Standardization
Standardization is a method of scaling a feature's data to have a mean of 0 and a standard deviation of 1. If the data is normally distributed, this gives it a standard normal distribution. Wikipedia: Feature scaling — standardization
Language: Python 3
Library: scikit-learn
Example Data: King County House Sales
Key Statements
# Create the scalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data_to_standardize)
# Standardize the columns.
standardized_data = raw_data.copy()
standardized_columns = scaler.transform(data_to_standardize)
standardized_data[COLUMNS_TO_STANDARDIZE] = standardized_columns
Working End-to-End Example
# Step 1: Import the libraries.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Step 2: Set up the constants.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
COLUMNS_TO_STANDARDIZE = [
'bedrooms',
'bathrooms',
'sqft_living',
'sqft_lot',
'floors',
'view',
'grade',
'sqft_above',
'sqft_basement',
'sqft_living15',
'sqft_lot15'
]
# Step 3: Load in the raw data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# This assumes the data is in the same directory as this script.
raw_data = pd.read_csv('kc_house_data.csv', header=0)
# It's helpful to take a quick look at the data.
print("Loaded data:")
print(raw_data[:5])
print("")
# Step 4: Standardize the data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Get the data to standardize.
data_to_standardize = raw_data.copy()[COLUMNS_TO_STANDARDIZE]
# Create the scaler.
scaler = StandardScaler().fit(data_to_standardize)
# Standardize the data
standardized_data = raw_data.copy()
standardized_columns = scaler.transform(data_to_standardize)
standardized_data[COLUMNS_TO_STANDARDIZE] = standardized_columns
# It's helpful to double check that the final data looks good.
print('Standardized data:')
print(standardized_data[:5])
Notes
If you're including an intercept (all-ones) column, it's important to avoid standardizing that column. Otherwise, the intercept column will be standardized to a value of zero.
Standardization is important if you plan to compare the feature coefficients of a trained model.