Standardization

Standardization is a method of scaling a feature's data to have a mean of 0 and a standard deviation of 1. If the data is normally distributed, this gives it a standard normal distribution. Wikipedia: Feature scaling — standardization

Language: Python 3
Library: scikit-learn
Example Data: King County House Sales

Key Statements

# Create the scalar.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data_to_standardize)

# Standardize the columns.
standardized_data = raw_data.copy()
standardized_columns = scaler.transform(data_to_standardize)
standardized_data[COLUMNS_TO_STANDARDIZE] = standardized_columns

Working End-to-End Example

# Step 1: Import the libraries.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

import pandas as pd
from sklearn.preprocessing import StandardScaler


# Step 2: Set up the constants.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

COLUMNS_TO_STANDARDIZE = [
  'bedrooms',
  'bathrooms',
  'sqft_living',
  'sqft_lot',
  'floors',
  'view',
  'grade',
  'sqft_above',
  'sqft_basement',
  'sqft_living15',
  'sqft_lot15'
]


# Step 3: Load in the raw data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# This assumes the data is in the same directory as this script.
raw_data = pd.read_csv('kc_house_data.csv', header=0)

# It's helpful to take a quick look at the data.
print("Loaded data:")
print(raw_data[:5])
print("")


# Step 4: Standardize the data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Get the data to standardize.
data_to_standardize = raw_data.copy()[COLUMNS_TO_STANDARDIZE]

# Create the scaler.
scaler = StandardScaler().fit(data_to_standardize)

# Standardize the data
standardized_data = raw_data.copy()
standardized_columns = scaler.transform(data_to_standardize)
standardized_data[COLUMNS_TO_STANDARDIZE] = standardized_columns

# It's helpful to double check that the final data looks good.
print('Standardized data:')
print(standardized_data[:5])

Notes

If you're including an intercept (all-ones) column, it's important to avoid standardizing that column. Otherwise, the intercept column will be standardized to a value of zero.

Standardization is important if you plan to compare the feature coefficients of a trained model.

Notes

See Also