Principal Component Analysis (PCA)

PCA is a statistical method that converts observations (data points) of possibly correlated variables into a smaller set of linearly uncorrelated variables that capture most of the information of the original data. Wikipedia: PCA

Language: Python 3
Library: scikit-learn
Example Data: King County House Sales

Key Statements

# Inputs: prepared_data, N_COMPONENTS

# Fit the model.
from sklearn import decomposition
model = decomposition.PCA(n_components=N_COMPONENTS)
model.fit(prepared_data)

# Get results.
pca_data = model.transform(prepared_data)

Working End-to-End Example

# Step 1: Import the libraries.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

import pandas as pd
from sklearn import decomposition

# Step 2: Set up the constants.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# We need to know how many components to make.
N_COMPONENTS = 5

# There are some columns we won't use in this model.
# Note that if we plan to use the PCA components in a
# predictive model later, it's critical that we exclude
# the target feature here. With this data, we often
# predict the price, so we'll exclude that feature here.
FEATURES_TO_REMOVE = ['id', 'date', 'price']

# We need to know which features are categorical.
CATEGORICAL_FEATURES = ['waterfront', 'condition', 'zipcode']


# Step 3: Load in the raw data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# This assumes the data is in the same directory as this script.
# Here we load the data into a pandas DataFrame.
raw_data = pd.read_csv('kc_house_data.csv')

# It's helpful to take a quick look at the data.
print('Sample of loaded data:')
print(raw_data.sample(5))
print('')


# Step 4: Set up the data.
# ~~~~~~~~~~~~~~~~~~~~~~~~

# Using drop() doesn't change raw_data, only the return value.
# The axis=1 keyword tells pandas to drop a column (not a row).
prepared_data = raw_data.drop(FEATURES_TO_REMOVE, axis=1)

# Turn categorical variables into dummy columns (0 or 1 values).
# Do this to avoid assuming a meaningful order of categories.
# Use drop_first to avoid multicollinearity among features.
prepared_data = pd.get_dummies(
    prepared_data,
    columns=CATEGORICAL_FEATURES,
    drop_first=True
)

# It's helpful to double check that the final data looks good.
print('Sample of data to use:')
print(prepared_data.sample(5))
print('')


# Step 5: Fit the model.
# ~~~~~~~~~~~~~~~~~~~~~~

model = decomposition.PCA(n_components=N_COMPONENTS)
model.fit(prepared_data)

# Yes, that's it!


# Step 6: Get the results.
# ~~~~~~~~~~~~~~~~~~~~~~~~

pca_data = pd.DataFrame(model.transform(prepared_data))
print(pca_data.sample(5))
print('')

Notes

When PCA components are included in a predictive model, it's critical to exclude the target feature from those components.

Notes

See Also