Logistic Regression with PCA
This is an end-to-end example implementation of running a logistic regression on the PCA components of a data set. Wikipedia: Logistic regression and Principal component analysis
Language: Python 3
Library: scikit-learn
Example Data: King County House Sales
Key Statements
# Inputs: x_data, y_data, N_COMPONENTS, TEST_SET_SIZE
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# Get the PCA data.
pca_model = PCA(n_components=N_COMPONENTS).fit(x_data)
pca_data = pca_model.transform(x_data)
# Split the data.
x_train, x_test, y_train, y_test = train_test_split(
pca_data,
y_data,
test_size=TEST_SET_SIZE
)
# Fit the logistic regression model.
lr_model = LogisticRegression().fit(x_train, y_train)
# Get predictions and their confusion matrix.
y_predict = lr_model.predict(x_text)
matrix = confusion_matrix(y_test, y_predict)
Working End-to-End Example
# Step 1: Import the libraries.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import pandas as pd
from sklearn import decomposition
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
# Step 2: Set up the constants.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# We need to know how many components to make.
N_COMPONENTS = 10
# The target feature is whether or not the employee left.
TARGET_FEATURE = 'left' # Valid data values are 0 or 1.
# We'll set aside 20% of the data to test the model.
TEST_SET_SIZE = 0.2
# We need to know which features are categorical.
CATEGORICAL_FEATURES = ['sales', 'salary']
# Step 3: Load in the raw data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# This assumes the data is in the same directory as this script.
# Here we load the data into a pandas DataFrame.
raw_data = pd.read_csv('HR_comma_sep.csv')
# It's helpful to take a quick look at the data.
print('Sample of loaded data:')
print(raw_data.sample(5))
print('')
# Step 4: Set up the data for PCA.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Separate the X and Y values.
y_data = raw_data[TARGET_FEATURE]
# Using drop() doesn't change raw_data, only the return value.
# The axis=1 keyword tells pandas to drop a column (not a row).
x_data = raw_data.drop(TARGET_FEATURE, axis=1)
# Turn categorical variables into dummy columns (0 or 1 values).
# Do this to avoid assuming a meaningful order of categories.
# Use drop_first to avoid multicollinearity among features.
x_data = pd.get_dummies(
x_data,
columns=CATEGORICAL_FEATURES,
drop_first=True
)
# It's helpful to double check that the final data looks good.
print('Sample of x data:')
print(x_data.sample(5))
print('')
# Step 5: Fit the PCA model and get the PCA data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pca_model = decomposition.PCA(n_components=N_COMPONENTS)
pca_model.fit(x_data)
pca_data = pd.DataFrame(pca_model.transform(x_data))
print('Sample PCA data:')
print(pca_data.sample(5))
print('')
# Step 6: Set up the data for logistic regression.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# To include an intercept, add a new column with a constant.
pca_data['intercept'] = 1.0
# Split the data into training and test sets.
x_train, x_test, y_train, y_test = train_test_split(
pca_data,
y_data,
test_size=TEST_SET_SIZE
)
# Step 7: Fit the logistic regression model.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
logit_model = LogisticRegression().fit(x_train, y_train)
# Step 8: Get the results.
# ~~~~~~~~~~~~~~~~~~~~~~~~
# Get prediction probabilities for the test set.
y_predict_proba = logit_model.predict_proba(x_test)
# The value y_predict_proba[i, j] is the model's prediction for
# prob(y[i] = j), where j = 0 or 1; y[i] is the ith target value.
# Convert to 0 or 1: y_predict[i] = 1 when prob(y[i] = 1) > 0.5.
cutoff = 0.5
y_predict = [int(proba[1] > cutoff) for proba in y_predict_proba]
# Get the confusion matrix and calculate the results.
# M[i][j] = #cases with known value i and predicted value j.
M = confusion_matrix(y_test, y_predict)
n_samples = len(y_test)
print(M)
print('Accuracy: %.2f' % ((M[0][0] + M[1][1]) / n_samples))
print('Precision: %.2f' % (M[1][1] / (M[0][1] + M[1][1])))
print('Recall: %.2f' % (M[1][1] / (M[1][0] + M[1][1])))
Notes
When PCA components are included in a predictive model, it's critical to exclude the target feature from those components.
Variables can have any distribution; they don't need to be normally distributed.