k-means Clustering with Standardization

This is an example of k-means clustering with standardized data. Wikipedia: k-means clustering and Feature scaling — standardization

Language: Python 3
Library: scikit-learn
Example Data: Human Resources Analytics

Key Statements

# Inputs: unstandardized_data, cols_to_standardize, N_CLUSTERS

# Create the scalar.
from sklearn.preprocessing import StandardScaler
data_to_standardize = unstandardized_data[cols_to_standardize]
scaler = StandardScaler().fit(data_to_standardize)

# Standardize the columns.
standardized_data = unstandardized_data.copy()
standardized_columns = scaler.transform(data_to_standardize)
standardized_data[cols_to_standardize] = standardized_columns

# Fit the model.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=N_CLUSTERS).fit(standardized_data)

# Get results.
unstandardized_data['cluster'] = model.predict(standardized_data)

Working End-to-End Example

# Step 1: Import the libraries.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


# Step 2: Set up the constants.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# We need to know how many clusters to make.
N_CLUSTERS = 20

# We need to know which features are categorical.
CATEGORICAL_FEATURES = ['sales', 'salary']


# Step 3: Load in the raw data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# This assumes the data is in the same directory as this script.
# Here we load the data into a pandas DataFrame.
raw_data = pd.read_csv('HR_comma_sep.csv')

# It's helpful to take a quick look at the data.
print('Sample of loaded data:')
print(raw_data.sample(5))
print('')


# Step 4: Set up the data.
# ~~~~~~~~~~~~~~~~~~~~~~~~

# Turn categorical variables into dummy columns (0 or 1 values).
# Do this to avoid assuming a meaningful order of categories.
# Use drop_first to avoid multicollinearity among features.
unstandardized_data = pd.get_dummies(
    raw_data,
    columns=CATEGORICAL_FEATURES,
    drop_first=True
)

# Since the dummy columns already have values of 0 or 1,
# it is common to exclude them from standardization.
cols_to_standardize = [
  column for column in raw_data.columns
    if column not in CATEGORICAL_FEATURES
]
data_to_standardize = unstandardized_data[cols_to_standardize]

# Create the scaler.
scaler = StandardScaler().fit(data_to_standardize)

# Standardize the data
standardized_data = unstandardized_data.copy()
standardized_columns = scaler.transform(data_to_standardize)
standardized_data[cols_to_standardize] = standardized_columns

# It's helpful to double check that the final data looks good.
print('Sample of data to use:')
print(standardized_data.sample(5))
print('')


# Step 5: Fit the model.
# ~~~~~~~~~~~~~~~~~~~~~~

model = KMeans(n_clusters=N_CLUSTERS).fit(standardized_data)


# Step 6: Get the results.
# ~~~~~~~~~~~~~~~~~~~~~~~~

# It's helpful to see the results on the unstandardized data.
# The output of model.predict() is an integer representing
# the cluster that each data point is classified with.
unstandardized_data['cluster'] = model.predict(standardized_data)

# It's helpful to take a quick look at the count and
# average value values per cluster.

print('Cluster summary:')
summary = unstandardized_data.groupby(['cluster']).mean()
summary['count'] = unstandardized_data['cluster'].value_counts()
summary = summary.sort_values(by='count', ascending=False)
print(summary)

Notes

When viewing the results we prefer to look at the unstandardized data, even though we use the standardized data to train the k-means clustering model.

Notes

See Also