k-means Clustering
k-means clustering is a method of finding k clusters of data points. The clusters are chosen based on the data to minimize the distance between any data point and the center of its cluster. Wikipedia: k-means clustering
Language: Python 3
Library: scikit-learn
Example Data: Human Resources Analytics
Key Statements
# Inputs: prepared_data, N_CLUSTERS
# Fit the model.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=N_CLUSTERS).fit(prepared_data)
# Get results. Values are cluster numbers, 0 to N_CLUSTERS-1.
prepared_data['cluster'] = model.predict(prepared_data)
Working End-to-End Example
# Step 1: Import the libraries.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
import pandas as pd
from sklearn.cluster import KMeans
# Step 2: Set up the constants.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# We need to know how many clusters to make.
N_CLUSTERS = 20
# We need to know which features are categorical.
CATEGORICAL_FEATURES = ['sales', 'salary']
# Step 3: Load in the raw data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# This assumes the data is in the same directory as this script.
# Here we load the data into a pandas DataFrame.
raw_data = pd.read_csv('HR_comma_sep.csv')
# It's helpful to take a quick look at the data.
print('Sample of loaded data:')
print(raw_data.sample(5))
print('')
# Step 4: Set up the data.
# ~~~~~~~~~~~~~~~~~~~~~~~~
# Turn categorical variables into dummy columns (0 or 1 values).
# Do this to avoid assuming a meaningful order of categories.
# Use drop_first to avoid multicollinearity among features.
prepared_data = pd.get_dummies(
raw_data,
columns=CATEGORICAL_FEATURES,
drop_first=True
)
# It's helpful to double check that the final data looks good.
print('Sample of data to use:')
print(prepared_data.sample(5))
print('')
# Step 5: Fit the model.
# ~~~~~~~~~~~~~~~~~~~~~~
model = KMeans(n_clusters=N_CLUSTERS).fit(prepared_data)
# Yes, that's it!
# Step 6: Get the results.
# ~~~~~~~~~~~~~~~~~~~~~~~~
# The output of model.predict() is an integer representing
# the cluster that each data point is classified with.
prepared_data['cluster'] = model.predict(prepared_data)
# It's helpful to take a quick look at the count and
# average feature values per cluster.
print('Cluster summary:')
summary = prepared_data.groupby(['cluster']).mean()
summary['count'] = prepared_data['cluster'].value_counts()
summary = summary.sort_values(by='count', ascending=False)
print(summary)
Notes
k-means clustering is sensitive to the scale of the values. It's common to normalize the data in some way; a popular approach is standardization.
It's helpful to check for outliers and remove them; outliers can skew the results because k-means clustering uses the mean.