Dummy Variables
A dummy variable is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect. It is a common way of handling categorical X variables in regression analyses and other statistical models. Wikipedia: Dummy variable
Language: Python 3
Example Data: King County House Sales
# Step 1: Import the libraries.
import pandas as pd
# Step 2: Set up the constants.
# List the columns we want to turn into dummy variables.
CATEGORICAL_FEATURES = ['waterfront', 'condition', 'zipcode']
# Step 3: Load in the data.
# This assumes the data is in the same directory as this script.
# Here we load the data into a pandas DataFrame.
raw_data = pd.read_csv('kc_house_data.csv')
# It's helpful to take a quick look at the data.
print('Sample of loaded data:')
print(raw_data.sample(5))
print('')
# Step 4: Create the dummy variables.
# Iterate through each of the categorical features.
for feature in CATEGORICAL_FEATURES:
# Pandas comes with a super convenient function that creates
# the dummy variable for you.
dummy_data = pd.get_dummies(data[feature], prefix=feature)
# We need to remove at least one of the dummy features.
# It doesn't matter which one. To stay we consistent we
# usually remove the most common value.
most_common_value = pd.value_counts(data[feature]).index[0]
dummy_to_exclude = feature + '_' + str(most_common_value)
print('Excluding dummy variable %s' % dummy_to_exclude)
dummy_data_to_use = dummy_data.drop(dummy_to_exclude, axis=1)
data[dummy_data_to_use.columns] = dummy_data_to_use
# Remove the original feature, so just the dummy variables remain.
data = data.drop(feature, axis=1)
# It's helpful to double check that the final data looks good.
print('')
print('Data with dummy variables:')
print(x_data.sample(5))
Notes
It's important that for a given categorical variable, one of the category dummy variables is excluded. Otherwise there will be multicollinearity.