Dummy Variables

A dummy variable is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect. It is a common way of handling categorical X variables in regression analyses and other statistical models. Wikipedia: Dummy variable

# Step 1: Import the libraries.

import pandas as pd



# Step 2: Set up the constants.

# List the columns we want to turn into dummy variables.
CATEGORICAL_FEATURES = ['waterfront', 'condition', 'zipcode']



# Step 3: Load in the data.

# This assumes the data is in the same directory as this script.
# Here we load the data into a pandas DataFrame.
raw_data = pd.read_csv('kc_house_data.csv')

# It's helpful to take a quick look at the data.
print('Sample of loaded data:')
print(raw_data.sample(5))
print('')



# Step 4: Create the dummy variables.

# Iterate through each of the categorical features.
for feature in CATEGORICAL_FEATURES:

  # Pandas comes with a super convenient function that creates 
  # the dummy variable for you. 
  dummy_data = pd.get_dummies(data[feature], prefix=feature)

  # We need to remove at least one of the dummy features.
  # It doesn't matter which one. To stay we consistent we
  # usually remove the most common value.
  most_common_value = pd.value_counts(data[feature]).index[0]
  dummy_to_exclude = feature + '_' + str(most_common_value)

  print('Excluding dummy variable %s' % dummy_to_exclude)
  dummy_data_to_use = dummy_data.drop(dummy_to_exclude, axis=1)
  data[dummy_data_to_use.columns] = dummy_data_to_use

  # Remove the original feature, so just the dummy variables remain.
  data = data.drop(feature, axis=1)



# It's helpful to double check that the final data looks good.
print('')
print('Data with dummy variables:')
print(x_data.sample(5))

Notes

It's important that for a given categorical variable, one of the category dummy variables is excluded. Otherwise there will be multicollinearity.