Support Vector Machine (SVM)

A support vector machine (SVM) is a supervised learning model with associated learning algorithms that analyze data used for classification and regression analysis. Below is a classification example. Wikipedia: Support vector machine

Key Statements
# Inputs: x_train, y_train, x_test, y_test.

# Fit the model.
from sklearn.svm import SVC
model = SVC().fit(x_train, y_train)

# Get predictions.
y_predict = model.predict(x_test)

# Get the confusion matrix results.
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test, y_predict)
Working End-to-End Example
# Step 1: Import the libraries.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC  # SVC = support vector classifier.

# Step 2: Set up the constants.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# The target feature is whether or not the employee left.
TARGET_FEATURE = 'left'  # Valid data values are 0 or 1.

# We'll set aside 20% of the data to test the model.

# We need to know which features are categorical.
CATEGORICAL_FEATURES = ['sales', 'salary']

# Step 3: Load in the raw data.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# This assumes the data is in the same directory as this script.
# Here we load the data into a pandas DataFrame.
raw_data = pd.read_csv('HR_comma_sep.csv')

# It's helpful to take a quick look at the data.
print('Sample of loaded data:')
print('Count per value (0 or 1) of the target feature:')

# Step 4: Set up the data.
# ~~~~~~~~~~~~~~~~~~~~~~~~

# Separate the X and Y values.
y_data = raw_data[TARGET_FEATURE]

# Using drop() doesn't change raw_data, only the return value.
# The axis=1 keyword tells pandas to drop a column (not a row).
x_data = raw_data.drop(TARGET_FEATURE, axis=1)

# To include an intercept, add a new column with a constant.
x_data['intercept'] = 1.0

# Turn categorical variables into dummy columns (0 or 1 values).
# Do this to avoid assuming a meaningful order of categories.
# Use drop_first to avoid multicollinearity among features.
x_data = pd.get_dummies(

# It's helpful to double check that the final data looks good.
print('Sample of data to use:')

# Split the data into training and test sets.
x_train, x_test, y_train, y_test = train_test_split(

# Step 5: Fit the model.
# ~~~~~~~~~~~~~~~~~~~~~~

# This call can find nonlinear decision boundaries (since the
# default kernel uses radial basis functions).
model = SVC().fit(x_train, y_train)

# Yes, that's it!

# Step 6: Get the results.
# ~~~~~~~~~~~~~~~~~~~~~~~~

# Get the predicted target (y) values.
y_predict = model.predict(x_test)

# Get the confusion matrix and calculate the results.
#   M[i][j] = #cases with known value i and predicted value j.
M = confusion_matrix(y_test, y_predict)
n_samples = len(y_test)
print('Accuracy:  %.2f' % ((M[0][0] + M[1][1]) / n_samples))
print('Precision: %.2f' % (M[1][1] / (M[0][1] + M[1][1])))
print('Recall:    %.2f' % (M[1][1] / (M[1][0] + M[1][1])))


An advantage of SVMs is that you can use them for both classification and regression problems.

SVMs tend to perform poorly when there are more explanatory X variables than there are samples in the training data set.