Category Archives: Machine Learning Work

Numerai Competition Attempt #2

In this notebook I experiment with adding additional features to the Numerai dataset using t-distributed stochastic neighbor embedding (t-SNE). It is a machine learning dimension reduction tool that is often used for visualizing data in 2 or 3 dimensions, much like PCA. Unlike PCA, it can also work with non-linear / non-parametric data. A number of models have benefited from the use of additional t-SNE features prior to running the main classification algorithm.

T-SNE works by finding patterns within observable clusters in the dataset. It then maps these clusters to a probability distribution in a lower dimension space. The underlying data is no longer visible, but local similarities between points are preserved.

I suggest the following video if you want a theoretical understanding of t-SNE from the creator: https://www.youtube.com/watch?v=RJVL80Gg3lA&t

And a practical guide to tuning its parameters: https://distill.pub/2016/misread-tsne/

I’m using an adapted version of Jim Flemings t-SNE pipeline found here: https://github.com/jimfleming/numerai/blob/master/prep_data.py

Why Apply t-SNE to Numerai’s dataset?

If you’ve worked with Numerai data before, you’ll notice that it does not contain any feature distinctions, and that each feature has been normalized between 0 and 1. Richard Craib, the founder of Numerai, explains the reason and method of this data encryption here: https://medium.com/numerai/encrypted-data-for-efficient-markets-fffbe9743ba8. As such, adding features via dimensionality reduction is the only sensible way I can think of for squeezing any additional features into the training dataset.

Note: This notebook is incomplete and outputs have not been uploaded. It also takes like 10 hours to run this notebook and I don’t recommend doing so unless you have some sort of cloud computation resources! My next post will share the data infrastucture I’ve built out on AWS to make this easier

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
# Lets see whats in our input folder
from subprocess import check_output
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))
example_model.py
example_model.r
example_predictions.csv
numerai_tournament_data.csv
numerai_training_data.csv
test_data.csv
train_data.csv
valid_data.csv

Data Prep

Jim Fleming does something pretty cool in this pipeline that I’ve recently incorporated into all my work. Rather than splitting the training data arbitrarily into a training / validation dataset, he chooses his validation dataset to be those points that are closest to the test results. This is done by first running a 5-fold cross validated random forest classifier, and then sorting the predictions by the model’s relative confidence in each prediction. The points that the model is most confident in are those, we hypothesize, the closest to representing the underlying data pattern.

In [ ]:
import time
import random

from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedKFold

def prep():
    df_train = pd.read_csv('../numerai/T62/numerai_training_data.csv')
    df_test = pd.read_csv('../numerai/T62/numerai_tournament_data.csv')

    feature_cols = list(df_train.columns[3:-1])
    target_col = df_train.columns[-1]
    test_col = 'is_test'
    id_col = 't_id'

    df_train['is_test'] = 0
    df_test['is_test'] = 1

    df_data = pd.concat([df_train, df_test])
    df_data = df_data.reindex_axis(feature_cols + [test_col, target_col], axis='columns')

    X_split = df_data[feature_cols]
    y_split = df_data[test_col]

    rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=67)
    predictions = np.zeros(y_split.shape)

    kfold = StratifiedKFold(y_split, n_folds=5, shuffle=True, random_state=67)
    
    for i, (train_i, test_i) in enumerate(kfold):
        print("Fold #{}".format(i + 1))

        X_split_train = X_split.iloc[train_i]
        y_split_train = y_split.iloc[train_i]

        X_split_test = X_split.iloc[test_i]
        y_split_test = y_split.iloc[test_i]

        rf.fit(X_split_train, y_split_train)

        p = rf.predict_proba(X_split_test)[:,1]
        auc = roc_auc_score(y_split_test, p)
        print("AUC: {:.2f}".format(auc))

        predictions[test_i] = p

    # sort predictions by value
    i = predictions.argsort()

    # sort data by prediction confidence
    df_sorted = df_data.iloc[i]

    # select only training data
    df_train_sorted = df_sorted.loc[df_sorted.is_test == 0]

    # drop unnecessary columns
    df_train_sorted = df_train_sorted.drop([test_col], axis='columns')

    # verify training data
    assert(df_train_sorted[target_col].sum() == df_train[target_col].sum())

    # grab first N rows as train and last N rows as validation (those closest to test)
    validation_size = int(len(df_train_sorted) * 0.1)
    df_train = df_train_sorted.iloc[:-validation_size]
    df_valid = df_train_sorted.iloc[-validation_size:]
    print('Creating dataset with validation size: {}'.format(validation_size))

    df_train.to_csv('../numerai/T62/train_data.csv', index_label=False)
    df_valid.to_csv('../numerai/T62/valid_data.csv', index_label=False)
    df_test.to_csv('../numerai/T62/test_data.csv', index_label=False)
    print('Done.')
In [ ]:
prep()

Lets ensure we have our data in the right folder

In [ ]:
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))

t-SNE Feature Encoding

The below t-SNE feature encoding is a modified version of Jim Fleming’s code. The modular function takes as an input a perplexity value, which is in a sense, similar to the k in a nearet neighbors algorithm. It functions as a trade-off betwen the local and global patterns in the data. With a smaller perplexity, local variations will dominate the t-SNE plot. A larger perplexity will cause larger affinity clusters to be generated.

Getting the most from t-SNE may mean incorporating multiple plots with different perplexities into the final training dataset. Most t-SNE encodings I’ve seen have perplexities between 5 and 100, with ranges between 10 and 50 being the most common. I’ll be including 5 different encodings at equal intervals to my final dataset.

In [25]:
import time
import random

from tsne import bh_sne
from sklearn.preprocessing import PolynomialFeatures

def save_tsne(perplexity, dimensions=2, polynomial=False):
    df_train = pd.read_csv('../numerai/T62/train_data.csv')
    df_valid = pd.read_csv('../numerai/T62/valid_data.csv')
    df_test = pd.read_csv('../numerai/T62/test_data.csv')

    feature_cols = list(df_train.columns[:-1])
    target_col = df_train.columns[-1]

    X_train = df_train[feature_cols].values
    y_train = df_train[target_col].values

    X_valid = df_valid[feature_cols].values
    y_valid = df_valid[target_col].values

    X_test = df_test[feature_cols].values

    X_all = np.concatenate([X_train, X_valid, X_test], axis=0)
    
    if polynomial:
        poly = PolynomialFeatures(degree=2)
        X_all = poly.fit_transform(X_all)

    print('Running TSNE (perplexity: {}, dimensions: {}, polynomial: {})...'.format(perplexity, dimensions, polynomial))
    start_time = time.time()
    tsne_all = bh_sne(X_all, d=dimensions, perplexity=float(perplexity))
    print('TSNE: {}s'.format(time.time() - start_time))

    tsne_train = tsne_all[:X_train.shape[0]]
    assert(len(tsne_train) == len(X_train))

    tsne_valid = tsne_all[X_train.shape[0]:X_train.shape[0]+X_valid.shape[0]]
    assert(len(tsne_valid) == len(X_valid))

    tsne_test = tsne_all[X_train.shape[0]+X_valid.shape[0]:X_train.shape[0]+X_valid.shape[0]+X_test.shape[0]]
    assert(len(tsne_test) == len(X_test))

    if polynomial:
        save_path = '../numerai/T62/tsne_{}d_{}p_poly.npz'.format(dimensions, perplexity)
    else:
        save_path = '../numerai/T62/tsne_{}d_{}p.npz'.format(dimensions, perplexity)

    np.savez(save_path, \
        train=tsne_train, \
        valid=tsne_valid, \
        test=tsne_test)
    print('Saved: {}'.format(save_path))

Generate t-SNE. Warning: This will take a long time.

In [ ]:
for perplexity in [10, 20, 30, 40, 50]:
    save_tsne(perplexity, polynomial=True)

Lets check our output again

In [ ]:
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))

Fitting our Model

Having added the new features, I’ll run the final prediction model using Jim’s code again. Our model uses a logistic regression and outputs a classification probability.

In [ ]:
def main():
    # load data
    df_train = pd.read_csv('data/train_data.csv')
    df_valid = pd.read_csv('data/valid_data.csv')
    df_test = pd.read_csv('data/test_data.csv')

    feature_cols = list(df_train.columns[:-1])
    target_col = df_train.columns[-1]

    X_train = df_train[feature_cols].values
    y_train = df_train[target_col].values

    X_valid = df_valid[feature_cols].values
    y_valid = df_valid[target_col].values

    X_test = df_test[feature_cols].values

    tsne_data_10p = np.load('data/tsne_2d_10p_poly.npz')
    tsne_data_20p = np.load('data/tsne_2d_20p_poly.npz')
    tsne_data_30p = np.load('data/tsne_2d_30p_poly.npz')
    tsne_data_40p = np.load('data/tsne_2d_40p_poly.npz')
    tsne_data_50p = np.load('data/tsne_2d_50p_poly.npz')

    # concat features
    X_train_concat = {
        'X': X_train,
        'tsne_10p': tsne_data_10p['train'],
        'tsne_20p': tsne_data_20p['train'],
        'tsne_30p': tsne_data_30p['train'],
        'tsne_40p': tsne_data_40p['train'],
        'tsne_50p': tsne_data_50p['train'],
    }
    X_valid_concat = {
        'X': X_valid,
        'tsne_10p': tsne_data_10p['valid'],
        'tsne_20p': tsne_data_20p['valid'],
        'tsne_30p': tsne_data_30p['valid'],
        'tsne_40p': tsne_data_40p['valid'],
        'tsne_50p': tsne_data_50p['valid'],
    }
    X_test_concat = {
        'X': X_test,
        'tsne_10p': tsne_data_10p['test'],
        'tsne_20p': tsne_data_20p['test'],
        'tsne_30p': tsne_data_30p['test'],
        'tsne_40p': tsne_data_40p['test'],
        'tsne_50p': tsne_data_50p['test'],
    }

    # build pipeline
    classifier = Pipeline(steps=[
        ('features', FeatureUnion(transformer_list=[
            ('tsne_10p', ItemSelector('tsne_10p')),
            ('tsne_20p', ItemSelector('tsne_20p')),
            ('tsne_30p', ItemSelector('tsne_30p')),
            ('tsne_40p', ItemSelector('tsne_40p')),
            ('tsne_50p', ItemSelector('tsne_50p')),
            ('X', ItemSelector('X')),
        ])),
        ('poly', PolynomialFeatures(degree=2)),
        ('scaler', MinMaxScaler()),
        ('lr', LogisticRegression(penalty='l2', C=1e-2, n_jobs=-1)),
    ])

    print('Fitting...')
    start_time = time.time()
    classifier.fit(X_train_concat, y_train)
    print('Fit: {}s'.format(time.time() - start_time))

    p_valid = classifier.predict_proba(X_valid_concat)
    loss = log_loss(y_valid, p_valid)
    print('Loss: {}'.format(loss))

    p_test = classifier.predict_proba(X_test_concat)
    df_pred = pd.DataFrame({
        't_id': df_test['t_id'],
        'probability': p_test[:,1]
    })
    csv_path = 'predictions/predictions_{}_{}.csv'.format(int(time.time()), loss)
    df_pred.to_csv(csv_path, columns=('t_id', 'probability'), index=None)
    print('Saved: {}'.format(csv_path))

Numerai Competition Attempt #1

In this notebook I do some profiling on Numer.ai’s tournament 61 dataset. For those not acquainted with numer.ai, it is a crowd sourced hedge fund where data scientists compete to create the best predictive models. Numerai then uses these predictions to create a meta-model that, as the theory goes, is better than all the disparate models used to create it.

Numerai runs tournaments each week and the format, as far as I know, is the same each week. This is my first step at creating a mostly automatic pipeline for entering into these tournaments. Given that numerai allows for 25 predictions a day, there is also a lot of room to experiement with ‘magic numbers’ and other sorts sorts of hacks to game the competition. That will obviously come later, so the purpose of this notebook is just to get an understanding of the data, the prediction format, and to test out a few models.

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [10]:
# Lets see whats in our input folder
from subprocess import check_output
print(check_output(["ls", "../numerai/T61/"]).decode("utf8"))
example_model.py
example_model.r
example_predictions.csv
numerai_tournament_data.csv
numerai_training_data.csv

In [160]:
# Load Training Data
training_data = pd.read_csv('../numerai/T61/numerai_training_data.csv', header=0)
prediction_data = pd.read_csv('../numerai/T61/numerai_tournament_data.csv', header=0)

# Transform the loaded CSV data into numpy arrays
features = [f for f in list(training_data) if "feature" in f]
X = training_data[features]
Y = training_data["target"]
x_prediction = prediction_data[features]
ids = prediction_data["id"]
In [37]:
print X.shape, Y.shape
(108405, 21) (108405,) (45625, 25)

First we’ll get a sense of what type each column is. I’m presuming that we’ll be working primarily with numerical data.

In [19]:
print(X.dtypes)
feature1     float64
feature2     float64
feature3     float64
feature4     float64
feature5     float64
feature6     float64
feature7     float64
feature8     float64
feature9     float64
feature10    float64
feature11    float64
feature12    float64
feature13    float64
feature14    float64
feature15    float64
feature16    float64
feature17    float64
feature18    float64
feature19    float64
feature20    float64
feature21    float64
dtype: object

Python has a handy function that will summarize each column.

In [24]:
print X.describe()
            feature1      feature2       feature3       feature4  \
count  108405.000000  108405.00000  108405.000000  108405.000000   
mean        0.500489       0.52870       0.593131       0.463543   
std         0.090130       0.10490       0.098372       0.096387   
min         0.000000       0.00000       0.000000       0.000000   
25%         0.455870       0.46791       0.540400       0.408910   
50%         0.499290       0.53508       0.594830       0.466530   
75%         0.546790       0.59569       0.648070       0.520500   
max         1.000000       1.00000       0.994840       1.000000   

            feature5       feature6       feature7       feature8  \
count  108405.000000  108405.000000  108405.000000  108405.000000   
mean        0.430014       0.500072       0.533695       0.484669   
std         0.087958       0.102223       0.098016       0.087952   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.376880       0.436020       0.478910       0.437220   
50%         0.428840       0.502160       0.539150       0.483380   
75%         0.482570       0.566100       0.588430       0.529180   
max         1.000000       1.000000       1.000000       1.000000   

            feature9      feature10      ...            feature12  \
count  108405.000000  108405.000000      ...        108405.000000   
mean        0.351988       0.565529      ...             0.475409   
std         0.100833       0.095981      ...             0.103580   
min         0.005760       0.000000      ...             0.000000   
25%         0.286220       0.510540      ...             0.415570   
50%         0.340520       0.571450      ...             0.477890   
75%         0.406380       0.626150      ...             0.536430   
max         1.000000       1.000000      ...             1.000000   

           feature13      feature14      feature15      feature16  \
count  108405.000000  108405.000000  108405.000000  108405.000000   
mean        0.426605       0.622671       0.539131       0.407971   
std         0.092866       0.088380       0.093146       0.087795   
min         0.002520       0.000000       0.000000       0.002620   
25%         0.378480       0.570380       0.485290       0.359280   
50%         0.425220       0.626130       0.545240       0.406200   
75%         0.473190       0.676210       0.597950       0.454230   
max         1.000000       1.000000       0.999890       1.000000   

           feature17      feature18      feature19      feature20  \
count  108405.000000  108405.000000  108405.000000  108405.000000   
mean        0.568731       0.594192       0.525519       0.493394   
std         0.092837       0.097566       0.089129       0.090487   
min         0.000000       0.000000       0.000000       0.021120   
25%         0.522870       0.541190       0.472620       0.444070   
50%         0.569970       0.595320       0.530720       0.488750   
75%         0.616220       0.648000       0.578430       0.540270   
max         0.997570       1.000000       0.979030       0.968260   

           feature21  
count  108405.000000  
mean        0.433121  
std         0.104713  
min         0.000000  
25%         0.365090  
50%         0.423950  
75%         0.493530  
max         1.000000  

[8 rows x 21 columns]
In [33]:
pd.options.display.mpl_style = 'default'

X.boxplot(figsize=(14,8), rot = 45);
In [36]:
X.hist(figsize=(14, 12), bins=30);

Our data looks to be approximately normally distributed between 0 and 1, so theres no immediate need for us to do any sort of additional normalization.

I have heard the misconception that machine learning algorithms do not need normalization. This is false. Imagine a loss function mapped across a 2 dimensional feature space, whereby one of features has a range >> than the other. The gradient descent algorithm for movement along this feature space will not be able to differentiate between the two ranges, and will conversely, fail to navigate to the loss minimizing value along the lower feature range.

I will first be splitting the data into a training and validation set so I can get a better picture of out of sample performance.

In [40]:
# split into test and train
from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(X, Y, test_size=0.1, random_state=False)

First I will be seeing what my performance is looking like out of sample so I can get a sense of how much tuning will be necessary.

I will be running a simple XGBoosted model with some paramaters tuned from a prior Kaggle competition.

In [41]:
import xgboost as xgb # xgboost package 

# Split training and validation datasets
d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)

I tend to build models with early stopping, so that if the validation performance doesn’t improve in over n rounds, the final model will use whatever was the last configuration before the n iterations.

In [135]:
params = {}
params['eta'] = 0.02 # control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1. Lower is slower but more robust to overfitting.
params['objective'] = 'binary:logistic' # Default.  Running a regression, since we're predicting values not classes
params['eval_metric'] = 'logloss' # We're evaluating our models on Mean Average Error.  
params['max_depth'] = 6 # Maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting.
params['silent'] = 1 # Don't print messages

# We will be tuning our model based on how it does in the validation dataset 
watchlist = [(d_train, 'train'), (d_valid, 'valid')]

# Early stopping
clf = xgb.train(params, d_train, 10000, watchlist, early_stopping_rounds=100, verbose_eval=10)
[0]	train-logloss:0.693028	valid-logloss:0.69312
Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping.

Will train until valid-logloss hasn't improved in 100 rounds.
[10]	train-logloss:0.691948	valid-logloss:0.692913
[20]	train-logloss:0.690984	valid-logloss:0.692734
[30]	train-logloss:0.690102	valid-logloss:0.692609
[40]	train-logloss:0.689299	valid-logloss:0.692501
[50]	train-logloss:0.688555	valid-logloss:0.692459
[60]	train-logloss:0.687853	valid-logloss:0.692441
[70]	train-logloss:0.687231	valid-logloss:0.692426
[80]	train-logloss:0.686782	valid-logloss:0.69244
[90]	train-logloss:0.686489	valid-logloss:0.69244
[100]	train-logloss:0.686231	valid-logloss:0.692458
[110]	train-logloss:0.685894	valid-logloss:0.6925
[120]	train-logloss:0.685616	valid-logloss:0.692557
[130]	train-logloss:0.685286	valid-logloss:0.692594
[140]	train-logloss:0.685012	valid-logloss:0.692656
[150]	train-logloss:0.684729	valid-logloss:0.692697
[160]	train-logloss:0.684432	valid-logloss:0.69275
Stopping. Best iteration:
[64]	train-logloss:0.687586	valid-logloss:0.692406

In [116]:
d_prediction = xgb.DMatrix(x_prediction)
In [125]:
results = clf.predict(d_prediction)
results_df = pd.DataFrame(data={'probability':results})
joined = pd.DataFrame(ids).join(results_df)

print("Writing predictions to predictions.csv")
# Save the predictions out to a CSV file
joined.to_csv("predictions.csv", index=False)
print("Done!")
Writing predictions to predictions.csv
Done!

Wow, thats a shocker. I’ve never seen a model improve so little after the first iteration. I think it might be because the data is so cleaned already. I dont think I’ll do any tuning at this moment, and I’ll wait to see how my other models perform on this dataset.

In [127]:
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.model_selection import cross_val_predict, cross_val_score

I’ll be running 5 fold cross validation on our training set to generate predictions on our validation dataset.

In [62]:
# import cross validation packages
from sklearn.model_selection import KFold

## Building KFolds with 5 Folds
KFold_5 = KFold(5)

I’m testing a few different models with relatively random hyper-paramaters- I’ll be likely building an ensemble model at this point so I’m trying to get a sense of what models I’ll want to tune later.

In [ ]:
names = ["LR-C=0.01", "LR-C=0.1", "LR-C=1", "LR-C=10",  
         "KNN-50", "KNN-500", "KNN-5000",
         "GBR ls",
         "Random Forest 100", "Random Forest 10"]

regressors = [
    LogisticRegression(C = 0.01),
    LogisticRegression(C = 0.1),
    LogisticRegression(C = 1),
    LogisticRegression(C = 10),
    KNeighborsClassifier(50),
    KNeighborsClassifier(500),
    KNeighborsClassifier(5000),
    GradientBoostingClassifier(n_estimators=100, learning_rate=0.02, random_state=0),
    RandomForestClassifier(n_estimators=100),
    RandomForestClassifier(n_estimators=10),   
]

for name, reg in zip(names, regressors):
    reg.fit(x_train, y_train)
    predicted = reg.predict_proba(x_valid)
    #predicted = cross_val_predict(reg, x_valid, y_valid, cv=KFold_5)
    print name, metrics.log_loss(y_valid, predicted)

With the exeption of the penalized logistic regression model, none of the other tested models outperform the XGBoosted model on the validation dataset. The good news is that we’re seeing promising results from the linear regression models and KNN models. If you’re ensembling, you first and foremost want high performing models. But you also want to ensure that your models are different in ways to compensate one another. For example, you’d stand little to gain pairing a number of forest based models since they’ll likely be strong and weak in many of the same ways.

I’m learning towards my ensemble combining a K-Nearest Neighbors model, a penalized logistic regression, and an XGboost model. I’ll see what sort of performance I get on the Numerai leaderboard before I try tuning the parameters. To ensemble, I’ll simply average the prediction probabilities over the full dataset.

In [151]:
d_X = xgb.DMatrix(X, label=Y)

params = {}
params['eta'] = 0.02 # control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1. Lower is slower but more robust to overfitting.
params['objective'] = 'multi:softprob' 
params['eval_metric'] = 'logloss' # We're evaluating our models on Mean Average Error.  
params['max_depth'] = 6 # Maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting.
params['silent'] = 1 # Don't print messages
params['num_class'] = 2
In [152]:
LR_CLF = LogisticRegression(C = 10)
KN_CLF = KNeighborsClassifier(5000)

Fit our models…

In [153]:
XG_CLF = xgb.train(params, d_X, num_boost_round=64)
LR_CLF.fit(X, Y)
KN_CLF.fit(X, Y)
Out[153]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5000, p=2,
           weights='uniform')

Generate out predictions…

In [161]:
d_prediction = xgb.DMatrix(x_prediction)
In [170]:
XG_prediction = XG_CLF.predict(d_prediction)
LR_prediction = LR_CLF.predict_proba(x_prediction)
KN_prediction = KN_CLF.predict_proba(x_prediction)

Average and print our predictions to CSV…

In [175]:
y_prediction = (XG_prediction + XG_prediction + KN_prediction)/3
results = y_prediction[:, 1]
results_df = pd.DataFrame(data={'probability':results})
joined = pd.DataFrame(ids).join(results_df)

print("Writing predictions to predictions.csv")
# Save the predictions out to a CSV file
joined.to_csv("predictions_0624.csv", index=False)
print("Done!")
Writing predictions to predictions.csv
Done!

With our 3 model ensemble we have a score of 0.6927 on Numer.ai. Enough for a paltry 300th place, but the model is both original, concordant, and 75% consistent on historical datasets. With tuning and some feature engineering we’ll hopefully be able to bring this down to below 0.69, enough for a top 200 finish.

Ensemble Methods

The discussion of ensemble methods will compromise both popular algorithms that incporate multiple models within them, as well as the combination of algorithms.

In the first part, I’ll discuss the logic behind ensemble algorithms like Decision Trees, Random Forests, and Gradient Boosted Models. In the second half, I’ll implement a meta model that incorporates a number of different classifiers and then votes on the classification based on different rules.

In [29]:
from IPython.display import Image
from IPython.core.display import HTML

Ensemble Models

My favorite illustration of ensemble learning comes from the Jain parable of ‘The Blind Men and the Elephant’

In [30]:
Image(url= "https://www.jainworld.com/literature/story25i1.gif", width=700, height=500)
Out[30]:

Ensemble methods like random forests will combine the outputs of multiple weaker clasifier models to reduce the risk of selecting a single poorly performing strong classifier.

The random forests method takes and combines the predictions of small decision trees working on a randomized portion of the original dataset to produce a ‘meta-model’ that will generalize better on a new testing dataset.

The extent and rate at which these small decision trees in the ‘random forest’ are pruned and adjusted is what distinguishes the random forest method with the ‘boosted’ varieties like XGBoost that you’ve probably heard of or used before. The idea, however, is the same– Lots of little independent decisions made on random parts and distributions of the original dataset.

Voting Classifiers

The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

In this example, I’ll be combining different machine learning classifiers and using either a hard or soft voting system to predict the class labels. The voting system will either take a majority vote or sum up the individual probabilities of each classifier to select the class labels. The use of voting classifiers is desired because they are able to balance out the individual weaknesses of each model.

Hard (Majority) Voting

With hard voting, the final class label is determined by the class label that has received the majority of votes. An example:

  • Classifier 1: Label A
  • Classifier 2: Label B
  • Classifier 3: Label B

We would expect that the hard voting classifier would predict label B.

Below I will implement the hard voting classifier on 3 algorithms: a random forest, a logistic regression, and K-Nearest Neighbors

In [9]:
# Import datasets and voting tools
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier

# Import models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Load data
iris = datasets.load_iris()
X = iris.data 
y = iris.target

# Create classifier instances
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = KNeighborsClassifier(3)

# Create voting classifier instances
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('knn', clf3)], voting='hard')

# Print success
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'K-NN', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.3f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.960 (+/- 0.04) [Logistic Regression]
Accuracy: 0.953 (+/- 0.03) [Random Forest]
Accuracy: 0.967 (+/- 0.02) [K-NN]
Accuracy: 0.973 (+/- 0.01) [Ensemble]

With hard voting, our ensemble model had a higher accuracy score then any of the classifiers individually. Next we will implement the same model with soft voting.

Soft Voting

Soft voting returns the final class label as the max argument of the sum of predicted probabilities. The final label is the highest average probability across all classifiers. It is just an argument switch to change the implementation from hard to soft voting!

In [12]:
# Create classifier instances
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = KNeighborsClassifier(3)

# Create voting classifier instances
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), 
                                    ('knn', clf3)], voting='soft')

# Print successes and std deviation across k-folds
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 
                                                 'Random Forest', 
                                                 'K-NN', 
                                                 'Ensemble']):
    # cross validate across 5 folds
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    
    # print
    print("Accuracy: %0.3f (+/- %0.2f) [%s]" % 
          (scores.mean(), scores.std(), label))
Accuracy: 0.960 (+/- 0.04) [Logistic Regression]
Accuracy: 0.960 (+/- 0.03) [Random Forest]
Accuracy: 0.967 (+/- 0.02) [K-NN]
Accuracy: 0.960 (+/- 0.03) [Ensemble]

The ensemble method doesn’t perform as well this time around. Me thinks its because the LR and RF models (which have thus far performed worse individually) are holding more weight in a couple tie breaker scenarios.

Visualizing decision boundaries

I fund some handy code on the Sci-Kit learn documentation that shows how to plot the decision boundaries of our indivdiual classifier models. Super cool! See here for more: http://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_decision_regions.html

In [28]:
from itertools import product
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=4)
clf2 = KNeighborsClassifier(n_neighbors=3)
clf3 = SVC(kernel='rbf', probability=True)
eclf = VotingClassifier(estimators=[('dt', clf1), 
                                    ('knn', clf2),
                                    ('svc', clf3)],
                        voting='hard')

# Fit classifiers
clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)
eclf.fit(X, y)

# Plotting decision regions
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        [clf1, clf2, clf3, eclf],
                        ['Decision Tree (depth=4)', 'KNN (k=3)',
                         'Kernel SVM', 'Hard Voting']):

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
    axarr[idx[0], idx[1]].set_title(tt)

plt.show()

Feature Selection

This notebook will aim to provide an explanation and application of different feature ranking methods, namely that of Recursive Feature Elimination (RFE), Stability Selection, linear models as well as Random Forest.

This notebook borrows heavily from an article by Ando Saabas on feature selection found here: http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/ as well as Anisotropic’s work on feature selection found here: https://www.kaggle.com/arthurtok/feature-ranking-w-randomforest-rfe-linear-models/code. A lot of Anisotropic’s comments are used in this kernel to explain what is being done.

My work is an application of the techniques and methods written by those gentlemen to the Zillow data competition on Kaggle (https://www.kaggle.com/c/zillow-prize-1). I have included additional competition specific data cleaning and comments in addition to the feature selection code.

There is one point of serious concern with the application of this code to the Zillow competition. In the Zillow competition, we are not estimating home price directly. Zillow is not sharing the actual sale price as a predictor variable (presumably to avoid something like this happening: https://www.wired.com/2010/03/netflix-cancels-contest/) but is having Kagglers estimate the log-loss of their model instead. Weird, I know.

I’ve been left thinking what a feature selection on the log-loss means. Are the most prominent features the ones that Zillow has been most incorrectly using? Are the best features at predicting log loss the worst features to use in a price estimate model? Do we need to do some feature engineering to make the ‘best’ features here more usable?

The contents of this notebook are as follows:

  1. Data Cleaning and Visualisation : This section will revolve around exploring the data and visualising some summary statistics.
  2. Stability Selection via Randomised Lasso Method : Introduce a relatively new feature selection method called “Stability Selection” and using the Randomised Lasso in its implementation
  3. Recursive Feature Elimination : Implementing the Recursive Feature Elimination method of feature ranking via the use of basic Linear Regression
  4. Linear Model Feature Coefficients : Implementing 3 of Sklearn’s linear models (Linear Regression, Lasso and Ridge) and using the inbuilt estimated coefficients for our feature selection
  5. Random Forest Feature Selection : Using the Random Forest’s convenient attribute “feature_importances” to calculate and ultimately rank the feature importance.

Finally, with all the points 1 to 5 above, we will combine the results to create our:

Feature Ranking Matrix : Matrix of all the features along with the respective model scores which we can use in our ranking.

In [167]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.feature_selection import RFE, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso, RandomizedLasso)
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor

1. DATA CLEANSING AND ANALYSIS

Let’s first read in the house data as a dataframe “house” and inspect the first 5 rows

In [185]:
# Load Data
train = pd.read_csv('../input/train_2016_v2.csv')
prop = pd.read_csv('../input/properties_2016.csv')

# Convert to float32 (this is done to post on Kaggle Kernels)
for c, dtype in zip(prop.columns, prop.dtypes):
    if dtype == np.float64:
        prop[c] = prop[c].astype(np.float32)

# Merge training dataset with properties dataset
df_train = train.merge(prop, how='left', on='parcelid')

# Remove ID columns
x_train = df_train.drop(['parcelid', 'logerror', 'transactiondate', 'propertyzoningdesc', 
                         'propertycountylandusecode'], axis=1)
In [171]:
x_train.head()
Out[171]:
airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid calculatedbathnbr decktypeid finishedfloor1squarefeet numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 1.0 NaN NaN 2.0 3.0 NaN 4.0 2.0 NaN NaN NaN NaN 122754.0 360170.0 2015.0 237416.0 6735.88 NaN NaN 6.037107e+13
1 NaN NaN NaN 3.5 4.0 NaN NaN 3.5 NaN NaN NaN NaN 346458.0 585529.0 2015.0 239071.0 10153.02 NaN NaN NaN
2 1.0 NaN NaN 3.0 2.0 NaN 4.0 3.0 NaN NaN NaN NaN 61994.0 119906.0 2015.0 57912.0 11484.48 NaN NaN 6.037464e+13
3 1.0 NaN NaN 2.0 2.0 NaN 4.0 2.0 NaN NaN NaN NaN 171518.0 244880.0 2015.0 73362.0 3048.74 NaN NaN 6.037296e+13
4 NaN NaN NaN 2.5 4.0 NaN NaN 2.5 NaN NaN 2.0 NaN 169574.0 434551.0 2015.0 264977.0 5488.96 NaN NaN 6.059042e+13

5 rows × 55 columns

Now its time for some general data inspection. Let’s first examine to see if there are any nulls in the dataframe as well as look at the type of the data (i.e whether it is a string or numeric)

In [172]:
print(x_train.dtypes)
airconditioningtypeid           float64
architecturalstyletypeid        float64
basementsqft                    float64
bathroomcnt                     float64
bedroomcnt                      float64
buildingclasstypeid             float64
buildingqualitytypeid           float64
calculatedbathnbr               float64
decktypeid                      float64
finishedfloor1squarefeet        float64
calculatedfinishedsquarefeet    float64
finishedsquarefeet12            float64
finishedsquarefeet13            float64
finishedsquarefeet15            float64
finishedsquarefeet50            float64
finishedsquarefeet6             float64
fips                            float64
fireplacecnt                    float64
fullbathcnt                     float64
garagecarcnt                    float64
garagetotalsqft                 float64
hashottuborspa                   object
heatingorsystemtypeid           float64
latitude                        float64
longitude                       float64
lotsizesquarefeet               float64
poolcnt                         float64
poolsizesum                     float64
pooltypeid10                    float64
pooltypeid2                     float64
pooltypeid7                     float64
propertylandusetypeid           float64
rawcensustractandblock          float64
regionidcity                    float64
regionidcounty                  float64
regionidneighborhood            float64
regionidzip                     float64
roomcnt                         float64
storytypeid                     float64
threequarterbathnbr             float64
typeconstructiontypeid          float64
unitcnt                         float64
yardbuildingsqft17              float64
yardbuildingsqft26              float64
yearbuilt                       float64
numberofstories                 float64
fireplaceflag                    object
structuretaxvaluedollarcnt      float64
taxvaluedollarcnt               float64
assessmentyear                  float64
landtaxvaluedollarcnt           float64
taxamount                       float64
taxdelinquencyflag               object
taxdelinquencyyear              float64
censustractandblock             float64
dtype: object

I’ll see what values are in the object type columns — We will have to either break these features up so that for each option there is just a binary (True / False) indicator, or find a way to make the existing features binary.

In [179]:
x_train['taxdelinquencyflag'].value_counts()
Out[179]:
False    88492
True      1783
Name: taxdelinquencyflag, dtype: int64
In [180]:
x_train['fireplaceflag'].value_counts()
Out[180]:
False    90053
True       222
Name: fireplaceflag, dtype: int64
In [181]:
x_train['hashottuborspa'].value_counts()
Out[181]:
False    87910
True      2365
Name: hashottuborspa, dtype: int64

I’ll have to convert the ‘Y’ to a TRUE and otherwise convert any NaN to False’s

In [177]:
x_train['taxdelinquencyflag'] = x_train['taxdelinquencyflag'] == 'Y'
In [178]:
for c in x_train.dtypes[x_train.dtypes == object].index.values:
    x_train[c] = (x_train[c] == True)
In [60]:
# Looking for nulls
print(x_train.isnull().sum())
airconditioningtypeid           61494
architecturalstyletypeid        90014
basementsqft                    90232
bathroomcnt                         0
bedroomcnt                          0
buildingclasstypeid             90259
buildingqualitytypeid           32911
calculatedbathnbr                1182
decktypeid                      89617
finishedfloor1squarefeet        83419
calculatedfinishedsquarefeet      661
finishedsquarefeet12             4679
finishedsquarefeet13            90242
finishedsquarefeet15            86711
finishedsquarefeet50            83419
finishedsquarefeet6             89854
fips                                0
fireplacecnt                    80668
fullbathcnt                      1182
garagecarcnt                    60338
garagetotalsqft                 60338
hashottuborspa                      0
heatingorsystemtypeid           34195
latitude                            0
longitude                           0
lotsizesquarefeet               10150
poolcnt                         72374
poolsizesum                     89306
pooltypeid10                    89114
pooltypeid2                     89071
pooltypeid7                     73578
propertylandusetypeid               0
rawcensustractandblock              0
regionidcity                     1803
regionidcounty                      0
regionidneighborhood            54263
regionidzip                        35
roomcnt                             0
storytypeid                     90232
threequarterbathnbr             78266
typeconstructiontypeid          89976
unitcnt                         31922
yardbuildingsqft17              87629
yardbuildingsqft26              90180
yearbuilt                         756
numberofstories                 69705
fireplaceflag                       0
structuretaxvaluedollarcnt        380
taxvaluedollarcnt                   1
assessmentyear                      0
landtaxvaluedollarcnt               1
taxamount                           6
taxdelinquencyflag                  0
taxdelinquencyyear              88492
censustractandblock               605
dtype: int64

Yikes, the data is pretty sparse. We’re going to have to figure out how to either remove some features or impute values for some of these.

Temporarily I’ll be imputing medians for the missing values based on the missing value for said column, but its only a stopgap solution before I can do something else (possibly predict missing values?)

In [182]:
x_train = x_train.fillna(x_train.median())

2. Stability Selection via Randomized Lasso

In a nutshell, this method serves to apply the feature selection on different parts of the data and features repeatedly until the results can be aggregated. Therefore stronger features ( defined as being selected as important) will have greater scores in this method as compared to weaker features. Refer to this paper by Nicolai Meinshausen and Peter Buhlmann for a much greater detail on the method : http://stat.ethz.ch/~nicolai/stability.pdf

In this notebook, the Stability Selection method is conveniently inbuilt into sklearn’s randomized lasso model and therefore this will be implemented as follows:

In [84]:
# First extract the target variable which is our Log Error
Y = df_train['logerror'].values
X = x_train.as_matrix()

# Store the column/feature names into a list "colnames"
colnames = x_train.columns

Next, we create a function which will be able to conveniently store our feature rankings obtained from the various methods described here into a Python dictionary. In case you are thinking I created this function, no this is not the case. All credit goes to Ando Saabas and I am only trying to apply what he has discussed in this context.

In [85]:
# Define dictionary to store our rankings
ranks = {}
# Create our function which stores the feature rankings to the ranks dictionary
def ranking(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x,2), ranks)
    return dict(zip(names, ranks))
In [87]:
# Finally let's run our Selection Stability method with Randomized Lasso
rlasso = RandomizedLasso(alpha=0.04)
rlasso.fit(X, Y)
ranks["rlasso/Stability"] = ranking(np.abs(rlasso.scores_), colnames)
print('finished')
finished

3. Recursive Feature Elimination ( RFE )

Now onto the next method in our feature ranking endeavour. Recursive Feature Elimination or RFE uses a model ( eg. linear Regression or SVM) to select either the best or worst-performing feature, and then excludes the feature. After this, the whole process is iterated until all features in the dataset are used up ( or up to a user-defined limit). Sklearn conveniently possesses a RFE function via the sklearn.feature_selection call and we will use this along with a simple linear regression model for our ranking search as follows:

In [88]:
# Construct our Linear Regression model
lr = LinearRegression(normalize=True)
lr.fit(X,Y)

#stop the search when only the last feature is left
rfe = RFE(lr, n_features_to_select=1, verbose =3 )
rfe.fit(X,Y)
ranks["RFE"] = ranking(list(map(float, rfe.ranking_)), colnames, order=-1)
Fitting estimator with 55 features.
Fitting estimator with 54 features.
Fitting estimator with 53 features.
Fitting estimator with 52 features.
Fitting estimator with 51 features.
Fitting estimator with 50 features.
Fitting estimator with 49 features.
Fitting estimator with 48 features.
Fitting estimator with 47 features.
Fitting estimator with 46 features.
Fitting estimator with 45 features.
Fitting estimator with 44 features.
Fitting estimator with 43 features.
Fitting estimator with 42 features.
Fitting estimator with 41 features.
Fitting estimator with 40 features.
Fitting estimator with 39 features.
Fitting estimator with 38 features.
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 29 features.
Fitting estimator with 28 features.
Fitting estimator with 27 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

4. Linear Model Feature Ranking

Now let’s apply 3 different linear models (Linear, Lasso and Ridge Regression) and how the features are selected and prioritised via these models. To achieve this, I shall use the sklearn implementation of these models and in particular the attribute .coef to return the estimated coefficients for each feature in the linear model.

In [89]:
# Using Linear Regression
lr = LinearRegression(normalize=True)
lr.fit(X,Y)
ranks["LinReg"] = ranking(np.abs(lr.coef_), colnames)

# Using Ridge 
ridge = Ridge(alpha = 7)
ridge.fit(X,Y)
ranks['Ridge'] = ranking(np.abs(ridge.coef_), colnames)

# Using Lasso
lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = ranking(np.abs(lasso.coef_), colnames)

5. Random Forest feature ranking

Sklearn’s Random Forest model also comes with it’s own inbuilt feature ranking attribute and one can conveniently just call it via “featureimportances“. That is what we will be using as follows:

In [105]:
rf = RandomForestRegressor(n_jobs=-1, n_estimators=50, verbose=2)
rf.fit(X,Y)
ranks["RF"] = ranking(rf.feature_importances_, colnames)
building tree 1 of 50
building tree 3 of 50 building tree 2 of 50building tree 7 of 50building tree 4 of 50
building tree 6 of 50building tree 5 of 50building tree 8 of 50





building tree 9 of 50
building tree 10 of 50
building tree 11 of 50
building tree 12 of 50
building tree 13 of 50
building tree 14 of 50
building tree 15 of 50
building tree 16 of 50
building tree 17 of 50
building tree 18 of 50
building tree 19 of 50
building tree 20 of 50
building tree 21 of 50
building tree 22 of 50
building tree 23 of 50
building tree 24 of 50
building tree 25 of 50
building tree 26 of 50
building tree 27 of 50
building tree 28 of 50building tree 29 of 50

building tree 30 of 50
building tree 31 of 50
building tree 32 of 50
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   21.6s
building tree 33 of 50
building tree 34 of 50
building tree 35 of 50
building tree 36 of 50
building tree 37 of 50
building tree 38 of 50
building tree 39 of 50
building tree 40 of 50
building tree 41 of 50
building tree 42 of 50
building tree 43 of 50
building tree 44 of 50
building tree 45 of 50
building tree 46 of 50
building tree 47 of 50
building tree 48 of 50
building tree 49 of 50
building tree 50 of 50
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   38.3s finished

6. Creating the Feature Ranking Matrix

We combine the scores from the various methods above and output it in a matrix form for convenient viewing as such:

In [133]:
# Create empty dictionary to store the mean value calculated from all the scores
r = {}
for name in colnames:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)
 
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")

print("\t%s" % "\t".join(methods))
for name in colnames:
    print("%s\t%s" % (name, "\t".join(map(str, 
                         [ranks[method][name] for method in methods]))))
	Lasso	LinReg	Mean	RF	RFE	Ridge	rlasso/Stability	Mean
airconditioningtypeid	0.0	0.0	0.13	0.07	0.72	0.01	0.0	0.13
architecturalstyletypeid	0.0	0.0	0.12	0.0	0.69	0.01	0.0	0.12
basementsqft	0.0	0.0	0.1	0.0	0.63	0.0	0.0	0.1
bathroomcnt	0.0	0.0	0.22	0.1	0.85	0.36	0.0	0.22
bedroomcnt	0.0	0.0	0.18	0.19	0.83	0.03	0.0	0.18
buildingclasstypeid	0.0	1.0	0.19	0.0	0.11	0.0	0.0	0.19
buildingqualitytypeid	0.0	0.0	0.12	0.11	0.59	0.01	0.0	0.12
calculatedbathnbr	0.0	0.0	0.2	0.08	0.89	0.23	0.0	0.2
decktypeid	0.0	0.16	0.05	0.0	0.13	0.0	0.0	0.05
finishedfloor1squarefeet	0.0	0.0	0.09	0.04	0.52	0.0	0.0	0.09
calculatedfinishedsquarefeet	0.0	0.0	0.16	0.54	0.39	0.0	0.0	0.16
finishedsquarefeet12	1.0	0.0	0.33	0.51	0.48	0.0	0.0	0.33
finishedsquarefeet13	0.0	0.0	0.1	0.0	0.61	0.0	0.0	0.1
finishedsquarefeet15	0.46	0.0	0.17	0.12	0.41	0.0	0.0	0.17
finishedsquarefeet50	0.13	0.0	0.11	0.05	0.5	0.0	0.0	0.11
finishedsquarefeet6	0.02	0.0	0.08	0.02	0.46	0.0	0.0	0.08
fips	0.0	0.0	0.14	0.01	0.65	0.18	0.0	0.14
fireplacecnt	0.0	0.0	0.18	0.03	0.96	0.11	0.0	0.18
fullbathcnt	0.0	0.0	0.16	0.07	0.8	0.07	0.0	0.16
garagecarcnt	0.0	0.0	0.15	0.03	0.81	0.06	0.0	0.15
garagetotalsqft	0.54	0.0	0.18	0.13	0.43	0.0	0.0	0.18
hashottuborspa	0.0	0.0	0.24	0.02	0.94	0.51	0.0	0.24
heatingorsystemtypeid	0.0	0.0	0.14	0.07	0.76	0.02	0.0	0.14
latitude	0.0	0.0	0.18	0.88	0.2	0.0	0.0	0.18
longitude	0.0	0.0	0.19	0.92	0.22	0.0	0.0	0.19
lotsizesquarefeet	0.0	0.0	0.19	0.89	0.24	0.0	0.0	0.19
poolcnt	0.0	0.01	0.01	0.0	0.07	0.0	0.0	0.01
poolsizesum	0.0	0.0	0.07	0.01	0.44	0.0	0.0	0.07
pooltypeid10	0.0	0.03	0.01	0.0	0.04	0.0	0.0	0.01
pooltypeid2	0.0	0.0	0.02	0.0	0.15	0.0	0.0	0.02
pooltypeid7	0.0	0.01	0.01	0.0	0.06	0.0	0.0	0.01
propertylandusetypeid	0.0	0.0	0.13	0.09	0.7	0.0	0.0	0.13
rawcensustractandblock	0.0	0.0	0.1	0.27	0.33	0.0	0.0	0.1
regionidcity	0.0	0.0	0.07	0.25	0.17	0.0	0.0	0.07
regionidcounty	0.49	0.0	0.14	0.01	0.35	0.0	0.0	0.14
regionidneighborhood	0.0	0.0	0.08	0.31	0.19	0.0	0.0	0.08
regionidzip	0.03	0.0	0.12	0.44	0.26	0.0	0.0	0.12
roomcnt	0.0	0.0	0.13	0.06	0.74	0.01	0.0	0.13
storytypeid	0.0	0.01	0.02	0.0	0.09	0.0	0.0	0.02
threequarterbathnbr	0.0	0.0	0.34	0.02	1.0	1.0	0.0	0.34
typeconstructiontypeid	0.0	0.0	0.18	0.0	0.93	0.16	0.0	0.18
unitcnt	0.0	0.0	0.15	0.04	0.87	0.02	0.0	0.15
yardbuildingsqft17	0.0	0.0	0.1	0.04	0.56	0.0	0.0	0.1
yardbuildingsqft26	0.0	0.0	0.09	0.0	0.54	0.0	0.0	0.09
yearbuilt	0.0	0.0	0.19	0.58	0.57	0.0	0.0	0.19
numberofstories	0.0	0.0	0.14	0.04	0.67	0.11	0.0	0.14
fireplaceflag	0.0	0.0	0.2	0.0	0.91	0.28	0.0	0.2
structuretaxvaluedollarcnt	0.01	0.0	0.21	1.0	0.28	0.0	0.0	0.21
taxvaluedollarcnt	0.01	0.0	0.18	0.73	0.31	0.0	0.0	0.18
assessmentyear	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
landtaxvaluedollarcnt	0.01	0.0	0.18	0.76	0.3	0.0	0.0	0.18
taxamount	0.31	0.0	0.26	0.86	0.37	0.0	0.0	0.26
taxdelinquencyflag	0.0	0.0	0.3	0.02	0.98	0.83	0.0	0.3
taxdelinquencyyear	0.0	0.0	0.15	0.07	0.78	0.03	0.0	0.15
censustractandblock	0.0	0.0	0.05	0.28	0.02	0.0	0.0	0.05

Now, with the matrix above, the numbers and layout does not seem very easy or pleasant to the eye. Therefore, let’s just collate the mean ranking score attributed to each of the feature and plot that via Seaborn’s factorplot.

In [183]:
# Put the mean scores into a Pandas dataframe
meanplot = pd.DataFrame(list(r.items()), columns= ['Feature','Mean Ranking'])

# Sort the dataframe
meanplot = meanplot.sort_values('Mean Ranking', ascending=False)
In [184]:
# Let's plot the ranking of the features
sns.factorplot(x="Mean Ranking", y="Feature", data = meanplot, kind="bar", size=10, 
               aspect=1, palette='coolwarm')
Out[184]:
<seaborn.axisgrid.FacetGrid at 0x126458b50>

Conclusion

The top 3 features are “Three Quarter Bathroom Count”, “Finished Square Feet”, and “Tax Delinquincy Flag”. “Tax amount” and “Has Hot Tub or Spa” are 4th and 5th.

To continue the discussion from up top – these are the features that are best at predicting Zillow’s log-loss on their price estimate model.

These rankings are showing us, I believe, which features Zillow has been using the most incorrectly to estimate their models. I’m sure that ‘Finished Square feet’ would be a great predictor at home price, if I were to write a price prediction algorithm myself. But, I’m thinking, they’ve relied on it too much, or have insufficiently factored in some additional interactions that are causing it to become a bad feature.

Hyperparameter Selection & Visualization

In this notebook I will be covering hyperparameter selection for model tuning. Hyperparameters express “higher-level” properties of the model such as its complexity or how fast it should learn. Hyperparameters are usually fixed before the actual training process begins, and cannot be learned directly from the data in the standard model training process. Model developers will predefine these hyperparameters by testing different values, training different models, and choosing the values that test better.

Examples of hyperparmeters are:

  • number of leaves in a tree based model
  • clusters in a k-means algorithm
  • number of hidden layers in a neural network
  • penalization incurred by additional model complexity
  • learning rate

In this notebook I will do hyperparameter selection for a model, including visualization of the hyperparameter grid as well as a revision after seeing unintuitive results.

In [77]:
#Image displat packages
from IPython.display import Image
from IPython.core.display import HTML 

#Loading Necessary Background Packages
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Load data
iris = datasets.load_iris()
X = iris.data 
y = iris.target

Hyperparameter Optimization

We’ll be using cross validation to initially find the optimal hyperparameters for our SVM kernal classifier.

When training our SVM clasisifier with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and gamma. Proper choice of these two hyperparameters is critical to the SVM classifier’s performance. We use cross validation and a separate train/test set because we do not want to overfit our hyperparamters to our data. Hence we will be finding the parameters on a separate dataset than the dataset we will be evaluating our model on.

The parameter C, common to a lot of ML algorithms, trades off misclassiciation for simpliciity. A low C will make the decision boundary smooth (trading off some accuracy), while a high C will trade off high accuracy for simplicity. See below for an example of this in action.

In [5]:
Image(url= "https://www.analyticsvidhya.com/wp-content/uploads/2015/10/SVM_18.png")
Out[5]:

Gamma defines how much influence a single training example has. Consider it a ‘sphere of influence’. The larger gamma is, the closer other examples must be to be affected.

In [6]:
Image(url= "https://www.analyticsvidhya.com/wp-content/uploads/2015/10/SVM_15.png")
Out[6]:

We will be splitting our data into testing and training datasets, like we did before. We will be finding the our optimal hyperparameters on the training dataset before we evaluate them on our training set.

In [7]:
# split into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=False)
In [10]:
# load in SVC - a kernel classifier function from the scikit-learn library
from sklearn.svm import SVC

# create a default instance of the classifier
classifier = SVC()

I will create a dictionary of possible gammas and C’s for our cross validator to run through. I looked through literature and examples of the SVM rbf kernel to get a sense of what others were using. For an initial search, a logarithmic grid with basis 10 is used.

In [49]:
C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)

I will be using the GridSearchCV package to run hyperparameter cross validation on the training dataset.

GridSearchCV’s default settings will run 3-fold cross validation, but I will be doing 5-fold cross validation.

In [50]:
# import cross validation packages
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit

# setup 5 fold cross validation to be fed into the GridSearchCV package
cv_5 = KFold(5)
In [78]:
# create a parameter range to test over
parameters = {'kernel':['rbf'], 'gamma':gamma_range, 'C':C_range}

# create an instance of the GridSearchCV cross-validator - using our classifier 
# and choice or parameters
cross_validator = GridSearchCV(classifier, parameters, cv=cv_5)
In [79]:
# fit the data
cross_validator.fit(X_train, y_train)
Out[79]:
GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=False),
       error_score='raise',
       estimator=SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ['rbf'], 'C': array([  1.00000e-02,   1.00000e-01,   1.00000e+00,   1.00000e+01,
         1.00000e+02,   1.00000e+03,   1.00000e+04,   1.00000e+05,
         1.00000e+06,   1.00000e+07,   1.00000e+08,   1.00000e+09,
         1.00000e+10]), 'gamma': array([  1.00000e-09,   1.00000e-08,   1.00000e-07,   1.00000e-06,
         1.00000e-05,   1.00000e-04,   1.00000e-03,   1.00000e-02,
         1.00000e-01,   1.00000e+00,   1.00000e+01,   1.00000e+02,
         1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
In [80]:
best_gamma = cross_validator.best_estimator_.gamma  
best_C = cross_validator.best_estimator_.C

print 'Best Gamma: {}, Best C: {}' .format(best_gamma, best_C)
Best Gamma: 1e-07, Best C: 100000000.0

Debrief

So from what I’ve seen in my previous work, the high C initially gave me some concern. Its not exactly the perfect interpretation, but the C value of 100000000 is essentially stating that our model is weighting misclassifications as being 10*10^8 times worse than adding complexity to our model. Its suggesting a C this high because it appears that across all 5 folds of our CV, the high C model was performing very well.

In a pure out of sample evaluation, our model would likely perform very poorly. My hypothesis why the C is so high is that due to the small testing data set of each k-fold cross validation (only 24 points!) our model is fitting close to all the points in each testing set.

I’ll first re-shape the score array such that I can easily access the score of each hyperparamter combination.

In [66]:
scores = cross_validator.cv_results_['mean_test_score'].reshape(len(C_range),
                                                     len(gamma_range))

Checking the ‘optimal’ hyperparameters, we see that our model was near perfect in its predictions across the testing sets in the 5 folds.

In [65]:
scores[[C_range==100000000, gamma_range==1e-07]]
Out[65]:
array([ 0.96666667])

I found a plotting function online that would show me a heatmap of how well the scoring was across each unique hyperparameter combination.

In [67]:
# source: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

# Utility function to move the midpoint of a colormap to be around
# the values of interest.
from matplotlib.colors import Normalize

class MidpointNormalize(Normalize):

    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))

## Plotting Function
plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,
           norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
plt.yticks(np.arange(len(C_range)), C_range)
plt.title('Validation accuracy')
plt.show()

Our glowing hot point near the bottom left shows that the particular hyperparameter combo did slightly better than a lot of (already pretty good!) hyperparameter combinations.

To control for this overfitting, I’m going to re-run our hyperparameter grid search using fewer k-folds (To create larger test sets) in addition to some fold shuffling.

In [70]:
cv_3 = StratifiedShuffleSplit(n_splits=3, test_size=0.33, random_state=42)

# create a parameter range to test over
parameters = {'kernel':['rbf'], 'gamma':gamma_range, 'C':C_range}

# create an instance of the GridSearchCV cross-validator - using our classifier and choice or parameters
cross_validator = GridSearchCV(classifier, parameters, cv=cv_4)

# fit the data
cross_validator.fit(X_train, y_train)

best_gamma = cross_validator.best_estimator_.gamma  
best_C = cross_validator.best_estimator_.C

print 'Best Gamma: {}, Best C: {}' .format(best_gamma, best_C)
Best Gamma: 0.01, Best C: 10.0
In [71]:
scores = cross_validator.cv_results_['mean_test_score'].reshape(len(C_range),
                                                     len(gamma_range))

plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,
           norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
plt.yticks(np.arange(len(C_range)), C_range)
plt.title('Validation accuracy')
plt.show()

White hot! Looks like we didn’t lose predictive ability by decreasing the number of folds in our cross validation. We have many good possible C or Gamma values for measuing our model performance.

Now we’ll see how our model runs on our testing data with our selected gamma and C.

In [76]:
# Run on test data
cross_validator.score(X_test, y_test)
Out[76]:
1.0

Boom! Too good. If we had other competing models, we’d be comparing this score. Iris is an easy dataset, so this is nothing too groundbreaking to be proud of 😉

In [ ]:
 

Model Validation Techniques

In statistics and machine learning, model validation is the process of deciding whether the results of a model are sufficiently generalizable to describe and make predictions on similarly distributed data.

Model validation largely revolves around the great dilemma between bias and variance. Model developers want to choose a model that will accurately capture the predictions in the training data, but will also generalize well on unseen data.

  • The first of these demands is a desire to reduce bias, which is the error from erroneous assumptions in the learning algorithm. High bias models are lower accuracy because they do not utilize all the possible relations between features and predictions available in the data. Low bias models will often be very complex, and will usually have a much higher accuracy as a result. The risk with low bias models is that they are overfit on the data and are not modelling on a true relationship between the features and predictions, but on the noise present in the data. The opposite of overfitting is underfitting, and it is used to describe models that potentially miss a significant relationship between a feature and a prediction.
  • The second demand on a model developer is to reduce variance, which is the error from fluctuations in the underlying data. High variance models are poor generalizers on data outside the training set. Models that are (over)fit to the noise in the training data will not be able to make good predictions on data that has a different distribution. Low variance models should have reasonable accuracy out of sample, because they have correctly identified real relationships between the features and predictions, and not just noise.

In the training phase of our model development process, care is taken to tune models such that they are able to minimize bias and variance as much as possible. In this notebook, I’ll be implementing the most basic of validation techniques, the Test/Train Split and Cross-Validation.

In [47]:
from IPython.display import Image
from IPython.core.display import HTML 

#Loading Necessary Background Packages
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Test Train Splits

If you’re going to be developing any sort of predictive model, the minimal amount of validation will require splitting your data into testing and training datatsets.

In [48]:
# Load data
iris = datasets.load_iris()
X = iris.data 
y = iris.target
In [49]:
# split into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=False)

We have split the dataset such that 20% of the data will be removed for testing purposes. In this scenario, I have split the data such that the last 20% are reserved for testing (via random_state=False), but you can easily imagine that another method of splitting data might pull the 20% randomly from the dataset. The use of this flag breaks down to whether or not we’re dealing with time dependent data.

If we’re dealing with a time dimension in our data (e.g. financial analysis), it is more common to remove the last 6 months of data, for example, for testing purposes. The reason being that it is easy to overfit data this way- You’re still testing over points that match the general patterns in the training dataset.

Below I’ll be training the data on our training dataset, and measuring its fit on the test dataset.

In [55]:
# load in SVC - a kernel classifier function from the scikit-learn library
from sklearn.svm import SVC

# create an instance of a kernel-based regressor from scikit learn
classifier = SVC(kernel = 'rbf')

# Fit on training dataset
classifier.fit(X[train], y[train])

# Print score
print 'Our model correctly classified {} of our test points' \
.format(classifier.score(X[test], y[test]))
Our model correctly classified 0.933333333333 of our test points

Cross Validation

Cross Validation is used in model development because it is one of the ways we get insights on out of sample model performance in addition to preventing overfitting of our models. It is often done in addition to a hold out test set (as done above) and is often used to find the optimal hyperparameters for a given function.

Cross Validation techniques will generally split up the dataset in a way that either fixed numbers or portions of the dataset get iteratively placed into testing and training pools. Two of the most common methods of cross validation are LeaveOneOut and K-Folds.

K-Folds

In K-Folds validation, the data is split into test/train datasets k times. Each iteration of k-folds will test the data on 1/k of the dataset, while training it on the remaining k-1/k portion of the data. Below is an illustration of the k iterations of k-folds.

In [46]:
Image(url= "http://cse3521.artifice.cc/images/k-fold-cross-validation.jpg")
Out[46]:

K-folds has been empircally shown to exhibit desireable tradeoffs between bias and variance with a k of 5 and 10, so you’ll see these most commonly represented in tests and examples. [Citation needed]

Leave One Out

Leave one out is a special case of K-Folds whereby k is equivalent to the number of datapoints, n.

Which is preferred?

As you can imagine, leave one out can be quite computationally expensive– each model is essentially run n times. It is therefore usually only run on small datasets whereby the computational hit can be taken. With regards to bias/variance, LOO CV will lead to estimates with lower bias, but higher variance because each training set will contain n-1 examples and will have overlap across training sets as you’re using almost the entire training set in each iteration.

The opposite is true with k-fold CV, because there is relatively less overlap between training sets, thus the test error estimates are less correlated, as a result of which the mean test error value won’t have as much variance as LOO CV.

Is cross validation a replacement for having a test data set?

It is tempting to solely use cross validation in lieu of a test train split. This is unadvised because while cross validation may provide some sense of out of sample model performance, it still uses the entirety of the training set to generate coefficients. It is moot to compare models this way, because some models are simply much better at overfitting to the data given to them.

We will be using the Iris dataset as above for classification purposes in this test.

In [53]:
# import cross validation packages
from sklearn.model_selection import KFold, LeaveOneOut

# split into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

Our cross validation packages will take the number of folds but otherwise are very simple to initialize.

In [54]:
## Building KFolds with 5 Folds
KFold_5 = KFold(5)

## Building KFolds with 10 Folds
KFold_10 = KFold(10)

## Building LOO, leaving one out (naturally) 
LooCV = LeaveOneOut()

To demonstrate the use of the cross validation packages, I will first initialize and display the prediction score of each fold of our 5-fold cross validation.

Scikit-learn has a useful set of packages for iterating through each fold of a cross validation scheme in addition to making predictions on a model fit with cross validation. Plenty of models have built in cross validation functions (ie: LassoCV, LogisticRegressionCV) but this is a model agnostic framework — great for comparing the performance of a lot of models against eachother!

In [71]:
from sklearn.model_selection import cross_val_predict, cross_val_score

# load in Logistic Regression 
from sklearn.linear_model import LogisticRegression

# create an instance of Logistic Regression with default options
classifier = LogisticRegression()

# Display the prediction score of each fold
cross_val_score(classifier, X_train, y_train, cv=KFold_5)
Out[71]:
array([ 0.83333333,  0.95833333,  1.        ,  1.        ,  0.875     ])

As you can see not every fold was predicted equally well in our cross validation scheme.

We will use the cross_val_predict package to see how well our model works out of sample.

In [75]:
# Generate Predictions
predicted = cross_val_predict(classifier, X_test, y_test)

# Measure model performance
from sklearn import metrics
metrics.accuracy_score(y_test, predicted)
Out[75]:
0.90000000000000002

Comparing performance of different models

Below I’ll be using our cross validation methodology (the k-fold with 5 folds) to test the out of sample performance of our models. Each of the models will be fit on 4/5th of the training data 5 times and used to make 5 predictions on the out of sample data. The 5 out of sample predictions will be averaged then to make the final prediction.

I’m testing the performane of a random set of classifier models with relatively random hyperparameters. If I wanted to get fancy I’d tune each of those hyperparameters individually, but as a start this is a good technique to see which models are looking promising on a new dataset.

In [81]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
In [84]:
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

for name, clf in zip(names, classifiers):
    predicted = cross_val_predict(clf, X_test, y_test, cv=KFold_5)
    print name, metrics.accuracy_score(y_test, predicted)
Nearest Neighbors 0.9
Linear SVM 0.8
RBF SVM 0.866666666667
Gaussian Process 0.966666666667
Decision Tree 0.9
Random Forest 0.933333333333
Neural Net 0.866666666667
AdaBoost 0.9
Naive Bayes 0.933333333333
QDA 0.766666666667

Discussion

Cross valiadation and the test/train split are the most basic methods of model validation you can do. Bootstrapping and Bagging are two further techniques whereby models are tested on random samples of the training set, such that the distribution of the training set is varied.

The success of ensemble meta-model techniques like Random Forests is largely because of the build in model validation that is used to create the final model. To learn more about how these models improve the bias-variance tradeoff, see my post about ensemble models.

DIY K-Nearest Neighbors

K-Nearest Neighbors for Classification

Like the K-Means algorithm we covered before this, K-Nearest Neighbors is an example of lazy learning in machine learning. Neither K-means of K-nearest neighbors are generalizable outside of their input data. In order to get usage out of these algorithms, we need to have access to the entire set of training data, even as we try and use our model to make predictions on new data. This can pose some difficulties when you’re dealing with a lot of data…imagine if you have millions of existing data points to test for!

We contrast this to eager learning, which is exhibited by all tree based algorithms, support vector machines, or logistic regressions (among many others). Once we have trained an eager model, we can export it as a generalizable function without bringing all of our training data within.

That K-Nearest neighbors is a lazy model shouldn’t be a surprise. K-NN makes predictions by comparing the new input data to existing points and their labels. Points are predicted based on which points in the training data they are most simialar to. It is extremely powerful and modifications of it (such as approximate nearest neighbors) power a lot of huge apps like Spotify or Better.

In this notebook, we’ll be hand coding an implementation of the K-nearest neighbors algorithm as well as comparing it to some other algorithms with regards to classification power on a test set of data.

Writing the KNN Classification Function First

We are loading the famous Iris dataset to test our KNN’s ability to classify flower samples based on their similarity to other flowers.

In [2]:
#Loading Necessary Background Packages
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Importing Data for classification 
iris = datasets.load_iris()
X = iris.data 
y = iris.target

The iris dataset has data in 4 dimensions

In [3]:
## Sampling the data
X[:6]
Out[3]:
array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5. ,  3.6,  1.4,  0.2],
       [ 5.4,  3.9,  1.7,  0.4]])

There are 3 iris classifications we are predicting.

In [4]:
np.unique(y)
Out[4]:
array([0, 1, 2])

We’ll be splitting the data into Test and Train splits using the super handy built in sklearn function.

I’ve chosen to split the data into 80% train, 20% test.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

Below I have printed the length of the train and test datasets.

In [11]:
print X_train.shape, X_test.shape
(120, 4) (30, 4)

I’ve plotted the first 2 features of the training datasets to get a sense of how well clustered the points are. Seems that even with 2 features, our data is very well segmented already. I can see our model doing very well here.

In [13]:
plt.scatter(X_train[:,1], X_train[:,0], c=y_train)
Out[13]:
<matplotlib.collections.PathCollection at 0x1137d57d0>

The below K-NN classification takes the following arguments:

  • X_train: Features to build model on
  • y_train: Labels for training model
  • X_test: Features to predict on

The model outputs the following

  • y_test: labels for testing data

The algorithm

K nearest neighbors is quite beautiful in its simplicity.

  1. Given X_Train, X_Test, Y_Train
  2. Find the k nearest points to each x in X_train
  3. Classify x based on the predominant classification of those k nearest points
  4. Done!

Our implementation is below:

In [23]:
from collections import Counter

def distance(X, X_train):
    return [np.sum((X - x)**2) for x in X_train]
    
def my_knn_classification(X_train, X_test, y_train, n_neighbors):
    #find the closest n_neighbors in X_train to each point in X_test 
    #assign each point in X_test a label which is the most common label 
    #of its nearest n_neighbors
    #
    #return array y_test which is the array of above labels on X_test
    y_test = []
    for x in X_test:
        #find the index of the cloest n_neighbors points to x in X_train
        knn = np.argsort(distance(x, X_train))[:n_neighbors]
        knn_labels = y_train[knn]
        # find the most predominant label of the nearest *k* points
        label_counts = Counter(knn_labels)
        y_test.append(label_counts.most_common(1)[0][0])
    return np.array(y_test)

Can you believe the KNN algorithm is this simple??? Seriously before I understood how it worked I treated it like some magical machine learning black box. Its that simple!

Lets test her out! I’m setting my algorithm to look at the 5 nearest points — Later I’ll discuss a more programtic approach to finding the optimal number of k neighbors.

In [24]:
y_hat = my_knn_classification(X_train, X_test, y_train, 5)
In [25]:
num_correct = 0.0
for i in range(len(y_hat)):
    if y_hat[i] == y_test[i]: num_correct += 1

print "KNN correctly classified: ", num_correct/len(y_hat)*100, "%"
KNN correctly classified:  96.6666666667 %

Wow! Almost perfect classification. Granted, this is a pretty easy dataset.

K-Nearest Neighbors for Regression

For the second half of our implementation, I’ll be writing the K-NN algorithm for regression problems.

I’ll be loading another great dataset from sklearn containing features and prices of homes in Boston. We’ll be calculating the optimal number of nearest neighbors once we get the base implementation working.

K nearest neighbors is quite beautiful in its simplicity, even for regressions.

  1. Given X_Train, X_Test, Y_Train
  2. Find the k nearest points to each x in X_train
  3. Predict what Y_test will be for each x based on the average value of Y_train is in its k nearest points
  4. Done!
In [7]:
from sklearn import datasets
import matplotlib.pyplot as plt
%matplotlib inline

# import our home price data
boston = datasets.load_boston()
X = boston.data  # we only take the first two features.
y = boston.target

# split into test and train
from sklearn.model_selection import train_test_split
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

The cool thing about the built in sklearn datasets is that they have all these incredible descriptions built in as well. Perfect for experiementing around with new packages!

In [8]:
print boston.DESCR
Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

The Algorithm

The below K-NN regression takes the following arguments:

  • X_train: Features to build model on
  • y_train: Labels for training model
  • X_test: Features to predict on

The model outputs the following

  • y_test: labels for testing data
In [9]:
from collections import Counter

def distance(X, X_train):
    return [np.sum((X - x)**2) for x in X_train]
    
def my_knn_regression(X_train, X_test, y_train, n_neighbors):
    #find the closest n_neighbors in X_train to each point in X_test 
    #assign each point in X_test a label which is the most common label 
    #of its nearest n_neighbors
    #return array y_test which is the array of above labels on X_test
    y_test = []
    for x in X_test:
        #find the index of the cloest n_neighbors points to x in X_train
        knn = np.argsort(distance(x, X_train))[:n_neighbors]
        knn_value = np.mean(y_train[knn])
        y_test.append(knn_value)
    return np.array(y_test)

Lets test her out with 5 nearest neighbors.

In [10]:
y_hat = my_knn_regression(X_train, X_test, y_train, 5)

To judge the model performance, I’ll be using a measure thats common in my line of work called MAPE. Its the Mean Absolute Percent Error of our predictions with their actual values.

I would normally use the metrics package for measuring performance, but this is a manually implementation of the same thing.

In [11]:
percent_error = []
for i in range(len(y_hat)):
    PE = (y_hat[i]-y_test[i])/y_test[i]
    percent_error.append(PE)

print "average error is: ", np.mean(np.abs(percent_error))
average error is:  0.214663492394

Not bad! I think…

Before I compare K-NN regression against other people machine learning algorithms, I’ll first run a hyper-parameter search to find the number of k nearest neighbors that reduces our PE the most. Hyper-parameters are options we pass to machine learning algorithms to configure how they run. In a tree-based algorithm, a hyperparameter might be maximum tree depth. For a penalized linear regression, it might be the penalty assigned to additional variables.

The below code will test our k-nn regression with a number of neighbors between 1 and 20 to find the one which minimizes our loss the most.

If I were doing this in a production environment, I’d further split up my training dataset so that I’m finding the optimal set of neighbors on yet a different dataset than the one I’m building my model off of.

In [15]:
def meta_knn_regression(X_train, X_test, y_train, n_max = 20):
    best_PE = 100000
    for i in range(n_max):
        y_hat = my_knn_regression(X_train, X_test, y_train, i)
        percent_error = []
        for j in range(len(y_hat)):
            PE = (y_hat[j]-y_test[j])/y_test[j]
            percent_error.append(PE)
        temp_PE = np.mean(np.abs(percent_error))
        if temp_PE < best_PE:
            best_PE = temp_PE
            best_n = i
            best_y_hat = y_hat
    return best_y_hat, best_PE, best_n
In [16]:
y_hat, PE, n_neighbors = meta_knn_regression(X_train, X_test, y_train, n_max = 20)

print "KNN produced an optimal mean error of %s using %s nearest neighbors " % (PE, n_neighbors)
KNN produced an optimal mean error of 0.2050085244 using 8 nearest neighbors 

Comparing KNN Regression against other machine learning algorithms.

Now, using our KNN with optimized k hyperparameter, we’ll compare our model against some other popular machine learning algorithms.

This is the first time I’m bringing these algorithms into the picture, and my next posts will go over these models in greater depth.

Linear Regression

In [34]:
# import linear regression module from scikit-learn
from sklearn.linear_model import LinearRegression

# create an instance of a linear regressor from scikit learn
regressor = LinearRegression()

# fit our chosen regressor to the dataset
regressor.fit(X_train, y_train)  

y_hat = regressor.predict(X_test)
In [35]:
percent_error = []
for i in range(len(y_hat)):
    PE = (y_hat[i]-y_test[i])/y_test[i]
    percent_error.append(PE)

print "average error is: ", np.mean(np.abs(percent_error))
average error is:  0.183501475863

Elastic net

In [36]:
from sklearn.linear_model import ElasticNet

# create an instance of a elasticnet regressor from scikit learn
regressor = ElasticNet(alpha=1)

# fit our chosen regressor to the dataset
regressor.fit(X_train, y_train)  

y_hat = regressor.predict(X_test)
In [37]:
percent_error = []
for i in range(len(y_hat)):
    PE = (y_hat[i]-y_test[i])/y_test[i]
    percent_error.append(PE)

print "average error is: ", np.mean(np.abs(percent_error))
average error is:  0.194042726016

Random Forest

In [40]:
from sklearn.ensemble import RandomForestRegressor

# create an instance of the Random Forest Regressor from scikit learn
regressor = RandomForestRegressor()

# fit our chosen regressor to the dataset
regressor.fit(X_train, y_train)  

y_hat = regressor.predict(X_test)
In [41]:
percent_error = []
for i in range(len(y_hat)):
    PE = (y_hat[i]-y_test[i])/y_test[i]
    percent_error.append(PE)

print "average error is: ", np.mean(np.abs(percent_error))
average error is:  0.134505708624

Gradient Boosted Trees

In [42]:
from sklearn.ensemble import GradientBoostingRegressor

# create an instance of the Random Forest Regressor from scikit learn
regressor = GradientBoostingRegressor(loss='lad')

# fit our chosen regressor to the dataset
regressor.fit(X_train, y_train)  

y_hat = regressor.predict(X_test)
In [43]:
percent_error = []
for i in range(len(y_hat)):
    PE = (y_hat[i]-y_test[i])/y_test[i]
    percent_error.append(PE)

print "average error is: ", np.mean(np.abs(percent_error))
average error is:  0.133206918853

Summary

Hmm – seems like our KNN algorithm didn’t fare so hot compared to some other models. I’ve done a few Kaggle competititions where K-NN features strongly in my final ensemble of models, the poor performance here is a little surprising to me. My guess is that 8, while the optimal # of neighbors in our range of 20 tested, is actually pretty low. I don’t think I’ve done a competition where I have less than 100 neighbors in my final model, but then again this is a pretty small dataset to most of what is offered on Kaggle.

DIY K-Means Implementation

The K-Means clustering algorithm is a popular clustering algorithm that separates our data points into k clusters of affinity. It’s not a predictive algorithm, meaning, it’s not generally used to predict behvarior of new data. It’s more often used to find similarities between clusters of existing data for descriptive purposes. I had a chance to ues k-means in my university academic work on a set of business customer characteristic data. We were looking for segments of customers to develop individual marketing plans to. We knew we wanted to target between 4 and 6 different clusters, and the algorithm was a useful supplement to our theoretical justification of our marketing splits.

You’ll come out of this notebook having handwritten the k-means clustering algorithm. You’ll also see how we can deal with the problem of k-means having a non-convex loss function.

For our first step, I’m loading the basic packages used for data manipulation and loading.

In [2]:
# import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
from IPython.display import Image
from IPython.core.display import HTML

# importing a printing utility
sys.path.append('../utils')            
import dim_reduction_plotting as plotting_util

In the first cell, we load up one of the datasets

In [3]:
# path to data
basepath = '../datasets/'
dataname = '5cluster_2d_data.csv'
X = np.asarray(pd.read_csv(basepath + dataname,header = None))

First, we’ll visualize the data.

In [4]:
%matplotlib inline

# plot dataset
fig = plt.figure(figsize = (7,7))
ax = fig.add_subplot(111, aspect = 'equal')
ax.scatter(X[:,0],X[:,1],c = 'k',s = 60,edgecolor = 'w')
ax.axis('off');

We can see that there are a couple distinct clusters in here — Perfect input data for our k-means algorithm. Our data is 2 dimensional, which lets us spot check to see that the optimal set of clusters should be 5. Of course, most data you’ll work with won’t have this sort of obvious choice of clusters. Later in this notebook I’ll cover how we can think about selecting the right number of clusters. For now, we’ll write our algorithm agnostic to number of clusters.

Implementation

Our algorithm will maintain the same nomenclature as scikit learn’s k-means functionality. It will take:

X – your dataset

num_clusters – a number of desired clusters

and should return two things:

centroids – the returned centroids. These are center points of each cluster.

assignments – the set of labels assigning each point to a cluster.

The Algorithm

The k-means algorithm can be broken down into 2 individually optimizable functions by virtue of Lloyd’s algorithm. The process goes as follows:

  1. Place k points somewhere randomly in the input domain to serve as initial centroids
  2. Assign each n points to its nearest centroid. We are minimizing the distance between each point to their centroid.
  3. Move each centroid to the center of its assigned points.
  4. Repeat steps 2 and 3

Easy!

We’ll first code up our loss function. We’ll be constantly minimizing the distance between points and centroids, so we’ll be creating a function to wrap it up. For each point and set of centroids passed to it it, it’ll return the distance from that point and each of those centroids.

In [5]:
## Distance function measures the distance from each point to to each centroid
def distance(x, centroids):
    return [np.sum((x - c)**2) for c in centroids]

Below I code up the k-means algorithm with a default 100 iterations to find the minimum point. See further below for a run through of the code.

In [6]:
## Primary K-Means Function takes the number of iterations to find the
## allocation of points that gives the lowest error 
def my_kmeans(X,num_clusters, max_iters=100):
    # Assigns *k* random points to be initial centroids
    centroids = X[np.random.choice(range(len(X)), num_clusters),:]
    # build an array of assignments
    assignments = range(len(X))
    # run max_iters amount of iterations to find the loss minimum
    for i in range(max_iters):
        # assign each point to its nearest centroid
        assignments = [np.argmin(distance(x, centroids)) for x in X]
        # Move each centroid to the center of all the points assigned to it.
        for j in np.unique(assignments):
            temp_X = X[assignments==j]
            centroids[j] = [temp_X[:,0].mean(), temp_X[:,1].mean()]  
    assignments = np.asarray(assignments)
    return centroids,assignments
In [15]:
# run the k-means algorithm below
centroids,assignments = my_kmeans(X = X,num_clusters = 5)

With our centroids and assignment labels in hand, lets plot the original dataset colored by cluster membership as well as the cluster centroids (as stars of the same color as its cluster).

In [16]:
# print out dataset again, only with each cluster colored individually, 
# and plot centroids
plotting_util.cluster_plot_2d(X,centroids,assignments)

Comments

Uh oh! Looks like our model did not produce an optimal result.

Plenty of runs do not end with the optimal solution. Plenty of runs end with a very obvious case of “this could be divied up better”. To solve this, I’ll be running a ‘super loop’ on the data, choosing the run that has the lowest total loss on each point

Why does this happen?

The K-means loss function is not naturally convex. The process of assigning N points to K clusters can fall into an exponential number of locally optimal assignments of points. Consider the loss function graphed to the right below. In the process of assigning and re-assigning points to minimize loss, the k-means algorithm will frequently fall down the wrong ‘path’ for finding the global loss minimum.

In [16]:
Image(url= "https://image.slidesharecdn.com/mlconf2015-sf-anandkumar-1511140021\
55-lva1-app6892/95/animashree-anandkumar-electrical-engineering-and-cs-dept-uc-\
irvine-at-mlconf-sf-111315-10-638.jpg?cb=1447460604", width=700, height=500)
Out[16]:

We’ll be using most of the same code here. The biggest difference is that we will be wrapping our k-means function in a super-function that will run the regular k-means code N times. Our super function will then choose the run that produced the lowest total squared distance from the centroids to their constituent points.

In [62]:
## Distance function measures the distance from each point to to each centroid
def distance(x, centroids):
    return [np.sum((x - c)**2) for c in centroids]
In [17]:
def super_kmeans(X,num_clusters, super_loops = 10, max_iters=100):
    # Create a variable to measure the lowest error across loops.  
    #Set to something really high initially
    best_error = 10000000
    for k in range(super_loops):
        # Assigns *k* random points to be initial centroids
        centroids = X[np.random.choice(range(len(X)), num_clusters),:]
        # build an array of assignments
        assignments = range(len(X))
        # run max_iters amount of iterations to find the loss minimum
        for i in range(max_iters):
            # assign each point to its nearest centroid
            assignments = [np.argmin(distance(x, centroids)) for x in X]
            # Move each centroid to the center of all the points assigned to it.
            for j in np.unique(assignments):
                temp_X = X[assignments==j]
                centroids[j] = [temp_X[:,0].mean(), temp_X[:,1].mean()]  
        # Measure the error
        error = 0
        for k in range(len(assignments)):
            error += np.mean(distance(X[k], [assignments[k]]))
        # if this run has better error than the previous best, overwrite
        # print error # debug
        if error < best_error:
            final_assignments = np.asarray(assignments)
            final_centroids = centroids
            best_error = error
    return final_centroids,final_assignments
In [21]:
centroids, assignments = super_kmeans(X = X,num_clusters = 5)
In [22]:
# print out dataset again, only with each cluster colored individually, 
# and plot centroids
plotting_util.cluster_plot_2d(X,centroids,assignments)

Hmm….Doesn’t seem like its guaranteed to give the intuitively best answer any more than our regular solution. Perhaps taking the sum of our errors isn’t the greatest method of measuing our models overall fit?

Open to suggestions on how to improve this!

Loss Functions and Gradient Descent

Loss functions are the foundation and measure by which machine learning algorithms are optimized. The value of the loss function is a measure of how good the model ‘coefficients’ (we’ll call these paramaters, or betas from now on) are at fitting the underlying data.

An intuitive understand of the loss function will depend on the sort of problem we’re trying to optimize. If we’re trying to run a regression to predict housing prices, the loss function will be based off the error between the true and predicted home prices. If we’re trying to classify credit defaults, the loss function will be a measure of how accurate our predictions are at identifying the bad loans. Intuitive!

A machine learning algorithm will be finished optimizing when it finds a set of parameters that minimize the loss. When our machine learning algorithm has a convex loss function, it has a single optimal set of parameters that will minimize its loss function. Convex loss functions are like a basin. Given enough iterations, our algorithm will fall into the set of parameters that result in the lowest possible loss.

Consider the figure on the left below. If you were looking at the convex plane from the top-down, the ‘X’ and ‘Y’ would be two possible model parameters. With enough iterations, the parameters will minimize model loss. This won’t mean we found the ‘best’ model — I’ll cover overfitting in later posts and how we can control for it.

In [ ]:
from IPython.display import Image
from IPython.core.display import HTML
In [4]:
Image(url= "https://image.slidesharecdn.com/mlconf2015-sf-anandkumar\
-151114002155-lva1-app6892/95/animashree-anandkumar-electrical-engine\
ering-and-cs-dept-uc-irvine-at-mlconf-sf-111315-10-638.jpg?cb=14474606\
04", width=700, height=500)
Out[4]:

The above figure to the right shows the loss function of a non-convex algorithm. You can see there are multiple local optimums that our algorithm can fall into. In the following posts, we’ll cover which machine learning algorithms are convex or non-convex, and we’ll code up some possible resolutions to these scenarios.

Gradient Descent

Our machine learning algorithms knows which direction to travel along the convex plane through a process called gradient descent. To find a local minimimum of our loss funtion using gradient descent, the algorithm takes steps proportional to the negative of the derivative of the function at the current point.

In other words, we use the slope of our loss function at a given point to figure out 2 things:

  • Which direction we should adjust our parameters
  • How large our parameter adjustment should be

If we find outselves on a steep portion of the loss function plane, we take large steps. When the slope of our loss function is shallow, we move slowly. Eventually we’ll reach a point where our slope is practically zero and our algorithm will stop.

In practice many of these algorithms have a step size cut-off, whereby if the step is calculcated to be below a certain size it’ll stop traversing. The consequence of this will be that these algorithms don’t necessarily find the absolute lowest point, but something close enough. This isn’t actually a bad thing– In fact, when we cover overfitting in later chapters I’ll show how this is actually desired.

In [5]:
Image(url= "https://qph.ec.quoracdn.net/main-qimg-b7a3a254830ac3748\
18cdce3fa5a7f17", width=700, height=500)
Out[5]:

For our coded example, we’ll be optimizing a one dimensional loss function. We’ll be finding the parameter (w) which minimizes the loss function. We’ll traverse along the cost function by

But first, we’ll be loading our basic packages.

In [13]:
import numpy as np
#from math import sin
import pylab 
#import matplotlib.pyplot as plt
#%matplotlib inline

To get a little more visibility into our algorithm, we’ve hard-coded the loss function and its derivative as inputs into our gradient descent algorithm.

Like the above figure, our loss function is defined as w^2. We’ll be expecting our algorithm to find loss minimized around 0 or close to it.

In [17]:
def loss_func(w):
    return np.power(w, 2)

The derivative of w^2 is of course, 2*w. We’ll be using the derivative to find which direction we need to traverse along our loss function

In [18]:
def gradient_func(w):
    return 2*w

And finally, we have the code for our gradient descent function. Step by step what it does is below:

  1. Initialize a random point to begin our dscent
  2. Initialize our array of guesses and derivatives (for plotting our descent)
  3. For our random point, record the loss and gradient
  4. Move in the opposite direction of our gradient by a distance proportional to the gradient and our programmed step size
  5. As long as our movement is greater than our minimum movement, repeat steps 3-5 with our new point.
In [19]:
def gradient_descent(loss_func, gradient_func, step, resolution, guess):
    old_guess = 99999
    guesses = []
    derivatives = []
    while abs(old_guess - guess) > resolution:
        error = loss_func(guess)
        derivative = gradient_func(guess)
        move = derivative * step
        old_guess = guess
        guess = guess - move
        guesses.append(guess)
        derivatives.append(derivative)
    pylab.plot(guesses, label='Values')
    pylab.plot(derivatives, label='Derivatives')
    pylab.legend(loc='upper right')
    pylab.show()
    return guess, error

Below we run our code

In [20]:
guess, error = gradient_descent(loss_func, gradient_func, step=0.1, 
                                resolution=0.001, guess=6)

print "Best approximation of local minimum is", guess
print "Error is", error
Best approximation of local minimum is 0.00380295180068
Error is 2.25975662474e-05

As I stated previously, our gradient descent process didn’t find the lowest point in our loss function. With a greater minimum movement input, we could reach closer and closer to zero.

Summary

I hope after this little tutorial you’re more comfortable with the idea of loss functions and the process to minimizes their loss. I hear its not entirely common to start learning about machine learning with this concept, but I’m dumbfounded to think theres a better way.