Hey, thanks for stopping by! Want some tea..? Come in, come in. I’ll get the kettle going.
Let me show you around. Over to the right and below I have some machine learning tutorials written in Python on a Jupyter notebook.  If you’re new to ML, or just desire continuity in writing, I recommend browsing in the following order:

I have some write ups about other projects as well, including my experiments with Fiverr.com, some Ruby on Rails apps I’ve built, and other interesting

If you see anything blatantly incorrect or something I should improve, please do drop a line! I’m a new student to the game and looking for as much help as possible.

New Rails Apps: Flicky and Text-2-Spotify

I’ve spent the last couple weeks learning how to build web apps in Rails. I’m happy to say I finally pushed and published 2 of my favorites to Heroku today!

The first application is called ‘Text-2-Spotify’, and it allows users to text songs to add to their Spotify playlist. I envisioned it being an easy way to give a party the ability to make a collaborative playlist without relying on the Spotify app. You can find the source here and the application is hosted here.

The second application is called ‘Flicky’, and it helps indecisive groups of movie watchers find group consensus on what to watch by individually voting on their favorite films. You can find the hosted app here (Takes about 5 seconds to load from sleep) and the source here.

Rails has been absolutely fantastic to develop on. The community is huge, integrating with API’s is incredible simple, and particularly so when theres an API wrapper written for Ruby. I’d highly recommend anyone interested in learning Rails to consult the fantastic Rails 5 Tutorial by Michael Hartl.

Code Challenge: Girl / Boy Proportion Problem

Code Challenge: Girl / Boy Proportion Problem

This weekend my roommate shared a little puzzle with me, which goes as follows:

In a country in which people only want boys, every family continues to have children until they have a boy. If they have a girl, they have another child. If they have a boy, they stop. What is the proportion of boys to girls in the country?

I think its tempting to initially think that there will be more boys, but after you consider the number of females that could be born waiting for a boy, you might think the opposite.

He was trying to solve it mathematically using the sum of infinite series, but we had a little race to see who could do the Monte-Carlo simulation first. I was really happy with my 6 line solution!

In [52]:
# Simulate a birth.  0 is male, 1 is female
def birth():
    return np.random.binomial(1, .5)
In [53]:
# Input: number of families to simulate
# Output: proportion of males to females born
def new_society(int):
    female = 0
    for i in range(int):
        while (birth()==0): 
            female = female + 1
    return int/(female+int)
In [54]:
# Simulate 100,000 families trying for a boy
new_society(100000)
Out[54]:
0.49985004498650404

For fun, I wrote another function to measure what the largest family would be in this society.

In [55]:
def new_society_v2(int):
    female = 0
    longest_fam = 0
    temp_size = 0
    for i in range(int):
        while (birth()==0): 
            female = female + 1
            temp_size = temp_size + 1
        if(temp_size > longest_fam):
            longest_fam = temp_size
        temp_size = 0
    return int/(female+int), longest_fam
In [57]:
new_society_v2(100000)
Out[57]:
(0.5025226637721362, 19)

19 kids! Thats a lot of females just to get your male heir.

Experiments in Fiverr

Fiverr is a marketplace for freelance entrepreneurs, which as it’s namesake implies, has gigs for as low as five dollars on its site. This low cost is possible through the commodification and global reach of its freelancers, who more often than not come from countries with a drastically lower cost of living. In San Francisco, Fiverr has heavily pushed its branding as the “Freelance Services Marketplace for The Lean Entrepreneur”. Their late-capitalism sounding adverts dot the city’s metro reminding you that the dream is still alive:

I am still haunted by this man’s face

Say what you will about the labor concerns of contract labor, I think the globalization of traditionally ‘white collar’ services like programming, graphic design, and media production are giving smart motivated people around the world a chance to compete with their (relatively overpaid) Western counterparts. And compete they do: Fiverr is an Alexa-ranked top 130 most visited website, and there are over 3 million gigs on the site at any time. I’ve used Fiverr multiple times to both serious and comedic ends, and have documented a few of my favorite gigs so far below.

Trigger-Me-Timbers — The Chrome Extension for Trigger Words

As a junior in college I participated in a hackathon with 3 friends. We had the idea to build a Chrome extension that would generate trigger warnings when a given trigger word was found on a web page. We figured it was a simple task— build an applet that stores words, and compares them to every word on a webpage. If there’s a match, hide the word and generate a popup. Easy enough. We were all capable programmers and this seemed like a walk in the park. At the end of our 24 hour Red Bull, Chinese food, and coding binge, we were able to create a drop down text box with some broken buttons. Thats it. Four talented Princeton coders and we could barely figure out how to get an extension to run. It was disheartening.

I was committed to finishing what we started, and I had recently discovered Fiverr. I searched ‘Google Extension’ and found almost a hundred users who professed experience in extension development. I reached out to the first well reviewed user, a Pakistani Javascript developer with 6+ years of experience by the name of ‘iAmCoder’. A brief description of my goal and $15 later, I waited patiently for his response.

What did I get? $15 of damn good code. Check it out for yourself: https://github.com/mikekosk/Trigger-me-timbers. Its about 300 lines of modularized and readable code that ran immediately and without a hitch. With 6+ years of Javascript and front end development under his belt, I can’t imagine iAmCoder being paid anything less than $100,000 per annum in the bay area. I haven’t done anything with this proof of concept, but the framework is there and I’m sure with a couples days work on my end (or <$50 in Fiverr credits) I’d have a full fledged app.

Meme Production

In 2017, YouTube personality PewDiePie found himself in hot water after by pushing the limits of what people on Fiverr will do. Felix “PewDiePie” Kjellberg uploaded a video showing two young men dancing and laughing with a banner that read “death to all Jews”. Felix sought to show his 53 million subscribers just how crazy the modern world is by demonstrating that some people on Fiverr would do anything for 5 dollars.

A shocked PewDiePie

I didn’t want to repeat the distastefulness of Felix’s experiment, but the temptation of using Fiverr’s experts for meme production was too tempting. Especially after you read a bio like this:

This man is clearly an expert

I tried to come up with a very esoteric meme idea to see just what I could fish out of our expert. I sent him the request below:

I would like for you to create a My Little Pony meme for a Vietnam Veteran's 
"Brony" Community.

After some initial concerns and confusion about Ponies(tm) and Vietnam, our expert memeologist was ready to flex his muscles. I was initially concerned that the lack of cultural familiarity would hinder our expert’s humor from reading with a US audience. I think he did pretty well though:

Brutal.

Provisioning and configuring an AWS instance for Data Science

My most ambitious Fiverr gig is one I’ve been working on for the last couple weeks in conjunction with an engineer in the UK. An independent contractor and working dev ops engineer from Italy, Alberto Nugnes, has exceeded my expectations for services on Fiverr. I came to him because I was getting infuriated with my current cloud computing setup. Anytime I wanted additional firepower for a machine learning problem, I had to go through almost an hour of setup to provision myself a server with all my necessary data science packages on AWS. Even tools like DataScienceToolbox or Docker’s Data Science Stacks were insufficient to get me up and running as seamlessly as I desired.

Alberto was able to build me a 2-step pipeline that provisions, configures, and connects me via SFTP to a computing environment much faster than what my laptop can support. Using modern dev ops tools like Terraform, Ansible, and Docker, I am able to get my environment ready in a couple minutes. When you’re trying to process gigabytes of data, you need some serious firepower. Not having to wait for that makes it easy to squeeze in work when you’ve got a few minutes to spare or want to kick something off as you’re nodding off to sleep.

I continue to work with Alberto, and each week we both come back with our individual learnings. He didn’t have experience working with data scientists before, and its my first time diving into the land of dev ops. Its a wonderful domain full of endless possibilities. Alberto’s work cost me more than $5 at this point, but in the end its a bargain for having a perfectly tailored system.

Takeaways

Next time you have an idea for an app, a dank meme, or a serious project, I highly recommend taking a look at Fiverr for some help. Rather than letting your idea collect dust on the back burner, have someone on Fiverr build you a proof of concept. You’ll feel good about the initial progress, and its much easier to invest in a project when there’s something already there. It’s inexpensive and can be really fun to hone your managing skills this way.

Numerai Competition Attempt #2

In this notebook I experiment with adding additional features to the Numerai dataset using t-distributed stochastic neighbor embedding (t-SNE). It is a machine learning dimension reduction tool that is often used for visualizing data in 2 or 3 dimensions, much like PCA. Unlike PCA, it can also work with non-linear / non-parametric data. A number of models have benefited from the use of additional t-SNE features prior to running the main classification algorithm.

T-SNE works by finding patterns within observable clusters in the dataset. It then maps these clusters to a probability distribution in a lower dimension space. The underlying data is no longer visible, but local similarities between points are preserved.

I suggest the following video if you want a theoretical understanding of t-SNE from the creator: https://www.youtube.com/watch?v=RJVL80Gg3lA&t

And a practical guide to tuning its parameters: https://distill.pub/2016/misread-tsne/

I’m using an adapted version of Jim Flemings t-SNE pipeline found here: https://github.com/jimfleming/numerai/blob/master/prep_data.py

Why Apply t-SNE to Numerai’s dataset?

If you’ve worked with Numerai data before, you’ll notice that it does not contain any feature distinctions, and that each feature has been normalized between 0 and 1. Richard Craib, the founder of Numerai, explains the reason and method of this data encryption here: https://medium.com/numerai/encrypted-data-for-efficient-markets-fffbe9743ba8. As such, adding features via dimensionality reduction is the only sensible way I can think of for squeezing any additional features into the training dataset.

Note: This notebook is incomplete and outputs have not been uploaded. It also takes like 10 hours to run this notebook and I don’t recommend doing so unless you have some sort of cloud computation resources! My next post will share the data infrastucture I’ve built out on AWS to make this easier

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
# Lets see whats in our input folder
from subprocess import check_output
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))
example_model.py
example_model.r
example_predictions.csv
numerai_tournament_data.csv
numerai_training_data.csv
test_data.csv
train_data.csv
valid_data.csv

Data Prep

Jim Fleming does something pretty cool in this pipeline that I’ve recently incorporated into all my work. Rather than splitting the training data arbitrarily into a training / validation dataset, he chooses his validation dataset to be those points that are closest to the test results. This is done by first running a 5-fold cross validated random forest classifier, and then sorting the predictions by the model’s relative confidence in each prediction. The points that the model is most confident in are those, we hypothesize, the closest to representing the underlying data pattern.

In [ ]:
import time
import random

from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedKFold

def prep():
    df_train = pd.read_csv('../numerai/T62/numerai_training_data.csv')
    df_test = pd.read_csv('../numerai/T62/numerai_tournament_data.csv')

    feature_cols = list(df_train.columns[3:-1])
    target_col = df_train.columns[-1]
    test_col = 'is_test'
    id_col = 't_id'

    df_train['is_test'] = 0
    df_test['is_test'] = 1

    df_data = pd.concat([df_train, df_test])
    df_data = df_data.reindex_axis(feature_cols + [test_col, target_col], axis='columns')

    X_split = df_data[feature_cols]
    y_split = df_data[test_col]

    rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=67)
    predictions = np.zeros(y_split.shape)

    kfold = StratifiedKFold(y_split, n_folds=5, shuffle=True, random_state=67)
    
    for i, (train_i, test_i) in enumerate(kfold):
        print("Fold #{}".format(i + 1))

        X_split_train = X_split.iloc[train_i]
        y_split_train = y_split.iloc[train_i]

        X_split_test = X_split.iloc[test_i]
        y_split_test = y_split.iloc[test_i]

        rf.fit(X_split_train, y_split_train)

        p = rf.predict_proba(X_split_test)[:,1]
        auc = roc_auc_score(y_split_test, p)
        print("AUC: {:.2f}".format(auc))

        predictions[test_i] = p

    # sort predictions by value
    i = predictions.argsort()

    # sort data by prediction confidence
    df_sorted = df_data.iloc[i]

    # select only training data
    df_train_sorted = df_sorted.loc[df_sorted.is_test == 0]

    # drop unnecessary columns
    df_train_sorted = df_train_sorted.drop([test_col], axis='columns')

    # verify training data
    assert(df_train_sorted[target_col].sum() == df_train[target_col].sum())

    # grab first N rows as train and last N rows as validation (those closest to test)
    validation_size = int(len(df_train_sorted) * 0.1)
    df_train = df_train_sorted.iloc[:-validation_size]
    df_valid = df_train_sorted.iloc[-validation_size:]
    print('Creating dataset with validation size: {}'.format(validation_size))

    df_train.to_csv('../numerai/T62/train_data.csv', index_label=False)
    df_valid.to_csv('../numerai/T62/valid_data.csv', index_label=False)
    df_test.to_csv('../numerai/T62/test_data.csv', index_label=False)
    print('Done.')
In [ ]:
prep()

Lets ensure we have our data in the right folder

In [ ]:
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))

t-SNE Feature Encoding

The below t-SNE feature encoding is a modified version of Jim Fleming’s code. The modular function takes as an input a perplexity value, which is in a sense, similar to the k in a nearet neighbors algorithm. It functions as a trade-off betwen the local and global patterns in the data. With a smaller perplexity, local variations will dominate the t-SNE plot. A larger perplexity will cause larger affinity clusters to be generated.

Getting the most from t-SNE may mean incorporating multiple plots with different perplexities into the final training dataset. Most t-SNE encodings I’ve seen have perplexities between 5 and 100, with ranges between 10 and 50 being the most common. I’ll be including 5 different encodings at equal intervals to my final dataset.

In [25]:
import time
import random

from tsne import bh_sne
from sklearn.preprocessing import PolynomialFeatures

def save_tsne(perplexity, dimensions=2, polynomial=False):
    df_train = pd.read_csv('../numerai/T62/train_data.csv')
    df_valid = pd.read_csv('../numerai/T62/valid_data.csv')
    df_test = pd.read_csv('../numerai/T62/test_data.csv')

    feature_cols = list(df_train.columns[:-1])
    target_col = df_train.columns[-1]

    X_train = df_train[feature_cols].values
    y_train = df_train[target_col].values

    X_valid = df_valid[feature_cols].values
    y_valid = df_valid[target_col].values

    X_test = df_test[feature_cols].values

    X_all = np.concatenate([X_train, X_valid, X_test], axis=0)
    
    if polynomial:
        poly = PolynomialFeatures(degree=2)
        X_all = poly.fit_transform(X_all)

    print('Running TSNE (perplexity: {}, dimensions: {}, polynomial: {})...'.format(perplexity, dimensions, polynomial))
    start_time = time.time()
    tsne_all = bh_sne(X_all, d=dimensions, perplexity=float(perplexity))
    print('TSNE: {}s'.format(time.time() - start_time))

    tsne_train = tsne_all[:X_train.shape[0]]
    assert(len(tsne_train) == len(X_train))

    tsne_valid = tsne_all[X_train.shape[0]:X_train.shape[0]+X_valid.shape[0]]
    assert(len(tsne_valid) == len(X_valid))

    tsne_test = tsne_all[X_train.shape[0]+X_valid.shape[0]:X_train.shape[0]+X_valid.shape[0]+X_test.shape[0]]
    assert(len(tsne_test) == len(X_test))

    if polynomial:
        save_path = '../numerai/T62/tsne_{}d_{}p_poly.npz'.format(dimensions, perplexity)
    else:
        save_path = '../numerai/T62/tsne_{}d_{}p.npz'.format(dimensions, perplexity)

    np.savez(save_path, \
        train=tsne_train, \
        valid=tsne_valid, \
        test=tsne_test)
    print('Saved: {}'.format(save_path))

Generate t-SNE. Warning: This will take a long time.

In [ ]:
for perplexity in [10, 20, 30, 40, 50]:
    save_tsne(perplexity, polynomial=True)

Lets check our output again

In [ ]:
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))

Fitting our Model

Having added the new features, I’ll run the final prediction model using Jim’s code again. Our model uses a logistic regression and outputs a classification probability.

In [ ]:
def main():
    # load data
    df_train = pd.read_csv('data/train_data.csv')
    df_valid = pd.read_csv('data/valid_data.csv')
    df_test = pd.read_csv('data/test_data.csv')

    feature_cols = list(df_train.columns[:-1])
    target_col = df_train.columns[-1]

    X_train = df_train[feature_cols].values
    y_train = df_train[target_col].values

    X_valid = df_valid[feature_cols].values
    y_valid = df_valid[target_col].values

    X_test = df_test[feature_cols].values

    tsne_data_10p = np.load('data/tsne_2d_10p_poly.npz')
    tsne_data_20p = np.load('data/tsne_2d_20p_poly.npz')
    tsne_data_30p = np.load('data/tsne_2d_30p_poly.npz')
    tsne_data_40p = np.load('data/tsne_2d_40p_poly.npz')
    tsne_data_50p = np.load('data/tsne_2d_50p_poly.npz')

    # concat features
    X_train_concat = {
        'X': X_train,
        'tsne_10p': tsne_data_10p['train'],
        'tsne_20p': tsne_data_20p['train'],
        'tsne_30p': tsne_data_30p['train'],
        'tsne_40p': tsne_data_40p['train'],
        'tsne_50p': tsne_data_50p['train'],
    }
    X_valid_concat = {
        'X': X_valid,
        'tsne_10p': tsne_data_10p['valid'],
        'tsne_20p': tsne_data_20p['valid'],
        'tsne_30p': tsne_data_30p['valid'],
        'tsne_40p': tsne_data_40p['valid'],
        'tsne_50p': tsne_data_50p['valid'],
    }
    X_test_concat = {
        'X': X_test,
        'tsne_10p': tsne_data_10p['test'],
        'tsne_20p': tsne_data_20p['test'],
        'tsne_30p': tsne_data_30p['test'],
        'tsne_40p': tsne_data_40p['test'],
        'tsne_50p': tsne_data_50p['test'],
    }

    # build pipeline
    classifier = Pipeline(steps=[
        ('features', FeatureUnion(transformer_list=[
            ('tsne_10p', ItemSelector('tsne_10p')),
            ('tsne_20p', ItemSelector('tsne_20p')),
            ('tsne_30p', ItemSelector('tsne_30p')),
            ('tsne_40p', ItemSelector('tsne_40p')),
            ('tsne_50p', ItemSelector('tsne_50p')),
            ('X', ItemSelector('X')),
        ])),
        ('poly', PolynomialFeatures(degree=2)),
        ('scaler', MinMaxScaler()),
        ('lr', LogisticRegression(penalty='l2', C=1e-2, n_jobs=-1)),
    ])

    print('Fitting...')
    start_time = time.time()
    classifier.fit(X_train_concat, y_train)
    print('Fit: {}s'.format(time.time() - start_time))

    p_valid = classifier.predict_proba(X_valid_concat)
    loss = log_loss(y_valid, p_valid)
    print('Loss: {}'.format(loss))

    p_test = classifier.predict_proba(X_test_concat)
    df_pred = pd.DataFrame({
        't_id': df_test['t_id'],
        'probability': p_test[:,1]
    })
    csv_path = 'predictions/predictions_{}_{}.csv'.format(int(time.time()), loss)
    df_pred.to_csv(csv_path, columns=('t_id', 'probability'), index=None)
    print('Saved: {}'.format(csv_path))

Numerai Competition Attempt #1

In this notebook I do some profiling on Numer.ai’s tournament 61 dataset. For those not acquainted with numer.ai, it is a crowd sourced hedge fund where data scientists compete to create the best predictive models. Numerai then uses these predictions to create a meta-model that, as the theory goes, is better than all the disparate models used to create it.

Numerai runs tournaments each week and the format, as far as I know, is the same each week. This is my first step at creating a mostly automatic pipeline for entering into these tournaments. Given that numerai allows for 25 predictions a day, there is also a lot of room to experiement with ‘magic numbers’ and other sorts sorts of hacks to game the competition. That will obviously come later, so the purpose of this notebook is just to get an understanding of the data, the prediction format, and to test out a few models.

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [10]:
# Lets see whats in our input folder
from subprocess import check_output
print(check_output(["ls", "../numerai/T61/"]).decode("utf8"))
example_model.py
example_model.r
example_predictions.csv
numerai_tournament_data.csv
numerai_training_data.csv

In [160]:
# Load Training Data
training_data = pd.read_csv('../numerai/T61/numerai_training_data.csv', header=0)
prediction_data = pd.read_csv('../numerai/T61/numerai_tournament_data.csv', header=0)

# Transform the loaded CSV data into numpy arrays
features = [f for f in list(training_data) if "feature" in f]
X = training_data[features]
Y = training_data["target"]
x_prediction = prediction_data[features]
ids = prediction_data["id"]
In [37]:
print X.shape, Y.shape
(108405, 21) (108405,) (45625, 25)

First we’ll get a sense of what type each column is. I’m presuming that we’ll be working primarily with numerical data.

In [19]:
print(X.dtypes)
feature1     float64
feature2     float64
feature3     float64
feature4     float64
feature5     float64
feature6     float64
feature7     float64
feature8     float64
feature9     float64
feature10    float64
feature11    float64
feature12    float64
feature13    float64
feature14    float64
feature15    float64
feature16    float64
feature17    float64
feature18    float64
feature19    float64
feature20    float64
feature21    float64
dtype: object

Python has a handy function that will summarize each column.

In [24]:
print X.describe()
            feature1      feature2       feature3       feature4  \
count  108405.000000  108405.00000  108405.000000  108405.000000   
mean        0.500489       0.52870       0.593131       0.463543   
std         0.090130       0.10490       0.098372       0.096387   
min         0.000000       0.00000       0.000000       0.000000   
25%         0.455870       0.46791       0.540400       0.408910   
50%         0.499290       0.53508       0.594830       0.466530   
75%         0.546790       0.59569       0.648070       0.520500   
max         1.000000       1.00000       0.994840       1.000000   

            feature5       feature6       feature7       feature8  \
count  108405.000000  108405.000000  108405.000000  108405.000000   
mean        0.430014       0.500072       0.533695       0.484669   
std         0.087958       0.102223       0.098016       0.087952   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.376880       0.436020       0.478910       0.437220   
50%         0.428840       0.502160       0.539150       0.483380   
75%         0.482570       0.566100       0.588430       0.529180   
max         1.000000       1.000000       1.000000       1.000000   

            feature9      feature10      ...            feature12  \
count  108405.000000  108405.000000      ...        108405.000000   
mean        0.351988       0.565529      ...             0.475409   
std         0.100833       0.095981      ...             0.103580   
min         0.005760       0.000000      ...             0.000000   
25%         0.286220       0.510540      ...             0.415570   
50%         0.340520       0.571450      ...             0.477890   
75%         0.406380       0.626150      ...             0.536430   
max         1.000000       1.000000      ...             1.000000   

           feature13      feature14      feature15      feature16  \
count  108405.000000  108405.000000  108405.000000  108405.000000   
mean        0.426605       0.622671       0.539131       0.407971   
std         0.092866       0.088380       0.093146       0.087795   
min         0.002520       0.000000       0.000000       0.002620   
25%         0.378480       0.570380       0.485290       0.359280   
50%         0.425220       0.626130       0.545240       0.406200   
75%         0.473190       0.676210       0.597950       0.454230   
max         1.000000       1.000000       0.999890       1.000000   

           feature17      feature18      feature19      feature20  \
count  108405.000000  108405.000000  108405.000000  108405.000000   
mean        0.568731       0.594192       0.525519       0.493394   
std         0.092837       0.097566       0.089129       0.090487   
min         0.000000       0.000000       0.000000       0.021120   
25%         0.522870       0.541190       0.472620       0.444070   
50%         0.569970       0.595320       0.530720       0.488750   
75%         0.616220       0.648000       0.578430       0.540270   
max         0.997570       1.000000       0.979030       0.968260   

           feature21  
count  108405.000000  
mean        0.433121  
std         0.104713  
min         0.000000  
25%         0.365090  
50%         0.423950  
75%         0.493530  
max         1.000000  

[8 rows x 21 columns]
In [33]:
pd.options.display.mpl_style = 'default'

X.boxplot(figsize=(14,8), rot = 45);
In [36]:
X.hist(figsize=(14, 12), bins=30);

Our data looks to be approximately normally distributed between 0 and 1, so theres no immediate need for us to do any sort of additional normalization.

I have heard the misconception that machine learning algorithms do not need normalization. This is false. Imagine a loss function mapped across a 2 dimensional feature space, whereby one of features has a range >> than the other. The gradient descent algorithm for movement along this feature space will not be able to differentiate between the two ranges, and will conversely, fail to navigate to the loss minimizing value along the lower feature range.

I will first be splitting the data into a training and validation set so I can get a better picture of out of sample performance.

In [40]:
# split into test and train
from sklearn.model_selection import train_test_split

x_train, x_valid, y_train, y_valid = train_test_split(X, Y, test_size=0.1, random_state=False)

First I will be seeing what my performance is looking like out of sample so I can get a sense of how much tuning will be necessary.

I will be running a simple XGBoosted model with some paramaters tuned from a prior Kaggle competition.

In [41]:
import xgboost as xgb # xgboost package 

# Split training and validation datasets
d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)

I tend to build models with early stopping, so that if the validation performance doesn’t improve in over n rounds, the final model will use whatever was the last configuration before the n iterations.

In [135]:
params = {}
params['eta'] = 0.02 # control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1. Lower is slower but more robust to overfitting.
params['objective'] = 'binary:logistic' # Default.  Running a regression, since we're predicting values not classes
params['eval_metric'] = 'logloss' # We're evaluating our models on Mean Average Error.  
params['max_depth'] = 6 # Maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting.
params['silent'] = 1 # Don't print messages

# We will be tuning our model based on how it does in the validation dataset 
watchlist = [(d_train, 'train'), (d_valid, 'valid')]

# Early stopping
clf = xgb.train(params, d_train, 10000, watchlist, early_stopping_rounds=100, verbose_eval=10)
[0]	train-logloss:0.693028	valid-logloss:0.69312
Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping.

Will train until valid-logloss hasn't improved in 100 rounds.
[10]	train-logloss:0.691948	valid-logloss:0.692913
[20]	train-logloss:0.690984	valid-logloss:0.692734
[30]	train-logloss:0.690102	valid-logloss:0.692609
[40]	train-logloss:0.689299	valid-logloss:0.692501
[50]	train-logloss:0.688555	valid-logloss:0.692459
[60]	train-logloss:0.687853	valid-logloss:0.692441
[70]	train-logloss:0.687231	valid-logloss:0.692426
[80]	train-logloss:0.686782	valid-logloss:0.69244
[90]	train-logloss:0.686489	valid-logloss:0.69244
[100]	train-logloss:0.686231	valid-logloss:0.692458
[110]	train-logloss:0.685894	valid-logloss:0.6925
[120]	train-logloss:0.685616	valid-logloss:0.692557
[130]	train-logloss:0.685286	valid-logloss:0.692594
[140]	train-logloss:0.685012	valid-logloss:0.692656
[150]	train-logloss:0.684729	valid-logloss:0.692697
[160]	train-logloss:0.684432	valid-logloss:0.69275
Stopping. Best iteration:
[64]	train-logloss:0.687586	valid-logloss:0.692406

In [116]:
d_prediction = xgb.DMatrix(x_prediction)
In [125]:
results = clf.predict(d_prediction)
results_df = pd.DataFrame(data={'probability':results})
joined = pd.DataFrame(ids).join(results_df)

print("Writing predictions to predictions.csv")
# Save the predictions out to a CSV file
joined.to_csv("predictions.csv", index=False)
print("Done!")
Writing predictions to predictions.csv
Done!

Wow, thats a shocker. I’ve never seen a model improve so little after the first iteration. I think it might be because the data is so cleaned already. I dont think I’ll do any tuning at this moment, and I’ll wait to see how my other models perform on this dataset.

In [127]:
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.model_selection import cross_val_predict, cross_val_score

I’ll be running 5 fold cross validation on our training set to generate predictions on our validation dataset.

In [62]:
# import cross validation packages
from sklearn.model_selection import KFold

## Building KFolds with 5 Folds
KFold_5 = KFold(5)

I’m testing a few different models with relatively random hyper-paramaters- I’ll be likely building an ensemble model at this point so I’m trying to get a sense of what models I’ll want to tune later.

In [ ]:
names = ["LR-C=0.01", "LR-C=0.1", "LR-C=1", "LR-C=10",  
         "KNN-50", "KNN-500", "KNN-5000",
         "GBR ls",
         "Random Forest 100", "Random Forest 10"]

regressors = [
    LogisticRegression(C = 0.01),
    LogisticRegression(C = 0.1),
    LogisticRegression(C = 1),
    LogisticRegression(C = 10),
    KNeighborsClassifier(50),
    KNeighborsClassifier(500),
    KNeighborsClassifier(5000),
    GradientBoostingClassifier(n_estimators=100, learning_rate=0.02, random_state=0),
    RandomForestClassifier(n_estimators=100),
    RandomForestClassifier(n_estimators=10),   
]

for name, reg in zip(names, regressors):
    reg.fit(x_train, y_train)
    predicted = reg.predict_proba(x_valid)
    #predicted = cross_val_predict(reg, x_valid, y_valid, cv=KFold_5)
    print name, metrics.log_loss(y_valid, predicted)

With the exeption of the penalized logistic regression model, none of the other tested models outperform the XGBoosted model on the validation dataset. The good news is that we’re seeing promising results from the linear regression models and KNN models. If you’re ensembling, you first and foremost want high performing models. But you also want to ensure that your models are different in ways to compensate one another. For example, you’d stand little to gain pairing a number of forest based models since they’ll likely be strong and weak in many of the same ways.

I’m learning towards my ensemble combining a K-Nearest Neighbors model, a penalized logistic regression, and an XGboost model. I’ll see what sort of performance I get on the Numerai leaderboard before I try tuning the parameters. To ensemble, I’ll simply average the prediction probabilities over the full dataset.

In [151]:
d_X = xgb.DMatrix(X, label=Y)

params = {}
params['eta'] = 0.02 # control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1. Lower is slower but more robust to overfitting.
params['objective'] = 'multi:softprob' 
params['eval_metric'] = 'logloss' # We're evaluating our models on Mean Average Error.  
params['max_depth'] = 6 # Maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting.
params['silent'] = 1 # Don't print messages
params['num_class'] = 2
In [152]:
LR_CLF = LogisticRegression(C = 10)
KN_CLF = KNeighborsClassifier(5000)

Fit our models…

In [153]:
XG_CLF = xgb.train(params, d_X, num_boost_round=64)
LR_CLF.fit(X, Y)
KN_CLF.fit(X, Y)
Out[153]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5000, p=2,
           weights='uniform')

Generate out predictions…

In [161]:
d_prediction = xgb.DMatrix(x_prediction)
In [170]:
XG_prediction = XG_CLF.predict(d_prediction)
LR_prediction = LR_CLF.predict_proba(x_prediction)
KN_prediction = KN_CLF.predict_proba(x_prediction)

Average and print our predictions to CSV…

In [175]:
y_prediction = (XG_prediction + XG_prediction + KN_prediction)/3
results = y_prediction[:, 1]
results_df = pd.DataFrame(data={'probability':results})
joined = pd.DataFrame(ids).join(results_df)

print("Writing predictions to predictions.csv")
# Save the predictions out to a CSV file
joined.to_csv("predictions_0624.csv", index=False)
print("Done!")
Writing predictions to predictions.csv
Done!

With our 3 model ensemble we have a score of 0.6927 on Numer.ai. Enough for a paltry 300th place, but the model is both original, concordant, and 75% consistent on historical datasets. With tuning and some feature engineering we’ll hopefully be able to bring this down to below 0.69, enough for a top 200 finish.

Ensemble Methods

The discussion of ensemble methods will compromise both popular algorithms that incporate multiple models within them, as well as the combination of algorithms.

In the first part, I’ll discuss the logic behind ensemble algorithms like Decision Trees, Random Forests, and Gradient Boosted Models. In the second half, I’ll implement a meta model that incorporates a number of different classifiers and then votes on the classification based on different rules.

In [29]:
from IPython.display import Image
from IPython.core.display import HTML

Ensemble Models

My favorite illustration of ensemble learning comes from the Jain parable of ‘The Blind Men and the Elephant’

In [30]:
Image(url= "https://www.jainworld.com/literature/story25i1.gif", width=700, height=500)
Out[30]:

Ensemble methods like random forests will combine the outputs of multiple weaker clasifier models to reduce the risk of selecting a single poorly performing strong classifier.

The random forests method takes and combines the predictions of small decision trees working on a randomized portion of the original dataset to produce a ‘meta-model’ that will generalize better on a new testing dataset.

The extent and rate at which these small decision trees in the ‘random forest’ are pruned and adjusted is what distinguishes the random forest method with the ‘boosted’ varieties like XGBoost that you’ve probably heard of or used before. The idea, however, is the same– Lots of little independent decisions made on random parts and distributions of the original dataset.

Voting Classifiers

The idea behind the voting classifier implementation is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

In this example, I’ll be combining different machine learning classifiers and using either a hard or soft voting system to predict the class labels. The voting system will either take a majority vote or sum up the individual probabilities of each classifier to select the class labels. The use of voting classifiers is desired because they are able to balance out the individual weaknesses of each model.

Hard (Majority) Voting

With hard voting, the final class label is determined by the class label that has received the majority of votes. An example:

  • Classifier 1: Label A
  • Classifier 2: Label B
  • Classifier 3: Label B

We would expect that the hard voting classifier would predict label B.

Below I will implement the hard voting classifier on 3 algorithms: a random forest, a logistic regression, and K-Nearest Neighbors

In [9]:
# Import datasets and voting tools
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier

# Import models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Load data
iris = datasets.load_iris()
X = iris.data 
y = iris.target

# Create classifier instances
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = KNeighborsClassifier(3)

# Create voting classifier instances
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('knn', clf3)], voting='hard')

# Print success
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'K-NN', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.3f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.960 (+/- 0.04) [Logistic Regression]
Accuracy: 0.953 (+/- 0.03) [Random Forest]
Accuracy: 0.967 (+/- 0.02) [K-NN]
Accuracy: 0.973 (+/- 0.01) [Ensemble]

With hard voting, our ensemble model had a higher accuracy score then any of the classifiers individually. Next we will implement the same model with soft voting.

Soft Voting

Soft voting returns the final class label as the max argument of the sum of predicted probabilities. The final label is the highest average probability across all classifiers. It is just an argument switch to change the implementation from hard to soft voting!

In [12]:
# Create classifier instances
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = KNeighborsClassifier(3)

# Create voting classifier instances
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), 
                                    ('knn', clf3)], voting='soft')

# Print successes and std deviation across k-folds
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 
                                                 'Random Forest', 
                                                 'K-NN', 
                                                 'Ensemble']):
    # cross validate across 5 folds
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    
    # print
    print("Accuracy: %0.3f (+/- %0.2f) [%s]" % 
          (scores.mean(), scores.std(), label))
Accuracy: 0.960 (+/- 0.04) [Logistic Regression]
Accuracy: 0.960 (+/- 0.03) [Random Forest]
Accuracy: 0.967 (+/- 0.02) [K-NN]
Accuracy: 0.960 (+/- 0.03) [Ensemble]

The ensemble method doesn’t perform as well this time around. Me thinks its because the LR and RF models (which have thus far performed worse individually) are holding more weight in a couple tie breaker scenarios.

Visualizing decision boundaries

I fund some handy code on the Sci-Kit learn documentation that shows how to plot the decision boundaries of our indivdiual classifier models. Super cool! See here for more: http://scikit-learn.org/stable/auto_examples/ensemble/plot_voting_decision_regions.html

In [28]:
from itertools import product
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=4)
clf2 = KNeighborsClassifier(n_neighbors=3)
clf3 = SVC(kernel='rbf', probability=True)
eclf = VotingClassifier(estimators=[('dt', clf1), 
                                    ('knn', clf2),
                                    ('svc', clf3)],
                        voting='hard')

# Fit classifiers
clf1.fit(X, y)
clf2.fit(X, y)
clf3.fit(X, y)
eclf.fit(X, y)

# Plotting decision regions
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        [clf1, clf2, clf3, eclf],
                        ['Decision Tree (depth=4)', 'KNN (k=3)',
                         'Kernel SVM', 'Hard Voting']):

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
    axarr[idx[0], idx[1]].set_title(tt)

plt.show()

Feature Selection

This notebook will aim to provide an explanation and application of different feature ranking methods, namely that of Recursive Feature Elimination (RFE), Stability Selection, linear models as well as Random Forest.

This notebook borrows heavily from an article by Ando Saabas on feature selection found here: http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/ as well as Anisotropic’s work on feature selection found here: https://www.kaggle.com/arthurtok/feature-ranking-w-randomforest-rfe-linear-models/code. A lot of Anisotropic’s comments are used in this kernel to explain what is being done.

My work is an application of the techniques and methods written by those gentlemen to the Zillow data competition on Kaggle (https://www.kaggle.com/c/zillow-prize-1). I have included additional competition specific data cleaning and comments in addition to the feature selection code.

There is one point of serious concern with the application of this code to the Zillow competition. In the Zillow competition, we are not estimating home price directly. Zillow is not sharing the actual sale price as a predictor variable (presumably to avoid something like this happening: https://www.wired.com/2010/03/netflix-cancels-contest/) but is having Kagglers estimate the log-loss of their model instead. Weird, I know.

I’ve been left thinking what a feature selection on the log-loss means. Are the most prominent features the ones that Zillow has been most incorrectly using? Are the best features at predicting log loss the worst features to use in a price estimate model? Do we need to do some feature engineering to make the ‘best’ features here more usable?

The contents of this notebook are as follows:

  1. Data Cleaning and Visualisation : This section will revolve around exploring the data and visualising some summary statistics.
  2. Stability Selection via Randomised Lasso Method : Introduce a relatively new feature selection method called “Stability Selection” and using the Randomised Lasso in its implementation
  3. Recursive Feature Elimination : Implementing the Recursive Feature Elimination method of feature ranking via the use of basic Linear Regression
  4. Linear Model Feature Coefficients : Implementing 3 of Sklearn’s linear models (Linear Regression, Lasso and Ridge) and using the inbuilt estimated coefficients for our feature selection
  5. Random Forest Feature Selection : Using the Random Forest’s convenient attribute “feature_importances” to calculate and ultimately rank the feature importance.

Finally, with all the points 1 to 5 above, we will combine the results to create our:

Feature Ranking Matrix : Matrix of all the features along with the respective model scores which we can use in our ranking.

In [167]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.feature_selection import RFE, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso, RandomizedLasso)
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor

1. DATA CLEANSING AND ANALYSIS

Let’s first read in the house data as a dataframe “house” and inspect the first 5 rows

In [185]:
# Load Data
train = pd.read_csv('../input/train_2016_v2.csv')
prop = pd.read_csv('../input/properties_2016.csv')

# Convert to float32 (this is done to post on Kaggle Kernels)
for c, dtype in zip(prop.columns, prop.dtypes):
    if dtype == np.float64:
        prop[c] = prop[c].astype(np.float32)

# Merge training dataset with properties dataset
df_train = train.merge(prop, how='left', on='parcelid')

# Remove ID columns
x_train = df_train.drop(['parcelid', 'logerror', 'transactiondate', 'propertyzoningdesc', 
                         'propertycountylandusecode'], axis=1)
In [171]:
x_train.head()
Out[171]:
airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid calculatedbathnbr decktypeid finishedfloor1squarefeet numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 1.0 NaN NaN 2.0 3.0 NaN 4.0 2.0 NaN NaN NaN NaN 122754.0 360170.0 2015.0 237416.0 6735.88 NaN NaN 6.037107e+13
1 NaN NaN NaN 3.5 4.0 NaN NaN 3.5 NaN NaN NaN NaN 346458.0 585529.0 2015.0 239071.0 10153.02 NaN NaN NaN
2 1.0 NaN NaN 3.0 2.0 NaN 4.0 3.0 NaN NaN NaN NaN 61994.0 119906.0 2015.0 57912.0 11484.48 NaN NaN 6.037464e+13
3 1.0 NaN NaN 2.0 2.0 NaN 4.0 2.0 NaN NaN NaN NaN 171518.0 244880.0 2015.0 73362.0 3048.74 NaN NaN 6.037296e+13
4 NaN NaN NaN 2.5 4.0 NaN NaN 2.5 NaN NaN 2.0 NaN 169574.0 434551.0 2015.0 264977.0 5488.96 NaN NaN 6.059042e+13

5 rows × 55 columns

Now its time for some general data inspection. Let’s first examine to see if there are any nulls in the dataframe as well as look at the type of the data (i.e whether it is a string or numeric)

In [172]:
print(x_train.dtypes)
airconditioningtypeid           float64
architecturalstyletypeid        float64
basementsqft                    float64
bathroomcnt                     float64
bedroomcnt                      float64
buildingclasstypeid             float64
buildingqualitytypeid           float64
calculatedbathnbr               float64
decktypeid                      float64
finishedfloor1squarefeet        float64
calculatedfinishedsquarefeet    float64
finishedsquarefeet12            float64
finishedsquarefeet13            float64
finishedsquarefeet15            float64
finishedsquarefeet50            float64
finishedsquarefeet6             float64
fips                            float64
fireplacecnt                    float64
fullbathcnt                     float64
garagecarcnt                    float64
garagetotalsqft                 float64
hashottuborspa                   object
heatingorsystemtypeid           float64
latitude                        float64
longitude                       float64
lotsizesquarefeet               float64
poolcnt                         float64
poolsizesum                     float64
pooltypeid10                    float64
pooltypeid2                     float64
pooltypeid7                     float64
propertylandusetypeid           float64
rawcensustractandblock          float64
regionidcity                    float64
regionidcounty                  float64
regionidneighborhood            float64
regionidzip                     float64
roomcnt                         float64
storytypeid                     float64
threequarterbathnbr             float64
typeconstructiontypeid          float64
unitcnt                         float64
yardbuildingsqft17              float64
yardbuildingsqft26              float64
yearbuilt                       float64
numberofstories                 float64
fireplaceflag                    object
structuretaxvaluedollarcnt      float64
taxvaluedollarcnt               float64
assessmentyear                  float64
landtaxvaluedollarcnt           float64
taxamount                       float64
taxdelinquencyflag               object
taxdelinquencyyear              float64
censustractandblock             float64
dtype: object

I’ll see what values are in the object type columns — We will have to either break these features up so that for each option there is just a binary (True / False) indicator, or find a way to make the existing features binary.

In [179]:
x_train['taxdelinquencyflag'].value_counts()
Out[179]:
False    88492
True      1783
Name: taxdelinquencyflag, dtype: int64
In [180]:
x_train['fireplaceflag'].value_counts()
Out[180]:
False    90053
True       222
Name: fireplaceflag, dtype: int64
In [181]:
x_train['hashottuborspa'].value_counts()
Out[181]:
False    87910
True      2365
Name: hashottuborspa, dtype: int64

I’ll have to convert the ‘Y’ to a TRUE and otherwise convert any NaN to False’s

In [177]:
x_train['taxdelinquencyflag'] = x_train['taxdelinquencyflag'] == 'Y'
In [178]:
for c in x_train.dtypes[x_train.dtypes == object].index.values:
    x_train[c] = (x_train[c] == True)
In [60]:
# Looking for nulls
print(x_train.isnull().sum())
airconditioningtypeid           61494
architecturalstyletypeid        90014
basementsqft                    90232
bathroomcnt                         0
bedroomcnt                          0
buildingclasstypeid             90259
buildingqualitytypeid           32911
calculatedbathnbr                1182
decktypeid                      89617
finishedfloor1squarefeet        83419
calculatedfinishedsquarefeet      661
finishedsquarefeet12             4679
finishedsquarefeet13            90242
finishedsquarefeet15            86711
finishedsquarefeet50            83419
finishedsquarefeet6             89854
fips                                0
fireplacecnt                    80668
fullbathcnt                      1182
garagecarcnt                    60338
garagetotalsqft                 60338
hashottuborspa                      0
heatingorsystemtypeid           34195
latitude                            0
longitude                           0
lotsizesquarefeet               10150
poolcnt                         72374
poolsizesum                     89306
pooltypeid10                    89114
pooltypeid2                     89071
pooltypeid7                     73578
propertylandusetypeid               0
rawcensustractandblock              0
regionidcity                     1803
regionidcounty                      0
regionidneighborhood            54263
regionidzip                        35
roomcnt                             0
storytypeid                     90232
threequarterbathnbr             78266
typeconstructiontypeid          89976
unitcnt                         31922
yardbuildingsqft17              87629
yardbuildingsqft26              90180
yearbuilt                         756
numberofstories                 69705
fireplaceflag                       0
structuretaxvaluedollarcnt        380
taxvaluedollarcnt                   1
assessmentyear                      0
landtaxvaluedollarcnt               1
taxamount                           6
taxdelinquencyflag                  0
taxdelinquencyyear              88492
censustractandblock               605
dtype: int64

Yikes, the data is pretty sparse. We’re going to have to figure out how to either remove some features or impute values for some of these.

Temporarily I’ll be imputing medians for the missing values based on the missing value for said column, but its only a stopgap solution before I can do something else (possibly predict missing values?)

In [182]:
x_train = x_train.fillna(x_train.median())

2. Stability Selection via Randomized Lasso

In a nutshell, this method serves to apply the feature selection on different parts of the data and features repeatedly until the results can be aggregated. Therefore stronger features ( defined as being selected as important) will have greater scores in this method as compared to weaker features. Refer to this paper by Nicolai Meinshausen and Peter Buhlmann for a much greater detail on the method : http://stat.ethz.ch/~nicolai/stability.pdf

In this notebook, the Stability Selection method is conveniently inbuilt into sklearn’s randomized lasso model and therefore this will be implemented as follows:

In [84]:
# First extract the target variable which is our Log Error
Y = df_train['logerror'].values
X = x_train.as_matrix()

# Store the column/feature names into a list "colnames"
colnames = x_train.columns

Next, we create a function which will be able to conveniently store our feature rankings obtained from the various methods described here into a Python dictionary. In case you are thinking I created this function, no this is not the case. All credit goes to Ando Saabas and I am only trying to apply what he has discussed in this context.

In [85]:
# Define dictionary to store our rankings
ranks = {}
# Create our function which stores the feature rankings to the ranks dictionary
def ranking(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x,2), ranks)
    return dict(zip(names, ranks))
In [87]:
# Finally let's run our Selection Stability method with Randomized Lasso
rlasso = RandomizedLasso(alpha=0.04)
rlasso.fit(X, Y)
ranks["rlasso/Stability"] = ranking(np.abs(rlasso.scores_), colnames)
print('finished')
finished

3. Recursive Feature Elimination ( RFE )

Now onto the next method in our feature ranking endeavour. Recursive Feature Elimination or RFE uses a model ( eg. linear Regression or SVM) to select either the best or worst-performing feature, and then excludes the feature. After this, the whole process is iterated until all features in the dataset are used up ( or up to a user-defined limit). Sklearn conveniently possesses a RFE function via the sklearn.feature_selection call and we will use this along with a simple linear regression model for our ranking search as follows:

In [88]:
# Construct our Linear Regression model
lr = LinearRegression(normalize=True)
lr.fit(X,Y)

#stop the search when only the last feature is left
rfe = RFE(lr, n_features_to_select=1, verbose =3 )
rfe.fit(X,Y)
ranks["RFE"] = ranking(list(map(float, rfe.ranking_)), colnames, order=-1)
Fitting estimator with 55 features.
Fitting estimator with 54 features.
Fitting estimator with 53 features.
Fitting estimator with 52 features.
Fitting estimator with 51 features.
Fitting estimator with 50 features.
Fitting estimator with 49 features.
Fitting estimator with 48 features.
Fitting estimator with 47 features.
Fitting estimator with 46 features.
Fitting estimator with 45 features.
Fitting estimator with 44 features.
Fitting estimator with 43 features.
Fitting estimator with 42 features.
Fitting estimator with 41 features.
Fitting estimator with 40 features.
Fitting estimator with 39 features.
Fitting estimator with 38 features.
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 29 features.
Fitting estimator with 28 features.
Fitting estimator with 27 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

4. Linear Model Feature Ranking

Now let’s apply 3 different linear models (Linear, Lasso and Ridge Regression) and how the features are selected and prioritised via these models. To achieve this, I shall use the sklearn implementation of these models and in particular the attribute .coef to return the estimated coefficients for each feature in the linear model.

In [89]:
# Using Linear Regression
lr = LinearRegression(normalize=True)
lr.fit(X,Y)
ranks["LinReg"] = ranking(np.abs(lr.coef_), colnames)

# Using Ridge 
ridge = Ridge(alpha = 7)
ridge.fit(X,Y)
ranks['Ridge'] = ranking(np.abs(ridge.coef_), colnames)

# Using Lasso
lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = ranking(np.abs(lasso.coef_), colnames)

5. Random Forest feature ranking

Sklearn’s Random Forest model also comes with it’s own inbuilt feature ranking attribute and one can conveniently just call it via “featureimportances“. That is what we will be using as follows:

In [105]:
rf = RandomForestRegressor(n_jobs=-1, n_estimators=50, verbose=2)
rf.fit(X,Y)
ranks["RF"] = ranking(rf.feature_importances_, colnames)
building tree 1 of 50
building tree 3 of 50 building tree 2 of 50building tree 7 of 50building tree 4 of 50
building tree 6 of 50building tree 5 of 50building tree 8 of 50





building tree 9 of 50
building tree 10 of 50
building tree 11 of 50
building tree 12 of 50
building tree 13 of 50
building tree 14 of 50
building tree 15 of 50
building tree 16 of 50
building tree 17 of 50
building tree 18 of 50
building tree 19 of 50
building tree 20 of 50
building tree 21 of 50
building tree 22 of 50
building tree 23 of 50
building tree 24 of 50
building tree 25 of 50
building tree 26 of 50
building tree 27 of 50
building tree 28 of 50building tree 29 of 50

building tree 30 of 50
building tree 31 of 50
building tree 32 of 50
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   21.6s
building tree 33 of 50
building tree 34 of 50
building tree 35 of 50
building tree 36 of 50
building tree 37 of 50
building tree 38 of 50
building tree 39 of 50
building tree 40 of 50
building tree 41 of 50
building tree 42 of 50
building tree 43 of 50
building tree 44 of 50
building tree 45 of 50
building tree 46 of 50
building tree 47 of 50
building tree 48 of 50
building tree 49 of 50
building tree 50 of 50
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   38.3s finished

6. Creating the Feature Ranking Matrix

We combine the scores from the various methods above and output it in a matrix form for convenient viewing as such:

In [133]:
# Create empty dictionary to store the mean value calculated from all the scores
r = {}
for name in colnames:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)
 
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")

print("\t%s" % "\t".join(methods))
for name in colnames:
    print("%s\t%s" % (name, "\t".join(map(str, 
                         [ranks[method][name] for method in methods]))))
	Lasso	LinReg	Mean	RF	RFE	Ridge	rlasso/Stability	Mean
airconditioningtypeid	0.0	0.0	0.13	0.07	0.72	0.01	0.0	0.13
architecturalstyletypeid	0.0	0.0	0.12	0.0	0.69	0.01	0.0	0.12
basementsqft	0.0	0.0	0.1	0.0	0.63	0.0	0.0	0.1
bathroomcnt	0.0	0.0	0.22	0.1	0.85	0.36	0.0	0.22
bedroomcnt	0.0	0.0	0.18	0.19	0.83	0.03	0.0	0.18
buildingclasstypeid	0.0	1.0	0.19	0.0	0.11	0.0	0.0	0.19
buildingqualitytypeid	0.0	0.0	0.12	0.11	0.59	0.01	0.0	0.12
calculatedbathnbr	0.0	0.0	0.2	0.08	0.89	0.23	0.0	0.2
decktypeid	0.0	0.16	0.05	0.0	0.13	0.0	0.0	0.05
finishedfloor1squarefeet	0.0	0.0	0.09	0.04	0.52	0.0	0.0	0.09
calculatedfinishedsquarefeet	0.0	0.0	0.16	0.54	0.39	0.0	0.0	0.16
finishedsquarefeet12	1.0	0.0	0.33	0.51	0.48	0.0	0.0	0.33
finishedsquarefeet13	0.0	0.0	0.1	0.0	0.61	0.0	0.0	0.1
finishedsquarefeet15	0.46	0.0	0.17	0.12	0.41	0.0	0.0	0.17
finishedsquarefeet50	0.13	0.0	0.11	0.05	0.5	0.0	0.0	0.11
finishedsquarefeet6	0.02	0.0	0.08	0.02	0.46	0.0	0.0	0.08
fips	0.0	0.0	0.14	0.01	0.65	0.18	0.0	0.14
fireplacecnt	0.0	0.0	0.18	0.03	0.96	0.11	0.0	0.18
fullbathcnt	0.0	0.0	0.16	0.07	0.8	0.07	0.0	0.16
garagecarcnt	0.0	0.0	0.15	0.03	0.81	0.06	0.0	0.15
garagetotalsqft	0.54	0.0	0.18	0.13	0.43	0.0	0.0	0.18
hashottuborspa	0.0	0.0	0.24	0.02	0.94	0.51	0.0	0.24
heatingorsystemtypeid	0.0	0.0	0.14	0.07	0.76	0.02	0.0	0.14
latitude	0.0	0.0	0.18	0.88	0.2	0.0	0.0	0.18
longitude	0.0	0.0	0.19	0.92	0.22	0.0	0.0	0.19
lotsizesquarefeet	0.0	0.0	0.19	0.89	0.24	0.0	0.0	0.19
poolcnt	0.0	0.01	0.01	0.0	0.07	0.0	0.0	0.01
poolsizesum	0.0	0.0	0.07	0.01	0.44	0.0	0.0	0.07
pooltypeid10	0.0	0.03	0.01	0.0	0.04	0.0	0.0	0.01
pooltypeid2	0.0	0.0	0.02	0.0	0.15	0.0	0.0	0.02
pooltypeid7	0.0	0.01	0.01	0.0	0.06	0.0	0.0	0.01
propertylandusetypeid	0.0	0.0	0.13	0.09	0.7	0.0	0.0	0.13
rawcensustractandblock	0.0	0.0	0.1	0.27	0.33	0.0	0.0	0.1
regionidcity	0.0	0.0	0.07	0.25	0.17	0.0	0.0	0.07
regionidcounty	0.49	0.0	0.14	0.01	0.35	0.0	0.0	0.14
regionidneighborhood	0.0	0.0	0.08	0.31	0.19	0.0	0.0	0.08
regionidzip	0.03	0.0	0.12	0.44	0.26	0.0	0.0	0.12
roomcnt	0.0	0.0	0.13	0.06	0.74	0.01	0.0	0.13
storytypeid	0.0	0.01	0.02	0.0	0.09	0.0	0.0	0.02
threequarterbathnbr	0.0	0.0	0.34	0.02	1.0	1.0	0.0	0.34
typeconstructiontypeid	0.0	0.0	0.18	0.0	0.93	0.16	0.0	0.18
unitcnt	0.0	0.0	0.15	0.04	0.87	0.02	0.0	0.15
yardbuildingsqft17	0.0	0.0	0.1	0.04	0.56	0.0	0.0	0.1
yardbuildingsqft26	0.0	0.0	0.09	0.0	0.54	0.0	0.0	0.09
yearbuilt	0.0	0.0	0.19	0.58	0.57	0.0	0.0	0.19
numberofstories	0.0	0.0	0.14	0.04	0.67	0.11	0.0	0.14
fireplaceflag	0.0	0.0	0.2	0.0	0.91	0.28	0.0	0.2
structuretaxvaluedollarcnt	0.01	0.0	0.21	1.0	0.28	0.0	0.0	0.21
taxvaluedollarcnt	0.01	0.0	0.18	0.73	0.31	0.0	0.0	0.18
assessmentyear	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
landtaxvaluedollarcnt	0.01	0.0	0.18	0.76	0.3	0.0	0.0	0.18
taxamount	0.31	0.0	0.26	0.86	0.37	0.0	0.0	0.26
taxdelinquencyflag	0.0	0.0	0.3	0.02	0.98	0.83	0.0	0.3
taxdelinquencyyear	0.0	0.0	0.15	0.07	0.78	0.03	0.0	0.15
censustractandblock	0.0	0.0	0.05	0.28	0.02	0.0	0.0	0.05

Now, with the matrix above, the numbers and layout does not seem very easy or pleasant to the eye. Therefore, let’s just collate the mean ranking score attributed to each of the feature and plot that via Seaborn’s factorplot.

In [183]:
# Put the mean scores into a Pandas dataframe
meanplot = pd.DataFrame(list(r.items()), columns= ['Feature','Mean Ranking'])

# Sort the dataframe
meanplot = meanplot.sort_values('Mean Ranking', ascending=False)
In [184]:
# Let's plot the ranking of the features
sns.factorplot(x="Mean Ranking", y="Feature", data = meanplot, kind="bar", size=10, 
               aspect=1, palette='coolwarm')
Out[184]:
<seaborn.axisgrid.FacetGrid at 0x126458b50>

Conclusion

The top 3 features are “Three Quarter Bathroom Count”, “Finished Square Feet”, and “Tax Delinquincy Flag”. “Tax amount” and “Has Hot Tub or Spa” are 4th and 5th.

To continue the discussion from up top – these are the features that are best at predicting Zillow’s log-loss on their price estimate model.

These rankings are showing us, I believe, which features Zillow has been using the most incorrectly to estimate their models. I’m sure that ‘Finished Square feet’ would be a great predictor at home price, if I were to write a price prediction algorithm myself. But, I’m thinking, they’ve relied on it too much, or have insufficiently factored in some additional interactions that are causing it to become a bad feature.

Hyperparameter Selection & Visualization

In this notebook I will be covering hyperparameter selection for model tuning. Hyperparameters express “higher-level” properties of the model such as its complexity or how fast it should learn. Hyperparameters are usually fixed before the actual training process begins, and cannot be learned directly from the data in the standard model training process. Model developers will predefine these hyperparameters by testing different values, training different models, and choosing the values that test better.

Examples of hyperparmeters are:

  • number of leaves in a tree based model
  • clusters in a k-means algorithm
  • number of hidden layers in a neural network
  • penalization incurred by additional model complexity
  • learning rate

In this notebook I will do hyperparameter selection for a model, including visualization of the hyperparameter grid as well as a revision after seeing unintuitive results.

In [77]:
#Image displat packages
from IPython.display import Image
from IPython.core.display import HTML 

#Loading Necessary Background Packages
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# Load data
iris = datasets.load_iris()
X = iris.data 
y = iris.target

Hyperparameter Optimization

We’ll be using cross validation to initially find the optimal hyperparameters for our SVM kernal classifier.

When training our SVM clasisifier with the Radial Basis Function (RBF) kernel, two parameters must be considered: C and gamma. Proper choice of these two hyperparameters is critical to the SVM classifier’s performance. We use cross validation and a separate train/test set because we do not want to overfit our hyperparamters to our data. Hence we will be finding the parameters on a separate dataset than the dataset we will be evaluating our model on.

The parameter C, common to a lot of ML algorithms, trades off misclassiciation for simpliciity. A low C will make the decision boundary smooth (trading off some accuracy), while a high C will trade off high accuracy for simplicity. See below for an example of this in action.

In [5]:
Image(url= "https://www.analyticsvidhya.com/wp-content/uploads/2015/10/SVM_18.png")
Out[5]:

Gamma defines how much influence a single training example has. Consider it a ‘sphere of influence’. The larger gamma is, the closer other examples must be to be affected.

In [6]:
Image(url= "https://www.analyticsvidhya.com/wp-content/uploads/2015/10/SVM_15.png")
Out[6]:

We will be splitting our data into testing and training datasets, like we did before. We will be finding the our optimal hyperparameters on the training dataset before we evaluate them on our training set.

In [7]:
# split into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=False)
In [10]:
# load in SVC - a kernel classifier function from the scikit-learn library
from sklearn.svm import SVC

# create a default instance of the classifier
classifier = SVC()

I will create a dictionary of possible gammas and C’s for our cross validator to run through. I looked through literature and examples of the SVM rbf kernel to get a sense of what others were using. For an initial search, a logarithmic grid with basis 10 is used.

In [49]:
C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)

I will be using the GridSearchCV package to run hyperparameter cross validation on the training dataset.

GridSearchCV’s default settings will run 3-fold cross validation, but I will be doing 5-fold cross validation.

In [50]:
# import cross validation packages
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit

# setup 5 fold cross validation to be fed into the GridSearchCV package
cv_5 = KFold(5)
In [78]:
# create a parameter range to test over
parameters = {'kernel':['rbf'], 'gamma':gamma_range, 'C':C_range}

# create an instance of the GridSearchCV cross-validator - using our classifier 
# and choice or parameters
cross_validator = GridSearchCV(classifier, parameters, cv=cv_5)
In [79]:
# fit the data
cross_validator.fit(X_train, y_train)
Out[79]:
GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=False),
       error_score='raise',
       estimator=SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'kernel': ['rbf'], 'C': array([  1.00000e-02,   1.00000e-01,   1.00000e+00,   1.00000e+01,
         1.00000e+02,   1.00000e+03,   1.00000e+04,   1.00000e+05,
         1.00000e+06,   1.00000e+07,   1.00000e+08,   1.00000e+09,
         1.00000e+10]), 'gamma': array([  1.00000e-09,   1.00000e-08,   1.00000e-07,   1.00000e-06,
         1.00000e-05,   1.00000e-04,   1.00000e-03,   1.00000e-02,
         1.00000e-01,   1.00000e+00,   1.00000e+01,   1.00000e+02,
         1.00000e+03])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
In [80]:
best_gamma = cross_validator.best_estimator_.gamma  
best_C = cross_validator.best_estimator_.C

print 'Best Gamma: {}, Best C: {}' .format(best_gamma, best_C)
Best Gamma: 1e-07, Best C: 100000000.0

Debrief

So from what I’ve seen in my previous work, the high C initially gave me some concern. Its not exactly the perfect interpretation, but the C value of 100000000 is essentially stating that our model is weighting misclassifications as being 10*10^8 times worse than adding complexity to our model. Its suggesting a C this high because it appears that across all 5 folds of our CV, the high C model was performing very well.

In a pure out of sample evaluation, our model would likely perform very poorly. My hypothesis why the C is so high is that due to the small testing data set of each k-fold cross validation (only 24 points!) our model is fitting close to all the points in each testing set.

I’ll first re-shape the score array such that I can easily access the score of each hyperparamter combination.

In [66]:
scores = cross_validator.cv_results_['mean_test_score'].reshape(len(C_range),
                                                     len(gamma_range))

Checking the ‘optimal’ hyperparameters, we see that our model was near perfect in its predictions across the testing sets in the 5 folds.

In [65]:
scores[[C_range==100000000, gamma_range==1e-07]]
Out[65]:
array([ 0.96666667])

I found a plotting function online that would show me a heatmap of how well the scoring was across each unique hyperparameter combination.

In [67]:
# source: http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html

# Utility function to move the midpoint of a colormap to be around
# the values of interest.
from matplotlib.colors import Normalize

class MidpointNormalize(Normalize):

    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))

## Plotting Function
plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,
           norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
plt.yticks(np.arange(len(C_range)), C_range)
plt.title('Validation accuracy')
plt.show()

Our glowing hot point near the bottom left shows that the particular hyperparameter combo did slightly better than a lot of (already pretty good!) hyperparameter combinations.

To control for this overfitting, I’m going to re-run our hyperparameter grid search using fewer k-folds (To create larger test sets) in addition to some fold shuffling.

In [70]:
cv_3 = StratifiedShuffleSplit(n_splits=3, test_size=0.33, random_state=42)

# create a parameter range to test over
parameters = {'kernel':['rbf'], 'gamma':gamma_range, 'C':C_range}

# create an instance of the GridSearchCV cross-validator - using our classifier and choice or parameters
cross_validator = GridSearchCV(classifier, parameters, cv=cv_4)

# fit the data
cross_validator.fit(X_train, y_train)

best_gamma = cross_validator.best_estimator_.gamma  
best_C = cross_validator.best_estimator_.C

print 'Best Gamma: {}, Best C: {}' .format(best_gamma, best_C)
Best Gamma: 0.01, Best C: 10.0
In [71]:
scores = cross_validator.cv_results_['mean_test_score'].reshape(len(C_range),
                                                     len(gamma_range))

plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.hot,
           norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
plt.yticks(np.arange(len(C_range)), C_range)
plt.title('Validation accuracy')
plt.show()

White hot! Looks like we didn’t lose predictive ability by decreasing the number of folds in our cross validation. We have many good possible C or Gamma values for measuing our model performance.

Now we’ll see how our model runs on our testing data with our selected gamma and C.

In [76]:
# Run on test data
cross_validator.score(X_test, y_test)
Out[76]:
1.0

Boom! Too good. If we had other competing models, we’d be comparing this score. Iris is an easy dataset, so this is nothing too groundbreaking to be proud of 😉

In [ ]:
 

Model Validation Techniques

In statistics and machine learning, model validation is the process of deciding whether the results of a model are sufficiently generalizable to describe and make predictions on similarly distributed data.

Model validation largely revolves around the great dilemma between bias and variance. Model developers want to choose a model that will accurately capture the predictions in the training data, but will also generalize well on unseen data.

  • The first of these demands is a desire to reduce bias, which is the error from erroneous assumptions in the learning algorithm. High bias models are lower accuracy because they do not utilize all the possible relations between features and predictions available in the data. Low bias models will often be very complex, and will usually have a much higher accuracy as a result. The risk with low bias models is that they are overfit on the data and are not modelling on a true relationship between the features and predictions, but on the noise present in the data. The opposite of overfitting is underfitting, and it is used to describe models that potentially miss a significant relationship between a feature and a prediction.
  • The second demand on a model developer is to reduce variance, which is the error from fluctuations in the underlying data. High variance models are poor generalizers on data outside the training set. Models that are (over)fit to the noise in the training data will not be able to make good predictions on data that has a different distribution. Low variance models should have reasonable accuracy out of sample, because they have correctly identified real relationships between the features and predictions, and not just noise.

In the training phase of our model development process, care is taken to tune models such that they are able to minimize bias and variance as much as possible. In this notebook, I’ll be implementing the most basic of validation techniques, the Test/Train Split and Cross-Validation.

In [47]:
from IPython.display import Image
from IPython.core.display import HTML 

#Loading Necessary Background Packages
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Test Train Splits

If you’re going to be developing any sort of predictive model, the minimal amount of validation will require splitting your data into testing and training datatsets.

In [48]:
# Load data
iris = datasets.load_iris()
X = iris.data 
y = iris.target
In [49]:
# split into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=False)

We have split the dataset such that 20% of the data will be removed for testing purposes. In this scenario, I have split the data such that the last 20% are reserved for testing (via random_state=False), but you can easily imagine that another method of splitting data might pull the 20% randomly from the dataset. The use of this flag breaks down to whether or not we’re dealing with time dependent data.

If we’re dealing with a time dimension in our data (e.g. financial analysis), it is more common to remove the last 6 months of data, for example, for testing purposes. The reason being that it is easy to overfit data this way- You’re still testing over points that match the general patterns in the training dataset.

Below I’ll be training the data on our training dataset, and measuring its fit on the test dataset.

In [55]:
# load in SVC - a kernel classifier function from the scikit-learn library
from sklearn.svm import SVC

# create an instance of a kernel-based regressor from scikit learn
classifier = SVC(kernel = 'rbf')

# Fit on training dataset
classifier.fit(X[train], y[train])

# Print score
print 'Our model correctly classified {} of our test points' \
.format(classifier.score(X[test], y[test]))
Our model correctly classified 0.933333333333 of our test points

Cross Validation

Cross Validation is used in model development because it is one of the ways we get insights on out of sample model performance in addition to preventing overfitting of our models. It is often done in addition to a hold out test set (as done above) and is often used to find the optimal hyperparameters for a given function.

Cross Validation techniques will generally split up the dataset in a way that either fixed numbers or portions of the dataset get iteratively placed into testing and training pools. Two of the most common methods of cross validation are LeaveOneOut and K-Folds.

K-Folds

In K-Folds validation, the data is split into test/train datasets k times. Each iteration of k-folds will test the data on 1/k of the dataset, while training it on the remaining k-1/k portion of the data. Below is an illustration of the k iterations of k-folds.

In [46]:
Image(url= "http://cse3521.artifice.cc/images/k-fold-cross-validation.jpg")
Out[46]:

K-folds has been empircally shown to exhibit desireable tradeoffs between bias and variance with a k of 5 and 10, so you’ll see these most commonly represented in tests and examples. [Citation needed]

Leave One Out

Leave one out is a special case of K-Folds whereby k is equivalent to the number of datapoints, n.

Which is preferred?

As you can imagine, leave one out can be quite computationally expensive– each model is essentially run n times. It is therefore usually only run on small datasets whereby the computational hit can be taken. With regards to bias/variance, LOO CV will lead to estimates with lower bias, but higher variance because each training set will contain n-1 examples and will have overlap across training sets as you’re using almost the entire training set in each iteration.

The opposite is true with k-fold CV, because there is relatively less overlap between training sets, thus the test error estimates are less correlated, as a result of which the mean test error value won’t have as much variance as LOO CV.

Is cross validation a replacement for having a test data set?

It is tempting to solely use cross validation in lieu of a test train split. This is unadvised because while cross validation may provide some sense of out of sample model performance, it still uses the entirety of the training set to generate coefficients. It is moot to compare models this way, because some models are simply much better at overfitting to the data given to them.

We will be using the Iris dataset as above for classification purposes in this test.

In [53]:
# import cross validation packages
from sklearn.model_selection import KFold, LeaveOneOut

# split into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

Our cross validation packages will take the number of folds but otherwise are very simple to initialize.

In [54]:
## Building KFolds with 5 Folds
KFold_5 = KFold(5)

## Building KFolds with 10 Folds
KFold_10 = KFold(10)

## Building LOO, leaving one out (naturally) 
LooCV = LeaveOneOut()

To demonstrate the use of the cross validation packages, I will first initialize and display the prediction score of each fold of our 5-fold cross validation.

Scikit-learn has a useful set of packages for iterating through each fold of a cross validation scheme in addition to making predictions on a model fit with cross validation. Plenty of models have built in cross validation functions (ie: LassoCV, LogisticRegressionCV) but this is a model agnostic framework — great for comparing the performance of a lot of models against eachother!

In [71]:
from sklearn.model_selection import cross_val_predict, cross_val_score

# load in Logistic Regression 
from sklearn.linear_model import LogisticRegression

# create an instance of Logistic Regression with default options
classifier = LogisticRegression()

# Display the prediction score of each fold
cross_val_score(classifier, X_train, y_train, cv=KFold_5)
Out[71]:
array([ 0.83333333,  0.95833333,  1.        ,  1.        ,  0.875     ])

As you can see not every fold was predicted equally well in our cross validation scheme.

We will use the cross_val_predict package to see how well our model works out of sample.

In [75]:
# Generate Predictions
predicted = cross_val_predict(classifier, X_test, y_test)

# Measure model performance
from sklearn import metrics
metrics.accuracy_score(y_test, predicted)
Out[75]:
0.90000000000000002

Comparing performance of different models

Below I’ll be using our cross validation methodology (the k-fold with 5 folds) to test the out of sample performance of our models. Each of the models will be fit on 4/5th of the training data 5 times and used to make 5 predictions on the out of sample data. The 5 out of sample predictions will be averaged then to make the final prediction.

I’m testing the performane of a random set of classifier models with relatively random hyperparameters. If I wanted to get fancy I’d tune each of those hyperparameters individually, but as a start this is a good technique to see which models are looking promising on a new dataset.

In [81]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
In [84]:
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

for name, clf in zip(names, classifiers):
    predicted = cross_val_predict(clf, X_test, y_test, cv=KFold_5)
    print name, metrics.accuracy_score(y_test, predicted)
Nearest Neighbors 0.9
Linear SVM 0.8
RBF SVM 0.866666666667
Gaussian Process 0.966666666667
Decision Tree 0.9
Random Forest 0.933333333333
Neural Net 0.866666666667
AdaBoost 0.9
Naive Bayes 0.933333333333
QDA 0.766666666667

Discussion

Cross valiadation and the test/train split are the most basic methods of model validation you can do. Bootstrapping and Bagging are two further techniques whereby models are tested on random samples of the training set, such that the distribution of the training set is varied.

The success of ensemble meta-model techniques like Random Forests is largely because of the build in model validation that is used to create the final model. To learn more about how these models improve the bias-variance tradeoff, see my post about ensemble models.