Numerai Competition Attempt #2

In this notebook I experiment with adding additional features to the Numerai dataset using t-distributed stochastic neighbor embedding (t-SNE). It is a machine learning dimension reduction tool that is often used for visualizing data in 2 or 3 dimensions, much like PCA. Unlike PCA, it can also work with non-linear / non-parametric data. A number of models have benefited from the use of additional t-SNE features prior to running the main classification algorithm.

T-SNE works by finding patterns within observable clusters in the dataset. It then maps these clusters to a probability distribution in a lower dimension space. The underlying data is no longer visible, but local similarities between points are preserved.

I suggest the following video if you want a theoretical understanding of t-SNE from the creator:

And a practical guide to tuning its parameters:

I’m using an adapted version of Jim Flemings t-SNE pipeline found here:

Why Apply t-SNE to Numerai’s dataset?

If you’ve worked with Numerai data before, you’ll notice that it does not contain any feature distinctions, and that each feature has been normalized between 0 and 1. Richard Craib, the founder of Numerai, explains the reason and method of this data encryption here: As such, adding features via dimensionality reduction is the only sensible way I can think of for squeezing any additional features into the training dataset.

Note: This notebook is incomplete and outputs have not been uploaded. It also takes like 10 hours to run this notebook and I don’t recommend doing so unless you have some sort of cloud computation resources! My next post will share the data infrastucture I’ve built out on AWS to make this easier

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
In [2]:
# Lets see whats in our input folder
from subprocess import check_output
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))

Data Prep

Jim Fleming does something pretty cool in this pipeline that I’ve recently incorporated into all my work. Rather than splitting the training data arbitrarily into a training / validation dataset, he chooses his validation dataset to be those points that are closest to the test results. This is done by first running a 5-fold cross validated random forest classifier, and then sorting the predictions by the model’s relative confidence in each prediction. The points that the model is most confident in are those, we hypothesize, the closest to representing the underlying data pattern.

In [ ]:
import time
import random

from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedKFold

def prep():
    df_train = pd.read_csv('../numerai/T62/numerai_training_data.csv')
    df_test = pd.read_csv('../numerai/T62/numerai_tournament_data.csv')

    feature_cols = list(df_train.columns[3:-1])
    target_col = df_train.columns[-1]
    test_col = 'is_test'
    id_col = 't_id'

    df_train['is_test'] = 0
    df_test['is_test'] = 1

    df_data = pd.concat([df_train, df_test])
    df_data = df_data.reindex_axis(feature_cols + [test_col, target_col], axis='columns')

    X_split = df_data[feature_cols]
    y_split = df_data[test_col]

    rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=67)
    predictions = np.zeros(y_split.shape)

    kfold = StratifiedKFold(y_split, n_folds=5, shuffle=True, random_state=67)
    for i, (train_i, test_i) in enumerate(kfold):
        print("Fold #{}".format(i + 1))

        X_split_train = X_split.iloc[train_i]
        y_split_train = y_split.iloc[train_i]

        X_split_test = X_split.iloc[test_i]
        y_split_test = y_split.iloc[test_i], y_split_train)

        p = rf.predict_proba(X_split_test)[:,1]
        auc = roc_auc_score(y_split_test, p)
        print("AUC: {:.2f}".format(auc))

        predictions[test_i] = p

    # sort predictions by value
    i = predictions.argsort()

    # sort data by prediction confidence
    df_sorted = df_data.iloc[i]

    # select only training data
    df_train_sorted = df_sorted.loc[df_sorted.is_test == 0]

    # drop unnecessary columns
    df_train_sorted = df_train_sorted.drop([test_col], axis='columns')

    # verify training data
    assert(df_train_sorted[target_col].sum() == df_train[target_col].sum())

    # grab first N rows as train and last N rows as validation (those closest to test)
    validation_size = int(len(df_train_sorted) * 0.1)
    df_train = df_train_sorted.iloc[:-validation_size]
    df_valid = df_train_sorted.iloc[-validation_size:]
    print('Creating dataset with validation size: {}'.format(validation_size))

    df_train.to_csv('../numerai/T62/train_data.csv', index_label=False)
    df_valid.to_csv('../numerai/T62/valid_data.csv', index_label=False)
    df_test.to_csv('../numerai/T62/test_data.csv', index_label=False)
In [ ]:

Lets ensure we have our data in the right folder

In [ ]:
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))

t-SNE Feature Encoding

The below t-SNE feature encoding is a modified version of Jim Fleming’s code. The modular function takes as an input a perplexity value, which is in a sense, similar to the k in a nearet neighbors algorithm. It functions as a trade-off betwen the local and global patterns in the data. With a smaller perplexity, local variations will dominate the t-SNE plot. A larger perplexity will cause larger affinity clusters to be generated.

Getting the most from t-SNE may mean incorporating multiple plots with different perplexities into the final training dataset. Most t-SNE encodings I’ve seen have perplexities between 5 and 100, with ranges between 10 and 50 being the most common. I’ll be including 5 different encodings at equal intervals to my final dataset.

In [25]:
import time
import random

from tsne import bh_sne
from sklearn.preprocessing import PolynomialFeatures

def save_tsne(perplexity, dimensions=2, polynomial=False):
    df_train = pd.read_csv('../numerai/T62/train_data.csv')
    df_valid = pd.read_csv('../numerai/T62/valid_data.csv')
    df_test = pd.read_csv('../numerai/T62/test_data.csv')

    feature_cols = list(df_train.columns[:-1])
    target_col = df_train.columns[-1]

    X_train = df_train[feature_cols].values
    y_train = df_train[target_col].values

    X_valid = df_valid[feature_cols].values
    y_valid = df_valid[target_col].values

    X_test = df_test[feature_cols].values

    X_all = np.concatenate([X_train, X_valid, X_test], axis=0)
    if polynomial:
        poly = PolynomialFeatures(degree=2)
        X_all = poly.fit_transform(X_all)

    print('Running TSNE (perplexity: {}, dimensions: {}, polynomial: {})...'.format(perplexity, dimensions, polynomial))
    start_time = time.time()
    tsne_all = bh_sne(X_all, d=dimensions, perplexity=float(perplexity))
    print('TSNE: {}s'.format(time.time() - start_time))

    tsne_train = tsne_all[:X_train.shape[0]]
    assert(len(tsne_train) == len(X_train))

    tsne_valid = tsne_all[X_train.shape[0]:X_train.shape[0]+X_valid.shape[0]]
    assert(len(tsne_valid) == len(X_valid))

    tsne_test = tsne_all[X_train.shape[0]+X_valid.shape[0]:X_train.shape[0]+X_valid.shape[0]+X_test.shape[0]]
    assert(len(tsne_test) == len(X_test))

    if polynomial:
        save_path = '../numerai/T62/tsne_{}d_{}p_poly.npz'.format(dimensions, perplexity)
        save_path = '../numerai/T62/tsne_{}d_{}p.npz'.format(dimensions, perplexity)

    np.savez(save_path, \
        train=tsne_train, \
        valid=tsne_valid, \
    print('Saved: {}'.format(save_path))

Generate t-SNE. Warning: This will take a long time.

In [ ]:
for perplexity in [10, 20, 30, 40, 50]:
    save_tsne(perplexity, polynomial=True)

Lets check our output again

In [ ]:
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))

Fitting our Model

Having added the new features, I’ll run the final prediction model using Jim’s code again. Our model uses a logistic regression and outputs a classification probability.

In [ ]:
def main():
    # load data
    df_train = pd.read_csv('data/train_data.csv')
    df_valid = pd.read_csv('data/valid_data.csv')
    df_test = pd.read_csv('data/test_data.csv')

    feature_cols = list(df_train.columns[:-1])
    target_col = df_train.columns[-1]

    X_train = df_train[feature_cols].values
    y_train = df_train[target_col].values

    X_valid = df_valid[feature_cols].values
    y_valid = df_valid[target_col].values

    X_test = df_test[feature_cols].values

    tsne_data_10p = np.load('data/tsne_2d_10p_poly.npz')
    tsne_data_20p = np.load('data/tsne_2d_20p_poly.npz')
    tsne_data_30p = np.load('data/tsne_2d_30p_poly.npz')
    tsne_data_40p = np.load('data/tsne_2d_40p_poly.npz')
    tsne_data_50p = np.load('data/tsne_2d_50p_poly.npz')

    # concat features
    X_train_concat = {
        'X': X_train,
        'tsne_10p': tsne_data_10p['train'],
        'tsne_20p': tsne_data_20p['train'],
        'tsne_30p': tsne_data_30p['train'],
        'tsne_40p': tsne_data_40p['train'],
        'tsne_50p': tsne_data_50p['train'],
    X_valid_concat = {
        'X': X_valid,
        'tsne_10p': tsne_data_10p['valid'],
        'tsne_20p': tsne_data_20p['valid'],
        'tsne_30p': tsne_data_30p['valid'],
        'tsne_40p': tsne_data_40p['valid'],
        'tsne_50p': tsne_data_50p['valid'],
    X_test_concat = {
        'X': X_test,
        'tsne_10p': tsne_data_10p['test'],
        'tsne_20p': tsne_data_20p['test'],
        'tsne_30p': tsne_data_30p['test'],
        'tsne_40p': tsne_data_40p['test'],
        'tsne_50p': tsne_data_50p['test'],

    # build pipeline
    classifier = Pipeline(steps=[
        ('features', FeatureUnion(transformer_list=[
            ('tsne_10p', ItemSelector('tsne_10p')),
            ('tsne_20p', ItemSelector('tsne_20p')),
            ('tsne_30p', ItemSelector('tsne_30p')),
            ('tsne_40p', ItemSelector('tsne_40p')),
            ('tsne_50p', ItemSelector('tsne_50p')),
            ('X', ItemSelector('X')),
        ('poly', PolynomialFeatures(degree=2)),
        ('scaler', MinMaxScaler()),
        ('lr', LogisticRegression(penalty='l2', C=1e-2, n_jobs=-1)),

    start_time = time.time(), y_train)
    print('Fit: {}s'.format(time.time() - start_time))

    p_valid = classifier.predict_proba(X_valid_concat)
    loss = log_loss(y_valid, p_valid)
    print('Loss: {}'.format(loss))

    p_test = classifier.predict_proba(X_test_concat)
    df_pred = pd.DataFrame({
        't_id': df_test['t_id'],
        'probability': p_test[:,1]
    csv_path = 'predictions/predictions_{}_{}.csv'.format(int(time.time()), loss)
    df_pred.to_csv(csv_path, columns=('t_id', 'probability'), index=None)
    print('Saved: {}'.format(csv_path))

Leave a Reply

Your email address will not be published. Required fields are marked *