In this notebook I experiment with adding additional features to the Numerai dataset using t-distributed stochastic neighbor embedding (t-SNE). It is a machine learning dimension reduction tool that is often used for visualizing data in 2 or 3 dimensions, much like PCA. Unlike PCA, it can also work with non-linear / non-parametric data. A number of models have benefited from the use of additional t-SNE features prior to running the main classification algorithm.

T-SNE works by finding patterns within observable clusters in the dataset. It then maps these clusters to a probability distribution in a lower dimension space. The underlying data is no longer visible, but local similarities between points are preserved.

I suggest the following video if you want a theoretical understanding of t-SNE from the creator: https://www.youtube.com/watch?v=RJVL80Gg3lA&t

And a practical guide to tuning its parameters: https://distill.pub/2016/misread-tsne/

I’m using an adapted version of Jim Flemings t-SNE pipeline found here: https://github.com/jimfleming/numerai/blob/master/prep_data.py

### Why Apply t-SNE to Numerai’s dataset?¶

If you’ve worked with Numerai data before, you’ll notice that it does not contain any feature distinctions, and that each feature has been normalized between 0 and 1. Richard Craib, the founder of Numerai, explains the reason and method of this data encryption here: https://medium.com/numerai/encrypted-data-for-efficient-markets-fffbe9743ba8. As such, adding features via dimensionality reduction is the only sensible way I can think of for squeezing any additional features into the training dataset.

**Note: This notebook is incomplete and outputs have not been uploaded. It also takes like 10 hours to run this notebook and I don’t recommend doing so unless you have some sort of cloud computation resources! My next post will share the data infrastucture I’ve built out on AWS to make this easier**

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

```
# Lets see whats in our input folder
from subprocess import check_output
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))
```

# Data Prep¶

Jim Fleming does something pretty cool in this pipeline that I’ve recently incorporated into all my work. Rather than splitting the training data arbitrarily into a training / validation dataset, he chooses his validation dataset to be those points that are closest to the test results. This is done by first running a 5-fold cross validated random forest classifier, and then sorting the predictions by the model’s relative confidence in each prediction. The points that the model is most confident in are those, we hypothesize, the closest to representing the underlying data pattern.

```
import time
import random
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import StratifiedKFold
def prep():
df_train = pd.read_csv('../numerai/T62/numerai_training_data.csv')
df_test = pd.read_csv('../numerai/T62/numerai_tournament_data.csv')
feature_cols = list(df_train.columns[3:-1])
target_col = df_train.columns[-1]
test_col = 'is_test'
id_col = 't_id'
df_train['is_test'] = 0
df_test['is_test'] = 1
df_data = pd.concat([df_train, df_test])
df_data = df_data.reindex_axis(feature_cols + [test_col, target_col], axis='columns')
X_split = df_data[feature_cols]
y_split = df_data[test_col]
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=67)
predictions = np.zeros(y_split.shape)
kfold = StratifiedKFold(y_split, n_folds=5, shuffle=True, random_state=67)
for i, (train_i, test_i) in enumerate(kfold):
print("Fold #{}".format(i + 1))
X_split_train = X_split.iloc[train_i]
y_split_train = y_split.iloc[train_i]
X_split_test = X_split.iloc[test_i]
y_split_test = y_split.iloc[test_i]
rf.fit(X_split_train, y_split_train)
p = rf.predict_proba(X_split_test)[:,1]
auc = roc_auc_score(y_split_test, p)
print("AUC: {:.2f}".format(auc))
predictions[test_i] = p
# sort predictions by value
i = predictions.argsort()
# sort data by prediction confidence
df_sorted = df_data.iloc[i]
# select only training data
df_train_sorted = df_sorted.loc[df_sorted.is_test == 0]
# drop unnecessary columns
df_train_sorted = df_train_sorted.drop([test_col], axis='columns')
# verify training data
assert(df_train_sorted[target_col].sum() == df_train[target_col].sum())
# grab first N rows as train and last N rows as validation (those closest to test)
validation_size = int(len(df_train_sorted) * 0.1)
df_train = df_train_sorted.iloc[:-validation_size]
df_valid = df_train_sorted.iloc[-validation_size:]
print('Creating dataset with validation size: {}'.format(validation_size))
df_train.to_csv('../numerai/T62/train_data.csv', index_label=False)
df_valid.to_csv('../numerai/T62/valid_data.csv', index_label=False)
df_test.to_csv('../numerai/T62/test_data.csv', index_label=False)
print('Done.')
```

```
prep()
```

Lets ensure we have our data in the right folder

```
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))
```

# t-SNE Feature Encoding¶

The below t-SNE feature encoding is a modified version of Jim Fleming’s code. The modular function takes as an input a perplexity value, which is in a sense, similar to the *k* in a nearet neighbors algorithm. It functions as a trade-off betwen the local and global patterns in the data. With a smaller perplexity, local variations will dominate the t-SNE plot. A larger perplexity will cause larger affinity clusters to be generated.

Getting the most from t-SNE may mean incorporating multiple plots with different perplexities into the final training dataset. Most t-SNE encodings I’ve seen have perplexities between 5 and 100, with ranges between 10 and 50 being the most common. I’ll be including 5 different encodings at equal intervals to my final dataset.

```
import time
import random
from tsne import bh_sne
from sklearn.preprocessing import PolynomialFeatures
def save_tsne(perplexity, dimensions=2, polynomial=False):
df_train = pd.read_csv('../numerai/T62/train_data.csv')
df_valid = pd.read_csv('../numerai/T62/valid_data.csv')
df_test = pd.read_csv('../numerai/T62/test_data.csv')
feature_cols = list(df_train.columns[:-1])
target_col = df_train.columns[-1]
X_train = df_train[feature_cols].values
y_train = df_train[target_col].values
X_valid = df_valid[feature_cols].values
y_valid = df_valid[target_col].values
X_test = df_test[feature_cols].values
X_all = np.concatenate([X_train, X_valid, X_test], axis=0)
if polynomial:
poly = PolynomialFeatures(degree=2)
X_all = poly.fit_transform(X_all)
print('Running TSNE (perplexity: {}, dimensions: {}, polynomial: {})...'.format(perplexity, dimensions, polynomial))
start_time = time.time()
tsne_all = bh_sne(X_all, d=dimensions, perplexity=float(perplexity))
print('TSNE: {}s'.format(time.time() - start_time))
tsne_train = tsne_all[:X_train.shape[0]]
assert(len(tsne_train) == len(X_train))
tsne_valid = tsne_all[X_train.shape[0]:X_train.shape[0]+X_valid.shape[0]]
assert(len(tsne_valid) == len(X_valid))
tsne_test = tsne_all[X_train.shape[0]+X_valid.shape[0]:X_train.shape[0]+X_valid.shape[0]+X_test.shape[0]]
assert(len(tsne_test) == len(X_test))
if polynomial:
save_path = '../numerai/T62/tsne_{}d_{}p_poly.npz'.format(dimensions, perplexity)
else:
save_path = '../numerai/T62/tsne_{}d_{}p.npz'.format(dimensions, perplexity)
np.savez(save_path, \
train=tsne_train, \
valid=tsne_valid, \
test=tsne_test)
print('Saved: {}'.format(save_path))
```

Generate t-SNE. Warning: This will take a long time.

```
for perplexity in [10, 20, 30, 40, 50]:
save_tsne(perplexity, polynomial=True)
```

Lets check our output again

```
print(check_output(["ls", "../numerai/T62/"]).decode("utf8"))
```

# Fitting our Model¶

Having added the new features, I’ll run the final prediction model using Jim’s code again. Our model uses a logistic regression and outputs a classification probability.

```
def main():
# load data
df_train = pd.read_csv('data/train_data.csv')
df_valid = pd.read_csv('data/valid_data.csv')
df_test = pd.read_csv('data/test_data.csv')
feature_cols = list(df_train.columns[:-1])
target_col = df_train.columns[-1]
X_train = df_train[feature_cols].values
y_train = df_train[target_col].values
X_valid = df_valid[feature_cols].values
y_valid = df_valid[target_col].values
X_test = df_test[feature_cols].values
tsne_data_10p = np.load('data/tsne_2d_10p_poly.npz')
tsne_data_20p = np.load('data/tsne_2d_20p_poly.npz')
tsne_data_30p = np.load('data/tsne_2d_30p_poly.npz')
tsne_data_40p = np.load('data/tsne_2d_40p_poly.npz')
tsne_data_50p = np.load('data/tsne_2d_50p_poly.npz')
# concat features
X_train_concat = {
'X': X_train,
'tsne_10p': tsne_data_10p['train'],
'tsne_20p': tsne_data_20p['train'],
'tsne_30p': tsne_data_30p['train'],
'tsne_40p': tsne_data_40p['train'],
'tsne_50p': tsne_data_50p['train'],
}
X_valid_concat = {
'X': X_valid,
'tsne_10p': tsne_data_10p['valid'],
'tsne_20p': tsne_data_20p['valid'],
'tsne_30p': tsne_data_30p['valid'],
'tsne_40p': tsne_data_40p['valid'],
'tsne_50p': tsne_data_50p['valid'],
}
X_test_concat = {
'X': X_test,
'tsne_10p': tsne_data_10p['test'],
'tsne_20p': tsne_data_20p['test'],
'tsne_30p': tsne_data_30p['test'],
'tsne_40p': tsne_data_40p['test'],
'tsne_50p': tsne_data_50p['test'],
}
# build pipeline
classifier = Pipeline(steps=[
('features', FeatureUnion(transformer_list=[
('tsne_10p', ItemSelector('tsne_10p')),
('tsne_20p', ItemSelector('tsne_20p')),
('tsne_30p', ItemSelector('tsne_30p')),
('tsne_40p', ItemSelector('tsne_40p')),
('tsne_50p', ItemSelector('tsne_50p')),
('X', ItemSelector('X')),
])),
('poly', PolynomialFeatures(degree=2)),
('scaler', MinMaxScaler()),
('lr', LogisticRegression(penalty='l2', C=1e-2, n_jobs=-1)),
])
print('Fitting...')
start_time = time.time()
classifier.fit(X_train_concat, y_train)
print('Fit: {}s'.format(time.time() - start_time))
p_valid = classifier.predict_proba(X_valid_concat)
loss = log_loss(y_valid, p_valid)
print('Loss: {}'.format(loss))
p_test = classifier.predict_proba(X_test_concat)
df_pred = pd.DataFrame({
't_id': df_test['t_id'],
'probability': p_test[:,1]
})
csv_path = 'predictions/predictions_{}_{}.csv'.format(int(time.time()), loss)
df_pred.to_csv(csv_path, columns=('t_id', 'probability'), index=None)
print('Saved: {}'.format(csv_path))
```