Model Validation Techniques

In statistics and machine learning, model validation is the process of deciding whether the results of a model are sufficiently generalizable to describe and make predictions on similarly distributed data.

Model validation largely revolves around the great dilemma between bias and variance. Model developers want to choose a model that will accurately capture the predictions in the training data, but will also generalize well on unseen data.

  • The first of these demands is a desire to reduce bias, which is the error from erroneous assumptions in the learning algorithm. High bias models are lower accuracy because they do not utilize all the possible relations between features and predictions available in the data. Low bias models will often be very complex, and will usually have a much higher accuracy as a result. The risk with low bias models is that they are overfit on the data and are not modelling on a true relationship between the features and predictions, but on the noise present in the data. The opposite of overfitting is underfitting, and it is used to describe models that potentially miss a significant relationship between a feature and a prediction.
  • The second demand on a model developer is to reduce variance, which is the error from fluctuations in the underlying data. High variance models are poor generalizers on data outside the training set. Models that are (over)fit to the noise in the training data will not be able to make good predictions on data that has a different distribution. Low variance models should have reasonable accuracy out of sample, because they have correctly identified real relationships between the features and predictions, and not just noise.

In the training phase of our model development process, care is taken to tune models such that they are able to minimize bias and variance as much as possible. In this notebook, I’ll be implementing the most basic of validation techniques, the Test/Train Split and Cross-Validation.

In [47]:
from IPython.display import Image
from IPython.core.display import HTML 

#Loading Necessary Background Packages
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Test Train Splits

If you’re going to be developing any sort of predictive model, the minimal amount of validation will require splitting your data into testing and training datatsets.

In [48]:
# Load data
iris = datasets.load_iris()
X = iris.data 
y = iris.target
In [49]:
# split into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=False)

We have split the dataset such that 20% of the data will be removed for testing purposes. In this scenario, I have split the data such that the last 20% are reserved for testing (via random_state=False), but you can easily imagine that another method of splitting data might pull the 20% randomly from the dataset. The use of this flag breaks down to whether or not we’re dealing with time dependent data.

If we’re dealing with a time dimension in our data (e.g. financial analysis), it is more common to remove the last 6 months of data, for example, for testing purposes. The reason being that it is easy to overfit data this way- You’re still testing over points that match the general patterns in the training dataset.

Below I’ll be training the data on our training dataset, and measuring its fit on the test dataset.

In [55]:
# load in SVC - a kernel classifier function from the scikit-learn library
from sklearn.svm import SVC

# create an instance of a kernel-based regressor from scikit learn
classifier = SVC(kernel = 'rbf')

# Fit on training dataset
classifier.fit(X[train], y[train])

# Print score
print 'Our model correctly classified {} of our test points' \
.format(classifier.score(X[test], y[test]))
Our model correctly classified 0.933333333333 of our test points

Cross Validation

Cross Validation is used in model development because it is one of the ways we get insights on out of sample model performance in addition to preventing overfitting of our models. It is often done in addition to a hold out test set (as done above) and is often used to find the optimal hyperparameters for a given function.

Cross Validation techniques will generally split up the dataset in a way that either fixed numbers or portions of the dataset get iteratively placed into testing and training pools. Two of the most common methods of cross validation are LeaveOneOut and K-Folds.

K-Folds

In K-Folds validation, the data is split into test/train datasets k times. Each iteration of k-folds will test the data on 1/k of the dataset, while training it on the remaining k-1/k portion of the data. Below is an illustration of the k iterations of k-folds.

In [46]:
Image(url= "http://cse3521.artifice.cc/images/k-fold-cross-validation.jpg")
Out[46]:

K-folds has been empircally shown to exhibit desireable tradeoffs between bias and variance with a k of 5 and 10, so you’ll see these most commonly represented in tests and examples. [Citation needed]

Leave One Out

Leave one out is a special case of K-Folds whereby k is equivalent to the number of datapoints, n.

Which is preferred?

As you can imagine, leave one out can be quite computationally expensive– each model is essentially run n times. It is therefore usually only run on small datasets whereby the computational hit can be taken. With regards to bias/variance, LOO CV will lead to estimates with lower bias, but higher variance because each training set will contain n-1 examples and will have overlap across training sets as you’re using almost the entire training set in each iteration.

The opposite is true with k-fold CV, because there is relatively less overlap between training sets, thus the test error estimates are less correlated, as a result of which the mean test error value won’t have as much variance as LOO CV.

Is cross validation a replacement for having a test data set?

It is tempting to solely use cross validation in lieu of a test train split. This is unadvised because while cross validation may provide some sense of out of sample model performance, it still uses the entirety of the training set to generate coefficients. It is moot to compare models this way, because some models are simply much better at overfitting to the data given to them.

We will be using the Iris dataset as above for classification purposes in this test.

In [53]:
# import cross validation packages
from sklearn.model_selection import KFold, LeaveOneOut

# split into test and train
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

Our cross validation packages will take the number of folds but otherwise are very simple to initialize.

In [54]:
## Building KFolds with 5 Folds
KFold_5 = KFold(5)

## Building KFolds with 10 Folds
KFold_10 = KFold(10)

## Building LOO, leaving one out (naturally) 
LooCV = LeaveOneOut()

To demonstrate the use of the cross validation packages, I will first initialize and display the prediction score of each fold of our 5-fold cross validation.

Scikit-learn has a useful set of packages for iterating through each fold of a cross validation scheme in addition to making predictions on a model fit with cross validation. Plenty of models have built in cross validation functions (ie: LassoCV, LogisticRegressionCV) but this is a model agnostic framework — great for comparing the performance of a lot of models against eachother!

In [71]:
from sklearn.model_selection import cross_val_predict, cross_val_score

# load in Logistic Regression 
from sklearn.linear_model import LogisticRegression

# create an instance of Logistic Regression with default options
classifier = LogisticRegression()

# Display the prediction score of each fold
cross_val_score(classifier, X_train, y_train, cv=KFold_5)
Out[71]:
array([ 0.83333333,  0.95833333,  1.        ,  1.        ,  0.875     ])

As you can see not every fold was predicted equally well in our cross validation scheme.

We will use the cross_val_predict package to see how well our model works out of sample.

In [75]:
# Generate Predictions
predicted = cross_val_predict(classifier, X_test, y_test)

# Measure model performance
from sklearn import metrics
metrics.accuracy_score(y_test, predicted)
Out[75]:
0.90000000000000002

Comparing performance of different models

Below I’ll be using our cross validation methodology (the k-fold with 5 folds) to test the out of sample performance of our models. Each of the models will be fit on 4/5th of the training data 5 times and used to make 5 predictions on the out of sample data. The 5 out of sample predictions will be averaged then to make the final prediction.

I’m testing the performane of a random set of classifier models with relatively random hyperparameters. If I wanted to get fancy I’d tune each of those hyperparameters individually, but as a start this is a good technique to see which models are looking promising on a new dataset.

In [81]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
In [84]:
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

for name, clf in zip(names, classifiers):
    predicted = cross_val_predict(clf, X_test, y_test, cv=KFold_5)
    print name, metrics.accuracy_score(y_test, predicted)
Nearest Neighbors 0.9
Linear SVM 0.8
RBF SVM 0.866666666667
Gaussian Process 0.966666666667
Decision Tree 0.9
Random Forest 0.933333333333
Neural Net 0.866666666667
AdaBoost 0.9
Naive Bayes 0.933333333333
QDA 0.766666666667

Discussion

Cross valiadation and the test/train split are the most basic methods of model validation you can do. Bootstrapping and Bagging are two further techniques whereby models are tested on random samples of the training set, such that the distribution of the training set is varied.

The success of ensemble meta-model techniques like Random Forests is largely because of the build in model validation that is used to create the final model. To learn more about how these models improve the bias-variance tradeoff, see my post about ensemble models.

Leave a Reply

Your email address will not be published. Required fields are marked *