Feature Selection

This notebook will aim to provide an explanation and application of different feature ranking methods, namely that of Recursive Feature Elimination (RFE), Stability Selection, linear models as well as Random Forest.

This notebook borrows heavily from an article by Ando Saabas on feature selection found here: http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/ as well as Anisotropic’s work on feature selection found here: https://www.kaggle.com/arthurtok/feature-ranking-w-randomforest-rfe-linear-models/code. A lot of Anisotropic’s comments are used in this kernel to explain what is being done.

My work is an application of the techniques and methods written by those gentlemen to the Zillow data competition on Kaggle (https://www.kaggle.com/c/zillow-prize-1). I have included additional competition specific data cleaning and comments in addition to the feature selection code.

There is one point of serious concern with the application of this code to the Zillow competition. In the Zillow competition, we are not estimating home price directly. Zillow is not sharing the actual sale price as a predictor variable (presumably to avoid something like this happening: https://www.wired.com/2010/03/netflix-cancels-contest/) but is having Kagglers estimate the log-loss of their model instead. Weird, I know.

I’ve been left thinking what a feature selection on the log-loss means. Are the most prominent features the ones that Zillow has been most incorrectly using? Are the best features at predicting log loss the worst features to use in a price estimate model? Do we need to do some feature engineering to make the ‘best’ features here more usable?

The contents of this notebook are as follows:

  1. Data Cleaning and Visualisation : This section will revolve around exploring the data and visualising some summary statistics.
  2. Stability Selection via Randomised Lasso Method : Introduce a relatively new feature selection method called “Stability Selection” and using the Randomised Lasso in its implementation
  3. Recursive Feature Elimination : Implementing the Recursive Feature Elimination method of feature ranking via the use of basic Linear Regression
  4. Linear Model Feature Coefficients : Implementing 3 of Sklearn’s linear models (Linear Regression, Lasso and Ridge) and using the inbuilt estimated coefficients for our feature selection
  5. Random Forest Feature Selection : Using the Random Forest’s convenient attribute “feature_importances” to calculate and ultimately rank the feature importance.

Finally, with all the points 1 to 5 above, we will combine the results to create our:

Feature Ranking Matrix : Matrix of all the features along with the respective model scores which we can use in our ranking.

In [167]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.feature_selection import RFE, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso, RandomizedLasso)
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor

1. DATA CLEANSING AND ANALYSIS

Let’s first read in the house data as a dataframe “house” and inspect the first 5 rows

In [185]:
# Load Data
train = pd.read_csv('../input/train_2016_v2.csv')
prop = pd.read_csv('../input/properties_2016.csv')

# Convert to float32 (this is done to post on Kaggle Kernels)
for c, dtype in zip(prop.columns, prop.dtypes):
    if dtype == np.float64:
        prop[c] = prop[c].astype(np.float32)

# Merge training dataset with properties dataset
df_train = train.merge(prop, how='left', on='parcelid')

# Remove ID columns
x_train = df_train.drop(['parcelid', 'logerror', 'transactiondate', 'propertyzoningdesc', 
                         'propertycountylandusecode'], axis=1)
In [171]:
x_train.head()
Out[171]:
airconditioningtypeid architecturalstyletypeid basementsqft bathroomcnt bedroomcnt buildingclasstypeid buildingqualitytypeid calculatedbathnbr decktypeid finishedfloor1squarefeet numberofstories fireplaceflag structuretaxvaluedollarcnt taxvaluedollarcnt assessmentyear landtaxvaluedollarcnt taxamount taxdelinquencyflag taxdelinquencyyear censustractandblock
0 1.0 NaN NaN 2.0 3.0 NaN 4.0 2.0 NaN NaN NaN NaN 122754.0 360170.0 2015.0 237416.0 6735.88 NaN NaN 6.037107e+13
1 NaN NaN NaN 3.5 4.0 NaN NaN 3.5 NaN NaN NaN NaN 346458.0 585529.0 2015.0 239071.0 10153.02 NaN NaN NaN
2 1.0 NaN NaN 3.0 2.0 NaN 4.0 3.0 NaN NaN NaN NaN 61994.0 119906.0 2015.0 57912.0 11484.48 NaN NaN 6.037464e+13
3 1.0 NaN NaN 2.0 2.0 NaN 4.0 2.0 NaN NaN NaN NaN 171518.0 244880.0 2015.0 73362.0 3048.74 NaN NaN 6.037296e+13
4 NaN NaN NaN 2.5 4.0 NaN NaN 2.5 NaN NaN 2.0 NaN 169574.0 434551.0 2015.0 264977.0 5488.96 NaN NaN 6.059042e+13

5 rows × 55 columns

Now its time for some general data inspection. Let’s first examine to see if there are any nulls in the dataframe as well as look at the type of the data (i.e whether it is a string or numeric)

In [172]:
print(x_train.dtypes)
airconditioningtypeid           float64
architecturalstyletypeid        float64
basementsqft                    float64
bathroomcnt                     float64
bedroomcnt                      float64
buildingclasstypeid             float64
buildingqualitytypeid           float64
calculatedbathnbr               float64
decktypeid                      float64
finishedfloor1squarefeet        float64
calculatedfinishedsquarefeet    float64
finishedsquarefeet12            float64
finishedsquarefeet13            float64
finishedsquarefeet15            float64
finishedsquarefeet50            float64
finishedsquarefeet6             float64
fips                            float64
fireplacecnt                    float64
fullbathcnt                     float64
garagecarcnt                    float64
garagetotalsqft                 float64
hashottuborspa                   object
heatingorsystemtypeid           float64
latitude                        float64
longitude                       float64
lotsizesquarefeet               float64
poolcnt                         float64
poolsizesum                     float64
pooltypeid10                    float64
pooltypeid2                     float64
pooltypeid7                     float64
propertylandusetypeid           float64
rawcensustractandblock          float64
regionidcity                    float64
regionidcounty                  float64
regionidneighborhood            float64
regionidzip                     float64
roomcnt                         float64
storytypeid                     float64
threequarterbathnbr             float64
typeconstructiontypeid          float64
unitcnt                         float64
yardbuildingsqft17              float64
yardbuildingsqft26              float64
yearbuilt                       float64
numberofstories                 float64
fireplaceflag                    object
structuretaxvaluedollarcnt      float64
taxvaluedollarcnt               float64
assessmentyear                  float64
landtaxvaluedollarcnt           float64
taxamount                       float64
taxdelinquencyflag               object
taxdelinquencyyear              float64
censustractandblock             float64
dtype: object

I’ll see what values are in the object type columns — We will have to either break these features up so that for each option there is just a binary (True / False) indicator, or find a way to make the existing features binary.

In [179]:
x_train['taxdelinquencyflag'].value_counts()
Out[179]:
False    88492
True      1783
Name: taxdelinquencyflag, dtype: int64
In [180]:
x_train['fireplaceflag'].value_counts()
Out[180]:
False    90053
True       222
Name: fireplaceflag, dtype: int64
In [181]:
x_train['hashottuborspa'].value_counts()
Out[181]:
False    87910
True      2365
Name: hashottuborspa, dtype: int64

I’ll have to convert the ‘Y’ to a TRUE and otherwise convert any NaN to False’s

In [177]:
x_train['taxdelinquencyflag'] = x_train['taxdelinquencyflag'] == 'Y'
In [178]:
for c in x_train.dtypes[x_train.dtypes == object].index.values:
    x_train[c] = (x_train[c] == True)
In [60]:
# Looking for nulls
print(x_train.isnull().sum())
airconditioningtypeid           61494
architecturalstyletypeid        90014
basementsqft                    90232
bathroomcnt                         0
bedroomcnt                          0
buildingclasstypeid             90259
buildingqualitytypeid           32911
calculatedbathnbr                1182
decktypeid                      89617
finishedfloor1squarefeet        83419
calculatedfinishedsquarefeet      661
finishedsquarefeet12             4679
finishedsquarefeet13            90242
finishedsquarefeet15            86711
finishedsquarefeet50            83419
finishedsquarefeet6             89854
fips                                0
fireplacecnt                    80668
fullbathcnt                      1182
garagecarcnt                    60338
garagetotalsqft                 60338
hashottuborspa                      0
heatingorsystemtypeid           34195
latitude                            0
longitude                           0
lotsizesquarefeet               10150
poolcnt                         72374
poolsizesum                     89306
pooltypeid10                    89114
pooltypeid2                     89071
pooltypeid7                     73578
propertylandusetypeid               0
rawcensustractandblock              0
regionidcity                     1803
regionidcounty                      0
regionidneighborhood            54263
regionidzip                        35
roomcnt                             0
storytypeid                     90232
threequarterbathnbr             78266
typeconstructiontypeid          89976
unitcnt                         31922
yardbuildingsqft17              87629
yardbuildingsqft26              90180
yearbuilt                         756
numberofstories                 69705
fireplaceflag                       0
structuretaxvaluedollarcnt        380
taxvaluedollarcnt                   1
assessmentyear                      0
landtaxvaluedollarcnt               1
taxamount                           6
taxdelinquencyflag                  0
taxdelinquencyyear              88492
censustractandblock               605
dtype: int64

Yikes, the data is pretty sparse. We’re going to have to figure out how to either remove some features or impute values for some of these.

Temporarily I’ll be imputing medians for the missing values based on the missing value for said column, but its only a stopgap solution before I can do something else (possibly predict missing values?)

In [182]:
x_train = x_train.fillna(x_train.median())

2. Stability Selection via Randomized Lasso

In a nutshell, this method serves to apply the feature selection on different parts of the data and features repeatedly until the results can be aggregated. Therefore stronger features ( defined as being selected as important) will have greater scores in this method as compared to weaker features. Refer to this paper by Nicolai Meinshausen and Peter Buhlmann for a much greater detail on the method : http://stat.ethz.ch/~nicolai/stability.pdf

In this notebook, the Stability Selection method is conveniently inbuilt into sklearn’s randomized lasso model and therefore this will be implemented as follows:

In [84]:
# First extract the target variable which is our Log Error
Y = df_train['logerror'].values
X = x_train.as_matrix()

# Store the column/feature names into a list "colnames"
colnames = x_train.columns

Next, we create a function which will be able to conveniently store our feature rankings obtained from the various methods described here into a Python dictionary. In case you are thinking I created this function, no this is not the case. All credit goes to Ando Saabas and I am only trying to apply what he has discussed in this context.

In [85]:
# Define dictionary to store our rankings
ranks = {}
# Create our function which stores the feature rankings to the ranks dictionary
def ranking(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x,2), ranks)
    return dict(zip(names, ranks))
In [87]:
# Finally let's run our Selection Stability method with Randomized Lasso
rlasso = RandomizedLasso(alpha=0.04)
rlasso.fit(X, Y)
ranks["rlasso/Stability"] = ranking(np.abs(rlasso.scores_), colnames)
print('finished')
finished

3. Recursive Feature Elimination ( RFE )

Now onto the next method in our feature ranking endeavour. Recursive Feature Elimination or RFE uses a model ( eg. linear Regression or SVM) to select either the best or worst-performing feature, and then excludes the feature. After this, the whole process is iterated until all features in the dataset are used up ( or up to a user-defined limit). Sklearn conveniently possesses a RFE function via the sklearn.feature_selection call and we will use this along with a simple linear regression model for our ranking search as follows:

In [88]:
# Construct our Linear Regression model
lr = LinearRegression(normalize=True)
lr.fit(X,Y)

#stop the search when only the last feature is left
rfe = RFE(lr, n_features_to_select=1, verbose =3 )
rfe.fit(X,Y)
ranks["RFE"] = ranking(list(map(float, rfe.ranking_)), colnames, order=-1)
Fitting estimator with 55 features.
Fitting estimator with 54 features.
Fitting estimator with 53 features.
Fitting estimator with 52 features.
Fitting estimator with 51 features.
Fitting estimator with 50 features.
Fitting estimator with 49 features.
Fitting estimator with 48 features.
Fitting estimator with 47 features.
Fitting estimator with 46 features.
Fitting estimator with 45 features.
Fitting estimator with 44 features.
Fitting estimator with 43 features.
Fitting estimator with 42 features.
Fitting estimator with 41 features.
Fitting estimator with 40 features.
Fitting estimator with 39 features.
Fitting estimator with 38 features.
Fitting estimator with 37 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 29 features.
Fitting estimator with 28 features.
Fitting estimator with 27 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 5 features.
Fitting estimator with 4 features.
Fitting estimator with 3 features.
Fitting estimator with 2 features.

4. Linear Model Feature Ranking

Now let’s apply 3 different linear models (Linear, Lasso and Ridge Regression) and how the features are selected and prioritised via these models. To achieve this, I shall use the sklearn implementation of these models and in particular the attribute .coef to return the estimated coefficients for each feature in the linear model.

In [89]:
# Using Linear Regression
lr = LinearRegression(normalize=True)
lr.fit(X,Y)
ranks["LinReg"] = ranking(np.abs(lr.coef_), colnames)

# Using Ridge 
ridge = Ridge(alpha = 7)
ridge.fit(X,Y)
ranks['Ridge'] = ranking(np.abs(ridge.coef_), colnames)

# Using Lasso
lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = ranking(np.abs(lasso.coef_), colnames)

5. Random Forest feature ranking

Sklearn’s Random Forest model also comes with it’s own inbuilt feature ranking attribute and one can conveniently just call it via “featureimportances“. That is what we will be using as follows:

In [105]:
rf = RandomForestRegressor(n_jobs=-1, n_estimators=50, verbose=2)
rf.fit(X,Y)
ranks["RF"] = ranking(rf.feature_importances_, colnames)
building tree 1 of 50
building tree 3 of 50 building tree 2 of 50building tree 7 of 50building tree 4 of 50
building tree 6 of 50building tree 5 of 50building tree 8 of 50





building tree 9 of 50
building tree 10 of 50
building tree 11 of 50
building tree 12 of 50
building tree 13 of 50
building tree 14 of 50
building tree 15 of 50
building tree 16 of 50
building tree 17 of 50
building tree 18 of 50
building tree 19 of 50
building tree 20 of 50
building tree 21 of 50
building tree 22 of 50
building tree 23 of 50
building tree 24 of 50
building tree 25 of 50
building tree 26 of 50
building tree 27 of 50
building tree 28 of 50building tree 29 of 50

building tree 30 of 50
building tree 31 of 50
building tree 32 of 50
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   21.6s
building tree 33 of 50
building tree 34 of 50
building tree 35 of 50
building tree 36 of 50
building tree 37 of 50
building tree 38 of 50
building tree 39 of 50
building tree 40 of 50
building tree 41 of 50
building tree 42 of 50
building tree 43 of 50
building tree 44 of 50
building tree 45 of 50
building tree 46 of 50
building tree 47 of 50
building tree 48 of 50
building tree 49 of 50
building tree 50 of 50
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   38.3s finished

6. Creating the Feature Ranking Matrix

We combine the scores from the various methods above and output it in a matrix form for convenient viewing as such:

In [133]:
# Create empty dictionary to store the mean value calculated from all the scores
r = {}
for name in colnames:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)
 
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")

print("\t%s" % "\t".join(methods))
for name in colnames:
    print("%s\t%s" % (name, "\t".join(map(str, 
                         [ranks[method][name] for method in methods]))))
	Lasso	LinReg	Mean	RF	RFE	Ridge	rlasso/Stability	Mean
airconditioningtypeid	0.0	0.0	0.13	0.07	0.72	0.01	0.0	0.13
architecturalstyletypeid	0.0	0.0	0.12	0.0	0.69	0.01	0.0	0.12
basementsqft	0.0	0.0	0.1	0.0	0.63	0.0	0.0	0.1
bathroomcnt	0.0	0.0	0.22	0.1	0.85	0.36	0.0	0.22
bedroomcnt	0.0	0.0	0.18	0.19	0.83	0.03	0.0	0.18
buildingclasstypeid	0.0	1.0	0.19	0.0	0.11	0.0	0.0	0.19
buildingqualitytypeid	0.0	0.0	0.12	0.11	0.59	0.01	0.0	0.12
calculatedbathnbr	0.0	0.0	0.2	0.08	0.89	0.23	0.0	0.2
decktypeid	0.0	0.16	0.05	0.0	0.13	0.0	0.0	0.05
finishedfloor1squarefeet	0.0	0.0	0.09	0.04	0.52	0.0	0.0	0.09
calculatedfinishedsquarefeet	0.0	0.0	0.16	0.54	0.39	0.0	0.0	0.16
finishedsquarefeet12	1.0	0.0	0.33	0.51	0.48	0.0	0.0	0.33
finishedsquarefeet13	0.0	0.0	0.1	0.0	0.61	0.0	0.0	0.1
finishedsquarefeet15	0.46	0.0	0.17	0.12	0.41	0.0	0.0	0.17
finishedsquarefeet50	0.13	0.0	0.11	0.05	0.5	0.0	0.0	0.11
finishedsquarefeet6	0.02	0.0	0.08	0.02	0.46	0.0	0.0	0.08
fips	0.0	0.0	0.14	0.01	0.65	0.18	0.0	0.14
fireplacecnt	0.0	0.0	0.18	0.03	0.96	0.11	0.0	0.18
fullbathcnt	0.0	0.0	0.16	0.07	0.8	0.07	0.0	0.16
garagecarcnt	0.0	0.0	0.15	0.03	0.81	0.06	0.0	0.15
garagetotalsqft	0.54	0.0	0.18	0.13	0.43	0.0	0.0	0.18
hashottuborspa	0.0	0.0	0.24	0.02	0.94	0.51	0.0	0.24
heatingorsystemtypeid	0.0	0.0	0.14	0.07	0.76	0.02	0.0	0.14
latitude	0.0	0.0	0.18	0.88	0.2	0.0	0.0	0.18
longitude	0.0	0.0	0.19	0.92	0.22	0.0	0.0	0.19
lotsizesquarefeet	0.0	0.0	0.19	0.89	0.24	0.0	0.0	0.19
poolcnt	0.0	0.01	0.01	0.0	0.07	0.0	0.0	0.01
poolsizesum	0.0	0.0	0.07	0.01	0.44	0.0	0.0	0.07
pooltypeid10	0.0	0.03	0.01	0.0	0.04	0.0	0.0	0.01
pooltypeid2	0.0	0.0	0.02	0.0	0.15	0.0	0.0	0.02
pooltypeid7	0.0	0.01	0.01	0.0	0.06	0.0	0.0	0.01
propertylandusetypeid	0.0	0.0	0.13	0.09	0.7	0.0	0.0	0.13
rawcensustractandblock	0.0	0.0	0.1	0.27	0.33	0.0	0.0	0.1
regionidcity	0.0	0.0	0.07	0.25	0.17	0.0	0.0	0.07
regionidcounty	0.49	0.0	0.14	0.01	0.35	0.0	0.0	0.14
regionidneighborhood	0.0	0.0	0.08	0.31	0.19	0.0	0.0	0.08
regionidzip	0.03	0.0	0.12	0.44	0.26	0.0	0.0	0.12
roomcnt	0.0	0.0	0.13	0.06	0.74	0.01	0.0	0.13
storytypeid	0.0	0.01	0.02	0.0	0.09	0.0	0.0	0.02
threequarterbathnbr	0.0	0.0	0.34	0.02	1.0	1.0	0.0	0.34
typeconstructiontypeid	0.0	0.0	0.18	0.0	0.93	0.16	0.0	0.18
unitcnt	0.0	0.0	0.15	0.04	0.87	0.02	0.0	0.15
yardbuildingsqft17	0.0	0.0	0.1	0.04	0.56	0.0	0.0	0.1
yardbuildingsqft26	0.0	0.0	0.09	0.0	0.54	0.0	0.0	0.09
yearbuilt	0.0	0.0	0.19	0.58	0.57	0.0	0.0	0.19
numberofstories	0.0	0.0	0.14	0.04	0.67	0.11	0.0	0.14
fireplaceflag	0.0	0.0	0.2	0.0	0.91	0.28	0.0	0.2
structuretaxvaluedollarcnt	0.01	0.0	0.21	1.0	0.28	0.0	0.0	0.21
taxvaluedollarcnt	0.01	0.0	0.18	0.73	0.31	0.0	0.0	0.18
assessmentyear	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
landtaxvaluedollarcnt	0.01	0.0	0.18	0.76	0.3	0.0	0.0	0.18
taxamount	0.31	0.0	0.26	0.86	0.37	0.0	0.0	0.26
taxdelinquencyflag	0.0	0.0	0.3	0.02	0.98	0.83	0.0	0.3
taxdelinquencyyear	0.0	0.0	0.15	0.07	0.78	0.03	0.0	0.15
censustractandblock	0.0	0.0	0.05	0.28	0.02	0.0	0.0	0.05

Now, with the matrix above, the numbers and layout does not seem very easy or pleasant to the eye. Therefore, let’s just collate the mean ranking score attributed to each of the feature and plot that via Seaborn’s factorplot.

In [183]:
# Put the mean scores into a Pandas dataframe
meanplot = pd.DataFrame(list(r.items()), columns= ['Feature','Mean Ranking'])

# Sort the dataframe
meanplot = meanplot.sort_values('Mean Ranking', ascending=False)
In [184]:
# Let's plot the ranking of the features
sns.factorplot(x="Mean Ranking", y="Feature", data = meanplot, kind="bar", size=10, 
               aspect=1, palette='coolwarm')
Out[184]:
<seaborn.axisgrid.FacetGrid at 0x126458b50>

Conclusion

The top 3 features are “Three Quarter Bathroom Count”, “Finished Square Feet”, and “Tax Delinquincy Flag”. “Tax amount” and “Has Hot Tub or Spa” are 4th and 5th.

To continue the discussion from up top – these are the features that are best at predicting Zillow’s log-loss on their price estimate model.

These rankings are showing us, I believe, which features Zillow has been using the most incorrectly to estimate their models. I’m sure that ‘Finished Square feet’ would be a great predictor at home price, if I were to write a price prediction algorithm myself. But, I’m thinking, they’ve relied on it too much, or have insufficiently factored in some additional interactions that are causing it to become a bad feature.

Leave a Reply

Your email address will not be published. Required fields are marked *