This notebook will aim to provide an explanation and application of different feature ranking methods, namely that of Recursive Feature Elimination (RFE), Stability Selection, linear models as well as Random Forest.

This notebook borrows heavily from an article by Ando Saabas on feature selection found here: http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/ as well as Anisotropic’s work on feature selection found here: https://www.kaggle.com/arthurtok/feature-ranking-w-randomforest-rfe-linear-models/code. A lot of Anisotropic’s comments are used in this kernel to explain what is being done.

My work is an application of the techniques and methods written by those gentlemen to the Zillow data competition on Kaggle (https://www.kaggle.com/c/zillow-prize-1). I have included additional competition specific data cleaning and comments in addition to the feature selection code.

**There is one point of serious concern with the application of this code to the Zillow competition. In the Zillow competition, we are not estimating home price directly. Zillow is not sharing the actual sale price as a predictor variable (presumably to avoid something like this happening: https://www.wired.com/2010/03/netflix-cancels-contest/) but is having Kagglers estimate the log-loss of their model instead. Weird, I know. **

**I’ve been left thinking what a feature selection on the log-loss means. Are the most prominent features the ones that Zillow has been most incorrectly using? Are the best features at predicting log loss the worst features to use in a price estimate model? Do we need to do some feature engineering to make the ‘best’ features here more usable?**

The contents of this notebook are as follows:

**Data Cleaning and Visualisation**: This section will revolve around exploring the data and visualising some summary statistics.**Stability Selection via Randomised Lasso Method**: Introduce a relatively new feature selection method called “Stability Selection” and using the Randomised Lasso in its implementation**Recursive Feature Elimination**: Implementing the Recursive Feature Elimination method of feature ranking via the use of basic Linear Regression**Linear Model Feature Coefficients**: Implementing 3 of Sklearn’s linear models (Linear Regression, Lasso and Ridge) and using the inbuilt estimated coefficients for our feature selection**Random Forest Feature Selection**: Using the Random Forest’s convenient attribute “feature_importances” to calculate and ultimately rank the feature importance.

Finally, with all the points 1 to 5 above, we will combine the results to create our:

**Feature Ranking Matrix** : Matrix of all the features along with the respective model scores which we can use in our ranking.

```
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.feature_selection import RFE, f_regression
from sklearn.linear_model import (LinearRegression, Ridge, Lasso, RandomizedLasso)
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
```

# 1. DATA CLEANSING AND ANALYSIS¶

Let’s first read in the house data as a dataframe “house” and inspect the first 5 rows

```
# Load Data
train = pd.read_csv('../input/train_2016_v2.csv')
prop = pd.read_csv('../input/properties_2016.csv')
# Convert to float32 (this is done to post on Kaggle Kernels)
for c, dtype in zip(prop.columns, prop.dtypes):
if dtype == np.float64:
prop[c] = prop[c].astype(np.float32)
# Merge training dataset with properties dataset
df_train = train.merge(prop, how='left', on='parcelid')
# Remove ID columns
x_train = df_train.drop(['parcelid', 'logerror', 'transactiondate', 'propertyzoningdesc',
'propertycountylandusecode'], axis=1)
```

```
x_train.head()
```

Now its time for some general data inspection. Let’s first examine to see if there are any nulls in the dataframe as well as look at the type of the data (i.e whether it is a string or numeric)

```
print(x_train.dtypes)
```

I’ll see what values are in the object type columns — We will have to either break these features up so that for each option there is just a binary (True / False) indicator, or find a way to make the existing features binary.

```
x_train['taxdelinquencyflag'].value_counts()
```

```
x_train['fireplaceflag'].value_counts()
```

```
x_train['hashottuborspa'].value_counts()
```

I’ll have to convert the ‘Y’ to a TRUE and otherwise convert any NaN to False’s

```
x_train['taxdelinquencyflag'] = x_train['taxdelinquencyflag'] == 'Y'
```

```
for c in x_train.dtypes[x_train.dtypes == object].index.values:
x_train[c] = (x_train[c] == True)
```

```
# Looking for nulls
print(x_train.isnull().sum())
```

Yikes, the data is pretty sparse. We’re going to have to figure out how to either remove some features or impute values for some of these.

Temporarily I’ll be imputing medians for the missing values based on the missing value for said column, but its only a stopgap solution before I can do something else (possibly predict missing values?)

```
x_train = x_train.fillna(x_train.median())
```

# 2. Stability Selection via Randomized Lasso¶

In a nutshell, this method serves to apply the feature selection on different parts of the data and features repeatedly until the results can be aggregated. Therefore stronger features ( defined as being selected as important) will have greater scores in this method as compared to weaker features. Refer to this paper by Nicolai Meinshausen and Peter Buhlmann for a much greater detail on the method : http://stat.ethz.ch/~nicolai/stability.pdf

In this notebook, the Stability Selection method is conveniently inbuilt into sklearn’s randomized lasso model and therefore this will be implemented as follows:

```
# First extract the target variable which is our Log Error
Y = df_train['logerror'].values
X = x_train.as_matrix()
# Store the column/feature names into a list "colnames"
colnames = x_train.columns
```

Next, we create a function which will be able to conveniently store our feature rankings obtained from the various methods described here into a Python dictionary. In case you are thinking I created this function, no this is not the case. All credit goes to Ando Saabas and I am only trying to apply what he has discussed in this context.

```
# Define dictionary to store our rankings
ranks = {}
# Create our function which stores the feature rankings to the ranks dictionary
def ranking(ranks, names, order=1):
minmax = MinMaxScaler()
ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
ranks = map(lambda x: round(x,2), ranks)
return dict(zip(names, ranks))
```

```
# Finally let's run our Selection Stability method with Randomized Lasso
rlasso = RandomizedLasso(alpha=0.04)
rlasso.fit(X, Y)
ranks["rlasso/Stability"] = ranking(np.abs(rlasso.scores_), colnames)
print('finished')
```

# 3. Recursive Feature Elimination ( RFE )¶

Now onto the next method in our feature ranking endeavour. Recursive Feature Elimination or RFE uses a model ( eg. linear Regression or SVM) to select either the best or worst-performing feature, and then excludes the feature. After this, the whole process is iterated until all features in the dataset are used up ( or up to a user-defined limit). Sklearn conveniently possesses a RFE function via the sklearn.feature_selection call and we will use this along with a simple linear regression model for our ranking search as follows:

```
# Construct our Linear Regression model
lr = LinearRegression(normalize=True)
lr.fit(X,Y)
#stop the search when only the last feature is left
rfe = RFE(lr, n_features_to_select=1, verbose =3 )
rfe.fit(X,Y)
ranks["RFE"] = ranking(list(map(float, rfe.ranking_)), colnames, order=-1)
```

# 4. Linear Model Feature Ranking¶

Now let’s apply 3 different linear models (Linear, Lasso and Ridge Regression) and how the features are selected and prioritised via these models. To achieve this, I shall use the sklearn implementation of these models and in particular the attribute .coef to return the estimated coefficients for each feature in the linear model.

```
# Using Linear Regression
lr = LinearRegression(normalize=True)
lr.fit(X,Y)
ranks["LinReg"] = ranking(np.abs(lr.coef_), colnames)
# Using Ridge
ridge = Ridge(alpha = 7)
ridge.fit(X,Y)
ranks['Ridge'] = ranking(np.abs(ridge.coef_), colnames)
# Using Lasso
lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = ranking(np.abs(lasso.coef_), colnames)
```

# 5. Random Forest feature ranking¶

Sklearn’s Random Forest model also comes with it’s own inbuilt feature ranking attribute and one can conveniently just call it via “feature*importances*“. That is what we will be using as follows:

```
rf = RandomForestRegressor(n_jobs=-1, n_estimators=50, verbose=2)
rf.fit(X,Y)
ranks["RF"] = ranking(rf.feature_importances_, colnames)
```

# 6. Creating the Feature Ranking Matrix¶

We combine the scores from the various methods above and output it in a matrix form for convenient viewing as such:

```
# Create empty dictionary to store the mean value calculated from all the scores
r = {}
for name in colnames:
r[name] = round(np.mean([ranks[method][name]
for method in ranks.keys()]), 2)
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")
print("\t%s" % "\t".join(methods))
for name in colnames:
print("%s\t%s" % (name, "\t".join(map(str,
[ranks[method][name] for method in methods]))))
```

Now, with the matrix above, the numbers and layout does not seem very easy or pleasant to the eye. Therefore, let’s just collate the mean ranking score attributed to each of the feature and plot that via Seaborn’s factorplot.

```
# Put the mean scores into a Pandas dataframe
meanplot = pd.DataFrame(list(r.items()), columns= ['Feature','Mean Ranking'])
# Sort the dataframe
meanplot = meanplot.sort_values('Mean Ranking', ascending=False)
```

```
# Let's plot the ranking of the features
sns.factorplot(x="Mean Ranking", y="Feature", data = meanplot, kind="bar", size=10,
aspect=1, palette='coolwarm')
```

# Conclusion¶

The top 3 features are “Three Quarter Bathroom Count”, “Finished Square Feet”, and “Tax Delinquincy Flag”. “Tax amount” and “Has Hot Tub or Spa” are 4th and 5th.

To continue the discussion from up top – these are the features that are best at predicting Zillow’s log-loss on *their* price estimate model.

These rankings are showing us, I believe, which features Zillow has been using the most incorrectly to estimate their models. I’m sure that ‘Finished Square feet’ would be a great predictor at home price, if I were to write a price prediction algorithm myself. But, I’m thinking, they’ve relied on it too much, or have insufficiently factored in some additional interactions that are causing it to become a *bad* feature.