My Stint with Kaggle House Prices

September 1, 2017

Kaggle House Prices

The Kaggle competition “House Prices” uses a data-set presented in the paper “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Dean De Cock Truman State University Journal of Statistics Education Volume 19, Number 3 (2011)”. This data-set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains 2930 data instances. It comprises of 79 predictor features among which 23 are nominal variable or categorical variables, 23 ordinal, 14 discrete, and 19 continuous variables. The response variable is SalePrice, the price of the house in USD. The training and test data are presented in files train.csv and test.csv. I was able to achieve a score of 0.12234 on submission the regression results to the Kaggle site with using good old Ridge model and employing some feature engineering tricks. The full code is available on my Github repo.

Reading the data

The “House Prices” data is available as two CSV files. A Pandas DataFrame is a 2-dimensional labelled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. Using Pandas, the raw training and testing data is read into two DataFrames train and test.

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
> (1460,81)
> (1459,81)

Feature Engineering

Feature Engineering can be defined as carving the raw data to get the features that best represent the problem to solve and in succession improve the performance of the predictive model. There are no well formed outline that can be followed to achieve desired features but nevertheless, feature engineering include feature ranking and feature selection, creating new features etc.

The key ingredient in solving regression problems is to observe the data. For this problem, studying the data helped me gain insight into the following characteristics:

Distribution of variable values – If the distribution is non-uniform or skewed, then the values need to be de-skewed.

The number of predictors – Too many predictors can lead to overfitting, on the other hand, too few predictors will cause the model to underfit.

Data correlation – The correlation factor of predictors with response variable and the correlation among themselves.

Removing skewness from the response variable SalePrice

Drawing the Density Plot and Histogram of SalePrice to observe the data distribution. The Kernel Density Plot(KDE) is useful in plotting the shape of the distribution.

Just for information, Like a histogram, a KDE plot encodes the density of observations on one axis with height along the other axis. But a histogram may turn out to be a poor depiction of data distribution as it depends on the number of bins. Where as in KDE, using a kernel instead of discrete probabilities, the continuity nature in the underlying random variable is promoted and hence this results in a much smoother distribution. 

I am using the seaborn plotting library which provides a high level framework on top of Matplotlib.

import seaborn as sns

SalePrice Histogram – original

It can be inferred that the houses with prices > $400,000 are very few and their inclusion in the dataset increases the range without adding additional information to the dataset. There are only and to be exact just 28 data instances where house SalePrice > $400,000 in number. So, let’s discard these data instances. The graph below see the modified dataset:

SalePrice Histogram – after removing data > 400000

train.drop(train.index[train["SalePrice"] > 400000],inplace=True)

Observe that the tail is stretched towards the right side of the plot, which shows that the SalePrice is skewed to the right. A log transformation can help to fit a very skewed distribution into a more normal model. Since we know that none of the entries in SalePrice is zero, we can use log function otherwise we should use log1p function that is equivalent to log(1+x).

y = np.log(train["SalePrice"])

Feature reduction using Feature Ranking

For the next step in feature engineering, I perform feature ranking based on correlation coefficient. I calculate the correlation coefficient of numeric predictor variables with SalePrice and sort them in ascending order.

numerical_features = combined_data.select_dtypes(include=["float", "int64","bool"]).columns.values
corr = np.abs(plot_correlation())
corr_with_dependant = corr.iloc[:, -1].sort_values(ascending=False)

Seaborn heatmap plot is a visual treat and beautifully depicts the correlation between all the attributes.

sns.heatmap(corr, vmax=1, square=True)

Seaborn Heatmap Plot of Attributes

Highest five correlation coefficients

Lowest five correlation coefficients

To perform feature selection based on feature ranking, all the attributes with correlation coefficient < 0.04 are removed. This reduces the features that do not impact the variance of response variable. In the next step, I pick all the categorical features.

category_variables = combined_data.dtypes[combined_data.dtypes == "object"].index

The next step is to perform a feature selection by removing all the categorical variables that have missing values for 1/3rd of the training data instances.

total_missing_attributes = train.isnull().sum()
to_delete = total_missing[total_missing > (1460 / 3.)]

The features deleted are Alley, FireplaceQu, PoolQC, Fence, and MiscFeature.

Finally, to detect the outliers we apply statistics. This is done in three steps:

  1. Calculate the Interquartile range(IQR): Q3 – Q1, where Q1 is the 25th percentile, first percentile and Q3 is the 75th percentile, third percentile. The interquartile range (IQR) is the difference between the first and third quartiles.
  2. Then calculate the upper and lower range values
  3. For each value v of the attribute, check if v is within the (lower, upper) range otherwise it is an outlier.

Boxplot of GrLivArea

lower = Q1 – (IQR X 1.5)
upper = Q2 – (IQR X 1.5)
Q1 = df[feature].quantile(0.25) 
Q3 = df[feature].quantile(0.75) 
IQR = Q3 - Q1

There are different techniques for treating outliers, though nothing is full proof, I resort to deleting the data instances from training set that contain these outliers. Since deletion for every predictor variable will reduce the data set immensely, I select only one attribute, GrLivArea, as this attribute has very high correlation with the response variable (Pure Intuition!).


Lets fill in the missing values. Numerical features are filled with median and categorical features are filled with most frequent values.

combined_data = pd.concat((train_sans_SalePrice.iloc[:, 1:], test.iloc[:, 1:]))
for feature in numerical_features:
    combined_data[feature].fillna(combined_data[feature].median(), inplace=True)  for feature in categorical_features:
    combined_data[feature].fillna(combined_data[feature].value_counts().idxmax(), inplace=True)

Encoding the categorical data using one hot encoding using pandas

get_dummies() method:
combined_data = pd.get_dummies(combined_data)

Building the model

Using Ridge implementation of sklearn and training it with cross-validation to enhance the generalized learning of the model. For cross-validation, sklearn’s KFolds comes in handy. It creates a K- fold cross validator, splits dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k – 1 remaining folds form the training set.

kf = KFold(n_splits=n_folds)

I set the normalize parameter of Ridge to True to remove skewness of predictor variables.

ridge_model = Ridge(alpha=alpha, normalize=True)

Selecting the hyper-parameter Alpha and Assessing the model

Using R-Square ( R2) score also called as Coefficient of Determination, for model assessment. R-Square is a relative measure of assessment where the model performance is compared with the mean-model. Its value ranges from 0 to 1, where R2 = 1 indicates that the regression line perfectly fits the data.

Before zeroing on the alpha value (Regularization Coefficient) for the model, I toyed with a few alpha values, [0.005,0.05, 0.1, 0.3, 0.4, 1, 1.5, 2] and drew a scatter plot between the alpha value and the corresponding r-square score for the model. As visible from the scatter plot, alpha = 0.1 gave the highest R2 score for the model.

Training the model and fitting it on Test Dataset

for train_index, cv_index kf.split(X_train):
    x_train, x_valid = X_train.iloc[train_index], X_train.iloc[cv_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[cv_index],y_train)
    pred = np.exp(ridge_model.predict(x_valid))
    y_valid_exp = np.exp(y_valid)
    prediction = ridge_model.predict(X_test)

Since we log-transformed the Response Variable, we need to convert the predictions from log transformation to normal values using Numpy’s exp(x) function. If Response Variable was transformed using logp1 function, Numpy’s expm1 function converts the predictions using out = exp(x) – 1.

pred_ridge = np.expm1

Output the predictions in a CSV format

np.savetxt("pred_ridge_Everything01.csv", pred_ridge, delimiter=",")

Finally, I planned of employing the ensemble technique to provide a smoother regression and went on to build a model using xgBoost. I had observed that there were numerous examples where this algorithm works well on the Kaggle set (using xgBoost model alone got me a Kaggle Score of 0.13089). I used the Averaging ensemble method to get the final predictions, which basically averages the prediction for each instance of test Dataset obtained from the above developed Ridge and later xgBoost based model.