Predicting Home Prices
In this project, I've pulled the Zillow housing dataset from Kaggle to take a look at building a neural network with keras.
So this is a quick one. I’m going to use this house price data set from Kaggle to look at how a Neural Network can be implemented.
I’m using the Zillow dataset from Kaggle, you can find it here.
The dataset has 1460 observations of homes sold with 80 features for each house – normal things that you think about when buying a house like lot size, number of bedrooms and bathrooms, square footage of the house and then slightly less normal things like number of fireplaces, the quality of the fence and how far it is from a main road or railroad.
So I downloaded the dataset (no scraping/cleaning data like in my other two examples) and loaded it into python.
import pandas as pd train = pd.read_csv('train.csv') X_test = pd.read_csv('test.csv') y_train = train['SalePrice'] X_train = train.drop('SalePrice', axis=1)
This dataset has quite a few categorical data fields. And rather than going through each field to check if its categorical, I’m going to do a batch one-hot encoding of the categorical features to get dummy variables. To do this, I get the column names that have the data type object
. In my data they were all either int64
, float64
or object
so it was easy to tell which were categorical and which weren’t. There are a few int64
variables that were categorical, so I’ve dealt with them after this initial call (MSSubClass
). I found using a batch one-hot encoding to cover most of the variables and there were only a few that were int64
‘s that needed a conversion. This is all done using Panda’s getdummies
function.
columns=X_train.columns[X_train.dtypes == 'object'] X_train_ohe = pd.get_dummies(X_train, columns = columns, drop_first=True) X_train_ohe = pd.get_dummies(X_train_ohe, columns = {'MSSubClass'}, drop_first = True)
From here, I found there were some missing values, so using the SimpleImputer
from scikit-learn
, I imputed missing values using the median strategy.
from sklearn.impute import SimpleImputer imp = SimpleImputer(missing_values=np.nan, strategy='median') imp.fit(X_train_ohe) X_train_imp = imp.transform(X_train_ohe)
Now for keras
. Here I’m importing the Sequential ANN model, Dense layers and an early stopping callback to exit the neural network if the model stops progressing (which is helpful for my slow computer).
from keras.models import Sequential from keras.layers import Dense from keras.callbacks import EarlyStopping
Keras requires passing the shape to the first layer, so I created a variable called n_cols
to keep it cleaner. I set the model up as Sequential and then add my dense layers. For the first two layers I include the number of features + 1
for the number of nodes/neurons. I then add an extra layer of half the number of features + 1
and then finish with a single layer with one node for the output, since it’s a continuous target variable (i.e., selling price). For the hidden layers I’m using a relu
activation function (but in the future am considering trying a Random relu for fun).
I included a pretty patient early stopping mechanism. I tried a lower number but found it quit too quickly for my liking.
Lastly, I’m using an Adam optimizer and am targeting the Mean Absolute Percentage Error.
n_cols = X_train_imp.shape[1] model = Sequential() model.add(Dense(n_cols+1, activation = 'relu', input_shape=(n_cols,))) model.add(Dense(n_cols+1, activation='relu')) model.add(Dense(int((n_cols+1)/2), activation='relu')) model.add(Dense(1)) es = EarlyStopping(patience=50) model.compile(optimizer = 'adam', loss='mean_absolute_percentage_error')
For plotting purposes, I’m going to assign the model fit call to the variable history
. In the model fit call, I’ve set validation_split
to 25%, so one quarter of the data is held back as a test set. I’ve set epochs
to 1,000 but we won’t reach 1,000 epochs if there isn’t any progress after 50 epochs (as defined in the early stopping variable es
).
history = model.fit(X_train_imp, y_train, callbacks=[es], epochs=1000, validation_split=0.25)
The neural network went through 402 epochs before settling on the final model. In the chart below, you can see that the test set slows down considerably after about 25 epochs but steadily gets better until about 200 or so epochs. The model continues to improve, but is likely overfitting once the test set flattens out. The model loss is the Mean Absolute Percentage Error. The test set gets to about 10%, which means the model predicts house prices with about +/- 10% accuracy.
What’s interesting is comparing the predicted price with the actual sale price for the entire dataset. The grey line is a 45 degree line which represents a perfect prediction. As shown below, there are a few big outliers that are likely affecting the accuracy of the model.