Predicting AirBnB prices in Lisbon: Trees and Random Forests

In this small article we will quickly bootstrap a prediction model for the nighly prices of an AirBnB in Lisbon. This guide hopes to serve as a simplistic and practical introduction to machine learning data analysis, by using real data and developing a real model.

It assumes as well basic understanding of Python and the machine learning library scikit-learn, and it was written on a Jupyter notebook running Python 3.6 and sklearn 0.21. The dataset, as well as the notebook, can be obtained on my Github account, or via Google’s dataset search.

1. Data exploration and cleanup

As the first step we start by loading our dataset. After downloading the file it is trivial to open and parse it with Pandas and provide a quick list of what we could expect from it:

import pandas as pd

airbnb_lisbon_prices_path = './airbnb_lisbon_1480_2017-07-27.csv'
airbnb_data = pd.read_csv(airbnb_lisbon_prices_path)

airbnb_data.columns

Index(['room_id', 'survey_id', 'host_id', 'room_type', 'country', 'city',
       'borough', 'neighborhood', 'reviews', 'overall_satisfaction',
       'accommodates', 'bedrooms', 'bathrooms', 'price', 'minstay', 'name',
       'last_modified', 'latitude', 'longitude', 'location'],
      dtype='object')

Even though above we can confirm that the dataset was properly loaded and parsed, a quick analysis of the statistical description of the data may provide us a quick insight of its nature:

airbnb_data.describe()

	room_id	survey_id	host_id	country	borough	reviews	overall_satisfaction	accommodates	bedrooms	bathrooms	price	minstay	latitude	longitude
count	1.357800e+04	13578.0	1.357800e+04	0.0	0.0	13578.000000	13578.000000	13578.000000	13578.000000	0.0	13578.000000	0.0	13578.000000	13578.000000
mean	1.128653e+07	1480.0	4.031780e+07	NaN	NaN	29.507512	3.242046	3.872956	1.535793	NaN	88.376712	NaN	38.720882	-9.143594
std	6.331112e+06	0.0	4.120258e+07	NaN	NaN	43.657402	2.151046	2.254779	1.042837	NaN	123.044522	NaN	0.015134	0.017923
min	6.499000e+03	1480.0	1.445500e+04	NaN	NaN	0.000000	0.000000	1.000000	0.000000	NaN	10.000000	NaN	38.692112	-9.226461
25%	5.611757e+06	1480.0	6.855202e+06	NaN	NaN	1.000000	0.000000	2.000000	1.000000	NaN	48.000000	NaN	38.711405	-9.150805
50%	1.253847e+07	1480.0	2.393412e+07	NaN	NaN	11.000000	4.500000	4.000000	1.000000	NaN	70.000000	NaN	38.715857	-9.141973
75%	1.718596e+07	1480.0	6.298880e+07	NaN	NaN	40.000000	5.000000	5.000000	2.000000	NaN	96.000000	NaN	38.726283	-9.132693
max	2.009103e+07	1480.0	1.426582e+08	NaN	NaN	448.000000	5.000000	16.000000	10.000000	NaN	4203.000000	NaN	38.792573	-9.091610

From this table we can actually infer about basic statistical observations for each of our parameters. As our model intends to predict the price, based on whatever set of inputs we’ll provide to it, we could check for example that:

the mean value of the nighly price is around 88 EUR
the prices range from a minimum of 10 EUR to 4203 EUR
the standard deviation for the prices is around 123 EUR (!)

The price distribution could be represented as follows:

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
pd.plotting.register_matplotlib_converters()

sns.distplot(a=airbnb_data['price'])

As we can see, our distribution of prices concentrates, under the 300 EUR interval, having some entries for the 4000 EUR values. Plotting it for where most of the prices reside:

airbnb_sub_set = airbnb_data.loc[airbnb_data.price < 300]
sns.distplot(a=airbnb_sub_set['price'])

We can clearly see from our representation above that most of the prices, for a night in Lisbon, will cost between 0-150 EUR.

Let us now pry and have a sneak peak into the actual dataset, in order to understand the kind of parameters we’ll be working on:

airbnb_data.head()

	room_id	survey_id	host_id	room_type	country	city	borough	neighborhood	reviews	overall_satisfaction	accommodates	bedrooms	bathrooms	price	minstay	name	last_modified	latitude	longitude	location
0	14708916	1480	91501272	Shared room	NaN	Lisbon	NaN	Santo António	19	4.5	4	1.0	NaN	30.0	NaN	LOW COST HOSTEL MARQUES GARDENS 4 BED DORM	2017-07-28 01:19:53.215272	38.723987	-9.146613	0101000020E61000003F56F0DB104B22C055DD239BAB5C...
1	7465447	1480	704061	Shared room	NaN	Lisbon	NaN	Avenidas Novas	4	3.5	6	1.0	NaN	39.0	NaN	Room in Lisbon center	2017-07-28 01:19:53.206052	38.735061	-9.154510	0101000020E610000057CF49EF1B4F22C054C8957A165E...
2	11058290	1480	1379661	Shared room	NaN	Lisbon	NaN	Santa Maria Maior	38	4.5	8	1.0	NaN	39.0	NaN	Bed in a 8-bed dorm @ This Is Lisbon Hostel	2017-07-28 01:19:52.034547	38.715726	-9.132671	0101000020E610000057410C74ED4322C0CD72D9E89C5B...
3	9019111	1480	46762399	Shared room	NaN	Lisbon	NaN	Avenidas Novas	9	4.0	6	1.0	NaN	37.0	NaN	Shared mixed bunkbed room 1	2017-07-28 01:19:52.020396	38.729017	-9.149932	0101000020E61000006C3F19E3C34C22C0309DD66D505D...
4	3123304	1480	8488955	Shared room	NaN	Lisbon	NaN	Arroios	0	0.0	2	1.0	NaN	480.0	NaN	quarto no centro de lisboa	2017-07-28 01:19:50.401364	38.721617	-9.136208	0101000020E61000003AE8120EBD4522C06B9E23F25D5C...

From the description above we should be able to infer some statistical data about the nature of the data. Besides the distribution set of parameters (that we will not be looking for now), we clearly identify two sets of relevant insights:

there are empty columns: country, borough, bathrooms, minstay
entries like host_id, survey_id, room_id, name, city, last_modified and survey_id may not be so relevant for our price predictor
there are some categorical data that we will not be able to initially add to the regression of the Price, such as room_type and neighborhood (but we’ll be back to these two later on)
location may be redundant for now, when we have both latitude and longitude and we may need to further infer about the nature of the format of this field

Let us then proceed on separating the dataset in:

one vector Y that will contain all the real prices of the dataset
on matrix X that contains all the features that we consider relevant for our model

This can be achieved by the following snippet:

lisbon_features = ['reviews', 'overall_satisfaction', 'accommodates', 'bedrooms', 'latitude', 'longitude']

Y = airbnb_data.price
X = airbnb_data[lisbon_features]

X.head()

	reviews	overall_satisfaction	accommodates	bedrooms	latitude	longitude
0	19	4.5	4	1.0	38.723987	-9.146613
1	4	3.5	6	1.0	38.735061	-9.154510
2	38	4.5	8	1.0	38.715726	-9.132671
3	9	4.0	6	1.0	38.729017	-9.149932
4	0	0.0	2	1.0	38.721617	-9.136208

With our new subset, we can now try to understand what is the correlation of these parameters in terms of the overall satisfaction, for the most common price range:

relevant_parameters = ['price'] + lisbon_features
sns.pairplot(airbnb_data.loc[airbnb_data.price < 300][relevant_parameters], hue="overall_satisfaction")

The above plots allow us to check the distribution of all the single variables and try to infer about the reationships between them. We’ve took the freedom to apply a color hue based on the review values for each of our chosen parameters. Some easy reading examples of the above figure, from relations that may detones a positive correlation:

the number of reviews is more common for rooms with few accomodations. This could mean that most of the guests that review are renting smaller rooms.
most of the reviews are made for the cheaper priced rooms
taking into account the visual dominance of the yellow hue, most of the reviews are actually rated with 5. Either this means that most of the accomodations are actually very satisfatory or, most probably, the large amount of people that actually review, does it to give 5 as rating.

On curious observation is also that the location heavily influences price and rating. When plotting both longitude and latitudw we can obtain a quasi geografical / spacial distribution for the ratings along Lisbon:

sns.pairplot(airbnb_data, vars=['latitude', 'longitude'], hue="overall_satisfaction", height=3)

We can then add this data to an actual map of Lisbon, to check the distribution:

As expected, most of the reviews are on the city center with a cluster of reviews already relevant alongside the recent Parque das Nações. The northern more sub-urban area, even though it has some scattered places, the reviews are not as high and common as on the center.

2. Splitting the dataset

With our dataset now properly cleared we will then first proceed into splitting it into two pieces:

a set that will be responsible for training our model, therefore called the training set
a validation set that will be used to then validate our model

Both sets would then basically be a subset of X and Y, containing a subset of the rental spaces and their corresponding prices. We would then, after training our model, to use the validation set as a input to then infer how good is our model on generalizing into data sets other than the ones used to train. When a model is performing very well on the training set, but does not generalize well to other data, we say that the model is overfitted to the dataset. For further information on overfitting, please refer to: https://en.wikipedia.org/wiki/Overfitting

In order to avoid this overfitting of our model to the test data we will then use a tool from sklearn called https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html that basically will split our data into a random train of train and test subsets:

from sklearn.model_selection import train_test_split

train_X, validation_X, train_Y, validation_Y = train_test_split(X, Y, random_state = 42)

print("Training set: Xt:{} Yt:{}".format(train_X.shape, train_Y.shape))
print("Validation set: Xv:{} Yv:{}".format(validation_X.shape, validation_Y.shape))
print("-")
print("Full dataset: X:{} Y:{}".format(X.shape, Y.shape))

Training set: Xt:(10183, 6) Yt:(10183,)
Validation set: Xv:(3395, 6) Yv:(3395,)
-
Full dataset: X:(13578, 6) Y:(13578,)

Now that we have our datasets in place, we can now proceed on creating a simple regression model that will try to predict, based on our choosen parameters, the nighly cost of an AirBnb in Lisbon.

3. Planting the Decision Trees

As one of the most simplistic supervised ML models, a decision tree is usually used to predict an outcome by learning and infering decision rules from all the features data available. By ingesting our data parameters the trees can learn a series of educated “questions” in order to partition our data in a way that we can use the resulting data structure to either classify categorial data, or simply create a regression model for numerical values (as it is our case with the prices).

An visualization example, taken from wikipedia, could be the decision tree around the prediction for survival of passangers in the Titanic:

Based on the data, the tree is built on the root and will be (recursively) partitioned by splitting each node into two child ones. These resulting nodes will be split, based on decisions that are infered about the statistical data we are providing to the model, until we reach a point where the data split results in the biggest information gain, meaning we can properly classify all the samples based on the classes we are iteratively creating. To these end vertices we call “leaves”.

On the wikipedia example above it is trivial to follow how the decision process follows and, as the probability of survival is the estimated parameter here, we can easily obtain the probability of a “male, with more than 9.5 years old” survives when “he has no siblings”.

(For a more deep understanding on how decision trees are built for regression I would recomend the video by StatQuest, named Decision Trees).

Let us then create our Decision Tree regressor by utilizing the sklearn implementation:

from sklearn.tree import DecisionTreeRegressor

lisbon_airbnb_model = DecisionTreeRegressor(random_state = 42)
lisbon_airbnb_model.fit(train_X, train_Y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

We can verify how the tree was built, for illustration purposes, on the picture below:

from sklearn.tree import export_graphviz
from graphviz import Source

export_graphviz(lisbon_airbnb_model, out_file='tree.dot', filled=True, rounded=True, special_characters=True)  

from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

from IPython.display import Image
Image(filename = 'tree.png')

Please find here a graphical representation of the generated tree (big picture)

We can also show a snippet of the predictions, and corresponding parameters for a sample of the training data set. So for the following accomodations:

train_X.head()

	reviews	overall_satisfaction	accommodates	bedrooms	latitude	longitude
11956	29	4.5	2	1.0	38.717426	-9.156587
5665	75	5.0	5	2.0	38.716165	-9.144141
11639	2	0.0	3	1.0	38.709567	-9.143207
13391	4	5.0	2	1.0	38.727817	-9.133753
3219	1	0.0	6	3.0	38.723333	-9.130640

We obtain the following prices:

lisbon_airbnb_model.predict(train_X.head())

array([ 30.,  81.,  60.,  30., 121.])

After fitting our model to the train data, we can now run a prediction for the validation set and assess the current absolute error of our model to assess on how good it generalizes when not run agains the data it was tested.

For this we’ll use the Mean Absolute Error (MAE) metric. We can consider this metric as the average error magnitude in a predictions set. It can be represented as such:

$MAE = {1 \over n}{\sum_{i=1}^{n}{\left|y_i - \hat{y}_i\right|}}$

It is basically an average over the the differences between our model predictions (y) and actual observations (y-hat), and making the consideration that all individual differences have equal weight.

Let us then apply this metric to our model, using Scikit Learn implementation:

from sklearn.metrics import mean_absolute_error

# instruct our model to make predictions for the prices on the validation set
validation_predictions = lisbon_airbnb_model.predict(validation_X)

# calculate the MAE between the actual prices (in validation_Y) and the predictions made
validation_prediction_errors = mean_absolute_error(validation_Y, validation_predictions)

validation_prediction_errors

42.91664212076583

This result basically means that our model is giving an absolute error of about 42.935 EUR per accomodation when exposed to the test data, out of a 88.38 EUR mean value that we collected during the initial data exploration.

Either due to our dataset being small, and to our model being naife, this result is not satisfactory.

Even though this may seem worriyng at this point, it is always advised to create a model that generates results as soon as possible and then start iterating on its optimization. Therefore, let us now proceed on attempting to imporve our model’s predictions a bit more.

Currently we are indeed suffering for overfitting on the test data. If we imagine the decision tree that is being built, as we are not specifying a limit for the decisions to split, we will consequently generate a decision tree that goes way deep until the test features, not generalizing well on any test set.

As sklearn’s DecisionTreeRegressor allows us to specify a maximum number of leaf nodes as an hyper parameter, let us quickly try to assess if there is a value that decreses our MAE:

# this function takes both the training and validation sets to compute the MAE for a Decision Tree
def compute_mae(train_X, train_Y, validation_X, validation_Y, max_leaf_nodes):
    trees_model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 42)
    trees_model.fit(train_X, train_Y)
    validation_predictions = trees_model.predict(validation_X)
    error = mean_absolute_error(validation_Y, validation_predictions)
    
    return(error)

def get_best_tree_size(train_X, train_Y, validation_X, validation_Y, verbose = False):
    # candidates to iterate on finding a better tree depth
    candidate_max_leaf_nodes = [5, 10, 20, 30, 50, 100, 250, 500]

    # initialization
    minimum_error = None
    best_tree_size = 5

    # loop to find the minimal error value
    for max_leaf_nodes in candidate_max_leaf_nodes:
        current_error = compute_mae(train_X, train_Y, validation_X, validation_Y, max_leaf_nodes)
        verbose and print("(Size: {}, MAE: {})".format(max_leaf_nodes, current_error))

        if(minimum_error == None or current_error < minimum_error):
            minimum_error = current_error
            best_tree_size = max_leaf_nodes

    return(best_tree_size)

best_tree_size = get_best_tree_size(train_X, train_Y, validation_X, validation_Y, True)

best_tree_size

(Size: 5, MAE: 42.6016036138866)
(Size: 10, MAE: 40.951013502542885)
(Size: 20, MAE: 40.00407688450048)
(Size: 30, MAE: 39.6249335490541)
(Size: 50, MAE: 39.038730827750555)
(Size: 100, MAE: 37.72578309289501)
(Size: 250, MAE: 36.82474862034445)
(Size: 500, MAE: 37.58889602439078)





250

Let us then try to generate our model, but including the computed max tree size, and check then its prediction with the new limit:

# create the model
lisbon_airbnb_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size, random_state = 42)
lisbon_airbnb_model.fit(train_X, train_Y)

# generate the predictions for the validation set
validation_predictions = lisbon_airbnb_model.predict(validation_X)
validation_prediction_errors = mean_absolute_error(validation_Y, validation_predictions)

validation_prediction_errors

36.82474862034445

So by simply tunning up our maximum number of leaf nodes hyperparameter we could then obtain a significant increase of our model’s predictions. We have now improved on average (42.935 - 36.825) ~ 6.11 EUR on our model’s errors.

4. Categorical Data

As mentioned above, even though we are being able to proceed on optimizing our very simplistic model, we still dropped two possible relevant fields that may (or may not) contribute to a better generalization and parameterization of our model: room_type and neighborhood.

These non-numerical data fields are usually referred to as Categorical Data, and most frequently we can approach them in three ways:

1) Drop

Sometimes the easiest way to deal with categorical data is… to remove it from the dataset. We did this to set up our project quickly, but one must go case by case in order to infer about the nature of such fields and if they make sense to be dropped.

This was the scenario we analysed until now, with a MAE of: 36.82474862034445

2) Label Encoding

So for label encoding we assume that each value is assigned to a unique integer. We an also make this transformation taking into account any kind of ordering/magnitude that may be relevant for data (e.g., ratings, views, …). Let us check a simple example using the sklearn preprocessor:

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

# initialize with a list of values
encoder.fit(["shared room", "single room", "double room", "suite"])

sample_inputs = ["suite", "suite", "shared room", "double room", "single room"]

encoder.transform(sample_inputs)

array([3, 3, 1, 0, 2])

It is trivial to assess then the transformation that the LabelEncoder is doing, by assigning the array index of the fitted data:

encoder.classes_

array(['double room', 'shared room', 'single room', 'suite'], dtype='<U11')

Let us then apply to our categorical data this preprocessing technique, and let us verify how this affects our model predictions. So our new data set would be:

lisbon_features_extended = ['room_type', 'neighborhood', 'reviews', 'overall_satisfaction', 'accommodates', 'bedrooms', 'latitude', 'longitude']
X = airbnb_data[lisbon_features_extended]

X.head()

	room_type	neighborhood	reviews	overall_satisfaction	accommodates	bedrooms	latitude	longitude
0	Shared room	Santo António	19	4.5	4	1.0	38.723987	-9.146613
1	Shared room	Avenidas Novas	4	3.5	6	1.0	38.735061	-9.154510
2	Shared room	Santa Maria Maior	38	4.5	8	1.0	38.715726	-9.132671
3	Shared room	Avenidas Novas	9	4.0	6	1.0	38.729017	-9.149932
4	Shared room	Arroios	0	0.0	2	1.0	38.721617	-9.136208

Our categorical data, represented on our panda’s dataframe as an object, can then be extracted by:

categorical = (X.dtypes == 'object')
categorial_columns = list(categorical[categorical].index)

categorial_columns

['room_type', 'neighborhood']

Now that we have the columns, let us then transform them on both the training and validation sets:

train_X, validation_X, train_Y, validation_Y = train_test_split(X, Y, random_state = 42)

# create copies of our data sets to apply the transformations
train_X_encoded = train_X.copy()
validation_X_encoded = validation_X.copy()

encoder = LabelEncoder()

# for each column we fit the encoder and transform each of the rows
for column in categorial_columns:
    train_X_encoded[column] = encoder.fit_transform(train_X[column])
    validation_X_encoded[column] = encoder.transform(validation_X[column])

# a sample of our transformed data
train_X_encoded.head()

	room_type	neighborhood	reviews	overall_satisfaction	accommodates	bedrooms	latitude	longitude
11956	1	9	29	4.5	2	1.0	38.717426	-9.156587
5665	0	21	75	5.0	5	2.0	38.716165	-9.144141
11639	1	15	2	0.0	3	1.0	38.709567	-9.143207
13391	1	4	4	5.0	2	1.0	38.727817	-9.133753
3219	0	18	1	0.0	6	3.0	38.723333	-9.130640

Let us now train and fit the model with the transformed data:

# compute the best tree size
best_tree_size = get_best_tree_size(train_X_encoded, train_Y, validation_X_encoded, validation_Y)

# create the model
lisbon_airbnb_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size, random_state = 42)
lisbon_airbnb_model.fit(train_X_encoded, train_Y)

# generate the predictions for the validation set
validation_predictions = lisbon_airbnb_model.predict(validation_X_encoded)
validation_prediction_errors = mean_absolute_error(validation_Y, validation_predictions)

validation_prediction_errors

35.690195084932355

We have then improved our predictor, by encoding our categorical data, reducing our MAE to ~ 35.69 EUR.

3) One-Hot Encoding

One-Hot encoding, instead of enumerating a fields’ possible values, creates new columns indicating the presence or absence of the encoded values. Let us showcase this with a small example

from sklearn.preprocessing import OneHotEncoder

# categories
room_type = ["shared room", "single room", "double room", "suite"]
building_type = ["hotel", "apartement", "iglo"]

# initializing the encoder (non-sparse returns an array, and ignore unknown categories)
encoder = OneHotEncoder(categories=[room_type, building_type], sparse=False, handle_unknown='ignore')

# fitting and transforming the example
encoder.fit_transform([['suite', 'iglo'], ['single room', 'apartement']])

array([[0., 0., 0., 1., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1., 0.]])

From the result above we can see that the binary encoding is providing 1 on the features that each feature array actually has enabled, and 0 when not present. Let us then try to use this preprocessing on our model:

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# create a pandas dataframe with our transformed categorical columns, with One Hot Encoding
train_X_encoded_categories = pd.DataFrame(encoder.fit_transform(train_X[categorial_columns]))
validation_X_encoded_categories = pd.DataFrame(encoder.transform(validation_X[categorial_columns]))

train_X_encoded_categories.head()

	0	1	7	...	18	21	24
0	0.0	1.0	0.0	...	0.0	0.0	0.0
1	1.0	0.0	0.0	...	0.0	0.0	1.0
2	0.0	1.0	0.0	...	1.0	0.0	0.0
3	0.0	1.0	1.0	...	0.0	0.0	0.0
4	1.0	0.0	0.0	...	0.0	1.0	0.0

5 rows × 27 columns

So the above result may look wierd at first but, for the 26 possible categories we now have a binary codification checking for its presence. We will now:

add back the original row indexes that were lost during the transformation
drop the original categorical columns from the original sets train_X and validation_X
replace the droped columns by our new dataframe with all 26 possible categories

train_X_encoded_categories.index = train_X.index
validation_X_encoded_categories.index = validation_X.index

partial_train_X = train_X.drop(categorial_columns, axis=1)
partial_validation_X = validation_X.drop(categorial_columns, axis=1)

train_X_encoded = pd.concat([partial_train_X, train_X_encoded_categories], axis=1)
validation_X_encoded = pd.concat([partial_validation_X, validation_X_encoded_categories], axis=1)


validation_X_encoded.head()

	reviews	overall_satisfaction	accommodates	bedrooms	latitude	longitude	0	1	...	23	24	26
1061	57	5.0	4	2.0	38.717151	-9.145846	1.0	0.0	...	0.0	1.0	0.0
6480	80	5.0	4	0.0	38.711204	-9.134876	1.0	0.0	...	1.0	0.0	0.0
12233	14	5.0	2	1.0	38.727864	-9.142812	0.0	1.0	...	0.0	0.0	0.0
4659	19	5.0	4	1.0	38.718072	-9.125642	1.0	0.0	...	0.0	0.0	1.0
7383	3	4.0	4	1.0	38.712991	-9.141062	1.0	0.0	...	1.0	0.0	0.0

5 rows × 33 columns

Now we can proceed on using our new encoded sets into our model:

# compute the best tree size
best_tree_size = get_best_tree_size(train_X_encoded, train_Y, validation_X_encoded, validation_Y)

# create the model
lisbon_airbnb_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size, random_state = 42)
lisbon_airbnb_model.fit(train_X_encoded, train_Y)

# generate the predictions for the validation set
validation_predictions = lisbon_airbnb_model.predict(validation_X_encoded)
validation_prediction_errors = mean_absolute_error(validation_Y, validation_predictions)

validation_prediction_errors

36.97010930367817

By using One Hot Encoding on our categorical data we obtain a MAE to ~ 36.97EUR.

This result may prove that One-Hot-Encoding is not the best fit for both our categorical parameters, when compared with the Label Encoding and for both parameters at the same time. Nevertheless this result still allowed us to include the categorical parameters with a reduction of the initial MAE.

5. Random Forests

From the previous chapter we could see that, with our Decision Tree, we are always balancing between:

a deep tree with many leaves, in our case with few AirBnB places on each of them, being then too overfitted to our testing set (they present what we call high variance)
a shallow tree with few leaves that is unable to distinguish between the various features of a item

We can imagine a “Random Forest” as an ensable of Decision Trees that, in order to try to reduce the variance mentioned above, generates Trees in a way that will allow the algorithm to select the remaining trees in a way that the error is reduced. Some examples on how the random forest are created could be by:

generating trees with different subsets of data. For example, from our set of parameters analysed above, trees would be generated having only a random set of them (e.g., a Decision Tree with only “reviews” and “bedrooms”, another with all parameters except “latitude”
generating other trees by training on different samples of data (different sizes, different splits between the data set into training and validation, …)

In order to reduce then the variance the added randomness makes the generated individual trees’ errors less likely to be related. The prediction is then taken from the average of all predictions, by combining the different decision trees predictions, has the interesting effect of even canceling some of those errors out, reducing then the variance of the whole prediction.

The original publication, explaining this algorithm in more depth, can be found on the bibliography section at the end of this article.

Let us then implement our predictor using a Random Forest:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# create copies of our data sets to apply the transformations
train_X_encoded = train_X.copy()
validation_X_encoded = validation_X.copy()

# use the label encoder to our categorical data
encoder = LabelEncoder()

# for each column we fit the encoder and transform each of the rows
for column in categorial_columns:
    train_X_encoded[column] = encoder.fit_transform(train_X[column])
    validation_X_encoded[column] = encoder.transform(validation_X[column])

# let us set a maximum of 250 trees in our forest
airbnb_lisbon_forest_model = RandomForestRegressor(n_estimators = 250, random_state = 42)
airbnb_lisbon_forest_model.fit(train_X_encoded, train_Y)

predictions = airbnb_lisbon_forest_model.predict(validation_X_encoded)

mean_absolute_error(validation_Y, predictions)

33.9996500736377

We can see that we have a significant reduction on our MAE when using a Random Forest.

Summary

Even though decision trees are a very simplistic (maybe the most simple) regression techniques in machine learning that we can use in our models, we expected to demostrate a sample process of analysing a dataset in order to generate predictions. It was clear to demonstrate that with small optimization steps (like cleaning up the data, encoding categorical data) and abstracting a single tree to a random forest we could significantly reduce the mean absolute error from our model predictions.

We hope that this example becomes usefull as a hands-on experience with machine learning and please don’t exitate to contact me if I could clarify or correct some of what was demonstrated above. There would also be the plan on proceeding on further optimizing our predictions on this specific dataset in some future articles, using other approaches and tools so keep tuned :)

	0	1	7	...	18	21	24
0	0.0	1.0	0.0	...	0.0	0.0	0.0
1	1.0	0.0	0.0	...	0.0	0.0	1.0
2	0.0	1.0	0.0	...	1.0	0.0	0.0
3	0.0	1.0	1.0	...	0.0	0.0	0.0
4	1.0	0.0	0.0	...	0.0	1.0	0.0

	0	1	7	...	18	21	24
0	0.0	1.0	0.0	...	0.0	0.0	0.0
1	1.0	0.0	0.0	...	0.0	0.0	1.0
2	0.0	1.0	0.0	...	1.0	0.0	0.0
3	0.0	1.0	1.0	...	0.0	0.0	0.0
4	1.0	0.0	0.0	...	0.0	1.0	0.0