You are on page 1of 19

12/2/2018 Jupyter Notebook Viewer

Bike_Sharing_Demand (/github/daniel-bejarano/Bike_Sharing_Demand/tree/master)
/
Bike Sharing Demand Analysis.ipynb (/github/daniel-bejarano/Bike_Sharing_Demand/tree/master/Bike Sharing Demand Analysis.ipynb)

BIKE SHARING DEMAND COMPETITION


Source: Kaggle | Date: 11/26/18

Overview

Predict Bike Sharing Demand on days 20th-to-end-of-month using hourly training data for days 1-19 of the month.
Only information available prior to the rental period can be used to predict demand.

Out[1]: The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code,
click here.

1. Load Libraries and Datasets

First, all required libraries are loaded.

C:\Users\dbejarano\Anaconda3\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversio


from ._conv import register_converters as _register_converters
Using TensorFlow backend.

Load train datasets and sample submission file already saved in directory. Below are the first few rows for
each.

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 1/19
12/2/2018 Jupyter Notebook Viewer

datetime season holiday workingday weather temp atemp humidity windspeed casual registered coun

2011-01-
0 01 1 0 0 1 9.84 14.395 81 0.0 3 13 1
00:00:00

2011-01-
1 01 1 0 0 1 9.02 13.635 80 0.0 8 32 4
01:00:00

2011-01-
2 01 1 0 0 1 9.02 13.635 80 0.0 5 27 3
02:00:00

Training Set Size: (10886, 12)

datetime season holiday workingday weather temp atemp humidity windspeed

0 2011-01-20 00:00:00 1 0 1 1 10.66 11.365 56 26.0027

1 2011-01-20 01:00:00 1 0 1 1 10.66 13.635 56 0.0000

2 2011-01-20 02:00:00 1 0 1 1 10.66 13.635 56 0.0000

Test Set Size: (6493, 9)

datetime count

0 2011-01-20 00:00:00 0

1 2011-01-20 01:00:00 0

2 2011-01-20 02:00:00 0

Datasets Description

Features:

datetime - hourly date & timestamp


season - 1 = spring, 2 = summer, 3 = fall, 4 = winter
holiday - 1 = holiday, 0 = non-holiday
workingday - 1 = working day, 0 = weekend
weather
1. Clear, Few clouds, Partly cloudy, Partly cloudy
2. Mist & Cloudy, Mist & Broken clouds, Mist & Few clouds, Mist
3. Light Snow, Light Rain & Thunderstorm & Scattered clouds, Light Rain & Scattered clouds
4. Heavy Rain & Ice Pallets & Thunderstorm & Mist, Snow & Fog
temp - temperature in Celsius
atemp - "feels like" temperature in Celsius
humidity - relative humidity
windspeed - wind speed
casual - number of non-registered user rentals initiated
registered - number of registered user rentals initiated
count - number of total rentals (Dependent Variable)

Variable names are all-lower-case. However, throughout this document I will use upper-case for the first letter, so it's
obvious that I'm referring to the variables in question (e.g. Holiday, Weather, etc).

Training Set:

Contains 10,886 observations of 12 variables. The dependent variable (what we will be predicting) is the feature
Count. It contains hourly observations for days 1-19 of every month from 2011-01-01 to 2012-12-19.

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 2/19
12/2/2018 Jupyter Notebook Viewer
Test Set:

Contains 6,493 observations of 9 variables. It includes hourly observations for days 20-end_of_month from 2011-01-
20 to 2012-12-31.

It does not include the Casual and Registered features (divides renters into registered and non-registered users).
Since we won't be able to use them to predict on the testing set, and since Count is lineraly dependent on them
(Count = Casual + Registered) we will be dropping them as they won't be of much help in our predictions.

Sample Submission:

We will be predicting the count of bicycles rented per hour on the test set.

Data Types
Let's look at what type of variables we are dealing with in our datasets.

We notice for the most part is numeric. Some of those numeric variables are categorical ones represented as
integers: Season, Holiday, Workingday and Weather

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime 10886 non-null object
season 10886 non-null int64
holiday 10886 non-null int64
workingday 10886 non-null int64
weather 10886 non-null int64
temp 10886 non-null float64
atemp 10886 non-null float64
humidity 10886 non-null int64
windspeed 10886 non-null float64
casual 10886 non-null int64
registered 10886 non-null int64
count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
None

2. Exploratory Data Analysis (EDA)


Let's first look at a pairplot of the variables to get a feel for what the dataset contains. I already show it with Casual
and Registered features dropped. Also, Atemp is very highly correlated w/ Temp, so it was dropped as well to avoid
multicolinearity.

REMEMBER that ONLY the training set should be used during EDA. This prevents us from learning
information about the test set that could bias us to reach particular conclusions that are specific to the test
set and would not necessarily generalize well.

Feature Engineering - Part 1


Remove Casual, Registered and Atemp are dropped from the training and test sets.

Visualization - Pair Plots

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 3/19
12/2/2018 Jupyter Notebook Viewer

C:\Users\dbejarano\Anaconda3\lib\site-packages\numpy\core\_methods.py:135: RuntimeWarning: D
keepdims=keepdims)
C:\Users\dbejarano\Anaconda3\lib\site-packages\numpy\core\_methods.py:127: RuntimeWarning:
ret = ret.dtype.type(ret / rcount)
C:\Users\dbejarano\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:488: Runtime
binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:\Users\dbejarano\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: Ru
FAC1 = 2*(np.pi*bw/RANGE)**2

Analysis:

1. Count is skewed right. Since a good amount of statistical models assume a Gaussian distribution we will try a
some transformations to normalize the data
2. Temp and Winspeed seem to have a positive and negative correlation with Count, respectively.
3. There is not such a considerable difference between workinday and weekday ridership. It does seem to depend
on the weather, where most of the cloudy day ridership occurs on the weekends and holidays.

One thing that is not present in our pairs plot is Datetime. Let's extract its components into new columns

Feature Engineering - Part 2

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 4/19
12/2/2018 Jupyter Notebook Viewer

Datetime is converted from a string into a datetime object.

Out[105]:
season holiday workingday weather temp humidity windspeed count year month day hour we

datetime

2011-01-
01 1 0 0 1 9.84 81 0.0 16 2011 1 1 0
00:00:00

2011-01-
01 1 0 0 1 9.02 80 0.0 40 2011 1 1 1
01:00:00

Visualization of Count
Get an understanding of how Count relates to other variables

Rolling Sum Over 24 Hours

Out[146]: (734138.0, 734856.9583333334)

Analysis:

Count cycles during each month, as well as by season, but it also shows a clear increasing tendency overall with
time.

Count Outliers

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 5/19
12/2/2018 Jupyter Notebook Viewer

Analysis:

1. Rentals show a tendency to increase with higher temperatures, up to about 36 deegres-C.


2. More rentals occur during the Spring and Summer months.
3. From the features presented above, the one offering the largest variance amongst its values is Hour. It can be
inferred that most rentals are for people to commute to work since the average rentals at 8 AM and 5 PM are
the highest. The line plot, however, does show that the cummulative rentals between 11 AM and 8 PM are high
enough to conclude that there is considerable non-commuting renting throughout the day, but it's less
predictable, as can be seen by the large number of outliers between 10 AM and 3 PM.
4. On average, rentals are faily uniform from one day of the week to the next.
5. One assumption we could make is that a significant amount of users have alternative commmuniting options,
since commuting is a big part of rentals, and the average Count drops considerably from good to bad weather.
We will explore this later.
6. Based on the amount of outliers, it might make sense to remove the most extreme values.

Let's see if the same outliers and distributions can be observed by re-scaling Count.

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 6/19
12/2/2018 Jupyter Notebook Viewer

Count Distribution

Out[109]: [Text(0.5,1,'Modified Distribution')]

Analysis:

The data is skewed to the right. After attempting several transformations (such as x^1/2, x^1/3) a log transformation
provides a distribution that looks fairly Gaussian in comparison w/ the original

Feature Engineering - Part 3: Outliers and Data Transformation


This will consist of two parts:

1. Remove outliers - We will define outliers as those points beyond the 99 percentile, which is 3 standard
deviations from the mean.
2. Substitute Count with its log transformation: log(count).

One needs to be cautious when removing outliers. Just because it's an extreme value doesn't mean it's an error or
that it's not informative. When doing data INFERENCE, we seek to understand the dependent and independent
variables as much as possible, and their relationship. When the focus is PREDICTION, the goal is developing a
model that predicts accurately on unseen data. In this case our interest is a little on the former, but more so on the
latter, so we will proceed to test our model with and without outliers.

Transformations may not always provide benefits either, so we will try fitting a model with and without the data
transformation as well.

Correlation Analysis
To make our colinearity (when two or more predictors are closely related to each other) analysis more robust, we
create and examine a correlation matrix.

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 7/19
12/2/2018 Jupyter Notebook Viewer

Out[112]: <matplotlib.axes._subplots.AxesSubplot at 0x12877cc0>

Deductions

There are only a few variables with high correlation. Some are not very interesting, like Month being highly
correlated with Season, or Humidity being correlated with Weather.

Others, however, might be less obvious and more relevant. First, Count correlates at 0.39 with Temp. Another is
how Count correlates with Hour. Even more surprising is that Log_count is highly correlated with Hour. It looks like
the data transformation may have turned some of these predictors into more powerful ones.

Some Remaining Questions


Some questions that came up to me during the analysis done so far:

1. How does Count vary depending on the day of the week?


2. Are there times during the year when more percentage of the bike rentals are on a workday vs the weekend?
3. How much does Temp affect Count throughout the day?
To answer this question we need to convert Temp into a categorical variable: We into 4 segments: 0-10 ,11-
20, 21-30, 31-max
4. How would each of this differ between Count and Log_Count?
The answer to this question is that the relative change between weather types, workingday, and Temp
remain after transforming the data. Being that Count is easier to interpret than Log_Count, we will only plot
Count.

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 8/19
12/2/2018 Jupyter Notebook Viewer

Let's plot those relationships mentioned above to come up with some insights.

Analysis:

1. As previously observed, during the week most of the rentals are customers commuting to/from work. The
weekend shows high rentals throghout mid-day and into the afternoon.
2. Hourly rentals vary on different months of the year, but they are not very different between working and non-
working days. We do see a trend that more "recreational" rides (non-working days) occur during the summer
when compared to workingday rentals. The opposite is true during the winter.
3. Weather seems to affect rentals uniformly regardless of the time of the day. The extreme cases are when
rentals are very low and when they are very high. The gaps between good and bad weather rentals are
smaller/larger at those points in time, respectively.

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 9/19
12/2/2018 Jupyter Notebook Viewer

The EDA above should be sufficient to give us an idea of how variables relate to each other, especially how
Count relates to the rest. This information is highly important when choosing the model, doing feature
selection or any other type of feature engineering, and when analyzing the results from our models.

Before we move on to modeling, let's define functions for pre-processing data to get it in the appropriate
form for our models. We will use functions since we might want to use different pre-processing steps for
different models, and since we will do this both for our train and test sets. Functions will therefore save time
and avoid confusion.

3. Data Preprocessing

Define functions to Preprocess Data

Let's load and preprocess the data. The end result will be what we walked through on the EDA section above, with 3
major additions:

1. All categorical variables (those which can only take on discrete values, whether they come in numerical or
string form) will be converted to one-hot-encodings. This is highly recommended for certain ML algorithms
because, say, for Season, a value of 4 does not mean anything in relation to a value of 1 other than the fact
they are two different categories. The fact that winter is 4 does not mean it's 4 times larger than spring.
2. We will perform normalization and standardization of features before inputting them into some of the models.
Neural Networks, for instance, are particularly sensitive to features in different scales. This is reasonable the
features are normally distributed, which in this case it's not true for all.
3. Our training dataset will be split into X_train, X_val, y_train and y_val.

Data Preprocessing for train and test sets


These functions should take the input data and transform it to a form that can be fed into the various predictive
models we will fit.

We check the datasets shapes to ensure our pre-processing function did its job correctly:

X_train Shape: (8708, 53)


X_val Shape: (2178, 53)
y_train Shape: (8708,)
y_val Shape: (2178,)
Test Set Shape: (6493, 53)

4. Predictive Data Analytics (PDA)

Model A: Liner Regression

In the code we can observe that copies of the datasets were made, in case we want to make some alterations, like
dropping features that are not predictive enough. This way we can quickly test performance of the model on
differently featured engineered sets.

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 10/19
12/2/2018 Jupyter Notebook Viewer

OLS Regression Results


==============================================================================
Dep. Variable: count R-squared: 0.826
Model: OLS Adj. R-squared: 0.826
Method: Least Squares F-statistic: 896.7
Date: Fri, 30 Nov 2018 Prob (F-statistic): 0.00
Time: 20:13:47 Log-Likelihood: -8226.0
No. Observations: 8708 AIC: 1.655e+04
Df Residuals: 8661 BIC: 1.688e+04
Df Model: 46
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -862.1351 25.258 -34.133 0.000 -911.648 -812.623
temp 0.2264 0.016 14.160 0.000 0.195 0.258
humidity -0.0420 0.009 -4.593 0.000 -0.060 -0.024
windspeed -0.0332 0.007 -4.610 0.000 -0.047 -0.019
year 0.5374 0.016 34.241 0.000 0.507 0.568
day 0.0049 0.001 3.987 0.000 0.002 0.007
season_2 0.4854 0.035 13.955 0.000 0.417 0.554
season_3 0.4670 0.036 12.946 0.000 0.396 0.538
season_4 0.5968 0.025 24.274 0.000 0.549 0.645
holiday_1 -215.5597 6.315 -34.135 0.000 -227.939 -203.181
workingday_1 -215.6174 6.314 -34.147 0.000 -227.995 -203.240
weather_2 -0.0550 0.017 -3.258 0.001 -0.088 -0.022
weather_3 -0.5952 0.028 -21.478 0.000 -0.649 -0.541
month_2 0.1739 0.030 5.781 0.000 0.115 0.233
month_3 0.2609 0.032 8.057 0.000 0.197 0.324
month_5 0.2225 0.031 7.159 0.000 0.162 0.283
month_6 0.1649 0.034 4.806 0.000 0.098 0.232
month_7 0.0767 0.024 3.241 0.001 0.030 0.123
month_8 0.1514 0.026 5.812 0.000 0.100 0.202
month_9 0.2390 0.025 9.457 0.000 0.189 0.289
month_10 0.2516 0.028 8.917 0.000 0.196 0.307
month_11 0.2041 0.025 8.147 0.000 0.155 0.253
month_12 0.1411 0.025 5.590 0.000 0.092 0.191
hour_1 -0.6441 0.046 -13.931 0.000 -0.735 -0.553
hour_2 -1.1740 0.046 -25.275 0.000 -1.265 -1.083
hour_3 -1.7432 0.047 -37.113 0.000 -1.835 -1.651
hour_4 -2.0879 0.047 -44.661 0.000 -2.180 -1.996
hour_5 -0.9961 0.046 -21.431 0.000 -1.087 -0.905
hour_6 0.2604 0.046 5.613 0.000 0.169 0.351
hour_7 1.2552 0.046 27.096 0.000 1.164 1.346
hour_8 1.9068 0.046 41.219 0.000 1.816 1.997
hour_9 1.5865 0.046 34.278 0.000 1.496 1.677
hour_10 1.2572 0.046 27.064 0.000 1.166 1.348
hour_11 1.3697 0.047 29.297 0.000 1.278 1.461
hour_12 1.5590 0.047 33.114 0.000 1.467 1.651
hour_13 1.5431 0.047 32.490 0.000 1.450 1.636
hour_14 1.4609 0.048 30.576 0.000 1.367 1.555
hour_15 1.5117 0.048 31.567 0.000 1.418 1.606
hour_16 1.7693 0.048 37.005 0.000 1.676 1.863
hour_17 2.1938 0.048 46.125 0.000 2.101 2.287
hour_18 2.1168 0.047 44.780 0.000 2.024 2.209
hour_19 1.8188 0.047 38.859 0.000 1.727 1.911
hour_20 1.5171 0.047 32.604 0.000 1.426 1.608
hour_21 1.2588 0.046 27.161 0.000 1.168 1.350
hour_22 1.0165 0.046 21.976 0.000 0.926 1.107
hour_23 0.6034 0.046 13.055 0.000 0.513 0.694
weekday_3 0.0739 0.021 3.592 0.000 0.034 0.114
weekday_4 0.1761 0.021 8.518 0.000 0.136 0.217
weekday_5 -215.4300 6.314 -34.117 0.000 -227.808 -203.052
weekday_6 -215.5280 6.315 -34.131 0.000 -227.906 -203.150
==============================================================================
Omnibus: 878.630 Durbin-Watson: 0.604
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2004.072
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 11/19
12/2/2018 Jupyter Notebook Viewer
Skew: -0.618 Prob(JB): 0.00
Kurtosis: 4.999 Cond. No. 9.48e+18
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.92e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
MSLE: 0.5688354263532879

Comments on How to Interpret Model Results:

You should not be able to predict the errors (there should not be observable trends). If your model did a good job at
explaining/predicting the response, then the error left would be stochastic (the portion of the error that is inherent to
real life randomness.

Also, you should not be able to predict residuals based off another variable. If so, then that variable should be
included in your model. Lastly, adjacent residuals should not be correlated with eachother. This is called
autocorrelation and means the deterministic portion of your model is not capturing that information (often found in
time series).

Let's take a look at a residuals plot and determine whether a linear model that predicts the Log_Count is better or
worse than predicting Count without any log transformations.

Plot Residuals vs Predictions - WITH log transformation of Count

LINEAR REGRESSION WITH LOG TRANSFORMATION ON DATA


MSLE: 0.5689440566420537

Plot Residuals vs Predictions - WITHOUT log transformation of Count

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 12/19
12/2/2018 Jupyter Notebook Viewer

LINEAR REGRESSION WITHOUT LOG TRANSFORMATION ON DATA


MSLE: 0.9980210264360666

Analysis:

The following variations of the input data generated these results:

Using Count instead of Log-Count provides a score of 0.99, which is almost double than the 0.56 obtained by
fitting a model that predicts the log of Count.
Removing outliers negatively impacted the model performance. Therefore, we will keep outliers.
Normalizing and Standardizing had no effect on the model. We will keep the original values.
Not converting categorical variables into one-hot vectors doubled the error rate.

After choosing log(count) as the better independent variable:

1. The linear model does a good job at explaining the response variable Count, with an Adjusted R^2 score of
0.825.
2. Dropping those features with low predictive power (high p-values) did not result in any changes to the error rate
or R^2 score, so we will keep them for simplicity.
3. The error score from regularized models (both L1 and L2) came up to be higher than with no regularization, so
we will keep the unregularized model.

Note: Some of the predictions were negative. In reality we can't have negative bike rentals, so these values
were substituted by zeros.

From looking at the "Residuals vs Predictions" plots, we observe discernable patterns. This means not all
of the non-random portion of the error is being captured. Let's try some non-linear models next and perform
a similar analysis as above.

Model B: Neural Network

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 13/19
12/2/2018 Jupyter Notebook Viewer

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_21 (InputLayer) (None, 53) 0
_________________________________________________________________
dense_61 (Dense) (None, 80) 4320
_________________________________________________________________
dense_62 (Dense) (None, 80) 6480
_________________________________________________________________
dense_63 (Dense) (None, 1) 81
=================================================================
Total params: 10,881
Trainable params: 10,881
Non-trainable params: 0
_________________________________________________________________
Train on 6966 samples, validate on 1742 samples
Epoch 1/10
6966/6966 [==============================] - 1s 211us/step - loss: 2.1208 - val_loss: 2.4419
Epoch 2/10
6966/6966 [==============================] - 0s 64us/step - loss: 2.1072 - val_loss: 2.5309
Epoch 3/10
6966/6966 [==============================] - 0s 51us/step - loss: 2.0911 - val_loss: 1.9950
Epoch 4/10
6966/6966 [==============================] - 0s 57us/step - loss: 2.0609 - val_loss: 2.2041
Epoch 5/10
6966/6966 [==============================] - 1s 89us/step - loss: 1.9502 - val_loss: 1.8334
Epoch 6/10
6966/6966 [==============================] - 0s 60us/step - loss: 1.9393 - val_loss: 2.0684
Epoch 7/10
6966/6966 [==============================] - 1s 90us/step - loss: 1.8638 - val_loss: 1.7712
Epoch 8/10
6966/6966 [==============================] - 1s 87us/step - loss: 1.8040 - val_loss: 2.1831
Epoch 9/10
6966/6966 [==============================] - 0s 60us/step - loss: 1.7428 - val_loss: 1.7423
Epoch 10/10
6966/6966 [==============================] - 0s 60us/step - loss: 1.7758 - val_loss: 1.6856

Plot Residuals vs Predictions

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 14/19
12/2/2018 Jupyter Notebook Viewer

NEURAL NETWORKS MODEL RESULTS


MSLE: 1.333658125759546

The results from our Neural Network are not great at all. Let's try some other models.

Model C: Random Forests

Plot Residuals vs Predictions

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 15/19
12/2/2018 Jupyter Notebook Viewer

RANDOM FORESTS MODEL RESULTS


MSLE: 1.2280650533297208

Model D: XGRegressor

Fitting 5 folds for each of 2 candidates, totalling 10 fits

[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 34.7s finished

Plot Residuals vs Predictions

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 16/19
12/2/2018 Jupyter Notebook Viewer

XGBOOST REGRESSOR WITHOUT LOG TRANSFORMATION MODEL RESULTS


MSLE: 0.6093859464940209

Model E: Ensemble Gradient Boosting

Fitting 5 folds for each of 1 candidates, totalling 5 fits

[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 8.4s finished

The Best Score : 0.7976643402785151


The Best Params : {'learning_rate': 0.1, 'max_depth': 6}
MSLE Score: 0.1154609577190961

Plot Residuals vs Predictions

XGBOOST REGRESSOR WITH LOG TRANSFORMATION MODEL RESULTS


MSLE: 0.11240912002474802

[Extra] Model F: Recurrent Neural Network (RNN) and Additinoal Feature


Engineering
I tried a couple of other models just to see what kind of results they would give. They weren't great, but if you're
interested here I describe both:

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 17/19
12/2/2018 Jupyter Notebook Viewer

1. RNN - A rolling model that used data from the previous 24 hours to predict Count. It incorporated the Count
variable from those 24 hours. Therefore, when it got to the 20th day of the month, it took data from day 19. This
meant, however, that on day 21 it used data from day 20, which were all predictions obtained previously. The
Count predictions ended up being too low. My hypothesis is that the RNN model (which used GRU cells)
assigned heavy neuron weights on the Count input from the previous 24 hours because it was a strong
predictor, but it got stuck predicting values very similar to the previous Counts. This, as we saw in our EDA, is
not a behavior observed on the data.
2. Additional Feature - Similarly to above, I incorporated Count from previous observations and used XGBoost.
The model used Count from exactly 24 hours earlier (same time, previous day), but just that one Count value,
not a window of 24 observations as in the RNN model above. To predict on the test set it used data from the
training set only for day 20. Day 21 used the predictions from day 20, and so on. Therefore it resulted in a
similar pattern as the RNN model: relying too heavily on the Count_From_24_Hours_Ago feature and then
getting stuck on very similar values. This could be circumvented in various ways. One would be to heavily
regulate the weight for that particular input, for instance.

With all this in mind, let's move on to the final results...

5. Generate Results and Export Models


Based on the 7 models attempted, and their set of variations, Linear Regression and Ensemble Gradient Boosting
performed best. Therefore, we will submit results for those two models and cross our fingers (that's important, don't
forget the cross-fingers step).

In order to produce the final results, we need to do one last thing: train the models on the entire training set. So
far we have been using a subset of it, so that we could use the remaining portion (20%) for validation. So let's train
Linear Regression and Ensemble Gradient Boosting on the entire set.

5.1 Linear Regression on Whole Training Set


Code to perform linear regression on entire training set.

5.2 Ensemble Gradient Boosting on Whole Training Set


Code to perform Ensemble Gradient Boosting on entire training set.

Fitting 3 folds for each of 1 candidates, totalling 3 fits

[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 22.7s finished

5.3 Save Files with Final Predictions on Test Set


Code to save files with final predictions.

Final Results

Combine Count for train and test sets to plot a rolling sum of rentals over the entire period

Done

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 18/19
12/2/2018 Jupyter Notebook Viewer

Out[13]: (734128.0, 734878.9583333334)

The plot shows predictions to be well within what we would expect based on the trends in the training set.

After submitting both models to Kaggle, Ensemble Gradient Boosting with log-transformed Count had the lowest
error rate at 0.43. This could be improved further by increasing the number of estimators and exploring other values
for its parameters (we tested only a few combinations with GridSearchCV).

A shout out to Vivek Srinivasan and his Kernel "EDA & Ensemble Model (Top 10 Percentile)". It provided great
insights that helped me structure my EDA.

Kindly share if you found this useful.

http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 19/19

You might also like