Professional Documents
Culture Documents
Bike_Sharing_Demand (/github/daniel-bejarano/Bike_Sharing_Demand/tree/master)
/
Bike Sharing Demand Analysis.ipynb (/github/daniel-bejarano/Bike_Sharing_Demand/tree/master/Bike Sharing Demand Analysis.ipynb)
Overview
Predict Bike Sharing Demand on days 20th-to-end-of-month using hourly training data for days 1-19 of the month.
Only information available prior to the rental period can be used to predict demand.
Out[1]: The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code,
click here.
Load train datasets and sample submission file already saved in directory. Below are the first few rows for
each.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 1/19
12/2/2018 Jupyter Notebook Viewer
datetime season holiday workingday weather temp atemp humidity windspeed casual registered coun
2011-01-
0 01 1 0 0 1 9.84 14.395 81 0.0 3 13 1
00:00:00
2011-01-
1 01 1 0 0 1 9.02 13.635 80 0.0 8 32 4
01:00:00
2011-01-
2 01 1 0 0 1 9.02 13.635 80 0.0 5 27 3
02:00:00
datetime count
0 2011-01-20 00:00:00 0
1 2011-01-20 01:00:00 0
2 2011-01-20 02:00:00 0
Datasets Description
Features:
Variable names are all-lower-case. However, throughout this document I will use upper-case for the first letter, so it's
obvious that I'm referring to the variables in question (e.g. Holiday, Weather, etc).
Training Set:
Contains 10,886 observations of 12 variables. The dependent variable (what we will be predicting) is the feature
Count. It contains hourly observations for days 1-19 of every month from 2011-01-01 to 2012-12-19.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 2/19
12/2/2018 Jupyter Notebook Viewer
Test Set:
Contains 6,493 observations of 9 variables. It includes hourly observations for days 20-end_of_month from 2011-01-
20 to 2012-12-31.
It does not include the Casual and Registered features (divides renters into registered and non-registered users).
Since we won't be able to use them to predict on the testing set, and since Count is lineraly dependent on them
(Count = Casual + Registered) we will be dropping them as they won't be of much help in our predictions.
Sample Submission:
We will be predicting the count of bicycles rented per hour on the test set.
Data Types
Let's look at what type of variables we are dealing with in our datasets.
We notice for the most part is numeric. Some of those numeric variables are categorical ones represented as
integers: Season, Holiday, Workingday and Weather
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
datetime 10886 non-null object
season 10886 non-null int64
holiday 10886 non-null int64
workingday 10886 non-null int64
weather 10886 non-null int64
temp 10886 non-null float64
atemp 10886 non-null float64
humidity 10886 non-null int64
windspeed 10886 non-null float64
casual 10886 non-null int64
registered 10886 non-null int64
count 10886 non-null int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.6+ KB
None
REMEMBER that ONLY the training set should be used during EDA. This prevents us from learning
information about the test set that could bias us to reach particular conclusions that are specific to the test
set and would not necessarily generalize well.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 3/19
12/2/2018 Jupyter Notebook Viewer
C:\Users\dbejarano\Anaconda3\lib\site-packages\numpy\core\_methods.py:135: RuntimeWarning: D
keepdims=keepdims)
C:\Users\dbejarano\Anaconda3\lib\site-packages\numpy\core\_methods.py:127: RuntimeWarning:
ret = ret.dtype.type(ret / rcount)
C:\Users\dbejarano\Anaconda3\lib\site-packages\statsmodels\nonparametric\kde.py:488: Runtime
binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:\Users\dbejarano\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:34: Ru
FAC1 = 2*(np.pi*bw/RANGE)**2
Analysis:
1. Count is skewed right. Since a good amount of statistical models assume a Gaussian distribution we will try a
some transformations to normalize the data
2. Temp and Winspeed seem to have a positive and negative correlation with Count, respectively.
3. There is not such a considerable difference between workinday and weekday ridership. It does seem to depend
on the weather, where most of the cloudy day ridership occurs on the weekends and holidays.
One thing that is not present in our pairs plot is Datetime. Let's extract its components into new columns
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 4/19
12/2/2018 Jupyter Notebook Viewer
Out[105]:
season holiday workingday weather temp humidity windspeed count year month day hour we
datetime
2011-01-
01 1 0 0 1 9.84 81 0.0 16 2011 1 1 0
00:00:00
2011-01-
01 1 0 0 1 9.02 80 0.0 40 2011 1 1 1
01:00:00
Visualization of Count
Get an understanding of how Count relates to other variables
Analysis:
Count cycles during each month, as well as by season, but it also shows a clear increasing tendency overall with
time.
Count Outliers
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 5/19
12/2/2018 Jupyter Notebook Viewer
Analysis:
Let's see if the same outliers and distributions can be observed by re-scaling Count.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 6/19
12/2/2018 Jupyter Notebook Viewer
Count Distribution
Analysis:
The data is skewed to the right. After attempting several transformations (such as x^1/2, x^1/3) a log transformation
provides a distribution that looks fairly Gaussian in comparison w/ the original
1. Remove outliers - We will define outliers as those points beyond the 99 percentile, which is 3 standard
deviations from the mean.
2. Substitute Count with its log transformation: log(count).
One needs to be cautious when removing outliers. Just because it's an extreme value doesn't mean it's an error or
that it's not informative. When doing data INFERENCE, we seek to understand the dependent and independent
variables as much as possible, and their relationship. When the focus is PREDICTION, the goal is developing a
model that predicts accurately on unseen data. In this case our interest is a little on the former, but more so on the
latter, so we will proceed to test our model with and without outliers.
Transformations may not always provide benefits either, so we will try fitting a model with and without the data
transformation as well.
Correlation Analysis
To make our colinearity (when two or more predictors are closely related to each other) analysis more robust, we
create and examine a correlation matrix.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 7/19
12/2/2018 Jupyter Notebook Viewer
Deductions
There are only a few variables with high correlation. Some are not very interesting, like Month being highly
correlated with Season, or Humidity being correlated with Weather.
Others, however, might be less obvious and more relevant. First, Count correlates at 0.39 with Temp. Another is
how Count correlates with Hour. Even more surprising is that Log_count is highly correlated with Hour. It looks like
the data transformation may have turned some of these predictors into more powerful ones.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 8/19
12/2/2018 Jupyter Notebook Viewer
Let's plot those relationships mentioned above to come up with some insights.
Analysis:
1. As previously observed, during the week most of the rentals are customers commuting to/from work. The
weekend shows high rentals throghout mid-day and into the afternoon.
2. Hourly rentals vary on different months of the year, but they are not very different between working and non-
working days. We do see a trend that more "recreational" rides (non-working days) occur during the summer
when compared to workingday rentals. The opposite is true during the winter.
3. Weather seems to affect rentals uniformly regardless of the time of the day. The extreme cases are when
rentals are very low and when they are very high. The gaps between good and bad weather rentals are
smaller/larger at those points in time, respectively.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 9/19
12/2/2018 Jupyter Notebook Viewer
The EDA above should be sufficient to give us an idea of how variables relate to each other, especially how
Count relates to the rest. This information is highly important when choosing the model, doing feature
selection or any other type of feature engineering, and when analyzing the results from our models.
Before we move on to modeling, let's define functions for pre-processing data to get it in the appropriate
form for our models. We will use functions since we might want to use different pre-processing steps for
different models, and since we will do this both for our train and test sets. Functions will therefore save time
and avoid confusion.
3. Data Preprocessing
Let's load and preprocess the data. The end result will be what we walked through on the EDA section above, with 3
major additions:
1. All categorical variables (those which can only take on discrete values, whether they come in numerical or
string form) will be converted to one-hot-encodings. This is highly recommended for certain ML algorithms
because, say, for Season, a value of 4 does not mean anything in relation to a value of 1 other than the fact
they are two different categories. The fact that winter is 4 does not mean it's 4 times larger than spring.
2. We will perform normalization and standardization of features before inputting them into some of the models.
Neural Networks, for instance, are particularly sensitive to features in different scales. This is reasonable the
features are normally distributed, which in this case it's not true for all.
3. Our training dataset will be split into X_train, X_val, y_train and y_val.
We check the datasets shapes to ensure our pre-processing function did its job correctly:
In the code we can observe that copies of the datasets were made, in case we want to make some alterations, like
dropping features that are not predictive enough. This way we can quickly test performance of the model on
differently featured engineered sets.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 10/19
12/2/2018 Jupyter Notebook Viewer
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.92e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
MSLE: 0.5688354263532879
You should not be able to predict the errors (there should not be observable trends). If your model did a good job at
explaining/predicting the response, then the error left would be stochastic (the portion of the error that is inherent to
real life randomness.
Also, you should not be able to predict residuals based off another variable. If so, then that variable should be
included in your model. Lastly, adjacent residuals should not be correlated with eachother. This is called
autocorrelation and means the deterministic portion of your model is not capturing that information (often found in
time series).
Let's take a look at a residuals plot and determine whether a linear model that predicts the Log_Count is better or
worse than predicting Count without any log transformations.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 12/19
12/2/2018 Jupyter Notebook Viewer
Analysis:
Using Count instead of Log-Count provides a score of 0.99, which is almost double than the 0.56 obtained by
fitting a model that predicts the log of Count.
Removing outliers negatively impacted the model performance. Therefore, we will keep outliers.
Normalizing and Standardizing had no effect on the model. We will keep the original values.
Not converting categorical variables into one-hot vectors doubled the error rate.
1. The linear model does a good job at explaining the response variable Count, with an Adjusted R^2 score of
0.825.
2. Dropping those features with low predictive power (high p-values) did not result in any changes to the error rate
or R^2 score, so we will keep them for simplicity.
3. The error score from regularized models (both L1 and L2) came up to be higher than with no regularization, so
we will keep the unregularized model.
Note: Some of the predictions were negative. In reality we can't have negative bike rentals, so these values
were substituted by zeros.
From looking at the "Residuals vs Predictions" plots, we observe discernable patterns. This means not all
of the non-random portion of the error is being captured. Let's try some non-linear models next and perform
a similar analysis as above.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 13/19
12/2/2018 Jupyter Notebook Viewer
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_21 (InputLayer) (None, 53) 0
_________________________________________________________________
dense_61 (Dense) (None, 80) 4320
_________________________________________________________________
dense_62 (Dense) (None, 80) 6480
_________________________________________________________________
dense_63 (Dense) (None, 1) 81
=================================================================
Total params: 10,881
Trainable params: 10,881
Non-trainable params: 0
_________________________________________________________________
Train on 6966 samples, validate on 1742 samples
Epoch 1/10
6966/6966 [==============================] - 1s 211us/step - loss: 2.1208 - val_loss: 2.4419
Epoch 2/10
6966/6966 [==============================] - 0s 64us/step - loss: 2.1072 - val_loss: 2.5309
Epoch 3/10
6966/6966 [==============================] - 0s 51us/step - loss: 2.0911 - val_loss: 1.9950
Epoch 4/10
6966/6966 [==============================] - 0s 57us/step - loss: 2.0609 - val_loss: 2.2041
Epoch 5/10
6966/6966 [==============================] - 1s 89us/step - loss: 1.9502 - val_loss: 1.8334
Epoch 6/10
6966/6966 [==============================] - 0s 60us/step - loss: 1.9393 - val_loss: 2.0684
Epoch 7/10
6966/6966 [==============================] - 1s 90us/step - loss: 1.8638 - val_loss: 1.7712
Epoch 8/10
6966/6966 [==============================] - 1s 87us/step - loss: 1.8040 - val_loss: 2.1831
Epoch 9/10
6966/6966 [==============================] - 0s 60us/step - loss: 1.7428 - val_loss: 1.7423
Epoch 10/10
6966/6966 [==============================] - 0s 60us/step - loss: 1.7758 - val_loss: 1.6856
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 14/19
12/2/2018 Jupyter Notebook Viewer
The results from our Neural Network are not great at all. Let's try some other models.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 15/19
12/2/2018 Jupyter Notebook Viewer
Model D: XGRegressor
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 16/19
12/2/2018 Jupyter Notebook Viewer
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 17/19
12/2/2018 Jupyter Notebook Viewer
1. RNN - A rolling model that used data from the previous 24 hours to predict Count. It incorporated the Count
variable from those 24 hours. Therefore, when it got to the 20th day of the month, it took data from day 19. This
meant, however, that on day 21 it used data from day 20, which were all predictions obtained previously. The
Count predictions ended up being too low. My hypothesis is that the RNN model (which used GRU cells)
assigned heavy neuron weights on the Count input from the previous 24 hours because it was a strong
predictor, but it got stuck predicting values very similar to the previous Counts. This, as we saw in our EDA, is
not a behavior observed on the data.
2. Additional Feature - Similarly to above, I incorporated Count from previous observations and used XGBoost.
The model used Count from exactly 24 hours earlier (same time, previous day), but just that one Count value,
not a window of 24 observations as in the RNN model above. To predict on the test set it used data from the
training set only for day 20. Day 21 used the predictions from day 20, and so on. Therefore it resulted in a
similar pattern as the RNN model: relying too heavily on the Count_From_24_Hours_Ago feature and then
getting stuck on very similar values. This could be circumvented in various ways. One would be to heavily
regulate the weight for that particular input, for instance.
In order to produce the final results, we need to do one last thing: train the models on the entire training set. So
far we have been using a subset of it, so that we could use the remaining portion (20%) for validation. So let's train
Linear Regression and Ensemble Gradient Boosting on the entire set.
Final Results
Combine Count for train and test sets to plot a rolling sum of rentals over the entire period
Done
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 18/19
12/2/2018 Jupyter Notebook Viewer
The plot shows predictions to be well within what we would expect based on the trends in the training set.
After submitting both models to Kaggle, Ensemble Gradient Boosting with log-transformed Count had the lowest
error rate at 0.43. This could be improved further by increasing the number of estimators and exploring other values
for its parameters (we tested only a few combinations with GridSearchCV).
A shout out to Vivek Srinivasan and his Kernel "EDA & Ensemble Model (Top 10 Percentile)". It provided great
insights that helped me structure my EDA.
http://nbviewer.jupyter.org/github/daniel-bejarano/Bike_Sharing_Demand/blob/master/Bike%20Sharing%20Demand%20Analysis.ipynb 19/19