You are on page 1of 4

MACHINE LEARNING ENGINEER NANODEGREE

CAPSTONE PROPOSAL

PROPOSAL

1. Project Background and Description


Predicting ad click-through rates is a massive scale machine learning problem
that is central to the multi-billion dollar online advertising industry. The project
was inspired by google AdSense. Google AdSense is a program run
by Google that allows publishers in the Google Network of content sites to
serve automatic text, image, video, or interactive media advertisements, that
are targeted to site content and audience. They can generate revenue on either
a per-click or per-impression basis. In this project I tried to create a model
which try to predict the probability whether an ad will be clicked or not based
on different features

2. Problem Statement
Online advertising is a multi-billion dollar industry that has served as one of the
great success stories for machine learning. Sponsored advertising have relied
heavily on the ability of learned models to predict ad click-through rates
accurately. In this project I am planning to build a supervised learning model to
find the probability whether an advertisement will be clicked. Here I will use
classification (probabilistic classification) and will use predict_prob function
(this is an additional method in the classifier which I will be using in order to
predict the probability) so as to predict the probability whether an advertisement
will be clicked or not. The inputs variables are mentioned below in the datasets
and input section. I am also planning to give a statistical analysis as which
supervised learning algorithm gives the best accuracy. The final model is
expected to be useful for companies so that they could make necessary
changes so as to make their advertisement useful.

3. Datasets and Inputs


The dataset and inputs were obtained from hacker earth platform where the
same problem was listed as a challenge. It can be downloaded from here. The

1
size of the dataset is very large. File named test.csv has size of around 258 Mb
while the file train.csv has size of around 850 Mb which is quite large.The
datasets are from a network company from Europe whose networks are spread
across multiple countries such as Portugal, Germany, France, Austria,
Switzerland etc. Input files are also provided with the dataset package. The
dataset train.csv and test.csv each has record from more than 1 million ids. The
following variables are given in the dataset:-

Variable Description

ID Unique ID

datetime timestamp

siteid website id

offerid offer id(commission based offers)

category offer category

countrycode country International code

browserid browser used

devid device used

click target variable

I am planning to use all the variables except the last one as the feature variable and
the last one click as the target variable. Target variable is encoded in the form such 0
represents advertisement is not clicked and 1 represents advertisement is clicked.

4. Solution Statement
I am planning to use supervised learning for this model. I will treat it as
probabilistic classification problem since we want to know the probability as to
whether the advertisement will be clicked or not. I will use predict_proba

2
function which is available in almost all the classifiers be it decision tree or
SVM.I will use different algorithms as we cannot say which algorithm will be the
best as I will try to get a maximum accuracy by tweaking the parameters but I
think the best to use here are decision tree regression, logistic regression (as
a classification model), SVM (support vector machines).

5. Benchmark Model
Since this was a part of hacker earth coding challenge so the benchmark model
will be the accuracy score of the leader of the challenge. I will try to compare
my model with the leader of the competition so as to improve as there is always
a scope of improvement.

6. Evaluation Metrics
For this problem I will use logarithmic loss as an evaluation metric as these type
of metric is best to use when your model output is the probability of the binary
outcome. The log-loss metrics considers confidence of the prediction when
assessing how to penalize incorrect classification.

7. Project Design
This problem is a supervised learning model. So I am planning to implement
decision tree, logistic regression etc. as the algorithms but the final one will be
chosen based on the Log loss metric as described in the above section. So
here in order of implementation we will import the data then transform the data
and clean it (like we will need to convert the date-time columns into a way such
that it is readable by the pandas library). We will normalize the data if needed.
Also we will also need to one hot encode some of the features such as browser
used and type of device used. We will cross validate the dataset using grid-
search CV or k-cross-fold depending on the needs. Then we will apply our
supervised learning algorithm and evaluate their metrics so to give a statistical
analysis as to which algorithm is best. These steps will be followed in the order
as given above.

8. Reference and citations


1. The dataset of this project was taken from HackerEarth platform.
https://www.hackerearth.com/challenge/competitive/machine-learning-
challenge-3/problems/

3
2. The link to download the dataset is
https://he-s3.s3.amazonaws.com/media/hackathon/machine-learning-
challenge-3/predict-ad-clicks/205e1808-6-dataset.zip
3. The idea for this project was inspired by Google AdSense and necessary
citations for this is referenced from this paper.
https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf
4. Different kaggle competitions were also taken a look at for different ideas
related to the model.
https://www.kaggle.com/c/avazu-ctr-prediction
https://www.kaggle.com/c/avito-context-ad-clicks

You might also like