You are on page 1of 5

(http://turi.

com)

KevinMarkham
UploadedNov25,2015

DOWNLOAD

eginner'GuidetoClick-ThroughRatePredictionwithLogitic
Regreion
Let'athatou'reamajorearchengine,andouneedtodecidewhichadtodiplaatthetopofourearchreult.Howwouldoudo
it?

Your rtthoughtmightetonarrowthecopetoad"related"totheearch,andthenchooewhicheverado erthegreatetrevenue.


Companiehavealreadidonhowmuchthewillpaou,oiteemeatomaximizeourrevenuechooingthehighetpaingad.
utithattherightapproach?

Manadareactualloldona"pa-per-click"(PPC)ai,meaningthecompanonlpaforadclick,notadview.Thuouroptimal
approach(aaearchengine)iactualltochooeanadaedon"expectedvalue",meaningthepriceofaclicktimethelikelihoodthatthe
adwilleclicked.Inotherword,a\$1.00adwitha5%proailitofeingclickedhaanexpectedvalueof\$0.05,whereaa\$2.00ad
witha1%proailitofeingclickedhaanexpectedvalueofonl\$0.02.Inthicae,ouwouldchooetodiplathe rtad.

Inorderforoutomaximizeexpectedvalue,outhereforeneedtoaccuratelpredictthelikelihoodthatagivenadwilleclicked,alo
knowna"click-throughrate"(CTR).

Inthinoteook,I'llwalkthroughthepredictivemodelingproce,dicuwhlogiticregreioniagoodchoiceforthitak,andthen
explainthicodeline--lineothatoucanapplittoourownpredictivetak!

Forthiexample,I'muingthedatafromaKagglecompetition(http://www.kaggle.com/c/avazu-ctr-prediction)onclick-throughrate
predictionponoredAvazu.Thegoalinthecompetitionmatcheourgoal,whichitopredictthelikelihoodthatagivenadwille
clicked.

tep1:ReadingandxploringtheData
I'vealreaddownloadedthedataetfromKaggleforthiexampleandextractedamalluettomakemcalculationfater.Ifouwould
liketofollowalong,ouhoulddownloadanddecompre train.gz fromthecompetition'datapage(http://www.kaggle.com/c/avazu-
ctr-prediction/data)(loginrequired),andthenextractthe rt100,000linefrom train.csv uingthicommandatthecommand
line/terminal: head -n100000 train.csv > train_subset.csv

Our rttepitoreadthedataintoanFrame,whichiGraphLa'taulardatatructurethatiimilartoadataframeinRorapanda
DataFrameinPthon.

ThidatahappentoetoredinthepopularCV(commaeparatedvalue)format,utFramecanecontructedfromavarietof
ource(http://turi.com/product/create/doc/graphla.data_tructure.html#connector).We'lluetheread_cv
(http://turi.com/product/create/doc/generated/graphla.Frame.read_cv.html)methodtoreadinthedata:

In[1]:

importgraphlabasgl
data=gl.SFrame.read_csv('train_subset.csv',verbose=False)

Let'takeaquicklookatthe rtrowofdata,toeewhatwe'reworkingwith:

In[2]:

data.head(1)
Out[2]:
id click hour C1 anner_po ite_id ite_domain ite_categor app_id
1000009418151094273 0 14102100 1005 0 1fe01fe f3845767 28905ed ecad2386
app_domain app_categor device_id device_ip device_model device_tpe device_conn_tpe C14 C15
7801e8d9 07d7df22 a99f214a ddd2926e 44956a24 1 2 15706 320
C17 C18 C19 C20 C21
1722 0 35 -1 79
[1rowx24column]

FromKaggle'datadictionar(http://www.kaggle.com/c/avazu-ctr-prediction/data),Iknowthatclick=0meantheadwanotclicked,and
click=1meantheadwaclicked.The"click"columnithereforeourtargetvariale,andtheothercolumnareourpotentialfeature!

The rtthingwewanttoknowiwhatpercentageofadinthedataetwereactuallclicked.Inthicae,wecanimpltakethemeanof
the"click"column,incethatiequivalenttoaddingupalloftheone(whichithenumerofclick)anddividingthetotalnumerof
ad:

In[3]:

data['click'].mean()

Out[3]:
0.1749017490174896

Weeethat17.5%oftheadwereclicked,meaningtheoverallclick-throughratei17.5%.Thiiuefultokeepinmindaa"aeline",a
we'lleelateron.

eforewetartuildingamachinelearningmodel,it'alwauefultoexplorethedataet.OnewatogettartediuingtheGraphLa
Canva,arower-aedviualizationplatform:

In[4]:

gl.canvas.set_target('ipynb')
data.show()

Inoticedthat"device_tpe"onlha4uniquevalue,anditmakeintuitiveenethatthetpeofdeviceou'reuingwhenviewinganad
mighta ectourlikelihoodofclickingthead,olet'exploreitfurther.

Toundertandtherelationhipetweenthifeatureandthetargetvariale,wewanttocalculatetheclick-throughrateforeachvalueof
device_tpe.Wecanaccomplihthi"groupingthedata"device_tpe,andthencalculatingthemeanoftheclickcolumnforeach
group:

In[5]:

data.groupby('device_type',{'CTR':gl.aggregate.MEAN('click')})

Out[5]:
device_tpe CTR
0 0.227499406317
5 0.0990566037736
4 0.0725075528701
1 0.175977623465
[4rowx2column]
Weawearlierthattheaelineclick-throughratei17.5%,anditappearthatthereiaigdi erenceinaverageclick-throughrate
dependingondevice_tpe.Thilooklikeagoodfeature!

imilarl,theC1columnlooklikeagoodfeature:

In[6]:

data.groupby('C1',{'CTR':gl.aggregate.MEAN('click')})
Out[6]:
C1 CTR
1008 0.4
1005 0.176174097389
1001 0.103448275862
1010 0.0742713882795
1002 0.227499406317
1007 0.0
[6rowx2column]
IalonoticedthatC15andC16appeartoethedimenionofthead(widthandheight),whichwewouldaloimaginearegoodpredictor
ofwhetheranadiclicked:

In[7]:

data['C15'].sketch_summary().frequent_items()

Out[7]:
{120:2,216:912,300:3935,320:95132,728:18}

In[8]:

data['C16'].sketch_summary().frequent_items()

Out[8]:
{20:2,36:912,50:95620,90:18,250:3427,480:20}

Forourinitialmodel,we'lljutuedevice_tpe,C1,C15,andC16aourfeature.

NotethatwhenweuilttheFramefromtheCV le,itimplgueedthedatatpeofeachcolumn.ometimetheedatatpeneedto
eadjuted,olet'takeaquicklookatthecolumnnameandtheiraociatedtpetoeeifthere'anthingweneedto x:

In[9]:

zip(data.column_names(),data.column_types())

Out[9]:
[('id',str),
('click',int),
('hour',int),
('C1',int),
('banner_pos',int),
('site_id',str),
('site_domain',str),
('site_category',str),
('app_id',str),
('app_domain',str),
('app_category',str),
('device_id',str),
('device_ip',str),
('device_model',str),
('device_type',int),
('device_conn_type',int),
('C14',int),
('C15',int),
('C16',int),
('C17',int),
('C18',int),
('C19',int),
('C20',int),
('C21',int)]

Weknowthatothdevice_tpeandC1are"categoricalvariale",meaningthattheirnumericalvaluerepreentcategorie.We'llconvertthe
datatpeofothofthoecolumnfromintegertotring,ecauewedon'twantourmachinelearningmodeltothinkthereia
mathematicalrelationhipetweenthecategorvalue:
In[10]:

data['device_type']=data['C1'].astype(str)
data['C1']=data['C1'].astype(str)

Youcouldpendalotmoretimeontheexploratorphae,utlet'movealongtothenexttepinpredictivemodeling!Ifouwanttolearn
howtomanipulateFrameinmoredepth,readthroughthiexamplenoteook,IntroductiontoFrame
(http://turi.com/learn/galler/noteook/introduction_to_frame.html).

tep2:plittingtheData
Oneoftheketopropermachinelearningimodelevaluation.Thegoalofmodelevaluationitoetimatehowwellourmodelwill
"generalize"tofuturedata.Inotherword,wewanttouildamodelthataccuratelpredictthefuture,notthepat!

Oneofthemotcommonevaluationprocedureitoplitourdataintoa"traininget"anda"tetinget".

Let'uean80/20plit,inwhich80%ofthedataiuedfortrainingand20%iuedforteting:

In[11]:

train_data,test_data=data.random_split(0.8,seed=1)

WenowhavetwoeparateFrame,calledtrain_dataandtet_data.

tep3:electingaMachineLearningModel
Therearetwomaintpeofmodel:clai cationmodel,whichareuedwhenourtargetvarialeicategorical(uchae/no),and
regreionmodel,whichareuedwhenourtargetvarialeicontinuou(uchaprice).Inthicae,we'llneedtoueaclai cation
model,inceourtargetvarialeicategorical(click:eorno).

Thepeci cmodelwe'regoingtoueinthicaeilogiticregreion.Inlogiticregreion,theproailitthatthetargetiTrueimodeled
aalogiticfunctionofalinearcominationofthefeature.Thu,themodelipredictingaproailit(whichiacontinuouvalue),utthat
proailitiuedtochooethepredictedtargetcla.Inotherword,it'uingregreiontopredictacontinuouvalue,utwe'reuingthe
continuouvaluethatioutputfromthemodeltoperformclai cation.(Prettcool,right?)

Itcantakealotoftudtotrulundertandamachinelearningmodel,utagoodintroductiontologiticregreioniavailaleintheuer
guide(http://turi.com/learn/uerguide/upervied-learning/logitic-regreion.html).

o,whexactldidwechooelogiticregreionforthitak,inteadofanoftheotheravailaleclai cationmodel
(http://turi.com/learn/uerguide/upervied-learning/clai er.html)?Well,itturnoutthatlogiticregreionhamannicepropertie.For
tarter,itiaverfatmodel,meaningthatitdoenottakelongtotrainthemodelormakeprediction.Awell,itihighlinterpretale,
meaningthatoucanundertandexactlhowit'makingprediction.utthekeconiderationinthicaeithatlogiticregreionoutput
"well-calirated"predictedproailitie.

tep4:TrainingaMachineLearningModel
Nowthatwe'veelectedourmodel,wecan nalltartthemodeltrainingproce!InGraphLaCreate,thicanedoneinaingleline.You
implpainthetrainingdata,thenameofthetargetcolumn,andthenameofthefeaturecolumn.Andinfact,ifoujutreplace
gl.logistic_classifier.create with gl.classifier.create ,GraphLawillchooetheetmodelforouautomaticall(aedonthe
propertieofourdata),meaningthatoucankiptep3aove!

In[12]:

model=gl.logistic_classifier.create(train_data,target='click',features=['device_type','C1','C15','C16'])
PROGRESS:Creatingavalidationsetfrom5percentoftrainingdata.Thismaytakeawhile.
Youcanset``validation_set=None``todisablevalidationtracking.

PROGRESS:Logisticregression:
PROGRESS:
PROGRESS:Numberofexamples:76149
PROGRESS:Numberofclasses:2
PROGRESS:Numberoffeaturecolumns:4
PROGRESS:Numberofunpackedfeatures:4
PROGRESS:Numberofcoefficients:13
PROGRESS:StartingNewtonMethod
PROGRESS:
PROGRESS:++++++
PROGRESS:|Iteration|Passes|ElapsedTime|Trainingaccuracy|Validationaccuracy|
PROGRESS:++++++
PROGRESS:|1|2|1.074658|0.824095|0.819668|
PROGRESS:|2|3|1.122571|0.824095|0.819668|
PROGRESS:|3|4|1.168321|0.824095|0.819668|
PROGRESS:|4|5|1.224780|0.824095|0.819668|
PROGRESS:|5|6|1.274098|0.824095|0.819668|
PROGRESS:|6|7|1.319348|0.824095|0.819668|
PROGRESS:++++++

PROGRESS:TERMINATED:Iterationlimitreached.
PROGRESS:Thismodelmaynotbeoptimal.Toimproveit,considerincreasing`max_iterations`.

Notethatwedidn'thavetotellGraphLahowtohandleeachofthefeature,eventhoughtwofeaturewerenumericalandtheothertwo
werecategorical.Thecategoricalfeaturewereautomaticallhandleduing"dummencoding",whichiwhtheoutputaoveindicate
thattherewere4featureut13modelcoe cient.(Aimpleexplanationofdummencodingiavailaleintheuerguide
(http://turi.com/learn/uerguide/upervied-learning/linear-regreion.html#linregr-categorical-feature).)

tep5:MakingPrediction
Aftertrainingamodel,the naltepitouethemodeltomakeprediction.Inotherword,themodelhalearnedamathematical
relationhipetweenthefeatureandthetarget,anditwilluethatrelationhiptopredictthetargetvaluefornewdatapoint.

Inthicae,wepathetetingdatatothe" ttedmodel",andakittooutputthepredictedproailitofaclick:

In[13]:

model.predict(test_data,output_type='probability').head(5)

Out[13]:
dtype:float
Rows:5
[0.16537085227336723,0.22480874210027335,0.16537085227336723,0.16537085227336723,0.16537085227336723]

Atthipoint,ouwouldwanttoevaluatethemodelcomparingthepredictedproailitieverutheactualtargetvalue,uingan
appropriate"evaluationmetric."Theetmetrictoueinthicaeiproallogarithmiclo
(http://www.kaggle.com/wiki/LogarithmicLo),whichicommonluedwhenoucareaouthavingwell-caliratedproailitie.In
addition,oumightinpecttheROCcurveandcomputeothermetricuchatheF1-coreandAUC.(eethilogpot
(http://log.turi.com/how-to-evaluate-machine-learning-model-part-2a-clai cation-metric)formoredetail.)

Nowthatwehavetheeproailitie,wecould ndtheadthatmaximizerevenuemultiplingtheeproailitiethecot-per-click,
and ndingthelargetvalue.

Althoughwe'reattheendofthinoteook,thiirealljuttheeginning!Youcancontinuetoaddmorefeaturetothemodel,andthen
uetheevaluationmetrictocomparetheexpectedperformanceofeachofourmodel.Awell,oucanuefeatureengineering
(http://turi.com/learn/galler/noteook/feature-engineering.html)tocreatenewfeature,oucantrothermodel,andomuchmore!

Ifou'dliketoreadmoreaoutclick-throughrateprediction,therearereadalepaperonthetopicothCriteo
(http://people.cail.mit.edu/romer/paper/TITRepPredAd.pdf)andGoogle
(http://tatic.googleuercontent.com/media/reearch.google.com/ru//pu/archive/41159.pdf).

2017TuriAllRightReerved.NopartofthiweitemaereproducedwithoutTuri'expreedconent.Turi,GraphLa,GraphLaCreateandlogoarepropertofTuri.PrivacPolic

(http://turi.com/legal/privac_polic.html)|TermofUe(http://turi.com/legal/term_of_ue.html)|LiceneAttriution(http://turi.com/legal/licene-attriution.html)

You might also like