You are on page 1of 24

Rules of Machine Learning:

Best Practices for ML Engineering

MartinZinkevich

Thisdocumentisintendedtohelpthosewithabasicknowledgeofmachinelearninggetthe
benefitofbestpracticesinmachinelearningfromaroundGoogle.Itpresentsastyleformachine
learning,similartotheGoogleC++StyleGuideandotherpopularguidestopractical
programming.Ifyouhavetakenaclassinmachinelearning,orbuiltorworkedona
machinelearnedmodel,thenyouhavethenecessarybackgroundtoreadthisdocument.

Terminology
Overview
BeforeMachineLearning
Rule#1:Dontbeafraidtolaunchaproductwithoutmachinelearning.
Rule#2:Makemetricsdesignandimplementationapriority.
Rule#3:Choosemachinelearningoveracomplexheuristic.
MLPhaseI:YourFirstPipeline
Rule#4:Keepthefirstmodelsimpleandgettheinfrastructureright.
Rule#5:Testtheinfrastructureindependentlyfromthemachinelearning.
Rule#6:Becarefulaboutdroppeddatawhencopyingpipelines.
Rule#7:Turnheuristicsintofeatures,orhandlethemexternally.
Monitoring
Rule#8:Knowthefreshnessrequirementsofyoursystem.
Rule#9:Detectproblemsbeforeexportingmodels.
Rule#10:Watchforsilentfailures.
Rule#11:Givefeaturesetsownersanddocumentation.
YourFirstObjective
Rule#12:Dontoverthinkwhichobjectiveyouchoosetodirectlyoptimize.
Rule#13:Chooseasimple,observableandattributablemetricforyourfirst
objective.
Rule#14:Startingwithaninterpretablemodelmakesdebuggingeasier.
Rule#15:SeparateSpamFilteringandQualityRankinginaPolicyLayer.
MLPhaseII:FeatureEngineering
Rule#16:Plantolaunchanditerate.
Rule#17:Startwithdirectlyobservedandreportedfeaturesasopposedtolearned
features.

Rule#18:Explorewithfeaturesofcontentthatgeneralizeacrosscontexts.
Rule#19:Useveryspecificfeatureswhenyoucan.
Rule#20:Combineandmodifyexistingfeaturestocreatenewfeaturesin
humanunderstandableways.
Rule#21:Thenumberoffeatureweightsyoucanlearninalinearmodelisroughly
proportionaltotheamountofdatayouhave.
Rule#22:Cleanupfeaturesyouarenolongerusing.
HumanAnalysisoftheSystem
Rule#23:Youarenotatypicalenduser.
Rule#24:Measurethedeltabetweenmodels.
Rule#25:Whenchoosingmodels,utilitarianperformancetrumpspredictivepower.
Rule#26:Lookforpatternsinthemeasurederrors,andcreatenewfeatures.
Rule#27:Trytoquantifyobservedundesirablebehavior.
Rule#28:Beawarethatidenticalshorttermbehaviordoesnotimplyidentical
longtermbehavior.
TrainingServingSkew
Rule#29:Thebestwaytomakesurethatyoutrainlikeyouserveistosavetheset
offeaturesusedatservingtime,andthenpipethosefeaturestoalogtousethemat
trainingtime.
Rule#30:Importanceweightsampleddata,dontarbitrarilydropit!
Rule#31:Bewarethatifyoujoindatafromatableattrainingandservingtime,the
datainthetablemaychange.
Rule#32:Reusecodebetweenyourtrainingpipelineandyourservingpipeline
wheneverpossible.
Rule#33:IfyouproduceamodelbasedonthedatauntilJanuary5th,testthemodel
onthedatafromJanuary6thandafter.
Rule#34:Inbinaryclassificationforfiltering(suchasspamdetectionordetermining
interestingemails),makesmallshorttermsacrificesinperformanceforveryclean
data.
Rule#35:Bewareoftheinherentskewinrankingproblems.
Rule#36:Avoidfeedbackloopswithpositionalfeatures.
Rule#37:MeasureTraining/ServingSkew.
MLPhaseIII:SlowedGrowth,OptimizationRefinement,andComplexModels
Rule#38:Dontwastetimeonnewfeaturesifunalignedobjectiveshavebecomethe
issue.
Rule#39:Launchdecisionswilldependuponmorethanonemetric.
Rule#40:Keepensemblessimple.
Rule#41:Whenperformanceplateaus,lookforqualitativelynewsourcesof
informationtoaddratherthanrefiningexistingsignals.
Rule#42:Dontexpectdiversity,personalization,orrelevancetobeascorrelated
withpopularityasyouthinktheyare.
Rule#43:Yourfriendstendtobethesameacrossdifferentproducts.Yourinterests
tendnottobe.

RelatedWork
Acknowledgements
Appendix
YouTubeOverview
GooglePlayOverview
GooglePlusOverview

Terminology

Thefollowingtermswillcomeuprepeatedlyinourdiscussionofeffectivemachinelearning:

Instance:Thethingaboutwhichyouwanttomakeaprediction.Forexample,theinstance
mightbeawebpagethatyouwanttoclassifyaseither"aboutcats"or"notaboutcats".
Label:Ananswerforapredictiontaskeithertheanswerproducedbyamachinelearning
system,ortherightanswersuppliedintrainingdata.Forexample,thelabelforawebpage
mightbe"aboutcats".
Feature:Apropertyofaninstanceusedinapredictiontask.Forexample,awebpagemight
haveafeature"containstheword'cat'".
FeatureColumn1:Asetofrelatedfeatures,suchasthesetofallpossiblecountriesinwhich
usersmightlive.Anexamplemayhaveoneormorefeaturespresentinafeaturecolumn.A
featurecolumnisreferredtoasanamespaceintheVWsystem(atYahoo/Microsoft),ora
field.
Example:Aninstance(withitsfeatures)andalabel.
Model:Astatisticalrepresentationofapredictiontask.Youtrainamodelonexamplesthenuse
themodeltomakepredictions.
Metric:Anumberthatyoucareabout.Mayormaynotbedirectlyoptimized.
Objective:A
metricthatyouralgorithmistryingtooptimize.
Pipeline:Theinfrastructuresurroundingamachinelearningalgorithm.Includesgatheringthe
datafromthefrontend,puttingitintotrainingdatafiles,trainingoneormoremodels,and
exportingthemodelstoproduction.

Overview
Tomakegreatproducts:
domachinelearninglikethegreatengineeryouare,notlikethegreatmachinelearning
expertyouarent.
1

Googlespecificterminology.

Mostoftheproblemsyouwillfaceare,infact,engineeringproblems.Evenwithallthe
resourcesofagreatmachinelearningexpert,mostofthegainscomefromgreatfeatures,not
greatmachinelearningalgorithms.So,thebasicapproachis:
1. makesureyourpipelineissolidendtoend
2. startwithareasonableobjective
3. addcommonsensefeaturesinasimpleway
4. makesurethatyourpipelinestayssolid.
Thisapproachwillmakelotsofmoneyand/ormakelotsofpeoplehappyforalongperiodof
time.Divergefromthisapproachonlywhentherearenomoresimpletrickstogetyouany
farther.Addingcomplexityslowsfuturereleases.

Onceyou'veexhaustedthesimpletricks,cuttingedgemachinelearningmightindeedbeinyour
future.SeethesectiononP
haseIIImachinelearningprojects.

Thisdocumentisarrangedinfourparts:
1. Thefirstpartshouldhelpyouunderstandwhetherthetimeisrightforbuildingamachine
learningsystem.
2. Thesecondpartisaboutdeployingyourfirstpipeline.
3. Thethirdpartisaboutlaunchinganditeratingwhileaddingnewfeaturestoyourpipeline,
howtoevaluatemodelsandtrainingservingskew.
4. Thefinalpartisaboutwhattodowhenyoureachaplateau.
5. Afterwards,thereisalistofr elatedworkandana
ppendixwithsomebackgroundonthe
systemscommonlyusedasexamplesinthisdocument.

BeforeMachineLearning
Rule#1:Dontbeafraidtolaunchaproductwithoutmachinelearning.
Machinelearningiscool,butitrequiresdata.Theoretically,youcantakedatafromadifferent
problemandthentweakthemodelforanewproduct,butthiswilllikelyunderperformbasic
heuristics.Ifyouthinkthatmachinelearningwillgiveyoua100%boost,thenaheuristicwillget
you50%ofthewaythere.

Forinstance,ifyouarerankingappsinanappmarketplace,youcouldusetheinstallrateor
numberofinstalls.Ifyouaredetectingspam,filteroutpublishersthathavesentspambefore.
Dontbeafraidtousehumaneditingeither.Ifyouneedtorankcontacts,rankthemostrecently
usedhighest(orevenrankalphabetically).Ifmachinelearningisnotabsolutelyrequiredforyour
product,don'tuseituntilyouhavedata.

Rule#2:First,designandimplementmetrics.
Beforeformalizingwhatyourmachinelearningsystemwilldo,trackasmuchaspossibleinyour
currentsystem.Dothisforthefollowingreasons:

1. Itiseasiertogainpermissionfromthesystemsusersearlieron.
2. Ifyouthinkthatsomethingmightbeaconcerninthefuture,itisbettertogethistorical
datanow.
3. Ifyoudesignyoursystemwithmetricinstrumentationinmind,thingswillgobetterfor
youinthefuture.Specifically,youdontwanttofindyourselfgreppingforstringsinlogs
toinstrumentyourmetrics!
4. Youwillnoticewhatthingschangeandwhatstaysthesame.Forinstance,supposeyou
wanttodirectlyoptimizeonedayactiveusers.However,duringyourearlymanipulations
ofthesystem,youmaynoticethatdramaticalterationsoftheuserexperiencedont
noticeablychangethismetric.

GooglePlusteammeasuresexpandsperread,resharesperread,plusonesperread,
comments/read,commentsperuser,resharesperuser,etc.whichtheyuseincomputingthe
goodnessofapostatservingtime.A
lso,notethatanexperimentframework,whereyou
cangroupusersintobucketsandaggregatestatisticsbyexperiment,isimportant.See
Rule#12.

Bybeingmoreliberalaboutgatheringmetrics,youcangainabroaderpictureofyoursystem.
Noticeaproblem?Addametrictotrackit!Excitedaboutsomequantitativechangeonthelast
release?Addametrictotrackit!

Rule#3:Choosemachinelearningoveracomplexheuristic.
Asimpleheuristiccangetyourproductoutthedoor.Acomplexheuristicisunmaintainable.
Onceyouhavedataandabasicideaofwhatyouaretryingtoaccomplish,moveontomachine
learning.Asinmostsoftwareengineeringtasks,youwillwanttobeconstantlyupdatingyour
approach,whetheritisaheuristicoramachinelearnedmodel,andyouwillfindthatthe
machinelearnedmodeliseasiertoupdateandmaintain(seeR
ule#16).

MLPhaseI:YourFirstPipeline
Focusonyoursysteminfrastructureforyourfirstpipeline.Whileitisfuntothinkaboutallthe
imaginativemachinelearningyouaregoingtodo,itwillbehardtofigureoutwhatishappening
ifyoudontfirsttrustyourpipeline.

Rule#4:Keepthefirstmodelsimpleandgettheinfrastructureright.
Thefirstmodelprovidesthebiggestboosttoyourproduct,soitdoesn'tneedtobefancy.But
youwillrunintomanymoreinfrastructureissuesthanyouexpect.Beforeanyonecanuseyour
fancynewmachinelearningsystem,youhavetodetermine:

1. Howtogetexamplestoyourlearningalgorithm.
2. Afirstcutastowhatgoodandbadmeantoyoursystem.
3. Howtointegrateyourmodelintoyourapplication.Youcaneitherapplythemodellive,or
precomputethemodelonexamplesofflineandstoretheresultsinatable.Forexample,
youmightwanttopreclassifywebpagesandstoretheresultsinatable,butyoumight
wanttoclassifychatmessageslive.

Choosingsimplefeaturesmakesiteasiertoensurethat:
1. Thefeaturesreachyourlearningalgorithmcorrectly.
2. Themodellearnsreasonableweights.
3. Thefeaturesreachyourmodelintheservercorrectly.
Onceyouhaveasystemthatdoesthesethreethingsreliably,youhavedonemostofthework.
Yoursimplemodelprovidesyouwithbaselinemetricsandabaselinebehaviorthatyoucanuse
totestmorecomplexmodels.Someteamsaimforaneutralfirstlaunch:afirstlaunchthat
explicitlydeprioritizesmachinelearninggains,toavoidgettingdistracted.

Rule#5:Testtheinfrastructureindependentlyfromthemachinelearning.
Makesurethattheinfrastructureistestable,andthatthelearningpartsofthesystemare
encapsulatedsothatyoucantesteverythingaroundit.Specifically:
1. Testgettingdataintothealgorithm.Checkthatfeaturecolumnsthatshouldbepopulated
arepopulated.Whereprivacypermits,manuallyinspecttheinputtoyourtraining
algorithm.Ifpossible,checkstatisticsinyourpipelineincomparisontoelsewhere,such
asRASTA.
2. Testgettingmodelsoutofthetrainingalgorithm.Makesurethatthemodelinyour
trainingenvironmentgivesthesamescoreasthemodelinyourservingenvironment
(seeR
ule#37).

Machinelearninghasanelementofunpredictability,somakesurethatyouhavetestsforthe
codeforcreatingexamplesintrainingandserving,andthatyoucanloadanduseafixedmodel
duringserving.Also,itisimportanttounderstandyourdata:seeP
racticalAdviceforAnalysisof
Large,ComplexDataSets.
Rule#6:Becarefulaboutdroppeddatawhencopyingpipelines.
Oftenwecreateapipelinebycopyinganexistingpipeline(i.e.cargocultprogramming),andthe
oldpipelinedropsdatathatweneedforthenewpipeline.Forexample,thepipelineforG
oogle
PlusWhatsHotdropsolderposts(becauseitistryingtorankfreshposts).Thispipelinewas
copiedtouseforG
ooglePlusStream,whereolderpostsarestillmeaningful,butthepipeline
wasstilldroppingoldposts.Anothercommonpatternistoonlylogdatathatwasseenbythe
user.Thus,thisdataisuselessifwewanttomodelwhyaparticularpostwasnotseenbythe
user,becauseallthenegativeexampleshavebeendropped.AsimilarissueoccurredinPlay.
WhileworkingonPlayAppsHome,anewpipelinewascreatedthatalsocontainedexamples
fromtwootherlandingpages(PlayGamesHomeandPlayHomeHome)withoutanyfeatureto
disambiguatewhereeachexamplecamefrom.


Rule#7:Turnheuristicsintofeatures,orhandlethemexternally.
Usuallytheproblemsthatmachinelearningistryingtosolvearenotcompletelynew.Thereis
anexistingsystemforranking,orclassifying,orwhateverproblemyouaretryingtosolve.This
meansthatthereareabunchofrulesandheuristics.T
hesesameheuristicscangiveyoua
liftwhentweakedwithmachinelearning.Yourheuristicsshouldbeminedforwhatever
informationtheyhave,fortworeasons.First,thetransitiontoamachinelearnedsystemwillbe
smoother.Second,usuallythoserulescontainalotoftheintuitionaboutthesystemyoudont
wanttothrowaway.Therearefourwaysyoucanuseanexistingheuristic:
1. Preprocessusingtheheuristic.Ifthefeatureisincrediblyawesome,thenthisisan
option.Forexample,if,inaspamfilter,thesenderhasalreadybeenblacklisted,donttry
torelearnwhatblacklistedmeans.Blockthemessage.Thisapproachmakesthemost
senseinbinaryclassificationtasks.
2. Createafeature.Directlycreatingafeaturefromtheheuristicisgreat.Forexample,if
youuseaheuristictocomputearelevancescoreforaqueryresult,youcanincludethe
scoreasthevalueofafeature.Lateronyoumaywanttousemachinelearning
techniquestomassagethevalue(forexample,convertingthevalueintooneofafinite
setofdiscretevalues,orcombiningitwithotherfeatures)butstartbyusingtheraw
valueproducedbytheheuristic.
3. Minetherawinputsoftheheuristic.Ifthereisaheuristicforappsthatcombinesthe
numberofinstalls,thenumberofcharactersinthetext,andthedayoftheweek,then
considerpullingthesepiecesapart,andfeedingtheseinputsintothelearning
separately.Sometechniquesthatapplytoensemblesapplyhere(s eeRule#40).
4. Modifythelabel.Thisisanoptionwhenyoufeelthattheheuristiccapturesinformation
notcurrentlycontainedinthelabel.Forexample,ifyouaretryingtomaximizethe
numberofdownloads,butyoualsowantqualitycontent,thenmaybethesolutionisto
multiplythelabelbytheaveragenumberofstarstheappreceived.Thereisalotof
spacehereforleeway.SeethesectiononYourFirstObjective.
DobemindfuloftheaddedcomplexitywhenusingheuristicsinanMLsystem.Usingold
heuristicsinyournewmachinelearningalgorithmcanhelptocreateasmoothtransition,but
thinkaboutwhetherthereisasimplerwaytoaccomplishthesameeffect.

Monitoring
Ingeneral,practicegoodalertinghygiene,suchasmakingalertsactionableandhavinga
dashboardpage.

Rule#8:Knowthefreshnessrequirementsofyoursystem.
Howmuchdoesperformancedegradeifyouhaveamodelthatisadayold?Aweekold?A
quarterold?Thisinformationcanhelpyoutounderstandtheprioritiesofyourmonitoring.Ifyou
lose10%ofyourrevenueifthemodelisnotupdatedforaday,itmakessensetohavean
engineerwatchingitcontinuously.Mostadservingsystemshavenewadvertisementstohandle

everyday,andmustupdatedaily.Forinstance,iftheMLmodelforG
ooglePlaySearchisnot
updated,itcanhaveanimpactonrevenueinunderamonth.SomemodelsforWhatsHotin
GooglePlushavenopostidentifierintheirmodelsotheycanexportthesemodelsinfrequently.
Othermodelsthathavepostidentifiersareupdatedmuchmorefrequently.Alsonoticethat
freshnesscanchangeovertime,especiallywhenfeaturecolumnsareaddedorremovedfrom
yourmodel.

Rule#9:Detectproblemsbeforeexportingmodels.
Manymachinelearningsystemshaveastagewhereyouexportthemodeltoserving.Ifthereis
anissuewithanexportedmodel,itisauserfacingissue.Ifthereisanissuebefore,thenitisa
trainingissue,anduserswillnotnotice.

Dosanitychecksrightbeforeyouexportthemodel.Specifically,makesurethatthemodels
performanceisreasonableonheldoutdata.Or,ifyouhavelingeringconcernswiththedata,
dontexportamodel.Manyteamscontinuouslydeployingmodelscheckthea
reaunderthe
ROCcurve(orAUC)beforeexporting.Issuesaboutmodelsthathaventbeenexported
requireanemailalert,butissuesonauserfacingmodelmayrequireapage.Sobetterto
waitandbesurebeforeimpactingusers.

Rule#10:Watchforsilentfailures.
Thisisaproblemthatoccursmoreformachinelearningsystemsthanforotherkindsof
systems.Supposethataparticulartablethatisbeingjoinedisnolongerbeingupdated.The
machinelearningsystemwilladjust,andbehaviorwillcontinuetobereasonablygood,decaying
gradually.Sometimestablesarefoundthatweremonthsoutofdate,andasimplerefresh
improvedperformancemorethananyotherlaunchthatquarter!Forexample,thecoverageofa
featuremaychangeduetoimplementationchanges:forexampleafeaturecolumncouldbe
populatedin90%oftheexamples,andsuddenlydropto60%oftheexamples.Playoncehada
tablethatwasstalefor6months,andrefreshingthetablealonegaveaboostof2%ininstall
rate.Ifyoutrackstatisticsofthedata,aswellasmanuallyinspectthedataonoccassion,you
canreducethesekindsoffailures.

Rule#11:Givefeaturecolumnownersanddocumentation.
Ifthesystemislarge,andtherearemanyfeaturecolumns,knowwhocreatedorismaintaining
eachfeaturecolumn.Ifyoufindthatthepersonwhounderstandsafeaturecolumnisleaving,
makesurethatsomeonehastheinformation.Althoughmanyfeaturecolumnshavedescriptive
names,it'sgoodtohaveamoredetaileddescriptionofwhatthefeatureis,whereitcamefrom,
andhowitisexpectedtohelp.

YourFirstObjective
Youhavemanymetrics,ormeasurementsaboutthesystemthatyoucareabout,butyour
machinelearningalgorithmwilloftenrequireasingleo
bjective,anumberthatyouralgorithm

istryingtooptimize.Idistinguishherebetweenobjectivesandmetrics:a
metricisany
numberthatyoursystemreports,whichmayormaynotbeimportant.SeealsoR
ule#2.

Rule#12:Dontoverthinkwhichobjectiveyouchoosetodirectlyoptimize.
Youwanttomakemoney,makeyourusershappy,andmaketheworldabetterplace.Thereare
tonsofmetricsthatyoucareabout,andyoushouldmeasurethemall(seeR
ule#2).However,
earlyinthemachinelearningprocess,youwillnoticethemallgoingup,eventhosethatyoudo
notdirectlyoptimize.Forinstance,supposeyoucareaboutnumberofclicks,timespentonthe
site,anddailyactiveusers.Ifyouoptimizefornumberofclicks,youarelikelytoseethetime
spentincrease.
So,keepitsimpleanddontthinktoohardaboutbalancingdifferentmetricswhenyoucanstill
easilyincreaseallthemetrics.Donttakethisruletoofarthough:donotconfuseyourobjective
withtheultimatehealthofthesystem(seeR
ule#39).And,ifyoufindyourselfincreasingthe
directlyoptimizedmetric,butdecidingnottolaunch,someobjectiverevisionmaybe
required.

Rule#13:Chooseasimple,observableandattributablemetricforyourfirstobjective.
Oftenyoudon'tknowwhatthetrueobjectiveis.Youthinkyoudobutthenyouasyoustareat
thedataandsidebysideanalysisofyouroldsystemandnewMLsystem,yourealizeyouwant
totweakit.Further,differentteammembersoftencan'tagreeonthetrueobjective.T
heML
objectiveshouldbesomethingthatiseasytomeasureandisaproxyforthetrue
objective2.SotrainonthesimpleMLobjective,andconsiderhavinga"policylayer"ontopthat
allowsyoutoaddadditionallogic(hopefullyverysimplelogic)todothefinalranking.

Theeasiestthingtomodelisauserbehaviorthatisdirectlyobservedandattributabletoan
actionofthesystem:
1. Wasthisrankedlinkclicked?
2. Wasthisrankedobjectdownloaded?
3. Wasthisrankedobjectforwarded/repliedto/emailed?
4. Wasthisrankedobjectrated?
5. Wasthisshownobjectmarkedasspam/pornography/offensive?
Avoidmodelingindirecteffectsatfirst:
1. Didtheuservisitthenextday?
2. Howlongdidtheuservisitthesite?
3. Whatwerethedailyactiveusers?
Indirecteffectsmakegreatmetrics,andcanbeusedduringA/Btestingandduringlaunch
decisions.
Finally,donttrytogetthemachinelearningtofigureout:
1. Istheuserhappyusingtheproduct?
2. Istheusersatisfiedwiththeexperience?
3. Istheproductimprovingtheusersoverallwellbeing?
2

Thereisoftennotrueobjective.SeeR
ule#39.

4. Howwillthisaffectthecompanysoverallhealth?
Theseareallimportant,butalsoincrediblyhard.Instead,useproxies:iftheuserishappy,they
willstayonthesitelonger.Iftheuserissatisfied,theywillvisitagaintomorrow.Insofaras
wellbeingandcompanyhealthisconcerned,humanjudgementisrequiredtoconnectany
machinelearnedobjectivetothenatureoftheproductyouaresellingandyourbusinessplan,
sowedontenduph
ere.

Rule#14:Startingwithaninterpretablemodelmakesdebuggingeasier.
Linearregression,logisticregression,andPoissonregressionaredirectlymotivatedbya
probabilisticmodel.Eachpredictionisinterpretableasaprobabilityoranexpectedvalue.This
makesthemeasiertodebugthanmodelsthatuseobjectives(zerooneloss,varioushinge
losses,etcetera)thattrytodirectlyoptimizeclassificationaccuracyorrankingperformance.For
example,ifprobabilitiesintrainingdeviatefromprobabilitiespredictedinsidebysidesorby
inspectingtheproductionsystem,thisdeviationcouldrevealaproblem.

Forexample,inlinear,logistic,orPoissonregression,t herearesubsetsofthedatawherethe
averagepredictedexpectationequalstheaveragelabel(1momentcalibrated,orjust
calibrated)3.Ifyouhaveafeaturewhichiseither1or0foreachexample,thenthesetof
exampleswherethatfeatureis1iscalibrated.Also,ifyouhaveafeaturethatis1forevery
example,thenthesetofallexamplesiscalibrated.

Withsimplemodels,itiseasiertodealwithfeedbackloops(seeR
ule#36).
Often,weusetheseprobabilisticpredictionstomakeadecision:e.g.rankpostsindecreasing
expectedvalue(i.e.probabilityofclick/download/etc.).H
owever,rememberwhenitcomes
timetochoosewhichmodeltouse,thedecisionmattersmorethanthelikelihoodofthe
datagiventhemodel(seeR
ule#27).

Rule#15:SeparateSpamFilteringandQualityRankinginaPolicyLayer.
Qualityrankingisafineart,butspamfilteringisawar.Thesignalsthatyouusetodetermine
highqualitypostswillbecomeobvioustothosewhouseyoursystem,andtheywilltweaktheir
poststohavetheseproperties.Thus,yourqualityrankingshouldfocusonrankingcontentthat
ispostedingoodfaith.Youshouldnotdiscountthequalityrankinglearnerforrankingspam
highly.S
imilarly,racycontentshouldbehandledseparatelyfromQualityRanking.
Spamfilteringisadifferentstory.Youhavetoexpectthatthefeaturesthatyouneedtogenerate
willbeconstantlychanging.Often,therewillbeobviousrulesthatyouputintothesystem(ifa
posthasmorethanthreespamvotes,dontretrieveit,etcetera).Anylearnedmodelwillhaveto
beupdateddaily,ifnotfaster.Thereputationofthecreatorofthecontentwillplayagreatrole.

Atsomelevel,theoutputofthesetwosystemswillhavetobeintegrated.Keepinmind,filtering
spaminsearchresultsshouldprobablybemoreaggressivethanfilteringspaminemail
3

Thisistrueassumingthatyouhavenoregularizationandthatyouralgorithmhasconverged.Itis
approximatelytrueingeneral.

messages.Also,itisastandardpracticetoremovespamfromthetrainingdataforthequality
classifier.

MLPhaseII:FeatureEngineering
Inthefirstphaseofthelifecycleofamachinelearningsystem,theimportantissueistogetthe
trainingdataintothelearningsystem,getanymetricsofinterestinstrumented,andcreatea
servinginfrastructure.A
fteryouhaveaworkingendtoendsystemwithunitandsystem
testsinstrumented,PhaseIIbegins.

Inthesecondphase,thereisalotoflowhangingfruit.Thereareavarietyofobviousfeatures
thatcouldbepulledintothesystem.Thus,thesecondphaseofmachinelearninginvolves
pullinginasmanyfeaturesaspossibleandcombiningtheminintuitiveways.Duringthisphase,
allofthemetricsshouldstillberising.Therewillbelotsoflaunches,anditisagreattimetopull
inlotsofengineersthatcanjoinupallthedatathatyouneedtocreateatrulyawesomelearning
system.

Rule#16:Plantolaunchanditerate.
Dontexpectthatthemodelyouareworkingonnowwillbethelastonethatyouwilllaunch,or
eventhatyouwilleverstoplaunchingmodels.Thusconsiderwhetherthecomplexityyouare
addingwiththislaunchwillslowdownfuturelaunches.Manyteamshavelaunchedamodelper
quarterormoreforyears.Therearethreebasicreasonstolaunchnewmodels:
1. youarecomingupwithnewfeatures,
2. youaretuningregularizationandcombiningoldfeaturesinnewways,and/or
3. youaretuningtheobjective.

Regardless,givingamodelabitoflovecanbegood:lookingoverthedatafeedingintothe
examplecanhelpfindnewsignalsaswellasold,brokenones.So,asyoubuildyourmodel,
thinkabouthoweasyitistoaddorremoveorrecombinefeatures.Thinkabouthoweasyitisto
createafreshcopyofthepipelineandverifyitscorrectness.Thinkaboutwhetheritispossible
tohavetwoorthreecopiesrunninginparallel.Finally,dontworryaboutwhetherfeature16of
35makesitintothisversionofthepipeline.Youllgetitnextquarter.

Rule#17:Startwithdirectlyobservedandreportedfeaturesasopposedtolearned
features.
Thismightbeacontroversialpoint,butitavoidsalotofpitfalls.Firstofall,letsdescribewhata
learnedfeatureis.Alearnedfeatureisafeaturegeneratedeitherbyanexternalsystem(such
asanunsupervisedclusteringsystem)orbythelearneritself(e.g.viaafactoredmodelordeep

learning).Bothofthesecanbeuseful,buttheycanhavealotofissues,sotheyshouldnotbein
thefirstmodel.

Ifyouuseanexternalsystemtocreateafeature,rememberthatthesystemhasitsown
objective.Theexternalsystem'sobjectivemaybeonlyweaklycorrelatedwithyourcurrent
objective.Ifyougrabasnapshotoftheexternalsystem,thenitcanbecomeoutofdate.Ifyou
updatethefeaturesfromtheexternalsystem,thenthemeaningsmaychange.Ifyouusean
externalsystemtoprovideafeature,beawarethattheyrequireagreatdealofcare.

Theprimaryissuewithfactoredmodelsanddeepmodelsisthattheyarenonconvex.Thus,
thereisnoguaranteethatanoptimalsolutioncanbeapproximatedorfound,andthelocal
minimafoundoneachiterationcanbedifferent.Thisvariationmakesithardtojudgewhether
theimpactofachangetoyoursystemismeaningfulorrandom.Bycreatingamodelwithout
deepfeatures,youcangetanexcellentbaselineperformance.Afterthisbaselineisachieved,
youcantrymoreesotericapproaches.

Rule#18:Explorewithfeaturesofcontentthatgeneralizeacrosscontexts.
Oftenamachinelearningsystemisasmallpartofamuchbiggerpicture.Forexample,ifyou
imagineapostthatmightbeusedinWhatsHot,manypeoplewillplusone,reshare,or
commentonapostbeforeitisevershowninWhatsHot.Ifyouprovidethosestatisticstothe
learner,itcanpromotenewpoststhatithasnodataforinthecontextitisoptimizing.Y
ouTube
WatchNextcouldusenumberofwatches,orcowatches(countsofhowmanytimesonevideo
waswatchedafteranotherwaswatched)fromY
ouTubesearch.Youcanalsouseexplicituser
ratings.Finally,ifyouhaveauseractionthatyouareusingasalabel,seeingthatactiononthe
documentinadifferentcontextcanbeagreatfeature.Allofthesefeaturesallowyoutobring
newcontentintothecontext.Notethatthisisnotaboutpersonalization:figureoutifsomeone
likesthecontentinthiscontextfirst,thenfigureoutwholikesitmoreorless.

Rule#19:Useveryspecificfeatureswhenyoucan.
Withtonsofdata,itissimplertolearnmillionsofsimplefeaturesthanafewcomplexfeatures.
Identifiersofdocumentsbeingretrievedandcanonicalizedqueriesdonotprovidemuch
generalization,butalignyourrankingwithyourlabelsonheadqueries..Thus,dontbeafraidof
groupsoffeatureswhereeachfeatureappliestoaverysmallfractionofyourdata,butoverall
coverageisabove90%.Youcanuseregularizationtoeliminatethefeaturesthatapplytotoo
fewexamples.

Rule#20:Combineandmodifyexistingfeaturestocreatenewfeaturesin
humanunderstandableways.
Thereareavarietyofwaystocombineandmodifyfeatures.Machinelearningsystemssuchas
TensorFlowallowyoutopreprocessyourdatathroughtransformations.Thetwomoststandard
approachesarediscretizationsandcrosses.

Discretizationconsistsoftakingacontinuousfeatureandcreatingmanydiscretefeaturesfrom
it.Consideracontinuousfeaturesuchasage.Youcancreateafeaturewhichis1whenageis
lessthan18,anotherfeaturewhichis1whenageisbetween18and35,etcetera.Dont
overthinktheboundariesofthesehistograms:basicquantileswillgiveyoumostoftheimpact.

Crossescombinetwoormorefeaturecolumns.Afeaturecolumn,inTensorFlow'sterminology,
isasetofhomogenousfeatures,(e.g.{male,female},{US,Canada,Mexico},etcetera).Across
isanewfeaturecolumnwithfeaturesin,forexample,{male, f emale} {U S, C anada, M exico} .
Thisnewfeaturecolumnwillcontainthefeature(male,Canada).IfyouareusingTensorFlow
andyoutellTensorFlowtocreatethiscrossforyou,this(male,Canada)featurewillbepresent
inexamplesrepresentingmaleCanadians.Notethatittakesmassiveamountsofdatatolearn
modelswithcrossesofthree,four,ormorebasefeaturecolumns.

Crossesthatproduceverylargefeaturecolumnsmayoverfit.Forinstance,imaginethatyouare
doingsomesortofsearch,andyouhaveafeaturecolumnwithwordsinthequery,andyou
haveafeaturecolumnwithwordsinthedocument.Youcancombinethesewithacross,but
youwillendupwithalotoffeatures(seeR
ule#21).Whenworkingwithtexttherearetwo
alternatives.Themostdraconianisadotproduct.Adotproductinitssimplestformsimply
countsthenumberofcommonwordsbetweenthequeryandthedocument.Thisfeaturecan
thenbediscretized.Anotherapproachisanintersection:thus,wewillhaveafeaturewhichis
presentifandonlyifthewordponyisinthedocumentandthequery,andanotherfeature
whichispresentifandonlyifthewordtheisinthedocumentandthequery.

Rule#21:Thenumberoffeatureweightsyoucanlearninalinearmodelisroughly
proportionaltotheamountofdatayouhave.
Therearefascinatingstatisticallearningtheoryresultsconcerningtheappropriatelevelof
complexityforamodel,butthisruleisbasicallyallyouneedtoknow.Ihavehadconversations
inwhichpeopleweredoubtfulthatanythingcanbelearnedfromonethousandexamples,or
thatyouwouldeverneedmorethan1millionexamples,becausetheygetstuckinacertain
methodoflearning.Thekeyistoscaleyourlearningtothesizeofyourdata:
1. Ifyouareworkingonasearchrankingsystem,andtherearemillionsofdifferentwords
inthedocumentsandthequeryandyouhave1000labeledexamples,thenyoushould
useadotproductbetweendocumentandqueryfeatures,T
FIDF,andahalfdozen
otherhighlyhumanengineeredfeatures.1000examples,adozenfeatures.
2. Ifyouhaveamillionexamples,thenintersectthedocumentandqueryfeaturecolumns,
usingregularizationandpossiblyfeatureselection.Thiswillgiveyoumillionsoffeatures,
butwithregularizationyouwillhavefewer.Tenmillionexamples,maybeahundred
thousandfeatures.
3. Ifyouhavebillionsorhundredsofbillionsofexamples,youcancrossthefeature
columnswithdocumentandquerytokens,usingfeatureselectionandregularization.
Youwillhaveabillionexamples,and10millionfeatures.
Statisticallearningtheoryrarelygivestightbounds,butgivesgreatguidanceforastartingpoint.
Intheend,useR
ule#28todecidewhatfeaturestouse.


Rule#22:Cleanupfeaturesyouarenolongerusing.
Unusedfeaturescreatetechnicaldebt.Ifyoufindthatyouarenotusingafeature,andthat
combiningitwithotherfeaturesisnotworking,thendropitoutofyourinfrastructure.Youwant
tokeepyourinfrastructurecleansothatthemostpromisingfeaturescanbetriedasfastas
possible.Ifnecessary,someonecanalwaysaddbackyourfeature.

Keepcoverageinmindwhenconsideringwhatfeaturestoaddorkeep.Howmanyexamples
arecoveredbythefeature?Forexample,ifyouhavesomepersonalizationfeatures,butonly
8%ofyourusershaveanypersonalizationfeatures,itisnotgoingtobeveryeffective.

Atthesametime,somefeaturesmaypunchabovetheirweight.Forexample,ifyouhavea
featurewhichcoversonly1%ofthedata,but90%oftheexamplesthathavethefeatureare
positive,thenitwillbeagreatfeaturetoadd.

HumanAnalysisoftheSystem
Beforegoingontothethirdphaseofmachinelearning,itisimportanttofocusonsomethingthat
isnottaughtinanymachinelearningclass:howtolookatanexistingmodel,andimproveit.
Thisismoreofanartthanascience,andyetthereareseveralantipatternsthatithelpsto
avoid.

Rule#23:Youarenotatypicalenduser.
Thisisperhapstheeasiestwayforateamtogetboggeddown.Whiletherearealotofbenefits
tofishfooding(usingaprototypewithinyourteam)anddogfooding(usingaprototypewithin
yourcompany),employeesshouldlookatwhethertheperformanceiscorrect.Whileachange
whichisobviouslybadshouldnotbeused,anythingthatlooksreasonablynearproduction
shouldbetestedfurther,eitherbypayinglaypeopletoanswerquestionsonacrowdsourcing
platform,orthroughaliveexperimentonrealusers.

Therearetworeasonsforthis.Thefirstisthatyouaretooclosetothecode.Youmaybe
lookingforaparticularaspectoftheposts,oryouaresimplytooemotionallyinvolved(e.g.
confirmationbias).Thesecondisthatyourtimeistoovaluable.Considerthecostof9
engineerssittinginaonehourmeeting,andthinkofhowmanycontractedhumanlabelsthat
buysonacrowdsourcingplatform.

Ifyoureallywanttohaveuserfeedback,u
seuserexperiencemethodologies.Createuser
personas(onedescriptionisinBillBuxtonsD
esigningUserExperiences)earlyinaprocessand
dousabilitytesting(onedescriptionisinSteveKrugsD
ontMakeMeThink)later.User
personasinvolvecreatingahypotheticaluser.Forinstance,ifyourteamisallmale,itmighthelp
todesigna35yearoldfemaleuserpersona(completewithuserfeatures),andlookatthe
resultsitgeneratesratherthan10resultsfor2540yearoldmales.Bringinginactualpeopleto

watchtheirreactiontoyoursite(locallyorremotely)inusabilitytestingcanalsogetyouafresh
perspective.

Rule#24:Measurethedeltabetweenmodels.
Oneoftheeasiest,andsometimesmostusefulmeasurementsyoucanmakebeforeanyusers
havelookedatyournewmodelistocalculatejusthowdifferentthenewresultsarefrom
production.Forinstance,ifyouhavearankingproblem,runbothmodelsonasampleofqueries
throughtheentiresystem,andlookatthesizeofthesymmetricdifferenceoftheresults
(weightedbyrankingposition).Ifthedifferenceisverysmall,thenyoucantellwithoutrunning
anexperimentthattherewillbelittlechange.Ifthedifferenceisverylarge,thenyouwantto
makesurethatthechangeisgood.Lookingoverquerieswherethesymmetricdifferenceishigh
canhelpyoutounderstandqualitativelywhatthechangewaslike.Makesure,however,thatthe
systemisstable.Makesurethatamodelwhencomparedwithitselfhasalow(ideallyzero)
symmetricdifference.

Rule#25:Whenchoosingmodels,utilitarianperformancetrumpspredictivepower.
Yourmodelmaytrytopredictclickthroughrate.However,intheend,thekeyquestioniswhat
youdowiththatprediction.Ifyouareusingittorankdocuments,thenthequalityofthefinal
rankingmattersmorethanthepredictionitself.Ifyoupredicttheprobabilitythatadocumentis
spamandthenhaveacutoffonwhatisblocked,thentheprecisionofwhatisallowedthrough
mattersmore.Mostofthetime,thesetwothingsshouldbeinagreement:whentheydonot
agree,itwilllikelybeonasmallgain.Thus,ifthereissomechangethatimprovesloglossbut
degradestheperformanceofthesystem,lookforanotherfeature.Whenthisstartshappening
moreoften,itistimetorevisittheobjectiveofyourmodel.

Rule#26:Lookforpatternsinthemeasurederrors,andcreatenewfeatures.
Supposethatyouseeatrainingexamplethatthemodelgotwrong.Inaclassificationtask,this
couldbeafalsepositiveorafalsenegative.Inarankingtask,itcouldbeapairwhereapositive
wasrankedlowerthananegative.Themostimportantpointisthatthisisanexamplethatthe
machinelearningsystemk nowsitgotwrongandwouldliketofixifgiventheopportunity.Ifyou
givethemodelafeaturethatallowsittofixtheerror,themodelwilltrytouseit.

Ontheotherhand,ifyoutrytocreateafeaturebaseduponexamplesthesystemdoesntsee
asmistakes,thefeaturewillbeignored.Forinstance,supposethatinPlayAppsSearch,
someonesearchesforfreegames.Supposeoneofthetopresultsisalessrelevantgagapp.
Soyoucreateafeatureforgagapps.However,ifyouaremaximizingnumberofinstalls,and
peopleinstallagagappwhentheysearchforfreegames,thegagappsfeaturewonthavethe
effectyouwant.

Onceyouhaveexamplesthatthemodelgotwrong,lookfortrendsthatareoutsideyourcurrent
featureset.Forinstance,ifthesystemseemstobedemotinglongerposts,thenaddpost
length.Dontbetoospecificaboutthefeaturesyouadd.Ifyouaregoingtoaddpostlength,

donttrytoguesswhatlongmeans,justaddadozenfeaturesandtheletmodelfigureoutwhat
todowiththem(seeR
ule#21).Thatistheeasiestwaytogetwhatyouwant.

Rule#27:Trytoquantifyobservedundesirablebehavior.
Somemembersofyourteamwillstarttobefrustratedwithpropertiesofthesystemtheydont
likewhicharentcapturedbytheexistinglossfunction.Atthispoint,theyshoulddowhateverit
takestoturntheirgripesintosolidnumbers.Forexample,iftheythinkthattoomanygagapps
arebeingshowninPlaySearch,theycouldhavehumanratersidentifygagapps.(Youcan
feasiblyusehumanlabelleddatainthiscasebecausearelativelysmallfractionofthequeries
accountforalargefractionofthetraffic.)Ifyourissuesaremeasurable,thenyoucanstartusing
themasfeatures,objectives,ormetrics.Thegeneralruleis measurefirst,optimizesecond.

Rule#28:Beawarethatidenticalshorttermbehaviordoesnotimplyidenticallongterm
behavior.
Imaginethatyouhaveanewsystemthatlooksateverydoc_idandexact_query,andthen
calculatestheprobabilityofclickforeverydocforeveryquery.Youfindthatitsbehavioris
nearlyidenticaltoyourcurrentsysteminbothsidebysidesandA/Btesting,sogivenits
simplicity,youlaunchit.However,younoticethatnonewappsarebeingshown.Why?Well,
sinceyoursystemonlyshowsadocbasedonitsownhistorywiththatquery,thereisnowayto
learnthatanewdocshouldbeshown.

Theonlywaytounderstandhowsuchasystemwouldworklongtermistohaveittrainonlyon
dataacquiredwhenthemodelwaslive.Thisisverydifficult.

TrainingServingSkew
Trainingservingskewisadifferencebetweenperformanceduringtrainingandperformance
duringserving.Thisskewcanbecausedby:
adiscrepancybetweenhowyouhandledatainthetrainingandservingpipelines,or
achangeinthedatabetweenwhenyoutrainandwhenyouserve,or
afeedbackloopbetweenyourmodelandyouralgorithm.
WehaveobservedproductionmachinelearningsystemsatGooglewithtrainingservingskew
thatnegativelyimpactsperformance.Thebestsolutionistoexplicitlymonitoritsothatsystem
anddatachangesdontintroduceskewunnoticed.

Rule#29:Thebestwaytomakesurethatyoutrainlikeyouserveistosavethesetof
featuresusedatservingtime,andthenpipethosefeaturestoalogtousethemat
trainingtime.

Evenifyoucantdothisforeveryexample,doitforasmallfraction,suchthatyoucanverifythe
consistencybetweenservingandtraining(seeR
ule#37).Teamsthathavemadethis
measurementatGoogleweresometimessurprisedbytheresults.Y
ouTubehomepage

switchedtologgingfeaturesatservingtimewithsignificantqualityimprovementsanda
reductionincodecomplexity,andmanyteamsareswitchingtheirinfrastructureaswespeak.

Rule#30:Importanceweightsampleddata,dontarbitrarilydropit!
Whenyouhavetoomuchdata,thereisatemptationtotakefiles112,andignorefiles1399.
Thisisamistake:droppingdataintraininghascausedissuesinthepastforseveralteams(see
Rule#6).Althoughdatathatwasnevershowntotheusercanbedropped,importance
weightingisbestfortherest.Importanceweightingmeansthatifyoudecidethatyouaregoing
tosampleexampleXwitha30%probability,thengiveitaweightof10/3.W
ithimportance
weighting,allofthecalibrationpropertiesdiscussedinR
ule#14stillhold.

Rule#31:Bewarethatifyoujoindatafromatableattrainingandservingtime,thedatain
thetablemaychange.
Sayyoujoindocidswithatablecontainingfeaturesforthosedocs(suchasnumberof
commentsorclicks).Betweentrainingandservingtime,featuresinthetablemaybechanged.
Yourmodel'spredictionforthesamedocumentmaythendifferbetweentrainingandserving.
Theeasiestwaytoavoidthissortofproblemistologfeaturesatservingtime(seeR
ule#32).If
thetableischangingonlyslowly,youcanalsosnapshotthetablehourlyordailytoget
reasonablyclosedata.Notethatthisstilldoesntcompletelyresolvetheissue.

Rule#32:Reusecodebetweenyourtrainingpipelineandyourservingpipeline
wheneverpossible.
Batchprocessingisdifferentthanonlineprocessing.Inonlineprocessing,youmusthandle
eachrequestasitarrives(e.g.youmustdoaseparatelookupforeachquery),whereasinbatch
processing,youcancombinetasks(e.g.makingajoin).Atservingtime,youaredoingonline
processing,whereastrainingisabatchprocessingtask.However,therearesomethingsthat
youcandotoreusecode.Forexample,youcancreateanobjectthatisparticulartoyour
systemwheretheresultofanyqueriesorjoinscanbestoredinaveryhumanreadableway,
anderrorscanbetestedeasily.Then,onceyouhavegatheredalltheinformation,during
servingortraining,yourunacommonmethodtobridgebetweenthehumanreadableobject
thatisspecifictoyoursystem,andwhateverformatthemachinelearningsystemexpects.T
his
eliminatesasourceoftrainingservingskew.Asacorollary,trynottousetwodifferent
programminglanguagesbetweentrainingandservingthatdecisionwillmakeitnearly
impossibleforyoutosharecode.

Rule#33:IfyouproduceamodelbasedonthedatauntilJanuary5th,testthemodelon
thedatafromJanuary6thandafter.
Ingeneral,measureperformanceofamodelonthedatagatheredafterthedatayoutrainedthe
modelon,asthisbetterreflectswhatyoursystemwilldoinproduction.Ifyouproduceamodel
basedonthedatauntilJanuary5th,testthemodelonthedatafromJanuary6th.Youwill
expectthattheperformancewillnotbeasgoodonthenewdata,butitshouldntberadically
worse.Sincetheremightbedailyeffects,youmightnotpredicttheaverageclickrateor

conversionrate,buttheareaunderthecurve,whichrepresentsthelikelihoodofgivingthe
positiveexampleascorehigherthananegativeexample,shouldbereasonablyclose.

Rule#34:Inbinaryclassificationforfiltering(suchasspamdetectionordetermining
interestingemails),makesmallshorttermsacrificesinperformanceforverycleandata.
Inafilteringtask,exampleswhicharemarkedasnegativearenotshowntotheuser.Suppose
youhaveafilterthatblocks75%ofthenegativeexamplesatserving.Youmightbetemptedto
drawadditionaltrainingdatafromtheinstancesshowntousers.Forexample,ifausermarksan
emailasspamthatyourfilterletthrough,youmightwanttolearnfromthat.

Butthisapproachintroducessamplingbias.Youcangathercleanerdataifinsteadduring
servingyoulabel1%ofalltrafficasheldout,andsendallheldoutexamplestotheuser.Now
yourfilterisblockingatleast74%ofthenegativeexamples.Theseheldoutexamplescan
becomeyourtrainingdata.

Notethatifyourfilterisblocking95%ofthenegativeexamplesormore,thisbecomesless
viable.Evenso,ifyouwishtomeasureservingperformance,youcanmakeaneventinier
sample(say0.1%or0.001%).Tenthousandexamplesisenoughtoestimateperformancequite
accurately.

Rule#35:Bewareoftheinherentskewinrankingproblems.
Whenyouswitchyourrankingalgorithmradicallyenoughthatdifferentresultsshowup,you
haveeffectivelychangedthedatathatyouralgorithmisgoingtoseeinthefuture.Thiskindof
skewwillshowup,andyoushoulddesignyourmodelaroundit.Therearemultipledifferent
approaches.Theseapproachesareallwaystofavordatathatyourmodelhasalreadyseen.
1. Havehigherregularizationonfeaturesthatcovermorequeriesasopposedtothose
featuresthatareonforonlyonequery.Thisway,themodelwillfavorfeaturesthatare
specifictooneorafewqueriesoverfeaturesthatgeneralizetoallqueries.This
approachcanhelppreventverypopularresultsfromleakingintoirrelevantqueries.Note
thatthisisoppositethemoreconventionaladviceofhavingmoreregularizationon
featurecolumnswithmoreuniquevalues.
2. Onlyallowfeaturestohavepositiveweights.Thus,anygoodfeaturewillbebetterthana
featurethatisunknown.
3. Donthavedocumentonlyfeatures.Thisisanextremeversionof#1.Forexample,even
ifagivenappisapopulardownloadregardlessofwhatthequerywas,youdontwantto
showiteverywhere4.Nothavingdocumentonlyfeatureskeepsthatsimple.

Thereasonyoudontwanttoshowaspecificpopularappeverywherehastodowiththeimportanceof
makingallthedesiredappsr eachable.Forinstance,ifsomeonesearchesforbirdwatchingapp,they
mightdownloadangrybirds,butthatcertainlywasnttheirintent.Showingsuchanappmightimprove
downloadrate,butleavetheusersneedsultimatelyunsatisfied.

Rule#36:Avoidfeedbackloopswithpositionalfeatures.
Thepositionofcontentdramaticallyaffectshowlikelytheuseristointeractwithit.Ifyouputan
appinthefirstpositionitwillbeclickedmoreoften,andyouwillbeconvinceditismorelikelyto
beclicked.Onewaytodealwiththisistoaddpositionalfeatures,i.e.featuresabouttheposition
ofthecontentinthepage.Youtrainyourmodelwithpositionalfeatures,anditlearnstoweight,
forexample,thefeature"1stposition"heavily.Yourmodelthusgiveslessweighttootherfactors
forexampleswith"1stposition=true".Thenatservingyoudon'tgiveanyinstancesthe
positionalfeature,oryougivethemallthesamedefaultfeature,becauseyouarescoring
candidatesb
eforeyouhavedecidedtheorderinwhichtodisplaythem.

Notethatitisimportanttokeepanypositionalfeaturessomewhatseparatefromtherestofthe
modelbecauseofthisasymmetrybetweentrainingandtesting.Havingthemodelbethesumof
afunctionofthepositionalfeaturesandafunctionoftherestofthefeaturesisideal.For
example,dontcrossthepositionalfeatureswithanydocumentfeature.

Rule#37:MeasureTraining/ServingSkew.
Thereareseveralthingsthatcancauseskewinthemostgeneralsense.Moreover,youcan
divideitintoseveralparts:
1. Thedifferencebetweentheperformanceonthetrainingdataandtheholdoutdata.In
general,thiswillalwaysexist,anditisnotalwaysbad.
2. Thedifferencebetweentheperformanceontheholdoutdataandthenextdaydata.
Again,thiswillalwaysexist.Y
oushouldtuneyourregularizationtomaximizethe
nextdayperformance.However,largedropsinperformancebetweenholdoutand
nextdaydatamayindicatethatsomefeaturesaretimesensitiveandpossiblydegrading
modelperformance.
3. Thedifferencebetweentheperformanceonthenextdaydataandthelivedata.Ifyou
applyamodeltoanexampleinthetrainingdataandthesameexampleatserving,it
shouldgiveyouexactlythesameresult(seeR
ule#5).Thus,adiscrepancyhere
probablyindicatesanengineeringerror.

MLPhaseIII:SlowedGrowth,Optimization
Refinement,andComplexModels
Therewillbecertainindicationsthatthesecondphaseisreachingaclose.Firstofall,your
monthlygainswillstarttodiminish.Youwillstarttohavetradeoffsbetweenmetrics:youwillsee
someriseandothersfallinsomeexperiments.Thisiswhereitgetsinteresting.Sincethegains
arehardertoachieve,themachinelearninghastogetmoresophisticated.

Acaveat:thissectionhasmoreblueskyrulesthanearliersections.Wehaveseenmanyteams
gothroughthehappytimesofPhaseIandPhaseIImachinelearning.OncePhaseIIIhasbeen
reached,teamshavetofindtheirownpath.
Rule#38:Dontwastetimeonnewfeaturesifunalignedobjectiveshavebecomethe
issue.
Asyourmeasurementsplateau,yourteamwillstarttolookatissuesthatareoutsidethescope
oftheobjectivesofyourcurrentmachinelearningsystem.Asstatedbefore,iftheproductgoals
arenotcoveredbytheexistingalgorithmicobjective,youneedtochangeeitheryourobjective
oryourproductgoals.Forinstance,youmayoptimizeclicks,plusones,ordownloads,butmake
launchdecisionsbasedinpartonhumanraters.

Rule#39:Launchdecisionsareaproxyforlongtermproductgoals.
Alicehasanideaaboutreducingthelogisticlossofpredictinginstalls.Sheaddsafeature.The
logisticlossdrops.Whenshedoesaliveexperiment,sheseestheinstallrateincrease.
However,whenshegoestoalaunchreviewmeeting,someonepointsoutthatthenumberof
dailyactiveusersdropsby5%.Theteamdecidesnottolaunchthemodel.Aliceis
disappointed,butnowrealizesthatlaunchdecisionsdependonmultiplecriteria,onlysomeof
whichcanbedirectlyoptimizedusingML.

Thetruthisthattherealworldisnotdungeonsanddragons:therearenohitpointsidentifying
thehealthofyourproduct.Theteamhastousethestatisticsitgatherstotrytoeffectively
predicthowgoodthesystemwillbeinthefuture.Theyneedtocareaboutengagement,1day
activeusers(DAU),30DAU,revenue,andadvertisersreturnoninvestment.Thesemetricsthat
aremeasureableinA/Btestsinthemselvesareonlyaproxyformorelongtermgoals:satisfying
users,increasingusers,satisfyingpartners,andprofit,whicheventhenyoucouldconsider
proxiesforhavingauseful,highqualityproductandathrivingcompanyfiveyearsfromnow.

Theonlyeasylaunchdecisionsarewhenallmetricsgetbetter(oratleastdonotget
worse).Iftheteamhasachoicebetweenasophisticatedmachinelearningalgorithm,anda
simpleheuristic,ifthesimpleheuristicdoesabetterjobonallthesemetrics,itshouldchoose
theheuristic.Moreover,thereisnoexplicitrankingofallpossiblemetricvalues.Specifically,
considerthefollowingtwoscenarios:

Experiment

DailyActiveUsers

Revenue/Day

1million

$4million

2million

$2million

IfthecurrentsystemisA,thentheteamwouldbeunlikelytoswitchtoB.Ifthecurrentsystemis
B,thentheteamwouldbeunlikelytoswitchtoA.Thisseemsinconflictwithrationalbehavior:
however,predictionsofchangingmetricsmayormaynotpanout,andthusthereisalargerisk
involvedwitheitherchange.Eachmetriccoverssomeriskwithwhichtheteamisconcerned.

Moreover,nometriccoverstheteamsultimateconcern,whereismyproductgoingtobefive
yearsfromnow?

Individuals,ontheotherhand,tendtofavoroneobjectivethattheycandirectlyoptimize.
Mostmachinelearningtoolsfavorsuchanenvironment.Anengineerbangingoutnewfeatures
cangetasteadystreamoflaunchesinsuchanenvironment.Thereisatypeofmachine
learning,multiobjectivelearning,whichstartstoaddressthisproblem.Forinstance,onecan
formulateaconstraintsatisfactionproblemthathaslowerboundsoneachmetric,andoptimizes
somelinearcombinationofmetrics.However,eventhen,notallmetricsareeasilyframedas
machinelearningobjectives:ifadocumentisclickedonoranappisinstalled,itisbecausethat
thecontentwasshown.Butitisfarhardertofigureoutwhyauservisitsyoursite.Howto
predictthefuturesuccessofasiteasawholeisA
Icomplete,ashardascomputervisionor
naturallanguageprocessing.

Rule#40:Keepensemblessimple.
Unifiedmodelsthattakeinrawfeaturesanddirectlyrankcontentaretheeasiestmodelsto
debugandunderstand.However,anensembleofmodels(amodelwhichcombinesthescores
ofothermodels)canworkbetter.T
okeepthingssimple,eachmodelshouldeitherbean
ensembleonlytakingtheinputofothermodels,orabasemodeltakingmanyfeatures,
butnotboth.Ifyouhavemodelsontopofothermodelsthataretrainedseparately,then
combiningthemcanresultinbadbehavior.

Useasimplemodelforensemblingthattakesonlytheoutputofyourbasemodelsasinputs.
Youalsowanttoenforcepropertiesontheseensemblemodels.Forexample,anincreaseinthe
scoreproducedbyabasemodelshouldnotdecreasethescoreoftheensemble.Also,itisbest
iftheincomingmodelsaresemanticallyinterpretable(forexample,calibrated)sothatchanges
oftheunderlyingmodelsdonotconfusetheensemblemodel.Also,e
nforcethatanincrease
inthepredictedprobabilityofanunderlyingclassifierdoesnotdecreasethepredicted
probabilityoftheensemble.

Rule#41:W
henperformanceplateaus,lookforqualitativelynewsourcesofinformation
toaddratherthanrefiningexistingsignals.
Youveaddedsomedemographicinformationabouttheuser.You'veaddedsomeinformation
aboutthewordsinthedocument.Youhavegonethroughtemplateexploration,andtunedthe
regularization.Youhaventseenalaunchwithmorethana1%improvementinyourkeymetrics
inafewquarters.Nowwhat?

Itistimetostartbuildingtheinfrastructureforradicallydifferentfeatures,suchasthehistoryof
documentsthatthisuserhasaccessedinthelastday,week,oryear,ordatafromadifferent
property.Usew
ikidataentitiesorsomethinginternaltoyourcompany(suchasGoogles
knowledgegraph).Usedeeplearning.Starttoadjustyourexpectationsonhowmuchreturnyou

expectoninvestment,andexpandyoureffortsaccordingly.Asinanyengineeringproject,you
havetoweighthebenefitofaddingnewfeaturesagainstthecostofincreasedcomplexity.

Rule#42:Dontexpectdiversity,personalization,orrelevancetobeascorrelatedwith
popularityasyouthinktheyare.
Diversityinasetofcontentcanmeanmanythings,withthediversityofthesourceofthe
contentbeingoneofthemostcommon.Personalizationimplieseachusergetstheirown
results.Relevanceimpliesthattheresultsforaparticularqueryaremoreappropriateforthat
querythananyother.Thusallthreeofthesepropertiesaredefinedasbeingdifferentfromthe
ordinary.

Theproblemisthattheordinarytendstobehardtobeat.

Notethatifyoursystemismeasuringclicks,timespent,watches,+1s,reshares,etcetera,you
aremeasuringthep
opularityofthecontent.Teamssometimestrytolearnapersonalmodel
withdiversity.Topersonalize,theyaddfeaturesthatwouldallowthesystemtopersonalize
(somefeaturesrepresentingtheusersinterest)ordiversify(featuresindicatingifthisdocument
hasanyfeaturesincommonwithotherdocumentsreturned,suchasauthororcontent),and
findthatthosefeaturesgetlessweight(orsometimesadifferentsign)thantheyexpect.

Thisdoesntmeanthatdiversity,personalization,orrelevancearentvaluable.Aspointedoutin
thepreviousrule,youcandopostprocessingtoincreasediversityorrelevance.Ifyousee
longertermobjectivesincrease,thenyoucandeclarethatdiversity/relevanceisvaluable,aside
frompopularity.Youcantheneithercontinuetouseyourpostprocessing,ordirectlymodifythe
objectivebasedupondiversityorrelevance.

Rule#43:Yourfriendstendtobethesameacrossdifferentproducts.Yourintereststend
nottobe.
TeamsatGooglehavegottenalotoftractionfromtakingamodelpredictingtheclosenessofa
connectioninoneproduct,andhavingitworkwellonanother.Yourfriendsarewhotheyare.On
theotherhand,Ihavewatchedseveralteamsstrugglewithpersonalizationfeaturesacross
productdivides.Yes,itseemslikeitshouldwork.Fornow,itdoesntseemlikeitdoes.Whathas
sometimesworkedisusingrawdatafromonepropertytopredictbehavioronanother.Also,
keepinmindthatevenknowingthatauserhasahistoryonanotherpropertycanhelp.For
instance,thepresenceofuseractivityontwoproductsmaybeindicativeinandofitself.

RelatedWork
TherearemanydocumentsonmachinelearningatGoogleaswellasexternally.
MachineLearningCrashCourse:anintroductiontoappliedmachinelearning

MachineLearning:AProbabilisticApproachbyKevinMurphyforanunderstandingof
thefieldofmachinelearning
PracticalAdvicefortheAnalysisofLarge,ComplexDataSets:adatascienceapproach
tothinkingaboutdatasets.

DeepLearningbyIanGoodfellowetalforlearningnonlinearmodels
Googlepaperontechnicaldebt,whichhasalotofgeneraladvice.
TensorflowDocumentation

Acknowledgements
ThankstoDavidWestbrook,PeterBrandt,SamuelIeong,ChenyuZhao,LiWei,Michalis
Potamias,EvanRosen,BarryRosenberg,ChristineRobson,JamesPine,TalShaked,Tushar
Chandra,MustafaIspir,JeremiahHarmsen,KonstantinosKatsiapis,GlenAnderson,Dan
Duckworth,ShishirBirmiwal,GalElidan,SuLinWu,JaihuiLiu,FernandoPereira,and
HrishikeshAradhyeformanycorrections,suggestions,andhelpfulexamplesforthisdocument.
Also,thankstoKristenLefevre,SuddhaBasu,andChrisBergwhohelpedwithanearlier
version.Anyerrors,omissions,or(gasp!)unpopularopinionsaremyown.

Appendix
ThereareavarietyofreferencestoGoogleproductsinthisdocument.Toprovidemorecontext,
Igiveashortdescriptionofthemostcommonexamplesbelow.

YouTubeOverview
YouTubeisastreamingvideoservice.BothYouTubeWatchNextandYouTubeHomePage
teamsuseMLmodelstorankvideorecommendations.WatchNextrecommendsvideosto
watchafterthecurrentlyplayingone,whileHomePagerecommendsvideostousersbrowsing
thehomepage.

GooglePlayOverview
GooglePlayhasmanymodelssolvingavarietyofproblems.PlaySearch,PlayHomePage
PersonalizedRecommendations,andUsersAlsoInstalledappsallusemachinelearning.

GooglePlusOverview
GooglePlususesmachinelearninginavarietyofsituations:rankingpostsinthestreamof
postsbeingseenbytheuser,rankingWhatsHotposts(poststhatareverypopularnow),
rankingpeopleyouknow,etcetera.

You might also like