Ultimate Guide To Ab Testing

ThanksfordownloadingthisPDF!
Youcanfindthewholepostonlineat:
http://conversionxl.com/abtestingguide
TheSmartMarketersGuidetoA/BTesting
A/Btesting.Youveheardsomuchaboutitinthepastfewyears.Youmayhaveevenrunatest
ortwo(orathousand).Yet,forallthecontentoutthereaboutsplittesting,theconceptisstill
frequentlymisunderstood.
Initssimplestsense,A/Bsplittestingisanewtermforan
oldtechnique
controlled
experimentation.Whenresearchersaretestingtheefficacyofnewdrugs,theyuseasplittest.
Infact,mostresearchexperimentscouldbeconsideredasplittest,completewitha
hypothesis,acontrolandvariation,andastatisticallycalculatedresult.
Themaindifference,however,liesinthevariabilityofinternettraffic.Inalab,itseasierto
controlforexternalvariables.Online,youcanmitigatethem,butitstrulydifficulttooperatea
purelycontrolledtest.
Inaddition,testingnewdrugsrequiresanalmostcertaindegreeofaccuracy.Livesareonthe
line.Intechnicalterms,yourperiodofexplorationcanbemuchlonger,asyouwanttobe
damnedsureduringyourperiodofexploitationthatyoudidntreachatypeIerror(false
positive).
A/Bsplittestingonlineisprimarilyabusinessdecision.Itsaweighingofriskvsreward,
explorationvsexploitation,sciencevsbusiness.Therefore,weviewresultswithadifferentlens
andmakedecisionsslightlydifferentlythantestsinapurelabsetting.
Readytodivein?Letsstartwiththebasics.
WhatIsAnA/BTest?
AnA/Btestisacontrolledonlineexperimentinwhich50%ofyourtrafficissenttotheoriginal
pageand50%issenttoavariation.
Thatsit.Initssimplestsense,anA/Btestisa50/50trafficsplitthatseekstovalidatechanges
youvemadetoapagebytestingsimilartrafficduringthesametimeperiod.
Youcan,ofcourse,createmorethantwovariations.BroadlyknownasanA/B/ntest,ifyou
havethetrafficthatallowsit,youcantestasmanyvariationsasyoudlike.Heresanexample
ofaA/B/C/Dtest,andhowmuchtrafficeachvariationisallocated:
Now,atthispoint,youmaybewonderingwhatthedifferenceisbetweenanA/B/C/Dtestanda
multivariatetest(MTV)is.MVTisnttheonlyothertypeofexperiment,either.Therearealso
banditalgorithmsthatsolvesimilarproblemsasA/B/ntests,justinadifferentway.
A/BTesting,Multivariate,andBanditAlgorithms:WhatstheDifference?</h3>
MultivariateTests
A/B/ntestsarecontrolledexperimentsrunon1ormorevariations+theoriginalpagethat
directlycompareconversionratemeansbasedonthechangesmadebetweenvariations.
Whileitsoundssimilar,multivariatetestsarecontrolledexperimentsthattestmultipleversions
ofapageandattempttoisolatewhichattributescausethelargestimpact.Inotherwords,
multivariatetestsarelikeA/B/ntestsinthattheytestanoriginalagainstvariations,buteach
variationcontainsdifferentdesignelements.Forexample:
Ifyouhaveenoughtraffic,youshouldusebothtypesofteststomaximizetheoutputofyour
optimizationprogram.Eachonehasadifferentandspecificimpactandusecase,andused
together,canhelpyougetthemostoutofyoursite.Hereshow:
UseA/Btestingtodeterminebestlayouts
UseMVTtopolishthelayoutstomakesurealltheelementsinteractwitheachotherin
thebestpossibleway.
AsIsaidbefore,youneedtogetatonoftraffictothepageyouretestingbeforeeven
consideringMVT.
However,makessureyourprioritiesalignwithyourtestingprogram.Peeponcesaid,mosttop
agenciesthatIvetalkedtoaboutthisrun~10A/Btestsforevery1MVT.
BanditTesting
Asforbanditalgorithms,youcanalmostthinkofthemasA/B/nteststhatupdateinrealtime
basedontheperformanceofeachvariation.
Inessence,abanditalgorithmstartsbysendingtraffictotwo(ormore)pages:theoriginaland
thevariation(s).Then,inattempttopullthewinningslotmachinearmmostoften,thealgorithm
updatesbasedonwhetherornotavariationiswinning.Eventually,thealgorithmfullyexploits
thebestoption:
ImageSource
Oneofthebigbenefitsofbandittestingisthatbanditsmitigateregret,whichisbasicallythe
lostconversionyouexperiencewhileexploringapotentiallyworsevariationinatest.Thischart
fromGoogleexplainsthatverywell:
ImageSource
Bytheway,trynottothinkofbanditsandA/B/ntestsasathisorthatscenariotheyretools
thateachhavetheirpurposes.Ingeneral,banditsaregreatfor:
HeadlinesandShortTermCampaigns
AutomationforScale
Targeting
BlendingOptimizationwithAttribution
Readthisarticleformoreinformationonbanditalgorithms.
TwootherminorissueswithA/Btesting:
Onetailortwotailtest?
BayesianorFrequentiststats?
OneTailvsTwoTailA/BTests
Ipromiseyou,thisisamuchsmallerissuethansomepeoplethink.Therefore,wellbrushoverit
quickly.
Onetailedtestsallowforthepossibilityofaneffectinjustonedirectionwherewithtwotailed
tests,youaretestingforthepossibilityofaneffectintwodirectionsbothpositiveand
negative.
Noneedtogetveryworkedupaboutthis.
MattGershoff
from
Conductrics
summeditupreally
well:
Ifyourtestingsoftwareonlydoesonetypeortheother,dontsweatit.Itissupersimpleto
convertonetypetotheother(butyouneedtodothisBEFOREyourunthetest)sinceallofthe
mathisexactlythesameinbothtests.Allthatisdifferentisthesignificancethresholdlevel.If
yoursoftwareusesaonetailtest,justdividethepvalueassociatedwiththeconfidencelevel
youarelookingtorunthetestby2.Soifyouwantyourtwotailtesttobeatthe95%
confidencelevel,thenyouwouldactuallyinputaconfidencelevelof97.5%,orifata99%,then
youneedtoinput99.5%.Youcanthenjustreadthetestasifitwastwotailed.
Divedowntherabbitholewithourarticleononetailvstwotailtestsifyoudlike.
BayesianorFrequentistStats
BayesianorFrequentistA/Btestingisanotherhottopicfordebate.Especiallywith
populartools
rebuildingtheirstatsengines
tofeatureaBayesianmethodology,
Heresthedifference(verymuchsimplified):
UsingaFrequentistmethodmeansmakingpredictionsonunderlyingtruthsoftheexperiment
usingonlydatafromthecurrentexperiment.
ThedifferenceisthatintheBayesianview,aprobabilityisassignedtoahypothesis.Inthe
Frequentistview,ahypothesisistestedwithoutbeingassignedaprobability.
Dr.RobBalon,whocarriesaPhdinstatisticsandmarketresearch,saysthedebateismostly
esoterictailwaggingdoneinthedomainoftheivorytower.Intruth,hesays,mostanalysts
outoftheivorytowerdontcarethatmuch,ifatall,aboutBayesianvs.Frequentist.
Dontgetmewrong,therearepracticalbusinessimplicationstoeachmethodology.Forthe
sakeofdiscussion,though,ifyoureatallnewtoA/Btesting,therearemuchmoreimportant
thingstoworryabout.
Ifyoudowanttodivedowntherabbithole,though,heresanarticlewewroteonBayesianvs
FrequentistA/BTesting.
StrategyFirst:SettingupaSuccessfulA/BTestingPlan
AstructuredapproachtoA/Btestingcouldbeyourbiggestareaofimpact.Dontlistentoany
blogpoststhattellyou99ThingsYouCanA/BTestRightNow.Thatsawasteoftimeand
traffic.Beingabitmoreprocessmindedwillmakeyoumoremoney.
Ina
surveydonebyEconsultancyandRedEye
,74%ofthesurveyrespondentswhoreported
havingastructuredapproachtoconversionalsostatedtheyhadimprovedtheirsales.Those
thatdonthaveastructuredapproach,ImalmostsurestayinwhatCraigSullivancallsthe
TroughofDisillusionment
(unlesstheirresultsarelitteredwithfalsepositives,whichwellget
intolater).
Sostructure.To
simplifyawinningprocess
,itgoessomethinglikethis:
1. Measurement
2. Prioritization
3. Experimentation
4. Repeat
Measurement:GettingDataDrivenInsights
Measurementisacrucialstepinthisprocess,becausewewanttoknow
what
ishappeningas
wellas
why
itshappening.
Firstthingsfirst,wellstartwiththehighlevelstrategyandmovedowntothegranular.Sothink
inthisorder:
1. Defineyourbusinessobjectives
2. Defineyourwebsitegoals
3. DefineyourKeyPerformanceIndicators
4. Defineyourtargetmetrics
Onceyouknowwhereyouwanttogo,wecancollectthedatanecessarytogetthere.Todo
this,werecommendthe
ResearchXLFramework
.(wellgooverthisbriefly,buttoreally
understandit,
readthispost
).
Pointis,wewanttobedatadriven,whichmeanswewanttocollectandanalyzerelevantdata
forourgoals.Thismeansastructureapproachtoourconversionresearch.Herestheexecutive
summaryofourprocess:
1. HeuristicAnalysis
2. TechnicalAnalysis
3. WebAnalyticsAnalysis
4. MouseTrackingAnalysis
5. QualitativeSurveys
6. UserTesting
Heuristicanalysisisaboutascloseaswegettobestpractices.Thedifferencehereinour
processisthat,afteryearsofexperience,youstillcanttellwhatexactlywillwork,butyoucan
moreeasilypointoutopportunityareas.As
CraigSullivanputit:
Myexperienceinobservingandfixingthings
thesepatternsdomakemeabetter
diagnosticianbuttheydontfunctionastruths
theyguideandinformmyworkbuttheydont
provideguarantees.
Sowhenitcomestoheuristicanalysis,humilityandaframeworkareimportant.Weassess
eachpagebasedon:
Relevancy
Clarity
Value
Friction
Distraction
ReadaboutWiderFunnelsLIFTModelforagoodheuristicframework
.
Technicalanalysisisanareaoftenoverlookedandhighlyunderratedbyoptimizers.Bugsif
theyrearoundareyourmainconversionkiller.Youthinkyoursiteworksperfectlybothin
termsofuserexperienceandfunctionalitywitheverybrowseranddevice?Probablynot.
Thisisalowhangingfruit,onethatyoucanmakealotofmoneyon(think12month
perspective).Sostart:
Conductcrossbrowserandcrossdevicetesting
Dospeedanalysis
Webanalyticsanalysisisnext.Firstthingsfirstmakesureeverythingisworking.Youdbe
surprisedhowmanyanalyticssetupsarebroken.
GoogleAnalytics(andotheranalyticssetups)areacourseinthemselves,soIllleaveyouwith
somehelpfullinkstoread:
GoogleAnalytics101:HowToConfigureGoogleAnalyticsToGetActionableData
GoogleAnalytics102:HowToSetUpGoals,Segments&EventsinGoogleAnalytics
Nextismousetrackinganalysis,whichincludesheatmaps,scrollmaps,clickmaps,form
analytics,andusersessionreplays.Onepointofadvicehereistonotgetcarriedawaywith
prettyvisualizationsofclickmaps,etc.Makesureyoureinformingyourlargergoalswiththe
analyticsinthisstep.
Qualitativeresearchisanimportantpartofmeasurementaswell,becauseittellsyouthe
why
thatquantitativeanalysismisses.Manypeoplethinkthatqualitativeanalysisissofteroreasier
thanquantitative,butitshouldbejustasrigorousandcanprovidejustasimportantofinsights
asyourGAdata.
For
qualitativeresearch
,usethingslike:
Onsitesurveys
Customersurveys
Customerinterviewsandfocusgroups
Finallywehaveusertesting.Thepremiseissimple:observeactualpeopleuseandinteractwith
yourwebsitewhiletheyrecommentingtheirthoughtprocessoutloud.Payattentiontowhat
theysayandexperience.
Aftertheheavyassconversionresearch,youllhavelotsofdataandneedtodosome
prioritization.
PrioritizingA/BTests
TherearemanyframeworkstoprioritizeyourA/Btests,andyoucouldeveninnovatewithyour
ownformula.Buthereshowwedoit.Onceyougothroughall6steps,youwillfindidentify
issuessomeofthemsevere,someminor.Youllwanttoallocateeveryfindingintooneof
these5buckets:
1. Test.
(Thisbucketiswhereyouplacestufffortesting.)
2. Instrument.
(Thiscaninvolvefixing,addingorimprovingtagoreventhandlingonthe
analyticsconfiguration.)
3. Hypothesize.
(Thisiswherewevefoundapage,widgetorprocessthatsjustnot
workingwellbutwedontseeaclearsinglesolution.)
4. JustDoItJFDI.
(Heresthebucketfornobrainers.Justdoit)
5. Investigate.
(Ifanitemisinthisbucket,youneedtoaskquestionsordofurther
digging.)
Thenwerankthemfrom1to5stars(1=minorissue,5=criticallyimportant).Thereare2
criteriathataremoreimportantthanotherswhengivingascore:

1. Easeofimplementation(time/complexity/risk).Sometimesthedatatellsyoutobuilda
feature,butittakesmonthstodoit.Soitsnotsomethingyoudstartwith.
2. Opportunityscore(subjectiveopiniononhowbigofaliftyoumightget).
Thencreateaspreadsheetwithallofyourdataandyoullhaveaprioritizedtestingroadmap,
morerigorousthanmostofyourcompetitorswillhave.
Youcanalsouseavarietyofotherframeworks.Averypopularoneisthe
PIEframework
.This
breaksopportunityareasintothreescores:
1. Potential
2. Importance
3. Ease
ReadmoreonframeworkstoprioritizeA/Btestinghere.
SettingUpYourA/BTests
Onceyouvegotaprioritizedlistoftestideas,itstimetoformahypothesisandrunan
experiment.Basically,ahypothesiswilldefinewhyyoubelieveaproblemoccurs.Furthermore,
agoodhypothesis:
1. IstestableItneedstobemeasurable,sothatitcanbeusedintesting.
2. HasagoalofsolvingconversionproblemsSplittestingisdonetosolvespecific
conversionproblems
3. GainsmarketinsightsAwellarticulatedhypothesiswillletyoursplittestingresultsgive
youinformationaboutyourcustomers,whetherthetestwinsorlosesorwhatever.
CraigSullivan
hasputtogetherahypothesiskittosimplifytheprocess
.Hereshissimple
version:
1. Becausewesaw(data/feedback)
2. Weexpectthat(change)willcause(impact)
3. Wellmeasurethisusing(datametric)
Andtheadvancedone:
1. Becausewesaw(qual&quantdata)
2. Weexpectthat(change)for(population)willcause(impact(s))
3. Weexpecttosee(datametric(s)change)overaperiodof(xbusinesscycles)
TechnicalStuff
Heresthefunpart:youfinallygetto
pickyourtool
.
Whilethisisthefirstthingmanypeoplethinkabout,itsnotactuallythemostimportant,byany
means.Thestrategyandstatisticalknowledgeaspectscomefirst,andonlythenshouldyou
worryaboutpickingatool.
Thatsaid,thereareafewdifferencesyoushouldbearinmind.
Onemajordistinguishmentintoolsiswhethertheyare
serversideorclientsidetestingtools
.
Serversidetoolsrendercodeontheserverlevelandsendarandomizedversionofthepageto
theviewerwithnomodificationonthevisitorsbrowser.Clientsidetoolssendthesamepage
butJavaScriptontheclientsbrowsermanipulatetheappearanceonboththeoriginalandthe
variation.
ClientsidetestingtoolsarethingslikeOptimizely,VWO,andAdobeTarget.Conductricshas
capabilitiesofboth,andSiteSpectdoesaproxyserversidemethod.
Whatdoesallthismeanforyou?Ifyoudliketosavetimeupfront,orifyourteamissmallor
lacksdevelopmentresources,clientsidetoolscangetyouupandrunningfaster.Serverside
requiresdevelopmentresourcesbutcanoftenbemorerobust.
Sowhilesettinguptestscanbeslightlydifferentdependingonwhichtoolyouuse,oftenitwill
beassimpleassigningupforyourfavoritetoolandfollowingsomebasicinstructions,like
puttingajavascriptsnippetonyourwebsite.
Youllbasicallywanttosetupgoals(somethingthatletsyouknowaconversionhasbeen
made,likeathankyouforpurchasingpage),andyourtestingtoolwilltrackwheneach
variationconvertsvisitorsintocustomers.
SomeskillsthatcomeinhandywhensettinguptestsareHTML,CSS,andJavaScript/JQuery,
aswellasdesignandcopywritingskillstodrawupthevariations.Sure,sometoolsallowuseof
avisualeditor,butthatlimitsyourflexibilityandcontrol,solearningsometechnicalskillsis
helpful.
Oryoucouldusesomethinglike
Testing.Agency
tosetupyourtestsforyou.
HowLongShouldYouRunA/BTests?
Firstrule:dontstopatestjustbecauseitreachesstatisticalsignificance.Thisisprobablythe
mostcommonerrorcommittedbybeginningoptimizerswithgoodintentions.
Ifyourecallingyourtestswhenyouhitsignificance,youllfindthatmostofyourliftsdont
translatetoincreasedrevenue(thatsthegoal,afterall).Youllfindthatthe
liftswereinfact
imaginary
.
Considerthis:
OnethousandA/Atests
(twoidenticalpagestestedagainsteachother)wererun.
771experimentsoutof1.000reached90%significanceatsomepoint
531experimentsoutof1.000reached95%significanceatsomepoint
Stoppingtestsatsignificancebreedstheriskoffalsepositivesandexcludespossible
external
validitythreats
likeseasonality.
Instead,youllwanttopredetermineasamplesizeandrunthetestforfullweeks,usuallyforat
leasttwobusinesscycles.
Howdoyoupredeterminesamplesize?Therearelotsofgreattoolsoutthereforthat,including
toolswithinyourfavoritetestingtool.HereshowyoudcalculateyoursamplesizewithEvan
Millerstool:
Inthiscasewetoldthetoolthatwehavea3%conversionrate,andwanttodetectatleast10%
uplift.Thetooltellsusthatweneed51,486visitorspervariationbeforecanlookatthestatistical
significancelevelsandstatisticalpower.
Oh,andyoullnoticeinadditiontosignificancelevel,theres
somethingcalledstatisticalpower
inthephotoaboveaswell.
StatisticalpowerisanotherimportantfactorinrunningyourA/Btest,asitattemptstoavoid
TypeIIerrors(falsenegatives).Inotherwords,itmakessurethatyoudetectaneffect
ifthere
actuallywasone
.
Forpracticalpurposes,knowthat80%poweristhestandardfortestingtools.Toreachsucha
level,youneedeitheralargesamplesize,alargeeffectsize,oralongerdurationtest.
ThereAreNoMagicNumbers
Youllreadalotofblogpoststhathavemagicnumberslike100conversionsor1,000visitors
astheirstoppingpoints.Mathisnotmagic,mathismath,andwhatweredealingwithisslightly
morecomplexthansimplisticheuristicslikethat.
AndrewAnderson
from
Malwarebytes
putit
well:
Itisneverabouthowmanyconversions,itisabouthavingenoughdatatovalidatebasedon
representativesamplesandrepresentativebehavior.
100conversionsispossibleinonlythemostremotecasesandwithanincrediblyhighdeltain
behavior,butonlyifotherrequirementslikebehaviorovertime,consistency,andnormal
distributiontakeplace.EventhenitishasareallyhighchanceofatypeIerror,falsepositive.
Whatwereworriedaboutistherepresentativenessofoursample.Howcanwedothatinbasic
terms?Yourtestshouldrunfor1,orbetteryet2,businesscycles,soitincludeseverythingthat
goingon:
everydayoftheweek(andtestedoneweekatatimeasyourdailytrafficcanvaryalot),
variousdifferenttrafficsources(unlessyouwanttopersonalizetheexperiencefora
dedicatedsource),
yourblogpostandnewsletterpublishingschedule,
peoplewhovisitedyoursite,thoughtaboutit,andthencameback10dayslatertobuy
it,
anyexternaleventthatmightaffectpurchasing(e.g.payday)
Another(veryimportant)note:becarefulwithlowsamplesize.Theinternetisfullofcase
studiessteepedin
shittymath
,andmostofit(iftheyevenreleasefullnumbers),isbecause
theyjudgedatestonlike100visitorspervariationand12vs22conversions.
Ifyouveseteverythingupcorrectlysofar,thenyoulljustwanttoavoidpeaking(orlettingyour
bosspeak)attestresultsmultipletimesbeforethetestisfinished.Thiscanresultincallinga
resultearlyduetospottingatrend(impossible).Whatyoullfindisthatmanytestresults
regresstothemean
.
RegressiontotheMean
Often,youllseeresultsvarywildlyinthefirstfewdaysofthetest.Sureenough,theytendto
convergeasthetestcontinuesforthenextfewweeks.HeresanexamplePeepgaveinan
olderblogpostofaneCommerceclient:
Hereswhatwerelookingat:
Firstcoupleofdays,blue(variation#3)iswinningbiglike$16pervisitorvs$12.5for
Control.Lotsofpeoplewouldendthetesthere.(Fail).
After7days:bluestillwinningandtherelativedifferenceisbig.
After14days:orange(#4)iswinning!
After21days:orangestillwinning!
End:nodifference
Soifyoudcalledthetestatlessthanfourweeksyouwouldhavemadeanerroneous
conclusion.
Somethingrelated,thattheinternetalwaysgetsconfusedon,iscalledthenoveltyeffect.Thats
whenthenoveltyofyourchanges(biggerbluebutton)bringsmoreattentiontothevariation.
Withtime,theliftdisappearsbecausethechangeisnolongernovel.
AllofthisstuffissomeofthemorecomplexA/Btestinginformation.Wehaveabunchofblog
postsdevotedtothevarioustopicscoveredabove.Diveinifyoudliketolearnmore:
StoppingA/BTests:HowManyConversionsDoINeed?
CanYouRunMultipleA/BTestsSimultaneously?
Youwanttospeedupyourtestingprogramandrunmoretests.
Hightempotesting
,yeah?Soa
commonquestionis:canyourunmorethanoneA/Btestatthesametimeonyoursite?
Willthisincreaseyourgrowthpotential,orwillitpollutethedatabecauseeachtestinteracts
withtheother?
Look,thisisacomplicatedissue.Someexpertssayyoushouldntdomultipletests
simultaneously,andsomesayitsfine.
Inmostcasesyouwillbefinerunningmultiplesimultaneoustests,andextremeinteractionsare
unlikely.Unlessyouretestingreallyimportantstuff(e.g.somethingthatimpactsyourbusiness
model,futureofthecompany),thebenefitsoftestingvolumewillmostlikelyoutweighthenoise
inyourdataandoccasionalfalsepositives.
Ifbasedonyourassessmenttheresahighriskofinteractionbetweenmultipletests,reducethe
numberofsimultaneoustestsand/orletthetestsrunlongerforimprovedaccuracy.
Ifyouwanttoreadmoreonthis,readtheseposts:
ABTesting:WhenTestsCollide
CanYouRunMultipleA/BTestsattheSameTime?
AnalyzingYourA/BTestingResults
Alright.Youvedoneyourresearch,setupyourtest,andthetestisfinallycooked.Now,onto
analysisanditsnotalwaysassimpleasglimpsingatthegraphyourtestingtoolgivesyou.
ImageSource
Onethingyoushouldalwaysdoitto
analyzeyourtestresultsinGoogleAnalytics
.
Itdoesntjustenhanceyouranalysiscapabilities,butitallowsyoutobemoreconfidentinyour
dataanddecisionmaking.
Thepointis,itspossiblethatyourtestingtoolcouldberecordingthedataincorrectly,andifyou
havenoothersourceforyourtestdata,youcanneverbesurewhethertotrustitornot.Create
multiplesourcesofdata(wontgotoofarintodetail,
butreadthispostforhowtosetitallup
)
Butwhathappensif,afteranalyzingtheresultsinGA,thereisnodifferenceatallbetween
variations?
Dontmoveontooquickly.First,realizethesetwothings:
1.Yourtesthypothesismighthavebeenright,buttheimplementationsucked.
Letssayyourqualitativeresearchsaysthatconcernaboutsecurityisanissue.Howmanyways
dowehavetobeefuptheperceptionofsecurity?Unlimited.
Thenameofthegameis
iterativetesting
,soifyouwereontosomething,thentryafew
iterationsthatattempttosolvetheproblem.
2.Justbecausetherewasnodifferenceoverall,thetreatmentmighthavebeatcontrolin
asegmentortwo.
Ifyougotaliftinreturningvisitorsandmobilevisitors,butadropfornewvisitorsanddesktop
usersthosesegmentsmightcanceleachotherout,anditseemslikeitsacaseofno
difference.
Analyzeyourtestacrosskeysegments
toseethis.
AllAboutDataSegmentation
ThekeytolearninginA/Btestingis
segmenting
.EventhoughBmightlosetoAintheoverall
results,BmightbeatAincertainsegments(organic,Facebook,mobile,etc).
Thereareatonofsegmentsyoucananalyze.
Optimizelyliststhefollowingpossibilities
:
Browsertype
Sourcetype
Mobilevs.desktop,orbydevice
Loggedinvs.loggedoutvisitors
PPC/SEMcampaign
Geographicalregions(City,State/Province,Country)
Newvs.returningvisitors
Newvs.repeatpurchasers
Powerusersvs.casualvisitors
Menvs.women
Agerange
Newvs.alreadysubmittedleads
Plantypesorloyaltyprogramlevels
Current,prospective,andformersubscribers
Roles(ifyoursitehas,forinstance,bothaBuyerandSellerrole)
Butdefinitelylookatyourtestresultsatleastacrossthesesegments(makingsureofadequate
samplesize):
DesktopvsTablet/Mobile
NewvsReturning
Trafficthatlandsdirectlyonthepageyouretestingvscameviainternallink
Forsegments,thesamestoppingrulesapply.
Makesurethatyouhaveenoughsamplesizewithinthesegmentitselfaswell(calculateitin
advance,bewaryifitslessthan250350conversionsPERvariationwithinthatonesegment
yourelookingat).
Ifyourtreatmentperformedwellforaspecificsegment,itstimetoconsiderapersonalized
approachforthatparticularsegment.
AllYourStatsQuestionsAnswered,Stat
TheresacertainlevelofstatisticalknowledgethatcomesinhandywhenanalyzingA/Btest
results.SomeofitwewentoverintheabovesectiononsettingupA/Btests,butthereisstill
moretobecoveredwhenitcomestoanalysis.
Whydoyouneedtoknowallofthisstatisticsstuff?Weredealingwithinferenceheremeans
andprobabilityandthereforecannotgowithoutsomebasicunderstandingofstats.
OrasMattGershoffputit(quotinghiscollegemathprofessor),howcanyoumakecheeseif
youdontknowwheremilkcomesfrom?!
TherearethreetermsyoushouldknowbeforewediveintothenittygrittyofA/Btesting
statistics:
1. Mean(werenotmeasuringallconversionrates,justasample,andfindinganaverageof
themthatisrepresentativeofthewhole)
2. Variance(whatisthenaturalvariabilityofapopulation?Thatwillaffectourresultsand
howwetakeactionwiththem)
3. Sampling(again,wecantmeasuretrueconversionrate,soweselectasamplethatis
hopefullyrepresentativeofthewhole)
WhatTheHellIsaPValue?
Theresalargeamountofbloggerswritingaboutconversionoptimizationthatareusingtheterm
statisticalsignificanceinaccurately.
Wetalkedabitaboveabouthowstatisticalsignificancebyitselfisnotastoppingrule,sowhat
isitandwhyisitimportant?
Tostartwith,letsgooverPValues,whicharealsoverymisunderstood.AsFiveThirtyEight
recentlypointedout,
evenscientistscanteasilyexplainwhatPValuesare
.
PValueisbasicallymeasureofevidenceagainstthenullhypothesis(thecontrolinA/BTesting
parlance).
Veryimportant:PvaluedoesnottellustheprobabilitythatBisbetterthanA.
Similarly,itdoesnttellustheprobabilitythatwewillmakeamistakeinselectiveBoverA.
Thesearebothextraordinarilycommonsmisconceptions,buttheyarefalse.
Thepvalueisjusttheprobabilityofseeingaresultormoreextremegiventhatthenull
hypothesisistrue.Or,Howsurprisingisthatresult?
Sotosumitup,statisticalsignificance(orastatisticallysignificantresult)isattainedwhena
pvalueislessthanthesignificancelevel(whichisusuallysetat.05).Bytheway,significance
inregardstostatistical
hypothesistesting
iswherethewholeonetailvstwotailissuecomesup.
ConfidenceIntervalsandMarginofError
InA/Btesting,weuseconfidenceintervalstomitigatetheriskofsamplingerrors.Inthatsense,
weremanagingtheriskassociatedwithimplementinganewvariation.Soifyourtoolsays
somethinglike,Weare95%confidentthattheconversionrateisX%+/Y%,thenyouneedto
accountforthe+/Y%asthemarginoferror.
ImageSource
Howconfidentyouareinyourresultsdependslargelyonhowlargethemarginoferroris.Asa
ruleofthumbifthe2conversionrangesoverlap,youllneedtokeeptestinginordertogeta
validresult.
MattGershoffgaveagreatillustrationofhowmarginoferrorworks:
SayyourbuddyiscomingtovisityoufromRoundRockandistakingTX1at5pm.Shewants
toknowhowlongitshouldtakeher.YousayIhavea95%confidencethatitwilltakeyouabout
60minutesplusorminus20minutes.Soyourmarginoferroris20minutesor33%.
Ifsheiscomingat11amyoumightsayitwilltakeyou40min,plusorminus10min,sothe
marginoferroris10minutes,or25%.Sowhilebothareatthe95%confidencelevel,themargin
oferrorisdifferent.
ExternalValidityThreats
TheresachallengewithrunningA/Btests:thedataisnonstationary.
Inotherwords,astationarytimeseriesisonewhosestatisticalproperties(mean,variance,
autocorrelation,etc)areconstantovertime.Formanyreasons,websitedataisnonstationary,
whichmeanswecantmakethesameassumptionsaswithstationarydata.Hereareafew
reasonsdatamightfluctuate:
Season
Dayoftheweek
Holidays
Press(positiveornegative)
OtherMarketingCampaigns
PPC/SEM
SEO
WordofMouth
Soseasonalityandtheotherfactorsaboveareonesourceofexternalvaliditythreat.
Otherincludesamplepollution,theflickereffect,revenuetrackingerrors,selectionbias,and
more(
readhere
).TheseareallthingstokeepinmindinplanningandanalyzingyourA/Btests.
ArchivingTestResultsForFutureLearning
A/Btestingisntjustaboutlifts,wins,losses,andtestingrandomshit.AsMattGershoffsaid,
optimizationisaboutgatheringinformationtoinformdecisions,andthelearningsfrom
statisticallyvalidA/Btestresultscontributetothegreatergoalsofgrowthandoptimization.
Smartorganizationsarchivetheirtestresultsandplantheirapproachtotestingsystematically.
Theresareasonhavingastructuredapproachtooptimizationhavegreatergrowthandare
limitedlessoftenbylocalmaxima.
ImageSource
Soheresthetoughpart:theresnosinglebestwaytostructureyourknowledgemanagement.
Wewroteanarticleonhow
effectiveorganizationsarchivetheirresults
(readit),andasitturns
out,manyofthemdoitslightlydifferently.Someusesophisticatedinternallybuilttools,some
use3rdpartytools,andsomeusegoodolExcelandTrello.
Ifithelps,hereare4toolsbuiltspecificallyforconversionoptimizationprojectmanagement:
Iridion
EffectiveExperiments
GrowthHackersCanvas
ExperimentEngine
Onasimilarnote,inlargerorganizations(orhell,insmalleraswell),itsimportanttobeableto
communicateacrossdepartmentsandtotheexecutivesabove.Often,A/Btestresultsarent
superintuitivetothelayperson(andmostpeoplehaventreadguidesaslongasthisone).So
whathelpsisvisualization.
Thisisanotherareawhere,sadly,thereisnotrealrightwaytodoit.Thatsaid,Annemarie
KlaassenandTonWesseling
wroteanawesomepostonourblog
detailingtheirjourneytogreat
visualizations.Sneakpeek,hereswhattheyendedupwith:
A/BTestingToolsandResources
Litteredthroughoutthisguidearetonsoflinkstoexternalresourcesarticles,tools,books,etc.
Tomakeitconvenientforyou,though,herearesomeofthebest(dividedbycategories).
A/BTestTools
Optimizely
VWO
AdobeTarget
Maximyser
Conductrics
53ConversionOptimizationToolsReviewedByExperts
A/BTestingCalculators
A/BSplitTestSignificanceCalculatorbyVWO
A/BSplitandMultivariateTestDurationCalculator
EvanMillersSampleSizeCalculator
EvanMillersWholeSuiteofA/BTestingTools
A/BTestingStatisticsResources
IgnorantNoMore:CrashCourseonA/BTestingStatistics
StatisticalAnalysisandA/BTesting
UnderstandingA/BtestingstatisticstogetREALLiftinConversions
OneTailedvsTwoTailedTests(DoesItMatter?)
BayesianvsFrequentistA/BTestingWhatstheDifference?
SamplePollution
ScienceIsntBroken
A/BTesting/CROStrategyResources
eCommerceA/BTestDataforImprovedProcess:WhatPercentageofTestsAre
Winners?
WiderFunnelsLIFTModel
3FrameworksToHelpPrioritize&ConductYourConversionTesting
Whatyouhavetoknowaboutconversionoptimization
OurConversionOptimizationGuide
SmallBusinessBigMoneyOnline:AProvenSystemtoOptimizeeCommerceWebsites
andIncreaseInternetProfits
(book)
Conclusion
A/Btestingisaninvaluableresourcetoanyonemakingdecisionsinanonlineenvironment.With
alittleitofknowledgeandalotofdiligence,youcanmitigatemanyoftherisksthatmost
beginningoptimizersfaceduetoerrors.
Thisisn'tacompleteoranultimateguide,butitisadamngoodstart.Ifyoureallydigintothe
informationhere,you'llbeaheadof90%ofpeoplerunningtests.Ifyoubelieveinthepowerof
A/Btestingforcontinuedrevenuegrowth,thenthat'safantasticplacetobe.
Knowledgeisalimitingfactorthatonlyexperienceanditerativelearningcanbustthrough,
though.Sogettesting:)

Ultimate Guide To Ab Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ultimate Guide To Ab Testing

Uploaded by

Copyright:

Available Formats

ThanksfordownloadingthisPDF!

You might also like