Professional Documents
Culture Documents
Youcanfindthewholepostonlineat:
http://conversionxl.com/abtestingguide
TheSmartMarketersGuidetoA/BTesting
A/Btesting.Youveheardsomuchaboutitinthepastfewyears.Youmayhaveevenrunatest
ortwo(orathousand).Yet,forallthecontentoutthereaboutsplittesting,theconceptisstill
frequentlymisunderstood.
Initssimplestsense,A/Bsplittestingisanewtermforan
oldtechnique
controlled
experimentation.Whenresearchersaretestingtheefficacyofnewdrugs,theyuseasplittest.
Infact,mostresearchexperimentscouldbeconsideredasplittest,completewitha
hypothesis,acontrolandvariation,andastatisticallycalculatedresult.
Themaindifference,however,liesinthevariabilityofinternettraffic.Inalab,itseasierto
controlforexternalvariables.Online,youcanmitigatethem,butitstrulydifficulttooperatea
purelycontrolledtest.
Inaddition,testingnewdrugsrequiresanalmostcertaindegreeofaccuracy.Livesareonthe
line.Intechnicalterms,yourperiodofexplorationcanbemuchlonger,asyouwanttobe
damnedsureduringyourperiodofexploitationthatyoudidntreachatypeIerror(false
positive).
A/Bsplittestingonlineisprimarilyabusinessdecision.Itsaweighingofriskvsreward,
explorationvsexploitation,sciencevsbusiness.Therefore,weviewresultswithadifferentlens
andmakedecisionsslightlydifferentlythantestsinapurelabsetting.
Readytodivein?Letsstartwiththebasics.
WhatIsAnA/BTest?
AnA/Btestisacontrolledonlineexperimentinwhich50%ofyourtrafficissenttotheoriginal
pageand50%issenttoavariation.
Thatsit.Initssimplestsense,anA/Btestisa50/50trafficsplitthatseekstovalidatechanges
youvemadetoapagebytestingsimilartrafficduringthesametimeperiod.
Youcan,ofcourse,createmorethantwovariations.BroadlyknownasanA/B/ntest,ifyou
havethetrafficthatallowsit,youcantestasmanyvariationsasyoudlike.Heresanexample
ofaA/B/C/Dtest,andhowmuchtrafficeachvariationisallocated:
Now,atthispoint,youmaybewonderingwhatthedifferenceisbetweenanA/B/C/Dtestanda
multivariatetest(MTV)is.MVTisnttheonlyothertypeofexperiment,either.Therearealso
banditalgorithmsthatsolvesimilarproblemsasA/B/ntests,justinadifferentway.
A/BTesting,Multivariate,andBanditAlgorithms:WhatstheDifference?</h3>
MultivariateTests
A/B/ntestsarecontrolledexperimentsrunon1ormorevariations+theoriginalpagethat
directlycompareconversionratemeansbasedonthechangesmadebetweenvariations.
Whileitsoundssimilar,multivariatetestsarecontrolledexperimentsthattestmultipleversions
ofapageandattempttoisolatewhichattributescausethelargestimpact.Inotherwords,
multivariatetestsarelikeA/B/ntestsinthattheytestanoriginalagainstvariations,buteach
variationcontainsdifferentdesignelements.Forexample:
Ifyouhaveenoughtraffic,youshouldusebothtypesofteststomaximizetheoutputofyour
optimizationprogram.Eachonehasadifferentandspecificimpactandusecase,andused
together,canhelpyougetthemostoutofyoursite.Hereshow:
UseA/Btestingtodeterminebestlayouts
UseMVTtopolishthelayoutstomakesurealltheelementsinteractwitheachotherin
thebestpossibleway.
AsIsaidbefore,youneedtogetatonoftraffictothepageyouretestingbeforeeven
consideringMVT.
However,makessureyourprioritiesalignwithyourtestingprogram.Peeponcesaid,mosttop
agenciesthatIvetalkedtoaboutthisrun~10A/Btestsforevery1MVT.
BanditTesting
Asforbanditalgorithms,youcanalmostthinkofthemasA/B/nteststhatupdateinrealtime
basedontheperformanceofeachvariation.
Inessence,abanditalgorithmstartsbysendingtraffictotwo(ormore)pages:theoriginaland
thevariation(s).Then,inattempttopullthewinningslotmachinearmmostoften,thealgorithm
updatesbasedonwhetherornotavariationiswinning.Eventually,thealgorithmfullyexploits
thebestoption:
ImageSource
Oneofthebigbenefitsofbandittestingisthatbanditsmitigateregret,whichisbasicallythe
lostconversionyouexperiencewhileexploringapotentiallyworsevariationinatest.Thischart
fromGoogleexplainsthatverywell:
ImageSource
Bytheway,trynottothinkofbanditsandA/B/ntestsasathisorthatscenariotheyretools
thateachhavetheirpurposes.Ingeneral,banditsaregreatfor:
HeadlinesandShortTermCampaigns
AutomationforScale
Targeting
BlendingOptimizationwithAttribution
Readthisarticleformoreinformationonbanditalgorithms.
TwootherminorissueswithA/Btesting:
Onetailortwotailtest?
BayesianorFrequentiststats?
OneTailvsTwoTailA/BTests
Ipromiseyou,thisisamuchsmallerissuethansomepeoplethink.Therefore,wellbrushoverit
quickly.
Onetailedtestsallowforthepossibilityofaneffectinjustonedirectionwherewithtwotailed
tests,youaretestingforthepossibilityofaneffectintwodirectionsbothpositiveand
negative.
Noneedtogetveryworkedupaboutthis.
MattGershoff
from
Conductrics
summeditupreally
well:
Ifyourtestingsoftwareonlydoesonetypeortheother,dontsweatit.Itissupersimpleto
convertonetypetotheother(butyouneedtodothisBEFOREyourunthetest)sinceallofthe
mathisexactlythesameinbothtests.Allthatisdifferentisthesignificancethresholdlevel.If
yoursoftwareusesaonetailtest,justdividethepvalueassociatedwiththeconfidencelevel
youarelookingtorunthetestby2.Soifyouwantyourtwotailtesttobeatthe95%
confidencelevel,thenyouwouldactuallyinputaconfidencelevelof97.5%,orifata99%,then
youneedtoinput99.5%.Youcanthenjustreadthetestasifitwastwotailed.
Divedowntherabbitholewithourarticleononetailvstwotailtestsifyoudlike.
BayesianorFrequentistStats
BayesianorFrequentistA/Btestingisanotherhottopicfordebate.Especiallywith
populartools
rebuildingtheirstatsengines
tofeatureaBayesianmethodology,
Heresthedifference(verymuchsimplified):
UsingaFrequentistmethodmeansmakingpredictionsonunderlyingtruthsoftheexperiment
usingonlydatafromthecurrentexperiment.
ThedifferenceisthatintheBayesianview,aprobabilityisassignedtoahypothesis.Inthe
Frequentistview,ahypothesisistestedwithoutbeingassignedaprobability.
Dr.RobBalon,whocarriesaPhdinstatisticsandmarketresearch,saysthedebateismostly
esoterictailwaggingdoneinthedomainoftheivorytower.Intruth,hesays,mostanalysts
outoftheivorytowerdontcarethatmuch,ifatall,aboutBayesianvs.Frequentist.
Dontgetmewrong,therearepracticalbusinessimplicationstoeachmethodology.Forthe
sakeofdiscussion,though,ifyoureatallnewtoA/Btesting,therearemuchmoreimportant
thingstoworryabout.
Ifyoudowanttodivedowntherabbithole,though,heresanarticlewewroteonBayesianvs
FrequentistA/BTesting.
StrategyFirst:SettingupaSuccessfulA/BTestingPlan
AstructuredapproachtoA/Btestingcouldbeyourbiggestareaofimpact.Dontlistentoany
blogpoststhattellyou99ThingsYouCanA/BTestRightNow.Thatsawasteoftimeand
traffic.Beingabitmoreprocessmindedwillmakeyoumoremoney.
Ina
surveydonebyEconsultancyandRedEye
,74%ofthesurveyrespondentswhoreported
havingastructuredapproachtoconversionalsostatedtheyhadimprovedtheirsales.Those
thatdonthaveastructuredapproach,ImalmostsurestayinwhatCraigSullivancallsthe
TroughofDisillusionment
(unlesstheirresultsarelitteredwithfalsepositives,whichwellget
intolater).
Sostructure.To
simplifyawinningprocess
,itgoessomethinglikethis:
1. Measurement
2. Prioritization
3. Experimentation
4. Repeat
Measurement:GettingDataDrivenInsights
Measurementisacrucialstepinthisprocess,becausewewanttoknow
what
ishappeningas
wellas
why
itshappening.
Firstthingsfirst,wellstartwiththehighlevelstrategyandmovedowntothegranular.Sothink
inthisorder:
1. Defineyourbusinessobjectives
2. Defineyourwebsitegoals
3. DefineyourKeyPerformanceIndicators
4. Defineyourtargetmetrics
Onceyouknowwhereyouwanttogo,wecancollectthedatanecessarytogetthere.Todo
this,werecommendthe
ResearchXLFramework
.(wellgooverthisbriefly,buttoreally
understandit,
readthispost
).
Pointis,wewanttobedatadriven,whichmeanswewanttocollectandanalyzerelevantdata
forourgoals.Thismeansastructureapproachtoourconversionresearch.Herestheexecutive
summaryofourprocess:
1. HeuristicAnalysis
2. TechnicalAnalysis
3. WebAnalyticsAnalysis
4. MouseTrackingAnalysis
5. QualitativeSurveys
6. UserTesting
Heuristicanalysisisaboutascloseaswegettobestpractices.Thedifferencehereinour
processisthat,afteryearsofexperience,youstillcanttellwhatexactlywillwork,butyoucan
moreeasilypointoutopportunityareas.As
CraigSullivanputit:
Myexperienceinobservingandfixingthings
thesepatternsdomakemeabetter
diagnosticianbuttheydontfunctionastruths
theyguideandinformmyworkbuttheydont
provideguarantees.
Sowhenitcomestoheuristicanalysis,humilityandaframeworkareimportant.Weassess
eachpagebasedon:
Relevancy
Clarity
Value
Friction
Distraction
ReadaboutWiderFunnelsLIFTModelforagoodheuristicframework
.
Technicalanalysisisanareaoftenoverlookedandhighlyunderratedbyoptimizers.Bugsif
theyrearoundareyourmainconversionkiller.Youthinkyoursiteworksperfectlybothin
termsofuserexperienceandfunctionalitywitheverybrowseranddevice?Probablynot.
Thisisalowhangingfruit,onethatyoucanmakealotofmoneyon(think12month
perspective).Sostart:
Conductcrossbrowserandcrossdevicetesting
Dospeedanalysis
Webanalyticsanalysisisnext.Firstthingsfirstmakesureeverythingisworking.Youdbe
surprisedhowmanyanalyticssetupsarebroken.
GoogleAnalytics(andotheranalyticssetups)areacourseinthemselves,soIllleaveyouwith
somehelpfullinkstoread:
GoogleAnalytics101:HowToConfigureGoogleAnalyticsToGetActionableData
GoogleAnalytics102:HowToSetUpGoals,Segments&EventsinGoogleAnalytics
Nextismousetrackinganalysis,whichincludesheatmaps,scrollmaps,clickmaps,form
analytics,andusersessionreplays.Onepointofadvicehereistonotgetcarriedawaywith
prettyvisualizationsofclickmaps,etc.Makesureyoureinformingyourlargergoalswiththe
analyticsinthisstep.
Qualitativeresearchisanimportantpartofmeasurementaswell,becauseittellsyouthe
why
thatquantitativeanalysismisses.Manypeoplethinkthatqualitativeanalysisissofteroreasier
thanquantitative,butitshouldbejustasrigorousandcanprovidejustasimportantofinsights
asyourGAdata.
For
qualitativeresearch
,usethingslike:
Onsitesurveys
Customersurveys
Customerinterviewsandfocusgroups
Finallywehaveusertesting.Thepremiseissimple:observeactualpeopleuseandinteractwith
yourwebsitewhiletheyrecommentingtheirthoughtprocessoutloud.Payattentiontowhat
theysayandexperience.
Aftertheheavyassconversionresearch,youllhavelotsofdataandneedtodosome
prioritization.
PrioritizingA/BTests
TherearemanyframeworkstoprioritizeyourA/Btests,andyoucouldeveninnovatewithyour
ownformula.Buthereshowwedoit.Onceyougothroughall6steps,youwillfindidentify
issuessomeofthemsevere,someminor.Youllwanttoallocateeveryfindingintooneof
these5buckets:
1. Test.
(Thisbucketiswhereyouplacestufffortesting.)
2. Instrument.
(Thiscaninvolvefixing,addingorimprovingtagoreventhandlingonthe
analyticsconfiguration.)
3. Hypothesize.
(Thisiswherewevefoundapage,widgetorprocessthatsjustnot
workingwellbutwedontseeaclearsinglesolution.)
4. JustDoItJFDI.
(Heresthebucketfornobrainers.Justdoit)
5. Investigate.
(Ifanitemisinthisbucket,youneedtoaskquestionsordofurther
digging.)
Thenwerankthemfrom1to5stars(1=minorissue,5=criticallyimportant).Thereare2
criteriathataremoreimportantthanotherswhengivingascore:
1. Easeofimplementation(time/complexity/risk).Sometimesthedatatellsyoutobuilda
feature,butittakesmonthstodoit.Soitsnotsomethingyoudstartwith.
2. Opportunityscore(subjectiveopiniononhowbigofaliftyoumightget).
Thencreateaspreadsheetwithallofyourdataandyoullhaveaprioritizedtestingroadmap,
morerigorousthanmostofyourcompetitorswillhave.
Youcanalsouseavarietyofotherframeworks.Averypopularoneisthe
PIEframework
.This
breaksopportunityareasintothreescores:
1. Potential
2. Importance
3. Ease
ReadmoreonframeworkstoprioritizeA/Btestinghere.
SettingUpYourA/BTests
Onceyouvegotaprioritizedlistoftestideas,itstimetoformahypothesisandrunan
experiment.Basically,ahypothesiswilldefinewhyyoubelieveaproblemoccurs.Furthermore,
agoodhypothesis:
1. IstestableItneedstobemeasurable,sothatitcanbeusedintesting.
2. HasagoalofsolvingconversionproblemsSplittestingisdonetosolvespecific
conversionproblems
3. GainsmarketinsightsAwellarticulatedhypothesiswillletyoursplittestingresultsgive
youinformationaboutyourcustomers,whetherthetestwinsorlosesorwhatever.
CraigSullivan
hasputtogetherahypothesiskittosimplifytheprocess
.Hereshissimple
version:
1. Becausewesaw(data/feedback)
2. Weexpectthat(change)willcause(impact)
3. Wellmeasurethisusing(datametric)
Andtheadvancedone:
1. Becausewesaw(qual&quantdata)
2. Weexpectthat(change)for(population)willcause(impact(s))
3. Weexpecttosee(datametric(s)change)overaperiodof(xbusinesscycles)
TechnicalStuff
Heresthefunpart:youfinallygetto
pickyourtool
.
Whilethisisthefirstthingmanypeoplethinkabout,itsnotactuallythemostimportant,byany
means.Thestrategyandstatisticalknowledgeaspectscomefirst,andonlythenshouldyou
worryaboutpickingatool.
Thatsaid,thereareafewdifferencesyoushouldbearinmind.
Onemajordistinguishmentintoolsiswhethertheyare
serversideorclientsidetestingtools
.
Serversidetoolsrendercodeontheserverlevelandsendarandomizedversionofthepageto
theviewerwithnomodificationonthevisitorsbrowser.Clientsidetoolssendthesamepage
butJavaScriptontheclientsbrowsermanipulatetheappearanceonboththeoriginalandthe
variation.
ClientsidetestingtoolsarethingslikeOptimizely,VWO,andAdobeTarget.Conductricshas
capabilitiesofboth,andSiteSpectdoesaproxyserversidemethod.
Whatdoesallthismeanforyou?Ifyoudliketosavetimeupfront,orifyourteamissmallor
lacksdevelopmentresources,clientsidetoolscangetyouupandrunningfaster.Serverside
requiresdevelopmentresourcesbutcanoftenbemorerobust.
Sowhilesettinguptestscanbeslightlydifferentdependingonwhichtoolyouuse,oftenitwill
beassimpleassigningupforyourfavoritetoolandfollowingsomebasicinstructions,like
puttingajavascriptsnippetonyourwebsite.
Youllbasicallywanttosetupgoals(somethingthatletsyouknowaconversionhasbeen
made,likeathankyouforpurchasingpage),andyourtestingtoolwilltrackwheneach
variationconvertsvisitorsintocustomers.
SomeskillsthatcomeinhandywhensettinguptestsareHTML,CSS,andJavaScript/JQuery,
aswellasdesignandcopywritingskillstodrawupthevariations.Sure,sometoolsallowuseof
avisualeditor,butthatlimitsyourflexibilityandcontrol,solearningsometechnicalskillsis
helpful.
Oryoucouldusesomethinglike
Testing.Agency
tosetupyourtestsforyou.
HowLongShouldYouRunA/BTests?
Firstrule:dontstopatestjustbecauseitreachesstatisticalsignificance.Thisisprobablythe
mostcommonerrorcommittedbybeginningoptimizerswithgoodintentions.
Ifyourecallingyourtestswhenyouhitsignificance,youllfindthatmostofyourliftsdont
translatetoincreasedrevenue(thatsthegoal,afterall).Youllfindthatthe
liftswereinfact
imaginary
.
Considerthis:
OnethousandA/Atests
(twoidenticalpagestestedagainsteachother)wererun.
771experimentsoutof1.000reached90%significanceatsomepoint
531experimentsoutof1.000reached95%significanceatsomepoint
Stoppingtestsatsignificancebreedstheriskoffalsepositivesandexcludespossible
external
validitythreats
likeseasonality.
Instead,youllwanttopredetermineasamplesizeandrunthetestforfullweeks,usuallyforat
leasttwobusinesscycles.
Howdoyoupredeterminesamplesize?Therearelotsofgreattoolsoutthereforthat,including
toolswithinyourfavoritetestingtool.HereshowyoudcalculateyoursamplesizewithEvan
Millerstool:
Inthiscasewetoldthetoolthatwehavea3%conversionrate,andwanttodetectatleast10%
uplift.Thetooltellsusthatweneed51,486visitorspervariationbeforecanlookatthestatistical
significancelevelsandstatisticalpower.
Oh,andyoullnoticeinadditiontosignificancelevel,theres
somethingcalledstatisticalpower
inthephotoaboveaswell.
StatisticalpowerisanotherimportantfactorinrunningyourA/Btest,asitattemptstoavoid
TypeIIerrors(falsenegatives).Inotherwords,itmakessurethatyoudetectaneffect
ifthere
actuallywasone
.
Forpracticalpurposes,knowthat80%poweristhestandardfortestingtools.Toreachsucha
level,youneedeitheralargesamplesize,alargeeffectsize,oralongerdurationtest.
ThereAreNoMagicNumbers
Youllreadalotofblogpoststhathavemagicnumberslike100conversionsor1,000visitors
astheirstoppingpoints.Mathisnotmagic,mathismath,andwhatweredealingwithisslightly
morecomplexthansimplisticheuristicslikethat.
AndrewAnderson
from
Malwarebytes
putit
well:
Itisneverabouthowmanyconversions,itisabouthavingenoughdatatovalidatebasedon
representativesamplesandrepresentativebehavior.
100conversionsispossibleinonlythemostremotecasesandwithanincrediblyhighdeltain
behavior,butonlyifotherrequirementslikebehaviorovertime,consistency,andnormal
distributiontakeplace.EventhenitishasareallyhighchanceofatypeIerror,falsepositive.
Whatwereworriedaboutistherepresentativenessofoursample.Howcanwedothatinbasic
terms?Yourtestshouldrunfor1,orbetteryet2,businesscycles,soitincludeseverythingthat
goingon:
everydayoftheweek(andtestedoneweekatatimeasyourdailytrafficcanvaryalot),
variousdifferenttrafficsources(unlessyouwanttopersonalizetheexperiencefora
dedicatedsource),
yourblogpostandnewsletterpublishingschedule,
peoplewhovisitedyoursite,thoughtaboutit,andthencameback10dayslatertobuy
it,
anyexternaleventthatmightaffectpurchasing(e.g.payday)
Another(veryimportant)note:becarefulwithlowsamplesize.Theinternetisfullofcase
studiessteepedin
shittymath
,andmostofit(iftheyevenreleasefullnumbers),isbecause
theyjudgedatestonlike100visitorspervariationand12vs22conversions.
Ifyouveseteverythingupcorrectlysofar,thenyoulljustwanttoavoidpeaking(orlettingyour
bosspeak)attestresultsmultipletimesbeforethetestisfinished.Thiscanresultincallinga
resultearlyduetospottingatrend(impossible).Whatyoullfindisthatmanytestresults
regresstothemean
.
RegressiontotheMean
Often,youllseeresultsvarywildlyinthefirstfewdaysofthetest.Sureenough,theytendto
convergeasthetestcontinuesforthenextfewweeks.HeresanexamplePeepgaveinan
olderblogpostofaneCommerceclient:
Hereswhatwerelookingat:
Firstcoupleofdays,blue(variation#3)iswinningbiglike$16pervisitorvs$12.5for
Control.Lotsofpeoplewouldendthetesthere.(Fail).
After7days:bluestillwinningandtherelativedifferenceisbig.
After14days:orange(#4)iswinning!
After21days:orangestillwinning!
End:nodifference
Soifyoudcalledthetestatlessthanfourweeksyouwouldhavemadeanerroneous
conclusion.
Somethingrelated,thattheinternetalwaysgetsconfusedon,iscalledthenoveltyeffect.Thats
whenthenoveltyofyourchanges(biggerbluebutton)bringsmoreattentiontothevariation.
Withtime,theliftdisappearsbecausethechangeisnolongernovel.
AllofthisstuffissomeofthemorecomplexA/Btestinginformation.Wehaveabunchofblog
postsdevotedtothevarioustopicscoveredabove.Diveinifyoudliketolearnmore:
StoppingA/BTests:HowManyConversionsDoINeed?
CanYouRunMultipleA/BTestsSimultaneously?
Youwanttospeedupyourtestingprogramandrunmoretests.
Hightempotesting
,yeah?Soa
commonquestionis:canyourunmorethanoneA/Btestatthesametimeonyoursite?
Willthisincreaseyourgrowthpotential,orwillitpollutethedatabecauseeachtestinteracts
withtheother?
Look,thisisacomplicatedissue.Someexpertssayyoushouldntdomultipletests
simultaneously,andsomesayitsfine.
Inmostcasesyouwillbefinerunningmultiplesimultaneoustests,andextremeinteractionsare
unlikely.Unlessyouretestingreallyimportantstuff(e.g.somethingthatimpactsyourbusiness
model,futureofthecompany),thebenefitsoftestingvolumewillmostlikelyoutweighthenoise
inyourdataandoccasionalfalsepositives.
Ifbasedonyourassessmenttheresahighriskofinteractionbetweenmultipletests,reducethe
numberofsimultaneoustestsand/orletthetestsrunlongerforimprovedaccuracy.
Ifyouwanttoreadmoreonthis,readtheseposts:
ABTesting:WhenTestsCollide
CanYouRunMultipleA/BTestsattheSameTime?
AnalyzingYourA/BTestingResults
Alright.Youvedoneyourresearch,setupyourtest,andthetestisfinallycooked.Now,onto
analysisanditsnotalwaysassimpleasglimpsingatthegraphyourtestingtoolgivesyou.
ImageSource
Onethingyoushouldalwaysdoitto
analyzeyourtestresultsinGoogleAnalytics
.
Itdoesntjustenhanceyouranalysiscapabilities,butitallowsyoutobemoreconfidentinyour
dataanddecisionmaking.
Thepointis,itspossiblethatyourtestingtoolcouldberecordingthedataincorrectly,andifyou
havenoothersourceforyourtestdata,youcanneverbesurewhethertotrustitornot.Create
multiplesourcesofdata(wontgotoofarintodetail,
butreadthispostforhowtosetitallup
)
Butwhathappensif,afteranalyzingtheresultsinGA,thereisnodifferenceatallbetween
variations?
Dontmoveontooquickly.First,realizethesetwothings:
1.Yourtesthypothesismighthavebeenright,buttheimplementationsucked.
Letssayyourqualitativeresearchsaysthatconcernaboutsecurityisanissue.Howmanyways
dowehavetobeefuptheperceptionofsecurity?Unlimited.
Thenameofthegameis
iterativetesting
,soifyouwereontosomething,thentryafew
iterationsthatattempttosolvetheproblem.
2.Justbecausetherewasnodifferenceoverall,thetreatmentmighthavebeatcontrolin
asegmentortwo.
Ifyougotaliftinreturningvisitorsandmobilevisitors,butadropfornewvisitorsanddesktop
usersthosesegmentsmightcanceleachotherout,anditseemslikeitsacaseofno
difference.
Analyzeyourtestacrosskeysegments
toseethis.
AllAboutDataSegmentation
ThekeytolearninginA/Btestingis
segmenting
.EventhoughBmightlosetoAintheoverall
results,BmightbeatAincertainsegments(organic,Facebook,mobile,etc).
Thereareatonofsegmentsyoucananalyze.
Optimizelyliststhefollowingpossibilities
:
Browsertype
Sourcetype
Mobilevs.desktop,orbydevice
Loggedinvs.loggedoutvisitors
PPC/SEMcampaign
Geographicalregions(City,State/Province,Country)
Newvs.returningvisitors
Newvs.repeatpurchasers
Powerusersvs.casualvisitors
Menvs.women
Agerange
Newvs.alreadysubmittedleads
Plantypesorloyaltyprogramlevels
Current,prospective,andformersubscribers
Roles(ifyoursitehas,forinstance,bothaBuyerandSellerrole)
Butdefinitelylookatyourtestresultsatleastacrossthesesegments(makingsureofadequate
samplesize):
DesktopvsTablet/Mobile
NewvsReturning
Trafficthatlandsdirectlyonthepageyouretestingvscameviainternallink
Forsegments,thesamestoppingrulesapply.
Makesurethatyouhaveenoughsamplesizewithinthesegmentitselfaswell(calculateitin
advance,bewaryifitslessthan250350conversionsPERvariationwithinthatonesegment
yourelookingat).
Ifyourtreatmentperformedwellforaspecificsegment,itstimetoconsiderapersonalized
approachforthatparticularsegment.
AllYourStatsQuestionsAnswered,Stat
TheresacertainlevelofstatisticalknowledgethatcomesinhandywhenanalyzingA/Btest
results.SomeofitwewentoverintheabovesectiononsettingupA/Btests,butthereisstill
moretobecoveredwhenitcomestoanalysis.
Whydoyouneedtoknowallofthisstatisticsstuff?Weredealingwithinferenceheremeans
andprobabilityandthereforecannotgowithoutsomebasicunderstandingofstats.
OrasMattGershoffputit(quotinghiscollegemathprofessor),howcanyoumakecheeseif
youdontknowwheremilkcomesfrom?!
TherearethreetermsyoushouldknowbeforewediveintothenittygrittyofA/Btesting
statistics:
1. Mean(werenotmeasuringallconversionrates,justasample,andfindinganaverageof
themthatisrepresentativeofthewhole)
2. Variance(whatisthenaturalvariabilityofapopulation?Thatwillaffectourresultsand
howwetakeactionwiththem)
3. Sampling(again,wecantmeasuretrueconversionrate,soweselectasamplethatis
hopefullyrepresentativeofthewhole)
WhatTheHellIsaPValue?
Theresalargeamountofbloggerswritingaboutconversionoptimizationthatareusingtheterm
statisticalsignificanceinaccurately.
Wetalkedabitaboveabouthowstatisticalsignificancebyitselfisnotastoppingrule,sowhat
isitandwhyisitimportant?
Tostartwith,letsgooverPValues,whicharealsoverymisunderstood.AsFiveThirtyEight
recentlypointedout,
evenscientistscanteasilyexplainwhatPValuesare
.
PValueisbasicallymeasureofevidenceagainstthenullhypothesis(thecontrolinA/BTesting
parlance).
Veryimportant:PvaluedoesnottellustheprobabilitythatBisbetterthanA.
Similarly,itdoesnttellustheprobabilitythatwewillmakeamistakeinselectiveBoverA.
Thesearebothextraordinarilycommonsmisconceptions,buttheyarefalse.
Thepvalueisjusttheprobabilityofseeingaresultormoreextremegiventhatthenull
hypothesisistrue.Or,Howsurprisingisthatresult?
Sotosumitup,statisticalsignificance(orastatisticallysignificantresult)isattainedwhena
pvalueislessthanthesignificancelevel(whichisusuallysetat.05).Bytheway,significance
inregardstostatistical
hypothesistesting
iswherethewholeonetailvstwotailissuecomesup.
ConfidenceIntervalsandMarginofError
InA/Btesting,weuseconfidenceintervalstomitigatetheriskofsamplingerrors.Inthatsense,
weremanagingtheriskassociatedwithimplementinganewvariation.Soifyourtoolsays
somethinglike,Weare95%confidentthattheconversionrateisX%+/Y%,thenyouneedto
accountforthe+/Y%asthemarginoferror.
ImageSource
Howconfidentyouareinyourresultsdependslargelyonhowlargethemarginoferroris.Asa
ruleofthumbifthe2conversionrangesoverlap,youllneedtokeeptestinginordertogeta
validresult.
MattGershoffgaveagreatillustrationofhowmarginoferrorworks:
SayyourbuddyiscomingtovisityoufromRoundRockandistakingTX1at5pm.Shewants
toknowhowlongitshouldtakeher.YousayIhavea95%confidencethatitwilltakeyouabout
60minutesplusorminus20minutes.Soyourmarginoferroris20minutesor33%.
Ifsheiscomingat11amyoumightsayitwilltakeyou40min,plusorminus10min,sothe
marginoferroris10minutes,or25%.Sowhilebothareatthe95%confidencelevel,themargin
oferrorisdifferent.
ExternalValidityThreats
TheresachallengewithrunningA/Btests:thedataisnonstationary.
Inotherwords,astationarytimeseriesisonewhosestatisticalproperties(mean,variance,
autocorrelation,etc)areconstantovertime.Formanyreasons,websitedataisnonstationary,
whichmeanswecantmakethesameassumptionsaswithstationarydata.Hereareafew
reasonsdatamightfluctuate:
Season
Dayoftheweek
Holidays
Press(positiveornegative)
OtherMarketingCampaigns
PPC/SEM
SEO
WordofMouth
Soseasonalityandtheotherfactorsaboveareonesourceofexternalvaliditythreat.
Otherincludesamplepollution,theflickereffect,revenuetrackingerrors,selectionbias,and
more(
readhere
).TheseareallthingstokeepinmindinplanningandanalyzingyourA/Btests.
ArchivingTestResultsForFutureLearning
A/Btestingisntjustaboutlifts,wins,losses,andtestingrandomshit.AsMattGershoffsaid,
optimizationisaboutgatheringinformationtoinformdecisions,andthelearningsfrom
statisticallyvalidA/Btestresultscontributetothegreatergoalsofgrowthandoptimization.
Smartorganizationsarchivetheirtestresultsandplantheirapproachtotestingsystematically.
Theresareasonhavingastructuredapproachtooptimizationhavegreatergrowthandare
limitedlessoftenbylocalmaxima.
ImageSource
Soheresthetoughpart:theresnosinglebestwaytostructureyourknowledgemanagement.
Wewroteanarticleonhow
effectiveorganizationsarchivetheirresults
(readit),andasitturns
out,manyofthemdoitslightlydifferently.Someusesophisticatedinternallybuilttools,some
use3rdpartytools,andsomeusegoodolExcelandTrello.
Ifithelps,hereare4toolsbuiltspecificallyforconversionoptimizationprojectmanagement:
Iridion
EffectiveExperiments
GrowthHackersCanvas
ExperimentEngine
Onasimilarnote,inlargerorganizations(orhell,insmalleraswell),itsimportanttobeableto
communicateacrossdepartmentsandtotheexecutivesabove.Often,A/Btestresultsarent
superintuitivetothelayperson(andmostpeoplehaventreadguidesaslongasthisone).So
whathelpsisvisualization.
Thisisanotherareawhere,sadly,thereisnotrealrightwaytodoit.Thatsaid,Annemarie
KlaassenandTonWesseling
wroteanawesomepostonourblog
detailingtheirjourneytogreat
visualizations.Sneakpeek,hereswhattheyendedupwith:
A/BTestingToolsandResources
Litteredthroughoutthisguidearetonsoflinkstoexternalresourcesarticles,tools,books,etc.
Tomakeitconvenientforyou,though,herearesomeofthebest(dividedbycategories).
A/BTestTools
Optimizely
VWO
AdobeTarget
Maximyser
Conductrics
53ConversionOptimizationToolsReviewedByExperts
A/BTestingCalculators
A/BSplitTestSignificanceCalculatorbyVWO
A/BSplitandMultivariateTestDurationCalculator
EvanMillersSampleSizeCalculator
EvanMillersWholeSuiteofA/BTestingTools
A/BTestingStatisticsResources
IgnorantNoMore:CrashCourseonA/BTestingStatistics
StatisticalAnalysisandA/BTesting
UnderstandingA/BtestingstatisticstogetREALLiftinConversions
OneTailedvsTwoTailedTests(DoesItMatter?)
BayesianvsFrequentistA/BTestingWhatstheDifference?
SamplePollution
ScienceIsntBroken
A/BTesting/CROStrategyResources
eCommerceA/BTestDataforImprovedProcess:WhatPercentageofTestsAre
Winners?
WiderFunnelsLIFTModel
3FrameworksToHelpPrioritize&ConductYourConversionTesting
Whatyouhavetoknowaboutconversionoptimization
OurConversionOptimizationGuide
SmallBusinessBigMoneyOnline:AProvenSystemtoOptimizeeCommerceWebsites
andIncreaseInternetProfits
(book)
Conclusion
A/Btestingisaninvaluableresourcetoanyonemakingdecisionsinanonlineenvironment.With
alittleitofknowledgeandalotofdiligence,youcanmitigatemanyoftherisksthatmost
beginningoptimizersfaceduetoerrors.
Thisisn'tacompleteoranultimateguide,butitisadamngoodstart.Ifyoureallydigintothe
informationhere,you'llbeaheadof90%ofpeoplerunningtests.Ifyoubelieveinthepowerof
A/Btestingforcontinuedrevenuegrowth,thenthat'safantasticplacetobe.
Knowledgeisalimitingfactorthatonlyexperienceanditerativelearningcanbustthrough,
though.Sogettesting:)