You are on page 1of 25

ThanksfordownloadingthisPDF!

Youcanfindthewholepostonlineat:
http://conversionxl.com/abtestingguide

TheSmartMarketersGuidetoA/BTesting

A/Btesting.Youveheardsomuchaboutitinthepastfewyears.Youmayhaveevenrunatest
ortwo(orathousand).Yet,forallthecontentoutthereaboutsplittesting,theconceptisstill
frequentlymisunderstood.

Initssimplestsense,A/Bsplittestingisanewtermforan
oldtechnique
controlled
experimentation.Whenresearchersaretestingtheefficacyofnewdrugs,theyuseasplittest.
Infact,mostresearchexperimentscouldbeconsideredasplittest,completewitha
hypothesis,acontrolandvariation,andastatisticallycalculatedresult.

Themaindifference,however,liesinthevariabilityofinternettraffic.Inalab,itseasierto
controlforexternalvariables.Online,youcanmitigatethem,butitstrulydifficulttooperatea
purelycontrolledtest.

Inaddition,testingnewdrugsrequiresanalmostcertaindegreeofaccuracy.Livesareonthe
line.Intechnicalterms,yourperiodofexplorationcanbemuchlonger,asyouwanttobe
damnedsureduringyourperiodofexploitationthatyoudidntreachatypeIerror(false
positive).

A/Bsplittestingonlineisprimarilyabusinessdecision.Itsaweighingofriskvsreward,
explorationvsexploitation,sciencevsbusiness.Therefore,weviewresultswithadifferentlens
andmakedecisionsslightlydifferentlythantestsinapurelabsetting.

Readytodivein?Letsstartwiththebasics.

WhatIsAnA/BTest?

AnA/Btestisacontrolledonlineexperimentinwhich50%ofyourtrafficissenttotheoriginal
pageand50%issenttoavariation.

Thatsit.Initssimplestsense,anA/Btestisa50/50trafficsplitthatseekstovalidatechanges
youvemadetoapagebytestingsimilartrafficduringthesametimeperiod.

Youcan,ofcourse,createmorethantwovariations.BroadlyknownasanA/B/ntest,ifyou
havethetrafficthatallowsit,youcantestasmanyvariationsasyoudlike.Heresanexample
ofaA/B/C/Dtest,andhowmuchtrafficeachvariationisallocated:

Now,atthispoint,youmaybewonderingwhatthedifferenceisbetweenanA/B/C/Dtestanda
multivariatetest(MTV)is.MVTisnttheonlyothertypeofexperiment,either.Therearealso
banditalgorithmsthatsolvesimilarproblemsasA/B/ntests,justinadifferentway.

A/BTesting,Multivariate,andBanditAlgorithms:WhatstheDifference?</h3>

MultivariateTests

A/B/ntestsarecontrolledexperimentsrunon1ormorevariations+theoriginalpagethat
directlycompareconversionratemeansbasedonthechangesmadebetweenvariations.

Whileitsoundssimilar,multivariatetestsarecontrolledexperimentsthattestmultipleversions
ofapageandattempttoisolatewhichattributescausethelargestimpact.Inotherwords,
multivariatetestsarelikeA/B/ntestsinthattheytestanoriginalagainstvariations,buteach
variationcontainsdifferentdesignelements.Forexample:

Ifyouhaveenoughtraffic,youshouldusebothtypesofteststomaximizetheoutputofyour
optimizationprogram.Eachonehasadifferentandspecificimpactandusecase,andused
together,canhelpyougetthemostoutofyoursite.Hereshow:

UseA/Btestingtodeterminebestlayouts
UseMVTtopolishthelayoutstomakesurealltheelementsinteractwitheachotherin
thebestpossibleway.

AsIsaidbefore,youneedtogetatonoftraffictothepageyouretestingbeforeeven
consideringMVT.

However,makessureyourprioritiesalignwithyourtestingprogram.Peeponcesaid,mosttop
agenciesthatIvetalkedtoaboutthisrun~10A/Btestsforevery1MVT.

BanditTesting

Asforbanditalgorithms,youcanalmostthinkofthemasA/B/nteststhatupdateinrealtime
basedontheperformanceofeachvariation.

Inessence,abanditalgorithmstartsbysendingtraffictotwo(ormore)pages:theoriginaland
thevariation(s).Then,inattempttopullthewinningslotmachinearmmostoften,thealgorithm
updatesbasedonwhetherornotavariationiswinning.Eventually,thealgorithmfullyexploits
thebestoption:

ImageSource

Oneofthebigbenefitsofbandittestingisthatbanditsmitigateregret,whichisbasicallythe
lostconversionyouexperiencewhileexploringapotentiallyworsevariationinatest.Thischart
fromGoogleexplainsthatverywell:

ImageSource

Bytheway,trynottothinkofbanditsandA/B/ntestsasathisorthatscenariotheyretools
thateachhavetheirpurposes.Ingeneral,banditsaregreatfor:
HeadlinesandShortTermCampaigns
AutomationforScale
Targeting
BlendingOptimizationwithAttribution

Readthisarticleformoreinformationonbanditalgorithms.

TwootherminorissueswithA/Btesting:
Onetailortwotailtest?
BayesianorFrequentiststats?

OneTailvsTwoTailA/BTests

Ipromiseyou,thisisamuchsmallerissuethansomepeoplethink.Therefore,wellbrushoverit
quickly.

Onetailedtestsallowforthepossibilityofaneffectinjustonedirectionwherewithtwotailed
tests,youaretestingforthepossibilityofaneffectintwodirectionsbothpositiveand
negative.

Noneedtogetveryworkedupaboutthis.
MattGershoff
from
Conductrics
summeditupreally
well:

Ifyourtestingsoftwareonlydoesonetypeortheother,dontsweatit.Itissupersimpleto
convertonetypetotheother(butyouneedtodothisBEFOREyourunthetest)sinceallofthe
mathisexactlythesameinbothtests.Allthatisdifferentisthesignificancethresholdlevel.If
yoursoftwareusesaonetailtest,justdividethepvalueassociatedwiththeconfidencelevel
youarelookingtorunthetestby2.Soifyouwantyourtwotailtesttobeatthe95%
confidencelevel,thenyouwouldactuallyinputaconfidencelevelof97.5%,orifata99%,then
youneedtoinput99.5%.Youcanthenjustreadthetestasifitwastwotailed.

Divedowntherabbitholewithourarticleononetailvstwotailtestsifyoudlike.

BayesianorFrequentistStats

BayesianorFrequentistA/Btestingisanotherhottopicfordebate.Especiallywith
populartools
rebuildingtheirstatsengines
tofeatureaBayesianmethodology,

Heresthedifference(verymuchsimplified):

UsingaFrequentistmethodmeansmakingpredictionsonunderlyingtruthsoftheexperiment
usingonlydatafromthecurrentexperiment.

ThedifferenceisthatintheBayesianview,aprobabilityisassignedtoahypothesis.Inthe
Frequentistview,ahypothesisistestedwithoutbeingassignedaprobability.

Dr.RobBalon,whocarriesaPhdinstatisticsandmarketresearch,saysthedebateismostly
esoterictailwaggingdoneinthedomainoftheivorytower.Intruth,hesays,mostanalysts
outoftheivorytowerdontcarethatmuch,ifatall,aboutBayesianvs.Frequentist.

Dontgetmewrong,therearepracticalbusinessimplicationstoeachmethodology.Forthe
sakeofdiscussion,though,ifyoureatallnewtoA/Btesting,therearemuchmoreimportant
thingstoworryabout.

Ifyoudowanttodivedowntherabbithole,though,heresanarticlewewroteonBayesianvs
FrequentistA/BTesting.

StrategyFirst:SettingupaSuccessfulA/BTestingPlan

AstructuredapproachtoA/Btestingcouldbeyourbiggestareaofimpact.Dontlistentoany
blogpoststhattellyou99ThingsYouCanA/BTestRightNow.Thatsawasteoftimeand
traffic.Beingabitmoreprocessmindedwillmakeyoumoremoney.

Ina
surveydonebyEconsultancyandRedEye
,74%ofthesurveyrespondentswhoreported
havingastructuredapproachtoconversionalsostatedtheyhadimprovedtheirsales.Those
thatdonthaveastructuredapproach,ImalmostsurestayinwhatCraigSullivancallsthe
TroughofDisillusionment
(unlesstheirresultsarelitteredwithfalsepositives,whichwellget
intolater).

Sostructure.To
simplifyawinningprocess
,itgoessomethinglikethis:
1. Measurement
2. Prioritization
3. Experimentation
4. Repeat

Measurement:GettingDataDrivenInsights

Measurementisacrucialstepinthisprocess,becausewewanttoknow
what
ishappeningas
wellas
why
itshappening.

Firstthingsfirst,wellstartwiththehighlevelstrategyandmovedowntothegranular.Sothink
inthisorder:

1. Defineyourbusinessobjectives
2. Defineyourwebsitegoals
3. DefineyourKeyPerformanceIndicators
4. Defineyourtargetmetrics

Onceyouknowwhereyouwanttogo,wecancollectthedatanecessarytogetthere.Todo
this,werecommendthe
ResearchXLFramework
.(wellgooverthisbriefly,buttoreally
understandit,
readthispost
).

Pointis,wewanttobedatadriven,whichmeanswewanttocollectandanalyzerelevantdata
forourgoals.Thismeansastructureapproachtoourconversionresearch.Herestheexecutive

summaryofourprocess:

1. HeuristicAnalysis
2. TechnicalAnalysis
3. WebAnalyticsAnalysis
4. MouseTrackingAnalysis
5. QualitativeSurveys
6. UserTesting

Heuristicanalysisisaboutascloseaswegettobestpractices.Thedifferencehereinour
processisthat,afteryearsofexperience,youstillcanttellwhatexactlywillwork,butyoucan
moreeasilypointoutopportunityareas.As
CraigSullivanputit:

Myexperienceinobservingandfixingthings

thesepatternsdomakemeabetter
diagnosticianbuttheydontfunctionastruths

theyguideandinformmyworkbuttheydont
provideguarantees.

Sowhenitcomestoheuristicanalysis,humilityandaframeworkareimportant.Weassess
eachpagebasedon:
Relevancy
Clarity
Value
Friction
Distraction

ReadaboutWiderFunnelsLIFTModelforagoodheuristicframework
.

Technicalanalysisisanareaoftenoverlookedandhighlyunderratedbyoptimizers.Bugsif
theyrearoundareyourmainconversionkiller.Youthinkyoursiteworksperfectlybothin
termsofuserexperienceandfunctionalitywitheverybrowseranddevice?Probablynot.

Thisisalowhangingfruit,onethatyoucanmakealotofmoneyon(think12month
perspective).Sostart:

Conductcrossbrowserandcrossdevicetesting
Dospeedanalysis

Webanalyticsanalysisisnext.Firstthingsfirstmakesureeverythingisworking.Youdbe
surprisedhowmanyanalyticssetupsarebroken.

GoogleAnalytics(andotheranalyticssetups)areacourseinthemselves,soIllleaveyouwith
somehelpfullinkstoread:

GoogleAnalytics101:HowToConfigureGoogleAnalyticsToGetActionableData
GoogleAnalytics102:HowToSetUpGoals,Segments&EventsinGoogleAnalytics

Nextismousetrackinganalysis,whichincludesheatmaps,scrollmaps,clickmaps,form
analytics,andusersessionreplays.Onepointofadvicehereistonotgetcarriedawaywith
prettyvisualizationsofclickmaps,etc.Makesureyoureinformingyourlargergoalswiththe
analyticsinthisstep.

Qualitativeresearchisanimportantpartofmeasurementaswell,becauseittellsyouthe
why
thatquantitativeanalysismisses.Manypeoplethinkthatqualitativeanalysisissofteroreasier
thanquantitative,butitshouldbejustasrigorousandcanprovidejustasimportantofinsights
asyourGAdata.

For
qualitativeresearch
,usethingslike:
Onsitesurveys
Customersurveys
Customerinterviewsandfocusgroups

Finallywehaveusertesting.Thepremiseissimple:observeactualpeopleuseandinteractwith
yourwebsitewhiletheyrecommentingtheirthoughtprocessoutloud.Payattentiontowhat
theysayandexperience.

Aftertheheavyassconversionresearch,youllhavelotsofdataandneedtodosome
prioritization.

PrioritizingA/BTests

TherearemanyframeworkstoprioritizeyourA/Btests,andyoucouldeveninnovatewithyour
ownformula.Buthereshowwedoit.Onceyougothroughall6steps,youwillfindidentify
issuessomeofthemsevere,someminor.Youllwanttoallocateeveryfindingintooneof
these5buckets:

1. Test.
(Thisbucketiswhereyouplacestufffortesting.)
2. Instrument.
(Thiscaninvolvefixing,addingorimprovingtagoreventhandlingonthe
analyticsconfiguration.)
3. Hypothesize.
(Thisiswherewevefoundapage,widgetorprocessthatsjustnot
workingwellbutwedontseeaclearsinglesolution.)
4. JustDoItJFDI.
(Heresthebucketfornobrainers.Justdoit)
5. Investigate.
(Ifanitemisinthisbucket,youneedtoaskquestionsordofurther
digging.)

Thenwerankthemfrom1to5stars(1=minorissue,5=criticallyimportant).Thereare2
criteriathataremoreimportantthanotherswhengivingascore:


1. Easeofimplementation(time/complexity/risk).Sometimesthedatatellsyoutobuilda
feature,butittakesmonthstodoit.Soitsnotsomethingyoudstartwith.
2. Opportunityscore(subjectiveopiniononhowbigofaliftyoumightget).

Thencreateaspreadsheetwithallofyourdataandyoullhaveaprioritizedtestingroadmap,
morerigorousthanmostofyourcompetitorswillhave.

Youcanalsouseavarietyofotherframeworks.Averypopularoneisthe
PIEframework
.This
breaksopportunityareasintothreescores:
1. Potential
2. Importance
3. Ease

ReadmoreonframeworkstoprioritizeA/Btestinghere.

SettingUpYourA/BTests

Onceyouvegotaprioritizedlistoftestideas,itstimetoformahypothesisandrunan
experiment.Basically,ahypothesiswilldefinewhyyoubelieveaproblemoccurs.Furthermore,
agoodhypothesis:
1. IstestableItneedstobemeasurable,sothatitcanbeusedintesting.
2. HasagoalofsolvingconversionproblemsSplittestingisdonetosolvespecific
conversionproblems
3. GainsmarketinsightsAwellarticulatedhypothesiswillletyoursplittestingresultsgive
youinformationaboutyourcustomers,whetherthetestwinsorlosesorwhatever.

CraigSullivan
hasputtogetherahypothesiskittosimplifytheprocess
.Hereshissimple
version:

1. Becausewesaw(data/feedback)
2. Weexpectthat(change)willcause(impact)
3. Wellmeasurethisusing(datametric)

Andtheadvancedone:

1. Becausewesaw(qual&quantdata)
2. Weexpectthat(change)for(population)willcause(impact(s))
3. Weexpecttosee(datametric(s)change)overaperiodof(xbusinesscycles)

TechnicalStuff

Heresthefunpart:youfinallygetto
pickyourtool
.

Whilethisisthefirstthingmanypeoplethinkabout,itsnotactuallythemostimportant,byany
means.Thestrategyandstatisticalknowledgeaspectscomefirst,andonlythenshouldyou
worryaboutpickingatool.

Thatsaid,thereareafewdifferencesyoushouldbearinmind.

Onemajordistinguishmentintoolsiswhethertheyare
serversideorclientsidetestingtools
.

Serversidetoolsrendercodeontheserverlevelandsendarandomizedversionofthepageto
theviewerwithnomodificationonthevisitorsbrowser.Clientsidetoolssendthesamepage
butJavaScriptontheclientsbrowsermanipulatetheappearanceonboththeoriginalandthe
variation.

ClientsidetestingtoolsarethingslikeOptimizely,VWO,andAdobeTarget.Conductricshas
capabilitiesofboth,andSiteSpectdoesaproxyserversidemethod.

Whatdoesallthismeanforyou?Ifyoudliketosavetimeupfront,orifyourteamissmallor
lacksdevelopmentresources,clientsidetoolscangetyouupandrunningfaster.Serverside
requiresdevelopmentresourcesbutcanoftenbemorerobust.

Sowhilesettinguptestscanbeslightlydifferentdependingonwhichtoolyouuse,oftenitwill
beassimpleassigningupforyourfavoritetoolandfollowingsomebasicinstructions,like
puttingajavascriptsnippetonyourwebsite.

Youllbasicallywanttosetupgoals(somethingthatletsyouknowaconversionhasbeen
made,likeathankyouforpurchasingpage),andyourtestingtoolwilltrackwheneach
variationconvertsvisitorsintocustomers.

SomeskillsthatcomeinhandywhensettinguptestsareHTML,CSS,andJavaScript/JQuery,
aswellasdesignandcopywritingskillstodrawupthevariations.Sure,sometoolsallowuseof
avisualeditor,butthatlimitsyourflexibilityandcontrol,solearningsometechnicalskillsis
helpful.

Oryoucouldusesomethinglike
Testing.Agency
tosetupyourtestsforyou.

HowLongShouldYouRunA/BTests?

Firstrule:dontstopatestjustbecauseitreachesstatisticalsignificance.Thisisprobablythe
mostcommonerrorcommittedbybeginningoptimizerswithgoodintentions.

Ifyourecallingyourtestswhenyouhitsignificance,youllfindthatmostofyourliftsdont
translatetoincreasedrevenue(thatsthegoal,afterall).Youllfindthatthe
liftswereinfact
imaginary
.

Considerthis:
OnethousandA/Atests
(twoidenticalpagestestedagainsteachother)wererun.

771experimentsoutof1.000reached90%significanceatsomepoint
531experimentsoutof1.000reached95%significanceatsomepoint

Stoppingtestsatsignificancebreedstheriskoffalsepositivesandexcludespossible
external
validitythreats
likeseasonality.

Instead,youllwanttopredetermineasamplesizeandrunthetestforfullweeks,usuallyforat
leasttwobusinesscycles.

Howdoyoupredeterminesamplesize?Therearelotsofgreattoolsoutthereforthat,including
toolswithinyourfavoritetestingtool.HereshowyoudcalculateyoursamplesizewithEvan
Millerstool:

Inthiscasewetoldthetoolthatwehavea3%conversionrate,andwanttodetectatleast10%
uplift.Thetooltellsusthatweneed51,486visitorspervariationbeforecanlookatthestatistical
significancelevelsandstatisticalpower.

Oh,andyoullnoticeinadditiontosignificancelevel,theres
somethingcalledstatisticalpower
inthephotoaboveaswell.

StatisticalpowerisanotherimportantfactorinrunningyourA/Btest,asitattemptstoavoid
TypeIIerrors(falsenegatives).Inotherwords,itmakessurethatyoudetectaneffect
ifthere
actuallywasone
.

Forpracticalpurposes,knowthat80%poweristhestandardfortestingtools.Toreachsucha
level,youneedeitheralargesamplesize,alargeeffectsize,oralongerdurationtest.

ThereAreNoMagicNumbers

Youllreadalotofblogpoststhathavemagicnumberslike100conversionsor1,000visitors
astheirstoppingpoints.Mathisnotmagic,mathismath,andwhatweredealingwithisslightly
morecomplexthansimplisticheuristicslikethat.
AndrewAnderson
from
Malwarebytes
putit
well:

Itisneverabouthowmanyconversions,itisabouthavingenoughdatatovalidatebasedon
representativesamplesandrepresentativebehavior.

100conversionsispossibleinonlythemostremotecasesandwithanincrediblyhighdeltain
behavior,butonlyifotherrequirementslikebehaviorovertime,consistency,andnormal
distributiontakeplace.EventhenitishasareallyhighchanceofatypeIerror,falsepositive.

Whatwereworriedaboutistherepresentativenessofoursample.Howcanwedothatinbasic
terms?Yourtestshouldrunfor1,orbetteryet2,businesscycles,soitincludeseverythingthat
goingon:
everydayoftheweek(andtestedoneweekatatimeasyourdailytrafficcanvaryalot),
variousdifferenttrafficsources(unlessyouwanttopersonalizetheexperiencefora
dedicatedsource),
yourblogpostandnewsletterpublishingschedule,
peoplewhovisitedyoursite,thoughtaboutit,andthencameback10dayslatertobuy
it,
anyexternaleventthatmightaffectpurchasing(e.g.payday)

Another(veryimportant)note:becarefulwithlowsamplesize.Theinternetisfullofcase
studiessteepedin
shittymath
,andmostofit(iftheyevenreleasefullnumbers),isbecause
theyjudgedatestonlike100visitorspervariationand12vs22conversions.

Ifyouveseteverythingupcorrectlysofar,thenyoulljustwanttoavoidpeaking(orlettingyour
bosspeak)attestresultsmultipletimesbeforethetestisfinished.Thiscanresultincallinga
resultearlyduetospottingatrend(impossible).Whatyoullfindisthatmanytestresults
regresstothemean
.

RegressiontotheMean

Often,youllseeresultsvarywildlyinthefirstfewdaysofthetest.Sureenough,theytendto
convergeasthetestcontinuesforthenextfewweeks.HeresanexamplePeepgaveinan
olderblogpostofaneCommerceclient:

Hereswhatwerelookingat:

Firstcoupleofdays,blue(variation#3)iswinningbiglike$16pervisitorvs$12.5for
Control.Lotsofpeoplewouldendthetesthere.(Fail).
After7days:bluestillwinningandtherelativedifferenceisbig.
After14days:orange(#4)iswinning!
After21days:orangestillwinning!
End:nodifference

Soifyoudcalledthetestatlessthanfourweeksyouwouldhavemadeanerroneous
conclusion.

Somethingrelated,thattheinternetalwaysgetsconfusedon,iscalledthenoveltyeffect.Thats
whenthenoveltyofyourchanges(biggerbluebutton)bringsmoreattentiontothevariation.
Withtime,theliftdisappearsbecausethechangeisnolongernovel.

AllofthisstuffissomeofthemorecomplexA/Btestinginformation.Wehaveabunchofblog
postsdevotedtothevarioustopicscoveredabove.Diveinifyoudliketolearnmore:
StoppingA/BTests:HowManyConversionsDoINeed?

CanYouRunMultipleA/BTestsSimultaneously?

Youwanttospeedupyourtestingprogramandrunmoretests.
Hightempotesting
,yeah?Soa
commonquestionis:canyourunmorethanoneA/Btestatthesametimeonyoursite?

Willthisincreaseyourgrowthpotential,orwillitpollutethedatabecauseeachtestinteracts
withtheother?

Look,thisisacomplicatedissue.Someexpertssayyoushouldntdomultipletests
simultaneously,andsomesayitsfine.

Inmostcasesyouwillbefinerunningmultiplesimultaneoustests,andextremeinteractionsare
unlikely.Unlessyouretestingreallyimportantstuff(e.g.somethingthatimpactsyourbusiness
model,futureofthecompany),thebenefitsoftestingvolumewillmostlikelyoutweighthenoise
inyourdataandoccasionalfalsepositives.

Ifbasedonyourassessmenttheresahighriskofinteractionbetweenmultipletests,reducethe
numberofsimultaneoustestsand/orletthetestsrunlongerforimprovedaccuracy.

Ifyouwanttoreadmoreonthis,readtheseposts:
ABTesting:WhenTestsCollide
CanYouRunMultipleA/BTestsattheSameTime?

AnalyzingYourA/BTestingResults

Alright.Youvedoneyourresearch,setupyourtest,andthetestisfinallycooked.Now,onto
analysisanditsnotalwaysassimpleasglimpsingatthegraphyourtestingtoolgivesyou.

ImageSource

Onethingyoushouldalwaysdoitto
analyzeyourtestresultsinGoogleAnalytics
.

Itdoesntjustenhanceyouranalysiscapabilities,butitallowsyoutobemoreconfidentinyour
dataanddecisionmaking.

Thepointis,itspossiblethatyourtestingtoolcouldberecordingthedataincorrectly,andifyou
havenoothersourceforyourtestdata,youcanneverbesurewhethertotrustitornot.Create
multiplesourcesofdata(wontgotoofarintodetail,
butreadthispostforhowtosetitallup
)

Butwhathappensif,afteranalyzingtheresultsinGA,thereisnodifferenceatallbetween
variations?

Dontmoveontooquickly.First,realizethesetwothings:

1.Yourtesthypothesismighthavebeenright,buttheimplementationsucked.

Letssayyourqualitativeresearchsaysthatconcernaboutsecurityisanissue.Howmanyways
dowehavetobeefuptheperceptionofsecurity?Unlimited.

Thenameofthegameis
iterativetesting
,soifyouwereontosomething,thentryafew
iterationsthatattempttosolvetheproblem.

2.Justbecausetherewasnodifferenceoverall,thetreatmentmighthavebeatcontrolin
asegmentortwo.

Ifyougotaliftinreturningvisitorsandmobilevisitors,butadropfornewvisitorsanddesktop
usersthosesegmentsmightcanceleachotherout,anditseemslikeitsacaseofno
difference.
Analyzeyourtestacrosskeysegments
toseethis.

AllAboutDataSegmentation

ThekeytolearninginA/Btestingis
segmenting
.EventhoughBmightlosetoAintheoverall
results,BmightbeatAincertainsegments(organic,Facebook,mobile,etc).

Thereareatonofsegmentsyoucananalyze.
Optimizelyliststhefollowingpossibilities
:
Browsertype
Sourcetype
Mobilevs.desktop,orbydevice
Loggedinvs.loggedoutvisitors
PPC/SEMcampaign
Geographicalregions(City,State/Province,Country)
Newvs.returningvisitors
Newvs.repeatpurchasers

Powerusersvs.casualvisitors
Menvs.women
Agerange
Newvs.alreadysubmittedleads
Plantypesorloyaltyprogramlevels
Current,prospective,andformersubscribers
Roles(ifyoursitehas,forinstance,bothaBuyerandSellerrole)

Butdefinitelylookatyourtestresultsatleastacrossthesesegments(makingsureofadequate
samplesize):

DesktopvsTablet/Mobile
NewvsReturning
Trafficthatlandsdirectlyonthepageyouretestingvscameviainternallink

Forsegments,thesamestoppingrulesapply.

Makesurethatyouhaveenoughsamplesizewithinthesegmentitselfaswell(calculateitin
advance,bewaryifitslessthan250350conversionsPERvariationwithinthatonesegment
yourelookingat).

Ifyourtreatmentperformedwellforaspecificsegment,itstimetoconsiderapersonalized
approachforthatparticularsegment.

AllYourStatsQuestionsAnswered,Stat

TheresacertainlevelofstatisticalknowledgethatcomesinhandywhenanalyzingA/Btest
results.SomeofitwewentoverintheabovesectiononsettingupA/Btests,butthereisstill
moretobecoveredwhenitcomestoanalysis.

Whydoyouneedtoknowallofthisstatisticsstuff?Weredealingwithinferenceheremeans
andprobabilityandthereforecannotgowithoutsomebasicunderstandingofstats.

OrasMattGershoffputit(quotinghiscollegemathprofessor),howcanyoumakecheeseif
youdontknowwheremilkcomesfrom?!

TherearethreetermsyoushouldknowbeforewediveintothenittygrittyofA/Btesting
statistics:

1. Mean(werenotmeasuringallconversionrates,justasample,andfindinganaverageof
themthatisrepresentativeofthewhole)
2. Variance(whatisthenaturalvariabilityofapopulation?Thatwillaffectourresultsand
howwetakeactionwiththem)

3. Sampling(again,wecantmeasuretrueconversionrate,soweselectasamplethatis
hopefullyrepresentativeofthewhole)

WhatTheHellIsaPValue?

Theresalargeamountofbloggerswritingaboutconversionoptimizationthatareusingtheterm
statisticalsignificanceinaccurately.

Wetalkedabitaboveabouthowstatisticalsignificancebyitselfisnotastoppingrule,sowhat
isitandwhyisitimportant?

Tostartwith,letsgooverPValues,whicharealsoverymisunderstood.AsFiveThirtyEight
recentlypointedout,
evenscientistscanteasilyexplainwhatPValuesare
.

PValueisbasicallymeasureofevidenceagainstthenullhypothesis(thecontrolinA/BTesting
parlance).

Veryimportant:PvaluedoesnottellustheprobabilitythatBisbetterthanA.

Similarly,itdoesnttellustheprobabilitythatwewillmakeamistakeinselectiveBoverA.
Thesearebothextraordinarilycommonsmisconceptions,buttheyarefalse.

Thepvalueisjusttheprobabilityofseeingaresultormoreextremegiventhatthenull
hypothesisistrue.Or,Howsurprisingisthatresult?

Sotosumitup,statisticalsignificance(orastatisticallysignificantresult)isattainedwhena
pvalueislessthanthesignificancelevel(whichisusuallysetat.05).Bytheway,significance
inregardstostatistical
hypothesistesting
iswherethewholeonetailvstwotailissuecomesup.

ConfidenceIntervalsandMarginofError

InA/Btesting,weuseconfidenceintervalstomitigatetheriskofsamplingerrors.Inthatsense,
weremanagingtheriskassociatedwithimplementinganewvariation.Soifyourtoolsays
somethinglike,Weare95%confidentthattheconversionrateisX%+/Y%,thenyouneedto
accountforthe+/Y%asthemarginoferror.

ImageSource

Howconfidentyouareinyourresultsdependslargelyonhowlargethemarginoferroris.Asa
ruleofthumbifthe2conversionrangesoverlap,youllneedtokeeptestinginordertogeta
validresult.

MattGershoffgaveagreatillustrationofhowmarginoferrorworks:

SayyourbuddyiscomingtovisityoufromRoundRockandistakingTX1at5pm.Shewants
toknowhowlongitshouldtakeher.YousayIhavea95%confidencethatitwilltakeyouabout
60minutesplusorminus20minutes.Soyourmarginoferroris20minutesor33%.

Ifsheiscomingat11amyoumightsayitwilltakeyou40min,plusorminus10min,sothe
marginoferroris10minutes,or25%.Sowhilebothareatthe95%confidencelevel,themargin
oferrorisdifferent.

ExternalValidityThreats

TheresachallengewithrunningA/Btests:thedataisnonstationary.

Inotherwords,astationarytimeseriesisonewhosestatisticalproperties(mean,variance,
autocorrelation,etc)areconstantovertime.Formanyreasons,websitedataisnonstationary,
whichmeanswecantmakethesameassumptionsaswithstationarydata.Hereareafew
reasonsdatamightfluctuate:

Season
Dayoftheweek
Holidays
Press(positiveornegative)
OtherMarketingCampaigns
PPC/SEM
SEO
WordofMouth

Soseasonalityandtheotherfactorsaboveareonesourceofexternalvaliditythreat.

Otherincludesamplepollution,theflickereffect,revenuetrackingerrors,selectionbias,and
more(
readhere
).TheseareallthingstokeepinmindinplanningandanalyzingyourA/Btests.

ArchivingTestResultsForFutureLearning

A/Btestingisntjustaboutlifts,wins,losses,andtestingrandomshit.AsMattGershoffsaid,
optimizationisaboutgatheringinformationtoinformdecisions,andthelearningsfrom
statisticallyvalidA/Btestresultscontributetothegreatergoalsofgrowthandoptimization.

Smartorganizationsarchivetheirtestresultsandplantheirapproachtotestingsystematically.
Theresareasonhavingastructuredapproachtooptimizationhavegreatergrowthandare
limitedlessoftenbylocalmaxima.

ImageSource

Soheresthetoughpart:theresnosinglebestwaytostructureyourknowledgemanagement.

Wewroteanarticleonhow
effectiveorganizationsarchivetheirresults
(readit),andasitturns
out,manyofthemdoitslightlydifferently.Someusesophisticatedinternallybuilttools,some
use3rdpartytools,andsomeusegoodolExcelandTrello.

Ifithelps,hereare4toolsbuiltspecificallyforconversionoptimizationprojectmanagement:

Iridion
EffectiveExperiments
GrowthHackersCanvas
ExperimentEngine

Onasimilarnote,inlargerorganizations(orhell,insmalleraswell),itsimportanttobeableto
communicateacrossdepartmentsandtotheexecutivesabove.Often,A/Btestresultsarent
superintuitivetothelayperson(andmostpeoplehaventreadguidesaslongasthisone).So
whathelpsisvisualization.

Thisisanotherareawhere,sadly,thereisnotrealrightwaytodoit.Thatsaid,Annemarie
KlaassenandTonWesseling
wroteanawesomepostonourblog
detailingtheirjourneytogreat
visualizations.Sneakpeek,hereswhattheyendedupwith:

A/BTestingToolsandResources

Litteredthroughoutthisguidearetonsoflinkstoexternalresourcesarticles,tools,books,etc.
Tomakeitconvenientforyou,though,herearesomeofthebest(dividedbycategories).

A/BTestTools

Optimizely
VWO
AdobeTarget
Maximyser
Conductrics
53ConversionOptimizationToolsReviewedByExperts

A/BTestingCalculators

A/BSplitTestSignificanceCalculatorbyVWO
A/BSplitandMultivariateTestDurationCalculator
EvanMillersSampleSizeCalculator
EvanMillersWholeSuiteofA/BTestingTools

A/BTestingStatisticsResources

IgnorantNoMore:CrashCourseonA/BTestingStatistics
StatisticalAnalysisandA/BTesting
UnderstandingA/BtestingstatisticstogetREALLiftinConversions
OneTailedvsTwoTailedTests(DoesItMatter?)

BayesianvsFrequentistA/BTestingWhatstheDifference?
SamplePollution
ScienceIsntBroken

A/BTesting/CROStrategyResources

eCommerceA/BTestDataforImprovedProcess:WhatPercentageofTestsAre
Winners?
WiderFunnelsLIFTModel
3FrameworksToHelpPrioritize&ConductYourConversionTesting
Whatyouhavetoknowaboutconversionoptimization
OurConversionOptimizationGuide
SmallBusinessBigMoneyOnline:AProvenSystemtoOptimizeeCommerceWebsites
andIncreaseInternetProfits
(book)

Conclusion

A/Btestingisaninvaluableresourcetoanyonemakingdecisionsinanonlineenvironment.With
alittleitofknowledgeandalotofdiligence,youcanmitigatemanyoftherisksthatmost
beginningoptimizersfaceduetoerrors.

Thisisn'tacompleteoranultimateguide,butitisadamngoodstart.Ifyoureallydigintothe
informationhere,you'llbeaheadof90%ofpeoplerunningtests.Ifyoubelieveinthepowerof
A/Btestingforcontinuedrevenuegrowth,thenthat'safantasticplacetobe.

Knowledgeisalimitingfactorthatonlyexperienceanditerativelearningcanbustthrough,
though.Sogettesting:)

You might also like