You are on page 1of 8

Sofar,weveplottedandvisualizeddatainvariousways.Today,wellseehowtostatisticallybackupsomeofthe observationswevemadeinlookingatourdata.Statisticsisatoolthathelpsseparatenewsmakingdatabackedstoriesfrom oneoanecdotes.Usually,bothkindsofstoriesstartwithahunch,andstatisticshelpsusquantifytheevidencebacking thathunch. Wheneveryouhaveahunch(ahypothesisinstatisticianspeak),therstthingtodoistolookatsomesummarystatistics (e.g.,averages),andexplorethedatagraphicallyaswedidyesterday.Ifthevisualizationsseemtosupportyourhunch,you willmoveintohypothesistestingmode.

TwoRunningExamples
Forourrstsetoftests,weregoingtousetworunningexamples:campaignspendingandafuncomparisonoftwotowns citizensheights.Herearethetwoscenarios: Onethingthatsbeenclaimedaboutthe2008electionisthatPresidentObamaraisedsmallerquantitiesfromalarger groupofdonorsthanSenatorMcCain,whoraisedasmallernumberoflargecontributions.Statisticaltechniqueswill helpusdeterminehowtruethisstatementis. Imaginetwotownsthatonlydierinthatoneofthetownshadsomethinginthewatertheyearabunchofkids wereborn.Didthatsomethinginthewateraecttheheightofthesekids?(Note:Thissituationisunrealistic.Its neverthecasethattheonlydierencebetweentwocommunitiesistheoneyouwanttomeasure,butitsanicegoal!) Wellusestatisticstodeterminewhetherthetwocommunitieshavemeaninfullydierentheights.

ComparingAverages
Letsstartbycomparingasimplestatistic,toseeifinthedataweobservetheresanydierence.Wellstartbycomparingthe averageheightsofthetwotowns.(Asanaside:itwouldhelpifyouwroteandranyourcodein dataiap/day3/ today,since severalmoduleslike ols.py areavailableinthatdirectory).
importnumpy town1_heights=[5,6,7,6,7.1,6,4] town2_heights=[5.5,6.5,7,6,7.1,6] town1_mean=numpy.mean(town1_heights) town2_mean=numpy.mean(town2_heights) print"Town1avg.height",town1_mean print"Town2avg.height",town2_mean print"Effectsize:",abs(town1_meantown2_mean)

Itlooksliketown2saverageheight(6.35feet)ishigherthantown1(5.87feet)byadierenceof.479feet.Thisdierenceis calledtheeectsize.Town2certainlylookstallerthanTown1! ExerciseComputetheaveragecampaigncontributionfortheObamaandMcCaincampaignsfromthedatasetinday1. Whatstheeectsize?Wehaveanaveragecontributionof$423forMcCainand$192forObama,foraneectsizeof$231. McCainappears,onaverage,tohavemoregivingdonors. Beforewereupthepressesoneitherofthesestories,letslookatthedatainmoredepth.

GraphTheData
Ifyounishedyesterdayshistogramexercise,thenfeelfreetoskipdowntotheboxplotsection Theeectsizeinbothofourexamplesseemslarge.Itwouldbenicetomorethanjustcompareaverages.Letstrytolookat
1

ahistogramofthedistributions.Wecreatedahistogramofthetwocampaignscontributions,binnedby$100increments.
importmatplotlib.pyplotasplt
fromcollectionsimportCounter

increment=1
width=.25
town1_bucketted=map(lambdaammt:ammtammt%increment,town1_heights)
town2_bucketted=map(lambdaammt:ammtammt%increment+width,town2_heights)
town1_hist=Counter(town1_bucketted)
town2_hist=Counter(town2_bucketted)
minamount=min(min(town1_heights),min(town2_heights))
maxamount=max(max(town1_heights),max(town2_heights))
buckets=range(int(minamount),int(maxamount)+1,increment)
fig=plt.figure()
sub=fig.add_subplot(111)
sub.bar(town1_hist.keys(),town1_hist.values(),color='b',width=width,label="town1")
sub.bar(town2_hist.keys(),town2_hist.values(),color='r',width=width,label="town2")
sub.legend()
plt.savefig('figures/town_histograms.png',format='png')

Thisresultsinahistogramthatlookslikethis:

Notbad!Thebucketsareallexactlythesamesizeexceptforonepersonofheightbetween4and5feetintown1. ExerciseBuildahistogramfortheObamaandMcCaincampaigns.Thisischallenging,becausetherearealargenumberof outliersthatmakethehistogramsdiculttocompare.Addtheline

sub.set_xlim((20000,20000))

beforedisplayingtheplotinordertosetthexvaluesofthehistogramtocutodonationslargerthan$20,000orsmaller than$20,000(refunds).Withbarwidthsof50andincrementsof$100,yourhistogramwilllooksomethinglikethis:

Ouch!Icantmakeheadsortailsofthat.ItseemslikeObamahasalargernumberofsmalldonations,butthereisntalotof granularityatthatscale.Forlargedatasets,ahistogrammighthavetoomuchinformationonittobehelpful.Luckily, descriptivestatisticianshaveamoreconcisevisualization.Itscalledaboxandwhiskerplot!Thecodeforitisquitesimple aswell:


importmatplotlib.pyplotasplt fig=plt.figure()
sub=fig.add_subplot(111)
sub.boxplot([town1_heights,town2_heights],whis=1)
sub.set_xticklabels(("Town1","Town2"))
sub.set_title("Town1vs.Town2Heights")
plt.savefig('figures/town_boxplots.png',format='png')

Hereswhatwesee:

Letsinterpretthisplot.Weshowtown1ontheleftandtown2ontheright.Eachtownisrepresentedbyaboxwithared lineandwhiskers. Theredlineintheboxrepresentsthemedian,or50thpercentilevalueofthedistribution.Ifwesortthedataset,50% ofthevalueswillbebelowthisline,and50%willbeaboveit. Thebottomedgeoftheboxrepresentsthe25thpercentile(thevaluelargerthan25%ofyourdataset),andthetop edgerepresentsthe75thpercentile(thevaluelargerthan75%ofyourdataset).Thedierencebetweenthe75thand 25thpercentileiscalledtheinnerquartilerange(IQR). Thewhiskersrepresenttheextremesofourdataset:thelargestvaluewerewillingtoconsiderinourdatasetbefore callingitanoutlier.Inourcase,wesetwhis=1,requestingthatweshowwhiskersthemostextremevalueata distanceofatmost1xtheIQRfromthebottomandtopedgesoftheboxplot. Ifnormaldistributionsareyourthing,thisimagemighthelpyouinterprettheboxandwhiskersplot. Likeinthehistogram,weseethatthetownsheightdistributionsdontlookallthatdierentfromoneanother.Generally,if theboxesofeachdistributionoverlap,andyouhaventtakensomethingontheorderofabuttload(metricunits)of measurements,youshoulddoubtthedierenesindistributionaverages.Itlookslikeasingleheightmeasurementfortown 1isprettyfarawayfromtheothers,andyoushouldinvestigatesuchmeasurementsaspotentialoutliers. ExerciseBuildaboxandwhiskersplotoftheMcCainandObamacampaigncontributions.Again,outliersmakethisa diculttask.Withwhis=1,andbysettingtheyrangeoftheplotslikeso
sub.set_ylim((250,1250))

wegotthefollowingplot

Obamaisontheleft,andMcCainontheright.Realdatasureismoreconfusingthanfakedata!Obamasboxplotisalot tighterthanMcCains,whohasalargerspreadofdonationsizes.BothofObamaswhiskersarevisibleonthischart,whereas onlythetopwhiskerofMcCainsplotisvisible.Anotherfeaturewehaventseenbeforeisthestreamofbluedotsaftereach ofthewhiskersoneachofObamaandMcCainsplots.Theserepresentpotentialoutliers,orvaluesthatareextremeand donotrepresentthemajorityofthedataset. Itwaseasytosaythatthehistogramsandboxplotsforthetownheightsoverlappedheavily.Sowhiletheeectsizefor townheightswasprettylarge,thedistributionsdontactuallylookallthatdierentfromoneanother. Thecampaignplotsareabithardertodiscern.Thehistogramtoldusvirtuallynothing.Theboxplotshowedusthat Obamasdonationsseemedmoreconcentratedonthesmallerend,whereasMcCainsseemedtospanalargerrange.There wasoverlapbetweentheboxesintheplot,butwedontreallyhaveasenseforjusthowmuchoverlaporsimilaritythereis betweenthesedistributions.Inthenextsection,wellquantifythedierenceusingstatistics!

RunaStatisticalTest
Wehavetwopopulationheightaverages.Weknowthattheyaredierent,butchartsshowthatoverallthetwotownslook similar.Wehavetwocampaigncontributionaveragesthatarealsodierent,butwithamurkierstoryafterlookingatour boxandwhiskerplots.Howwillwedenitivelysaywhetherthedierencesweobservearemeaningful? Instatistics,whatweareaskingiswhetherdierencesweobservedarereliableindicatorsofsometrend,orjusthappened byluckychance.Forexample,wemightsimplyhavemeasuredparticularlyshortmembersoftown1andtallmembersof town2.Statisticalsignicanceisameasureoftheprobabilitythat,forwhateverreason,westumbledupontheresultswe didbychance. Thereareseveraltestsforstatisticalsignicance,eachapplyingtoadierentquestion.Ourquestionis:Isthedierence betweentheaverageheightofpeopleintown1andtown2statisticallysignicant?Weaskasimilarquestionaboutthe dierenceinaveragecampaigncontributions.ThetestthatanswersthisquestionistheTTest.Thereareseveralavorsof TTestandwewilldiscussthesesoon,butfornowwellfocusonWelchsTTest.
5

importwelchttest print"Welch'sTTestpvalue:",welchttest.ttest(town1_heights,town2_heights)

TheWelchsTTestemittedapvalueof.349.Apvalueistheprobabilitythattheeectsizeof.479feetbetweentown1and town2happenedbychance.Inthiscase,theres34.9%chancethatwevearrivedatoureectsizebychance. Whatsagoodcutoforpvaluestoknowwhetherweshouldtrusttheeectsizewereseeing?Twopopularvaluesare.05 or.01:ifthereislessthana5%or1%chancethatwearrivedatouranswerbychance,werewillingtosaythatwehavea statisticallysignicantresult. Soinourcase,ourresultisnotsignicant.Hadwetakenmoremeasurements,orifthedierencesinheightswerefarther apart,wemighthavereachedsignicance.But,givenourcurrentresults,letsnotjumptoconclusions.Afterall,itwasjust foodcoloringinthewater! ExerciseRunWelchsTtestonthecampaigndata.IstheeectsizebetweenMcCainandObamasignicant?Byour measurements,thepvaluereportediswithinroundingerrorof0.Thatssignicantbyanyonesmeasure:theresa nearnonexistantchancewereseeingthisdierencebetweenthecandidatesbysomerandomukeintheuniverse.Timeto writeanarticle!

CanYouHaveaVerySignicantResult?
No.Thereisnosuchthingasveryoralmostsignicant.Remember:theeectsizeistheinterestingobservation,andits uptoyouwhatmakesforanimpressiveeectsizedependingonthesituation.Youcanhavesmalleects,largeeects,and everythinginbetween.Signicancetestingtellsuswhethertobelievethattheobservationswemadehappenedbyanything morethanrandomchance.Whilepeopledisagreeaboutwhetherapvalueof.05or.01isrequired,theyallagreethat signicanceisabinaryvalue. Strictlyspeaking,youvelearnedaboutTTestsatthispoint.Ifyouarepressedfortime,readPuttingitallTogetherbelow andmoveontothenextsection.Fortheoverachieversinourmidst,thereslotsofimportantinformationtofollow,andyou caninsteadkeepreadinguntiltheend.

TypesofTTest
TheTTesthastwomajoravors:pairedandunpaired. Sometimesyourdatasetsarepaired(alsocalleddependent).Forexample,youmaybemeasuringtheperformanceofthe samesetofstudentsonanexambeforeandafterteachingthemthecoursecontent.TouseapairedTTest,youhavetobe abletomeasureanitemtwice,usuallybeforeandaftersometreatment.Thisistheidealcondition:byhavingbeforeand aftermeasurementsofatreatment,youcontrolforotherpotentialdierencesintheitemsyoumentioned,likeperformance betweenstudents. Othertimes,youaremeasuringthedierencebetweentwosetsofmeasureddata,buttheindividualmeasurementsineach datasetareunpaired(sometimescalledindependent).Thiswasthecaseinourtests:dierentpeoplecontributedtoeach campaign,anddierentpeopleliveintown1and2.Withunpaireddatasets,welosetheabilitytocontrolfordierences betweenindividuals,sowelllikelyneedmoredatatoachievestatisticalsignicance. Unpaireddatasetscomeinallavors.Dependingonwhetherthesizesofthesetsareequalorunequal,anddependingon whetherthevariancesofbothsetsareequal,youwillrundierentversionfofanunpairedTTest.Inourcase,wemadeno assumptionsaboutthesizesofourdatasets,andnoassumptionsontheirvariances,either.Sowewentwithanunpaired, unequalsize,unequalvariancetest.ThatsWelchsTTest. Aswithalllifedecisions, ifyou want moredetails,checkouttheWikipediaarticleonTTests.Thereareimplementationsof pairedTTests and unpaired ones inscipy.Theunequalvariancecaseisnotavailableinscipy,whichiswhyweincluded Enjoyit! welchsttest.py.
6

TTestAssumptionsweBroke:(
Wevemanagedtosoundlikesmartypantsesthatdoalltherightthingsuntilthismoment,butnowwehavetoadmitwe brokeafewrules.ThemathbehindTTestsmakesassumptionsaboutthedatasetsthatmakesiteasiertoachievestatistical signicanceifthoseassumptionsaretrue.Thebigassumptionisthatthedataweusedcamefromanormaldistribution. Therstthingweshouldhavedoneischeckwhetherornotourdataisactuallynormal.Luckily,thenescipyfolkshave implementedtheShapiroWilktesttestfornormality.Thistestcalculatesapvalue,that,iflowenough(usually<0.05),tells usthereisalowchancethedistributionisnormal.
importscipy.stats print"Town1ShapiroWilkspvalue",scipy.stats.shapiro(town1_heights)[1]

Withapvalueof.380,wedonthaveenoughevidencethatourtownheightsarenotnormallydistributed,soitsprobably netorunWelchsTTest ExerciseTestthecampaigncontributiondatasetsfornormality.Wefoundthemtonotbenormal(p=.003forObamaand .014forMcCain),whichmeanswelikelybrokethenormalityassumptionofWelchsTTest.Thestatisticspolicearegoingto bepayingusavisit. ThisturnsouttobeOKfortworeasons:TTestsareresilienttobreakingofthenormalityassumption,and,ifyourereally seriousaboutyourstatistics,therearenonparametricequivalentsthatdontmakenormalityassumptions.Theyaremore conservativesincetheycantmakeassumptionsaboutthedata,andthuslikelyrequirealargersamplesizetoreach signicance.Ifyourealrightwiththat,feelfreetoruntheMannWhitneyUnonparametricversionoftheTTest,whichhas awonderfulname.
importscipy.stats print"MannWhitneyUpvalue",scipy.stats.mannwhitneyu(town1_heights,town2_heights)[1]

Remember:wedontneedtoruntheMannWhitneyUtestonourtowndata,sinceitdidntexhibitnonnormalcy.And besides,thepvalueis.254.Thatsstillnotsignicant.Thismakessense:ourlessconservativeWelchstestwasunabletogive ussignicance,sowedontexpectamoreconservativetesttomagicallyndsignicance. ExercisesinceweshouldntbeusingWelchsTTestonthecampaigncontributiondata,runtheMannWhitneyUteston thedata.IsthedierencebetweentheObamaandMcCaincontributionsstillsignicant? Wegotapvalueofabout0,soyouwillstillndtheresulttobestatisticallysignicant.A+foryou!

PuttingitAllTogether
Sofar,wevelearnedthestepstotestahypothesis: Computesummarystatistics,likeaveragesormedians,andseeifthesenumbersmatchyourintuition. Lookatthedistributionhistogramsorsummaryvisualizationslikeboxplotstounderstandwhetheryourhypothesis appearstobebackedupbythedata Ifitsnotimmediatelyclearyourhypothesiswaswrong,testitusingtheappropriatestatisticaltestto1)quantifythe eectsize,and2)ensurethedatayouobservedcouldnthavehappenedbychance. TheresalotmoretostatisticsthanTTests,whichcomparetwodatasetsaverages.Next,wellcovercorrelationbetween twodatasetsusinglinearregression.

MIT OpenCourseWare http://ocw.mit.edu

Resource: How to Process, Analyze and Visualize Data


Adam Marcus and Eugene Wu

The following may not correspond to a particular course on MIT OpenCourseWare, but has been provided by the author as an individual learning resource.

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like