Professional Documents
Culture Documents
BehindtheStatistics
Keynotes
Lastupdated:13May2016
Thedocumentisopenforyourcommentsorsuggestions.Usesuggestionmodetoadd
yourfeedback.
Toviewthenalformofthedocument,pleasechangethemodetoviewingonthetop-right
ofthemenubar.
DownloadthenotesbyclickingFile->Downloadas
JuanKlopperCC-BY
ThisworkislicensedunderaCreativeCommonsAttribution4.0InternationalLicense .
TableofContents:
Week1:Gettingthingsstartedbydeningdierentstudytypes
Gettingtoknowstudytypes
Observationalandexperimentalstudies
GettingtoKnowStudyTypes:CaseSeries
Case-controlStudies
Cross-sectionalstudies
Cohortstudies
RetrospectiveCohortStudies
ProspectiveCohortStudies
Experimentalstudies
Randomization
Blinding
Trialswithindependentconcurrentcontrols
Trialswithself-controls
Trialswithexternalcontrols
Uncontrolledtrials
Meta-analysisandsystematicreview
Meta-analysis
SystematicReview
Whatisthedierencebetweenasystematicreviewandameta-analysis?
Week2:Describingyourdata
Thespectrumofdatatypes
Denitions
Descriptivestatistics
Inferentialstatistics
Population
Sample
Parameter
Statistic
Variable
Datapoint
Datatypes
Nominalcategoricaldata
Ordinalcategoricaldata
Numericaldatatypes
Ratio
Summary
Discreteandcontinuousvariables
Discretedata:
Continuousdata:
Summarisingdatathroughsimpledescriptivestatistics
Describingthedata:measuresofcentraltendencyanddispersion
Measuresofcentraltendency
Mean
Median
Mode
Measuresofdispersion
Range
Quartiles
Percentile
TheInterquartileRangeandOutliers
Varianceandstandarddeviation
Plots,graphsandgures
Boxandwhiskerplots
Countplots
Histogram
Distributionplots
Violinplots
Scatterplots
Piechart
Sampling
Introduction
Typesofsampling
Simplerandomsampling
Systematicrandomsampling
Clusterrandomsampling
Stratiedrandomsampling
Week3:Buildinganintuitiveunderstandingofstatisticalanalysis
Fromareatoprobability
P-values
Rollingdice
Equatinggeometricalareatoprobability
Continuousdatatypes
Theheartofinferentialstatistics:Centrallimittheorem
Centrallimittheorem
Skewnessandkurtosis
Skewness
Kurtosis
Combinations
Centrallimittheorem
Distributions:theshapeofdata
Distributions
Normaldistribution
Samplingdistribution
Z-distribution
t-distribution
Week4:Theimportantrststeps:Hypothesistestingandcondencelevels
Hypothesistesting
Thenullhypothesis
Thealternativehypothesis
Thealternativehypothesis
Twowaysofstatingthealternativehypothesis
Thetwo-tailedtest
Theone-tailedtest
Hypothesistestingerrors
TypeIandIIerrors
Condenceinyourresults
Introductiontocondenceintervals
Condencelevels
Condenceintervals
Week5:Whichtestshouldyouuse?
Introductiontoparametrictests
Typesofparametrictests
Studentst-test:Introduction
Typesoft-tests
ANOVA
LinearRegression
Nonparametrictestingforyournon-normaldata
Nonparametrictests
Nonparametrictests
Week6:Categoricaldataandanalyzingaccuracyofresults
Comparingcategoricaldata
Thechi-squaredgoodness-of-ttest
Thechi-squaredtestforindependence
Fisher'sexacttest
Sensitivity,specicity,andpredictivevalues
Consideringmedicalinvestigations
Sensitivityandspecicity
Predictivevalues
Week1:Gettingthingsstartedbydening
dierentstudytypes
Gettingtoknowstudytypes
Doyouknowyourcross-sectionalfromyourcohortstudy?Yourretrospectivecase-controlseries
fromyourdependentcontroltrials?Studytypescanbeveryconfusing.Yetitisessentialtoknow
whattypeofstudyyourarereadingorplanningtoconduct.Deningastudyintoaspecictype
tellsusalotaboutwhatwecanlearnfromtheoutcomes,whatstrengthsandweaknesses
underpinitsdesignandwhatstatisticalanalysiswecanexpectfromthedata.
Therearevariousclassicationsystemsanditisevenpossibletocombineaspectsofdierent
studytypestocreatenewresearchdesigns.We'llstartthiscourseowithanintuitive
classicationsystemthatviewsstudiesaseitherobservationalorexperimental(interventional).
Observationalandexperimentalstudies
InthisrstlectureIcoveredclinicalstudytypes,usingtheclassicationsystemthatdividesall
studiesintoeitherobservationalorexperimental.Ialsotookalookatmeta-analysesand
systematicreviews.
LetssummarisethekeycharacteristicsofthedierentstudytypesthatImentioned.Thediagram
belowgivesanoverview.
Inobservationalstudies:
subjectsandvariablespertainingtothemareobservedanddescribed
notreatmentorinterventiontakesplaceotherthanthecontinuationofnormalwork-ow,
i.e.healthcareworkersareallowedtocarryonwiththeirnormalpatientmanagementor
treatmentplans
therearefourmainobservationalstudytypes:caseseries,case-controlstudies,
cross-sectionalstudiesandcohortstudies
Inexperimentalstudies:
subjectsaresubjectedtotreatmentsorinterventionsbasedonapredesignatedplan
healthcareworkersarenotallowedtocontinuetheirroutinecare,butmustaltertheir
actionsbasedonthedesignofthestudy
thisusuallyresultsinatleasttwogroupsofpatientsorsubjectsthatfollowadierentplan
andthesecanthenbecomparedtoeachother
themainideabehindanexperimentalstudyistoremovebias
ifastudyinvolveshumans,thisisknownasaclinicaltrial
themainexperimentalstudytypesare:trialswithindependentconcurrentcontrols,trials
withself-controls,trialswithexternalcontrols,anduncontrolledtrials
Ialsointroducedthetopicofmeta-analysesandsystematicreviews.Meta-analysisuses
pre-existingresearchandcombinestheirresultstoobtainanoverallconclusion.Theyaimto
overcomeoneofthemostcommonproblemsthatbesetclinicalresearchandthatissmallsample
sizes,resultinginunderpoweredresults.Asystematicreviewisaliteraturereviewthatsumsup
thebestavailableinformationonaspecicresearchquestionandincludesresultsfromresearch
intothespeciceld,publishedguidelines,expert(group)opinionandmeta-analyses.
Nowthatyouknowhowtodistinguishbetweenthevariousclinicalstudies,Illcovereachofthe
studytypesintheupcominglecturemorein-depth.
GettingtoKnowStudyTypes:CaseSeries
Acaseseriesisperhapsthesimplestofallstudytypesandreportsasimpledescriptiveaccount
ofacharacteristicobservedinagroupofsubjects.Itisalsoknownbythetermsclinicalseriesor
clinicalaudit.
Acaseseries:
observesanddescribessubjects
cantakeplaceoveradenedperiodorataninstantintime
ispurelyanalyticalandrequiresnoresearchhypotheses
iscommonlyusedtoidentifyinterestingobservationsforfutureresearchorplanning
Alsosimplebynatureanddesign,cases-seriesareneverthelessimportantrststepsinmany
researchareas.Theyidentifynumbersinvolved,i.e.howmanypatientsareseen,diagnosed,
under-threat,etc.anddescribevariouscharacteristicsregardingthesesubjects.
Wearebynaturebiasedandashumanshaveatendencytorememberonlyselectedcasesor
events.Weareusuallypooratseeingpatternsinlargenumbersoroverextendedperiodsandby
examiningcase-series(audits),wendinterestingandsometimessurprisingresultsandthese
canleadtofurtherresearchandevenachangeinmanagement.
Paperreferencedinvideo:
1. Donald,K.a,Walker,K.G.,Kilborn,T.,Carrara,H.,Langerak,N.G.,Eley,B.,&
Wilmshurst,J.M.(2015).HIVEncephalopathy:pediatriccaseseriesdescriptionand
insightsfromthecliniccoalface .AIDSResearchandTherapy,12,110.
doi:10.1186/s12981-014-0042-7
Case-controlStudies
NowthatIhavecoveredthetopicofcase-controlstudiesinthevideolecture,letssummariseand
expandonwhatwevelearned.
Acase-controlstudy:
selectssubjectsonthebasisofapresence(cases)andabsence(controls)ofan
outcomeordisease
looksbackintimetondvariablesandriskfactorsthatdierbetweengroups
canattempttodeterminetherelationshipbetweentheexposuretoriskfactors(orany
measuredvariable)andthedisease
case-controlstudiescanincludemorethantwogroups
Toillustratethesepoints,consideranexamplewherepatientsundergosomeformofinvasive
surgicalintervention.Youmightnotethatafterthesamesurgicalprocedure,somedevelop
infectionatthewoundsiteandsomedonot.Thosewiththewoundinfection(termedsurgicalsite
infection)makeupthecasesandthosewithout,thecontrols.Wecannowgatherdataonvarious
variablessuchasgender,age,admissiontemperature,etc.andcomparethesebetweenthetwo
groups.Notehowsuchdataonthesevariablesallexistedpriortotheoccurrenceofthewound
infection.Thestudylooksbackintimeandthedataiscollectedretrospectively.
Whatisadrawbackofcase-controlstudies?
Themaindrawbackisconfounding,whichreferstoafalseassociationbetweentheexposureand
outcome.Thisoccurswhenthereisathirdvariable,wecallthistheconfoundingfactorwhichis
associatedwithboththeriskfactorandthedisease.
Letsconsideranotherexample.Severalstudiesndanassociationbetweenalcoholintake
(exposure)andheartdisease(outcome).Here,agroupofpatientswithheartdiseasewillformthe
casesandagroupwithoutheartdisease,thecontrolsandwelookbackintimeattheiralcohol
consumption.If,bystatisticalanalysis,wendahigheralcoholconsumptionintheheartdisease
groupthaninthecontrolgroup,wemaythinkthatdrinkingalcoholcausesheartdisease.But
anotherconfoundingfactor,i.e.smoking,mayberelatedtobothalcoholintakeandheartdisease.
Ifthestudydoesnotconsiderthisconfoundingfactor,thisrelationshipbetweentheexposureand
outcomemaybemisinterpreted.Theconfoundingfactor,inthiscasesmoking,needstobe
controlledinordertondthetrueassociation.
Youcanreviewtheexampleofthecase-controlstudyinthepaperIdiscussedinthelecture
References:
1. YungJ,YuenJWM,OuY,LokeAY.FactorsAssociatedwithAtopyinToddlers:A
Case-ControlStudy.TchounwouPB,ed.InternationalJournalofEnvironmentalResearch
andPublicHealth.2015;12(3):2501-2520.doi:10.3390/ijerph120302501.
Cross-sectionalstudies
Letsreviewthecharacteristicsofcross-sectionalstudies.
Across-sectionalstudy:
identiesapopulationorsub-populationratherthanindividuals
takesplaceatapointintimeorovera(relatively)shortperiod
canmeasurearangeofvariablesacrossgroupsatthesametime
isoftenconductedintheformofasurvey
canbeaquick,easyandacosteectivewayofcollectinginformation
canbeincludedinotherstudydesignssuchascase-controlandcohortstudies
iscommonlyusedtomeasureprevalenceofanoutcomeordisease,i.e.epidemiological
studies
Whatarethepotentialdrawbacksofcross-sectionalstudies?
Bias
Responsebiasiswhenanindividualismorelikelyrespondiftheypossessaparticular
characteristicorsetofcharacteristics.Forexample,HIVnegativeindividualsmaybemore
comfortablerespondingtoasurveydiscussingtheirstatuscomparedtoHIVpositiveindividuals.A
varietyoftechnicaldicultiesorevenagemayalsoinuenceresponders.Oncebiasexistsinthe
groupofresponders,itcanleadtoseeingofthedataandinappropriateconclusionscanbedrawn
fromtheresults.Thiscanhavedevastatingconsequencesasthesestudiesaresometimesused
toplanlargescaleinterventions.
SeparatingCauseandEect
Cross-sectionalstudiesmaynotprovideaccurateinformationoncauseandeect.Thisis
becausethestudytakesplaceatamomentintime,anddoesnotconsiderthesequenceof
events.Exposureandoutcomeareassessedatthesametime.Inmostcasesweareunableto
determinewhetherthediseaseoutcomefollowedtheexposure,ortheexposureresultedfromthe
outcome.Thereforeitisalmostimpossibletoinfercausality.
YoumaynditusefultoreviewthepapersIdiscussedinthevideo,whicharegoodexamplesof
cross-sectionalstudies.
References:
1. LawrensonJG,EvansJR.Adviceaboutdietandsmokingforpeoplewithoratriskof
age-relatedmaculardegeneration:across-sectionalsurveyofeyecareprofessionalsin
theUK.BMCPublicHealth.2013;13:564.doi:10.1186/1471-2458-13-564.
2. 2.SartoriusB,VeermanLJ,ManyemaM,CholaL,HofmanK(2015)Determinantsof
ObesityandAssociatedPopulationAttributability,SouthAfrica:EmpiricalEvidencefroma
NationalPanelSurvey,2008-2012.PLoSONE10(6):e0130218.
doi:10.1371/journal.pone.0130218
Cohortstudies
AsIoutlinedinthelecture,acohortstudy:
beginsbyidentifyingsubjects(thecohort)withacommontraitsuchasadiseaseorrisk
factor
observesacohortovertime
canbeconductedretrospectivelyorprospectively
RetrospectiveCohortStudies
Aretrospectivestudyusesexistingdatatoidentifyapopulationandexposurestatus.Sinceweare
lookingbackintimeboththeexposureandoutcomehavealreadyoccurredbeforethestartofthe
investigation.Itmaybediculttogobackintimeandndtherequireddataonexposure,asany
datacollectedwasnotdesignedtobeusedaspartofastudy.However,incaseswherereliable
recordsareon-hand,retrospectivecohortstudiescanbeuseful.
ProspectiveCohortStudies
Inaprospectivecohortstudy,theresearcheridentiessubjectscomprisingacohortandtheir
exposurestatusatthebeginningofthestudy.Theyarefollowedovertimetoseewhetherthe
outcome(disease)developsornot.Thisusuallyallowsforbetterdatacollection,astheactual
datacollectiontoolsareinplace,withrequireddataclearlydened.
Thetermcohortisoftenconfused.Itsimplyreferstoagroupofsubjectsandforthepurposesof
research,theyusuallyhavesomecommontrait.Weoftenusethistermwhenreferringtothistype
ofstudy,butyouwillalsonoteitintheotherformsofobservationalstudies.Whenusedthere,itis
simplyagenericterm.Whenusedinthecaseofcohortstudiesitreferstothefactthatthedata
gatheredfortheresearchpointstoeventsthatoccurredafterthegroups(cohorts)wereidentied.
Togetbacktoourearlierexampleofwoundinfectionpatients(thatweusedinthecase-control
section),thepatientwithandwithoutwoundinfectioncouldbeconsideredcohortsandwe
considerwhathappenedtothemafterthediagnosisoftheirwoundinfection.Wemightthen
considerlengthofhospitalstay,totalcost,ortheoccurrenceofanyeventsafterthedevelopment
ofthewoundinfection(oratleastafterthesurgeryforthosewithoutwoundinfection).The
deningfact,though,isthatwearelookingforwardintimefromthewoundinfectionincontrastto
case-controlseries,wherewelookbackateventsbeforethedevelopmentofthewoundinfection.
YoumaynditusefultoreviewthepaperIdiscussedinthevideo,whichisagoodexampleofa
cohortstudy.
Paperdiscussedinthevideo:
LeRoux,D.M.,Myer,L.,Nicol,M.P.,&Zar,H.J.(2015).Incidenceandseverityofchildhood
pneumoniaintherstyearoflifeinaSouthAfricanbirthcohort:theDrakensteinChildHealth
Study.TheLancetGlobalHealth,3(2),e95e103.doi:10.1016/S2214-109X(14)70360-2
Experimentalstudies
Inexperimentalstudies(asopposedtoobservationalstudies,whichwediscussedearlier),an
activeinterventiontakesplace.Theseinterventionscantakemanyformssuchasmedication,
surgery,psychologicalsupportandmanyothers.
Experimentalstudies:
aimtoreducebiasinherentinobservationalstudies
usuallyinvolvetwogroupsormore,ofwhichatleastoneisthecontrolgroup
haveacontrolgroupthatreceivesnointerventionorashamintervention(placebo)
Randomization
Toreducebias,truerandomizationisrequired.Thatmeansthateverymemberofapopulation
musthaveanequalopportunity(randomchance)tobeincludedinasamplegroup.That
necessitatestheavailabilityofafulllistofthepopulationandsomemethodofrandomlyselecting
fromthatlist.
Inpracticaltermsthismeansthateverysubjectthatformspartofthetrial,musthaveanequal
opportunityofendingupinanyofthegroups.Usuallyitalsomeansthatallofthesesubjectsare
alsotakenfromanon-selectedgroup,i.e.inanon-biasedway.Forexample,ifwewantto
investigatetheeectivenessofanewdrugonhypertension,wemustnotonlybecertainthatall
patientshaveanequalopportunitytoreceiveeitherthedrugoraplacebo,butthatallthe
participantsarerandomlyselectedasawhole.Ifalltheparticipantscomefromaselectedgroup,
say,fromaclinicfortheaged,thereisbiasintheselectionprocess.Inthiscase,theresearchers
mustreportthattheirresultsareonlyapplicabletothissetofthepopulation.
Blinding
Ifthesubjectsdonotknowwhethertheyareinthecontrolgroupornot,thisistermedblinding.
Whentheresearchersaresimilarlyunawareofthegrouping,itistermeddouble-blinding.This
methodispreferablebutnotalwayspossible,i.e.inasurgicalprocedure.Inthesecasesthe
observertakingmeasurementsaftertheinterventionmaybeblindedtotheinterventionandthe
surgeonisexcludedfromthedatacollectionormeasurements.
Thepinnacleofclinicalresearchisusuallyseentobetherandomized,double-blindcontrolledtrial.
Itoftenprovidesthestrongestevidencetoprovecausation.
Thecontrolgroupcanbesetupinavarietyofways:
Trialswithindependentconcurrentcontrols
Intrialswithindependentconcurrentcontrols,thecontrolsareincludedinthetrialatthesame
timeasthestudyparticipants,anddatapointsarecollectedatthesametime.Inpracticalterms
thismeansthataparticipantcannotbeinbothgroups,norarehomozygotictwinsallowed.
Trialswithself-controls
Intrialswithself-controls,subjectsaretreatedasthecontrolandtreatmentgroups.Datais
collectedonsubjectsbeforeandaftertheintervention.
Themostelegantsubtypeofthisformoftrialisthecross-overstudy.Twogroupsareformedeach
withtheirownintervention.Mostcommonlyonegroupwillreceiveaplacebo.Theyformtheirown
controls,thereforedataiscollectedonbothgroupsbeforeandaftertheintervention.Afterthe
interventionanddatacollection,aperiodofnointerventiontakesplace.Beforeintervention
resumes,individualsintheplacebogroupareswappedwithindividualsinthetreatmentgroup.
Theplacebogroupthenbecomesthetreatmentgroupandthetreatmentgroupbecomesthe
placebogroup.
Trialswithexternalcontrols
Trialswithexternalcontrolscomparesacurrentinterventiongrouptoagroupoutsideofthe
researchsample.Themostcommonexternalcontrolisa historicalcontrol,whichcomparesthe
interventiongroupwithagrouptestedatanearliertime.Forexample,apublishedpapercan
serveasahistoricalcontrol.
Uncontrolledtrials
Inthesestudiesaninterventiontakesplace,buttherearenocontrols.Allpatientsreceivethe
sameinterventionandtheoutcomesareobserved.Thehypothesisisthattherewillbevarying
outcomesandreasonsforthesecanbeelucidatedfromthedata.Noattemptismadetoevaluate
theinterventionitself,asitisnotbeingcomparedtoeitheraplacebooranalternativeformof
intervention.
YoumaynditusefultoreviewthepaperIdiscussedinthevideo,whichisagoodexampleofa
experimentalstudy.
Papermentionedinvideo:
1. ThurtellMJ,JoshiAC,LeoneAC,etal.Cross-OverTrialofGabapentinandMemantineas
TreatmentforAcquiredNystagmus.Annalsofneurology.2010;67(5):676-680.
doi:10.1002/ana.21991.
Meta-analysisandsystematicreview
Werealmostattheendofweek1!Ivecoveredobservationalandexperimentalstudiesin
previousvideos,andImconcludingthislessonwithadiscussiononmeta-analysesand
systematicreviews.
Meta-analysis
Ameta-analysis:
usespre-existingresearchstudiesandcombinestheirstatisticalresultstodrawan
overallconclusion
centersaroundacommonmeasurementsuchasndinganaverageormean
isusefulforcombiningindividualstudiesofinadequatesizeorpowertostrengthen
results
usesinclusionandexclusioncriteriatoselectpaperstobeanalysed
Whatisadrawbackofmeta-analysis?
Onepossibledrawback,whichmakesmeta-analyseslessuseful,istheever-presentdangerof
publicationbias.Publicationbiasiswellrecognizedandreferstothefactthatthroughvarious
positiveandnegativeincentives,itismuchmorelikelytondpositiveresultsinthepublished
literature,i.e.statisticallysignicantresults.Itismuchlesscommontondnegativeresults.Even
thoughmeta-analysisisusedtoincreasethepowerofaresultandmakeitmoregeneralizable,its
resultsmaystillbepoorifthestudiesonwhichitisbasedarebiasedtowardspositive,statistically
signicantresults.
SystematicReview
Howoftendoyoucomeacrossresearchstudiesthatcontradictoneanothersndings?One
studyreportsthatcarbohydratesarebadforyou,anotherstudysayscarbohydratesarerequired
aspartofabalanceddiet.Whenlookingforresearchevidence,weneedtolookbeyondasingle
study.Thisiswheresystematicreviewstin.
Asystematicreview:
summarisesacomprehensiveamountofpublishedresearch
helpsyoundadenitewordonaresearchquestion
canincludeameta-analysis
canuseframeworkssuchasthePRISMAtostructurethereview
Whatisthedierencebetweenasystematicreviewandameta-analysis?
Thereissomeoverlapandnoteveryonestickclearlytothestrictdenitionsofthesetwotypesof
research(although,asImentionedinthelesson,clearguidelinesforbotharebeenacceptedby
mostresearchersandpublishers).Theaimofbothistocollectandusepreviouslypublisheddata.
Mostsystematicreviewsincludeameta-analysis,buttheyareratherliketocirclesina
Venn-diagram,withsomeoverlap(intersection).
Ameta-analysisisindeedthecombinationofpreviouslypublishedresearch.Thiscombinationcan
thenbeusedforre-analysis.Combiningtheseresultsintobiggernumbersmayresultsinimproved
results.Itisoftenseenasaquantitativelookatpreviouslypublisheddata.Theremayormaynot
besomenarrativetothemeta-analysisgivingalittlebackgroundinformationandknowledge
aboutthesubject.
Asystematicreviewalsocollectedpreviouslypublishedwork,buttakesamorequalitativelook
andisusuallymuchmoreinvolvedthanameta-analysis.Itreallyaimstobethemostdenitive
wordonatopicorresearchquestion.Certainproceduresarefollowsandareclearlysetoutinthe
designofthestudy,soastominimisebiasintheselectionandanalysisofthedata.Objective
techniquesarethenusedtoanalysethedata.Thefocusisonthemagnitudeoftheeect,rather
thanonstatisticalsignicance(meta-analysis).Itaddsalotofdetailandexplanationofthetopicat
handandincludesalotofnarrative.
Thereisthenalsothenarrativereview.ThesearefoundinpublicationssuchasthevariousNorth
AmericanClinicsJournals.Thereisreferencetopreviouspublisheddata,butthetrendismore
towardsthestyleofatextbook.
Nextup,wevedevelopedapracticequizforyoutocheckyourunderstanding.Goodluck!
Papersmentionedinthevideo:
1. GengL,SunC,BaiJ.SingleIncisionversusConventionalLaparoscopicCholecystectomy
Outcomes:AMeta-AnalysisofRandomizedControlledTrials.HillsRK,ed.PLoSONE.
2013;8(10):e76530.doi:10.1371/journal.pone.0076530.
Week2:Describingyourdata
Thespectrumofdatatypes
Denitions
ThereareeightkeydenitionswhichIintroducedinthevideolecturethatIwillbeusing
throughouttherestofthiscourse.
Descriptivestatistics
Theuseofstatisticaltoolstosummarizeanddescribeasetofdatavalues
Humanbeingsusuallynditdiculttocreatemeaningfromlonglistsofnumbersor
words
Summarizingthenumbersorcountingtheoccurrencesofwordsandexpressingthat
summarywithsinglevaluesmakesmuchmoresensetous
Indescriptivestatistics,noattemptismadetocompareanydatasetsorgroups
Inferentialstatistics
Theinvestigationofspeciedelementswhichallowustomakesinferencesabouta
largerpopulation(i.e.,beyondthesamplesize)
Herewecomparegroupsofsubjectsorindividuals
Itisnormallynotpossibletoincludeeachsubjectorindividualinapopulationinastudy,
thereforeweusestatisticsandinferthattheresultsweget,applytothelargerpopulation
Population
Agroupofindividualsthatshareatleastonecharacteristicincommon
Onamacrolevel,thismightrefertoallofhumanity
Atthelevelofaclinicalresearch,thismightrefertoeveryindividualwithacertain
disease,orriskfactor,whichmightstillbeanenormousnumberofindividuals
Itisquitepossibletohavequitesmallpopulation,i.e.inthecaseofveryrarecondition
Thendingsofastudyinferitsresultstoalargerpopulation;wemakeuseofthendings
tomanagethepopulationtowhichthosestudyndingsinfer
Sample
Asampleisaselectionofmemberswithinthepopulation(I'lldiscussdierentwaysof
selectingasampleabitlaterinthiscourse)
Researchisconductedusingthatsamplesetofmembersandanyresultscanbe
inferredtothepopulationfromwhichthesamplewastaken
Thisuseofstatisticalanalysismakesclinicalresearchpossibleasitisusuallynear
impossibletoincludethecompletepopulation
Parameter
Astatisticalvaluethatiscalculatedfromallthevaluesinawholepopulation,istermeda
parameter
Ifweknewtheageofeveryindividualonearthandcalculatedthemeanoraverageage,
thatagewouldbeaparameter
Statistic
Astatisticalvaluethatiscalculatedfromallthevaluesinasample,istermedastatistic
Themeanoraverageageofalltheparticipantsinastudywouldbeastatistic
Variable
Therearemanywaystodeneavariable,butforuseinthiscourseIwillrefertoa
variableasagroupnameforanydatavaluesthatarecollectedforastudy
Exampleswouldincludeage,presenceofriskfactor,admissiontemperature,infective
organism,systolicbloodpressure
Thisinvariablybecomesthecolumnnamesinadataspreadsheet,witheachrow
representingthendingsforanindividualinastudy
Datapoint
Irefertoadatapointasasingleexamplevalueforavariable,i.e.apatientmighthavea
systolicbloodpressure(thevariable)or120mmHg(thedatapoint)
Let'susethisknowledgeinaquickexample:Saywewanttotesttheeectivenessof
levothyroxineastreatmentforhypothyrodisminSouthAfricansubjectsbetweentheagesof
18-24.Itwillbephysicallyimpossibletondeveryindividualinthecountrywithhypothyrodism
andcollecttheirdata.However,wecancollectarepresentativesampleofthepopulation.For
example,wecancalculatetheaverage(samplestatistic)ofthethyroidstimulatinghormonelevel
beforeandtreatment.Wecanusethisaveragetoinferresultsaboutthepopulationparameter.
Inthisexamplethyroidstimulatinghormonelevelwouldbeavariableandtheactualnumerical
valueforeachpatientwouldrepresentindividualdatapoints.
Datatypes
Iwillbeusingtwoclassicationsystemsfordata-categoricalandnumerical,eachofwhichhas
twosub-divisions.
Categorical(includingnominalandordinaldata),referstocategoriesorthings,not
mathematicalvalues
Numerical(furtherdenedasbeingeitherintervalorratiodata)referstodatawhichis
aboutmeasurementandcounting
Inshortcategoricaldatatypesrefertowords.Althoughwordscanbecounted,thewords
themselvesonlyrepresentcategories.Thiscanbesaidfordiseases.Acutecholecystitis(infection
ofthegallbladder)andacutecholangitis(infectionofthebileducts)arebothdiseasesofthebiliary
(bile)tract.Aswords,thesediseasesrepresentcategoricalentities.AlthoughIcancounthow
manypatientshaveoneoftheseconditions,thediseasesthemselvesarenotnumericalentities.
Thesamewouldgoforgender,medications,andmanyotherexamples.
Justtomakethingsabitmoredicult,actualnumbersaresometimescategoricalandnot
numerical.Agood,illustrativeexamplewouldbechoosingfromaratingsystemforindicatingthe
severityofpain.Icouldaskapatienttoratetheseverityofthepaintheyexperienceaftersurgery
onascalefrom0(zero)to10(ten).Thesearenumbers,buttheydoNOTrepresentnumerical
values.Icanneversaythatapatientwhochooses6(six)hastwiceasmuchpainassomeone
whochooses3(three).Thereisnoxeddierencebetweeneachofthesenumbers.Theyarenot
quantiable.Assuch,theyrepresentcategoricalvalues.
Asthenameimplies,numericaldatareferstoactualnumbers.Wedistinguishnumericalnumber
valuesfromcategoricalnumbervaluesinthatthereisaxeddierencebetweenthem.The
dierencebetween3(three)and4(four)istheexactsamedierenceasthatbetween101
(one-hundred-and-one)and102(one-hundred-and-two).
Iwillalsocoveranothernumericalclassicationtype:discreteandcontinuousvariables.Discrete
valuesasthenameimpliesexistaslittleislandswhicharenotconnected(nolandbetweenthem).
Thinkoftherollofadie.Withanormalsix-sideddieyoucannotroleathree-and-a-half.
Continuousnumericalvaluesontheotherhandhave(forpracticalpurposes)manyvalues
in-betweenothervalues.Theyareinnitelydivisible(withinreasonablelimits).
Nominalcategoricaldata
Nominalcategoricaldata:
aredatapointsthateitherrepresentswords(yesorno)orconcepts(likegenderor
heartdisease)whichhavenomathematicalvalue
havenonaturalordertothevaluesorwords-i.e.nominal-forexample:gender,or
yourprofession
Becarefulofcategoricalconceptsthatmaybeperceivedashavingsomeorder.Usuallythese
areopentointerpretation.Someonemightsuggestthatheartdiseaseisworsethankidney
diseaseorviceversa.This,though,dependsonsomanypointsofview.Don'tmakethingstoo
complicated.Ingeneral,itiseasytospotthenominalcategoricaldatatype.
Readingsreferredtointhisvideoasexamplesofnominalcategoricaldata:
1. DeMoraes,A.G.,RacedoAfricano,C.J.,Hoskote,S.S.,Reddy,D.R.S.,Tedja,R.,
Thakur,L.,Smischney,N.J.(2015).KetamineandPropofolCombination(Ketofol)for
EndotrachealIntubationsinCriticallyIllPatients:ACaseSeries.TheAmericanJournalof
CaseReports,16,8186.doi:10.12659/AJCR.892424
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4332295/
Ordinalcategoricaldata
Ifcategoricaldatahavesomenaturalorderoralogicalrankingtothedatapoints,itistermed
ordinalcategoricaldata,i.e.theycanbeplacedinsomeincreasingordecreasingorder.
Igavetheexampleofapainscorefrom1(one)to10(ten).Eventhoughthesearenumbers,no
mathematicaloperationcanbeperformedonthesedigits.Theyareorderedinmagnitudefrom1
to10.Butthereisnostandardizedmeasurementoftheserankingsandthereforenoindication
thattheintervalbetweenthespecicscoresisofthesamevalue.
Othercommonexamplesincludesurveyquestions:whereaparticipantcanratetheiragreement
withastatementonascale,say1(one),indicatingthattheydon'tagreeatall,to5(ve),
indicatingthattheyfullyagree.Likertstyleanswerssuchastotallydisagree,disagree,neither
agreenordisagree,agreeandtotallyagreecanalsobeconvertedtonumbers,i.e.1(one)to5
(ve).Althoughtheycanberanked,theystillhavenoinherentnumericalvalueandassuch
remainordinalcategoricaldatavalues.
References:
1. LawrensonJG,EvansJR.Adviceaboutdietandsmokingforpeoplewithoratriskof
age-relatedmaculardegeneration:across-sectionalsurveyofeyecareprofessionalsin
theUK.BMCPublicHealth.2013;13:564.doi:10.1186/1471-2458-13-564.
Numericaldatatypes
As opposed to categorical data types (words, things, concepts, rating numbers), numerical data
types involve actualnumbers.Numericaldata isquantitative data -forexample,the weightsofthe
babies attending a clinic, the doses of medicine, or the blood pressure of dierentpatients.They
can be compared and you can do calculations on the values.Froma mathematicalpointofview,
there are xed dierences between values. The dierence between a systolic blood pressure
valueof110and120mmHgisthesameasbetween150and160mmHg(being10mmHg).
Therearetwotypesofnumericaldata-intervalandratio.
Interval
Withintervaldata,thedierencebetweeneachvalueisthesame,whichmeansthedenitionas
'I'usedaboveholds.Thedierencebetween1and2degreesCelsiusisthesameasthe
dierencebetween3and4degreesCelsius(thereisa1degreedierence).However,
temperaturesexpressedindegreesCelsius(orFahrenheit)donothaveatruezerobecause0
(zero)degreesCelsiusisnotatruezero.Thismeansthatwithnumericalintervaldata(like
temperature)wecanorderthedataandwecanaddandsubtract,butwecannotdivideand
multiplythedata(wecantdoratioswithoutatruezero).10degreesplus10degreesis20
degrees,but20degreesisnottwiceashotas10degreesCelsius.Ratiotypenumericaldata
requiresatruezero.
Ratio
Thistypeappliestodatathathaveatrue0(zero),whichmeansyoucanestablishameaningful
relationshipbetweenthedatapointsasrelatedtothe0(zero)valueeg.agefrombirth(0)orwhite
bloodcellcountornumberofclinicvisits(from0).Asystolicbloodpressureof200mmHgis
indeedtwiceashighasapressureof100mmHg.
Summary
Nominalcategorical=naminganddescribing(eg.gender)
Ordinalcategorical=someorderingornaturalranking(eg.painscales)
Intervalnumerical=meaningfulincrementsofdierence(eg.temperature)
Rationumerical=canestablishabase-linerelationshipbetweenthedatawiththeabsolute0(eg.
age)
Whydoweneedtospendtimedistinguishingdata?
Youhavetouseverydierentstatisticaltestsfordierenttypesofdata,andwithout
understandingwhatdatatypevalues(datapoints)reects,itiseasytomakefalseclaimsoruse
incorrectstatisticaltests.
Discreteandcontinuousvariables
Another important way to classifythe data youare lookingat,istodistinguishbetweendiscrete or
continuoustypesofdata.
Discretedata:
hasanitesetofvalues
cannotbesubdivided(rollingofthediceisanexample,youcanonlyrolla6,nota6.5!)
a good example are binomial values, where only two values are present, for example, a
patientdevelopsacomplications,ortheydonot
Continuousdata:
hasinnitepossibilitiesofsubdivisions(forexample,1.1,1.11.1.111etc.)
an example I used was the measure of blood pressure, and the possibility of taking ever
moredetailedreadingsdependingonthesensitivityoftheequipmentthatisbeingused
is mostly seen in a practical manner, i.e. although we can keep on halving the numberof
red blood cells per litre of blood and eventually end up with a single (discrete) cell, the
absolutely large numberswe are dealingwithmake redbloodcellcounta continuousdata
value
In the next lesson, I will look at why knowledge about the data type is so important. Spoiler: The
statisticaltestsusedisverydierentdependingonwhichtypesofdatayouareworkingwith.
Summarisingdatathroughsimpledescriptive
statistics
Describingthedata:measuresofcentraltendencyanddispersion
Researchpapersshowsummarizeddatavalues,(usually)withoutshowingtheactualdataset.
Instead,keymethodsareusedtoconveytheessenceofthedatatothereader.Thissummaryis
alsotherststeptowardsunderstandingtheresearchdata.Ashumanswecannotmakesenseof
largesetsofnumbersorvalues.Instead,werelyonsummariesofthesevaluestoaidthis
understanding.
Therearethreecommonmethodsofrepresentingasetofvaluesbyasinglenumber-themean,
medianandmode.Collectively,theseareallmeasuresofcentraltendency,orsometimes,point
estimates.
Mostpaperswillalsodescribetheactualsizeofthespreadofthedatapoints,alsoknownasthe
dispersion.Thisiswhereyouwillcomeacrosstermssuchasrange,quartiles,percentiles,
varianceandthemorecommon,standarddeviation,oftenabbreviatedasSD.
Measuresofcentraltendency
Let'ssummarizewhatwe'vejustlearnedaboutmean,medianandmode.Theyaremeasuresof
centraltendency.Asthenameliterallyimplies,theyrepresentsomevaluethattendstothemiddle
ofallthedatapointvaluesinaset.Whattheyachieveinreality,istosummarizeasetofdata
pointvaluesforus,replacingawholesetofvalueswithasinglevalue,thatissomehow
representativeofallthevaluesintheset.
Ashumanswearepooratinterpretingmeaningfromalargesetofnumbers.Itiseasiertotake
meaningfromadatasetifwecouldconsiderasinglevaluethatrepresentsthatsetofnumbers.In
orderforallofthistobemeaningful,themeasureofcentraltendencymustbeanaccurate
reectionofalltheactualvalues.Noonemethodwouldsuceforthispurposeandthereforewe
haveatleastthesethree.
Mean
oraveragereferstothesimplemathematicalconceptofaddingupallthedatapoint
valuesforavariableinadatasetanddividingthatsumbythenumberofvaluesintheset
isameaningfulwaytorepresentasetofnumbersthatdonothaveoutliers(valuesthat
arewaydierentfromthelargemajorityofnumbers)
Example:theaverageormeanforthisdatasetis15((3+4+5+8+10+12+63)/7=15)).
Median
isacalculatedvaluethatfallsrightinthemiddleofalltheothervalues.Thatmeansthat
halfofthevaluesarehigherthanandhalfarelowerthanthisvalue,irrespectiveofhow
highorlowtheyare(whattheiractualvaluesare)
areusedwhentherearevaluesthatmightskewyourdata,i.e.afewofthevaluesare
muchdierentfromthemajorityofthevalues
intheexampleabove(undermean)thevalue15isintuitivelyabitofanoverestimation
andonlyoneofthevalues(63)islargerthanit,makingitsomehowunrepresentativeof
theothervalues(3,4,5,8,10,and12)
Example(below):Thecalculationisbasedonwhetherthereareanoddorevennumber
ofvalues.Ifodd,thisiseasy.Therearesevennumbers(odd).Thenumber8appearsin
thedatasetandthreeofthevalues(3,4,5)arelowerthan8andthree(10,12,63)are
higherthan8,thereforethemedianis8.Incaseofanevennumberofvalues,the
averageofthemiddletwoistakentoreachthemedian.
Mode
isthedatavaluethatappearsmostfrequently
isusedtodescribecategoricalvalues
returnsthevaluethatoccursmostcommonlyinadataset,whichmeansthatsome
datasetsmighthavemorethanonemode,leadingtothetermsbimodal(fortwomodes)
andmultimodal(formorethantwomodes)
Example:Thinkofaquestionnaireinwhichparticipantscouldchoosebetweenvaluesof
0to10toindicatetheiramountofpainafterprocedure:
Thenarrangethenumbersinorder:
Itisevidentnowthatmostchosethevalueof4,so4wouldbethemode.
Note:Itwouldbeincorrecttousemeanandmedianvalueswhenitcomestocategoricaldata.
Modeismostappropriateforcategoricaldatatypes,andtheonlymeasureofcentraltendency
thatcanbeappliedtonominalcategoricaldatatypes.Inthecaseofordinalcategoricaldatasuch
asinourpainscoreexampleabove,orwithaLikertscale,ameanscoreof5.5625wouldbe
meaningless.Evenifweroundedthisoto5.6,itwouldbediculttoexplainwhat.6ofapainunit
is.Ifyouconsideritcarefully,evenmediansuersfromthesameshortcoming.
Readingsinthisvideo
1. MgeleaEetal.DetectingvirologicalfailureinHIV-infectedTanzanianchildren.SAfrMed
J.2014;104(10):696-9.doi:10.7196/samj.7807]
http://www.samj.org.za/index.php/samj/article/view/7807/6241
2. Naidoo,S.,Wand,H.,Abbai,N.,&Ramjee,G.(2014).Highprevalenceandincidenceof
sexuallytransmittedinfectionsamongwomenlivinginKwazulu-Natal,SouthAfrica.AIDS
ResearchandTherapy,11(1),31.doi:10.1186/1742-6405-11-31
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168991/pdf/1742-6405-11-31.pdfyou
willnotethattheyrepresentedtheirageanddurationanalysisbythemedian.Intheir
casetheydescribedamedianageof28,andamediandurationof12months.
Measuresofdispersion
Thereareseveralmeasuresofdispersionsuchasrange,quartiles,percentiles,varianceandthe
morecommonstandarddeviation(SD).Whereasmeasuresofcentraltendencygiveusa
single-valuerepresentationofadataset,measuresofdispersionsummarizesforushowspread
outthedatasetis.
Range
referstothedierencebetweentheminimumandmaximumvalues
isusuallyexpressedbynotingboththevalues
itisusedwhensimplydescribingdata,i.e.whennoinferenceiscalledfor
Quartiles
dividethegroupofvaluesintofourequalquarters,inthesamewaythatmediandivides
thedatasetintotwoequalparts
hasazerothvalue,whichrepresentstheminimumvalueisadataset,andafourth
quartilewhichrepresentsthemaximumvalue
hasarstquartile,whichrepresentsthevalueinthedataset,whichwilldividethatset
intoaquarterofthevaluesbeingsmallerthantherstquartilevalueandthree-quarters
beinglargerthanthatvalue
hasathirdquartile,whichrepresentsthevalueinthedataset,whichwilldividethatset
intothree-quartersofthevaluesbeingsmallerthanthethirdquartilevalueandone
quarterbeinglargerthanthatvalue
hasasecondquartilevalue,whichdividesthedatasetintotwoequalsetsandisnothing
otherthanthemedian
Thezerothvalueisthesameastheminimumvalueandthefourthquartilevalueisthe
sameasthemaximumvalue.
Percentile
looksatyourdatainnerdetailandinsteadofsimplycuttingyourvaluesintoquarters,
youcancalculateavalueforanypercentageofyourdatapoints
turnstherstquartileintothe25thpercentile,themedian(orsecondquartile)intoa50th
percentileandathirdquartileintoa75thpercentile(andallofthesearejustdierent
expressionofthesamething)
alsoincludesapercentilerankthatgivesapercentageofvaluesthatfallbelowanyvalue
inyoursetthatyoudecideon,i.e.avalueof99mighthaveapercentilerankof13
meaningthat13%ofthevaluesinthesetarelessthan99and87%arelargerthan99
TheInterquartileRangeandOutliers
Theinterquartilerange(IQR)isthedierencebetweenthevaluesoftherstandthirdquartiles.A
simplesubtraction.Itisusedtodeterminestatisticaloutliers.
Extremeoratypicalvalueswhichfallfaroutoftherangeofdatapointsaretermedoutliersand
canbeexcluded.
Forexample:
Rememberourinitialsamplevaluesfromthelecture?
Withthissmalldatasetwecanintuitivelyseethat63isanoutlier.Whendatasetsaremuchlarger
thismightnotbesoeasyandoutlierscanbedetectedbymultiplyingtheinterquartilerange(IQR)
by1.5.Thisvalueissubtractedfromtherst-quartileandaddedtothethird-quartile.Anyvaluein
thedatasetthatislowerorhigherthanthesevaluescanbeconsideredstatisticaloutliers.
Outliervalueswillhavethebiggestimpactonthecalculationofmean(ratherthanonthemodeor
median).Suchvaluescanbeomittedfromanalysisifitisreasonabletodoso(i.e.incorrectdata
inputormachineerror)andtheresearcherstatesthatthiswasdoneandwhy.Ifthevalue(s)is/
arerecheckedandconrmedasvalid,specialstatisticaltechniquescanhelpreducetheskewing
eect.
Varianceandstandarddeviation
Themethodofdescribingtheextentofdispersionorspreadofdatavaluesinrelationtothemean
isreferredtoasthevariance.Weusethesquarerootofthevariance,whichiscalledthestandard
deviation(SD).
Imagineallthedatavaluesinadatasetarerepresentedbydotsonastraightline,i.e.the
familiarx-axisfromgraphsatschool.Adotcanalsobeplacedonthislinerepresenting
themeanvalue.Nowthedistancebetweeneachpointandthemeanistakenandthen
averaged,soastogetanaveragedistanceofhowfarallthepointsarefromthemean.
Notethatwewantdistanceawayfromthemean,i.e.notnegativevalues(somevalues
willbesmallerthanthemean).Forthismathematicalreasonallthedierencesare
squared,resultinginallpositivevalues.
Theaverageofallthesevaluesisthevariance.ThesquarerootofthisisthentheSD,the
averagedistancethatallthedatapointsareawayfromthemean.
Asanillustration,thedatavaluesof1,2,3,20,38,39,and40haveamuchwiderspread
(standarddeviation),than17,18,19,20,21,22,and23.Bothsetshaveanaverageof
20,butthersthasamuchwiderspreadorSD.Whencomparingtheresultsoftwo
groupweshouldalwaysbecircumspectwhenlargestandarddeviationsarereported
andespeciallysowhenthevaluesofthestandarddeviationsoverlapforthetwogroups.
Readingsreferredtointhisvideo:
1. MgeleaEetal.DetectingvirologicalfailureinHIV-infectedTanzanianchildren.SAfrMed
J.2014;104(10):696-9.doi:10.7196/samj.7807]
http://www.samj.org.za/index.php/samj/article/view/7807/6241
2. Naidoo,S.,Wand,H.,Abbai,N.,&Ramjee,G.(2014).Highprevalenceandincidenceof
sexuallytransmittedinfectionsamongwomenlivinginKwazulu-Natal,SouthAfrica.AIDS
ResearchandTherapy,11(1),31.doi:10.1186/1742-6405-11-31
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168991/pdf/1742-6405-11-31.pdf
Plots,graphsandgures
Mostpublishedarticles,posterpresentationsandindeedalmostallresearchpresentationsmake
useofgraphs,plotsandgures.Thegraphicalrepresentationofdataallowsforcompact,visual
andinformationrichconsumptionsofdata.
Itisinvariablemucheasiertounderstandcomplexcategoricalandnumericaldatawhenitis
representedinpictures.Nowthatyouhaveagoodunderstandingofthedierentdatatypes,we
willtakealookatthevariouswaysgraphthem.
Boxandwhiskerplots
Aboxandwhiskerplotprovidesuswithinformationonboththemeasuresofcentraltendencyas
wellasmeasuresofspread.Ittakesthelistofnumericaldatapointvaluesforcategorical
variablesandmakesuseofquartilestoconstructarectangularblock.
Intheexampleaboveweseethreecategoricalgroups(GroupA,BandC)onthex-axisandsome
numericaldatatypeonthey-axis.Therectangularblockhasthreelines.Thebottomline(bottom
oftherectangleifdrawnvertically)representstherstquartilevalueforthelistofnumericalvalues
andthetopline(topoftherectangle)indicatesthethirdquartilevalue.Themiddlelinerepresents
themedian.
Thewhiskerscanrepresentseveralvaluesandtheauthorsofapapershouldmakeitclearwhat
theirwhiskervaluesrepresent.Possiblevaluesinclude:minimumandmaximumvalues,values
beyondwhichstatisticaloutliersarefound(one-and-a-halftimestheinterquartilerangebelowand
abovetherstandthirdquartiles),onestandarddeviationbelowandabovethemeanoravariety
ofpercentiles(2ndand98thor9thand91st).
Someauthorsalsoaddtheactualdatapointstotheseplots.Themeanandstandarddeviation
canalsobeaddedaswecanseeinthegraphbelow(indicatedbydottedlines).
Forbothgraphsabovethewhiskersindicatetheminimumandmaximumvalues.Noteclearlyhow
theboxandwhiskerplotgivesusanideaofthespreadandcentraltendencyofnumericaldata
pointsforvariouscategories.
Countplots
Countplotstellushowmanytimesacategoricaldatapointvalueordiscretenumericalvalue
occurred.Countplotsareoftenreferredtoasbarplots,butthetermbarplotisoftenmisused.
Thiscountplotistakenfromafamousdatasetandrepresentsthenumberofpassengersonthe
titanicdividedbyageandgender.
Histogram
Ahistogramisdierentformacountplotinthatittakesnumericalvaluesonthex-axis.From
these,so-calledbins,areconstructed.Abinrepresentsaminimumandamaximumcut-ovalue.
Thinkofthecountingnumbers0to20.Imaginenowhavealistof200values,alltakenfrom0to
20,somethinglike1,15,15,13,12,20,13,13,13,14,...andsoon.Icanconstructbinswitha
sizeofve,i.e.0to5,6to10,11to15,and16to20.Icannowcounthowmanyofmylistof200
fallintoeachbin.Fromthedeprecatedlisthere,thereisalready8valuesinthe11to15bin.The
sizeofthebinsarecompletelyarbitraryandthechoiceisuptotheresearcher(s)involvedina
project.
Thegraphaboveactuallyaddswhatistermedarugplotatthebottom,showingtheactualdata
points,whichthenrepresentshowmanythereareineachbin.
Distributionplots
Theseplotstakehistogramstothenextleveland,throughmathematicalequations,givesusa
visualrepresentationofthedistributionofthedata.
Notethat,aswiththehistogram,thex-axisisanumericalvariable.Ifyoulookcloselyyou'llsee
thestrangevaluesonthe y-axis.Laterinthecoursewewilllearnthatitisfromthesegraphsthat
wecalculatep-values.Fornow,itisclearthattheygiveusagoodindicationofwhatthe
distributionshapeislike.Forthegraphabove,showingdensitiesforthreegroups,wenotethat
somevalues(ataround50)occurmuchmorecommonlythanvaluesat20or80asindicatedby
theheightofthelinesateachofthesex-axisvalues.Thedistributionshereareallnearly
bell-shaped,calledthenormaldistribution.
Violinplots
Violinplotscombineboxandwhiskerplotsanddensityplots.theyactuallytakethedensityplots,
turnthemontheirsidesandmirrorthem.
Inthegraphabovewehavedottedlinesindicatingthemedianandrstandthirdquartilesaswith
boxandwhiskerplots,butwegetamuchbetterideaofthedistributionofthenumericalvalues.
Scatterplots
Scatterplotscombinesetsofnumericalvalues.eachdothasavaluefromtwonumericaldata
pointsets,soastomakea x-anday-coordinate.Thinkofasinglepatientwithawhitecellcount
andaredcellcountvalue.
Fromthegraphaboveitisclearthatbothaxeshavenumericalvalues.Throughmathematical
equations,wecanevencreatelinesthatrepresentallthesepoints.Theseareusefulforcreating
predictions.Basedonanyvalueonthex-axis,wecancalculateapredictedvalueonthey-axis.
Laterwewillseethatthisisaformoflinearregressionandwecanuseittocalculatehowwell
twosetsofvaluesarecorrelated.Thegraphbelowshowsuchalineandevenaddsahistogram
anddensityplotforeachofthetwosetsofnumericalvariables.
Piechart
Lastly,wehavetomentionthepooroldpiechart.Oftenfrowneduponinscienticcirclesit
nonethelesshasitsplace.
Theplotabovedividesupeachcirclebyhowmanyofeachofthevalues1-through-5occurredin
eachdataset.
Sampling
Introduction
Thiscoursetacklestheproblemofhowhealthcareresearchisconducted,buthaveyouever
wondered,onaveryfundamentallevel,whyhealthcareresearchisconducted?Well,the whyis
actuallyquiteeasy.Wewouldliketondsatisfactoryanswerstoquestionswehaveabouthuman
diseasesandtheirpreventionandmanagement.Thereareendlessnumbersofquestions.The
howismoretricky.
Toanswerquestionsrelatedtohealthcare,weneedtoinvestigate,well,humans.Problemis,there
aresomanyofus.Itisalmostalwaysimpossibletoexamineallhumans,evenwhenitcomesto
someprettyrarediseasesandconditions.Canyouimaginetheeortandthecost?
Tosolvethisproblem,wetakeonlya(relatively)smallgroupofindividualsfromalargerpopulation
thatcontainspeoplewiththediseaseortraitthatwewouldliketoinvestigate.Whenweanalyse
thedatapertainingtothesampleselectionandgetanswerstoourquestions,weneedtobesure
thattheseanswersareusefulwhenusedinmanagingeveryonewhowasnotinthesample.
Inordertodothis,wemustbesurethatthesampleproperlyreectsthelargerpopulation.The
largerthesamplesize,thehigherthelikelihoodthattheresultsofouranalysescorrectlyinfersto
thepopulation.Whenasampleisnotproperlyrepresentativeofthepopulationtowhichtheresults
willinfer,somesortofbiaswasintroducedintheselectionprocessofthesample.Allresearch
muststrivetominimizebias,orifitoccurred,toproperlyaccountforit.
Thissectionexplainswithafewofthemethodsthatareusedtoproperlysampleparticipantsfor
studiesfromlargerpopulations.
Typesofsampling
Simplerandomsampling
Insimplerandomsamplingamasterlistofthewholepopulationisavailableandeachindividualon
thatlisthasanequallikelihoodofbeingchosentobepartofthesample.
Thisformofsamplingcanbeusedintwosettings.Onalargerscalewemighthaveamasterlist
ofpatients(orpeoplewithacommontrait)whowewouldliketoinvestigate.Wecandrawfrom
thatlistusingsimplerandomsampling.Onasmallerscale,wemightalreadyhavealistof
participantsforaclinicaltrial.Wenowneedtodividethemintodierentgroups.Agoodexample
wouldbedividingourparticipantsintotwogroups,grouponereceivinganactivedrugandgroup
two,aplacebo.Eachindividualparticipantmusthaveanequallikelihoodofgettingeitherdrug.
Thiscanbeachievedbysimplerandomsampling.
Systematicrandomsampling
Insystematicrandomsampling,theselectionprocessiteratesovereverydecidednumberof
individuals,i.e.every10thor100thindividualonamasterlist.
Wecanagainconsidertwoscenarios.Intherst,wearedealingagainwithndingparticipants
forastudyandinthesecond,wealreadyhaveourparticipantsselectedandnowneedtodivide
themintogroups.
Clusterrandomsampling
Inclustersampling,thegroupsofindividualsthatareincludedaresomehowclustered,i.e.allin
thesamespace,location,orallowedtime-frame.Therearemanyformsofclusterrandom
sampling.Weoftenhavetodealwiththefactthatamasterlistsimplydoesnotexistoristoo
costlyordiculttoobtain.Clusteringgroupsofindividualsgreatlysimpliestheselectionprocess.
Wecouldevenseethetrivialcaseofacaseseriesorcase-controlseriesashavingmadeuseof
clustering.Astudymightcomparethosewithandwithoutatrait,sayapostoperativewound
complication.Thesampleistakenfromapopulationwhoattendedacertainhospitalovera
certaintimeperiod.
Stratiedrandomsampling
Instratiedsamplingindividualsarechosenbecauseofsomecommon,mutuallyexclusivetrait.
Themostcommonsuchtraitisgender,butmayalsobesocio-economicclass,age,andmany
others.
Week3:Buildinganintuitiveunderstandingof
statisticalanalysis
Fromareatoprobability
P-values
Whenyoureadmedicallyrelatedresearchpapersyouarelikelytocomeacrosstheconceptof
probabilityandthetermp-value,alongwiththegoldstandardofstatisticalsignicance-0.05.
Whatisthep-value?
Thep-valueexplainsaprobabilityofaneventoccurring
Itisbasedonthecalculationofageometricalarea
Themathematicsbehindap-valuedrawsacurveandsimplycalculatestheareaundera
certainpartofthatcurve
Rollingdice
Iexplainedthenotionofprobabilitybyusingthecommonexampleofrollingdice:
Foreachdie,thereisanequallikelihoodofrollingaone,two,three,four,ve,orasix
Theprobabilityofrollingaoneisoneoutofsixor16.67%(itiscustomarytowrite
probabilityasafraction(betweenzeroandone)asopposedtoapercentage,solet's
makethat0.1667)
Itisimpossibletohaveanegativeprobability(alessthan0%chanceofsomething
happening)oraprobabilityofmorethanone(morethana100%chanceofsomething
happening)
Ifweconsiderallprobabilitiesintherollingofourdie,itaddstoone(0.1667timessix
equalsone)
Thisisourprobabilityspace,nothingexistsoutsideofit
Bylookingatitfromadierent(moreofamedicalstatistics)pointofview,Icouldrolla
veandaskthequestion:"Whatwastheprobabilityofndingave?",towhichwe
wouldanswer,p=0.1667
Icouldalsoask:"Whatisthelikelihoodofrollingaveormore?",towhichtheansweris,
p=0.333
Exampleofrollingapairofdice:
Wehardlyeverdealwithsingleparticipantsinastudy,solet'srampthingsuptoapairof
dice
Ifyourollapairofdice,addingthevaluesthatlandsface-up,willleaveyouwithpossible
valuesbetweentwoand12
Notethatthereare36possibleoutcomes(usingthefactthatrolling,forexample,aone
andasixisnotthesameasrollingasixandaone)
Sincetherearesixwayofrollingatotalofseven,thechancesaresixin36or0.1667
Rollingtwosixesortwoonearelesslikelyatoneoutof36each,or0.0278(a2.78%
chance)
Wecanmakeachartofthesecalledahistogram.Itshowshowmanytimeseachoutcomecan
occur.You'llnotetheactualoutcomesonthehorizontalaxisandthenumberofwaysofachieving
eachoutcomesontheverticalaxis.
Equatinggeometricalareatoprobability
Wecouldalsochartaprobabilityplot.So,insteadoftheactualnumberoftimes,wechartthe
probabilityonthey-axis.Itisjustthenumberofoccurrencesdividedbythetotalnumberof
outcomes.
Continuousdatatypes
Westartedthislessonofbylookingatdiscreteoutcomes,i.e.therollingofdice.Theoutcomes
werediscretebywayofthefactthatnovaluesexistbetweenthewholenumberstwoto12(rolling
twodice).Thatmadetheshapeoftheprobabilitygraphsquiteeasy,withabasewidthofoneand
allintheshapeoflittlerectangles.Continuousvariablesontheotherhandareconsideredtobe
innitelydivisiblewhichmakesitverydiculttomeasurethegeometricwidthofthoseareas,not
tomentionthefactthatthetopendsofthelittlerectanglesarecurvedandnotstraight.
Integralcalculussolvestheproblemofdeterminingtheareaofanirregularshape(aslongaswe
haveanicemathematicalfunctionforthatshape).Ourbiggerproblemisthefactthat(for
continuousvariables)itisnolongerpossibletoaskwhattheprobabilityofndingasinglevalue
(outcomes)is.Asinglevaluedoesnotexistasinthecaseofdiscretedatatypes.Rememberthat
withcontinuousdatatypeswecan(intheoreticaltermsatleast)innitelydividethewidthofthe
base.Now,wecanonlyaskwhattheprobability(areaunderthecurve)isbetweentwovalues,or
morecommonlywhattheareaisforavaluelargerthanorsmallerthanagivenvalue(stretching
outtopositiveandnegativeinnityonbothsides).
So,howdoesthiswork?
Thegraphbelow(nottoscale),illustratesthep-value.Letsgobacktotheexampleofresearching
thewhitecellcountoftwogroupsofpatients.Imaginethatgrouponehasacertainaveragewhite
cellcountandsodoesgrouptwo.Thereisadierencebetweentheseaverages.Thequestionis
whetherthisdierenceisstatisticallysignicant?Themathematicsbehindthecalculationisgoing
tousesomevaluescalculatedfromthedatapointvaluesandrepresentina(bell-shaped)curve
(asinthegraphbelow).
Ifyouchoseap-valueoflessthan0.05toindicateasignicantdierenceandyoudecidedin
yourhypothesisthatonegroupwillhaveanaveragehigherthantheother,themathswillworkout
acut-oonthex-axiswhichwillindicateanareaunderthecurve(thegreen)of0.05(5%ofthe
totalarea).Itwillthenmarkthedierenceinaveragesofyourdataandseewhattheareaunder
thecurvewasforthis(inblue).Youcanseeitwaslargerthan0.05,sothedierencewasnot
statisticallysignicant.
Andthereyouhaveit.Anintuitiveunderstandingofthep-value.Itonlygetsbetterfromhere!
Asifbymagic,theheightoftherectangularbarsarenowequaltothethelikelihood
(p-valueorsorts)ofrollingaparticulartotal
Notehowtheheightofthecentrerectangle(andoutcomeofseven)is0.1667(thereare
sixwaysofrollingasevenandthereforewecalculateap-value6/36=0.1667.
Ifyoulookateachindividualrectangularbarandifyouconsiderthewidthofeachtobe
one,theareaofeachrectangle(heighttimeswidth)givesyoutheprobabilityofrolling
thatnumber(thep-value)(forthesakeofcompleteness,weshouldactuallybeusingthe
probabilitydensityfunction,butthisisanexampleofdiscretedatatypesandtheresults
arethesame)
Ifwewanttoknowwhattheprobabilityisofrollinga10ormoreis,wesimplycalculate
theareaoftherectanglesfrom10andup
Thepthenreferstoprobability,asinthechanceorlikelihoodofaneventoccurring(inhealthcare
research,aneventisanoutcomeofanexperiment).
Anexampleofanexperimentiscomparingthedierenceintheaverageofsomebloodtestvalue
betweentwogroupsofpatients,withthep-valuerepresentingtheprobabilitythattheparticular
dierencewasfound.Iftheprobabilitywassucientlylow,weinferthatthetwosetsof
participantsrepresenttwoseparatesetsofpopulations,whicharethensignicantlydierentfrom
eachother.
Theheartofinferentialstatistics:Centrallimit
theorem
Centrallimittheorem
Nowthatyouareawareofthefactthatthep-valuerepresentstheareaunderaverybeautifuland
symmetriccurve(forcontinuousdatatypevariablesatleast)somethingmaystarttoconcernyou.
Ifithasnt,letsspellitout.Istheprobabilitycurvealwayssosymmetric?Surely,whenyoulookat
theoccurrenceofdatapointvaluesforvariablesinaresearchproject(experiment),theyarenot
symmetricallyarranged.
Inthislesson,wegettheanswertothisquestion.Wewilllearnthatthisspecicdierence
betweenthemeansofthetwogroupsisbutoneofmany,many,many(reallymany)dierences
thatarepossible.Wewillalsoseethatsomedierencesoccurmuchmorecommonlythanothers.
TheanswerliesinamathematicaltheoremcalledtheCentralLimitTheorem(CLT).Asusual,don't
bealarmed,wewontgonearthemath.Afewsimplevisualgraphswillexplainitquitepainlessly.
Skewnessandkurtosis
AsImentioned,datacanbeverynon-symmetricalinitsdistribution.Tobeclear,bydistributionI
meanisphysicallycountinghowmanytimeseachindividualvaluecomesupinasetofdatapoint
values.Thetwotermsthatdescribenon-symmetricdistributionofdatapointvaluesareskewness
andkurtosis.
Skewness
Skewnessisratherself-explanatoryandiscommonlypresentinclinicalresearch.Itisamarker
thatshowsthattherearemoreoccurrencesofcertaindatapointvaluesatoneendofaspectrum
thananother.Belowisagraphshowingtheagedistributionofparticipantsinahypothetical
researchproject.Notehowmostindividualswereontheyoungerside.Youngerdatapointvalues
occurmorecommonly(althoughthereseemstobesomeveryoldpeopleinthisstudy).The
skewnessinthisinstanceisright-tailed.Ittailsototheright,whichmeansitispositivelyskewed.
Ontheotherhand,negativeskewnesswouldindicatethedataisleft-tailed.
Kurtosis
Kurtosisreferstothespreadofyourdatavalues.
Aplatykurtic curveisatterandbroaderthannormalasaresultofhavingfewscores
aroundthemean.Largesectionsunderthecurveareforcedintothetail,thereby(falsely)
increasingtheprobabilityofndingavaluequitefarfromthemean.
Amesokurtic curvetakesthemiddlegroundwithamediumcurvefromaverage
distributions.
Inaleptokurtic curveismorepeaked,wheremanyvaluesarecentredaroundthemean.
Remember,inthissectionwearediscussingthedistributionoftheactualdatapoint
valuesinastudy,butthetermsusedherecanalsorefertothecurvethatiseventually
constructedwhenwecalculateap-value.Aswewillseelater,thesearequitedierent
things(thecurveofactualdatapointvaluesandthep-valuecurvecalculatedfromthe
datapointvalues).Thisisaveryimportantdistinction.
Combinations
CombinationslieattheheartoftheCentralLimitTheoremandalsoinferentialstatisticalanalysis.It
isthekeyforunderstandingthecurvethatwegetwhenattemptingtocalculatethep-value.
Combinationsrefertothenumberofwaysaselectionofobjectstakenfromagroupofobjectscan
bearranged.
Inasimpleexamplewemightconsiderhowmanycombinationoftwocolorswecan
makefromatotalchoiceoffourcolors,sayred,green,blue,andblack.Wecould
choose:red+green,red+blue,red+black,green+blue,green+black,andnally,
blue+black(notingthatchoosingblue+blackisthesameaschoosingblack+blue).
Thatissixpossiblecombinationchoosingatwocolorcombinationfromfourchoices.
Manycountriesintheworldhavelotteriesinwhichyoupickafewnumbersandhand
oversomemoneyforachancetowinalargecashprizeshouldthosenumberspopupin
adraw.Theorderdoesntmatter,sowearedealingwithcombinations.So,ifyouhadto
choosesixnumbersbetweensayoneand47,howmanycombinationscouldcomeup?
Itsastaggering10,737,573.Over10million.Yourchoiceofsixnumbersisbutoneofall
ofthose.Thatmeansthatyouchancesofpickingtherightcombinationislessthanone
in10million!Mostlotterieshaveevenmorenumberstochoosefrom!Toputthingsinto
perspective(justforthosewhoplaythelottery),achoiceofthenumbers1,2,3,4,5,and
6(avery,veryunlikelychoice)isjustaslikelytocomeupasyourfavoritechoiceof13,
17,28,29,30,and47!Goodluck!
Combinationhassomeseriousimplicationsforclinicalresearch,though.
Forexample,aresearchprojectdecidestofocuson30patientsforastudy.Theresearcherchose
therst30patientstowalkthroughthedooratthelocalhypertension(highbloodpressure)clinic
andnotesdowntheirages.Ifadierentgroupofpatientswasselectedonadierentday,there
wouldbecompletelydierentdata.Thesamplegroupthatyouendupwith(thechosen30)isbut
oneofmany,many,manythatyoucouldhavehad!If1000peopleattendedtheclinicandyou
hadtochoose30,thenumberofpossiblecombinationswouldbelargerthan2.4times10tothe
power57.Billionsuponbillionsuponbillions!
Thisishowthedistributioncurvefortheoutcomesofstudies(fromwhichthep-valueis
calculated)areconstructed.Beitthedierenceinmeansbetweentwoormoregroups,or
proportionsofchoicesforacross-sectionalstudy'sLikert-stylequestions.Thereare
(mathematically)analmostuncountabledierentnumberofoutcomes(giventhesamevariables
tobestudied)andtheonefoundinanactualstudy,isbutoneofthose.
Centrallimittheorem
Wesawintheprevioussectiononcombinationsthatwhenyoucomparethedierencein
averagesbetweentwogroups,youranswerisbutoneofmanythatexist.TheCentralLimit
Theoremstatesthatifweweretoplotallthepossibledierences,theresultinggraphwouldform
asmooth,symmetricalcurve.Therefore,wecandostatisticalanalysisandlookfortheareaunder
thecurvetocalculateourp-values.
Themathematicsbehindthecalculationofthep-valueconstructsanestimationofallthepossible
outcomes(ordierencesasinourexample).
Letslookatavisualrepresentationofthedata.Intherstgraphbelowweaskedacomputer
programtogiveus10,000randomvaluesbetween30and40.Asyoucansee,thereisno
pattern.
Let'ssuggestthatthese10,000valuesrepresentapopulationandweneedtorandomlyselect30
individualsfromthepopulationtorepresentourstudysample.So,letsinstructthecomputerto
take30randomsamplesfromthese10,000valuesandcalculatetheaverageforthose30.Now,
letsrepeatthisprocess1000times.Weareinessencerepeatingourmedicalstudy1000times!
Theresultoftheoccurrenceofalltheaveragesisshowninthegraphbelow.TheCentralLimit
Theorempredicts,alovelysmooth,symmetricdistribution.Justreadyandwaitingforsome
statisticalanalysis.
Everytimeamedicalstudyisconducted,thedatapointvalues(andtheirmeasuresofcentral
tendencyanddispersion)arejustoneexampleofcountlessothers.Somewilloccurmore
commonlythanothersanditistheCentralLimitTheoremthatallowsustocalculatehowlikelyit
wastondaresultasextremeastheonefoundinanyparticularstudy.
Distributions:theshapeofdata
Distributions
Weallknowthatcertainthingsoccurmorecommonlythanothers.Weallacceptthatthereare
moredayswithlowertemperaturesinwinterthandayswithhighertemperatures.Inthenorthern
hemispheretherewillbemoredaysinJanuarythatarelessthan10degreesCelsius(50degrees
Fahrenheit)thantherearedaysthataremorethan20degreesCelsius(60degreesFahrenheit).
Actualdatapointvaluesforanyimaginablevariablecomesinavarietyof,shallwesay,shapesor
patternsofspread.Thepropertermforthisisadistribution.Themostfamiliarshapeisthenormal
distribution.Datafromthistypeofdistributionissymmetricandformswhatmanyrefertoasa
bell-shapedcurve.Mostvaluescenteraroundtheaverageandtaperotobothends.
Ifweturntohealthcare,wecanimaginethatcertainhemoglobinleveloccurmorecommonlythan
otherinanormalpopulation.Thereisadistributiontothedatapointvalues.Indecidingwhich
typeofstatisticaltesttouse,weareconcernedwiththedistributionthattheparametertakesin
thepopulation.Aswewillseelater,wedonotalwaysknowwhattheshapeofdistributionisand
wecanonlycalculateifoursampledatapointvaluesmightcomefromapopulationinwhichthat
variableisnormally(orotherwise)distributed.
Itturnsoutthattherearemanyformsofdatadistributionsforbothdiscreteandcontinuousdata
typevariables.Evenmoreso,averages,standarddeviations,andotherstatisticsalsohave
distributions.ThisfollowsnaturallyfromtheCentralLimitTheoremwelookedatbefore.Wesaw
thatifwecouldrepeatanexperimentthousandsoftimes,eachtimeselectinganewcombination
ofsubjects,someaveragevaluesordierencesinaveragesbetweentwogroupswouldforma
symmetricaldistribution.
Itisimportanttounderstandthevarioustypesofdistributionsbecause,asmentioned,distribution
typeshaveaninuenceonthechoiceofstatisticalanalysisthatshouldbeperformedonthem.It
wouldbequiteincorrecttodothefamoust-testondatavaluesforasamplethatdonotcome
fromvariablewithanormallydistributioninthepopulationfromwhichthesamplewastaken.
Unfortunately,mostdataisnotsharedopenlyandwehavetotrusttheintegrityoftheauthorsand
thattheychoseanappropriatetestfortheirdata.Theonusthenalsorestsonyoutobeawareof
thevariousdistributionsandwhatteststoperformwhenconductingyourownresearch,aswell
astoscrutinizethesechoiceswhenreadingtheliterature.
InthislessonyouwillnotethatIrefertotwomaintypesofdistributions.First,thereisthe
distributionpatterntakenbytheactualdatapointvaluesinastudysample(orthedistributionof
thatvariableintheunderlyingpopulationfromwhichthesamplewastaken).Thenthereisthe
distributionthatcanbecreatedfromthedatapointvaluesbywayoftheCentralLimitTheorem.
TherearetwoofthesedistributionsandtheyaretheZ-andthet-distributions(bothsharinga
beautifullysymmetric,bell-shapedpattern,allowingustocalculateap-valuefromthem).
Normaldistribution
Thenormaldistributionisperhapsthemostimportantdistribution.Weneedtoknowthatdata
pointvaluesforasamplearetakenfromapopulationinwhichthatvariableisnormallydistributed
beforewedecideonwhattypeofstatisticaltesttouse.Furthermore,thedistributionofallpossible
outcomes(throughtheCentralLimitTheorem)isnormallydistributed.
Thenormaldistributionhasthefollowingproperties:
mostvaluesarecenteredaroundthemean
asyoumoveawayfromthemean,therearefewerdatapoints
symmetricalinnature
bell-shapedcurve
almostalldatapoints(99.7%)occurwithin3standarddeviationsofthemean
Mostvariablesthatweuseinclinicalresearchhavedatapointvaluesthatarenormally
distributed,i.e.thesampledatapointswehave,comefromapopulationforwhomthevaluesare
normallydistributed.Asmentionedintheintroduction,itisimportanttoknowthis,becausewe
havetoknowwhatdistributionpatternthevariablehasinthepopulationinordertodecideonthe
correctstatisticalanalysistooltouse.
Itisworthwhiletorepeatthefactthatactualdatapointvaluesforavariable(i.e.age,height,white
cellcount,etc.)haveadistribution(bothinasampleandinthepopulation),butthatthroughthe
CentralLimitTheorem,wherewecalculatehowoftencertainvaluesordierencesinvalues
occur,wehaveaguaranteednormaldistribution.
Samplingdistribution
Aswasmentionedintheprevioussection,actualdatapointvaluesforvariables,beitfroma
sampleorfromapopulation,hasadistribution.Thenwehavethedistributionthatisguaranteed
throughtheCentralLimitTheorem.Whetherwearetalkingaboutmeans,thedierenceinmeans
betweentwoormoregroups,oreventhedierenceinmedians,aplotofhowmanytimeseach
willoccurgivenanalmostunlimitedrepeatofastudywillbenormallydistributed,allowsusdo
inferentialstatisticalanalysis.
So,imagine,onceagain,beinginaclinicandenteringtheagesofconsecutivepatientsina
spreadsheet.Someageswilloccurmoreorlesscommonly,allowingustoplotahistogramofthe
values.Itwouldshowthedistributionofouractualvalues.Thesepatientscomefromamuch,
muchlargerpopulation.Inthispopulation,ageswillalsocomeinacertaindistributionpattern,
whichmaybedierentfromoursampledistribution.
InthevideoontheCentralLimitTheoremwelearnthatoursampleorthedierencebetweentwo
samplegroupsisbutoneofanenormousnumberofdierencesthatcouldoccur.Thislarger
distributionwillalwaysbenormallydistributedaccordingtotheCentralLimitTheoremandrefers
toanothermeaningofthetermdistribution,termedasamplingdistribution(inotherwords,the
typeofdistributionwegetthroughthemathematicsoftheCentralLimitTheoremiscalleda
samplingdistribution).
Z-distribution
TheZ-distributionisoneofthesamplingdistributions.Ineect,itisagraphofamathematical
function.Thisfunctiontakessomevaluesfromparametersinthepopulation(asopposedtoonly
fromoursamplestatistics)andconstructsasymmetrical,bell-shapedcurvefromwhichwecan
doourstatisticalanalysis.
Itisnotcommonlyusedinmedicalstatisticsasitrequiresknowledgeofsomepopulation
parametersandinmostcases,theseareunknown.Itismuchmorecommontousethe
t-distribution,whichonlyrequiresknowledgethatisavailablefromthedatapointvaluesofa
sample.
t-distribution
Thisisthemostcommonlyusedsamplingdistribution.Itrequiresonlycalculationsthatcanbe
madefromthedatapointvaluesforavariablethatisknownfromthesamplesetdataforastudy.
Thet-distributionisalsosymmetricalandbell-shapedandisamathematicalequationthatfollows
fromtheCentralLimitTheoremallowingustodoinferentialstatisticalanalysis.
Oneofthevaluesthathastobecalculatedwhenusingthet-distribution,iscalleddegreesof
freedom.Itisquiteaninterestingconceptwithmanyinterpretationsanduses.Inthecontextwe
useithere,itdependsonthenumberofparticipantsinastudyandiseasilycalculatedasthe
dierencebetweenthetotalnumberofparticipantsandthetotalnumberofgroups.So,ifwehave
atotalof60subjectsinastudyandwehavethemdividedintotwogroups,wewillhaveadegree
offreedomequalto58,being60minustwo.Thelargerthedegreesoffreedom,themorethe
shapeofthet-distributionresemblesthatofthenormaldistribution.
Theeectofthisisthatallstudiesshouldaimtohaveasmanyparticipantsaspossibleandthea
largersamplesetwouldallowforthesampletobemorerepresentativeofthepopulation,
increasingthepowerofthestudy.
Week4:Theimportantrststeps:Hypothesis
testingandcondencelevels
Hypothesistesting
Thiscourserepresentsaninferentialviewonstatistics.Indoingsowemakecomparisonsand
calculatehowsignicantthosedierencesare,orratherhowlikelyitistohavefoundthespecic
dierencewhencomparingresults.
Thisformofstatisticalanalysismakesuseofwhatistermedhypothesistesting.Anyresearch
question,pertainingtocomparisons,hastwohypotheses.Wecallthemthenullandalternate
(maintained,research,test)hypothesisandinlightofcomparisons,theyareactuallyquiteeasyto
understand.
Thenullhypothesis
Itisofutmostimportancethatresearchersremainunbiasedtowardstheirresearchand
particularlytheoutcomeoftheirresearch.Itisnaturaltobecomeexitedaboutanewacademic
theoryandmostresearcherswouldliketondsignicantresults.
Theproperscienticmethod,though,istomaintainapointofnodeparture.Thismeans,thatuntil
dataiscollectedandanalyzed,westateanullhypothesis.Thismeansthatthereiswillbeno
dierencewhenacomparisonisdonethroughdatacollectionandanalyses.
Beingallscienticaboutit,though,wedosetathresholdforourtestingandiftheanalysesnds
thatthisthresholdhasbeenbreached,wenolongerhavetoacceptthenullhypothesis.Inactual
fact,wecannowrejectit.
Thismethod,calledthescienticmethod,formsthebedrockofevidence-basedmedicine.
Asscientistswebelieveintheneedforproof.
Thealternativehypothesis
Thealternativehypothesisstatesthatthereisadierencewhenacomparisonisdone.Almostall
comparisonswillshowdierences,butweareinterestedinthesizeofthatdierence,andmore
importantlyinhowlikelyitistohavefoundthedierencethataspecicstudynds.
WehaveseenthroughtheCentralLimitTheoremthatsomedierenceswilloccurmore
commonlythanothers.Wealsoknowthroughtheuseofcombinations,thattherearemany,many
dierencesindeed.
Thealternativehypothesisisacceptedwhenanarbitrarythresholdispassed.Thisthresholdisa
lineinthesandandstatesthatanydierencefoundbeyondthispointisveryunlikelytohave
occurred.Thisallowsustorejectthenullhypothesisandacceptthealternatehypothesis,thatis,
thatwehavefoundastatisticallysignicantdierenceorresults.
Inthenextsectionwewillseethatthereismorethanonewaytostatethealternativehypothesis
anditisofenormousimportance
Thealternativehypothesis
Twowaysofstatingthealternativehypothesis
AsIhavealludedtointheprevioussection,therearetwowaystostatethealternativehypothesis.
Itisexactlyforthereasonofdierentwaysofstatingthealternativehypothesis,thatitisso
importanttostatethenullandalternativehypothesespriortoanydatacollectionoranalyses.
Let'sconsidercomparingthecholesterolleveloftwogroupsofpatients.Grouponeisonanew
testdrugandgrouptwoistakingaplacebo.Aftermanyweeksofdutifullytakingtheirmedication,
thecholesterollevelsofeachparticipantistaken.
Sincethedatapointvaluesforthevariabletotalcholesterolleveliscontinuousandratio-type
numerical,wecancomparethemeans(orasweshallseelater,themedians)betweenthetwo
groups.
Asgoodscientistsandresearchers,though,westatedourhypotheseswaybeforethispoint!The
nullhypothesiswouldbeeasy.Therewillbenodierencebetweenthemeans(ormedians)of
thesetwogroups.
Whataboutthealternatehypothesis?It'sclearthatwecanstatethisintwoways.Itmightbe
natural,especiallyifwehaveputalotofmoneyandeortintothisnewdrugtostatethatthe
treatmentgroup(groupone)willhavelowercholesterolthantheplaceboorcontrolgroup(group
two).Wecouldalsosimplystatethattherewillbeadierence.Eitherofthetwogroupsmight
havealowermean(ormedian)cholesterollevel.
Thischoicehasafundamentaleectonthecalculatedp-value.Fortheexactsamedata
collectionandanalyses,oneofthesetwowaysofstatingthealternativehypothesishasap-value
ofhalfoftheother.Thatiswhyitsoimportanttostatethesehypothesesbeforecommencinga
study.Anon-signicantp-valueof0.08canveryeasilybechangedafterthefacttoasignicant
0.04bysimplyrestatingthealternatehypothesis.
Thetwo-tailedtest
Whenanalternativehypothesisstatessimplythattherewillbeadierence,weconductwhatis
termedatwo-tailedtest.
Let'sconsideragainourcholesterolexampleforbefore.Throughtheuseofcombinationsandthe
CentralLimitTheoremwewillbeabletoconstructasymmetricalbell-shapedt-distributioncurve
fromourdatapointvalues.Acomputerprogramwillconstructthisgraphandwillalsoknowwhat
cut-ovaluesonthex-axiswouldrepresentanareaunderthecurvetorepresent0.05(5%)ofthe
totalarea.
Thisbeingatwo-tailedtest,though,itwillsplitthistobothsides,with0.025(2.5%)oneitherside.
Whencomparingthemeans(ormedians)ofthetwogroups,onewillbelessthantheotherand
dependingwhichoneyousubtractfromwhich,youwillgeteitherapositiveoranegativeanswer.
Thisisconvertedtoavalue(unitsofstandarderror)thatcanalsobeplottedonthex-axis.Since
thisisatwo-tailedtest,though,itisreectedontheotherside.Thecomputercancalculatethe
areaunderthecurveforbothsides,calledatwo-tailedp-value.Fromthegraphsitisclearthat
thisformofalternatehypothesisstatementislesslikelytoyieldalowp-value.Theareaisdoubled
andthecut-omarksrepresentingthe0.025(2.5%)levelarefurtherawayfromthemean.
Theone-tailedtest
Aresearchermustbeveryclearwhenusingaone-tailedtest.Fortheexactsamestudyanddata
pointvalues,thep-valueforaone-tailedtestwouldbehalfofthatofatwo-tailedtest.Thegraphis
onlyrepresentedononeside(notduplicatedontheother)andallofthearearepresentingavalue
of0.05(5%)isononesideandthereforethecut-oisclosertothemean,makingiteasierfora
dierencetofallbeyondit.
Thechoicebetweenaone-tailedandtwo-tailedapproach
Thereisnomagicbehindthischoice.Aresearchershouldmakethischoicebyconvincingheror
hispeersthroughlogicalargumentsorpriorinvestigations.Anyonewhoreadsaresearchpaper
shouldbeconvincedthatthechoicewaslogicaltomake.
Hypothesistestingerrors
Allresearchstartswithaquestionthatneedsananswer.Everyonemighthavetheirownopinion,
butaninvestigatorneedstolookfortheanswerbydesigninganexperimentandinvestigatingthe
outcome.
Aconcernisthattheinvestigatormayintroducebias,evenunintentionally.Toavoidbias,most
healthcareresearchfollowsaprocessinvolvinghypothesistesting.Thehypothesisisaclear
statementofwhatistobeinvestigatedandshouldbedeterminedbeforeresearchbegins.
Tobegin,letsgoovertheimportantdenitionsIdiscussedinthislesson:
Thenullhypothesispredictsthattherewillbenodierencebetweenthevariablesof
investigation.
Thealternativehypothesis,alsoknownasthetestorresearchhypothesispredictsthat
therewillbeadierenceorsignicantrelationship.
Therearetwotypesofalternativehypotheses.Onethatpredictsthedirectionofthehypothesis
(e.g.AwillbemorethanBorAwillbelessthanB),knownasaone-tailedtest.Theotherstates
therewillbeasignicantrelationshipbutdoesnotstateinwhichway(e.g.Acanbemoreorless
thanB).Thelatterisknownasatwo-tailedtest.
Forexample,ifwewanttoinvestigatethedierencebetweenthewhitebloodcellcounton
admissionbetweenHIV-positiveandHIV-negativepatients,wecouldhavethefollowing
hypotheses:
Nullhypothesis:thereisnodierencebetweentheadmissionwhitebloodcellcount
betweenHIV-positiveandHIV-negativepatients.
Alternativehypothesis(one-tailed):theadmissionwhitecellcountofHIV-positivepatients
willbehigherthantheadmissionwhitebloodcellcountofHIV-negativepatients.
Alternativehypothesis(two-tailed):theadmissionwhitebloodcellcountofHIV-positive
patientswilldierfromthatofHIV-negativepatients.
TypeIandIIerrors
Wrongconclusionscansometimesbemadeaboutourresearchndings.Forexample,astudy
ndsthatdrugAhasnoeectonpatients,wheninfactithasmajorconsequences.Howmight
suchmistakesbemade,andhowdoweidentifythem?Sincesuchexperimentsareabout
studyingasampleandmakinginferencesaboutapopulation(whichwehavenotstudied),these
mistakescanconceivablyoccur.
TableIshowstwopossibletypesoferrorsthatexistwithinhypothesistesting.ATypeIerrorisone
wherewefalselyrejectthenullhypothesis.Forinstance,aTypeIerrorcouldoccurwhenwe
concludethatthereisadierenceinthewhitebloodcellcountbetweenHIV-positiveand
HIV-negativepatientswheninfactnosuchdierenceexists.
ATypeIIstatisticalerrorrelatestofailingtorejectthenullhypothesis,wheninfactadierencewas
present.Thisiswherewewouldconcludethatthereisnodierencebetweenthewhitebloodcellcount
betweenHIV-positiveandHIV-negativepatientsexistswheninrealityadierenceexists.
Whydowesaywefailtorejectthenullhypothesisinsteadofweacceptthenullhypothesis?
Thisistricky,butjustbecausewefailtotherejectthenullhypothesis,thisdoesnotmeanwe
haveenoughevidencetoproveitistrue.
Inthenextlesson,Illcoverthetopicofcondenceintervals.
Condenceinyourresults
Introductiontocondenceintervals
Condenceintervalsareveryoftenquotedinthemedicalliterature,yetitisoftenmentionedthatit
isapoorlyunderstoodstatisticalconceptamongsthealthcarepersonnel.
Inthiscourseweareconcentratingoninferentialstatistics.Werealizethatmostpublished
literatureisbasedontheconceptoftakingasampleofindividualsfromapopulationand
analyzingsomedatapointvalueswegatherfromthem.Theseresultsarethenusedtoinfersome
understandingaboutthepopulation.Wedothisbecausewesimplycannotinvestigatethewhole
population.
Thismethod,though,isfraughtwiththedangerofintroducingbias.Thesamplethatisselectedfor
analysismightnotproperlyrepresentthepopulation.Ifasamplestatisticsuchasameanis
calculated,itwilldierfromthepopulationmean(ifwewereabletotestthewholepopulation).Ifa
samplewastakenwithpropercare,though,thissamplemeanshouldbefairlyclosetothe
populationmean.
Thisiswhatcondenceintervalsareallabout.Weconstructarangeofvalues(lowerandupper
limit,orloweranduppermaximum),whichissymmetricallyformedaroundthesamplemean(in
thisexample)andweinferthatthepopulationmeanshouldfallbetweenthosevalues.
Wecanneverbeentirelysureaboutthis,though,sowedecideonhowcondentwewanttobein
theboundsthatweset,allowingustocalculatethesebounds.
Condenceintervalscanbeconstructedaroundmorethanjustmeansandinreality,theyhavea
slightlymorecomplexmeaningthanwhatIhavelaidouthere.Allwillberevealedinthislesson,
allowingyoutofullyunderstandwhatismeantbyallthosecondenceintervalsthatyoucome
acrossintheliterature.
Reference:
1. DinhTetal.ImpactofMaternalHIVSeroconversionduringPregnancyonEarlyMotherto
ChildTransmissionofHIV(MTCT)Measuredat4-8WeeksPostpartuminSouthAfrica
2011-2012:ANationalPopulation-BasedEvaluationPLoSONE10(5)DOI:
10.1371/journal.pone.0125525.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0125525
Condencelevels
Whatarethey?
AsI'vementionedintheprevioussection,weonlyexamineasmallsampleofsubjectfroma
muchlargerpopulation.Theaimisthough,tousetheresultsofourstudieswhenmanagingthat
population.Sinceweonlyinvestigateasmallsample,anystatisticthatwecalculatebasedonthe
datapointgatheredfromthem,willnotnecessarilyreectthepopulationparameter,whichiswhat
wearereallyinterestedin.
Somehow,weshouldbeabletotakeastabatwhatthatpopulationparametermightbe.Thinking
onaverylargescale,thetruepopulationparameterforanyvariablecouldbeanythingfrom
negativeinnitytopositiveinnity!Thatsoundsodd,butmakesmathematicalsense.Let'stake
ageforinstance.ImagineIhavethemeanageofasampleofpatientsandamwonderingabout
thetruemeanageinthepopulationfromwhichthatsamplewastaken.Now,noonecanbe
-1000yearsold,neithercantheybe+1000yearsold.RemembertheCentralLimitTheorem,
though?Itpositedthattheanalysisofavariablefromasamplewasjustoneofmany(bywayof
combinations).Thatdistributiongraphisamathematicalconstructandstretchesfromnegativeto
positiveinnity.Tobesure,theoccurrencesofthesevaluesarebasicallynilandinpracticethey
are.Solet'ssay(mathematically)thattherearevaluesof-1000and+1000(andforthatmatter
negativeandpositiveinnity)andIusetheseboundsasmyguessastowhatthemeanvaluein
thepopulationasawholeis.WiththatwideaguessIcanbe100%condentthatthispopulation
meanwouldfallbetweenthosebounds.
This100%representmycondencelevelandIcansetthisarbitrarilyforanysamplestatistic.
WhathappensifIshrinkthebounds?
Enoughwiththenon-sensicalvalues.Whathappensifforargument'ssakethesamplemeanage
was55andIsuggestthatthepopulationmeanis45to65.Ifhaveshrunkthebounds,butnow,
logically,Ishouldlosesomecondenceinmyguessnow.Indeed,thatiswhathappens.The
condencelevelgoesdown.IfIshrinkitto54to56,thereisamuchgreaterchancethatthe
populationmeanescapestheseboundsandthecondencelevelwouldbemuchsmaller.
A95%condencelevel
Itiscustomarytouseacondencelevelof95%andthatisthevaluethatyouwillnoticemost
oftenintheliteraturewhencondenceintervalsarequoted.The95%referstothecondence
levelsandthevaluesrepresenttheactualinterval.
Themathematicsbehindcondenceintervalsconstructsadistributionaroundthesamplestatistic
basedonthedatapointvaluesandcalculateswhatareawouldbecoveredby95%(fora95%
condencelevel)ofthecurve.Thex-axisvaluesarereconstitutedtoactualvalueswhicharethen
theloweranduppervaluesoftheinterval.
Condenceintervals
Nowthatweunderstandwhatcondencelevelsare,wearereadytodenetheproper
interpretationofcondenceintervals.
Itmightbenaturaltosuggestthatgivena95%condencelevel,thatweare95%condentthat
thetruepopulationparameterliesbetweentheintervalsgivenbythatcondencelevel.Revisiting
ourlastexamplewemighthaveasamplestatisticof55yearsforthemeanofoursampleand
witha95%condencelevelconstructintervalsof51to59years.Thiswouldcommonlybe
writtenasameanageof55years(96%CI,51-59).Itwouldbeincorrectthoughtosuggestthat
thereisa95%chanceofthepopulationmeanagebeingbetween51and59years.
Thetrueinterpretationofcondenceintervals
Considerthatboththesamplestatistics(meanageoftheparticipantsinastudy)andthe
populationparameter(meanageofthepopulationfromwhichthesamplewastaken)existin
reality.Theyarebothabsolutes.Giventhis,thepopulationparametereitherdoesordoesnotfall
insideofthecondenceinterval.Itisallornothing.
Thetruemeaningofthecondencelevelofsay95%isthatifthestudyisrepeated100times
(eachwithitsownrandomsamplesetofpatientsdrawnfromthepopulation,eachwithitsown
meanand95%condenceintervals),95ofthesestudieswillcorrectlyhavethepopulation
parametercorrectlywithintheintervalsand5wouldnot.Thereisnowaytoknowwhichoneyou
haveforanygivenstudy.
Week5:Whichtestshouldyouuse?
Introductiontoparametrictests
Finallyinthiscoursewegettogripswithsomerealinferentialstatisticsandwestartthingsowith
parametrictests.Inferentialstatisticsisallaboutcomparingdierentsamplesubjecttoeach
other.Mostcommonlywedealwithnumericaldatapointvalues,forwhichwecancalculate
measuresofcentraltendencyanddispersion.
Whenusingparametrictests,weusethemeanoraverageasourmeasureofcentraltendency.
Commonlywewillhavetwo(ormore)groupsandforanygivenvariable,sayforinstancewhite
cellcount,wecouldcalculateameanvalueforeachgroup.Adierencewillexistbetween
meansofthegroupsandthroughtheuseofstatisticaltestswecouldcalculatehowcommon
certaindierencesshouldoccurgivenmanyrepetitionsofastudyandalsohowlikelyitwasthen
tohavefoundadierenceatleastaswideastheonefortheparticularathand.
Parametrictestsarearethemostcommonlyusedtests,butwiththiscommonusecomesome
verystrictrulesorassumptions.Ifthesearenotmetandparametrictestsareused,the
subsequentp-valuesmightnotbeatruereectionwhatwecanexpectinthepopulation.
Typesofparametrictests
Iwilldiscussthreemaintypesofparametrictestsinthislesson.
t-tests
ThesearetrulythemostcommonlyusedandmostpeoplearefamiliarwithStudent'st-test.There
ismorethanonet-testdependingonvariousfactors.
Asagroup,though,theyareusedtocomparethepointestimateforsomenumericalvariable
betweentwogroups.
ANOVA
ANOVAistheacronymforanalysisofvariance.Asopposedtot-test,ANOVAcancompareapoint
estimateforanumericalvariablebetweenmorethantwogroups.
Linearregression
Whencomparingtwoormoregroups,wehavethefactthatalthoughthedatapointvaluesfora
variablearenumericalintype,thetwogroupsthemselvesarenot.WemightcallthegroupsAand
B,oroneandtwo,ortestandcontrol.Assuchtheyrefertocategoricaldatatypes.
Inlinearregressionwedirectlycompareanumericalvaluetoanumericalvalue.Forthis,weneed
pairsofvaluesandinessencewelookforacorrelationbetweenthese.Canwendthata
changeinthesetthatisrepresentedbytherstvalueinallthepairscausesapredictablechange
inthesetmadeupbyallthesecondvaluesinthepair.
Asanexamplewemightcorrelatenumberofcigarettessmokedperdaytobloodpressurelevel.
Todothiswewouldneedasampleofparticipantandforeachhaveavalueforcigarettessmoked
perdayandbloodpressurevalue.Aswewillseelater,correlationdoesnotprovecausation!
Studentst-test:Introduction
Wehavelearnedthatthisisatleastoneofthemostcommonstatisticaltests.Ittakesnumerical
datapointvaluesforsomevariablesintwogroupsofsubjectsinastudyandcomparesthemto
eachother.
WilliamGosset(developerofStudent'st-test)usedthet-distribution,whichwecoveredearlier.Itis
abell-shaped,symmetricalcurveandrepresentadistributionofallthedierences(inthemean)
betweentwogroupsshouldthesamestudyberepeatedmultipletimes.Somewilloccurvery
oftenandsomewillnot.Thet-distributionusestheconceptofdegreesoffreedom.Thisvalue
referstothetotalnumberofparticipantsinastudy(bothgroups)andminusthenumberofgroups
(whichistwo).Thehigherthedegreesoffreedom,themoreaccuratelythet-distributionfollows
thenormaldistributionandthemathematicsbehindthispositsinsomeway,amoreaccurate
p-value.Thisisanotherreasontoincludeaslargeasamplesizeaspossible.
Oncethegraphforthisdistributionisconstructed,itbecomespossibletocalculatewhereonthe
x-axisthecut-orepresentingadesiredareawouldbe.Theactualdierenceisalsoconvertedto
thesameunits(calledstandarderrors)andtheareaforthiscalculatedaswehaveseenbefore.
Sincewearetryingtomimicthenormaldistribution,wehaveoneofthemostcrucialassumptions
fortheuseofthet-testandotherparametrictests.Weneedassurancesthatthesampleof
subjectsinastudywastakenfromapopulationinwhichthevariablethatisbeingtestedis
normallydistributed.Ifnot,wecannotuseparametrictests.
Typesoft-tests
Thereareavarietyoft-tests.Commonlywewillhavetwoindependentgroups.Ifwewereto
comparetheaveragecholesterollevelsbetweentwogroupsofpatients,participantsinthesetwo
groupsmustbeindependentofeachother,i.e.wecannothavethesameindividualappearin
bothgroups.Aspecialtypeoft-testexistsifthetwogroupsdonotcontainindependentindividuals
aswouldhappenifthegroupsaremadeupofhomozygotic(identical)twinsandwetestthesame
variableinthesamegroupofparticipantsbeforeandafteranintervention(withthetwosetsof
dataconstitutingthetwogroups).
Thereisalsotwovariationsofthet-testbasedonequalandunequalvariances.Itisimportantto
considerthedierenceinthevariances(squareofthestandarddeviation)forthedatapoint
valuesforthetwogroups.Ifthereisabigdierenceat-testassumingunequalvariancesshould
beused.
ANOVA
ANOVAistheacronymforanalysisofvariance.Asopposedtothet-test,ANOVAcancomparethe
meansofmorethantwogroups.ThereareanumberofdierentANOVAtests,whichcandeal
withmorethanonefactor.
ThemostcommontypeofANOVAtestisone-wayANOVA.Herewesimplyuseasinglefactor
(variable)andcomparemorethantwogroupstoeachother.
ANOVAlooksatbothvariationsofvaluesinsideofgroupsandbetweengroupsandconstructsa
distributionbasedonthesamecriteriathatwehavebasedontheCentralLimitTheoremand
combinations.
Whencomparingmorethantwogroups,itisessentialtostartwithanalysisofvariance.Only
whenasignicantvalueiscalculatedshouldaresearchercontinuewithcomparingtwogroups
directlyusingat-test,soastolookforsignicantdierences.Iftheanalysisofvariancedoesnot
returnasignicantresult,itispointlesstorunt-testsbetweengroupsandifdoneanysignicant
ndingshouldbeignored.
LinearRegression
Upuntilnowwehavebeencomparingnumericalvaluesbetweentwocategoricalgroups.We
havehadexamplesofcomparingwhitecellcountvaluesbetweengroupsAandB,cholesterol
valuesbetweengroupstakinganewtestdrugandaplacebo.Thevariablecholesterolandwhite
cellcountcontaindatapointthatareratio-typenumericalandcontinuous,butthegroups
themselvesarecategorical(groupAandBortestandcontrol).
Wecan,comparenumericalvaluesdirectlytoeachotherthroughtheuseoflinearregression.
Inlinearregressionwearelookingforacorrelationbetweentwosetsofvalues.Thesesetsmust
comeinpairs.Thegraphbelowshowafamiliarplotofthemathematicalequationy=x.Wecan
plotsetsofvaluesonthisline,i.e.(0,0),(1,1),(2,2),etc.Thisishowlinearregressionisdone.The
rstvalueisallthepaircomefromonesetofdatapointvaluesandthesecondfromasecondset
ofdatapointvalues.I'vementionedanexamplebefore,lookingatnumberofcigarettessmoked
perdayversusbloodpressurevalue.
Linearregressionlooksatthecorrelationbetweenthesetwosetsofvalues.Doesonedependon
theother.Isthereachangeintheone,astheotherchanges.
Setsofdatapointswillalmostneverfallonastraightline,butthemathematicsunderlyinglinear
regressioncantryandmakeastraightlineoutofallthedatapointsets.Whenwedothiswenote
thatweusuallygetadirection.Mostsetsareeitherpositivelyornegativelycorrelated.With
positivecorrelation,onevariable(calledthedependentvariable,whichisonthey-axis),increases
astheother(calledtheindependentvariable,whichisonthex-axis)alsoincreases.
Thereisalsoanegativecorrelationandasyoumightimagine,thedependentvariablesdecreases
astheindependentvariableincreases.
Strengthofthecorrelation
Asyouwouldhavenoticed,someofthedatapointpairsarequiteadistanceawayfromthelinear
regressionline.Withstatisticalanalysiswecancalculatehowstronglythepairsofvaluesare
correlatedandexpressthatstrengthasacorrelationcoecient,r.Thiscorrelationcoecient
rangesfrom-1to+1,withnegativeonebeingabsolutenegativecorrelation.Thismeansthatall
thedotswouldfallonthelineandthereisperfectmovementintheonevariableastheother
moves.Withapositiveonecorrelationwehavetheopposite.Inmostreal-lifesituations,the
correlationcoecientwillfallsomewhereinbetween.
Thereisalsothezerovalue,whichmeans,nocorrelationasall.
Correlatingvariablesarealwaysfascinating,butcomeswithabigwarning.Anycorrelation
betweenvariablesdoesnotnecessarilymeancausation.Justbecausetwovariablesare
correlateddoesnotmeanthechangeinoneiscausedbyachangeintheother.Theremightbea
thirdfactorinuencingboth.Proofofacausalrelationshiprequiresmuchmorethanlinear
regression.
Nonparametrictestingforyournon-normaldata
Nonparametrictests
Whencomparingtwoormoregroupsofnumericaldatavalues,wehaveahostofstatisticaltools
onhand.Wehaveseenallthet-testsforusewhencomparingtwogroupsandwellasANOVAfor
comparingmorethantwogroups.Asmentionedinthepreviouslecture,though,theuseofthese
teststorequirethatsomeassumptionsaremet.
Chiefamongthesewasthefactthatthedatapointvaluesforaspecicvariablemustcomefrom
apopulationinwhichthatsamevariableisnormallydistributed.Wearetalkingaboutapopulation
andthereforeaparameter,hencethetermparametric tests.
Unfortunatelywedonothaveaccesstothedatapointvaluesforthewholepopulation.Fortunately
therearestatisticalteststomeasurethelikelihoodthatthedatapointvaluesintheunderlying
populationarenormallydistributedandtheyaredonebasedonthedatapointvaluesthatare
availablefromthesampleofparticipants.
Checkingfornormality
Thereareavarietyofwaystocheckwhetherparametrictestsshouldbeused.I'llmentiontwo
here.TherstisquitevisualandmakesuseofwhatiscalledaQQplot.InaQQplot,allthedata
pointvaluesfromasamplegroupforagivenvariablearesortedinascendingorderandeachis
assigneditspercentilerankvalue,calleditsquantile.Thisreferstothepercentageofvaluesinthe
setthatfallsbelowthatspecicvalue.Thisisplottedagainstthequantilesofvaluesfroma
distributionthatweneedtotestagainst.Indecidingifaparametrictestshouldbeused,thiswould
bethenormaldistribution.
Intherstimagebelowwehaveacomputergeneratedplotshowingdatapointvaluesthatdonot
followthehypotheticalstraight(red)lineifthedatapointvaluesforthesampleweretakenfroma
populationinwhichthatvariablewasnormallydistributed.
Intheimagebelow,wenotetheoppositeandwouldallagreethatthesepointareamuchcloser
matchfortheredlinenormaldistribution.
Ifyoulookcloselyatthesegraphs,youwillalsonoteanR-squaredvalue.Thatisthesquareofthe
correlationcoecient,r,whichwemetinthelessononlinearregression.Notethevalueof0.99
(verycloseto1)forthesecondsetofvalues,versusonly0.84fortherst.
ThesecondmethodthatIwillmentionhereistheKolmogorov-Smirnovtest.Itisinitselfaformof
anon-parametrictestandcancompareasetofdatapointvaluesagainstareferenceprobability
distribution,mostnotably,thenormaldistribution.Aswithallstatisticaltestsofcomparisonit
calculatesap-valueandunderthenullhypothesis,thesampledatapointvaluesaredrawnfrom
thesamedistributionagainstwhichitistested.Thealternativehypothesiswouldstatethatitisnot
andwoulddemandtheuseofanon-parametrictestforcomparingsetsofdatapointvalues.
Non-parametricteststotherescue
Sincewearegoingtocomparenumericalvaluestoeachotherandinmostcasesmakeuseof
thet-distributionofsamplemeans(tryingtomimicthenormaldistribution)itiscleartoseethat
whenthedatapointvaluesseemnottocomefromapopulationinwhichthosedatapointvalues
arenormallydistributedwouldleadtocreatingincorrectareasunderthecurve(p-value).Inthese
cases,itismuchbettertousethesetofnon-parametricteststhatwewillmeetinthisweek.
Nonparametrictests
Themostcommontestsusedintheliteraturetocomparenumericaldatapointvaluesaret-tests,
analysisofvariance,andlinearregression.
Thesetestscanbeveryaccuratewhenusedappropriately,butdosuerfromthefactthatfairly
stringentassumptionsmustbemadefortheiruse.Notallresearcherscommentonwhetherthese
assumptionsweremetbeforechoosingtousethesetests.
Fortunately,therearealternativetestsforwhentheassumptionsfailandthesearecalled
nonparametrictests.TheygobynamessuchatheMann-Whitney-UtestandWilcoxonsign-rank
test.
Inthislessonwewilltakealookatwhentheassumptionsforparametrictestsarenotmetsothat
youcanspotthemintheliteratureandwewillbuildanintuitiveunderstandingofthebasisfor
thesetests.Theydeservealotmoreattentionanduseinhealthcareresearch.
Keyconcepts
Commonparametrictestsfortheanalysisofnumericaldatatypevaluesincludethe
varioust-tests,analysisofvariance,andlinearregression
Themostimportantassumptionthatmustbemetfortheirappropriateuseisthatthe
sampledatapointmustbeshowntocomefromapopulationinwhichtheparameteris
normallydistributed
Thetermparametricstemsfromthewordparameter,whichshouldgiveyouaclueasto
theunderlyingpopulationparameter
Testingwhetherthesampledatapointarefromapopulationinwhichtheparameteris
normallydistributedcanbedonebycheckingforskewnessinthesampledataorbythe
useofquantile(distribution)plots,amongstothers
Whenthisassumptionisnotmet,itisnotappropriatetouseparametrictests
Theinappropriateuseofparametricanalysesmayleadtofalseconclusions
Nonparametrictestsareslightlylesssensitiveatpickingupdierencesbetweengroups
Nonparametrictestscanbeusedfornumericaldatatypesaswellasordinalcategorical
datatypes
Whendatapointsarenotfromanormaldistributionthemean(onwhichparametrictests
arebased)arenotgoodpointestimates.Inthesecasesitisbettertoconsiderthe
median
Comparingmediansmakesuseofsigns,ranks,signranksandranksums
Whenusingsignsallofthesampledatapointsaregroupedtogetherandeachvalueis
assignedalabelofeitherzeroor(plus)onebasedonwhethertheyareatorlowerthana
suspectedmedian(zero)orhigherthanthatmedian(one)
Thecloserthesuspectedvalueistothetruemedian,thecloserthesumofthesigns
shouldbetoonehalfofthesizeofthesample
Adistributioncanbecreatedfromarankwhereallthesamplevaluesareplacedin
ascendingorderandrankedfromone,withtiesresolvedbygivingthemanaveragerank
value
Thesignvalueismultipliedbythe(absolutevalue)ranknumbertogivethesignrank
value
Intheranksumsmethodspecictablescanbeusedtocomparevaluesingroupsand
makinglistsofwhichvaluesbeatwhichvalueswiththespecicoutcomeoneofmany
possibleoutcomes
TheMann-Whitney-U(orMann-Whitney-WilcoxonorWilcoxon-Mann-WhitneyorWilcoxon
rank-sum)testusestherank-sumdistribution
TheKruskal-Wallistestisthenonparametricequivalenttotheone-wayanalysisof
variancetest
IfaKruskal-Wallisndsasignicantp-valuethentheindividualgroupscanbecompared
usingtheMann-Whitney-Utest
TheWilcoxonsign-ranktestisanalogoustotheparametricpaired-samplet-test
Spearmansrankcorrelationisanalogoustolinearregression
ThealternativeKendallsranktestcanbeusedformoreaccuracywhentheSpearmans
rankcorrelationrejectsthenullhypothesis
References:
1. ParazziP,etal.Ventilatoryabnormalitiesinpatientswithcysticbrosisundergoingthe
submaximaltreadmillexercisetest,BiomedCentralPulmonaryMedicine,2015,15:63,
http://www.biomedcentral.com/1471-2466/15/63
2. Bello,A,etal.Knowledgeofpregnantwomenaboutbirthdefects,BiomedCentral
PregnancyChildbirth,2013,
13:45,http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3598521/
3. CraigE,etal.RiskfactorsforoverweightandoverfatnessinruralSouthAfricanchildren
andadolescents,JournalofPublicHealth,2015
http://jpubhealth.oxfordjournals.org/content/early/2015/03/04/pubmed.fdv016.full
4. PaulR,etal.Studyofplateletaggregationinacutecoronarysyndromewithspecial
referencetometabolicsyndrome,InternationalJournalofAppliedMedicalResearch
2013Jul-Dec;3(2):117121,
http://www.ijabmr.org/article.asp?issn=2229-516X;year=2013;volume=3;issue=2;spage=
117;epage=121;aulast=Paul
Week6:Categoricaldataandanalyzingaccuracy
ofresults
Comparingcategoricaldata
Intheprevioussectionswelookedatcomparingnumericaldatatypes.Whataboutmethodsto
analyzecategoricaldata,though?Themostoftenusedstatisticaltestforcategoricaldatatypes,
bothnominalandordinal,isthechi-squaretest.
Inthissectionwewillusethechi-squared(2)distributiontoperformagoodness-of-ttestand
havealookathowtotestsampledatapointsforindependence.
Thechi-squaredgoodness-of-ttest
Beforewegettothemorecommonlyusedchi-squaredtestforindependence,let'sstartowith
thegoodness-of-ttest.Thistestallowsustoseewhetherthedistributionpatternofourdatatsa
predicteddistribution.Thisiscalledagoodness-of-ttest.Inessencewewillpredictadistribution
ofvalues,gooutandmeasuresomeactualdataandseehowwellourpredictionfaired.
Sincewearedealingwithafrequencydistribution,wehavetocounthowmanytimesadata
valueoccursanddivideitbythesum-totalofvalues.Letsconsiderpredictingthevemost
commonemergencysurgicalproceduresoverthenextvemonths.Thepredictionestimatesthat
appendectomieswillbemostcommon,makingup40%ofthetotal,cholecystectomiesmakingup
30%ofthetotal,incisionanddrainage(I&D)ofabscessesmakingup20%ofthetotalandwith
5%eachwepredictrepairingperforatedpepticulcersandmajorlowerlimbamputations.Over
thenextvemonthsweactuallynotethefollowing:290appendectomies,256cholecystectomies,
146I&Dprocedures,64perforatedpepticulcerrepairsand44amputations.
Achi-squaregoodness-of-ttestallowsustoiftheobservedfrequencydistributiontsour
prediction.Ournullhypothesiswouldbethattheactualfrequencydistributioncanbedescribed
bytheexpecteddistributionandthetesthypothesisstatesthatthesewoulddier.
Theactualvaluesarealreadyavailable.Wedoneedtocalculatetheexpected(predicted)values,
though.Fortunatelyweknowthetotal,whichinourexampleabovewas800(addingalltheactual
values)andwecanconstructvaluesbasedonthepredictedpercentagesabove.Thiswillleaveis
with40%of800,whichis320.Comparethistotheactualvalueof290.Forthe30%,20%and
two5%valuesweget240,160,40and40,whichcomparestotheobservedvaluesof256,146,
64and44.
Thechi-squarevalueiscalculatedfromthedierencesbetweentheobservedandexpected
valuesandfromthiscancalculateaprobabilityofhavingfoundthisdierence(ap-value).Ifitis
lessthanourchosenvalueofsignicancewecanrejectthenullhypothesisandacceptthetest
hypothesis.Ifnot,wecannotrejectthenullhypothesis.
Thechi-squaredtestforindependence
Thisisalsocalledthe2-testforassociation(whenconsideringtreatmentandcondition)andisthe
commonformthatweseeinclinicalliterature.Itisperformedbyconstructingsocalled
contingencytables.
Thenullhypothesisstatesthatthereisnoassociationbetweentreatmentandcondition,withthe
alternativehypothesisstatingthatthereisanassociation.Belowisatablecontingencytable,
clearlyshowingthecategorical(ordinal)natureofthedata.
Considerableimprovement 27 5 32
Moderateimprovement 11 12 23
Nochange 3 2 5
Moderatedeterioration 4 13 17
Considerabledeterioration 5 7 12
Death 4 14 18
Totals 54 53 107
Thistablerepresentstheobservedtotalsfromahypotheticalstudyandsimplycountsthenumber
ofoccurrencesofeachoutcome,i.e.27patientsinthetreatmentgroupwereassessedashaving
improvedconsiderably,whereasonlyvedidsointheplacebogroup.Notehowtotalsoccurfor
boththerowsandcolumnsofthedata.Thirty-twopatientsintotal(bothtreatmentandplacebo
groups)showedconsiderableimprovement.Therewere54patientsinthetreatmentgroupand53
intheplacebogroup.
Fromthistableanexpectedtablecanbecalculated.Mathematicalanalysisofthesetwotables
resultsina2-value,whichisconvertedtoap-value.Forap-valuelessthanachosenvalueof
signicancewecanrejectthenullhypothesis,therebyacceptingthealternatehypothesis,that
thereisanassociationbetweentreatmentandoutcome.Whenviewingtheobservedtableabove,
thatwouldmeanthatthereisadierenceinproportionsbetweenthetreatmentandplacebo
columns.Stateddierently,which(treatment)groupapatientisin,doesaecttheoutcome(there
isindependence).
Thecalculationforthep-valueusingthechi-squaredtestmakesuseoftheconceptofdegreesof
freedom.Thisisasimplecalculationandmultipliestwovalues.Therstonesubtract1fromthe
numberofcolumnsandthesecondsubtracts1fromthenumberofrows.Inourexampleabove,
wehavetwocolumnsandsixrows.Subtracting1fromeachyields1and5.Multiplyingthem
yieldsavalueofveforthedegreesoffreedom.
Fisher'sexacttest
Therearecasesinwhichthe2-testdoesbecomeinaccurate.Thishappenswhenthenumbers
arequitesmall,withtotalsintheorderofveorless.ThereisactuallyarulecalledCochran'srule
whichstatesthatmorethan80%ofthevaluesintheexpectedtable(above)mustbelargerthan
ve.Ifnot,Fisher'sexacttestshouldbeused.Fisherstest,though,onlyconsiderstwocolumns
andtworows.Soinordertouseit,thecategoricalnumbersabovemustbereducedbycombining
someofthecategories.Intheexampleabovewemightcombineconsiderableimprovement,
moderateimprovementandnochangeintoasinglerowandallthedeteriorationsanddeathina
secondrow,leavinguswithatwocolumnandtworowconsistencytable(observedtable).
ThecalculationforFisher'stestusesfactorials.Fivefactorial(writtenas5!)means5x4x3x2x
1=120and3!is3x2x1=6.Forinterest'ssake1!=1and0!isalsoequalto1.Asyoumight
realise,factorialvaluesincreaseinsizequiteconsiderably.Intheexampleabovewehadavalue
of27and27!isavaluewith29numbers.Thatisbillionsandbillions.Whensuchlargevaluesare
used,aresearchermustmakesurethathisorhercomputercanaccuratelymanagesuchlarge
numbersandnotmakeroundingmistakes.Fisher'sexacttestshouldnotbeusedwhennot
requiredduetosmallsamplesizes.
Sensitivity,specicity,andpredictivevalues
Consideringmedicalinvestigations
Thetermssensitivityandspecicity,aswellaspositive andnegativepredictivevaluesareused
quiteofteninthemedicalliteratureanditisveryimportantintheday-to-daymanagementof
patientsthathealthcareworkersarefamiliarwiththeseterms.
Thesefourtermsareusedwhenweconsidermedicalinvestigationsandtests.Theylookatthe
problemfromtwopointsofview.Inthecaseofsensitivityandspecicityweconsiderhowmany
patientwillbecorrectlyindicatedassueringfromadiseaseornotsueringfromadisease.From
thisvantagepoint,notesthasbeenorderedandwedonothavetheresults.
Thisisincontrasttotheuseofpositiveandnegativepredictivevalue.Fromthispointofviewwe
alreadyhavethetestresultsandneedtoknowhowtointerpretapositiveandanegativending.
Sensitivityandspecicityhelpustodecidewhichtesttouseandthepredictivevalueshelpusto
decidehowtointerprettheresultsoncewehavethem.
Sensitivityandspecicity
Let'sconsiderthechoiceofamedicaltestorinvestigation.Weneedtobeawareofthefactthat
testsarenotcompletelyaccurate.Falsepositiveandnegativeresultsdooccur.Withafalse
positiveresultthepatientreallydoesnothavethedisease,yet,theresultsreturnpositive.In
contrasttothiswehavethefalsenegativeresult.Althoughthetestreturnsnegative,thepatient
reallydoeshavethedisease.Asagoodexampleweoftennoteheadlinesscrutinisingtheuseof
screeningtestssuchasmammography,wherefalsepositivetestsleadtobothunnecessary
psychologicalstressesandfurtherinvestigation,evensurgery.
Sensitivityreferstohowoftenatestreturnspositivewhenpatientsreallyhavethedisease.It
requiressomegold-standardbywhichweabsolutelyknowthatthepatienthasthedisease.So
imaginewehaveasetof100patientswithaknowndisease.Ifwesubjectallofthemtoanew(or
dierent)test,thesensitivitywillbethepercentageoftimesthatthistestreturnsapositiveresult.If
86ofthesetestscomebackaspositive,wehavea86%truepositiverate,statedasasensitivity
of86%.Thatmeansthatin14%ofcasesofusingthistest,wewillgetafalsenegativeandmight
missthefactthatapatientmighthavethediseasethatwewereinvestigating.
Specicityreferstohowoftenatestreturnsanegativeresultsintheabsenceofadisease.Once
again,itrequiresthepresenceofsomegold-standard,wherebyweabsolutelyknowthata
diseaseisabsent.Let'suseahundredpatientsagain,allknownnottohaveacertaindisease.If
wesubjectthemtoatestand86ofthosetestscomesbackasnegative,wehaveaspecicityof
86%.Thiswouldalsomeanthat14%ofpatientswillhaveafalsepositiveresultandmightbe
subjectedtounnecessaryfurtherinvestigationsandeveninterventions.
Thegreatmajorityofofmedicaltestsandinvestigationscannotabsolutelydiscriminatebetween
thosewithandwithoutthepresenceofadiseaseandwehavetobecircumspectwhenchoosing
anytest.
Herewehaveanexample:
Wenote1000patients,100withadiseaseand900without.Theyareallsubjectedtoanewtest
and180returnapositiveresultand820anegativeresult.Youwillnotetheindicationoffalse
positivesandnegatives.
Theequationsforsensitivityandspecicityareshownbelow.
Thisgivesusasensitivityof(90/100)90%andaspecicityof(810/900)90%.
Predictivevalues
Let'sturnthetablesandnowconsiderhowtointerprettestresultsoncetheyreturn.Againwe
couldimaginethatsometestswillreturnbothfalsepositiveandfalsenegativeresults.
Weexpresspositivepredictivevaluesasthepercentageofpatientswithapositivetestresultthat
turnsouttohavethediseaseandweexpressnegativepredictivevalueasthepercentageof
patientswithanegativetestresultthatturnsoutnottohavethedisease.Thegurebelowgives
thesimpleformulaeforpredictivevalues.
Thegivesusapositivepredictivevalueof(90/180)of50%(onlyhalfofpatientswithapositive
resultwillactuallyhavethedisease)andanegativepredictivevalueof(810/820)of99%,which
meansthatonlyalmostallpatientwithanegativeresultwillactuallynothavethedisease.
Predictivevaluesareverydependentontheprevalenceofadisease.Herewechoseasample
setof1000patientsinwhichthediseaseexistedinonly10%.Itisthislowprevalencewhichgives
usthepoorpositivepredictivevalue.Wheninterpretingpositiveandnegativepredictivevalues,
youmustalwayscomparetheprevalenceofthediseaseinthestudysampleversusthe
prevalenceofthediseaseinthepatientpopulationthatyousee.(Therearemathematical
methodsofconvertingresultstodierentlevelsofprevalence.)
Reference:
1. SuttonPA,HumesDJ,PurcellG,etal.TheRoleofRoutineAssaysofSerumAmylase
andLipasefortheDiagnosisofAcuteAbdominalPain.AnnalsofTheRoyalCollegeof
SurgeonsofEngland.2009;91(5):381-384.doi:10.1308/003588409X392135.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2758431/
JuanKlopperCC-BY
ThisworkislicensedunderaCreativeCommonsAttribution4.0InternationalLicense .