You are on page 1of 96

UnderstandingClinicalResearch:

BehindtheStatistics

Keynotes

Lastupdated:13May2016

Thedocumentisopenforyourcommentsorsuggestions.Usesuggestionmodetoadd
yourfeedback.

Toviewthenalformofthedocument,pleasechangethemodetoviewingonthetop-right
ofthemenubar.

DownloadthenotesbyclickingFile->Downloadas

JuanKlopperCC-BY

ThisworkislicensedunderaCreativeCommonsAttribution4.0InternationalLicense .

TableofContents:
Week1:Gettingthingsstartedbydeningdierentstudytypes

Gettingtoknowstudytypes

Observationalandexperimentalstudies

GettingtoKnowStudyTypes:CaseSeries

Case-controlStudies

Cross-sectionalstudies

Cohortstudies

RetrospectiveCohortStudies

ProspectiveCohortStudies

Experimentalstudies

Randomization

Blinding

Trialswithindependentconcurrentcontrols

Trialswithself-controls

Trialswithexternalcontrols

Uncontrolledtrials

Meta-analysisandsystematicreview

Meta-analysis
SystematicReview

Whatisthedierencebetweenasystematicreviewandameta-analysis?

Week2:Describingyourdata

Thespectrumofdatatypes

Denitions

Descriptivestatistics

Inferentialstatistics

Population

Sample

Parameter

Statistic

Variable

Datapoint

Datatypes

Nominalcategoricaldata

Ordinalcategoricaldata

Numericaldatatypes

Ratio

Summary

Discreteandcontinuousvariables

Discretedata:

Continuousdata:
Summarisingdatathroughsimpledescriptivestatistics

Describingthedata:measuresofcentraltendencyanddispersion

Measuresofcentraltendency

Mean

Median

Mode

Measuresofdispersion

Range

Quartiles

Percentile

TheInterquartileRangeandOutliers

Varianceandstandarddeviation

Plots,graphsandgures

Boxandwhiskerplots

Countplots

Histogram

Distributionplots

Violinplots

Scatterplots

Piechart

Sampling

Introduction
Typesofsampling

Simplerandomsampling

Systematicrandomsampling

Clusterrandomsampling

Stratiedrandomsampling

Week3:Buildinganintuitiveunderstandingofstatisticalanalysis

Fromareatoprobability

P-values

Rollingdice

Equatinggeometricalareatoprobability

Continuousdatatypes

Theheartofinferentialstatistics:Centrallimittheorem

Centrallimittheorem

Skewnessandkurtosis

Skewness

Kurtosis

Combinations

Centrallimittheorem

Distributions:theshapeofdata

Distributions

Normaldistribution

Samplingdistribution
Z-distribution

t-distribution

Week4:Theimportantrststeps:Hypothesistestingandcondencelevels

Hypothesistesting

Thenullhypothesis

Thealternativehypothesis

Thealternativehypothesis

Twowaysofstatingthealternativehypothesis

Thetwo-tailedtest

Theone-tailedtest

Hypothesistestingerrors

TypeIandIIerrors

Condenceinyourresults

Introductiontocondenceintervals

Condencelevels

Condenceintervals

Week5:Whichtestshouldyouuse?

Introductiontoparametrictests

Typesofparametrictests

Studentst-test:Introduction

Typesoft-tests

ANOVA
LinearRegression

Nonparametrictestingforyournon-normaldata

Nonparametrictests

Nonparametrictests

Week6:Categoricaldataandanalyzingaccuracyofresults

Comparingcategoricaldata

Thechi-squaredgoodness-of-ttest

Thechi-squaredtestforindependence

Fisher'sexacttest

Sensitivity,specicity,andpredictivevalues

Consideringmedicalinvestigations

Sensitivityandspecicity

Predictivevalues


Week1:Gettingthingsstartedbydening
dierentstudytypes

Gettingtoknowstudytypes
Doyouknowyourcross-sectionalfromyourcohortstudy?Yourretrospectivecase-controlseries
fromyourdependentcontroltrials?Studytypescanbeveryconfusing.Yetitisessentialtoknow
whattypeofstudyyourarereadingorplanningtoconduct.Deningastudyintoaspecictype
tellsusalotaboutwhatwecanlearnfromtheoutcomes,whatstrengthsandweaknesses
underpinitsdesignandwhatstatisticalanalysiswecanexpectfromthedata.

Therearevariousclassicationsystemsanditisevenpossibletocombineaspectsofdierent
studytypestocreatenewresearchdesigns.We'llstartthiscourseowithanintuitive
classicationsystemthatviewsstudiesaseitherobservationalorexperimental(interventional).

Observationalandexperimentalstudies

InthisrstlectureIcoveredclinicalstudytypes,usingtheclassicationsystemthatdividesall
studiesintoeitherobservationalorexperimental.Ialsotookalookatmeta-analysesand
systematicreviews.

LetssummarisethekeycharacteristicsofthedierentstudytypesthatImentioned.Thediagram
belowgivesanoverview.

Inobservationalstudies:

subjectsandvariablespertainingtothemareobservedanddescribed
notreatmentorinterventiontakesplaceotherthanthecontinuationofnormalwork-ow,
i.e.healthcareworkersareallowedtocarryonwiththeirnormalpatientmanagementor
treatmentplans
therearefourmainobservationalstudytypes:caseseries,case-controlstudies,
cross-sectionalstudiesandcohortstudies

Inexperimentalstudies:

subjectsaresubjectedtotreatmentsorinterventionsbasedonapredesignatedplan
healthcareworkersarenotallowedtocontinuetheirroutinecare,butmustaltertheir
actionsbasedonthedesignofthestudy
thisusuallyresultsinatleasttwogroupsofpatientsorsubjectsthatfollowadierentplan
andthesecanthenbecomparedtoeachother
themainideabehindanexperimentalstudyistoremovebias
ifastudyinvolveshumans,thisisknownasaclinicaltrial
themainexperimentalstudytypesare:trialswithindependentconcurrentcontrols,trials
withself-controls,trialswithexternalcontrols,anduncontrolledtrials
Ialsointroducedthetopicofmeta-analysesandsystematicreviews.Meta-analysisuses
pre-existingresearchandcombinestheirresultstoobtainanoverallconclusion.Theyaimto
overcomeoneofthemostcommonproblemsthatbesetclinicalresearchandthatissmallsample
sizes,resultinginunderpoweredresults.Asystematicreviewisaliteraturereviewthatsumsup
thebestavailableinformationonaspecicresearchquestionandincludesresultsfromresearch
intothespeciceld,publishedguidelines,expert(group)opinionandmeta-analyses.

Nowthatyouknowhowtodistinguishbetweenthevariousclinicalstudies,Illcovereachofthe
studytypesintheupcominglecturemorein-depth.

GettingtoKnowStudyTypes:CaseSeries

Acaseseriesisperhapsthesimplestofallstudytypesandreportsasimpledescriptiveaccount
ofacharacteristicobservedinagroupofsubjects.Itisalsoknownbythetermsclinicalseriesor
clinicalaudit.

Acaseseries:

observesanddescribessubjects
cantakeplaceoveradenedperiodorataninstantintime
ispurelyanalyticalandrequiresnoresearchhypotheses
iscommonlyusedtoidentifyinterestingobservationsforfutureresearchorplanning

Alsosimplebynatureanddesign,cases-seriesareneverthelessimportantrststepsinmany
researchareas.Theyidentifynumbersinvolved,i.e.howmanypatientsareseen,diagnosed,
under-threat,etc.anddescribevariouscharacteristicsregardingthesesubjects.

Wearebynaturebiasedandashumanshaveatendencytorememberonlyselectedcasesor
events.Weareusuallypooratseeingpatternsinlargenumbersoroverextendedperiodsandby
examiningcase-series(audits),wendinterestingandsometimessurprisingresultsandthese
canleadtofurtherresearchandevenachangeinmanagement.

Paperreferencedinvideo:
1. Donald,K.a,Walker,K.G.,Kilborn,T.,Carrara,H.,Langerak,N.G.,Eley,B.,&
Wilmshurst,J.M.(2015).HIVEncephalopathy:pediatriccaseseriesdescriptionand
insightsfromthecliniccoalface .AIDSResearchandTherapy,12,110.
doi:10.1186/s12981-014-0042-7

Case-controlStudies

NowthatIhavecoveredthetopicofcase-controlstudiesinthevideolecture,letssummariseand
expandonwhatwevelearned.

Acase-controlstudy:

selectssubjectsonthebasisofapresence(cases)andabsence(controls)ofan
outcomeordisease
looksbackintimetondvariablesandriskfactorsthatdierbetweengroups
canattempttodeterminetherelationshipbetweentheexposuretoriskfactors(orany
measuredvariable)andthedisease
case-controlstudiescanincludemorethantwogroups

Toillustratethesepoints,consideranexamplewherepatientsundergosomeformofinvasive
surgicalintervention.Youmightnotethatafterthesamesurgicalprocedure,somedevelop
infectionatthewoundsiteandsomedonot.Thosewiththewoundinfection(termedsurgicalsite
infection)makeupthecasesandthosewithout,thecontrols.Wecannowgatherdataonvarious
variablessuchasgender,age,admissiontemperature,etc.andcomparethesebetweenthetwo
groups.Notehowsuchdataonthesevariablesallexistedpriortotheoccurrenceofthewound
infection.Thestudylooksbackintimeandthedataiscollectedretrospectively.

Whatisadrawbackofcase-controlstudies?

Themaindrawbackisconfounding,whichreferstoafalseassociationbetweentheexposureand
outcome.Thisoccurswhenthereisathirdvariable,wecallthistheconfoundingfactorwhichis
associatedwithboththeriskfactorandthedisease.

Letsconsideranotherexample.Severalstudiesndanassociationbetweenalcoholintake
(exposure)andheartdisease(outcome).Here,agroupofpatientswithheartdiseasewillformthe
casesandagroupwithoutheartdisease,thecontrolsandwelookbackintimeattheiralcohol
consumption.If,bystatisticalanalysis,wendahigheralcoholconsumptionintheheartdisease
groupthaninthecontrolgroup,wemaythinkthatdrinkingalcoholcausesheartdisease.But
anotherconfoundingfactor,i.e.smoking,mayberelatedtobothalcoholintakeandheartdisease.
Ifthestudydoesnotconsiderthisconfoundingfactor,thisrelationshipbetweentheexposureand
outcomemaybemisinterpreted.Theconfoundingfactor,inthiscasesmoking,needstobe
controlledinordertondthetrueassociation.

Youcanreviewtheexampleofthecase-controlstudyinthepaperIdiscussedinthelecture

References:

1. YungJ,YuenJWM,OuY,LokeAY.FactorsAssociatedwithAtopyinToddlers:A
Case-ControlStudy.TchounwouPB,ed.InternationalJournalofEnvironmentalResearch
andPublicHealth.2015;12(3):2501-2520.doi:10.3390/ijerph120302501.

Cross-sectionalstudies

Letsreviewthecharacteristicsofcross-sectionalstudies.

Across-sectionalstudy:

identiesapopulationorsub-populationratherthanindividuals
takesplaceatapointintimeorovera(relatively)shortperiod
canmeasurearangeofvariablesacrossgroupsatthesametime
isoftenconductedintheformofasurvey
canbeaquick,easyandacosteectivewayofcollectinginformation
canbeincludedinotherstudydesignssuchascase-controlandcohortstudies
iscommonlyusedtomeasureprevalenceofanoutcomeordisease,i.e.epidemiological
studies

Whatarethepotentialdrawbacksofcross-sectionalstudies?

Bias

Responsebiasiswhenanindividualismorelikelyrespondiftheypossessaparticular
characteristicorsetofcharacteristics.Forexample,HIVnegativeindividualsmaybemore
comfortablerespondingtoasurveydiscussingtheirstatuscomparedtoHIVpositiveindividuals.A
varietyoftechnicaldicultiesorevenagemayalsoinuenceresponders.Oncebiasexistsinthe
groupofresponders,itcanleadtoseeingofthedataandinappropriateconclusionscanbedrawn
fromtheresults.Thiscanhavedevastatingconsequencesasthesestudiesaresometimesused
toplanlargescaleinterventions.

SeparatingCauseandEect

Cross-sectionalstudiesmaynotprovideaccurateinformationoncauseandeect.Thisis
becausethestudytakesplaceatamomentintime,anddoesnotconsiderthesequenceof
events.Exposureandoutcomeareassessedatthesametime.Inmostcasesweareunableto
determinewhetherthediseaseoutcomefollowedtheexposure,ortheexposureresultedfromthe
outcome.Thereforeitisalmostimpossibletoinfercausality.

YoumaynditusefultoreviewthepapersIdiscussedinthevideo,whicharegoodexamplesof
cross-sectionalstudies.

References:
1. LawrensonJG,EvansJR.Adviceaboutdietandsmokingforpeoplewithoratriskof
age-relatedmaculardegeneration:across-sectionalsurveyofeyecareprofessionalsin
theUK.BMCPublicHealth.2013;13:564.doi:10.1186/1471-2458-13-564.
2. 2.SartoriusB,VeermanLJ,ManyemaM,CholaL,HofmanK(2015)Determinantsof
ObesityandAssociatedPopulationAttributability,SouthAfrica:EmpiricalEvidencefroma
NationalPanelSurvey,2008-2012.PLoSONE10(6):e0130218.
doi:10.1371/journal.pone.0130218

Cohortstudies

AsIoutlinedinthelecture,acohortstudy:

beginsbyidentifyingsubjects(thecohort)withacommontraitsuchasadiseaseorrisk
factor
observesacohortovertime
canbeconductedretrospectivelyorprospectively

RetrospectiveCohortStudies

Aretrospectivestudyusesexistingdatatoidentifyapopulationandexposurestatus.Sinceweare
lookingbackintimeboththeexposureandoutcomehavealreadyoccurredbeforethestartofthe
investigation.Itmaybediculttogobackintimeandndtherequireddataonexposure,asany
datacollectedwasnotdesignedtobeusedaspartofastudy.However,incaseswherereliable
recordsareon-hand,retrospectivecohortstudiescanbeuseful.

ProspectiveCohortStudies

Inaprospectivecohortstudy,theresearcheridentiessubjectscomprisingacohortandtheir
exposurestatusatthebeginningofthestudy.Theyarefollowedovertimetoseewhetherthe
outcome(disease)developsornot.Thisusuallyallowsforbetterdatacollection,astheactual
datacollectiontoolsareinplace,withrequireddataclearlydened.

Thetermcohortisoftenconfused.Itsimplyreferstoagroupofsubjectsandforthepurposesof
research,theyusuallyhavesomecommontrait.Weoftenusethistermwhenreferringtothistype
ofstudy,butyouwillalsonoteitintheotherformsofobservationalstudies.Whenusedthere,itis
simplyagenericterm.Whenusedinthecaseofcohortstudiesitreferstothefactthatthedata
gatheredfortheresearchpointstoeventsthatoccurredafterthegroups(cohorts)wereidentied.

Togetbacktoourearlierexampleofwoundinfectionpatients(thatweusedinthecase-control
section),thepatientwithandwithoutwoundinfectioncouldbeconsideredcohortsandwe
considerwhathappenedtothemafterthediagnosisoftheirwoundinfection.Wemightthen
considerlengthofhospitalstay,totalcost,ortheoccurrenceofanyeventsafterthedevelopment
ofthewoundinfection(oratleastafterthesurgeryforthosewithoutwoundinfection).The
deningfact,though,isthatwearelookingforwardintimefromthewoundinfectionincontrastto
case-controlseries,wherewelookbackateventsbeforethedevelopmentofthewoundinfection.
YoumaynditusefultoreviewthepaperIdiscussedinthevideo,whichisagoodexampleofa
cohortstudy.

Paperdiscussedinthevideo:

LeRoux,D.M.,Myer,L.,Nicol,M.P.,&Zar,H.J.(2015).Incidenceandseverityofchildhood
pneumoniaintherstyearoflifeinaSouthAfricanbirthcohort:theDrakensteinChildHealth
Study.TheLancetGlobalHealth,3(2),e95e103.doi:10.1016/S2214-109X(14)70360-2

Experimentalstudies

Inexperimentalstudies(asopposedtoobservationalstudies,whichwediscussedearlier),an
activeinterventiontakesplace.Theseinterventionscantakemanyformssuchasmedication,
surgery,psychologicalsupportandmanyothers.

Experimentalstudies:

aimtoreducebiasinherentinobservationalstudies
usuallyinvolvetwogroupsormore,ofwhichatleastoneisthecontrolgroup
haveacontrolgroupthatreceivesnointerventionorashamintervention(placebo)

Randomization

Toreducebias,truerandomizationisrequired.Thatmeansthateverymemberofapopulation
musthaveanequalopportunity(randomchance)tobeincludedinasamplegroup.That
necessitatestheavailabilityofafulllistofthepopulationandsomemethodofrandomlyselecting
fromthatlist.

Inpracticaltermsthismeansthateverysubjectthatformspartofthetrial,musthaveanequal
opportunityofendingupinanyofthegroups.Usuallyitalsomeansthatallofthesesubjectsare
alsotakenfromanon-selectedgroup,i.e.inanon-biasedway.Forexample,ifwewantto
investigatetheeectivenessofanewdrugonhypertension,wemustnotonlybecertainthatall
patientshaveanequalopportunitytoreceiveeitherthedrugoraplacebo,butthatallthe
participantsarerandomlyselectedasawhole.Ifalltheparticipantscomefromaselectedgroup,
say,fromaclinicfortheaged,thereisbiasintheselectionprocess.Inthiscase,theresearchers
mustreportthattheirresultsareonlyapplicabletothissetofthepopulation.

Blinding

Ifthesubjectsdonotknowwhethertheyareinthecontrolgroupornot,thisistermedblinding.
Whentheresearchersaresimilarlyunawareofthegrouping,itistermeddouble-blinding.This
methodispreferablebutnotalwayspossible,i.e.inasurgicalprocedure.Inthesecasesthe
observertakingmeasurementsaftertheinterventionmaybeblindedtotheinterventionandthe
surgeonisexcludedfromthedatacollectionormeasurements.

Thepinnacleofclinicalresearchisusuallyseentobetherandomized,double-blindcontrolledtrial.
Itoftenprovidesthestrongestevidencetoprovecausation.

Thecontrolgroupcanbesetupinavarietyofways:

Trialswithindependentconcurrentcontrols

Intrialswithindependentconcurrentcontrols,thecontrolsareincludedinthetrialatthesame
timeasthestudyparticipants,anddatapointsarecollectedatthesametime.Inpracticalterms
thismeansthataparticipantcannotbeinbothgroups,norarehomozygotictwinsallowed.

Trialswithself-controls

Intrialswithself-controls,subjectsaretreatedasthecontrolandtreatmentgroups.Datais
collectedonsubjectsbeforeandaftertheintervention.

Themostelegantsubtypeofthisformoftrialisthecross-overstudy.Twogroupsareformedeach
withtheirownintervention.Mostcommonlyonegroupwillreceiveaplacebo.Theyformtheirown
controls,thereforedataiscollectedonbothgroupsbeforeandaftertheintervention.Afterthe
interventionanddatacollection,aperiodofnointerventiontakesplace.Beforeintervention
resumes,individualsintheplacebogroupareswappedwithindividualsinthetreatmentgroup.
Theplacebogroupthenbecomesthetreatmentgroupandthetreatmentgroupbecomesthe
placebogroup.

Trialswithexternalcontrols

Trialswithexternalcontrolscomparesacurrentinterventiongrouptoagroupoutsideofthe
researchsample.Themostcommonexternalcontrolisa historicalcontrol,whichcomparesthe
interventiongroupwithagrouptestedatanearliertime.Forexample,apublishedpapercan
serveasahistoricalcontrol.

Uncontrolledtrials

Inthesestudiesaninterventiontakesplace,buttherearenocontrols.Allpatientsreceivethe
sameinterventionandtheoutcomesareobserved.Thehypothesisisthattherewillbevarying
outcomesandreasonsforthesecanbeelucidatedfromthedata.Noattemptismadetoevaluate
theinterventionitself,asitisnotbeingcomparedtoeitheraplacebooranalternativeformof
intervention.

YoumaynditusefultoreviewthepaperIdiscussedinthevideo,whichisagoodexampleofa
experimentalstudy.

Papermentionedinvideo:

1. ThurtellMJ,JoshiAC,LeoneAC,etal.Cross-OverTrialofGabapentinandMemantineas
TreatmentforAcquiredNystagmus.Annalsofneurology.2010;67(5):676-680.
doi:10.1002/ana.21991.

Meta-analysisandsystematicreview

Werealmostattheendofweek1!Ivecoveredobservationalandexperimentalstudiesin
previousvideos,andImconcludingthislessonwithadiscussiononmeta-analysesand
systematicreviews.

Meta-analysis

Ameta-analysis:

usespre-existingresearchstudiesandcombinestheirstatisticalresultstodrawan
overallconclusion
centersaroundacommonmeasurementsuchasndinganaverageormean
isusefulforcombiningindividualstudiesofinadequatesizeorpowertostrengthen
results
usesinclusionandexclusioncriteriatoselectpaperstobeanalysed

Whatisadrawbackofmeta-analysis?

Onepossibledrawback,whichmakesmeta-analyseslessuseful,istheever-presentdangerof
publicationbias.Publicationbiasiswellrecognizedandreferstothefactthatthroughvarious
positiveandnegativeincentives,itismuchmorelikelytondpositiveresultsinthepublished
literature,i.e.statisticallysignicantresults.Itismuchlesscommontondnegativeresults.Even
thoughmeta-analysisisusedtoincreasethepowerofaresultandmakeitmoregeneralizable,its
resultsmaystillbepoorifthestudiesonwhichitisbasedarebiasedtowardspositive,statistically
signicantresults.

SystematicReview

Howoftendoyoucomeacrossresearchstudiesthatcontradictoneanothersndings?One
studyreportsthatcarbohydratesarebadforyou,anotherstudysayscarbohydratesarerequired
aspartofabalanceddiet.Whenlookingforresearchevidence,weneedtolookbeyondasingle
study.Thisiswheresystematicreviewstin.

Asystematicreview:

summarisesacomprehensiveamountofpublishedresearch
helpsyoundadenitewordonaresearchquestion
canincludeameta-analysis
canuseframeworkssuchasthePRISMAtostructurethereview

Whatisthedierencebetweenasystematicreviewandameta-analysis?
Thereissomeoverlapandnoteveryonestickclearlytothestrictdenitionsofthesetwotypesof
research(although,asImentionedinthelesson,clearguidelinesforbotharebeenacceptedby
mostresearchersandpublishers).Theaimofbothistocollectandusepreviouslypublisheddata.
Mostsystematicreviewsincludeameta-analysis,buttheyareratherliketocirclesina
Venn-diagram,withsomeoverlap(intersection).

Ameta-analysisisindeedthecombinationofpreviouslypublishedresearch.Thiscombinationcan
thenbeusedforre-analysis.Combiningtheseresultsintobiggernumbersmayresultsinimproved
results.Itisoftenseenasaquantitativelookatpreviouslypublisheddata.Theremayormaynot
besomenarrativetothemeta-analysisgivingalittlebackgroundinformationandknowledge
aboutthesubject.

Asystematicreviewalsocollectedpreviouslypublishedwork,buttakesamorequalitativelook
andisusuallymuchmoreinvolvedthanameta-analysis.Itreallyaimstobethemostdenitive
wordonatopicorresearchquestion.Certainproceduresarefollowsandareclearlysetoutinthe
designofthestudy,soastominimisebiasintheselectionandanalysisofthedata.Objective
techniquesarethenusedtoanalysethedata.Thefocusisonthemagnitudeoftheeect,rather
thanonstatisticalsignicance(meta-analysis).Itaddsalotofdetailandexplanationofthetopicat
handandincludesalotofnarrative.

Thereisthenalsothenarrativereview.ThesearefoundinpublicationssuchasthevariousNorth
AmericanClinicsJournals.Thereisreferencetopreviouspublisheddata,butthetrendismore
towardsthestyleofatextbook.

Nextup,wevedevelopedapracticequizforyoutocheckyourunderstanding.Goodluck!

Papersmentionedinthevideo:

1. GengL,SunC,BaiJ.SingleIncisionversusConventionalLaparoscopicCholecystectomy
Outcomes:AMeta-AnalysisofRandomizedControlledTrials.HillsRK,ed.PLoSONE.
2013;8(10):e76530.doi:10.1371/journal.pone.0076530.

Week2:Describingyourdata

Thespectrumofdatatypes

Denitions
ThereareeightkeydenitionswhichIintroducedinthevideolecturethatIwillbeusing
throughouttherestofthiscourse.

Descriptivestatistics

Theuseofstatisticaltoolstosummarizeanddescribeasetofdatavalues
Humanbeingsusuallynditdiculttocreatemeaningfromlonglistsofnumbersor
words
Summarizingthenumbersorcountingtheoccurrencesofwordsandexpressingthat
summarywithsinglevaluesmakesmuchmoresensetous
Indescriptivestatistics,noattemptismadetocompareanydatasetsorgroups

Inferentialstatistics

Theinvestigationofspeciedelementswhichallowustomakesinferencesabouta
largerpopulation(i.e.,beyondthesamplesize)
Herewecomparegroupsofsubjectsorindividuals
Itisnormallynotpossibletoincludeeachsubjectorindividualinapopulationinastudy,
thereforeweusestatisticsandinferthattheresultsweget,applytothelargerpopulation

Population

Agroupofindividualsthatshareatleastonecharacteristicincommon
Onamacrolevel,thismightrefertoallofhumanity
Atthelevelofaclinicalresearch,thismightrefertoeveryindividualwithacertain
disease,orriskfactor,whichmightstillbeanenormousnumberofindividuals
Itisquitepossibletohavequitesmallpopulation,i.e.inthecaseofveryrarecondition
Thendingsofastudyinferitsresultstoalargerpopulation;wemakeuseofthendings
tomanagethepopulationtowhichthosestudyndingsinfer

Sample

Asampleisaselectionofmemberswithinthepopulation(I'lldiscussdierentwaysof
selectingasampleabitlaterinthiscourse)
Researchisconductedusingthatsamplesetofmembersandanyresultscanbe
inferredtothepopulationfromwhichthesamplewastaken
Thisuseofstatisticalanalysismakesclinicalresearchpossibleasitisusuallynear
impossibletoincludethecompletepopulation
Parameter

Astatisticalvaluethatiscalculatedfromallthevaluesinawholepopulation,istermeda
parameter
Ifweknewtheageofeveryindividualonearthandcalculatedthemeanoraverageage,
thatagewouldbeaparameter

Statistic

Astatisticalvaluethatiscalculatedfromallthevaluesinasample,istermedastatistic
Themeanoraverageageofalltheparticipantsinastudywouldbeastatistic

Variable

Therearemanywaystodeneavariable,butforuseinthiscourseIwillrefertoa
variableasagroupnameforanydatavaluesthatarecollectedforastudy
Exampleswouldincludeage,presenceofriskfactor,admissiontemperature,infective
organism,systolicbloodpressure
Thisinvariablybecomesthecolumnnamesinadataspreadsheet,witheachrow
representingthendingsforanindividualinastudy

Datapoint

Irefertoadatapointasasingleexamplevalueforavariable,i.e.apatientmighthavea
systolicbloodpressure(thevariable)or120mmHg(thedatapoint)

Let'susethisknowledgeinaquickexample:Saywewanttotesttheeectivenessof
levothyroxineastreatmentforhypothyrodisminSouthAfricansubjectsbetweentheagesof
18-24.Itwillbephysicallyimpossibletondeveryindividualinthecountrywithhypothyrodism
andcollecttheirdata.However,wecancollectarepresentativesampleofthepopulation.For
example,wecancalculatetheaverage(samplestatistic)ofthethyroidstimulatinghormonelevel
beforeandtreatment.Wecanusethisaveragetoinferresultsaboutthepopulationparameter.

Inthisexamplethyroidstimulatinghormonelevelwouldbeavariableandtheactualnumerical
valueforeachpatientwouldrepresentindividualdatapoints.
Datatypes
Iwillbeusingtwoclassicationsystemsfordata-categoricalandnumerical,eachofwhichhas
twosub-divisions.

Categorical(includingnominalandordinaldata),referstocategoriesorthings,not
mathematicalvalues

Numerical(furtherdenedasbeingeitherintervalorratiodata)referstodatawhichis
aboutmeasurementandcounting

Inshortcategoricaldatatypesrefertowords.Althoughwordscanbecounted,thewords
themselvesonlyrepresentcategories.Thiscanbesaidfordiseases.Acutecholecystitis(infection
ofthegallbladder)andacutecholangitis(infectionofthebileducts)arebothdiseasesofthebiliary
(bile)tract.Aswords,thesediseasesrepresentcategoricalentities.AlthoughIcancounthow
manypatientshaveoneoftheseconditions,thediseasesthemselvesarenotnumericalentities.
Thesamewouldgoforgender,medications,andmanyotherexamples.
Justtomakethingsabitmoredicult,actualnumbersaresometimescategoricalandnot
numerical.Agood,illustrativeexamplewouldbechoosingfromaratingsystemforindicatingthe
severityofpain.Icouldaskapatienttoratetheseverityofthepaintheyexperienceaftersurgery
onascalefrom0(zero)to10(ten).Thesearenumbers,buttheydoNOTrepresentnumerical
values.Icanneversaythatapatientwhochooses6(six)hastwiceasmuchpainassomeone
whochooses3(three).Thereisnoxeddierencebetweeneachofthesenumbers.Theyarenot
quantiable.Assuch,theyrepresentcategoricalvalues.

Asthenameimplies,numericaldatareferstoactualnumbers.Wedistinguishnumericalnumber
valuesfromcategoricalnumbervaluesinthatthereisaxeddierencebetweenthem.The
dierencebetween3(three)and4(four)istheexactsamedierenceasthatbetween101
(one-hundred-and-one)and102(one-hundred-and-two).

Iwillalsocoveranothernumericalclassicationtype:discreteandcontinuousvariables.Discrete
valuesasthenameimpliesexistaslittleislandswhicharenotconnected(nolandbetweenthem).
Thinkoftherollofadie.Withanormalsix-sideddieyoucannotroleathree-and-a-half.
Continuousnumericalvaluesontheotherhandhave(forpracticalpurposes)manyvalues
in-betweenothervalues.Theyareinnitelydivisible(withinreasonablelimits).

Nominalcategoricaldata
Nominalcategoricaldata:

aredatapointsthateitherrepresentswords(yesorno)orconcepts(likegenderor
heartdisease)whichhavenomathematicalvalue
havenonaturalordertothevaluesorwords-i.e.nominal-forexample:gender,or
yourprofession

Becarefulofcategoricalconceptsthatmaybeperceivedashavingsomeorder.Usuallythese
areopentointerpretation.Someonemightsuggestthatheartdiseaseisworsethankidney
diseaseorviceversa.This,though,dependsonsomanypointsofview.Don'tmakethingstoo
complicated.Ingeneral,itiseasytospotthenominalcategoricaldatatype.
Readingsreferredtointhisvideoasexamplesofnominalcategoricaldata:

1. DeMoraes,A.G.,RacedoAfricano,C.J.,Hoskote,S.S.,Reddy,D.R.S.,Tedja,R.,
Thakur,L.,Smischney,N.J.(2015).KetamineandPropofolCombination(Ketofol)for
EndotrachealIntubationsinCriticallyIllPatients:ACaseSeries.TheAmericanJournalof
CaseReports,16,8186.doi:10.12659/AJCR.892424
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4332295/

Ordinalcategoricaldata
Ifcategoricaldatahavesomenaturalorderoralogicalrankingtothedatapoints,itistermed
ordinalcategoricaldata,i.e.theycanbeplacedinsomeincreasingordecreasingorder.

Igavetheexampleofapainscorefrom1(one)to10(ten).Eventhoughthesearenumbers,no
mathematicaloperationcanbeperformedonthesedigits.Theyareorderedinmagnitudefrom1
to10.Butthereisnostandardizedmeasurementoftheserankingsandthereforenoindication
thattheintervalbetweenthespecicscoresisofthesamevalue.

Othercommonexamplesincludesurveyquestions:whereaparticipantcanratetheiragreement
withastatementonascale,say1(one),indicatingthattheydon'tagreeatall,to5(ve),
indicatingthattheyfullyagree.Likertstyleanswerssuchastotallydisagree,disagree,neither
agreenordisagree,agreeandtotallyagreecanalsobeconvertedtonumbers,i.e.1(one)to5
(ve).Althoughtheycanberanked,theystillhavenoinherentnumericalvalueandassuch
remainordinalcategoricaldatavalues.

References:

1. LawrensonJG,EvansJR.Adviceaboutdietandsmokingforpeoplewithoratriskof
age-relatedmaculardegeneration:across-sectionalsurveyofeyecareprofessionalsin
theUK.BMCPublicHealth.2013;13:564.doi:10.1186/1471-2458-13-564.
Numericaldatatypes
As opposed to categorical data types (words, things, concepts, rating numbers), numerical data
types involve actualnumbers.Numericaldata isquantitative data -forexample,the weightsofthe
babies attending a clinic, the doses of medicine, or the blood pressure of dierentpatients.They
can be compared and you can do calculations on the values.Froma mathematicalpointofview,
there are xed dierences between values. The dierence between a systolic blood pressure
valueof110and120mmHgisthesameasbetween150and160mmHg(being10mmHg).

Therearetwotypesofnumericaldata-intervalandratio.

Interval

Withintervaldata,thedierencebetweeneachvalueisthesame,whichmeansthedenitionas
'I'usedaboveholds.Thedierencebetween1and2degreesCelsiusisthesameasthe
dierencebetween3and4degreesCelsius(thereisa1degreedierence).However,
temperaturesexpressedindegreesCelsius(orFahrenheit)donothaveatruezerobecause0
(zero)degreesCelsiusisnotatruezero.Thismeansthatwithnumericalintervaldata(like
temperature)wecanorderthedataandwecanaddandsubtract,butwecannotdivideand
multiplythedata(wecantdoratioswithoutatruezero).10degreesplus10degreesis20
degrees,but20degreesisnottwiceashotas10degreesCelsius.Ratiotypenumericaldata
requiresatruezero.

Ratio

Thistypeappliestodatathathaveatrue0(zero),whichmeansyoucanestablishameaningful
relationshipbetweenthedatapointsasrelatedtothe0(zero)valueeg.agefrombirth(0)orwhite
bloodcellcountornumberofclinicvisits(from0).Asystolicbloodpressureof200mmHgis
indeedtwiceashighasapressureof100mmHg.

Summary

Nominalcategorical=naminganddescribing(eg.gender)

Ordinalcategorical=someorderingornaturalranking(eg.painscales)

Intervalnumerical=meaningfulincrementsofdierence(eg.temperature)

Rationumerical=canestablishabase-linerelationshipbetweenthedatawiththeabsolute0(eg.
age)

Whydoweneedtospendtimedistinguishingdata?

Youhavetouseverydierentstatisticaltestsfordierenttypesofdata,andwithout
understandingwhatdatatypevalues(datapoints)reects,itiseasytomakefalseclaimsoruse
incorrectstatisticaltests.

Discreteandcontinuousvariables
Another important way to classifythe data youare lookingat,istodistinguishbetweendiscrete or
continuoustypesofdata.

Discretedata:
hasanitesetofvalues

cannotbesubdivided(rollingofthediceisanexample,youcanonlyrolla6,nota6.5!)

a good example are binomial values, where only two values are present, for example, a
patientdevelopsacomplications,ortheydonot

Continuousdata:
hasinnitepossibilitiesofsubdivisions(forexample,1.1,1.11.1.111etc.)

an example I used was the measure of blood pressure, and the possibility of taking ever
moredetailedreadingsdependingonthesensitivityoftheequipmentthatisbeingused

is mostly seen in a practical manner, i.e. although we can keep on halving the numberof
red blood cells per litre of blood and eventually end up with a single (discrete) cell, the
absolutely large numberswe are dealingwithmake redbloodcellcounta continuousdata
value

In the next lesson, I will look at why knowledge about the data type is so important. Spoiler: The
statisticaltestsusedisverydierentdependingonwhichtypesofdatayouareworkingwith.

Summarisingdatathroughsimpledescriptive
statistics

Describingthedata:measuresofcentraltendencyanddispersion

Researchpapersshowsummarizeddatavalues,(usually)withoutshowingtheactualdataset.
Instead,keymethodsareusedtoconveytheessenceofthedatatothereader.Thissummaryis
alsotherststeptowardsunderstandingtheresearchdata.Ashumanswecannotmakesenseof
largesetsofnumbersorvalues.Instead,werelyonsummariesofthesevaluestoaidthis
understanding.

Therearethreecommonmethodsofrepresentingasetofvaluesbyasinglenumber-themean,
medianandmode.Collectively,theseareallmeasuresofcentraltendency,orsometimes,point
estimates.

Mostpaperswillalsodescribetheactualsizeofthespreadofthedatapoints,alsoknownasthe
dispersion.Thisiswhereyouwillcomeacrosstermssuchasrange,quartiles,percentiles,
varianceandthemorecommon,standarddeviation,oftenabbreviatedasSD.

Measuresofcentraltendency

Let'ssummarizewhatwe'vejustlearnedaboutmean,medianandmode.Theyaremeasuresof
centraltendency.Asthenameliterallyimplies,theyrepresentsomevaluethattendstothemiddle
ofallthedatapointvaluesinaset.Whattheyachieveinreality,istosummarizeasetofdata
pointvaluesforus,replacingawholesetofvalueswithasinglevalue,thatissomehow
representativeofallthevaluesintheset.

Ashumanswearepooratinterpretingmeaningfromalargesetofnumbers.Itiseasiertotake
meaningfromadatasetifwecouldconsiderasinglevaluethatrepresentsthatsetofnumbers.In
orderforallofthistobemeaningful,themeasureofcentraltendencymustbeanaccurate
reectionofalltheactualvalues.Noonemethodwouldsuceforthispurposeandthereforewe
haveatleastthesethree.

Mean

oraveragereferstothesimplemathematicalconceptofaddingupallthedatapoint
valuesforavariableinadatasetanddividingthatsumbythenumberofvaluesintheset
isameaningfulwaytorepresentasetofnumbersthatdonothaveoutliers(valuesthat
arewaydierentfromthelargemajorityofnumbers)
Example:theaverageormeanforthisdatasetis15((3+4+5+8+10+12+63)/7=15)).

Median

isacalculatedvaluethatfallsrightinthemiddleofalltheothervalues.Thatmeansthat
halfofthevaluesarehigherthanandhalfarelowerthanthisvalue,irrespectiveofhow
highorlowtheyare(whattheiractualvaluesare)
areusedwhentherearevaluesthatmightskewyourdata,i.e.afewofthevaluesare
muchdierentfromthemajorityofthevalues
intheexampleabove(undermean)thevalue15isintuitivelyabitofanoverestimation
andonlyoneofthevalues(63)islargerthanit,makingitsomehowunrepresentativeof
theothervalues(3,4,5,8,10,and12)
Example(below):Thecalculationisbasedonwhetherthereareanoddorevennumber
ofvalues.Ifodd,thisiseasy.Therearesevennumbers(odd).Thenumber8appearsin
thedatasetandthreeofthevalues(3,4,5)arelowerthan8andthree(10,12,63)are
higherthan8,thereforethemedianis8.Incaseofanevennumberofvalues,the
averageofthemiddletwoistakentoreachthemedian.

Mode

isthedatavaluethatappearsmostfrequently
isusedtodescribecategoricalvalues
returnsthevaluethatoccursmostcommonlyinadataset,whichmeansthatsome
datasetsmighthavemorethanonemode,leadingtothetermsbimodal(fortwomodes)
andmultimodal(formorethantwomodes)
Example:Thinkofaquestionnaireinwhichparticipantscouldchoosebetweenvaluesof
0to10toindicatetheiramountofpainafterprocedure:

Thenarrangethenumbersinorder:

Itisevidentnowthatmostchosethevalueof4,so4wouldbethemode.

Note:Itwouldbeincorrecttousemeanandmedianvalueswhenitcomestocategoricaldata.
Modeismostappropriateforcategoricaldatatypes,andtheonlymeasureofcentraltendency
thatcanbeappliedtonominalcategoricaldatatypes.Inthecaseofordinalcategoricaldatasuch
asinourpainscoreexampleabove,orwithaLikertscale,ameanscoreof5.5625wouldbe
meaningless.Evenifweroundedthisoto5.6,itwouldbediculttoexplainwhat.6ofapainunit
is.Ifyouconsideritcarefully,evenmediansuersfromthesameshortcoming.

Readingsinthisvideo

1. MgeleaEetal.DetectingvirologicalfailureinHIV-infectedTanzanianchildren.SAfrMed
J.2014;104(10):696-9.doi:10.7196/samj.7807]
http://www.samj.org.za/index.php/samj/article/view/7807/6241
2. Naidoo,S.,Wand,H.,Abbai,N.,&Ramjee,G.(2014).Highprevalenceandincidenceof
sexuallytransmittedinfectionsamongwomenlivinginKwazulu-Natal,SouthAfrica.AIDS
ResearchandTherapy,11(1),31.doi:10.1186/1742-6405-11-31
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168991/pdf/1742-6405-11-31.pdfyou
willnotethattheyrepresentedtheirageanddurationanalysisbythemedian.Intheir
casetheydescribedamedianageof28,andamediandurationof12months.

Measuresofdispersion

Thereareseveralmeasuresofdispersionsuchasrange,quartiles,percentiles,varianceandthe
morecommonstandarddeviation(SD).Whereasmeasuresofcentraltendencygiveusa
single-valuerepresentationofadataset,measuresofdispersionsummarizesforushowspread
outthedatasetis.

Range

referstothedierencebetweentheminimumandmaximumvalues
isusuallyexpressedbynotingboththevalues
itisusedwhensimplydescribingdata,i.e.whennoinferenceiscalledfor

Quartiles

dividethegroupofvaluesintofourequalquarters,inthesamewaythatmediandivides
thedatasetintotwoequalparts
hasazerothvalue,whichrepresentstheminimumvalueisadataset,andafourth
quartilewhichrepresentsthemaximumvalue
hasarstquartile,whichrepresentsthevalueinthedataset,whichwilldividethatset
intoaquarterofthevaluesbeingsmallerthantherstquartilevalueandthree-quarters
beinglargerthanthatvalue
hasathirdquartile,whichrepresentsthevalueinthedataset,whichwilldividethatset
intothree-quartersofthevaluesbeingsmallerthanthethirdquartilevalueandone
quarterbeinglargerthanthatvalue
hasasecondquartilevalue,whichdividesthedatasetintotwoequalsetsandisnothing
otherthanthemedian
Thezerothvalueisthesameastheminimumvalueandthefourthquartilevalueisthe
sameasthemaximumvalue.

Percentile

looksatyourdatainnerdetailandinsteadofsimplycuttingyourvaluesintoquarters,
youcancalculateavalueforanypercentageofyourdatapoints
turnstherstquartileintothe25thpercentile,themedian(orsecondquartile)intoa50th
percentileandathirdquartileintoa75thpercentile(andallofthesearejustdierent
expressionofthesamething)
alsoincludesapercentilerankthatgivesapercentageofvaluesthatfallbelowanyvalue
inyoursetthatyoudecideon,i.e.avalueof99mighthaveapercentilerankof13
meaningthat13%ofthevaluesinthesetarelessthan99and87%arelargerthan99

TheInterquartileRangeandOutliers

Theinterquartilerange(IQR)isthedierencebetweenthevaluesoftherstandthirdquartiles.A
simplesubtraction.Itisusedtodeterminestatisticaloutliers.

Extremeoratypicalvalueswhichfallfaroutoftherangeofdatapointsaretermedoutliersand
canbeexcluded.

Forexample:
Rememberourinitialsamplevaluesfromthelecture?

Withthissmalldatasetwecanintuitivelyseethat63isanoutlier.Whendatasetsaremuchlarger
thismightnotbesoeasyandoutlierscanbedetectedbymultiplyingtheinterquartilerange(IQR)
by1.5.Thisvalueissubtractedfromtherst-quartileandaddedtothethird-quartile.Anyvaluein
thedatasetthatislowerorhigherthanthesevaluescanbeconsideredstatisticaloutliers.

Outliervalueswillhavethebiggestimpactonthecalculationofmean(ratherthanonthemodeor
median).Suchvaluescanbeomittedfromanalysisifitisreasonabletodoso(i.e.incorrectdata
inputormachineerror)andtheresearcherstatesthatthiswasdoneandwhy.Ifthevalue(s)is/
arerecheckedandconrmedasvalid,specialstatisticaltechniquescanhelpreducetheskewing
eect.

Varianceandstandarddeviation

Themethodofdescribingtheextentofdispersionorspreadofdatavaluesinrelationtothemean
isreferredtoasthevariance.Weusethesquarerootofthevariance,whichiscalledthestandard
deviation(SD).

Imagineallthedatavaluesinadatasetarerepresentedbydotsonastraightline,i.e.the
familiarx-axisfromgraphsatschool.Adotcanalsobeplacedonthislinerepresenting
themeanvalue.Nowthedistancebetweeneachpointandthemeanistakenandthen
averaged,soastogetanaveragedistanceofhowfarallthepointsarefromthemean.
Notethatwewantdistanceawayfromthemean,i.e.notnegativevalues(somevalues
willbesmallerthanthemean).Forthismathematicalreasonallthedierencesare
squared,resultinginallpositivevalues.
Theaverageofallthesevaluesisthevariance.ThesquarerootofthisisthentheSD,the
averagedistancethatallthedatapointsareawayfromthemean.
Asanillustration,thedatavaluesof1,2,3,20,38,39,and40haveamuchwiderspread
(standarddeviation),than17,18,19,20,21,22,and23.Bothsetshaveanaverageof
20,butthersthasamuchwiderspreadorSD.Whencomparingtheresultsoftwo
groupweshouldalwaysbecircumspectwhenlargestandarddeviationsarereported
andespeciallysowhenthevaluesofthestandarddeviationsoverlapforthetwogroups.

Readingsreferredtointhisvideo:

1. MgeleaEetal.DetectingvirologicalfailureinHIV-infectedTanzanianchildren.SAfrMed
J.2014;104(10):696-9.doi:10.7196/samj.7807]
http://www.samj.org.za/index.php/samj/article/view/7807/6241
2. Naidoo,S.,Wand,H.,Abbai,N.,&Ramjee,G.(2014).Highprevalenceandincidenceof
sexuallytransmittedinfectionsamongwomenlivinginKwazulu-Natal,SouthAfrica.AIDS
ResearchandTherapy,11(1),31.doi:10.1186/1742-6405-11-31
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168991/pdf/1742-6405-11-31.pdf

Plots,graphsandgures
Mostpublishedarticles,posterpresentationsandindeedalmostallresearchpresentationsmake
useofgraphs,plotsandgures.Thegraphicalrepresentationofdataallowsforcompact,visual
andinformationrichconsumptionsofdata.

Itisinvariablemucheasiertounderstandcomplexcategoricalandnumericaldatawhenitis
representedinpictures.Nowthatyouhaveagoodunderstandingofthedierentdatatypes,we
willtakealookatthevariouswaysgraphthem.

Boxandwhiskerplots
Aboxandwhiskerplotprovidesuswithinformationonboththemeasuresofcentraltendencyas
wellasmeasuresofspread.Ittakesthelistofnumericaldatapointvaluesforcategorical
variablesandmakesuseofquartilestoconstructarectangularblock.

Intheexampleaboveweseethreecategoricalgroups(GroupA,BandC)onthex-axisandsome
numericaldatatypeonthey-axis.Therectangularblockhasthreelines.Thebottomline(bottom
oftherectangleifdrawnvertically)representstherstquartilevalueforthelistofnumericalvalues
andthetopline(topoftherectangle)indicatesthethirdquartilevalue.Themiddlelinerepresents
themedian.

Thewhiskerscanrepresentseveralvaluesandtheauthorsofapapershouldmakeitclearwhat
theirwhiskervaluesrepresent.Possiblevaluesinclude:minimumandmaximumvalues,values
beyondwhichstatisticaloutliersarefound(one-and-a-halftimestheinterquartilerangebelowand
abovetherstandthirdquartiles),onestandarddeviationbelowandabovethemeanoravariety
ofpercentiles(2ndand98thor9thand91st).

Someauthorsalsoaddtheactualdatapointstotheseplots.Themeanandstandarddeviation
canalsobeaddedaswecanseeinthegraphbelow(indicatedbydottedlines).

Forbothgraphsabovethewhiskersindicatetheminimumandmaximumvalues.Noteclearlyhow
theboxandwhiskerplotgivesusanideaofthespreadandcentraltendencyofnumericaldata
pointsforvariouscategories.

Countplots
Countplotstellushowmanytimesacategoricaldatapointvalueordiscretenumericalvalue
occurred.Countplotsareoftenreferredtoasbarplots,butthetermbarplotisoftenmisused.

Thiscountplotistakenfromafamousdatasetandrepresentsthenumberofpassengersonthe
titanicdividedbyageandgender.

Histogram
Ahistogramisdierentformacountplotinthatittakesnumericalvaluesonthex-axis.From
these,so-calledbins,areconstructed.Abinrepresentsaminimumandamaximumcut-ovalue.
Thinkofthecountingnumbers0to20.Imaginenowhavealistof200values,alltakenfrom0to
20,somethinglike1,15,15,13,12,20,13,13,13,14,...andsoon.Icanconstructbinswitha
sizeofve,i.e.0to5,6to10,11to15,and16to20.Icannowcounthowmanyofmylistof200
fallintoeachbin.Fromthedeprecatedlisthere,thereisalready8valuesinthe11to15bin.The
sizeofthebinsarecompletelyarbitraryandthechoiceisuptotheresearcher(s)involvedina
project.

Thegraphaboveactuallyaddswhatistermedarugplotatthebottom,showingtheactualdata
points,whichthenrepresentshowmanythereareineachbin.

Distributionplots
Theseplotstakehistogramstothenextleveland,throughmathematicalequations,givesusa
visualrepresentationofthedistributionofthedata.

Notethat,aswiththehistogram,thex-axisisanumericalvariable.Ifyoulookcloselyyou'llsee
thestrangevaluesonthe y-axis.Laterinthecoursewewilllearnthatitisfromthesegraphsthat
wecalculatep-values.Fornow,itisclearthattheygiveusagoodindicationofwhatthe
distributionshapeislike.Forthegraphabove,showingdensitiesforthreegroups,wenotethat
somevalues(ataround50)occurmuchmorecommonlythanvaluesat20or80asindicatedby
theheightofthelinesateachofthesex-axisvalues.Thedistributionshereareallnearly
bell-shaped,calledthenormaldistribution.

Violinplots
Violinplotscombineboxandwhiskerplotsanddensityplots.theyactuallytakethedensityplots,
turnthemontheirsidesandmirrorthem.

Inthegraphabovewehavedottedlinesindicatingthemedianandrstandthirdquartilesaswith
boxandwhiskerplots,butwegetamuchbetterideaofthedistributionofthenumericalvalues.

Scatterplots
Scatterplotscombinesetsofnumericalvalues.eachdothasavaluefromtwonumericaldata
pointsets,soastomakea x-anday-coordinate.Thinkofasinglepatientwithawhitecellcount
andaredcellcountvalue.

Fromthegraphaboveitisclearthatbothaxeshavenumericalvalues.Throughmathematical
equations,wecanevencreatelinesthatrepresentallthesepoints.Theseareusefulforcreating
predictions.Basedonanyvalueonthex-axis,wecancalculateapredictedvalueonthey-axis.
Laterwewillseethatthisisaformoflinearregressionandwecanuseittocalculatehowwell
twosetsofvaluesarecorrelated.Thegraphbelowshowsuchalineandevenaddsahistogram
anddensityplotforeachofthetwosetsofnumericalvariables.

Piechart
Lastly,wehavetomentionthepooroldpiechart.Oftenfrowneduponinscienticcirclesit
nonethelesshasitsplace.

Theplotabovedividesupeachcirclebyhowmanyofeachofthevalues1-through-5occurredin
eachdataset.

Sampling

Introduction
Thiscoursetacklestheproblemofhowhealthcareresearchisconducted,buthaveyouever
wondered,onaveryfundamentallevel,whyhealthcareresearchisconducted?Well,the whyis
actuallyquiteeasy.Wewouldliketondsatisfactoryanswerstoquestionswehaveabouthuman
diseasesandtheirpreventionandmanagement.Thereareendlessnumbersofquestions.The
howismoretricky.
Toanswerquestionsrelatedtohealthcare,weneedtoinvestigate,well,humans.Problemis,there
aresomanyofus.Itisalmostalwaysimpossibletoexamineallhumans,evenwhenitcomesto
someprettyrarediseasesandconditions.Canyouimaginetheeortandthecost?

Tosolvethisproblem,wetakeonlya(relatively)smallgroupofindividualsfromalargerpopulation
thatcontainspeoplewiththediseaseortraitthatwewouldliketoinvestigate.Whenweanalyse
thedatapertainingtothesampleselectionandgetanswerstoourquestions,weneedtobesure
thattheseanswersareusefulwhenusedinmanagingeveryonewhowasnotinthesample.

Inordertodothis,wemustbesurethatthesampleproperlyreectsthelargerpopulation.The
largerthesamplesize,thehigherthelikelihoodthattheresultsofouranalysescorrectlyinfersto
thepopulation.Whenasampleisnotproperlyrepresentativeofthepopulationtowhichtheresults
willinfer,somesortofbiaswasintroducedintheselectionprocessofthesample.Allresearch
muststrivetominimizebias,orifitoccurred,toproperlyaccountforit.

Thissectionexplainswithafewofthemethodsthatareusedtoproperlysampleparticipantsfor
studiesfromlargerpopulations.

Typesofsampling

Simplerandomsampling

Insimplerandomsamplingamasterlistofthewholepopulationisavailableandeachindividualon
thatlisthasanequallikelihoodofbeingchosentobepartofthesample.

Thisformofsamplingcanbeusedintwosettings.Onalargerscalewemighthaveamasterlist
ofpatients(orpeoplewithacommontrait)whowewouldliketoinvestigate.Wecandrawfrom
thatlistusingsimplerandomsampling.Onasmallerscale,wemightalreadyhavealistof
participantsforaclinicaltrial.Wenowneedtodividethemintodierentgroups.Agoodexample
wouldbedividingourparticipantsintotwogroups,grouponereceivinganactivedrugandgroup
two,aplacebo.Eachindividualparticipantmusthaveanequallikelihoodofgettingeitherdrug.
Thiscanbeachievedbysimplerandomsampling.

Systematicrandomsampling

Insystematicrandomsampling,theselectionprocessiteratesovereverydecidednumberof
individuals,i.e.every10thor100thindividualonamasterlist.

Wecanagainconsidertwoscenarios.Intherst,wearedealingagainwithndingparticipants
forastudyandinthesecond,wealreadyhaveourparticipantsselectedandnowneedtodivide
themintogroups.

Clusterrandomsampling

Inclustersampling,thegroupsofindividualsthatareincludedaresomehowclustered,i.e.allin
thesamespace,location,orallowedtime-frame.Therearemanyformsofclusterrandom
sampling.Weoftenhavetodealwiththefactthatamasterlistsimplydoesnotexistoristoo
costlyordiculttoobtain.Clusteringgroupsofindividualsgreatlysimpliestheselectionprocess.

Wecouldevenseethetrivialcaseofacaseseriesorcase-controlseriesashavingmadeuseof
clustering.Astudymightcomparethosewithandwithoutatrait,sayapostoperativewound
complication.Thesampleistakenfromapopulationwhoattendedacertainhospitalovera
certaintimeperiod.

Stratiedrandomsampling

Instratiedsamplingindividualsarechosenbecauseofsomecommon,mutuallyexclusivetrait.
Themostcommonsuchtraitisgender,butmayalsobesocio-economicclass,age,andmany
others.

Week3:Buildinganintuitiveunderstandingof
statisticalanalysis

Fromareatoprobability

P-values

Whenyoureadmedicallyrelatedresearchpapersyouarelikelytocomeacrosstheconceptof
probabilityandthetermp-value,alongwiththegoldstandardofstatisticalsignicance-0.05.

Whatisthep-value?

Thep-valueexplainsaprobabilityofaneventoccurring
Itisbasedonthecalculationofageometricalarea
Themathematicsbehindap-valuedrawsacurveandsimplycalculatestheareaundera
certainpartofthatcurve

Rollingdice

Iexplainedthenotionofprobabilitybyusingthecommonexampleofrollingdice:

Foreachdie,thereisanequallikelihoodofrollingaone,two,three,four,ve,orasix
Theprobabilityofrollingaoneisoneoutofsixor16.67%(itiscustomarytowrite
probabilityasafraction(betweenzeroandone)asopposedtoapercentage,solet's
makethat0.1667)
Itisimpossibletohaveanegativeprobability(alessthan0%chanceofsomething
happening)oraprobabilityofmorethanone(morethana100%chanceofsomething
happening)
Ifweconsiderallprobabilitiesintherollingofourdie,itaddstoone(0.1667timessix
equalsone)
Thisisourprobabilityspace,nothingexistsoutsideofit
Bylookingatitfromadierent(moreofamedicalstatistics)pointofview,Icouldrolla
veandaskthequestion:"Whatwastheprobabilityofndingave?",towhichwe
wouldanswer,p=0.1667
Icouldalsoask:"Whatisthelikelihoodofrollingaveormore?",towhichtheansweris,
p=0.333

Exampleofrollingapairofdice:

Wehardlyeverdealwithsingleparticipantsinastudy,solet'srampthingsuptoapairof
dice
Ifyourollapairofdice,addingthevaluesthatlandsface-up,willleaveyouwithpossible
valuesbetweentwoand12
Notethatthereare36possibleoutcomes(usingthefactthatrolling,forexample,aone
andasixisnotthesameasrollingasixandaone)
Sincetherearesixwayofrollingatotalofseven,thechancesaresixin36or0.1667
Rollingtwosixesortwoonearelesslikelyatoneoutof36each,or0.0278(a2.78%
chance)

Wecanmakeachartofthesecalledahistogram.Itshowshowmanytimeseachoutcomecan
occur.You'llnotetheactualoutcomesonthehorizontalaxisandthenumberofwaysofachieving
eachoutcomesontheverticalaxis.

Equatinggeometricalareatoprobability

Wecouldalsochartaprobabilityplot.So,insteadoftheactualnumberoftimes,wechartthe
probabilityonthey-axis.Itisjustthenumberofoccurrencesdividedbythetotalnumberof
outcomes.

Continuousdatatypes

Westartedthislessonofbylookingatdiscreteoutcomes,i.e.therollingofdice.Theoutcomes
werediscretebywayofthefactthatnovaluesexistbetweenthewholenumberstwoto12(rolling
twodice).Thatmadetheshapeoftheprobabilitygraphsquiteeasy,withabasewidthofoneand
allintheshapeoflittlerectangles.Continuousvariablesontheotherhandareconsideredtobe
innitelydivisiblewhichmakesitverydiculttomeasurethegeometricwidthofthoseareas,not
tomentionthefactthatthetopendsofthelittlerectanglesarecurvedandnotstraight.

Integralcalculussolvestheproblemofdeterminingtheareaofanirregularshape(aslongaswe
haveanicemathematicalfunctionforthatshape).Ourbiggerproblemisthefactthat(for
continuousvariables)itisnolongerpossibletoaskwhattheprobabilityofndingasinglevalue
(outcomes)is.Asinglevaluedoesnotexistasinthecaseofdiscretedatatypes.Rememberthat
withcontinuousdatatypeswecan(intheoreticaltermsatleast)innitelydividethewidthofthe
base.Now,wecanonlyaskwhattheprobability(areaunderthecurve)isbetweentwovalues,or
morecommonlywhattheareaisforavaluelargerthanorsmallerthanagivenvalue(stretching
outtopositiveandnegativeinnityonbothsides).

So,howdoesthiswork?

Thegraphbelow(nottoscale),illustratesthep-value.Letsgobacktotheexampleofresearching
thewhitecellcountoftwogroupsofpatients.Imaginethatgrouponehasacertainaveragewhite
cellcountandsodoesgrouptwo.Thereisadierencebetweentheseaverages.Thequestionis
whetherthisdierenceisstatisticallysignicant?Themathematicsbehindthecalculationisgoing
tousesomevaluescalculatedfromthedatapointvaluesandrepresentina(bell-shaped)curve
(asinthegraphbelow).

Ifyouchoseap-valueoflessthan0.05toindicateasignicantdierenceandyoudecidedin
yourhypothesisthatonegroupwillhaveanaveragehigherthantheother,themathswillworkout
acut-oonthex-axiswhichwillindicateanareaunderthecurve(thegreen)of0.05(5%ofthe
totalarea).Itwillthenmarkthedierenceinaveragesofyourdataandseewhattheareaunder
thecurvewasforthis(inblue).Youcanseeitwaslargerthan0.05,sothedierencewasnot
statisticallysignicant.

Andthereyouhaveit.Anintuitiveunderstandingofthep-value.Itonlygetsbetterfromhere!

Asifbymagic,theheightoftherectangularbarsarenowequaltothethelikelihood
(p-valueorsorts)ofrollingaparticulartotal
Notehowtheheightofthecentrerectangle(andoutcomeofseven)is0.1667(thereare
sixwaysofrollingasevenandthereforewecalculateap-value6/36=0.1667.
Ifyoulookateachindividualrectangularbarandifyouconsiderthewidthofeachtobe
one,theareaofeachrectangle(heighttimeswidth)givesyoutheprobabilityofrolling
thatnumber(thep-value)(forthesakeofcompleteness,weshouldactuallybeusingthe
probabilitydensityfunction,butthisisanexampleofdiscretedatatypesandtheresults
arethesame)
Ifwewanttoknowwhattheprobabilityisofrollinga10ormoreis,wesimplycalculate
theareaoftherectanglesfrom10andup

Thepthenreferstoprobability,asinthechanceorlikelihoodofaneventoccurring(inhealthcare
research,aneventisanoutcomeofanexperiment).

Anexampleofanexperimentiscomparingthedierenceintheaverageofsomebloodtestvalue
betweentwogroupsofpatients,withthep-valuerepresentingtheprobabilitythattheparticular
dierencewasfound.Iftheprobabilitywassucientlylow,weinferthatthetwosetsof
participantsrepresenttwoseparatesetsofpopulations,whicharethensignicantlydierentfrom
eachother.

Theheartofinferentialstatistics:Centrallimit
theorem

Centrallimittheorem

Nowthatyouareawareofthefactthatthep-valuerepresentstheareaunderaverybeautifuland
symmetriccurve(forcontinuousdatatypevariablesatleast)somethingmaystarttoconcernyou.
Ifithasnt,letsspellitout.Istheprobabilitycurvealwayssosymmetric?Surely,whenyoulookat
theoccurrenceofdatapointvaluesforvariablesinaresearchproject(experiment),theyarenot
symmetricallyarranged.

Inthislesson,wegettheanswertothisquestion.Wewilllearnthatthisspecicdierence
betweenthemeansofthetwogroupsisbutoneofmany,many,many(reallymany)dierences
thatarepossible.Wewillalsoseethatsomedierencesoccurmuchmorecommonlythanothers.
TheanswerliesinamathematicaltheoremcalledtheCentralLimitTheorem(CLT).Asusual,don't
bealarmed,wewontgonearthemath.Afewsimplevisualgraphswillexplainitquitepainlessly.

Skewnessandkurtosis

AsImentioned,datacanbeverynon-symmetricalinitsdistribution.Tobeclear,bydistributionI
meanisphysicallycountinghowmanytimeseachindividualvaluecomesupinasetofdatapoint
values.Thetwotermsthatdescribenon-symmetricdistributionofdatapointvaluesareskewness
andkurtosis.

Skewness

Skewnessisratherself-explanatoryandiscommonlypresentinclinicalresearch.Itisamarker
thatshowsthattherearemoreoccurrencesofcertaindatapointvaluesatoneendofaspectrum
thananother.Belowisagraphshowingtheagedistributionofparticipantsinahypothetical
researchproject.Notehowmostindividualswereontheyoungerside.Youngerdatapointvalues
occurmorecommonly(althoughthereseemstobesomeveryoldpeopleinthisstudy).The
skewnessinthisinstanceisright-tailed.Ittailsototheright,whichmeansitispositivelyskewed.

Ontheotherhand,negativeskewnesswouldindicatethedataisleft-tailed.

Kurtosis

Kurtosisreferstothespreadofyourdatavalues.

Aplatykurtic curveisatterandbroaderthannormalasaresultofhavingfewscores
aroundthemean.Largesectionsunderthecurveareforcedintothetail,thereby(falsely)
increasingtheprobabilityofndingavaluequitefarfromthemean.
Amesokurtic curvetakesthemiddlegroundwithamediumcurvefromaverage
distributions.
Inaleptokurtic curveismorepeaked,wheremanyvaluesarecentredaroundthemean.
Remember,inthissectionwearediscussingthedistributionoftheactualdatapoint
valuesinastudy,butthetermsusedherecanalsorefertothecurvethatiseventually
constructedwhenwecalculateap-value.Aswewillseelater,thesearequitedierent
things(thecurveofactualdatapointvaluesandthep-valuecurvecalculatedfromthe
datapointvalues).Thisisaveryimportantdistinction.

Combinations

CombinationslieattheheartoftheCentralLimitTheoremandalsoinferentialstatisticalanalysis.It
isthekeyforunderstandingthecurvethatwegetwhenattemptingtocalculatethep-value.
Combinationsrefertothenumberofwaysaselectionofobjectstakenfromagroupofobjectscan
bearranged.

Inasimpleexamplewemightconsiderhowmanycombinationoftwocolorswecan
makefromatotalchoiceoffourcolors,sayred,green,blue,andblack.Wecould
choose:red+green,red+blue,red+black,green+blue,green+black,andnally,
blue+black(notingthatchoosingblue+blackisthesameaschoosingblack+blue).
Thatissixpossiblecombinationchoosingatwocolorcombinationfromfourchoices.
Manycountriesintheworldhavelotteriesinwhichyoupickafewnumbersandhand
oversomemoneyforachancetowinalargecashprizeshouldthosenumberspopupin
adraw.Theorderdoesntmatter,sowearedealingwithcombinations.So,ifyouhadto
choosesixnumbersbetweensayoneand47,howmanycombinationscouldcomeup?
Itsastaggering10,737,573.Over10million.Yourchoiceofsixnumbersisbutoneofall
ofthose.Thatmeansthatyouchancesofpickingtherightcombinationislessthanone
in10million!Mostlotterieshaveevenmorenumberstochoosefrom!Toputthingsinto
perspective(justforthosewhoplaythelottery),achoiceofthenumbers1,2,3,4,5,and
6(avery,veryunlikelychoice)isjustaslikelytocomeupasyourfavoritechoiceof13,
17,28,29,30,and47!Goodluck!

Combinationhassomeseriousimplicationsforclinicalresearch,though.

Forexample,aresearchprojectdecidestofocuson30patientsforastudy.Theresearcherchose
therst30patientstowalkthroughthedooratthelocalhypertension(highbloodpressure)clinic
andnotesdowntheirages.Ifadierentgroupofpatientswasselectedonadierentday,there
wouldbecompletelydierentdata.Thesamplegroupthatyouendupwith(thechosen30)isbut
oneofmany,many,manythatyoucouldhavehad!If1000peopleattendedtheclinicandyou
hadtochoose30,thenumberofpossiblecombinationswouldbelargerthan2.4times10tothe
power57.Billionsuponbillionsuponbillions!

Thisishowthedistributioncurvefortheoutcomesofstudies(fromwhichthep-valueis
calculated)areconstructed.Beitthedierenceinmeansbetweentwoormoregroups,or
proportionsofchoicesforacross-sectionalstudy'sLikert-stylequestions.Thereare
(mathematically)analmostuncountabledierentnumberofoutcomes(giventhesamevariables
tobestudied)andtheonefoundinanactualstudy,isbutoneofthose.

Centrallimittheorem

Wesawintheprevioussectiononcombinationsthatwhenyoucomparethedierencein
averagesbetweentwogroups,youranswerisbutoneofmanythatexist.TheCentralLimit
Theoremstatesthatifweweretoplotallthepossibledierences,theresultinggraphwouldform
asmooth,symmetricalcurve.Therefore,wecandostatisticalanalysisandlookfortheareaunder
thecurvetocalculateourp-values.

Themathematicsbehindthecalculationofthep-valueconstructsanestimationofallthepossible
outcomes(ordierencesasinourexample).
Letslookatavisualrepresentationofthedata.Intherstgraphbelowweaskedacomputer
programtogiveus10,000randomvaluesbetween30and40.Asyoucansee,thereisno
pattern.

Let'ssuggestthatthese10,000valuesrepresentapopulationandweneedtorandomlyselect30
individualsfromthepopulationtorepresentourstudysample.So,letsinstructthecomputerto
take30randomsamplesfromthese10,000valuesandcalculatetheaverageforthose30.Now,
letsrepeatthisprocess1000times.Weareinessencerepeatingourmedicalstudy1000times!
Theresultoftheoccurrenceofalltheaveragesisshowninthegraphbelow.TheCentralLimit
Theorempredicts,alovelysmooth,symmetricdistribution.Justreadyandwaitingforsome
statisticalanalysis.

Everytimeamedicalstudyisconducted,thedatapointvalues(andtheirmeasuresofcentral
tendencyanddispersion)arejustoneexampleofcountlessothers.Somewilloccurmore
commonlythanothersanditistheCentralLimitTheoremthatallowsustocalculatehowlikelyit
wastondaresultasextremeastheonefoundinanyparticularstudy.

Distributions:theshapeofdata

Distributions

Weallknowthatcertainthingsoccurmorecommonlythanothers.Weallacceptthatthereare
moredayswithlowertemperaturesinwinterthandayswithhighertemperatures.Inthenorthern
hemispheretherewillbemoredaysinJanuarythatarelessthan10degreesCelsius(50degrees
Fahrenheit)thantherearedaysthataremorethan20degreesCelsius(60degreesFahrenheit).

Actualdatapointvaluesforanyimaginablevariablecomesinavarietyof,shallwesay,shapesor
patternsofspread.Thepropertermforthisisadistribution.Themostfamiliarshapeisthenormal
distribution.Datafromthistypeofdistributionissymmetricandformswhatmanyrefertoasa
bell-shapedcurve.Mostvaluescenteraroundtheaverageandtaperotobothends.
Ifweturntohealthcare,wecanimaginethatcertainhemoglobinleveloccurmorecommonlythan
otherinanormalpopulation.Thereisadistributiontothedatapointvalues.Indecidingwhich
typeofstatisticaltesttouse,weareconcernedwiththedistributionthattheparametertakesin
thepopulation.Aswewillseelater,wedonotalwaysknowwhattheshapeofdistributionisand
wecanonlycalculateifoursampledatapointvaluesmightcomefromapopulationinwhichthat
variableisnormally(orotherwise)distributed.

Itturnsoutthattherearemanyformsofdatadistributionsforbothdiscreteandcontinuousdata
typevariables.Evenmoreso,averages,standarddeviations,andotherstatisticsalsohave
distributions.ThisfollowsnaturallyfromtheCentralLimitTheoremwelookedatbefore.Wesaw
thatifwecouldrepeatanexperimentthousandsoftimes,eachtimeselectinganewcombination
ofsubjects,someaveragevaluesordierencesinaveragesbetweentwogroupswouldforma
symmetricaldistribution.

Itisimportanttounderstandthevarioustypesofdistributionsbecause,asmentioned,distribution
typeshaveaninuenceonthechoiceofstatisticalanalysisthatshouldbeperformedonthem.It
wouldbequiteincorrecttodothefamoust-testondatavaluesforasamplethatdonotcome
fromvariablewithanormallydistributioninthepopulationfromwhichthesamplewastaken.

Unfortunately,mostdataisnotsharedopenlyandwehavetotrusttheintegrityoftheauthorsand
thattheychoseanappropriatetestfortheirdata.Theonusthenalsorestsonyoutobeawareof
thevariousdistributionsandwhatteststoperformwhenconductingyourownresearch,aswell
astoscrutinizethesechoiceswhenreadingtheliterature.

InthislessonyouwillnotethatIrefertotwomaintypesofdistributions.First,thereisthe
distributionpatterntakenbytheactualdatapointvaluesinastudysample(orthedistributionof
thatvariableintheunderlyingpopulationfromwhichthesamplewastaken).Thenthereisthe
distributionthatcanbecreatedfromthedatapointvaluesbywayoftheCentralLimitTheorem.
TherearetwoofthesedistributionsandtheyaretheZ-andthet-distributions(bothsharinga
beautifullysymmetric,bell-shapedpattern,allowingustocalculateap-valuefromthem).

Normaldistribution

Thenormaldistributionisperhapsthemostimportantdistribution.Weneedtoknowthatdata
pointvaluesforasamplearetakenfromapopulationinwhichthatvariableisnormallydistributed
beforewedecideonwhattypeofstatisticaltesttouse.Furthermore,thedistributionofallpossible
outcomes(throughtheCentralLimitTheorem)isnormallydistributed.
Thenormaldistributionhasthefollowingproperties:

mostvaluesarecenteredaroundthemean
asyoumoveawayfromthemean,therearefewerdatapoints
symmetricalinnature
bell-shapedcurve
almostalldatapoints(99.7%)occurwithin3standarddeviationsofthemean

Mostvariablesthatweuseinclinicalresearchhavedatapointvaluesthatarenormally
distributed,i.e.thesampledatapointswehave,comefromapopulationforwhomthevaluesare
normallydistributed.Asmentionedintheintroduction,itisimportanttoknowthis,becausewe
havetoknowwhatdistributionpatternthevariablehasinthepopulationinordertodecideonthe
correctstatisticalanalysistooltouse.

Itisworthwhiletorepeatthefactthatactualdatapointvaluesforavariable(i.e.age,height,white
cellcount,etc.)haveadistribution(bothinasampleandinthepopulation),butthatthroughthe
CentralLimitTheorem,wherewecalculatehowoftencertainvaluesordierencesinvalues
occur,wehaveaguaranteednormaldistribution.

Samplingdistribution
Aswasmentionedintheprevioussection,actualdatapointvaluesforvariables,beitfroma
sampleorfromapopulation,hasadistribution.Thenwehavethedistributionthatisguaranteed
throughtheCentralLimitTheorem.Whetherwearetalkingaboutmeans,thedierenceinmeans
betweentwoormoregroups,oreventhedierenceinmedians,aplotofhowmanytimeseach
willoccurgivenanalmostunlimitedrepeatofastudywillbenormallydistributed,allowsusdo
inferentialstatisticalanalysis.

So,imagine,onceagain,beinginaclinicandenteringtheagesofconsecutivepatientsina
spreadsheet.Someageswilloccurmoreorlesscommonly,allowingustoplotahistogramofthe
values.Itwouldshowthedistributionofouractualvalues.Thesepatientscomefromamuch,
muchlargerpopulation.Inthispopulation,ageswillalsocomeinacertaindistributionpattern,
whichmaybedierentfromoursampledistribution.

InthevideoontheCentralLimitTheoremwelearnthatoursampleorthedierencebetweentwo
samplegroupsisbutoneofanenormousnumberofdierencesthatcouldoccur.Thislarger
distributionwillalwaysbenormallydistributedaccordingtotheCentralLimitTheoremandrefers
toanothermeaningofthetermdistribution,termedasamplingdistribution(inotherwords,the
typeofdistributionwegetthroughthemathematicsoftheCentralLimitTheoremiscalleda
samplingdistribution).

Z-distribution

TheZ-distributionisoneofthesamplingdistributions.Ineect,itisagraphofamathematical
function.Thisfunctiontakessomevaluesfromparametersinthepopulation(asopposedtoonly
fromoursamplestatistics)andconstructsasymmetrical,bell-shapedcurvefromwhichwecan
doourstatisticalanalysis.

Itisnotcommonlyusedinmedicalstatisticsasitrequiresknowledgeofsomepopulation
parametersandinmostcases,theseareunknown.Itismuchmorecommontousethe
t-distribution,whichonlyrequiresknowledgethatisavailablefromthedatapointvaluesofa
sample.

t-distribution

Thisisthemostcommonlyusedsamplingdistribution.Itrequiresonlycalculationsthatcanbe
madefromthedatapointvaluesforavariablethatisknownfromthesamplesetdataforastudy.
Thet-distributionisalsosymmetricalandbell-shapedandisamathematicalequationthatfollows
fromtheCentralLimitTheoremallowingustodoinferentialstatisticalanalysis.

Oneofthevaluesthathastobecalculatedwhenusingthet-distribution,iscalleddegreesof
freedom.Itisquiteaninterestingconceptwithmanyinterpretationsanduses.Inthecontextwe
useithere,itdependsonthenumberofparticipantsinastudyandiseasilycalculatedasthe
dierencebetweenthetotalnumberofparticipantsandthetotalnumberofgroups.So,ifwehave
atotalof60subjectsinastudyandwehavethemdividedintotwogroups,wewillhaveadegree
offreedomequalto58,being60minustwo.Thelargerthedegreesoffreedom,themorethe
shapeofthet-distributionresemblesthatofthenormaldistribution.

Theeectofthisisthatallstudiesshouldaimtohaveasmanyparticipantsaspossibleandthea
largersamplesetwouldallowforthesampletobemorerepresentativeofthepopulation,
increasingthepowerofthestudy.

Week4:Theimportantrststeps:Hypothesis
testingandcondencelevels

Hypothesistesting

Thiscourserepresentsaninferentialviewonstatistics.Indoingsowemakecomparisonsand
calculatehowsignicantthosedierencesare,orratherhowlikelyitistohavefoundthespecic
dierencewhencomparingresults.

Thisformofstatisticalanalysismakesuseofwhatistermedhypothesistesting.Anyresearch
question,pertainingtocomparisons,hastwohypotheses.Wecallthemthenullandalternate
(maintained,research,test)hypothesisandinlightofcomparisons,theyareactuallyquiteeasyto
understand.

Thenullhypothesis

Itisofutmostimportancethatresearchersremainunbiasedtowardstheirresearchand
particularlytheoutcomeoftheirresearch.Itisnaturaltobecomeexitedaboutanewacademic
theoryandmostresearcherswouldliketondsignicantresults.

Theproperscienticmethod,though,istomaintainapointofnodeparture.Thismeans,thatuntil
dataiscollectedandanalyzed,westateanullhypothesis.Thismeansthatthereiswillbeno
dierencewhenacomparisonisdonethroughdatacollectionandanalyses.

Beingallscienticaboutit,though,wedosetathresholdforourtestingandiftheanalysesnds
thatthisthresholdhasbeenbreached,wenolongerhavetoacceptthenullhypothesis.Inactual
fact,wecannowrejectit.

Thismethod,calledthescienticmethod,formsthebedrockofevidence-basedmedicine.

Asscientistswebelieveintheneedforproof.

Thealternativehypothesis

Thealternativehypothesisstatesthatthereisadierencewhenacomparisonisdone.Almostall
comparisonswillshowdierences,butweareinterestedinthesizeofthatdierence,andmore
importantlyinhowlikelyitistohavefoundthedierencethataspecicstudynds.

WehaveseenthroughtheCentralLimitTheoremthatsomedierenceswilloccurmore
commonlythanothers.Wealsoknowthroughtheuseofcombinations,thattherearemany,many
dierencesindeed.

Thealternativehypothesisisacceptedwhenanarbitrarythresholdispassed.Thisthresholdisa
lineinthesandandstatesthatanydierencefoundbeyondthispointisveryunlikelytohave
occurred.Thisallowsustorejectthenullhypothesisandacceptthealternatehypothesis,thatis,
thatwehavefoundastatisticallysignicantdierenceorresults.

Inthenextsectionwewillseethatthereismorethanonewaytostatethealternativehypothesis
anditisofenormousimportance

Thealternativehypothesis

Twowaysofstatingthealternativehypothesis

AsIhavealludedtointheprevioussection,therearetwowaystostatethealternativehypothesis.
Itisexactlyforthereasonofdierentwaysofstatingthealternativehypothesis,thatitisso
importanttostatethenullandalternativehypothesespriortoanydatacollectionoranalyses.

Let'sconsidercomparingthecholesterolleveloftwogroupsofpatients.Grouponeisonanew
testdrugandgrouptwoistakingaplacebo.Aftermanyweeksofdutifullytakingtheirmedication,
thecholesterollevelsofeachparticipantistaken.

Sincethedatapointvaluesforthevariabletotalcholesterolleveliscontinuousandratio-type
numerical,wecancomparethemeans(orasweshallseelater,themedians)betweenthetwo
groups.

Asgoodscientistsandresearchers,though,westatedourhypotheseswaybeforethispoint!The
nullhypothesiswouldbeeasy.Therewillbenodierencebetweenthemeans(ormedians)of
thesetwogroups.

Whataboutthealternatehypothesis?It'sclearthatwecanstatethisintwoways.Itmightbe
natural,especiallyifwehaveputalotofmoneyandeortintothisnewdrugtostatethatthe
treatmentgroup(groupone)willhavelowercholesterolthantheplaceboorcontrolgroup(group
two).Wecouldalsosimplystatethattherewillbeadierence.Eitherofthetwogroupsmight
havealowermean(ormedian)cholesterollevel.

Thischoicehasafundamentaleectonthecalculatedp-value.Fortheexactsamedata
collectionandanalyses,oneofthesetwowaysofstatingthealternativehypothesishasap-value
ofhalfoftheother.Thatiswhyitsoimportanttostatethesehypothesesbeforecommencinga
study.Anon-signicantp-valueof0.08canveryeasilybechangedafterthefacttoasignicant
0.04bysimplyrestatingthealternatehypothesis.

Thetwo-tailedtest

Whenanalternativehypothesisstatessimplythattherewillbeadierence,weconductwhatis
termedatwo-tailedtest.

Let'sconsideragainourcholesterolexampleforbefore.Throughtheuseofcombinationsandthe
CentralLimitTheoremwewillbeabletoconstructasymmetricalbell-shapedt-distributioncurve
fromourdatapointvalues.Acomputerprogramwillconstructthisgraphandwillalsoknowwhat
cut-ovaluesonthex-axiswouldrepresentanareaunderthecurvetorepresent0.05(5%)ofthe
totalarea.

Thisbeingatwo-tailedtest,though,itwillsplitthistobothsides,with0.025(2.5%)oneitherside.

Whencomparingthemeans(ormedians)ofthetwogroups,onewillbelessthantheotherand
dependingwhichoneyousubtractfromwhich,youwillgeteitherapositiveoranegativeanswer.
Thisisconvertedtoavalue(unitsofstandarderror)thatcanalsobeplottedonthex-axis.Since
thisisatwo-tailedtest,though,itisreectedontheotherside.Thecomputercancalculatethe
areaunderthecurveforbothsides,calledatwo-tailedp-value.Fromthegraphsitisclearthat
thisformofalternatehypothesisstatementislesslikelytoyieldalowp-value.Theareaisdoubled
andthecut-omarksrepresentingthe0.025(2.5%)levelarefurtherawayfromthemean.

Theone-tailedtest

Aresearchermustbeveryclearwhenusingaone-tailedtest.Fortheexactsamestudyanddata
pointvalues,thep-valueforaone-tailedtestwouldbehalfofthatofatwo-tailedtest.Thegraphis
onlyrepresentedononeside(notduplicatedontheother)andallofthearearepresentingavalue
of0.05(5%)isononesideandthereforethecut-oisclosertothemean,makingiteasierfora
dierencetofallbeyondit.

Thechoicebetweenaone-tailedandtwo-tailedapproach

Thereisnomagicbehindthischoice.Aresearchershouldmakethischoicebyconvincingheror
hispeersthroughlogicalargumentsorpriorinvestigations.Anyonewhoreadsaresearchpaper
shouldbeconvincedthatthechoicewaslogicaltomake.
Hypothesistestingerrors
Allresearchstartswithaquestionthatneedsananswer.Everyonemighthavetheirownopinion,
butaninvestigatorneedstolookfortheanswerbydesigninganexperimentandinvestigatingthe
outcome.

Aconcernisthattheinvestigatormayintroducebias,evenunintentionally.Toavoidbias,most
healthcareresearchfollowsaprocessinvolvinghypothesistesting.Thehypothesisisaclear
statementofwhatistobeinvestigatedandshouldbedeterminedbeforeresearchbegins.

Tobegin,letsgoovertheimportantdenitionsIdiscussedinthislesson:

Thenullhypothesispredictsthattherewillbenodierencebetweenthevariablesof
investigation.
Thealternativehypothesis,alsoknownasthetestorresearchhypothesispredictsthat
therewillbeadierenceorsignicantrelationship.

Therearetwotypesofalternativehypotheses.Onethatpredictsthedirectionofthehypothesis
(e.g.AwillbemorethanBorAwillbelessthanB),knownasaone-tailedtest.Theotherstates
therewillbeasignicantrelationshipbutdoesnotstateinwhichway(e.g.Acanbemoreorless
thanB).Thelatterisknownasatwo-tailedtest.

Forexample,ifwewanttoinvestigatethedierencebetweenthewhitebloodcellcounton
admissionbetweenHIV-positiveandHIV-negativepatients,wecouldhavethefollowing
hypotheses:

Nullhypothesis:thereisnodierencebetweentheadmissionwhitebloodcellcount
betweenHIV-positiveandHIV-negativepatients.
Alternativehypothesis(one-tailed):theadmissionwhitecellcountofHIV-positivepatients
willbehigherthantheadmissionwhitebloodcellcountofHIV-negativepatients.
Alternativehypothesis(two-tailed):theadmissionwhitebloodcellcountofHIV-positive
patientswilldierfromthatofHIV-negativepatients.

TypeIandIIerrors

Wrongconclusionscansometimesbemadeaboutourresearchndings.Forexample,astudy
ndsthatdrugAhasnoeectonpatients,wheninfactithasmajorconsequences.Howmight
suchmistakesbemade,andhowdoweidentifythem?Sincesuchexperimentsareabout
studyingasampleandmakinginferencesaboutapopulation(whichwehavenotstudied),these
mistakescanconceivablyoccur.

TableIshowstwopossibletypesoferrorsthatexistwithinhypothesistesting.ATypeIerrorisone
wherewefalselyrejectthenullhypothesis.Forinstance,aTypeIerrorcouldoccurwhenwe
concludethatthereisadierenceinthewhitebloodcellcountbetweenHIV-positiveand
HIV-negativepatientswheninfactnosuchdierenceexists.

ATypeIIstatisticalerrorrelatestofailingtorejectthenullhypothesis,wheninfactadierencewas
present.Thisiswherewewouldconcludethatthereisnodierencebetweenthewhitebloodcellcount
betweenHIV-positiveandHIV-negativepatientsexistswheninrealityadierenceexists.

Whydowesaywefailtorejectthenullhypothesisinsteadofweacceptthenullhypothesis?
Thisistricky,butjustbecausewefailtotherejectthenullhypothesis,thisdoesnotmeanwe
haveenoughevidencetoproveitistrue.

Inthenextlesson,Illcoverthetopicofcondenceintervals.

Condenceinyourresults

Introductiontocondenceintervals
Condenceintervalsareveryoftenquotedinthemedicalliterature,yetitisoftenmentionedthatit
isapoorlyunderstoodstatisticalconceptamongsthealthcarepersonnel.

Inthiscourseweareconcentratingoninferentialstatistics.Werealizethatmostpublished
literatureisbasedontheconceptoftakingasampleofindividualsfromapopulationand
analyzingsomedatapointvalueswegatherfromthem.Theseresultsarethenusedtoinfersome
understandingaboutthepopulation.Wedothisbecausewesimplycannotinvestigatethewhole
population.

Thismethod,though,isfraughtwiththedangerofintroducingbias.Thesamplethatisselectedfor
analysismightnotproperlyrepresentthepopulation.Ifasamplestatisticsuchasameanis
calculated,itwilldierfromthepopulationmean(ifwewereabletotestthewholepopulation).Ifa
samplewastakenwithpropercare,though,thissamplemeanshouldbefairlyclosetothe
populationmean.

Thisiswhatcondenceintervalsareallabout.Weconstructarangeofvalues(lowerandupper
limit,orloweranduppermaximum),whichissymmetricallyformedaroundthesamplemean(in
thisexample)andweinferthatthepopulationmeanshouldfallbetweenthosevalues.

Wecanneverbeentirelysureaboutthis,though,sowedecideonhowcondentwewanttobein
theboundsthatweset,allowingustocalculatethesebounds.

Condenceintervalscanbeconstructedaroundmorethanjustmeansandinreality,theyhavea
slightlymorecomplexmeaningthanwhatIhavelaidouthere.Allwillberevealedinthislesson,
allowingyoutofullyunderstandwhatismeantbyallthosecondenceintervalsthatyoucome
acrossintheliterature.

Reference:

1. DinhTetal.ImpactofMaternalHIVSeroconversionduringPregnancyonEarlyMotherto
ChildTransmissionofHIV(MTCT)Measuredat4-8WeeksPostpartuminSouthAfrica
2011-2012:ANationalPopulation-BasedEvaluationPLoSONE10(5)DOI:
10.1371/journal.pone.0125525.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0125525

Condencelevels

Whatarethey?

AsI'vementionedintheprevioussection,weonlyexamineasmallsampleofsubjectfroma
muchlargerpopulation.Theaimisthough,tousetheresultsofourstudieswhenmanagingthat
population.Sinceweonlyinvestigateasmallsample,anystatisticthatwecalculatebasedonthe
datapointgatheredfromthem,willnotnecessarilyreectthepopulationparameter,whichiswhat
wearereallyinterestedin.

Somehow,weshouldbeabletotakeastabatwhatthatpopulationparametermightbe.Thinking
onaverylargescale,thetruepopulationparameterforanyvariablecouldbeanythingfrom
negativeinnitytopositiveinnity!Thatsoundsodd,butmakesmathematicalsense.Let'stake
ageforinstance.ImagineIhavethemeanageofasampleofpatientsandamwonderingabout
thetruemeanageinthepopulationfromwhichthatsamplewastaken.Now,noonecanbe
-1000yearsold,neithercantheybe+1000yearsold.RemembertheCentralLimitTheorem,
though?Itpositedthattheanalysisofavariablefromasamplewasjustoneofmany(bywayof
combinations).Thatdistributiongraphisamathematicalconstructandstretchesfromnegativeto
positiveinnity.Tobesure,theoccurrencesofthesevaluesarebasicallynilandinpracticethey
are.Solet'ssay(mathematically)thattherearevaluesof-1000and+1000(andforthatmatter
negativeandpositiveinnity)andIusetheseboundsasmyguessastowhatthemeanvaluein
thepopulationasawholeis.WiththatwideaguessIcanbe100%condentthatthispopulation
meanwouldfallbetweenthosebounds.

This100%representmycondencelevelandIcansetthisarbitrarilyforanysamplestatistic.

WhathappensifIshrinkthebounds?

Enoughwiththenon-sensicalvalues.Whathappensifforargument'ssakethesamplemeanage
was55andIsuggestthatthepopulationmeanis45to65.Ifhaveshrunkthebounds,butnow,
logically,Ishouldlosesomecondenceinmyguessnow.Indeed,thatiswhathappens.The
condencelevelgoesdown.IfIshrinkitto54to56,thereisamuchgreaterchancethatthe
populationmeanescapestheseboundsandthecondencelevelwouldbemuchsmaller.

A95%condencelevel

Itiscustomarytouseacondencelevelof95%andthatisthevaluethatyouwillnoticemost
oftenintheliteraturewhencondenceintervalsarequoted.The95%referstothecondence
levelsandthevaluesrepresenttheactualinterval.

Themathematicsbehindcondenceintervalsconstructsadistributionaroundthesamplestatistic
basedonthedatapointvaluesandcalculateswhatareawouldbecoveredby95%(fora95%
condencelevel)ofthecurve.Thex-axisvaluesarereconstitutedtoactualvalueswhicharethen
theloweranduppervaluesoftheinterval.

Condenceintervals

Nowthatweunderstandwhatcondencelevelsare,wearereadytodenetheproper
interpretationofcondenceintervals.

Itmightbenaturaltosuggestthatgivena95%condencelevel,thatweare95%condentthat
thetruepopulationparameterliesbetweentheintervalsgivenbythatcondencelevel.Revisiting
ourlastexamplewemighthaveasamplestatisticof55yearsforthemeanofoursampleand
witha95%condencelevelconstructintervalsof51to59years.Thiswouldcommonlybe
writtenasameanageof55years(96%CI,51-59).Itwouldbeincorrectthoughtosuggestthat
thereisa95%chanceofthepopulationmeanagebeingbetween51and59years.

Thetrueinterpretationofcondenceintervals

Considerthatboththesamplestatistics(meanageoftheparticipantsinastudy)andthe
populationparameter(meanageofthepopulationfromwhichthesamplewastaken)existin
reality.Theyarebothabsolutes.Giventhis,thepopulationparametereitherdoesordoesnotfall
insideofthecondenceinterval.Itisallornothing.

Thetruemeaningofthecondencelevelofsay95%isthatifthestudyisrepeated100times
(eachwithitsownrandomsamplesetofpatientsdrawnfromthepopulation,eachwithitsown
meanand95%condenceintervals),95ofthesestudieswillcorrectlyhavethepopulation
parametercorrectlywithintheintervalsand5wouldnot.Thereisnowaytoknowwhichoneyou
haveforanygivenstudy.

Week5:Whichtestshouldyouuse?
Introductiontoparametrictests

Finallyinthiscoursewegettogripswithsomerealinferentialstatisticsandwestartthingsowith
parametrictests.Inferentialstatisticsisallaboutcomparingdierentsamplesubjecttoeach
other.Mostcommonlywedealwithnumericaldatapointvalues,forwhichwecancalculate
measuresofcentraltendencyanddispersion.

Whenusingparametrictests,weusethemeanoraverageasourmeasureofcentraltendency.
Commonlywewillhavetwo(ormore)groupsandforanygivenvariable,sayforinstancewhite
cellcount,wecouldcalculateameanvalueforeachgroup.Adierencewillexistbetween
meansofthegroupsandthroughtheuseofstatisticaltestswecouldcalculatehowcommon
certaindierencesshouldoccurgivenmanyrepetitionsofastudyandalsohowlikelyitwasthen
tohavefoundadierenceatleastaswideastheonefortheparticularathand.
Parametrictestsarearethemostcommonlyusedtests,butwiththiscommonusecomesome
verystrictrulesorassumptions.Ifthesearenotmetandparametrictestsareused,the
subsequentp-valuesmightnotbeatruereectionwhatwecanexpectinthepopulation.

Typesofparametrictests

Iwilldiscussthreemaintypesofparametrictestsinthislesson.

t-tests

ThesearetrulythemostcommonlyusedandmostpeoplearefamiliarwithStudent'st-test.There
ismorethanonet-testdependingonvariousfactors.

Asagroup,though,theyareusedtocomparethepointestimateforsomenumericalvariable
betweentwogroups.

ANOVA

ANOVAistheacronymforanalysisofvariance.Asopposedtot-test,ANOVAcancompareapoint
estimateforanumericalvariablebetweenmorethantwogroups.

Linearregression

Whencomparingtwoormoregroups,wehavethefactthatalthoughthedatapointvaluesfora
variablearenumericalintype,thetwogroupsthemselvesarenot.WemightcallthegroupsAand
B,oroneandtwo,ortestandcontrol.Assuchtheyrefertocategoricaldatatypes.

Inlinearregressionwedirectlycompareanumericalvaluetoanumericalvalue.Forthis,weneed
pairsofvaluesandinessencewelookforacorrelationbetweenthese.Canwendthata
changeinthesetthatisrepresentedbytherstvalueinallthepairscausesapredictablechange
inthesetmadeupbyallthesecondvaluesinthepair.

Asanexamplewemightcorrelatenumberofcigarettessmokedperdaytobloodpressurelevel.
Todothiswewouldneedasampleofparticipantandforeachhaveavalueforcigarettessmoked
perdayandbloodpressurevalue.Aswewillseelater,correlationdoesnotprovecausation!

Studentst-test:Introduction
Wehavelearnedthatthisisatleastoneofthemostcommonstatisticaltests.Ittakesnumerical
datapointvaluesforsomevariablesintwogroupsofsubjectsinastudyandcomparesthemto
eachother.

WilliamGosset(developerofStudent'st-test)usedthet-distribution,whichwecoveredearlier.Itis
abell-shaped,symmetricalcurveandrepresentadistributionofallthedierences(inthemean)
betweentwogroupsshouldthesamestudyberepeatedmultipletimes.Somewilloccurvery
oftenandsomewillnot.Thet-distributionusestheconceptofdegreesoffreedom.Thisvalue
referstothetotalnumberofparticipantsinastudy(bothgroups)andminusthenumberofgroups
(whichistwo).Thehigherthedegreesoffreedom,themoreaccuratelythet-distributionfollows
thenormaldistributionandthemathematicsbehindthispositsinsomeway,amoreaccurate
p-value.Thisisanotherreasontoincludeaslargeasamplesizeaspossible.

Oncethegraphforthisdistributionisconstructed,itbecomespossibletocalculatewhereonthe
x-axisthecut-orepresentingadesiredareawouldbe.Theactualdierenceisalsoconvertedto
thesameunits(calledstandarderrors)andtheareaforthiscalculatedaswehaveseenbefore.

Sincewearetryingtomimicthenormaldistribution,wehaveoneofthemostcrucialassumptions
fortheuseofthet-testandotherparametrictests.Weneedassurancesthatthesampleof
subjectsinastudywastakenfromapopulationinwhichthevariablethatisbeingtestedis
normallydistributed.Ifnot,wecannotuseparametrictests.

Typesoft-tests
Thereareavarietyoft-tests.Commonlywewillhavetwoindependentgroups.Ifwewereto
comparetheaveragecholesterollevelsbetweentwogroupsofpatients,participantsinthesetwo
groupsmustbeindependentofeachother,i.e.wecannothavethesameindividualappearin
bothgroups.Aspecialtypeoft-testexistsifthetwogroupsdonotcontainindependentindividuals
aswouldhappenifthegroupsaremadeupofhomozygotic(identical)twinsandwetestthesame
variableinthesamegroupofparticipantsbeforeandafteranintervention(withthetwosetsof
dataconstitutingthetwogroups).
Thereisalsotwovariationsofthet-testbasedonequalandunequalvariances.Itisimportantto
considerthedierenceinthevariances(squareofthestandarddeviation)forthedatapoint
valuesforthetwogroups.Ifthereisabigdierenceat-testassumingunequalvariancesshould
beused.

ANOVA

ANOVAistheacronymforanalysisofvariance.Asopposedtothet-test,ANOVAcancomparethe
meansofmorethantwogroups.ThereareanumberofdierentANOVAtests,whichcandeal
withmorethanonefactor.

ThemostcommontypeofANOVAtestisone-wayANOVA.Herewesimplyuseasinglefactor
(variable)andcomparemorethantwogroupstoeachother.

ANOVAlooksatbothvariationsofvaluesinsideofgroupsandbetweengroupsandconstructsa
distributionbasedonthesamecriteriathatwehavebasedontheCentralLimitTheoremand
combinations.

Whencomparingmorethantwogroups,itisessentialtostartwithanalysisofvariance.Only
whenasignicantvalueiscalculatedshouldaresearchercontinuewithcomparingtwogroups
directlyusingat-test,soastolookforsignicantdierences.Iftheanalysisofvariancedoesnot
returnasignicantresult,itispointlesstorunt-testsbetweengroupsandifdoneanysignicant
ndingshouldbeignored.

LinearRegression

Upuntilnowwehavebeencomparingnumericalvaluesbetweentwocategoricalgroups.We
havehadexamplesofcomparingwhitecellcountvaluesbetweengroupsAandB,cholesterol
valuesbetweengroupstakinganewtestdrugandaplacebo.Thevariablecholesterolandwhite
cellcountcontaindatapointthatareratio-typenumericalandcontinuous,butthegroups
themselvesarecategorical(groupAandBortestandcontrol).

Wecan,comparenumericalvaluesdirectlytoeachotherthroughtheuseoflinearregression.

Inlinearregressionwearelookingforacorrelationbetweentwosetsofvalues.Thesesetsmust
comeinpairs.Thegraphbelowshowafamiliarplotofthemathematicalequationy=x.Wecan
plotsetsofvaluesonthisline,i.e.(0,0),(1,1),(2,2),etc.Thisishowlinearregressionisdone.The
rstvalueisallthepaircomefromonesetofdatapointvaluesandthesecondfromasecondset
ofdatapointvalues.I'vementionedanexamplebefore,lookingatnumberofcigarettessmoked
perdayversusbloodpressurevalue.

Linearregressionlooksatthecorrelationbetweenthesetwosetsofvalues.Doesonedependon
theother.Isthereachangeintheone,astheotherchanges.

Setsofdatapointswillalmostneverfallonastraightline,butthemathematicsunderlyinglinear
regressioncantryandmakeastraightlineoutofallthedatapointsets.Whenwedothiswenote
thatweusuallygetadirection.Mostsetsareeitherpositivelyornegativelycorrelated.With
positivecorrelation,onevariable(calledthedependentvariable,whichisonthey-axis),increases
astheother(calledtheindependentvariable,whichisonthex-axis)alsoincreases.

Thereisalsoanegativecorrelationandasyoumightimagine,thedependentvariablesdecreases
astheindependentvariableincreases.

Strengthofthecorrelation

Asyouwouldhavenoticed,someofthedatapointpairsarequiteadistanceawayfromthelinear
regressionline.Withstatisticalanalysiswecancalculatehowstronglythepairsofvaluesare
correlatedandexpressthatstrengthasacorrelationcoecient,r.Thiscorrelationcoecient
rangesfrom-1to+1,withnegativeonebeingabsolutenegativecorrelation.Thismeansthatall
thedotswouldfallonthelineandthereisperfectmovementintheonevariableastheother
moves.Withapositiveonecorrelationwehavetheopposite.Inmostreal-lifesituations,the
correlationcoecientwillfallsomewhereinbetween.

Thereisalsothezerovalue,whichmeans,nocorrelationasall.

Correlatingvariablesarealwaysfascinating,butcomeswithabigwarning.Anycorrelation
betweenvariablesdoesnotnecessarilymeancausation.Justbecausetwovariablesare
correlateddoesnotmeanthechangeinoneiscausedbyachangeintheother.Theremightbea
thirdfactorinuencingboth.Proofofacausalrelationshiprequiresmuchmorethanlinear
regression.

Nonparametrictestingforyournon-normaldata
Nonparametrictests

Whencomparingtwoormoregroupsofnumericaldatavalues,wehaveahostofstatisticaltools
onhand.Wehaveseenallthet-testsforusewhencomparingtwogroupsandwellasANOVAfor
comparingmorethantwogroups.Asmentionedinthepreviouslecture,though,theuseofthese
teststorequirethatsomeassumptionsaremet.

Chiefamongthesewasthefactthatthedatapointvaluesforaspecicvariablemustcomefrom
apopulationinwhichthatsamevariableisnormallydistributed.Wearetalkingaboutapopulation
andthereforeaparameter,hencethetermparametric tests.

Unfortunatelywedonothaveaccesstothedatapointvaluesforthewholepopulation.Fortunately
therearestatisticalteststomeasurethelikelihoodthatthedatapointvaluesintheunderlying
populationarenormallydistributedandtheyaredonebasedonthedatapointvaluesthatare
availablefromthesampleofparticipants.
Checkingfornormality

Thereareavarietyofwaystocheckwhetherparametrictestsshouldbeused.I'llmentiontwo
here.TherstisquitevisualandmakesuseofwhatiscalledaQQplot.InaQQplot,allthedata
pointvaluesfromasamplegroupforagivenvariablearesortedinascendingorderandeachis
assigneditspercentilerankvalue,calleditsquantile.Thisreferstothepercentageofvaluesinthe
setthatfallsbelowthatspecicvalue.Thisisplottedagainstthequantilesofvaluesfroma
distributionthatweneedtotestagainst.Indecidingifaparametrictestshouldbeused,thiswould
bethenormaldistribution.

Intherstimagebelowwehaveacomputergeneratedplotshowingdatapointvaluesthatdonot
followthehypotheticalstraight(red)lineifthedatapointvaluesforthesampleweretakenfroma
populationinwhichthatvariablewasnormallydistributed.

Intheimagebelow,wenotetheoppositeandwouldallagreethatthesepointareamuchcloser
matchfortheredlinenormaldistribution.

Ifyoulookcloselyatthesegraphs,youwillalsonoteanR-squaredvalue.Thatisthesquareofthe
correlationcoecient,r,whichwemetinthelessononlinearregression.Notethevalueof0.99
(verycloseto1)forthesecondsetofvalues,versusonly0.84fortherst.

ThesecondmethodthatIwillmentionhereistheKolmogorov-Smirnovtest.Itisinitselfaformof
anon-parametrictestandcancompareasetofdatapointvaluesagainstareferenceprobability
distribution,mostnotably,thenormaldistribution.Aswithallstatisticaltestsofcomparisonit
calculatesap-valueandunderthenullhypothesis,thesampledatapointvaluesaredrawnfrom
thesamedistributionagainstwhichitistested.Thealternativehypothesiswouldstatethatitisnot
andwoulddemandtheuseofanon-parametrictestforcomparingsetsofdatapointvalues.

Non-parametricteststotherescue

Sincewearegoingtocomparenumericalvaluestoeachotherandinmostcasesmakeuseof
thet-distributionofsamplemeans(tryingtomimicthenormaldistribution)itiscleartoseethat
whenthedatapointvaluesseemnottocomefromapopulationinwhichthosedatapointvalues
arenormallydistributedwouldleadtocreatingincorrectareasunderthecurve(p-value).Inthese
cases,itismuchbettertousethesetofnon-parametricteststhatwewillmeetinthisweek.

Nonparametrictests

Themostcommontestsusedintheliteraturetocomparenumericaldatapointvaluesaret-tests,
analysisofvariance,andlinearregression.

Thesetestscanbeveryaccuratewhenusedappropriately,butdosuerfromthefactthatfairly
stringentassumptionsmustbemadefortheiruse.Notallresearcherscommentonwhetherthese
assumptionsweremetbeforechoosingtousethesetests.

Fortunately,therearealternativetestsforwhentheassumptionsfailandthesearecalled
nonparametrictests.TheygobynamessuchatheMann-Whitney-UtestandWilcoxonsign-rank
test.

Inthislessonwewilltakealookatwhentheassumptionsforparametrictestsarenotmetsothat
youcanspotthemintheliteratureandwewillbuildanintuitiveunderstandingofthebasisfor
thesetests.Theydeservealotmoreattentionanduseinhealthcareresearch.

Keyconcepts

Commonparametrictestsfortheanalysisofnumericaldatatypevaluesincludethe
varioust-tests,analysisofvariance,andlinearregression
Themostimportantassumptionthatmustbemetfortheirappropriateuseisthatthe
sampledatapointmustbeshowntocomefromapopulationinwhichtheparameteris
normallydistributed
Thetermparametricstemsfromthewordparameter,whichshouldgiveyouaclueasto
theunderlyingpopulationparameter
Testingwhetherthesampledatapointarefromapopulationinwhichtheparameteris
normallydistributedcanbedonebycheckingforskewnessinthesampledataorbythe
useofquantile(distribution)plots,amongstothers
Whenthisassumptionisnotmet,itisnotappropriatetouseparametrictests
Theinappropriateuseofparametricanalysesmayleadtofalseconclusions
Nonparametrictestsareslightlylesssensitiveatpickingupdierencesbetweengroups
Nonparametrictestscanbeusedfornumericaldatatypesaswellasordinalcategorical
datatypes
Whendatapointsarenotfromanormaldistributionthemean(onwhichparametrictests
arebased)arenotgoodpointestimates.Inthesecasesitisbettertoconsiderthe
median
Comparingmediansmakesuseofsigns,ranks,signranksandranksums
Whenusingsignsallofthesampledatapointsaregroupedtogetherandeachvalueis
assignedalabelofeitherzeroor(plus)onebasedonwhethertheyareatorlowerthana
suspectedmedian(zero)orhigherthanthatmedian(one)
Thecloserthesuspectedvalueistothetruemedian,thecloserthesumofthesigns
shouldbetoonehalfofthesizeofthesample
Adistributioncanbecreatedfromarankwhereallthesamplevaluesareplacedin
ascendingorderandrankedfromone,withtiesresolvedbygivingthemanaveragerank
value
Thesignvalueismultipliedbythe(absolutevalue)ranknumbertogivethesignrank
value
Intheranksumsmethodspecictablescanbeusedtocomparevaluesingroupsand
makinglistsofwhichvaluesbeatwhichvalueswiththespecicoutcomeoneofmany
possibleoutcomes
TheMann-Whitney-U(orMann-Whitney-WilcoxonorWilcoxon-Mann-WhitneyorWilcoxon
rank-sum)testusestherank-sumdistribution
TheKruskal-Wallistestisthenonparametricequivalenttotheone-wayanalysisof
variancetest
IfaKruskal-Wallisndsasignicantp-valuethentheindividualgroupscanbecompared
usingtheMann-Whitney-Utest
TheWilcoxonsign-ranktestisanalogoustotheparametricpaired-samplet-test
Spearmansrankcorrelationisanalogoustolinearregression
ThealternativeKendallsranktestcanbeusedformoreaccuracywhentheSpearmans
rankcorrelationrejectsthenullhypothesis

References:

1. ParazziP,etal.Ventilatoryabnormalitiesinpatientswithcysticbrosisundergoingthe
submaximaltreadmillexercisetest,BiomedCentralPulmonaryMedicine,2015,15:63,
http://www.biomedcentral.com/1471-2466/15/63
2. Bello,A,etal.Knowledgeofpregnantwomenaboutbirthdefects,BiomedCentral
PregnancyChildbirth,2013,
13:45,http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3598521/
3. CraigE,etal.RiskfactorsforoverweightandoverfatnessinruralSouthAfricanchildren
andadolescents,JournalofPublicHealth,2015
http://jpubhealth.oxfordjournals.org/content/early/2015/03/04/pubmed.fdv016.full
4. PaulR,etal.Studyofplateletaggregationinacutecoronarysyndromewithspecial
referencetometabolicsyndrome,InternationalJournalofAppliedMedicalResearch
2013Jul-Dec;3(2):117121,
http://www.ijabmr.org/article.asp?issn=2229-516X;year=2013;volume=3;issue=2;spage=
117;epage=121;aulast=Paul

Week6:Categoricaldataandanalyzingaccuracy
ofresults

Comparingcategoricaldata

Intheprevioussectionswelookedatcomparingnumericaldatatypes.Whataboutmethodsto
analyzecategoricaldata,though?Themostoftenusedstatisticaltestforcategoricaldatatypes,
bothnominalandordinal,isthechi-squaretest.

Inthissectionwewillusethechi-squared(2)distributiontoperformagoodness-of-ttestand
havealookathowtotestsampledatapointsforindependence.

Thechi-squaredgoodness-of-ttest

Beforewegettothemorecommonlyusedchi-squaredtestforindependence,let'sstartowith
thegoodness-of-ttest.Thistestallowsustoseewhetherthedistributionpatternofourdatatsa
predicteddistribution.Thisiscalledagoodness-of-ttest.Inessencewewillpredictadistribution
ofvalues,gooutandmeasuresomeactualdataandseehowwellourpredictionfaired.

Sincewearedealingwithafrequencydistribution,wehavetocounthowmanytimesadata
valueoccursanddivideitbythesum-totalofvalues.Letsconsiderpredictingthevemost
commonemergencysurgicalproceduresoverthenextvemonths.Thepredictionestimatesthat
appendectomieswillbemostcommon,makingup40%ofthetotal,cholecystectomiesmakingup
30%ofthetotal,incisionanddrainage(I&D)ofabscessesmakingup20%ofthetotalandwith
5%eachwepredictrepairingperforatedpepticulcersandmajorlowerlimbamputations.Over
thenextvemonthsweactuallynotethefollowing:290appendectomies,256cholecystectomies,
146I&Dprocedures,64perforatedpepticulcerrepairsand44amputations.

Achi-squaregoodness-of-ttestallowsustoiftheobservedfrequencydistributiontsour
prediction.Ournullhypothesiswouldbethattheactualfrequencydistributioncanbedescribed
bytheexpecteddistributionandthetesthypothesisstatesthatthesewoulddier.

Theactualvaluesarealreadyavailable.Wedoneedtocalculatetheexpected(predicted)values,
though.Fortunatelyweknowthetotal,whichinourexampleabovewas800(addingalltheactual
values)andwecanconstructvaluesbasedonthepredictedpercentagesabove.Thiswillleaveis
with40%of800,whichis320.Comparethistotheactualvalueof290.Forthe30%,20%and
two5%valuesweget240,160,40and40,whichcomparestotheobservedvaluesof256,146,
64and44.

Thechi-squarevalueiscalculatedfromthedierencesbetweentheobservedandexpected
valuesandfromthiscancalculateaprobabilityofhavingfoundthisdierence(ap-value).Ifitis
lessthanourchosenvalueofsignicancewecanrejectthenullhypothesisandacceptthetest
hypothesis.Ifnot,wecannotrejectthenullhypothesis.

Thechi-squaredtestforindependence
Thisisalsocalledthe2-testforassociation(whenconsideringtreatmentandcondition)andisthe
commonformthatweseeinclinicalliterature.Itisperformedbyconstructingsocalled
contingencytables.

Thenullhypothesisstatesthatthereisnoassociationbetweentreatmentandcondition,withthe
alternativehypothesisstatingthatthereisanassociation.Belowisatablecontingencytable,
clearlyshowingthecategorical(ordinal)natureofthedata.

Assessment(categorical) Treatment Placebo Totals

Considerableimprovement 27 5 32

Moderateimprovement 11 12 23
Nochange 3 2 5

Moderatedeterioration 4 13 17

Considerabledeterioration 5 7 12

Death 4 14 18

Totals 54 53 107

Thistablerepresentstheobservedtotalsfromahypotheticalstudyandsimplycountsthenumber
ofoccurrencesofeachoutcome,i.e.27patientsinthetreatmentgroupwereassessedashaving
improvedconsiderably,whereasonlyvedidsointheplacebogroup.Notehowtotalsoccurfor
boththerowsandcolumnsofthedata.Thirty-twopatientsintotal(bothtreatmentandplacebo
groups)showedconsiderableimprovement.Therewere54patientsinthetreatmentgroupand53
intheplacebogroup.

Fromthistableanexpectedtablecanbecalculated.Mathematicalanalysisofthesetwotables
resultsina2-value,whichisconvertedtoap-value.Forap-valuelessthanachosenvalueof
signicancewecanrejectthenullhypothesis,therebyacceptingthealternatehypothesis,that
thereisanassociationbetweentreatmentandoutcome.Whenviewingtheobservedtableabove,
thatwouldmeanthatthereisadierenceinproportionsbetweenthetreatmentandplacebo
columns.Stateddierently,which(treatment)groupapatientisin,doesaecttheoutcome(there
isindependence).

Thecalculationforthep-valueusingthechi-squaredtestmakesuseoftheconceptofdegreesof
freedom.Thisisasimplecalculationandmultipliestwovalues.Therstonesubtract1fromthe
numberofcolumnsandthesecondsubtracts1fromthenumberofrows.Inourexampleabove,
wehavetwocolumnsandsixrows.Subtracting1fromeachyields1and5.Multiplyingthem
yieldsavalueofveforthedegreesoffreedom.
Fisher'sexacttest

Therearecasesinwhichthe2-testdoesbecomeinaccurate.Thishappenswhenthenumbers
arequitesmall,withtotalsintheorderofveorless.ThereisactuallyarulecalledCochran'srule
whichstatesthatmorethan80%ofthevaluesintheexpectedtable(above)mustbelargerthan
ve.Ifnot,Fisher'sexacttestshouldbeused.Fisherstest,though,onlyconsiderstwocolumns
andtworows.Soinordertouseit,thecategoricalnumbersabovemustbereducedbycombining
someofthecategories.Intheexampleabovewemightcombineconsiderableimprovement,
moderateimprovementandnochangeintoasinglerowandallthedeteriorationsanddeathina
secondrow,leavinguswithatwocolumnandtworowconsistencytable(observedtable).

ThecalculationforFisher'stestusesfactorials.Fivefactorial(writtenas5!)means5x4x3x2x
1=120and3!is3x2x1=6.Forinterest'ssake1!=1and0!isalsoequalto1.Asyoumight
realise,factorialvaluesincreaseinsizequiteconsiderably.Intheexampleabovewehadavalue
of27and27!isavaluewith29numbers.Thatisbillionsandbillions.Whensuchlargevaluesare
used,aresearchermustmakesurethathisorhercomputercanaccuratelymanagesuchlarge
numbersandnotmakeroundingmistakes.Fisher'sexacttestshouldnotbeusedwhennot
requiredduetosmallsamplesizes.

Sensitivity,specicity,andpredictivevalues
Consideringmedicalinvestigations

Thetermssensitivityandspecicity,aswellaspositive andnegativepredictivevaluesareused
quiteofteninthemedicalliteratureanditisveryimportantintheday-to-daymanagementof
patientsthathealthcareworkersarefamiliarwiththeseterms.

Thesefourtermsareusedwhenweconsidermedicalinvestigationsandtests.Theylookatthe
problemfromtwopointsofview.Inthecaseofsensitivityandspecicityweconsiderhowmany
patientwillbecorrectlyindicatedassueringfromadiseaseornotsueringfromadisease.From
thisvantagepoint,notesthasbeenorderedandwedonothavetheresults.

Thisisincontrasttotheuseofpositiveandnegativepredictivevalue.Fromthispointofviewwe
alreadyhavethetestresultsandneedtoknowhowtointerpretapositiveandanegativending.
Sensitivityandspecicityhelpustodecidewhichtesttouseandthepredictivevalueshelpusto
decidehowtointerprettheresultsoncewehavethem.

Sensitivityandspecicity

Let'sconsiderthechoiceofamedicaltestorinvestigation.Weneedtobeawareofthefactthat
testsarenotcompletelyaccurate.Falsepositiveandnegativeresultsdooccur.Withafalse
positiveresultthepatientreallydoesnothavethedisease,yet,theresultsreturnpositive.In
contrasttothiswehavethefalsenegativeresult.Althoughthetestreturnsnegative,thepatient
reallydoeshavethedisease.Asagoodexampleweoftennoteheadlinesscrutinisingtheuseof
screeningtestssuchasmammography,wherefalsepositivetestsleadtobothunnecessary
psychologicalstressesandfurtherinvestigation,evensurgery.

Sensitivityreferstohowoftenatestreturnspositivewhenpatientsreallyhavethedisease.It
requiressomegold-standardbywhichweabsolutelyknowthatthepatienthasthedisease.So
imaginewehaveasetof100patientswithaknowndisease.Ifwesubjectallofthemtoanew(or
dierent)test,thesensitivitywillbethepercentageoftimesthatthistestreturnsapositiveresult.If
86ofthesetestscomebackaspositive,wehavea86%truepositiverate,statedasasensitivity
of86%.Thatmeansthatin14%ofcasesofusingthistest,wewillgetafalsenegativeandmight
missthefactthatapatientmighthavethediseasethatwewereinvestigating.

Specicityreferstohowoftenatestreturnsanegativeresultsintheabsenceofadisease.Once
again,itrequiresthepresenceofsomegold-standard,wherebyweabsolutelyknowthata
diseaseisabsent.Let'suseahundredpatientsagain,allknownnottohaveacertaindisease.If
wesubjectthemtoatestand86ofthosetestscomesbackasnegative,wehaveaspecicityof
86%.Thiswouldalsomeanthat14%ofpatientswillhaveafalsepositiveresultandmightbe
subjectedtounnecessaryfurtherinvestigationsandeveninterventions.

Thegreatmajorityofofmedicaltestsandinvestigationscannotabsolutelydiscriminatebetween
thosewithandwithoutthepresenceofadiseaseandwehavetobecircumspectwhenchoosing
anytest.

Herewehaveanexample:

Wenote1000patients,100withadiseaseand900without.Theyareallsubjectedtoanewtest
and180returnapositiveresultand820anegativeresult.Youwillnotetheindicationoffalse
positivesandnegatives.

Theequationsforsensitivityandspecicityareshownbelow.

Thisgivesusasensitivityof(90/100)90%andaspecicityof(810/900)90%.

Predictivevalues

Let'sturnthetablesandnowconsiderhowtointerprettestresultsoncetheyreturn.Againwe
couldimaginethatsometestswillreturnbothfalsepositiveandfalsenegativeresults.
Weexpresspositivepredictivevaluesasthepercentageofpatientswithapositivetestresultthat
turnsouttohavethediseaseandweexpressnegativepredictivevalueasthepercentageof
patientswithanegativetestresultthatturnsoutnottohavethedisease.Thegurebelowgives
thesimpleformulaeforpredictivevalues.

Thegivesusapositivepredictivevalueof(90/180)of50%(onlyhalfofpatientswithapositive
resultwillactuallyhavethedisease)andanegativepredictivevalueof(810/820)of99%,which
meansthatonlyalmostallpatientwithanegativeresultwillactuallynothavethedisease.

Predictivevaluesareverydependentontheprevalenceofadisease.Herewechoseasample
setof1000patientsinwhichthediseaseexistedinonly10%.Itisthislowprevalencewhichgives
usthepoorpositivepredictivevalue.Wheninterpretingpositiveandnegativepredictivevalues,
youmustalwayscomparetheprevalenceofthediseaseinthestudysampleversusthe
prevalenceofthediseaseinthepatientpopulationthatyousee.(Therearemathematical
methodsofconvertingresultstodierentlevelsofprevalence.)

Reference:

1. SuttonPA,HumesDJ,PurcellG,etal.TheRoleofRoutineAssaysofSerumAmylase
andLipasefortheDiagnosisofAcuteAbdominalPain.AnnalsofTheRoyalCollegeof
SurgeonsofEngland.2009;91(5):381-384.doi:10.1308/003588409X392135.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2758431/

JuanKlopperCC-BY

ThisworkislicensedunderaCreativeCommonsAttribution4.0InternationalLicense .

You might also like