Algorithms For Calculating Variance

Algorithmsforcalculatingvariance
FromWikipedia,thefreeencyclopedia
Algorithmsforcalculatingvarianceplayamajorroleincomputationalstatistics.Akeydifficultyinthedesignofgood
algorithmsforthisproblemisthatformulasforthevariancemayinvolvesumsofsquares,whichcanleadtonumerical
instabilityaswellastoarithmeticoverflowwhendealingwithlargevalues.
Contents
1 Navealgorithm
1.1 Computingshifteddata
2 Twopassalgorithm
2.1 Compensatedvariant
3 Onlinealgorithm
4 Weightedincrementalalgorithm
5 Parallelalgorithm
6 Example
7 Higherorderstatistics
8 Covariance
9 Seealso
10 References
11 Externallinks
Navealgorithm
AformulaforcalculatingthevarianceofanentirepopulationofsizeNis:
UsingBessel'scorrectiontocalculateanunbiasedestimateofthepopulationvariancefromafinitesampleofn
observations,theformulais:
Therefore,anaivealgorithmtocalculatetheestimatedvarianceisgivenbythefollowing:
Letn0,Sum0,SumSq0
Foreachdatumx:
nn+1
SumSum+x
SumSqSumSq+xx
Var=(SumSq(SumSum)/n)/(n1)
Thisalgorithmcaneasilybeadaptedtocomputethevarianceofafinitepopulation:simplydividebyNinsteadofn1
onthelastline.
BecauseSumSqand(SumSum)/ncanbeverysimilarnumbers,cancellationcanleadtotheprecisionoftheresultto
BecauseSumSqand(SumSum)/ncanbeverysimilarnumbers,cancellationcanleadtotheprecisionoftheresultto
bemuchlessthantheinherentprecisionofthefloatingpointarithmeticusedtoperformthecomputation.Thusthis
algorithmshouldnotbeusedinpractice.[1][2]Thisisparticularlybadifthestandarddeviationissmallrelativetothe
mean.However,thealgorithmcanbeimprovedbyadoptingthemethodoftheassumedmean.
Computingshifteddata
Wecanuseapropertyofthevariancetoavoidthecatastrophiccancellationinthisformula,namelythevarianceis
invariantwithrespecttochangesinalocationparameter
with anyconstant,whichleadstothenewformula
thecloser istothemeanvaluethemoreaccuratetheresultwillbe,butjustchoosingavalueinsidethesamplesrange
willguaranteethedesiredstability.Ifthevalues
aresmallthentherearenoproblemswiththesumofits
squares,onthecontrary,iftheyarelargeitnecessarilymeansthatthevarianceislargeaswell.Inanycasethesecond
termintheformulaisalwayssmallerthanthefirstonethereforenocancellationmayoccur.[2]
Ifwetakejustthefirstsampleas thealgorithmcanbewritteninPythonprogramminglanguageas
defshifted_data_variance(data):
iflen(data)==0:
return0
K=data[0]
n=0
sum_=0
sum_sqr=0
forxindata:
n=n+1
sum_+=xK
sum_sqr+=(xK)*(xK)
variance=(sum_sqr(sum_*sum_)/n)/(n1)
#useninsteadof(n1)ifwanttocomputetheexactvarianceofthegivendata
#use(n1)ifdataaresamplesofalargerpopulation
returnvariance
thisformulafacilitatesaswelltheincrementalcomputation,thatcanbeexpressedas
K=0
n=0
ex=0
ex2=0
defadd_variable(x):
if(n==0):
K=x
n=n+1
ex+=xK
ex2+=(xK)*(xK)
defremove_variable(x):
n=n1
ex=(xK)
ex2=(xK)*(xK)
defget_meanvalue():
returnK+ex/n
defget_variance():
return(ex2(ex*ex)/n)/(n1)
Twopassalgorithm
Analternativeapproach,usingadifferentformulaforthevariance,firstcomputesthesamplemean,
,
andthencomputesthesumofthesquaresofthedifferencesfromthemean,
,
wheresisthestandarddeviation.Thisisgivenbythefollowingpseudocode:
deftwo_pass_variance(data):
n=0
sum1=0
sum2=0
forxindata:
n+=1
sum1+=x
mean=sum1/n
forxindata:
sum2+=(xmean)*(xmean)
variance=sum2/(n1)
returnvariance
Thisalgorithmisnumericallystableifnissmall.[1][3]However,theresultsofbothofthesesimplealgorithms("Nave"
and"Twopass")candependinordinatelyontheorderingofthedataandcangivepoorresultsforverylargedatasetsdue
torepeatedroundofferrorintheaccumulationofthesums.Techniquessuchascompensatedsummationcanbeusedto
combatthiserrortoadegree.
Compensatedvariant
Thecompensatedsummationversionofthealgorithmabovereads:[4]
defcompensated_variance(data):
n=0
sum1=0
forxindata:
n+=1
sum1+=x
mean=sum1/n
sum2=0
sum3=0
forxindata:
sum2+=(xmean)**2
sum3+=(xmean)
variance=(sum2sum3**2/n)/(n1)
returnvariance
Onlinealgorithm
Itisoftenusefultobeabletocomputethevarianceinasinglepass,inspectingeachvalue onlyonceforexample,
whenthedataarebeingcollectedwithoutenoughstoragetokeepallthevalues,orwhencostsofmemoryaccess
dominatethoseofcomputation.Forsuchanonlinealgorithm,arecurrencerelationisrequiredbetweenquantitiesfrom
whichtherequiredstatisticscanbecalculatedinanumericallystablefashion.
Thefollowingformulascanbeusedtoupdatethemeanand(estimated)varianceofthesequence,foranadditional
element .Here,xndenotesthesamplemeanofthefirstnsamples(x1,...,xn),s2ntheirsamplevariance,and2ntheir
populationvariance.
Theseformulassufferfromnumericalinstability.Abetterquantityforupdatingisthesumofsquaresofdifferencesfrom
thecurrentmean,
,heredenoted
:
Anumericallystablealgorithmforthesamplevarianceisgivenbelow.Italsocomputesthemean.Thisalgorithmwas
foundbyWelford,[5][6]andithasbeenthoroughlyanalyzed.[7][8]Itisalsocommontodenote
and
.[9]
defonline_variance(data):
n=0
mean=0.0
M2=0.0
forxindata:
n+=1
delta=xmean
mean+=delta/n
M2+=delta*(xmean)
ifn<2:
returnfloat('nan')
else:
returnM2/(n1)
Thisalgorithmismuchlesspronetolossofprecisionduetocatastrophiccancellation,butmightnotbeasefficient
becauseofthedivisionoperationinsidetheloop.Foraparticularlyrobusttwopassalgorithmforcomputingthevariance,
onecanfirstcomputeandsubtractanestimateofthemean,andthenusethisalgorithmontheresiduals.
Theparallelalgorithmbelowillustrateshowtomergemultiplesetsofstatisticscalculatedonline.
Weightedincrementalalgorithm
Thealgorithmcanbeextendedtohandleunequalsampleweights,replacingthesimplecounternwiththesumofweights
seensofar.West(1979)[10]suggeststhisincrementalalgorithm:
defweighted_incremental_variance(dataWeightPairs):
sumweight=0
mean=0
M2=0
forx,weightindataWeightPairs:#Alternatively"forx,weightinzip(data,weights):"
temp=weight+sumweight
delta=xmean
R=delta*weight/temp
mean+=R
M2+=sumweight*delta*R#Alternatively,"M2=M2+weight*delta*(xmean)"
sumweight=temp
variance_n=M2/sumweight
variance=variance_n*len(dataWeightPairs)/(len(dataWeightPairs)1)
Parallelalgorithm
Chanetal.[4]notethattheabove"Online"algorithmisaspecialcaseofanalgorithmthatworksforanypartitionofthe
sample intosets
,
:
.
Thismaybeusefulwhen,forexample,multipleprocessingunitsmaybeassignedtodiscretepartsoftheinput.
Chan'smethodforestimatingthemeanisnumericallyunstablewhen
errorin
isnotscaleddowninthewaythatitisinthe
andbotharelarge,becausethenumerical
case.Insuchcases,prefer
.
defparallel_variance(avg_a,count_a,var_a,avg_b,count_b,var_b):
delta=avg_bavg_a
m_a=var_a*(count_a1)
m_b=var_b*(count_b1)
M2=m_a+m_b+delta**2*count_a*count_b/(count_a+count_b)
returnM2/(count_a+count_b1)
Example
AssumethatallfloatingpointoperationsusethestandardIEEE754doubleprecisionarithmetic.Considerthesample(4,
7,13,16)fromaninfinitepopulation.Basedonthissample,theestimatedpopulationmeanis10,andtheunbiased
estimateofpopulationvarianceis30.Both"Nave"algorithmand"Twopass"algorithmcomputethesevaluescorrectly.
Nextconsiderthesample(108+4,108+7,108+13,108+16),whichgivesrisetothesameestimatedvarianceasthe
firstsample."Twopass"algorithmcomputesthisvarianceestimatecorrectly,but"Nave"algorithmreturns
29.333333333333332insteadof30.Whilethislossofprecisionmaybetolerableandviewedasaminorflawof"Nave"
algorithm,itiseasytofinddatathatrevealamajorflawinthenaivealgorithm:Takethesampletobe(109+4,109+7,
109+13,109+16).Againtheestimatedpopulationvarianceof30iscomputedcorrectlyby"Twopass""algorithm,but
"Nave"algorithmnowcomputesitas170.66666666666666.Thisisaseriousproblemwith"Nave"algorithmandis
duetocatastrophiccancellationinthesubtractionoftwosimilarnumbersatthefinalstageofthealgorithm.
Higherorderstatistics
Terriberry[11]extendsChan'sformulaetocalculatingthethirdandfourthcentralmoments,neededforexamplewhen
estimatingskewnessandkurtosis:
Herethe
areagainthesumsofpowersofdifferencesfromthemean
,giving
skewness:
kurtosis:
Fortheincrementalcase(i.e.,
Bypreservingthevalue
forlittleincrementalcost.
),thissimplifiesto:
,onlyonedivisionoperationisneededandthehigherorderstatisticscanthusbecalculated
Anexampleoftheonlinealgorithmforkurtosisimplementedasdescribedis:
defonline_kurtosis(data):
n=0
mean=0
M2=0
M3=0
M4=0
forxindata:
n1=n
n=n+1
delta=xmean
delta_n=delta/n
delta_n2=delta_n*delta_n
term1=delta*delta_n*n1
mean=mean+delta_n
M4=M4+term1*delta_n2*(n*n3*n+3)+6*delta_n2*M24*delta_n*M3
M3=M3+term1*delta_n*(n2)3*delta_n*M2
M2=M2+term1
kurtosis=(n*M4)/(M2*M2)3
returnkurtosis
Pbay[12]furtherextendstheseresultstoarbitraryordercentralmoments,fortheincrementalandthepairwisecases.One
canalsofindtheresimilarformulasforcovariance.
ChoiandSweetman[13]offertwoalternativemethodstocomputetheskewnessandkurtosis,eachofwhichcansave
substantialcomputermemoryrequirementsandCPUtimeincertainapplications.Thefirstapproachistocomputethe
statisticalmomentsbyseparatingthedataintobinsandthencomputingthemomentsfromthegeometryoftheresulting
histogram,whicheffectivelybecomesaonepassalgorithmforhighermoments.Onebenefitisthatthestatisticalmoment
calculationscanbecarriedouttoarbitraryaccuracysuchthatthecomputationscanbetunedtotheprecisionof,e.g.,the
datastorageformatortheoriginalmeasurementhardware.Arelativehistogramofarandomvariablecanbeconstructed
intheconventionalway:therangeofpotentialvaluesisdividedintobinsandthenumberofoccurrenceswithineachbin
arecountedandplottedsuchthattheareaofeachrectangleequalstheportionofthesamplevalueswithinthatbin:
where
and
representthefrequencyandtherelativefrequencyatbin
and
isthetotal
areaofthehistogram.Afterthisnormalization,the rawmomentsandcentralmomentsof
therelativehistogram:
canbecalculatedfrom
wherethesuperscript indicatesthemomentsarecalculatedfromthehistogram.Forconstantbinwidth
thesetwoexpressionscanbesimplifiedusing
:
ThesecondapproachfromChoiandSweetman[13]isananalyticalmethodologytocombinestatisticalmomentsfrom
individualsegmentsofatimehistorysuchthattheresultingoverallmomentsarethoseofthecompletetimehistory.This
methodologycouldbeusedforparallelcomputationofstatisticalmomentswithsubsequentcombinationofthose
moments,orforcombinationofstatisticalmomentscomputedatsequentialtimes.
If setsofstatisticalmomentsareknown:
expressedintermsoftheequivalent rawmoments:
where
isgenerallytakentobethedurationofthe
for
,theneach
timehistory,orthenumberofpointsif
canbe
isconstant.
Thebenefitofexpressingthestatisticalmomentsintermsof isthatthe setscanbecombinedbyaddition,andthereis

noupperlimitonthevalueof .
wherethesubscript representstheconcatenatedtimehistoryorcombined .Thesecombinedvaluesof canthenbe

inverselytransformedintorawmomentsrepresentingthecompleteconcatenatedtimehistory
Knownrelationshipsbetweentherawmoments( )andthecentralmoments(
)arethenusedto
computethecentralmomentsoftheconcatenatedtimehistory.Finally,thestatisticalmomentsoftheconcatenatedhistory
arecomputedfromthecentralmoments:
Covariance
Verysimilaralgorithmscanbeusedtocomputethecovariance.Thenaivealgorithmis:
Forthealgorithmabove,onecouldusethefollowingPythoncode:
defnaive_covariance(data1,data2):
n=len(data1)
sum12=0
sum1=sum(data1)
sum2=sum(data2)
foriinrange(n):
sum12+=data1[i]*data2[i]
covariance=(sum12sum1*sum2/n)/n
returncovariance
Asforthevariance,thecovarianceoftworandomvariablesisalsoshiftinvariant,sogiventhat
twoconstantvaluesitcanbewritten:
and
arewhatever
andagainchoosingavalueinsidetherangeofvalueswillstabilizetheformulaagainstcatastrophiccancellationaswellas
makeitmorerobustagainstbigsums.Takingthefirstvalueofeachdataset,thealgorithmcanbewrittenas:
defshifted_data_covariance(dataX,dataY):
n=len(dataX)
if(n<2):
return0
Kx=dataX[0]
Ky=dataY[0]
Ex=0
Ey=0
Exy=0
foriinrange(n):
Ex+=dataX[i]Kx
Ey+=dataY[i]Ky
Exy+=(dataX[i]Kx)*(dataY[i]Ky)
return(ExyEx*Ey/n)/n
Thetwopassalgorithmfirstcomputesthesamplemeans,andthenthecovariance:
Thetwopassalgorithmmaybewrittenas:
deftwo_pass_covariance(data1,data2):
n=len(data1)
mean1=sum(data1)/n
mean2=sum(data2)/n
covariance=0
foriinrange(n):
a=data1[i]mean1
b=data2[i]mean2
covariance+=a*b/n
returncovariance
Aslightlymoreaccuratecompensatedversionperformsthefullnaivealgorithmontheresiduals.Thefinalsums
and
shouldbezero,butthesecondpasscompensatesforanysmallerror.
Aslightmodificationoftheonlinealgorithmforcomputingthevarianceyieldsanonlinealgorithmforthecovariance:
defonline_covariance(data1,data2):
mean1=mean2=0.
M12=0.
n=len(data1)
foriinrange(n):
delta1=(data1[i]mean1)/(i+1)
mean1+=delta1
delta2=(data2[i]mean2)/(i+1)
mean2+=delta2
M12+=i*delta1*delta2M12/(i+1)
returnn/(n1.)*M12
Astableonepassalgorithmexists,similartotheoneabove,thatcomputescomoment
Theapparentasymmetryinthatlastequationisduetothefactthat
,sobothupdateterms
areequalto
.Evengreateraccuracycanbeachievedbyfirstcomputingthemeans,then
usingthestableonepassalgorithmontheresiduals.
Thuswecancomputethecovarianceas
Likewise,thereisaformulaforcombiningthecovariancesoftwosetsthatcanbeusedtoparallelizethecomputation:
.
Seealso
Algebraicformulaforthevariance
Kahansummationalgorithm
Squareddeviationsfromthemean
References
1.BoEinarsson(1August2005).AccuracyandReliabilityinScientificComputing.SIAM.p.47.ISBN9780898715842.
Retrieved17February2013.
2.T.F.Chan,G.H.GolubandR.J.LeVeque(1983)." "Algorithmsforcomputingthesamplevariance:Analysisand
recommendations",TheAmericanStatistician,37"(PDF):242247.
3.Higham,Nicholas(2002).AccuracyandStabilityofNumericalAlgorithms(2ed)(Problem1.10).SIAM.
4.Chan,TonyF.Golub,GeneH.LeVeque,RandallJ.(1979),"UpdatingFormulaeandaPairwiseAlgorithmforComputing
SampleVariances."(PDF),TechnicalReportSTANCS79773,DepartmentofComputerScience,StanfordUniversity.
5.B.P.Welford(1962)."Noteonamethodforcalculatingcorrectedsumsofsquaresandproducts"(http://www.jstor.org/stable/126
6577).Technometrics4(3):419420.
6.DonaldE.Knuth(1998).TheArtofComputerProgramming,volume2:SeminumericalAlgorithms,3rdedn.,p.232.Boston:
AddisonWesley.
7.Chan,TonyF.Golub,GeneH.LeVeque,RandallJ.(1983).AlgorithmsforComputingtheSampleVariance:Analysisand
Recommendations.TheAmericanStatistician37,242247.http://www.jstor.org/stable/2683386
8.Ling,RobertF.(1974).ComparisonofSeveralAlgorithmsforComputingSampleMeansandVariances.Journalofthe
AmericanStatisticalAssociation,Vol.69,No.348,859866.doi:10.2307/2286154(https://dx.doi.org/10.2307%2F2286154)
9.http://www.johndcook.com/standard_deviation.html
10.D.H.D.West(1979).CommunicationsoftheACM,22,9,532535:UpdatingMeanandVarianceEstimates:AnImproved
Method
11.Terriberry,TimothyB.(2007),ComputingHigherOrderMomentsOnline
12.Pbay,Philippe(2008),"FormulasforRobust,OnePassParallelComputationofCovariancesandArbitraryOrderStatistical
Moments"(PDF),TechnicalReportSAND20086212,SandiaNationalLaboratories
13.Choi,MuenkeunSweetman,Bert(2010),EfficientCalculationofStatisticalMomentsforStructuralHealthMonitoring(PDF)
Externallinks
Weisstein,EricW.,"SampleVarianceComputation"(http://mathworld.wolfram.com/SampleVarianceComputation.
html),MathWorld.
Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Algorithms_for_calculating_variance&oldid=729776128"
Categories: Statisticalalgorithms Statisticaldeviationanddispersion
Thispagewaslastmodifiedon14July2016,at13:47.
TextisavailableundertheCreativeCommonsAttributionShareAlikeLicenseadditionaltermsmayapply.By
usingthissite,youagreetotheTermsofUseandPrivacyPolicy.Wikipediaisaregisteredtrademarkofthe
WikimediaFoundation,Inc.,anonprofitorganization.

Algorithms For Calculating Variance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Algorithms For Calculating Variance

Uploaded by

Copyright:

Available Formats

Algorithmsforcalculatingvariance

Thebenefitofexpressingthestatisticalmomentsintermsof isthatthe setscanbecombinedbyaddition,andthereis

wherethesubscript representstheconcatenatedtimehistoryorcombined .Thesecombinedvaluesof canthenbe

You might also like