You are on page 1of 11

Algorithmsforcalculatingvariance

FromWikipedia,thefreeencyclopedia

Algorithmsforcalculatingvarianceplayamajorroleincomputationalstatistics.Akeydifficultyinthedesignofgood
algorithmsforthisproblemisthatformulasforthevariancemayinvolvesumsofsquares,whichcanleadtonumerical
instabilityaswellastoarithmeticoverflowwhendealingwithlargevalues.

Contents
1 Navealgorithm
1.1 Computingshifteddata
2 Twopassalgorithm
2.1 Compensatedvariant
3 Onlinealgorithm
4 Weightedincrementalalgorithm
5 Parallelalgorithm
6 Example
7 Higherorderstatistics
8 Covariance
9 Seealso
10 References
11 Externallinks

Navealgorithm
AformulaforcalculatingthevarianceofanentirepopulationofsizeNis:

UsingBessel'scorrectiontocalculateanunbiasedestimateofthepopulationvariancefromafinitesampleofn
observations,theformulais:

Therefore,anaivealgorithmtocalculatetheestimatedvarianceisgivenbythefollowing:

Letn0,Sum0,SumSq0
Foreachdatumx:

nn+1
SumSum+x
SumSqSumSq+xx
Var=(SumSq(SumSum)/n)/(n1)

Thisalgorithmcaneasilybeadaptedtocomputethevarianceofafinitepopulation:simplydividebyNinsteadofn1
onthelastline.

BecauseSumSqand(SumSum)/ncanbeverysimilarnumbers,cancellationcanleadtotheprecisionoftheresultto

BecauseSumSqand(SumSum)/ncanbeverysimilarnumbers,cancellationcanleadtotheprecisionoftheresultto
bemuchlessthantheinherentprecisionofthefloatingpointarithmeticusedtoperformthecomputation.Thusthis
algorithmshouldnotbeusedinpractice.[1][2]Thisisparticularlybadifthestandarddeviationissmallrelativetothe
mean.However,thealgorithmcanbeimprovedbyadoptingthemethodoftheassumedmean.

Computingshifteddata
Wecanuseapropertyofthevariancetoavoidthecatastrophiccancellationinthisformula,namelythevarianceis
invariantwithrespecttochangesinalocationparameter

with anyconstant,whichleadstothenewformula

thecloser istothemeanvaluethemoreaccuratetheresultwillbe,butjustchoosingavalueinsidethesamplesrange
willguaranteethedesiredstability.Ifthevalues
aresmallthentherearenoproblemswiththesumofits
squares,onthecontrary,iftheyarelargeitnecessarilymeansthatthevarianceislargeaswell.Inanycasethesecond
termintheformulaisalwayssmallerthanthefirstonethereforenocancellationmayoccur.[2]
Ifwetakejustthefirstsampleas thealgorithmcanbewritteninPythonprogramminglanguageas
defshifted_data_variance(data):
iflen(data)==0:
return0
K=data[0]
n=0
sum_=0
sum_sqr=0
forxindata:
n=n+1
sum_+=xK
sum_sqr+=(xK)*(xK)
variance=(sum_sqr(sum_*sum_)/n)/(n1)
#useninsteadof(n1)ifwanttocomputetheexactvarianceofthegivendata
#use(n1)ifdataaresamplesofalargerpopulation
returnvariance

thisformulafacilitatesaswelltheincrementalcomputation,thatcanbeexpressedas
K=0
n=0
ex=0
ex2=0

defadd_variable(x):
if(n==0):
K=x
n=n+1
ex+=xK
ex2+=(xK)*(xK)

defremove_variable(x):
n=n1
ex=(xK)
ex2=(xK)*(xK)

defget_meanvalue():
returnK+ex/n
defget_variance():
return(ex2(ex*ex)/n)/(n1)

Twopassalgorithm
Analternativeapproach,usingadifferentformulaforthevariance,firstcomputesthesamplemean,
,
andthencomputesthesumofthesquaresofthedifferencesfromthemean,
,
wheresisthestandarddeviation.Thisisgivenbythefollowingpseudocode:
deftwo_pass_variance(data):
n=0
sum1=0
sum2=0

forxindata:
n+=1
sum1+=x

mean=sum1/n
forxindata:
sum2+=(xmean)*(xmean)

variance=sum2/(n1)
returnvariance

Thisalgorithmisnumericallystableifnissmall.[1][3]However,theresultsofbothofthesesimplealgorithms("Nave"
and"Twopass")candependinordinatelyontheorderingofthedataandcangivepoorresultsforverylargedatasetsdue
torepeatedroundofferrorintheaccumulationofthesums.Techniquessuchascompensatedsummationcanbeusedto
combatthiserrortoadegree.

Compensatedvariant
Thecompensatedsummationversionofthealgorithmabovereads:[4]
defcompensated_variance(data):
n=0
sum1=0
forxindata:
n+=1
sum1+=x
mean=sum1/n

sum2=0
sum3=0
forxindata:
sum2+=(xmean)**2
sum3+=(xmean)
variance=(sum2sum3**2/n)/(n1)
returnvariance

Onlinealgorithm

Itisoftenusefultobeabletocomputethevarianceinasinglepass,inspectingeachvalue onlyonceforexample,
whenthedataarebeingcollectedwithoutenoughstoragetokeepallthevalues,orwhencostsofmemoryaccess
dominatethoseofcomputation.Forsuchanonlinealgorithm,arecurrencerelationisrequiredbetweenquantitiesfrom
whichtherequiredstatisticscanbecalculatedinanumericallystablefashion.
Thefollowingformulascanbeusedtoupdatethemeanand(estimated)varianceofthesequence,foranadditional
element .Here,xndenotesthesamplemeanofthefirstnsamples(x1,...,xn),s2ntheirsamplevariance,and2ntheir
populationvariance.

Theseformulassufferfromnumericalinstability.Abetterquantityforupdatingisthesumofsquaresofdifferencesfrom
thecurrentmean,
,heredenoted
:

Anumericallystablealgorithmforthesamplevarianceisgivenbelow.Italsocomputesthemean.Thisalgorithmwas
foundbyWelford,[5][6]andithasbeenthoroughlyanalyzed.[7][8]Itisalsocommontodenote
and
.[9]
defonline_variance(data):
n=0
mean=0.0
M2=0.0

forxindata:
n+=1
delta=xmean
mean+=delta/n
M2+=delta*(xmean)
ifn<2:
returnfloat('nan')
else:
returnM2/(n1)

Thisalgorithmismuchlesspronetolossofprecisionduetocatastrophiccancellation,butmightnotbeasefficient
becauseofthedivisionoperationinsidetheloop.Foraparticularlyrobusttwopassalgorithmforcomputingthevariance,
onecanfirstcomputeandsubtractanestimateofthemean,andthenusethisalgorithmontheresiduals.
Theparallelalgorithmbelowillustrateshowtomergemultiplesetsofstatisticscalculatedonline.

Weightedincrementalalgorithm
Thealgorithmcanbeextendedtohandleunequalsampleweights,replacingthesimplecounternwiththesumofweights
seensofar.West(1979)[10]suggeststhisincrementalalgorithm:

defweighted_incremental_variance(dataWeightPairs):
sumweight=0
mean=0
M2=0
forx,weightindataWeightPairs:#Alternatively"forx,weightinzip(data,weights):"
temp=weight+sumweight
delta=xmean
R=delta*weight/temp
mean+=R
M2+=sumweight*delta*R#Alternatively,"M2=M2+weight*delta*(xmean)"
sumweight=temp
variance_n=M2/sumweight
variance=variance_n*len(dataWeightPairs)/(len(dataWeightPairs)1)

Parallelalgorithm
Chanetal.[4]notethattheabove"Online"algorithmisaspecialcaseofanalgorithmthatworksforanypartitionofthe
sample intosets
,
:

.
Thismaybeusefulwhen,forexample,multipleprocessingunitsmaybeassignedtodiscretepartsoftheinput.
Chan'smethodforestimatingthemeanisnumericallyunstablewhen
errorin

isnotscaleddowninthewaythatitisinthe

andbotharelarge,becausethenumerical
case.Insuchcases,prefer

.
defparallel_variance(avg_a,count_a,var_a,avg_b,count_b,var_b):
delta=avg_bavg_a
m_a=var_a*(count_a1)
m_b=var_b*(count_b1)
M2=m_a+m_b+delta**2*count_a*count_b/(count_a+count_b)
returnM2/(count_a+count_b1)

Example
AssumethatallfloatingpointoperationsusethestandardIEEE754doubleprecisionarithmetic.Considerthesample(4,
7,13,16)fromaninfinitepopulation.Basedonthissample,theestimatedpopulationmeanis10,andtheunbiased
estimateofpopulationvarianceis30.Both"Nave"algorithmand"Twopass"algorithmcomputethesevaluescorrectly.
Nextconsiderthesample(108+4,108+7,108+13,108+16),whichgivesrisetothesameestimatedvarianceasthe
firstsample."Twopass"algorithmcomputesthisvarianceestimatecorrectly,but"Nave"algorithmreturns
29.333333333333332insteadof30.Whilethislossofprecisionmaybetolerableandviewedasaminorflawof"Nave"
algorithm,itiseasytofinddatathatrevealamajorflawinthenaivealgorithm:Takethesampletobe(109+4,109+7,
109+13,109+16).Againtheestimatedpopulationvarianceof30iscomputedcorrectlyby"Twopass""algorithm,but
"Nave"algorithmnowcomputesitas170.66666666666666.Thisisaseriousproblemwith"Nave"algorithmandis
duetocatastrophiccancellationinthesubtractionoftwosimilarnumbersatthefinalstageofthealgorithm.

Higherorderstatistics

Terriberry[11]extendsChan'sformulaetocalculatingthethirdandfourthcentralmoments,neededforexamplewhen
estimatingskewnessandkurtosis:

Herethe

areagainthesumsofpowersofdifferencesfromthemean

,giving

skewness:
kurtosis:
Fortheincrementalcase(i.e.,

Bypreservingthevalue
forlittleincrementalcost.

),thissimplifiesto:

,onlyonedivisionoperationisneededandthehigherorderstatisticscanthusbecalculated

Anexampleoftheonlinealgorithmforkurtosisimplementedasdescribedis:
defonline_kurtosis(data):
n=0
mean=0
M2=0
M3=0
M4=0
forxindata:
n1=n
n=n+1
delta=xmean
delta_n=delta/n
delta_n2=delta_n*delta_n
term1=delta*delta_n*n1
mean=mean+delta_n
M4=M4+term1*delta_n2*(n*n3*n+3)+6*delta_n2*M24*delta_n*M3
M3=M3+term1*delta_n*(n2)3*delta_n*M2
M2=M2+term1
kurtosis=(n*M4)/(M2*M2)3
returnkurtosis

Pbay[12]furtherextendstheseresultstoarbitraryordercentralmoments,fortheincrementalandthepairwisecases.One
canalsofindtheresimilarformulasforcovariance.
ChoiandSweetman[13]offertwoalternativemethodstocomputetheskewnessandkurtosis,eachofwhichcansave
substantialcomputermemoryrequirementsandCPUtimeincertainapplications.Thefirstapproachistocomputethe
statisticalmomentsbyseparatingthedataintobinsandthencomputingthemomentsfromthegeometryoftheresulting
histogram,whicheffectivelybecomesaonepassalgorithmforhighermoments.Onebenefitisthatthestatisticalmoment
calculationscanbecarriedouttoarbitraryaccuracysuchthatthecomputationscanbetunedtotheprecisionof,e.g.,the
datastorageformatortheoriginalmeasurementhardware.Arelativehistogramofarandomvariablecanbeconstructed
intheconventionalway:therangeofpotentialvaluesisdividedintobinsandthenumberofoccurrenceswithineachbin
arecountedandplottedsuchthattheareaofeachrectangleequalstheportionofthesamplevalueswithinthatbin:

where

and

representthefrequencyandtherelativefrequencyatbin

and

isthetotal

areaofthehistogram.Afterthisnormalization,the rawmomentsandcentralmomentsof
therelativehistogram:

canbecalculatedfrom

wherethesuperscript indicatesthemomentsarecalculatedfromthehistogram.Forconstantbinwidth
thesetwoexpressionscanbesimplifiedusing
:

ThesecondapproachfromChoiandSweetman[13]isananalyticalmethodologytocombinestatisticalmomentsfrom
individualsegmentsofatimehistorysuchthattheresultingoverallmomentsarethoseofthecompletetimehistory.This
methodologycouldbeusedforparallelcomputationofstatisticalmomentswithsubsequentcombinationofthose
moments,orforcombinationofstatisticalmomentscomputedatsequentialtimes.
If setsofstatisticalmomentsareknown:
expressedintermsoftheequivalent rawmoments:

where

isgenerallytakentobethedurationofthe

for

,theneach

timehistory,orthenumberofpointsif

canbe

isconstant.

Thebenefitofexpressingthestatisticalmomentsintermsof isthatthe setscanbecombinedbyaddition,andthereis


noupperlimitonthevalueof .

wherethesubscript representstheconcatenatedtimehistoryorcombined .Thesecombinedvaluesof canthenbe


inverselytransformedintorawmomentsrepresentingthecompleteconcatenatedtimehistory

Knownrelationshipsbetweentherawmoments( )andthecentralmoments(
)arethenusedto
computethecentralmomentsoftheconcatenatedtimehistory.Finally,thestatisticalmomentsoftheconcatenatedhistory
arecomputedfromthecentralmoments:

Covariance
Verysimilaralgorithmscanbeusedtocomputethecovariance.Thenaivealgorithmis:

Forthealgorithmabove,onecouldusethefollowingPythoncode:
defnaive_covariance(data1,data2):
n=len(data1)
sum12=0
sum1=sum(data1)
sum2=sum(data2)
foriinrange(n):
sum12+=data1[i]*data2[i]
covariance=(sum12sum1*sum2/n)/n
returncovariance

Asforthevariance,thecovarianceoftworandomvariablesisalsoshiftinvariant,sogiventhat
twoconstantvaluesitcanbewritten:

and

arewhatever

andagainchoosingavalueinsidetherangeofvalueswillstabilizetheformulaagainstcatastrophiccancellationaswellas
makeitmorerobustagainstbigsums.Takingthefirstvalueofeachdataset,thealgorithmcanbewrittenas:
defshifted_data_covariance(dataX,dataY):
n=len(dataX)
if(n<2):
return0
Kx=dataX[0]
Ky=dataY[0]
Ex=0
Ey=0
Exy=0
foriinrange(n):
Ex+=dataX[i]Kx
Ey+=dataY[i]Ky

Exy+=(dataX[i]Kx)*(dataY[i]Ky)
return(ExyEx*Ey/n)/n

Thetwopassalgorithmfirstcomputesthesamplemeans,andthenthecovariance:

Thetwopassalgorithmmaybewrittenas:
deftwo_pass_covariance(data1,data2):
n=len(data1)
mean1=sum(data1)/n
mean2=sum(data2)/n
covariance=0
foriinrange(n):
a=data1[i]mean1
b=data2[i]mean2
covariance+=a*b/n
returncovariance

Aslightlymoreaccuratecompensatedversionperformsthefullnaivealgorithmontheresiduals.Thefinalsums
and
shouldbezero,butthesecondpasscompensatesforanysmallerror.
Aslightmodificationoftheonlinealgorithmforcomputingthevarianceyieldsanonlinealgorithmforthecovariance:
defonline_covariance(data1,data2):
mean1=mean2=0.
M12=0.
n=len(data1)
foriinrange(n):
delta1=(data1[i]mean1)/(i+1)
mean1+=delta1
delta2=(data2[i]mean2)/(i+1)
mean2+=delta2
M12+=i*delta1*delta2M12/(i+1)
returnn/(n1.)*M12

Astableonepassalgorithmexists,similartotheoneabove,thatcomputescomoment

Theapparentasymmetryinthatlastequationisduetothefactthat

,sobothupdateterms

areequalto
.Evengreateraccuracycanbeachievedbyfirstcomputingthemeans,then
usingthestableonepassalgorithmontheresiduals.
Thuswecancomputethecovarianceas

Likewise,thereisaformulaforcombiningthecovariancesoftwosetsthatcanbeusedtoparallelizethecomputation:
.

Seealso
Algebraicformulaforthevariance
Kahansummationalgorithm
Squareddeviationsfromthemean

References
1.BoEinarsson(1August2005).AccuracyandReliabilityinScientificComputing.SIAM.p.47.ISBN9780898715842.
Retrieved17February2013.
2.T.F.Chan,G.H.GolubandR.J.LeVeque(1983)." "Algorithmsforcomputingthesamplevariance:Analysisand
recommendations",TheAmericanStatistician,37"(PDF):242247.
3.Higham,Nicholas(2002).AccuracyandStabilityofNumericalAlgorithms(2ed)(Problem1.10).SIAM.
4.Chan,TonyF.Golub,GeneH.LeVeque,RandallJ.(1979),"UpdatingFormulaeandaPairwiseAlgorithmforComputing
SampleVariances."(PDF),TechnicalReportSTANCS79773,DepartmentofComputerScience,StanfordUniversity.
5.B.P.Welford(1962)."Noteonamethodforcalculatingcorrectedsumsofsquaresandproducts"(http://www.jstor.org/stable/126
6577).Technometrics4(3):419420.
6.DonaldE.Knuth(1998).TheArtofComputerProgramming,volume2:SeminumericalAlgorithms,3rdedn.,p.232.Boston:
AddisonWesley.
7.Chan,TonyF.Golub,GeneH.LeVeque,RandallJ.(1983).AlgorithmsforComputingtheSampleVariance:Analysisand
Recommendations.TheAmericanStatistician37,242247.http://www.jstor.org/stable/2683386
8.Ling,RobertF.(1974).ComparisonofSeveralAlgorithmsforComputingSampleMeansandVariances.Journalofthe
AmericanStatisticalAssociation,Vol.69,No.348,859866.doi:10.2307/2286154(https://dx.doi.org/10.2307%2F2286154)
9.http://www.johndcook.com/standard_deviation.html
10.D.H.D.West(1979).CommunicationsoftheACM,22,9,532535:UpdatingMeanandVarianceEstimates:AnImproved
Method
11.Terriberry,TimothyB.(2007),ComputingHigherOrderMomentsOnline
12.Pbay,Philippe(2008),"FormulasforRobust,OnePassParallelComputationofCovariancesandArbitraryOrderStatistical
Moments"(PDF),TechnicalReportSAND20086212,SandiaNationalLaboratories
13.Choi,MuenkeunSweetman,Bert(2010),EfficientCalculationofStatisticalMomentsforStructuralHealthMonitoring(PDF)

Externallinks
Weisstein,EricW.,"SampleVarianceComputation"(http://mathworld.wolfram.com/SampleVarianceComputation.
html),MathWorld.
Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Algorithms_for_calculating_variance&oldid=729776128"
Categories: Statisticalalgorithms Statisticaldeviationanddispersion
Thispagewaslastmodifiedon14July2016,at13:47.

TextisavailableundertheCreativeCommonsAttributionShareAlikeLicenseadditionaltermsmayapply.By
usingthissite,youagreetotheTermsofUseandPrivacyPolicy.Wikipediaisaregisteredtrademarkofthe
WikimediaFoundation,Inc.,anonprofitorganization.

You might also like