Lab #1 - Data Screening: Statistics - Spring 2008

Statistics Spring 2008
Lab #1 Data Screening

The purpose of data screening is to:
(a) check if data have been entered correctly, such as outofrange values!
(b) check for "issing values, and deciding ho# to deal #ith the "issing values!
(c) check for outliers, and deciding ho# to deal #ith outliers!
(d) check for nor"ality, and deciding ho# to deal #ith nonnor"ality!

1! Finding incorrectly entered data
$our first step #ith %&ata Screening' is using %(re)uencies'
*! Select Analyze > Descriptive Statistics > Frequencies
2! +ove all variables into the %,ariable(s)' #indo#!
-! .lick /0!
/utput belo# is for only the four %syste"' variables in our dataset because copy1pasting the output for all
variables in our dataset #ould take up too "uch space in this docu"ent!
The %Statistics' bo2 tells you the nu"ber of "issing values for each variable! 3e #ill use this infor"ation
later #hen #e are discussing "issing values!
Each variable is then presented as a fre)uency table! (or e2a"ple, belo# #e see the output for %syste"*'! 4y
looking at the coding "anual for the %5egal beliefs' survey, you can see that the available responses for
%syste"*' are * through **! 4y looking at the output belo#, you can see that there is a nu"ber outofrange:
%*-'! (6/T7 in your dataset there #ill not be a %*-' because 8 gave you the screened dataset, so 8 have
included the %*-' into this e2a"ple to sho# you #hat it looks like #hen a nu"ber is out of range!) Since *- is
an invalid nu"ber, you then need to identify #hy %*-' #as entered! (or e2a"ple, did the person entering data
"ake a "istake9 /r, did the sub:ect respond #ith a %*-' even though the )uestion indicated that only nu"bers
* through ** are valid9 $ou can identify the source of the error by looking at the hard copies of the data! (or
e2a"ple, first identify #hich sub:ect indicated the %*-' by clicking on the variable na"e to highlight it
(syste"*), and then using the %find' function by: Edit > Find, and then scrolling to the left to identify the
sub:ect nu"ber! Then, hunt do#n the hard copy of the data for that sub:ect nu"ber!
*
! !issing "alues
4elo#, 8 describe indepth ho# to identify and deal #ith "issing values!
3hy do "issing values occur9 +issing values are either rando" or nonrando"! ;ando" "issing values "ay
occur because the sub:ect inadvertently did not ans#er so"e )uestions! (or e2a"ple, the study "ay be overly
co"ple2 and1or long, or the sub:ect "ay be tired and1or not paying attention, and "iss the )uestion! ;ando"
"issing values "ay also occur through data entry "istakes! 6onrando" "issing values "ay occur because
the sub:ect purposefully did not ans#er so"e )uestions! (or e2a"ple, the )uestion "ay be confusing, so "any
sub:ects do not ans#er the )uestion! <lso, the )uestion "ay not provide appropriate ans#er choices, such as
%no opinion' or %not applicable', so the sub:ect chooses not to ans#er the )uestion! <lso, sub:ects "ay be
reluctant to ans#er so"e )uestions because of social desirability concerns about the content of the )uestion,
such as )uestions about sensitive topics like past cri"es, se2ual history, pre:udice or bias to#ard certain
groups, and etc!
3hy is "issing data a proble"9 +issing values "eans reduced sa"ple si=e and loss of data! $ou conduct
research to "easure e"pirical reality so "issing values th#art the purpose of research! +issing values "ay
also indicate bias in the data! 8f the "issing values are nonrando", then the study is not accurately "easuring
the intended constructs! The results of your study "ay have been different if the "issing data #as not "issing!
>o# do 8 identify "issing values9
*! Select Analyze > Descriptive Statistics > Frequencies
-! .lick /0!
/utput belo# is for only the four %syste"' variables in our dataset because copy1pasting the output for all
variables in our dataset #ould take up too "uch space in this docu"ent!
The %Statistics' bo2 tells you the nu"ber of "issing values for each variable!
>o# do 8 identify if "issing values are rando" or nonrando"9 (irst, if there are only a s"all nu"ber of
"issing values, then it is e2tre"ely unlikely to be nonrando"! (or e2a"ple, %syste"-' has only 2 "issing
values, so 2 people out of -2? total sub:ects could not be %nonrando"'! Second, even if there are a larger
nu"ber of "issing values, that does not necessarily "ean the "issing values are nonrando"! $ou should look
to the )uestion itself to identify if it is poorly constructed or engenders social desirability concerns! Third,
so"e )uestions #ill al#ays have large nu"ber of "issing values because of the #ay the )uestion is designed!
(or e2a"ple, the %threshold-' )uestions in our dataset ask the sub:ects to %"ark all ans#ers that apply', so
there #ill be a lot of "issing data because so"e options are chosen less fre)uently than others! (ourth, S@SS
has an addon "odule called %+issing ,alues <nalysis' that #ill statistically test #hether "issing values are
rando" or nonrando"! The addon is included in your copy of S@SS, but "ost people do not have the addon
"odule! 8t is not even offered on the versions of S@SS at AS.! Biven ho# unlikely nonrando" values occurs
in datasets, 8 kno# no one #ho conducts this analysis! >o#ever, if you do #ant to conduct "issing values
analysis using the S@SS addon, then you can access it by Analyze > !issing "alue Analysis, and check
7+ esti"ation! 7+ esti"ation checks if the sub:ects #ith "issing values are different than the sub:ects
#ithout "issing values! 8f pC!0D, then the t#o groups are significantly different fro" each other, #hich
indicates the "issing values are nonrando"! 8n other #ords, you #ant the value to be greater than !0D, #hich
indicates the "issing values are rando"!
>o# do 8 deal #ith "issing values9 8rrespective of #hether the "issing values are rando" or nonrando",
you have three options #hen dealing #ith "issing values!
/ption * is to do nothing! 5eave the data as is, #ith the "issing values in place! This is the "ost fre)uent
approach, for a fe# reasons! (irst, "issing values are typically s"all! Second, "issing values are typically
nonrando"! Third, even if there are a fe# "issing values on individual ite"s, you typically create co"posites
2
of the ite"s by averaging the" together into one ne# variable, and this co"posite variable #ill not have
"issing values because it is an average of the e2isting data! >o#ever, if you chose this option, you "ust keep
in "ind ho# S@SS #ill treat the "issing values! S@SS #ill either use %list#ise deletion' or %pair#ise deletion'
of the "issing values! $ou can elect either one #hen conducting each test in S@SS!
a! 5ist#ise deletion S@SS #ill not include cases (sub:ects) that have "issing values on the variable(s)
under analysis! 8f you are only analy=ing one variable, then list#ise deletion is si"ply analy=ing the
e2isting data! 8f you are analy=ing "ultiple variables, then list#ise deletion re"oves cases (sub:ects) if
there is a "issing value on any of the variables! The disadvantage is a loss of data because you are
re"oving all data fro" sub:ects #ho "ay have ans#ered so"e of the )uestions, but not others (e!g!,
the "issing data)!
b! @air#ise deletion S@SS #ill include all available data! Anlike list#ise deletion #hich re"oves cases
(sub:ects) that have "issing values on any of the variables under analysis, pair#ise deletion only
re"oves the specific "issing values fro" the analysis (not the entire case)! 8n other #ords, all available
data is included! (or e2a"ple, if you are conducting a correlation on "ultiple variables, then S@SS #ill
conduct the bivariate correlation bet#een all available data point, and ignore only those "issing values
if they e2ist on so"e variables! 8n this case, pair#ise deletion #ill result in different sa"ple si=es for
each correlation! @air#ise deletion is useful #hen sa"ple si=e is s"all or "issing values are large
because there are not "any values to begin #ith, so #hy o"it even "ore #ith list#ise deletion!
c! 8n other to better understand ho# list#ise deletion versus pair#ise deletion influences your results, try
conducting the sa"e test using both deletion "ethods! &oes the outco"e change9
d! 8+@/;T<6T for each type of test you conduct, you need to identify if S@SS is using list#ise or
pair#ise deletion! 8 #ould reco""end electing pair#ise deletion, if possible! (or e2a"ple, #e have
been using the E#plore co""and! 8f you are analy=ing "ore than one variable in the 72plore
co""and, be sure to click %/ptions' and %72clude cases pair#ise' because the default option is
list#ise deletion! +ost tests allo# you to elect your preference, but B5+ +ultivariate only allo#s
list#ise! So, al#ays check your output for the nu"ber of cases used in each analysis!
/ption 2 is to delete cases #ith "issing values! (or e2a"ple, for every "issing value in the dataset, you can
delete the sub:ects #ith the "issing values! Thus, you are left #ith co"plete data for all sub:ects! The
disadvantage to this approach is you reduce the sa"ple si=e of your data! 8f you have a large dataset, then it
"ay not be a big disadvantage because you have enough sub:ects even after you delete the cases #ith "issing
values! <nother disadvantage to this approach is that the sub:ects #ith "issing values "ay be different than
the sub:ects #ithout "issing values (e!g!, "issing values that are nonrando"), so you have a non
representative sa"ple after re"oving the cases #ith "issing values! /nce situation in #hich 8 use /ption 2 is
#hen particular sub:ects have not ans#ered an entire scale or page of the study!
/ption - is to replace the "issing values, called i"putation! There is little agree"ent about #hether or not to
conduct i"putation! There is so"e agree"ent, ho#ever, in #hich type of i"putation to conduct! (or e2a"ple,
you typically do 6/T conduct +ean substitution or ;egression substitution! +ean substitution is replacing the
"issing value #ith the "ean of the variable! ;egression substitution uses regression analysis to replace the
"issing value! ;egression analysis is designed to predict one variable based upon another variable, so it can be
used to predict the "issing value based upon the sub:ectEs ans#er to another variable! 4oth +ean substitution
and ;egression substitution can be found using: $rans%or& > 'eplace !issing (ases) The favored type of
i"putation is replacing the "issing values using different esti"ation "ethods! The %+issing ,alues <nalysis'
addon contains the esti"ation "ethods, but versions of S@SS #ithout the addon "odule do not! The
esti"ation "ethods be found by using: $rans%or& > 'eplace !issing (ases)
*! +utliers ,nivariate
3hat are outliers9 /utliers are e2tre"e values as co"pared to the rest of the data! The deter"ination of values
as %outliers' is sub:ective! 3hile there are a fe# bench"arks for deter"ining #hether a value is an %outlier',
those bench"arks are arbitrarily chosen, si"ilar to ho# %pC!0D' is also arbitrarily chosen!
Should 8 check for outliers9 /utliers can render your data nonnor"al! Since nor"ality is one of the
assu"ptions for "any of the statistical tests you #ill conduct, finding and eli"inating the influence of outliers
-
"ay render your data nor"al, and thus render your data appropriate for analysis using those statistical tests!
>o#ever, 8 kno# no one #ho checks for outliers! (or e2a"ple, :ust because a value is e2tre"e co"pared to
the rest of the data does not necessarily "ean it is so"eho# an ano"aly, or invalid, or should be re"oved! The
sub:ect chose to respond #ith that value, so re"oving that value is arbitrarily thro#ing a#ay data si"ply
because it does not fit this %assu"ption' that data should be %nor"al'! .onducting research is about
discovering e"pirical reality! 8f the sub:ect chose to respond #ith that value, then that data is a reflection of
reality, so re"oving the %outlier' is the antithesis of #hy you conduct research!
There is another (less theoretical, and "ore practical) reason #hy 8 kno# no one #ho conducts outlier
analysis! /utliers are usually found in "any ("anyF) of the variables in ever study! 8f you are going to check
for outliers, then you have to check for outliers in all your variables (e!g!, could be *00G in so"e surveys), and
also check for outliers in the bivariate and "ultivariate relationships bet#een your variables (e!g!, *000G in
so"e surveys)! Biven the large nu"ber of outlier analyses you have to conduct in every study, you #ill
invariably find outliers in 7,7;$ STA&$! 8f you find and eli"inate outliers in one of your published studies,
then fro" an ethical and e)uity point of vie#, you should conduct the sa"e outlier analysis in every study you
analy=e for the rest of your career! +any researchers do not #ant to undertake outlier analysis in every one of
their studies because itEs cu"berso"e and so"eti"es over#hel"ing! @lus, if outliers are valid data, then #hy
conduct outlier analysis as all9
There is one "ore (less theoretical, and "ore practical) reason #hy 8 kno# no one #ho conducts outlier
analysis! 8t is co""on practice to use "ultiple )uestions to "easure constructs because it increases the po#er
of your statistical analysis! $ou typically create a %co"posite' score (average of all the )uestions) #hen
analy=ing your data! (or e2a"ple, in a study about happiness, you "ay use an established happiness scale, or
create your o#n happiness )uestions that "easure all the facets of the happiness construct! 3hen analy=ing
your data, you average together all the happiness )uestions into * happiness co"posite "easure! 3hile there
"ay be so"e outliers in each individual )uestion, averaged the ite"s together reduces the probability of
outliers due to the increased a"ount of data co"posited into the variable!
There is one last (less theoretical, and "ore practical) reason #hy 8 kno# no one #ho conducts outlier
analysis! 8f you decide to reduce the influence of the outlier, as described in the ne2t section, you then rerun
the outlier analysis again after reducing the influence of the kno#n outliers to deter"ine if the outlier is
eli"inated! So"eti"es ne# outliers e"erge because they #ere "asked by the old outliers and1or the data is
no# different after re"oving the old outlier so e2isting e2tre"e data points "ay no# )ualify as outliers! /nce
those outliers are re"oved, you rerun the outlier analysis again, and so"eti"es ne# outliers e"erge again! 8t
can beco"e a cu"berso"e and so"eti"es over#hel"ing process that has no end in sight! @lus, at #hat point,
if any, should you dra# the line and stop re"oving the ne#ly e"erging outliers9
There are t#o categories of outliers univariate and "ultivariate outliers
a! Anivariate outliers are e2tre"e values on a single variable! (or e2a"ple, if you have *0 survey
)uestions in your study, then you #ould conduct *0 separate univariate outlier analyses, one for each
variable! <lso, #hen you average the *0 )uestions together into a ne# co"posite variable, you can
conduct univariate outlier analysis on the ne# variable! <nother #ay you #ould conduct univariate
analysis is by looking at individual variables #ithin different groups! (or e2a"ple, you #ould conduct
univariate analysis on those sa"e *0 survey )uestions #ithin each gender ("ales and fe"ales), or
#ithin political groups (republican, de"ocrat, other), etc! /r, if you are conducting an e2peri"ent #ith
"ore than one condition, such as "anipulating happiness and sadness in your study, then you #ould
conduct univariate analysis on those sa"e *0 survey )uestions #ithin both groups!
b! The second category of outliers is "ultivariate outliers! +ultivariate outliers are e2tre"e co"binations
of scores on t#o or "ore variables! (or e2a"ple, if you are looking at the relationship bet#een height
and #eight, then there "ay be a :oint value that is e2tre"e co"pared to the rest of the data, such as
so"eone #ith e2tre"ely lo# height but high #eight, or high #eight but lo# height, and so forth! $ou
first look for univariate outliers, then proceed to look for "ultivariate outliers!
Anivariate outliers:
*! Select Analyze > Descriptive Statistics > E#plore
H
-! .lick %Statistics', and click %/utliers'
H! .lick %@lots', and unclick %Ste"andleaf'
D! .lick /0!
/utput on ne2t page is for %syste"*'
%Descriptives' bo2 tells you descriptive statistics about the variable, including the value of Ske#ness and
0urtosis, #ith acco"panying standard error for each! This infor"ation #ill be useful later #hen #e talk about
%nor"ality'! The %DI Tri""ed +ean' indicates the "ean value after re"oving the top and botto" DI of
scores! 4y co"paring this %DI Tri""ed +ean' to the %"ean', you can identify if e2tre"e scores (such as
outliers that #ould be re"oved #hen tri""ing the top and botto" DI) are having an influence on the
variable!
-E#tre&e "alues. and the /o#plot relate to each other! The bo2plot is a graphical display of the data that
sho#s: (*) "edian, #hich is the "iddle black line, (2) "iddle D0I of scores, #hich is the shaded region, (-)
top and botto" 2DI of scores, #hich are the lines e2tending out of the shaded region, (H) the s"allest and
largest (nonoutlier) scores, #hich are the hori=ontal lines at the top1botto" of the bo2plot, and (D) outliers!
The bo2plot sho#s both %"ild' outliers and %e2tre"e' outliers! +ild outliers are any score "ore than *!DJ8K;
fro" the rest of the scores, and are indicated by open dots! 8K; stands for %8nter)uartile range', and is the
"iddle D0I of the scores! 72tre"e outliers are any score "ore than -J8K; fro" the rest of the scores, and are
indicated by stars! >o#ever, keep in "ind that these bench"arks are arbitrarily chosen, si"ilar to ho# pC!0D
is arbitrarily chosen! (or %syste"*', there is an open dot! 6otice that the dot says %H2', but, by looking at
%72tre"e ,alues bo2', there are actually (/A; lo#est scores of %*', one of #hich is case H2! Since all four
scores of %*' overlap each other, the bo2plot can only display one case! 8n su""ary, this output tells us there
are four outliers, each #ith a value of %*'!
D
0! +utliers 1ithin 2roups
<nother #ay to look for univariate outliers is to do outlier analysis #ithin different groups in your study! (or
e2a"ple, i"agine a study that "anipulated the presence or absence of a #eapon during a cri"e, and the
&ependent ,ariable #as "easuring the level of e"otional reaction to the cri"e! 8n addition to looking for
univariate outliers for your &,, you "ay #ant to also look for univariate outliers #ithin each condition!
8n our dataset about %5egal 4eliefs', letEs treat gender as the grouping variable!
*! Select Analyze > Descriptive Statistics > E#plore
+ove %se2' into the %(actor 5ist'
-! .lick %Statistics', and click %/utliers'
H! .lick %@lots', and unclick %Ste"andleaf'
D! .lick /0!
/utput belo# is for %syste"*'
%Descriptives' bo2 tells you descriptive statistics about the variable! 6otice that infor"ation for %"ales' and
%fe"ales' is displayed separately!
-E#tre&e "alues. and the /o#plot relate to each other! 6otice the difference bet#een "ales and fe"ales!
L
3! +utliers 4 !ultivariate
+ultivariate outliers are traditionally analy=ed #hen conducting correlation and regression analysis! The
"ultivariate outlier analysis is so"e#hat co"ple2, so 8 #ill discuss ho# to identify "ultivariate outliers #hen
#e get to correlation and regression
5! +utliers dealing 6ith outliers
(irst, #e need to identify #hy the outlier(s) e2ist! 8t is possible the outlier is due to a data entry "istake, so
you should first conduct the test described above as %*! (inding incorrectly entered data' to ensure that any
outlier you find is not due to data entry errors! 8t is also possible that the sub:ects responded #ith the %outlier'
value for a reason! (or e2a"ple, "aybe the )uestion is poorly #orded or constructed! /r, "aybe the )uestion
is ade)uately constructed but the sub:ects #ho responded #ith the outlier values are different than the sub:ects
#ho did not respond #ith the e2tre"e scores! $ou can create a ne# variable that categori=es all the sub:ects as
either %outlier sub:ects' or %nonoutlier sub:ects', and then ree2a"ine the data to see if there is a difference
bet#een these t#o types of sub:ects! <lso, you "ay find the sa"e sub:ects are responsible for outliers in "any
)uestions in the survey by looking at the sub:ect nu"bers for the outliers displayed in all the bo2plots!
;e"e"ber, ho#ever, that :ust because a value is e2tre"e co"pared to the rest of the data does not necessarily
"ean it is so"eho# an ano"aly, or invalid, or should be re"oved!
Second, if you #ant to reduce the influence of the outliers, you have four options!
/ption * is to delete the value! 8f you have only a fe# outliers, you "ay si"ply delete those values, so they
beco"e blank or "issing values!
/ption 2 is to delete the variable! 8f you feel the )uestion #as poorly constructed, or if there are too "any
outliers in that variable, or if you do not need that variable, you can si"ply delete the variable! <lso, if
transfor"ing the value or variable (e!g!, /ptions M- and MH) does not eli"inate the proble", you "ay #ant to
si"ply delete the variable!
/ption - is to transfor" the value! $ou have a fe# options for transfor"ing the value! $ou can change the
value to the ne2t highest1lo#est (nonoutlier) nu"ber! (or e2a"ple, if you have a *00 point scale, and you
have t#o outliers (ND and NL), and the ne2t highest (nonoutlier) nu"ber is 8N, then you could si"ply change
the ND and NL to 8Ns! <lternatively, if the t#o outliers #ere D and L, and the ne2t lo#est (nonoutlier) nu"ber
#as **, then the D and L #ould change to **s! <nother option is to change the value to the ne2t highest1lo#est
(nonoutlier) nu"ber @5AS one unit incre"ent higher1lo#er! (or e2a"ple, the ND and NL nu"bers #ould
change to N0s (e!g!, 8N plus * unit higher)! The D and L nu"bers change to *0s (e!g!, ** "inus * unit lo#er)!
/ption H is to transfor" the variable! 8nstead of changing the individual outliers (as in /ption M-), #e are no#
talking about transfor"ing the entire variable! Transfor"ation creates nor"al distributions, as described in the
?
ne2t section belo# about %6or"ality'! Since outliers are one cause of nonnor"ality, see the ne2t section to
learn ho# to transfor" variables, and thus reduce the influence of outliers!
Third, after dealing #ith the outlier, you rerun the outlier analysis to deter"ine if any ne# outliers e"erge or
if the data are outlier free! 8f ne# outliers e"erge, and you #ant to reduce the influence of the outliers, you
choose one the four options again! Then, rerun the outlier analysis to deter"ine if any ne# outliers e"erge or
if the data are outlier free, and repeat again!
?! 7or&ality
4elo#, 8 describe five steps for deter"ining and dealing #ith nor"ality! >o#ever, the botto" line is that
al"ost no one checks their data for nor"alityO instead they assu"e nor"ality, and use the statistical tests that
are based upon assu"ptions of nor"ality that have "ore po#er (ability to find significant results in the data)!
(irst, #hat is nor"ality9 < nor"al distribution is a sy""etric bellshaped curve defined by t#o things: the
"ean (average) and variance (variability)!
Second, #hy is nor"ality i"portant9 The central idea behind statistical inference is that as sa"ple si=e
increases, distributions #ill appro2i"ate nor"al! +ost statistical tests rely upon the assu"ption that your data
is %nor"al'! Tests that rely upon the assu"ption or nor"ality are called para"etric tests! 8f your data is not
nor"al, then you #ould use statistical tests that do not rely upon the assu"ption of nor"ality, call non
para"etric tests! 6onpara"etric tests are less po#erful than para"etric tests, #hich "eans the nonpara"etric
tests have less ability to detect real differences or variability in your data! 8n other #ords, you #ant to conduct
para"etric tests because you #ant to increase your chances of finding significant results!
Third, ho# do you deter"ine #hether data are %nor"al'9 There are three interrelated approaches to deter"ine
nor"ality, and all three should be conducted!
(irst, look at a histogra" #ith the nor"al curve superi"posed! < histogra" provides useful graphical
representation of the data! S@SS can also superi"pose the theoretical %nor"al' distribution onto the histogra"
of your data so that you can co"pare your data to the nor"al curve! To obtain a histogra" #ith the
superi"posed nor"al curve:
*! Select Analyze > Descriptive Statistics > Frequencies)
-! .lick %.harts', and click %>istogra", #ith nor"al curve'!
H! .lick /0!
/utput belo# is for %syste"*'! 6otice the bellshaped black line superi"posed on the distribution! <ll
sa"ples deviate so"e#hat fro" nor"al, so the )uestion is ho# "uch deviation fro" the black line indicates
%nonnor"ality'9 Anfortunately, graphical representations like histogra" provide no hardandfast rules! <fter
you have vie#ed "any ("anyF) histogra"s, over ti"e you #ill get a sense for the nor"ality of data! 8n "y
vie#, the histogra" for %syste"*' sho#s a fairly nor"al distribution!
8
Second, look at the values of Ske#ness and 0urtosis! Ske#ness involves the sy""etry of the distribution!
Ske#ness that is nor"al involves a perfectly sy""etric distribution! < positively ske#ed distribution has
scores clustered to the left, #ith the tail e2tending to the right! < negatively ske#ed distribution has scores
clustered to the right, #ith the tail e2tending to the left! 0urtosis involves the peakedness of the distribution!
0urtosis that is nor"al involves a distribution that is bellshaped and not too peaked or flat! @ositive kurtosis
is indicated by a peak! 6egative kurtosis is indicated by a flat distribution! &escriptive statistics about
ske#ness and kurtosis can be found by using either the (re)uencies, &escriptives, or 72plore co""ands! 8
like to use the %72plore' co""and because it provides other useful infor"ation about nor"ality, so
*! Select Analyze > Descriptive Statistics > E#plore)
-! .lick %@lots', and unclick %Ste"andleaf'
H! .lick /0!
Descriptives bo2 tells you descriptive statistics about the variable, including the value of Ske#ness and
0urtosis, #ith acco"panying standard error for each! 4oth Ske#ness and 0urtosis are 0 in a nor"al
distribution, so the farther a#ay fro" 0, the "ore nonnor"al the distribution! The )uestion is %ho# "uch'
ske# or kurtosis render the data nonnor"al9 This is an arbitrary deter"ination, and so"eti"es difficult to
interpret using the values of Ske#ness and 0urtosis! 5uckily, there are "ore ob:ective tests of nor"ality,
described ne2t!
N
Third, the descriptive statistics for Ske#ness and 0urtosis are not as infor"ative as established tests for
nor"ality that take into account both Ske#ness and 0urtosis si"ultaneously! The 0ol"ogorovS"irnov test
(0S) and Shapiro3ilk (S3) test are designed to test nor"ality by co"paring your data to a nor"al
distribution #ith the sa"e "ean and standard deviation of your sa"ple:
*! Select Analyze > Descriptive Statistics > E#plore)
-! .lick %@lots', and unclick %Ste"andleaf', and click %6or"ality plots #ith tests'!
H! .lick /0!
%$est o% 7or&ality' bo2 gives the 0S and S3 test results! 8f the test is 6/T significant, then the data are
nor"al, so any value above !0D indicates nor"ality! 8f the test is significant (less than !0D), then the data are
nonnor"al! 8n this case, both tests indicate the data are nonnor"al! >o#ever, one li"itation of the nor"ality
tests is that the larger the sa"ple si=e, the "ore likely to get significant results! Thus, you "ay get significant
results #ith only slight deviations fro" nor"ality! 8n this case, our sa"ple si=e is large (nP-2?) so the
significance of the 0S and S3 tests "ay only indicate slight deviations fro" nor"ality! $ou need to eyeball
your data (using histogra"s) to deter"ine for yourself if the data rise to the level of nonnor"al!
%7or&al 848 9lot' provides a graphical #ay to deter"ine the level of nor"ality! The black line indicates the
values your sa"ple should adhere to if the distribution #as nor"al! The dots are your actual data! 8f the dots
fall e2actly on the black line, then your data are nor"al! 8f they deviate fro" the black line, your data are non
nor"al! 8n this case, you can see substantial deviation fro" the straight black line!
*0
(ourth, if your data are nonnor"al, #hat are your options to deal #ith nonnor"ality9 $ou have four basic
options!
a! /ption * is to leave your data nonnor"al, and conduct the para"etric tests that rely upon the
assu"ptions of nor"ality! Qust because your data are nonnor"al, does not instantly invalidate the
para"etric tests! 6or"ality (versus nonnor"ality) is a "atter of degrees, not a strict cutoff point!
Slight deviations fro" nor"ality "ay render the para"etric tests only slightly inaccurate! The issue is
the degree to #hich the data are nonnor"al!
b! /ption 2 is to leave your data nonnor"al, and conduct the nonpara"etric tests designed for non
nor"al data!
c! /ption - is to conduct %robust' tests! There is a gro#ing branch of statistics called %robust' tests that
are :ust as po#erful as para"etric tests but account for nonnor"ality of the data!
d! /ption H is to transfor" the data! Transfor"ing your data involving using "athe"atical for"ulas to
"odify the data into nor"ality!
(ifth, ho# do you transfor" your data into %nor"al' data9 There are different types of transfor"ations based
upon the type of nonnor"ality! (or e2a"ple, see handout %(igure 8!*' on the last page of this docu"ent that
sho#s si2 types of nonnor"ality (e!g!, - positive ske# that are "oderate, substantial, and severeO - negative
ske# that are "oderate, substantial, and severe)! (igure 8!* also sho#s the type of transfor"ation for each
type of nonnor"ality! Transfor"ing the data involves using the %.o"pute' function to create a ne# variable
(the ne# variable is the old variable transfor"ed by the "athe"atical for"ula):
*! Select $rans%or& > (o&pute "ariable
2! Type the na"e of the ne# variable you #ant to create, such as %transfor"Rsyste"*'!
-! Select the type of transfor"ation fro" the %(unctions' list, and doubleclick!
H! +ove the (nonnor"al) variable na"e into the place of the )uestion "ark %9'!
D! .lick /0!
The ne# variable is reproduced in the last colu"n in the %&ata vie#'!
6o#, check that the variable is nor"al by using the tests described above!
8f the variable is nor"al, then you can start conducting statistical analyses of that variable!
8f the variable is nonnor"al, then try other transfor"ations!
**

Lab #1 - Data Screening: Statistics - Spring 2008

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab #1 - Data Screening: Statistics - Spring 2008

Uploaded by

Copyright:

Available Formats

Statistics Spring 2008

Lab #1 Data Screening

You might also like