You are on page 1of 37

Descriptive Statistics

Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Statistics as a Tool for LIS Research


Importance of statistics in research
Summarize observations to provide answers to research questions and hypotheses Make general conclusions based on specic study observations Objectively evaluate reliability of study conclusions

Gary Geisler Simmons College LIS 403 Spring, 2004

Statistics as a Tool for LIS Research


Main purposes of statistics in research
Describe central point in a set of data/observations Describe how broad, diversied, or variable the data in a set is Indicate whether specc features of a set of data are related, and how closely they are related Indicate probability of features of data being inuenced by factors other than simply chance

Gary Geisler Simmons College LIS 403 Spring, 2004

Statistics as a Tool for LIS Research


Two main types or branches of statistics
Descriptive statistics
Characterizing or summarizing data set Presenting data in charts and tables to clarify characteristics No inference, just describing a particular group of observations

Inferential statistics
Using sample data to make generalizations (inferences) or estimates about a population Statements made in terms of probability
Gary Geisler Simmons College LIS 403 Spring, 2004

Statistics as a Tool for LIS Research


Descriptive and inferential statistics not mutually exclusive
Overlap in what can be called descriptive and what can be called inferential Intent is important:
Group of observations intended to describe an event: descriptive Group of observations collected from a sample and intended to predict what a larger population is like: inferential

Gary Geisler Simmons College LIS 403 Spring, 2004

Statistics as a Tool for LIS Research


Choosing statistical methods
Type of data collected largely determines choice of statistical analysis techniques Decisions about how and what type of data is collected will determine the specic statistical tests that can be performed to analyze the data Data collected should determine statistical tests used, not the other way around But consideration of how you want to analyze data should be done as part of research design to ensure study can produce the type of conclusions you want to make
Gary Geisler Simmons College LIS 403 Spring, 2004

Descriptive Statistics
Commonly used in LIS research Cannot test causal relationships Primary strength is describing and summarizing data:
Describing data in terms of frequency distributions Describing most typical value in data set - measures of central tendency Describing variability of data - measures of dispersion

Gary Geisler Simmons College LIS 403 Spring, 2004

Frequency Distributions
Describing data in terms of frequency distributions
Books checked out Counts of totals by value or category for each measured variable Can be presented as absolute totals, cumulative totals, percentages, grouped totals Often a rst step in statistical analysis of data Usually presented in tables or charts (histogram, bar graph, etc.) 80 60 40 20 0 0-10 11-20 21-40 41-60 61+ Age group

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Describing most typical value in data set - measures of central tendency
Mean is often referred to as average though average can be any of these measures of central tendency:
Mean (arithmetic average) Median Mode

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Mean
Most popular statistic for summarizing data Can be used for interval or ratio data Based on all observations of the data set Arithmetic average of a set of observations Example: mean of 5, 10, and 30 is 15, since 453 = 15 Mean of a set of numbers can be a number not in set Example: mean of 1, 2, 3, and 4 is 3.5, since 104 = 2.5
Gary Geisler Simmons College LIS 403 Spring, 2004

population stdev General addition rule: P (A mple size Probab N population size 2 /n population proportion j th quartilex 2Samplepaired difference pstandard deviation: ( d x) mple mean CHAPTER s 2( population sizen 1 O observedx)2 Mean of a discrete mean x frequency populationxrandom (x mple stdev p s sample proportion or s Probab population mean E expected frequency n proportion 1 Standard deviation ofn 1 where a disc quartile (n + 1)/2, 3(n + p population 1)/4 N 1)/ Descriptive O observed frequency 3 (nDescriptive Measu Statistics CHAPTER 2 P (X den Quartile positions: (n + 1)/4, +(x ) 3(n + x 1)/2, ulation size Q1 3 Descriptive Measures ulation mean of Interquartile range: IQR Factorial: k! k(kSpecia E expected frequencyQ3 Q1 Formula mean x 1) where Sample mean: x Sample mean: Upper limit x Q3 + 1.5 IQR P mean: x = mean of sample limit Q 1.5 IQR, Upper limit n N deno Q3n+ 1 1 n x Lower Binomial coefcient: (A, B, Range: Range Max Min scriptive mean of population iable): =Measures x xSpecia Range Max Min N Population mean (mean of a variable): Population mean: Comp = values x sigma, sum of variable): N formu Binomial probability Sample standard deviation: P tandard deviation of a standard deviation: x n STA 570 Formula (A, B, Sheet X = set of Population standard deviation (standard deviation Gener observations of a v 2 x2 2 ( x)2 /n x) (x P (X n x)2 (x x) x eor Max ispecic observations X X = Min Sample Mean )2Xs = X1 +X2 +...+Xn x 2 i=1o or 2 s (x n= 1 = Mean n1 N n n n or denotes 1 Compl the numbe where n N N ard deviation:= number of n or N positions: (n + 1)/4, (n + 1)/2, 3(nQuartile positions: (n Standa e probability. + 1)/4 observations + 1)/4, Genera x (x x)2 IQR Standardized variable: zx)2 /n 2 (X1 X)2 +(X2 2 +. x2 ( X) artile range: Q3 Sample Variance Mean = a binomial random Q1 = s of or s Interquartile range: IQR Q
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Median
Value that is above the lower one-half and below the upper one-half of the values -- middle value of set of observations when they have been arranged in order Can be used for ordinal, interval or ratio data Most central measure of a distribution Every data set has a median that is unique Dierence in sets with odd numbers of observations than for even numbers of observations Example: median of the ve observations 1, 3, 15, 16, and 17 = 15 Example: median of the six observations 1, 2, 3, 5, 8, and 9 = 4
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Mode
Can be used for any type of data Most frequently occuring value among a set of observations Examples: Mode of the observations 1, 2, 2, 3, 4, 5 = 2 Set of observations 1, 2, 3, 4, 5 has no mode Set of observations 1, 2, 3, 3, 4, 5, 5 has no single mode, but can be considered to have two modes, or is bi-modal

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Advantages of mean
Always exists Is unique Can always be calculated by a simple formula

Disadvantages of mean
Mean value for a data set is not necessarily one of the values of the data set Sensitive to extreme scores, either high or low Easily distorted by extremely large or extremely small values among the set of observations, Example: mean of 1, 2, and 1,000,000 is 333,334.33
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Advantages of median
Not aected by extreme scores Useful way of describing sets of observations that are skewed by including extremely large or small values

Disadvantages of median
Median is not necessarily one of the values of the data set Dened dierently for odd and even numbers of observations

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Advantages of mode
Can be used with any scale of measurement If set of observations has a mode, mode usefully characterizing the set For example, set of observations noting result of rolling two dice will have a mode of 7

Disadvantages of mode
Many sets of observations lack a mode because no observed value occurs more than once Other sets of observations may have several dierent most frequent values Doesnt characterize set beyond most frequently occuring value
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Age Frequency

Calculating mean

13 14 15 16 17 18 19

Measures of Central Tendency


Age Frequency 13 x 3 = 39 14 x 4 = 56 15 x 6 = 90 16 x 8 = 128 17 x 4 = 68 18 x 3 = 54 19 x 3 = 57 N = 31 Sum of X = 492 492/31 = 15.87 Mean = 15.87

Calculating mean

13 14 15 16 17 18 19

Measures of Central Tendency


Calculating mode
Age Frequency

13 14 15 16 17 18 19
Mode = 16

Measures of Central Tendency


Calculating median
Non-grouped data
Age Frequency 1-3 4-7 8 - 13 14 - 21 22 - 25 26 - 28 29 - 31 Median = 16

13 14 15 16 17 18 19
N = 31 so midpoint is 16th value

Measures of Central Tendency


Age Frequency 1-3 4-7 8 - 13 14 - 21 22 - 25 26 - 28 29 - 31 Median = 16.31 20 21

Calculating median
Grouped data: Each value is somewhere within each age range Values are assumed to be equally distributed within range

13 14 15 16 17 18 19
N = 31 so midpoint is 16th value 14 15 16 17 18 19

16.06 16.19 16.31 16.44 16.56 16.69 16.81 16.94

Measures of Central Tendency


Mean = 15.87 Mode = 16 Median = 16.31

Measures of Central Tendency


Normal distribution
Normal curve, bell-shaped curve, Gaussian distribution Many types of data are normally distributed in a population Histogram of data approximates a bell-shaped, symmetrical curve Concentration of scores in the middle, with fewer and fewer scores as you approach extremes Example: heights of people in a population are normally distributed
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Skewness
Not all sets of data will exhibit properties of a normal distribution Some data sets are asymmetrical around a central point Majority of scores are closer to one extreme or the other: skewed distribution In a skewed distribution, the mean does not equal the median

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Positively skewed distribution, tail goes to the right - median is less than the mean Example: Annual income of population Negatively skewed distribution tail goes to the left - mean is less than the median

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Special case of skewness: J-Curve
Extreme skewness Proposed by Allport to describe conforming behavior in groups of people Large majority of scores fall at end representing socially acceptable behavior, small minority represent deviation from norm Example: amount of time drivers who park in No Parking zone stay there
100 75 50 25 0
<5 5 to 10 10 to 15 15 to 20 20 to 25 >25
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Determining when a distribution is skewed too much to be considered normal
General rule of thumb: values beyond 2 standard errors of skewness (ses) are probably signicantly skewed ses = 6/N or use ses statistic from software (SPSS, for example) output

Example: if sample size = 30 and skewness statistic is .9814:

ses = 6/30 = .20 = .4472

2 ses = .4472 x 2 = .8944

skewness statistic of .9814 is beyond 2 ses, so is signicantly skewed


Other factors (histograms, normal probability plots, type of test to be used) should inuence decision, depending on exact circumstances of analysis
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Kurtosis - amount of peakedness or atness of the distribution
Mesokurtic - normal Leptokurtic - peaked, many scores around middle Platykurtic - at, many scores dispersed from middle Non-normal kurtosis determined by similar process to skewness Non-normal kurtosis only a concern with some statistical tests
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Central Tendency


Selecting appropriate measure of central tendency Interactive selection at Selecting Statistics by William M.K. Trochim: http://trochim.human.cornell.edu/selstat/ssstart.htm
Rules below can be bent, depending on situation
Unimodal, Ratio or interval data, skewed Unimodal, Ratio or interval data, not skewed Unimodal, ordinal Unimodal, Nominal Bi-modal or multi-modal distribution
Gary Geisler Simmons College LIS 403 Spring, 2004

median mean median mode mode

Measures of Dispersion
Variability is a fundamental characteristic of most data sets, but is not addressed by measures of central tendency Measures of central tendency are not enough to accurately describe a data set Also need to be able to describe the variability or dispersion of the data Dispersion: scatteredness or ucuation of scores around average score Several types of measures of dispersion
Range Standard deviation Variance

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Dispersion
Range
Distance between the smallest and largest observations in a set of data Examples: Range of the set of observations 2, 4, 7 is 5 Range of the set -10, -3, 4 is 14

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Dispersion
Interquartile range
Simplied version: ignore the top and bottom 25% after sorting Dierence between the remaining largest and smallest numbers is interquartile range Addresses the problem of outliers Other methods of calculating interquartile range are slightly more complicated but take into account more data

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Dispersion
Standard deviation
Measures the variability or the degree of dispersion of the data set Square root of the average squared deviations from the mean Roughly speaking, standard deviation is the average distance between the individual observations and the center of the set of observations

Gary Geisler Simmons College LIS 403 Spring, 2004

Range: Range

Max Min

Sample standard deviation: CHAPTER 3 Descriptive Meas

Measures of Dispersion
Calculating standard deviation
1. Subtract each each observation from sample/population mean and square 2. Add squared distances 3. Divide sum by n - 1 or N (adjusted mean of squared distances) 4. Take square root of mean squared distances

(x x)2 x Sample mean: x or s n1 n Range: Range Max Mi Quartile positions: (n + 1)/4,

(x x)2IQR, Lower limit Q1 1.5 SD of sample: s n1 Population mean (mean of a1)/ Quartile positions: (n + va

Sample standard IQR Q Interquartile range: deviation: 3

Population standard deviation (Q Interquartile range: IQR

CHAPTER 4 Descriptive )2 (x Method Gary Geisler Simmons College LIS 403 Spring, 2004 N Sxx , Sxy , and Syy :

(x1 )2 IQ Lower limit Q 1.5 SD of population: N Population mean (mean of a x Standardized variable: z Population standard deviatio

Measures of Dispersion
Variance
Square of standard deviation Not used for descriptive statistics, but is important for specic inferential statistics tests Variance of sample

Variance of population

Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Dispersion
Advantages of range as measure of dispersion
Very simple to calculate Provides a meaningful characteristic of a set of observations (total spread of the observations)

Disadvantages of range as measure of dispersion


Extreme values distort range Only measures the total spread; tells us nothing about the pattern of data distribution Examples: Data set 1, 2, 3, 4, 5, 6, 7, 8, 9 has a range of 8 Data set 1, 9, 9, 9, 9, 9, 9, 9, 9 also has range of 8, though clearly less scattered
Gary Geisler Simmons College LIS 403 Spring, 2004

Measures of Dispersion
Advantages of standard deviation as measure of dispersion
Can always be calculated Meaningful characteristic of a set of observations; takes every observation into account to express the scatteredness of observations

Examples:
Set of observations 1, 2, 3, 4, 5, 6, 7, 8, 9 has a standard deviation s = 2.74 Set of observations 1, 9, 9, 9, 9, 9, 9, 9, 9 has a standard deviation s = 2.67 Range doesnt distinguish dierence in scatteredness of sets, but standard deviation does

Disadvantage of standard deviation as measure of dispersion is that it is more complicated to calculate -- though not for computers
Gary Geisler Simmons College LIS 403 Spring, 2004

You might also like