Professional Documents
Culture Documents
Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion
Inferential statistics
Using sample data to make generalizations (inferences) or estimates about a population Statements made in terms of probability
Gary Geisler Simmons College LIS 403 Spring, 2004
Descriptive Statistics
Commonly used in LIS research Cannot test causal relationships Primary strength is describing and summarizing data:
Describing data in terms of frequency distributions Describing most typical value in data set - measures of central tendency Describing variability of data - measures of dispersion
Frequency Distributions
Describing data in terms of frequency distributions
Books checked out Counts of totals by value or category for each measured variable Can be presented as absolute totals, cumulative totals, percentages, grouped totals Often a rst step in statistical analysis of data Usually presented in tables or charts (histogram, bar graph, etc.) 80 60 40 20 0 0-10 11-20 21-40 41-60 61+ Age group
population stdev General addition rule: P (A mple size Probab N population size 2 /n population proportion j th quartilex 2Samplepaired difference pstandard deviation: ( d x) mple mean CHAPTER s 2( population sizen 1 O observedx)2 Mean of a discrete mean x frequency populationxrandom (x mple stdev p s sample proportion or s Probab population mean E expected frequency n proportion 1 Standard deviation ofn 1 where a disc quartile (n + 1)/2, 3(n + p population 1)/4 N 1)/ Descriptive O observed frequency 3 (nDescriptive Measu Statistics CHAPTER 2 P (X den Quartile positions: (n + 1)/4, +(x ) 3(n + x 1)/2, ulation size Q1 3 Descriptive Measures ulation mean of Interquartile range: IQR Factorial: k! k(kSpecia E expected frequencyQ3 Q1 Formula mean x 1) where Sample mean: x Sample mean: Upper limit x Q3 + 1.5 IQR P mean: x = mean of sample limit Q 1.5 IQR, Upper limit n N deno Q3n+ 1 1 n x Lower Binomial coefcient: (A, B, Range: Range Max Min scriptive mean of population iable): =Measures x xSpecia Range Max Min N Population mean (mean of a variable): Population mean: Comp = values x sigma, sum of variable): N formu Binomial probability Sample standard deviation: P tandard deviation of a standard deviation: x n STA 570 Formula (A, B, Sheet X = set of Population standard deviation (standard deviation Gener observations of a v 2 x2 2 ( x)2 /n x) (x P (X n x)2 (x x) x eor Max ispecic observations X X = Min Sample Mean )2Xs = X1 +X2 +...+Xn x 2 i=1o or 2 s (x n= 1 = Mean n1 N n n n or denotes 1 Compl the numbe where n N N ard deviation:= number of n or N positions: (n + 1)/4, (n + 1)/2, 3(nQuartile positions: (n Standa e probability. + 1)/4 observations + 1)/4, Genera x (x x)2 IQR Standardized variable: zx)2 /n 2 (X1 X)2 +(X2 2 +. x2 ( X) artile range: Q3 Sample Variance Mean = a binomial random Q1 = s of or s Interquartile range: IQR Q
Gary Geisler Simmons College LIS 403 Spring, 2004
Disadvantages of mean
Mean value for a data set is not necessarily one of the values of the data set Sensitive to extreme scores, either high or low Easily distorted by extremely large or extremely small values among the set of observations, Example: mean of 1, 2, and 1,000,000 is 333,334.33
Gary Geisler Simmons College LIS 403 Spring, 2004
Disadvantages of median
Median is not necessarily one of the values of the data set Dened dierently for odd and even numbers of observations
Disadvantages of mode
Many sets of observations lack a mode because no observed value occurs more than once Other sets of observations may have several dierent most frequent values Doesnt characterize set beyond most frequently occuring value
Gary Geisler Simmons College LIS 403 Spring, 2004
Calculating mean
13 14 15 16 17 18 19
Calculating mean
13 14 15 16 17 18 19
13 14 15 16 17 18 19
Mode = 16
13 14 15 16 17 18 19
N = 31 so midpoint is 16th value
Calculating median
Grouped data: Each value is somewhere within each age range Values are assumed to be equally distributed within range
13 14 15 16 17 18 19
N = 31 so midpoint is 16th value 14 15 16 17 18 19
Measures of Dispersion
Variability is a fundamental characteristic of most data sets, but is not addressed by measures of central tendency Measures of central tendency are not enough to accurately describe a data set Also need to be able to describe the variability or dispersion of the data Dispersion: scatteredness or ucuation of scores around average score Several types of measures of dispersion
Range Standard deviation Variance
Measures of Dispersion
Range
Distance between the smallest and largest observations in a set of data Examples: Range of the set of observations 2, 4, 7 is 5 Range of the set -10, -3, 4 is 14
Measures of Dispersion
Interquartile range
Simplied version: ignore the top and bottom 25% after sorting Dierence between the remaining largest and smallest numbers is interquartile range Addresses the problem of outliers Other methods of calculating interquartile range are slightly more complicated but take into account more data
Measures of Dispersion
Standard deviation
Measures the variability or the degree of dispersion of the data set Square root of the average squared deviations from the mean Roughly speaking, standard deviation is the average distance between the individual observations and the center of the set of observations
Range: Range
Max Min
Measures of Dispersion
Calculating standard deviation
1. Subtract each each observation from sample/population mean and square 2. Add squared distances 3. Divide sum by n - 1 or N (adjusted mean of squared distances) 4. Take square root of mean squared distances
(x x)2IQR, Lower limit Q1 1.5 SD of sample: s n1 Population mean (mean of a1)/ Quartile positions: (n + va
CHAPTER 4 Descriptive )2 (x Method Gary Geisler Simmons College LIS 403 Spring, 2004 N Sxx , Sxy , and Syy :
(x1 )2 IQ Lower limit Q 1.5 SD of population: N Population mean (mean of a x Standardized variable: z Population standard deviatio
Measures of Dispersion
Variance
Square of standard deviation Not used for descriptive statistics, but is important for specic inferential statistics tests Variance of sample
Variance of population
Measures of Dispersion
Advantages of range as measure of dispersion
Very simple to calculate Provides a meaningful characteristic of a set of observations (total spread of the observations)
Measures of Dispersion
Advantages of standard deviation as measure of dispersion
Can always be calculated Meaningful characteristic of a set of observations; takes every observation into account to express the scatteredness of observations
Examples:
Set of observations 1, 2, 3, 4, 5, 6, 7, 8, 9 has a standard deviation s = 2.74 Set of observations 1, 9, 9, 9, 9, 9, 9, 9, 9 has a standard deviation s = 2.67 Range doesnt distinguish dierence in scatteredness of sets, but standard deviation does
Disadvantage of standard deviation as measure of dispersion is that it is more complicated to calculate -- though not for computers
Gary Geisler Simmons College LIS 403 Spring, 2004