Professional Documents
Culture Documents
Prologue
Statistics What is it ?
Statistics deals with the collection and analysis of data to solve real-world problems. Time
Ancient
Contributor
Greece Philosophers Babylonians Egyptians European marine companies Blas Pascal Pierre de Fermat
Contribution
Idea no quantitative analyses. Collect demographical data for tax collection and recruitment of military units. Marine insurance rates were set using data concerning the success of the transportation of goods. Studied probability through games of chance and gambling. Proved the law of large numbers -- as the number of observations increased, the ratio of observed successful to unsuccessful occurrences will differ from the true ratio within certain small limits. Constructed the normal curve, developed the application of probability ideas to astronomy.
14th Century
Statistics underwent simultaneous horizontal and vertical development : Horizontal : methods spread among disciplines including astronomy, geodesy, psychology, biology, social sciences, etc. Vertical : understanding of mathematical probability theory led to the development of statistical inference. Adolphe Quetelet Astronomer who first applied statistical analyses to human biology. Studied genetic variation in humans, using regression and correlation. Studied natural selection using correlation, formed first academic department of statistics, Biometrika journal, helped develop the Chi Square analysis. Studied process of brewing, alerted the statistics community about problems with small sample sizes, developed Student's test. Evolutionary biologists who developed ANOVA, stressed the importance of experimental design. Provided many advantages over calculations by hand or by calculator, stimulated the growth of investigation into new techniques.
Karl Pearson
20th Century
Computer Technology
P.1
The word Statistics is derived from the Latin word for the state as the first important accumulation of data was for the purposes of the state. Statistik probably first used by the German philosopher Gottfried Achenwall in middle of the eighteenth century. It referred to inquiries respecting the Population, Political Circumstances, the Production of a Country, and other Matters of State. While the science of statistics was being studied in Germany, the words Statistics and Statistical were introduced into the English language around 1787 by Ebesherd A.W. von Zimmerman (1743-1815).
Elements of Statistics
Survey Sampling Data Collection Experimental Design Observational Study Statistics Descriptive Statistics and Statistical Graphics Point Data Analysis Estimation Interval Statistical Inference Testing Hypothesis
Statistics Users Category 1 able to understand statistical presentations Category 2 able to select and apply statistical procedures to a particular problem Category 3 applied statisticians who help others use statistics on a particular problem Category 4 mathematical statisticians who develop new statistical techniques and discover new characteristics of old techniques
P.2
Chapter I
Terminology
device used to assist in the production of a number from the object a piece of information that may be expressed as a number, changing from item to item a set of numbers representing records of observations
Variable 1 Variable 2 Variable m
Data
Unit 1 Unit 2 Unit n
Dataset : student_record.dat
1. Run Microsoft Excel 07. 2. Office Button -> Open... 3. Change directory and select the data file student_record.dat. You may need to first change the Files of type into All Files (*.*) to show all files. 4. In the Text Import Wizard, select Delimited as the Original data type. Click on Next. 5. Select both Tab and Space as the Delimiters. This will specify how the data are separated in each row in the original data file. The Data preview window shows the input data set. Click on Finish will load the data into the worksheet. (Click on Next can set the Data Format of each column.) 5. From the Office Button menu, select Save as .... You can save the worksheet in Excel Workbook (*.xlsx) format so that the data format for each variable can be stored too. You can also save the worksheet in Excel 97-2003 Workbook (*.xls) format so that the file will be compatible with older versions of Excel.
P.3
1. Nominal Scale Tells only what class a unit falls in with respect to the property, e.g. sex, nationality, tutorial class. The classes are often called categories. The categories have no logical order. 2. Ordinal Scale Also tells when one unit has more of the property than does another unit, e.g. grade, attitude (disagree, agree, strongly agree).
Quantitative Scales
3. Interval Scale Also tells us that one unit differs by a certain amount of the property from another unit, e.g. temperature in degrees Celsius, altitude of a place, time of occurrence of an event. Zero is just a reference point. 4. Ratio Scale Tells us that one unit has so many times as much of the property as does another unit (it has a meaningful zero), e.g. height, weight. It has a meaningful zero. Zero really means nothing.
P.4
1.2
Distribution of Data
The distribution of a quantitative variable provides a general picture for the user to have a rough idea on the number of units with the value of the variable falling in a certain range. It can be represented by a histogram, a boxplot, or summary statistics.
A first glance from the above histograms would give the following impressions. The distribution of scores in 00/01-02/03 is located on the right hand side of the distribution of scores in 03/04. On the other hand, the scores in 03/04 looks more symmetric and spread out wider. Such comparisons can also be done through the calculation of some statistics, which represent the location and spread of the distributions. The interpretation and construction of histogram will be described in later sections.
1.3
Suppose we have a set of data, {9, 16, 11, 19, 11, 10, 13, 12, 6, 9} which are the sales (in $m) of 10 furniture companies in a particular year. There are several measures of central location.
At this point it is necessary to introduce some symbols. The above ten individual items of data (X) will be designated X 1 , X 2 ,..., X 10 , and the number of these items of data by n . The mean X pronounced x-bar, is then computed by
P.5
X=
X
n
where , sigma, is the summation sign and is simply a command to add up all the 10 x-values, i.e.
X
X
B. Median
Arrange the data in ascending order. The median is the number with middle rank, i.e. n + 1 2 th number median = average of n th and n + 1 th numbers 2 2 where n is the size of dataset. For these ten sales of companies, {9, 16, 11, 19, 11, 10, 13, 12, 6, 9}, the sorted dataset will be 6 9 9 10 11 11 12 13 16 19 if n is odd if n is even
P.6
11 + 11 = 11 million dollars. 2
Using Excel function @MEDIAN(data range), the median overall scores of all the 706 students in the student record dataset is found to be 71.9. Comparing this figure with the mean we can see that both measures of location give quite close results. However, in general this is not always the case.
Example 1.2
Monthly income of five staffs (in $1000) : 13, 21, 23, 32, 35 Mean = 24.8 Median = 23 If 35 is mistyped as 135, then the data will become : 13, 21, 23, 32, 135 Mean = 44.8 Median = 23 mg Hence it is possible to have very different values of mean and median. From this example we can also see that the mean is more sensitive to extreme values than the median. In general, the median is the best for describing highly skewed (not symmetric) distributions or when there are one or more outlying values whose validity is suspected. Otherwise, the mean is the best as it fully utilizes all the data and easy to be studied.
1.4
The above histograms show the scores (based on a questionnaire of ten five-pointscale items) at five-star hotels rated by 127 male guests and 114 female guests. The
P.7
locations of both distributions are more or less the same. However, the scores given by male guests spread out wider than the scores given by female guests. That means the rating variation by male is larger than that of females. There are several ways to assess variation, but by far the most useful is the statistic known as the standard deviation, which is given the symbol S. The formula for calculating the standard deviation is
S=
(X X )
n 1
Example 1.3
Sales ($m) X
9 16 11 19 11 10 13 12 6 9
Deviation squared (X X )2
6.76 19.36 0.36 54.76 0.36 2.56 1.96 0.16 31.36 6.76
2
X = 116
(X X )
n 1
2
(X X ) = 0
= 124.4 = 13.8222 9
(X X )
= 124.4
S = 13.8222 = 3.7178
P.8
Dataset : hotel_scores.dat
1. Read the data into an Excel worksheet. The first column will contain the gender (M/F) of the guests and the second column will contain the scores given by the guests. 2. At any empty cells, input @AVERAGE(B2:B242), @MEDIAN(B2:B242), @STDEV(B2:B242) to obtain respectively the mean, median, and standard deviation of the scores given by all the guests. 3. To compute the standard deviation for male scores only, we can use the array formula: @STDEV(IF(A2:A242=M,B2:B242)) Press Ctrl-Shift-Enter instead of just Enter for an array formula. This will compute the standard deviation of the data in the range B2:B242 that satisfy the criteria A2:A242=M, i.e. gender=male. The other statistics, and the statistics for female scores can be computed similarly.
Remarks
( X ) n . n 1
2
Example 1.5
Sales ($m) X
9 16 11 19 11 10 13 12 6 9
X2
81 256 121 361 121 100 169 144 36 81
2
( X )
n
= 124.4
S2 =
124.4 = 13.8222 10 1
S = 13.8222 = 3.7178
X = 116
= 1470
1.5
A. Bar Chart Display summarized data where there is no emphasis on the percentage of a total.
Mean Overall Scores of Math244 students
90 80 70 60
Mean Overall
50 40 30 20 10 0 1 2
Year
P.10
B.
Pie Charts A simple descriptive display of data that sum to a given total. Most illustrative way of displaying percentages. For nominal or ordinal data.
Grade Distribution of Math244 Students
B, 107, 15%
C, 72, 10%
C. Dot Diagram
80 90
100
110
120 130
Remarks
1. It is easy to construct. 2. It is compact and can be used in the margins of other displays to add information. 3. Not suitable when we have too many points.
P.11
D. Frequency Distribution
e.g. Table : Flow of vehicles passing through a particular point during an hour.
Vehicles Cars Lorries Motorcycles Buses Total Frequency 45 22 6 3 76 Percentage 59 29 8 4 100
P.12
Remarks
1. Percentages and/or cumulative percentages should be included if it is what other people interest. Counts can be omitted provided that they can be recovered if desired. 2. Total size of the data should always be included. 3. For nominal data, arrange the categories in a meaningful way. If no natural order is formed, arrange the categories so that their associated frequencies are decreasing. For ordinal data, categories should be arranged in natural order. 4. For continuous data, information is lost through grouping of data. 5. Different choices of class-intervals may give different impressions.
E. Histogram
Histogram is a very suitable form of representation of data distribution, especially for large datasets. The histogram itself is a frequency diagram because it shows the frequency-of-occurrence of results within particular intervals. When inspecting a histogram, the most important point one should keep in mind is that counts of data are represented by area, rather than height. To construct a histogram, one should: 1. Partition the range of data into several intervals (not necessarily of equal widths). 2. Draw rectangles in the intervals. The area of each rectangle should be proportional to the corresponding count.
To construct histogram using Excel, the Analysis ToolPak should be loaded first:
1. Office Button -> Excel Options 2. Click the Add-ins tab on the left panel. At the Manage menu, select Excel Add-ins, then click Go... 3. Select Analysis ToolPak and click on OK. (If it is prompted that the Analysis ToolPak is not installed, install it). 4. After the Analysis ToolPak is successfully loaded, the Data Analysis command will be added to the Data menu. This command provides some handy functions for performing statistical analyses. Detailed instructions can be founded at:
http://office.microsoft.com/en-us/excel-help/load-the-analysis-toolpak-HP001127724.aspx P.13
There is a histogram function in the added data analysis command. However, it produces histogram as a column bar chart, with numerical labels misaligned. A handy add-in built on the basis of the original histogram command was developed by Prof. Michael R. Middleton, School of Business and Management, University of San Francisco.
1. Download the add-in BetterHistogram_20070222_2050.xla from the course web. 2. Office Button -> Excel Options -> Add-ins -> Excel Add-ins -> Go... 3. Click Browse..., change directory to select the downloaded add-in file, then click OK. 4. After the Better Histogram add-in is successfully Histogram command will be added to the Add-Ins menu. loaded, the Better
Detailed instructions on the use of this command can be found in the book Data Analysis Using Microsoft Excel: Updated for Office 07 , or the following site:
http://www.treeplan.com/BetterHistogram_20041117_1555.htm
200
Frequency
150
100
50
0 0 10 20 30 40 50 60 70 80 90 100
Overall
P.14
From this histogram we can see the rectangular blocks from 50 to 90 comprises most of the area. Hence most of the students scored within 50 to 90. The graph is not symmetric as there is a tail pointing towards the left. It indicates that there are less low-score students than high-score students, comparing to the average students. This kind of deviation from symmetry is called skewness. Since the tail of the histogram points towards left hand side, the distribution is said to be skewed to the left, or negatively skewed.
Since the area represent frequencies, formally the scale of the y-axis should be the frequencies/score rather than just frequencies (height = area width). The area under the whole distribution would represent the total number of data, i.e. 706. Usually the y-scale will be further adjusted to make the total area equal to 1. Such y-scale is called the density.
density =
Change of the y-axis scale make no difference on these two histograms because with same class width, heights of the rectangular blocks are directly proportional to the areas. However, one must use density as the y-axis scale whenever there are unequal class widths, as illustrated by the following examples.
P.15
Example 1.8 Frequency histogram with unequal width classes (Incorrect construction of histogram)
The shape of the distribution was totally ruined. It will give the incorrect impression that there are much more students scored 35 to 70 than 70 to 90.
This histogram provide more detail description of the data distribution from 70 to 90. However, the shape is still preserved. So conservation of distribution perception is the main reason of using area to represent frequencies rather than using height.
P.16
Relative frequency and Probability We saw from the above histogram that a given class of results, say those lying between 60 and 70, made up about 0.02 (70 60) = 20% of the total. This 20% is called the relative frequency of scores between 60 and 70. If the population of all students (including those in future semesters) has more or less the same distribution as this dataset, we may infer that there would be around 20% chance to observe a student scoring in this range. The process of statistical inference (with suitable assumptions) allows us to equate in a numerical fashion the relative frequency of past events and the probability of future. Hence it is important to understand the interpretation of histogram as there will be some similarities between histogram and probability density curve, which will be discussed in Chapter III.
Remarks
1. Size of dataset should be given. 2. No gap between blocks. 3. It is invalid to use broken vertical scale in a histogram. 4. There is no good way to represent graphically the open-ended intervals when they have non-zero frequencies. 5. Histogram is a more suitable form of representation than frequency distribution table when the class-intervals have unequal widths.
F. Percentile
In a dataset of n observations, (100p)th percentile (0 < p < 1) has approximately np observations less than or equal to it and also n(1-p) observations greater than it. e.g. Test scores of ten students : Sorted scores : 68, 75, 58, 47, 83, 34, 90, 71, 63, 79 34, 47, 58, 63, 68, 71, 75, 79, 83, 90
There are 3 students having scores less than or equal to 58. Therefore the datum 58 is a 30th percentile. Note that under the above definition, all the values between 58 and 63 are also 30th percentiles.
P.17
Let X (i ) denote the ith smallest value (such that X (1) X ( 2 ) L X ( n ) ). Let r and f be the integer part and fractional part of (n + 1) p respectively. The following definition provides a formula to compute percentile uniquely:
r +1 )
X (r ) ) .
In a relative frequency histogram, the 100pth percentile is the cutoff point on the xaxis such that the total area of the histogram on the left of this cutoff point is equal to 100p percent.
d (np Fp 1 ) fp
= lower class boundary of the class of this percentile = number of observations = cumulative frequency below the class of this percentile = frequency of the class of this percentile = class width of the class of this percentile np
(100p)th percentile
Lp + d
P.18
e.g. Table : Frequency table for 20 grain bullet penetration depths into oak wood from a distance of 15 feet.
Penetration Depth (mm) 58 60 60 62 62 64 64 66 66 68 68 70 70 72 Total Frequency Cumulative Frequency
5 3 6 3 1 0 2 20
5 8 14 17 18 18 20
G. Boxplot
The 25th, 50th, 75th percentiles cuts the data into four pieces and are given the special names lower quartile, median, upper quartile respectively. These three values, together with the maximum and minimum, provide a five-number summary of the data. A boxplot is a graphical display of the five-number summary. It is simple and compact. Although it is less informative than the histogram, it can give good picture about the centre and spread of the distribution. e.g. Boxplot of the bullet penetration data. (Min = 58, QL = 60, Median = 62.67, QU = 64.67, Max = 72)
55
60
65
70
75
P.19
Example 1.10 Boxplot of Student Record Data (created by another statistical package)
Upper Quartile
The box comprises the middle half of the dataset, i.e. about 50% of the overall scores located at the centre, from 60 to 80 mm. The stars represents data points that is too extreme compared to the others and is sometimes regarded as outliers.
We can easily compare the distributions of the data in several groups by just one graph. For example, the scores of audit (*) and year 2 students spread out a little bit wider than year 1 and year 3 students. The location of the distribution of audit students is on the right of all the others, which suggesting that they performed better than other students in terms of examination results.
P.20
H. Stem-and-Leaf Display
Each data value is split into two components called stem and leaf. Data 22.9 22.9 22
trunctated
or or
Stem 22 2 2
Leaf 9 29 2
e.g.
Data : 78, 65, 90, 86, 79, 51, 79, 62, 84, 101
5* 6 7 8 9 10* Stem 1 25 899 46 0 1 Leaf n = 10 Leaf unit = 1
Stem-and-leaf plot is more informative than histogram as it displays the raw data. However, It is not suitable for large datasets. There are variations of stem-and-leaf displays. Stretched stem-and-leaf display :
3+ 4* 4+ 5* 5+
n = 72 Leaf unit = 1
Example 1.12 Stem-and-leaf display of student record data (created by other statistical package)
Stem-and-Leaf Display: Overall Stem-and-leaf of Overall Leaf Unit = 1.0 Year = 1 N = 190
1 2 3 4 8 19 33 52 71 (25) 94 71 49 27 9 1
2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9
2 6 2 5 0034 55778888999 00122233344444 5556666777888899999 0011111223333334444 5555566667778888888999999 00011112233333344444444 5555566677788888899999 0000000001111112334444 555666666777788999 01112223 6
The numbers in the first column are the cumulative frequencies of each class interval, accumulating from two ends. Note that (25) is the frequency of the median class interval [65,70).
P.22
1.6 Cautions about graphs Pictures can be deceptive even when there is no intention to deceive.
lying with statistics Presenting data graphically on a stretched or compressed scale of numbers with the aim of making the data show whatever you want to show. Statistical tests tend to be more objective than human eyes and are less prone to deception as long as the corresponding assumptions hold.
Examples 1.13
Four brands of cigarette : A, B, C, D Pie chart : Market penetration of the four brands.
D 18%
A 27%
D 18%
B 37%
C 18% B 37%
C 18% A 27%
0 A B C D
0 A B C D
P.23
Example 1.14
Example 1.15
P.24
P.25
Example 1.17
"This may well be the worst graphic ever to find its way into print." --- Tufte (1983)
This graph uses colours, 3D effects, disguised redundancy to display just five numbers. Note the clever use of mirror-imaging -- the top series is just (100 - the bottom series) and the interesting use of curved lines, front and back to avoid the appearance that there's a lot less here than meets the eye. A simple bar chart displaying the same set of data is given below:
P.26
1.7
Measures of Skewness
Shape of histograms:
L shape
J shape
Bell shape
U shape
Skewness indicate how far the shape the histogram is different from symmetric shape.
P.27
Measures of skewness:
1 3 (X X ) 1 = n S3 1 > 0 1 < 0 1 = 0 skewed to the right skewed to the left not skewed (symmetric)
2 =
mean median S
The Descriptive Statistics command in the Data Analysis add-in provides an integrated calculation for all the summary statistics described in this chapter.
P.28