You are on page 1of 16

Statistics

Statistics is the branch of science that deals

OVERVIEW AND DESCRIPTIVE STATISTICS

with the collection, presentation, analysis & interpretation of data for the purpose of decisionmaking and problem-solving. Statistics is a critical skill in quality improvement as statistical techniques can be used to describe and to understand variability.

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Population vs Sample
Population
the entire set of measurements of interest Population Mean Variance Standard Deviation

Parameters vs Statistics
Parameter

Sample
a subset of data from the population

Statistic

x s2 s

Parameters
numerical measures of a population

Statistics
numerical measures of a sample Sample
2010 LC Tang. All rights reserved 2010 LC Tang. All rights reserved

Population Parameters Vs Sample Statistics

Types of Data An Overview

= Sample Mean

= Population Mean

Types of Data
Descriptive Orientation "Gappiness"
Discrete
(obtained by counting)

s = Sample Standard
Deviation

Level of Measurement
Nominal
(grouping)

= Population Standard Deviation

Response
(dependant variable)

Predictor
(independant variable)

Continuous
(obtained by measuring)

Ordinal
(grouping & ordering)

Interval
(measured vs scale)

Statistics

estimate

Parameters
Ratio
(measured vs absolute)

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Descriptive Orientation
Whether a variable is to be described or be described by other variables.
Response or Dependent Variable
variable under investigation is to be described in terms of other variables

Descriptive Orientation

KPIV

Y=
KPOV

f (X)

Predictor or Independent Variable


variable is used in conjunction with other variables to describe describe a given response variable

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Whats In A Name?
X Independent Variable Y Dependent Variable

Types of Data Gappiness


Whether there are gaps between successively observed values of a variable.
Discrete Variable
gaps exist between observations obtained by counting Values can take on only a particular set of numerical observations. This set is often evenly spaced. Examples : Stock Prices: 34 7/8 Number of soft errors on a disk Supreme Court votes: 6 to 3

Statistics Systems Engineering Quality Engineering Control Engineering Process Engineering Six Sigma

Predictor (Factor) Input Cause Parameter Control Characteristic KPIV

Response Output Effect (Quality) Performance Index Process Characteristic KPOV

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Types of Data Gappiness


Continuous Variable
no gaps exist between observations obtained by measuring Examples : Baking Temperature: 100 +/- 5 degree celcius Pressure : 35 Psi

Level of Measurement
Deals with preciseness of measurement of the variable.
Nominal Variable
values assumed by a variable indicate different categories

Ordinal Variable
allows not only grouping but also ordering of categories

Interval Variable
meaningful measure of the distance between categories

Ratio or Ratio-Scale Variable


an interval variable that has a scale with a true zero
2010 LC Tang. All rights reserved 2010 LC Tang. All rights reserved

Level of Measurement
Nominal Variable - Categorical & Independent Relative Size
Application form List of Field Reps Select one from each group Nationality Access to a digital camera Without access to a digital camera Fred W. Bill S. John D. Sam C.

Level of Measurement
Ordinal Variable - Categorical but Orderable
Example 1: Pareto Chart - Paint Adhesion Test
Order of importance

Example 2: Customer Survey


Question:How would you rate our service? Excellent Very Good Good Fair Poor

Marital status

Bob T. Jim C.

Occupation

Joe W. Diane A.

Pre pa

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

rati on Prim e T yp r e Pai n Typ t e App lica tion Hum idity Op era tor

Ordinal Scale

Level of Measurement
Interval (relative) Scale:
(no absolute zero)
1. Displaced Scale
50 40 30 20 10 0

Objectives
The Primary Objective of Statistics is to Learn from Data

Ratio Scale
(absolute zero exists)
2. Dial Gage 1. Ruler
100 90 80 70 60 50 40 30 20 10 0

Gage block

0.10 0 0.20

3. Relative Velocity

2. Position vs Time at Constant Speed

Tools of Statistics: data reduction Descriptive Statistics inferential measurement Inferential Statistics identification of relationships Regression, ANOVA

3. Weight As a Function of Number of Bricks


2010 LC Tang. All rights reserved 2010 LC Tang. All rights reserved

Statistics An Overview
Statistics
Descriptive Statistics
Graphical Presentations
Charts Tables

Descriptive vs Inferential Statistics


Descriptive Statistics comprises those methods concerned with collecting and describing a set of data so as to yield meaning information Inferential Statistics comprises those methods concerned with the analysis of a subset of data leading to predictions or inferences about the entire set of data

Inferential Statistics
Parameter Estimation
Point Estimate Interval Estimate

Numerical Measures
Location Dispersion Shape

Hypothesis Testing
Parametric Methods Nonparametric Methods

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Descriptive Statistics
Descriptive Statistics
Graphical Presentations
Charts Dot Plot Box Plot Histogram Stem & Leaf Diagram Bar Chart Trend Chart Tables Location Mean Median Mode Quartiles

Numerical Measures
Describes the characteristics of the data set.

Numerical Measures
Dispersion Range Standard Deviation Variance Interquartile Range Shape Skewness Kurtosis

Frequency Distribution

Key numerical measures: measures of location (central tendency) measures of dispersion (variation) measures of shape (distribution)

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Measures of Location
Mean Median Mode Quartiles

Mean
If the observations in a sample of size n are x1, x2, . . . , xn, then the sample mean is n x1 + x 2 + L + x n i=1 x i = x = n n The mean is the most common measure of location or center of the data.

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Mean
The pull strength (in gf) of 10 gold bonding wires are 16.85 16.40 17.21 16.35 16.52 17.04 16.96 17.15 16.59 16.57 The sample mean pull strength for the 10 observations is

Mean
The sample mean x represents the average value of all observations in the sample. For a finite number of observations N, the population mean (denoted by ) may be determined by

x =

i =1

xi

n 167.64 = = 16.764 gf 10

16.85 + 16.40 + L + 16.57 10

i =1

xi

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Mean
During Operation Desert Storm in 1991, USAF F-117A pilots flew 1270 combat sorties for a total of 6905 hours. Hence, the mean duration of a F-117A mission during this operation was

Median
Let x (1), x(2), . . . , x(n) denote a sample arranged in increasing order of magnitude, then the sample median is defined as if n is odd x ([ n +1] / 2 ) ~ = x x ( n / 2 ) + x ([ n / 2 ]+1) if n is even 2 The advantage of the median is that it is not influenced very much by extreme values.

i =1

xi

6905 = 5.4 hours 1270

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Median
If the sample observations are
1 3 4 2 7 8 6

Median
Just as the sample median x is the middle value in a sample, there is a middle value in the population. ~ The population median is that value at which half the population lies below it and half lies above.

The sample mean and median are 4.4 and 4 respectively. Both quantities give a reasonable measure of the central tendency of the data. If the last observation is changed so that the data are
1 3 4 2 7 8 2450

The sample mean is 353.6 while the sample median remains unchanged.
2010 LC Tang. All rights reserved 2010 LC Tang. All rights reserved

Mode
The mode is the observation that occurs most frequently in the sample. The mode may be unique, or there may be more than 1 mode. Sometimes, the mode may not exist.
If the sample observations are 3 6 9 3 5 8 3 4 6 3 1 The sample mode is 3, since it occurs four times.

Mode

10

If the sample observations are 3 6 9 3 5 8 3 4 6 3 1 10 6 2 5 6 The sample modes are at 3 and 6, since they both occur four times. If the sample observations are 1 3 4 2 7 6 8 The sample mode does not exist.

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Quartiles
When an ordered set of data is divided into four equal parts, the division points are called quartiles.
The first or lower quartile Q1 is a value that has approximately 25% of the observations below in value. The second quartile Q2 is a value that has approximately 50% of the observations below in value. It is also called the median. The third or upper quartile Q3 is a value that has approximately 75% of the observations below in value.

Quartiles
Twenty ordered observations on the times to failure (in hours) of electrical insulation material are shown below.
204 228 252 300 324 444 624 720 816 1176 1296 1392 1488 1512 2520 2856 3192 3528 912 3710

Q1 ~ x Q3

= 384 2 = Q 2 = (912 + 1176 ) = 1044 2 = (1512 + 2520 ) = 2016 2

(324 + 444 )

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Measures of Dispersion
Range Variance Standard Deviation Inter-Quartile Range

Range
The sample range is defined as the difference between the largest and the smallest sample observations, i.e. r = x(max) x(min) The mean is the simplest measure of dispersion or variation of the data. However, it ignores all the information in the sample between the smallest and the largest observations.

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Range
Consider the two samples 1, 3, 5, 8, 9 and 1, 5, 5, 5, 9. Both have the same range (r=8). However, in the second sample there is variability only in the two extreme values, while in the first sample the middle values vary considerably. When the sample is small (n10), the information loss associated with the range is not too serious.

Variance & Standard Deviation


If x1, x2, . . . , xn is a sample of n observations, then the sample variance is

(x
n i =1

x) n 1
i

The sample standard deviation is the positive square root of the sample variance, i.e.

s =

(x
n

i x) i =1 n 1

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Variance & Standard Deviation


The sample variance and the sample standard deviation are the most important measures of dispersion. The units of measurement for the sample variance are the square of the original units of the variable. The units of measurement for the sample standard deviation are the original units of the variable. A smaller value of s (and s) implies less variability.

Variance & Standard Deviation


For the two samples quoted earlier Sample A : 1, 3, 5, 8, 9 Sample B : 1, 5, 5, 5, 9
Range Inter-Quartile Range Variance Standard Deviation Sample A 8 5 11.20 3.35 Sample B 8 0 8.00 2.83

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Computation of Variance
Method 1 :
2 i =1 (x i x ) n

Computation of Variance (Method 1)


i 1 2 3 4 5 6 xi
xi x

s2
Method 2 :

(x i x )2
3364 400 3249 64 289 144 2 (x i x )= 7510 7510 (6 - 1) = 1502 psi

n 1

s2
Observations :

n i =1

x i2

( x )
n i =1 i

n 1

90 128 205 140 165 160

90 - 58 128 - 20 205 57 140 - 8 165 17 160 12 x i = 888 (x i x ) = 0 = 888 6 = 148

s2 =

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Computation of Variance (Method 2)


i
1 2 3 4 5 6

Variance & Standard Deviation


Analogous to the sample variance s, there is a measure of variability in the population - the population variance . The population standard deviation is the positive square root of the population variance. For finite population, comprising N values,

xi
90 128 205 140 165 160

xi
8,100 16,384 42,025 19,600 27,225 25,600
s2 =

i =1 x i2
n

( x )
n i =1 i

n 1

138,934

(888)2
6

6 1 7510 5

xi = 888 xi =138,934

= 1502 psi 2

(x
N i =1

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Inter-Quartile Range
The inter-quartile range is another measure of dispersion. IQR = Q3 - Q1 The inter-quartile range is less sensitive to extreme values in a sample than the range. For the two samples (1, 3, 5, 8, 9 and 1, 5, 5, 5, 9), their inter-quartile ranges are 5 and 0 respectively.

Measures of Shape
Skewness Kurtosis

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Skewness
The degree of asymmetry of a distribution around its mean is referred to as its skewness.
Positive skewness implies a distribution with an asymmetric tail extending towards higher values. Sometimes referred to as right-handed skew. Negative skewness implies a distribution with an asymmetric tail extending towards lower values. Sometimes referred to as left-handed skew.

Skewness

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Skewness
If the data are symmetric, the mean and median will coincide. If the data is unimodal, then the mean, median and mode will all coincide. If the data are skewed, the mean, median and mode will not coincide. For right-handed skewness : mode < median < mean For left-handed skewness : mode > median > mean

Kurtosis
Kurtosis characterizes the relative peakedness or flatness of a distribution compared to a normal (mesokurtic) distribution. Positive kurtosis indicates a relatively peaked (leptokurtic) distribution compared to the normal distribution. Negative kurtosis indicates a relatively flat (platykurtic) distribution compared to the normal distribution. Kurtosis is relevant only for symmetrical distributions.

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Kurtosis

Skewness and Kurtosis


(x x )
n i =1 i

Skewness =

(n1)(n2)

s3

Kurtosis

n 4 n (n +1) i =1 (x i x ) 3 (n 1)2 s4 (n 2 )(n 3 ) (n 1)(n 2 )(n 3 )

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Excels Descriptive Statistics


Numerical Measure
Mean Median Mode Quartile Range Variance Standard Deviation Inter-Quartile Range set, 1) Skewness Kurtosis

MINITABs Descriptive Statistics


Stat Basic Statistics Display Descriptive Statistics
produces statistics for each column of data (or subsets within a column) and displays them in the Session Window and optionally in a graph the user has no control over which statistics are computed/displayed

Excels Built-In Function


=AVERAGE(data set) =MEDIAN(data set) =MODE(data set) =QUARTILE(data set, quartile) =MAX(data set) MIN(data set) =VAR(data set) =STDEV(data set) =QUARTILE(data set, 3) QUARTILE(data =SKEW(data set) =KURT(data set)

Stat Basic Statistics Store Descriptive Statistics


descriptive statistics for each column (or subsets within a column) are displayed in adjacent columns within the Worksheet the user can select which statistics are to be computed/displayed, but has no control over the order in which they are displayed
2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Example 1
C1 of Basic Statistics.MTW contains the measurements of a certain quality characteristic for 500 productions units. Determine the appropriate numerical measures.

Example 1
Stat Basic Statistics Store Descriptive Statistics

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Example 1

Example 1
Stat Basic Statistics Display Descriptive Statistics

It would seem that the data set is somewhat symmetrical about the mean (since skewness 0), but with a flatter distribution than the normal distribution.

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Example 1

Graphical Presentations
Visual interpretation the the data set. Common graphical tools to illustrate a data set: Dot Plot Box Plot Frequency Distribution Histogram Stem & Leaf Diagram

Descriptive Statistics
Variable: Dist-1

What the data meant


Anderson-Darling Normality Test A-Squared: P-Value: Mean StDev Variance Skewness Kurtosis N Minimum 1st Quartile Median 3rd Quartile Maximum 97.154
80 90 100 110 120

27.108 0.000 100.000 32.385 1048.78 7.16E-03 -1.63184 500 41.771 68.695 104.202 130.809 162.821 102.846 34.527 117.663

What we assumed

45

65

85

105

125

145

165

95% Confidence Interval for Mu

95% Confidence Interval for Mu 95% Confidence Interval for Sigma 30.494 95% Confidence Interval for Median

95% Confidence Interval for Median

82.784

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved

Box & Whisker Plot


The box displays
the lower quartile (Q1) the median (Q2) the upper quartile (Q3)
Maximum

Box & Whisker Plot


1. Arrange the data set in ascending order. 2. Determine the minimum and maximum, the lower and upper quartiles, the median, and Tukeys outliers. 3. Draw a box that extends from the lower quartile to the upper quartile; the median is a line drawn through the box.

Q3 Median

The whiskers: extend to


the minimum the maximum

Q1

4. Draw lines (or whiskers) that extend from the ends of the box a) from Q1 to Max{Minimum, Q1 - 1.5 IQR} b) from Q3 to Min{Maximum, Q3 + 1.5 IQR}
Minimum

Not to be used when sample size is less than 10 units.


2010 LC Tang. All rights reserved

5. Outliers are represented by appropriate symbols a) - mild or possible outliers b) - severe or extreme outliers

2010 LC Tang. All rights reserved

Histogram
The histogram, a graphical presentation of the frequency distribution, provides a visual impression of the shape of the distribution of measurements. X-axis Y-axis : : measurement scale frequency (or relative frequency) scale

Histogram

Frequency, f(x)

Sample Size = 100 units


a) Bins = b) Bins = 5 9 Width = 40 Width = 20 Width = 10

a) 40
30 20 10 0

c) Bins = 18
b) 25
Frequency, f(x) 20 15 10 5 0 80

50

90

130

170

210

250

290

Compressive Strength, x

Frequency, f(x)

Area of each class intervals rectangle is proportional to the frequency for that class interval.

c) 16
12 8 4 0
100 120 140 160 180 200 220 240 260 280 Compressive Strength, x

65

85

105

125

145

165

185

205

225

245

Compressive Strength, x

2010 LC Tang. All rights reserved

2010 LC Tang. All rights reserved