Lecture 1: Introduction: Statistics Is Concerned With

Lecture 1: Introduction
Statistics is concerned with

Collecting and presenting data to assist decision making Processing and analyzing data g y g Obtaining reliable forecasts
Examples involving statistics

To i T inspect the incoming goods f t th i i d from a supplier (O li (Onesample hypothesis testing) Developers of a new hypertension drug want to determine if the drug lowers blood pressure (Twosample hypothesis testing) In marketing, statistics is used to evaluate whether higher spending on advertising is justified (Simple linear regression) g ) To forecast economic indices, such as GNP, GDP, etc related to many factors (Multiple linear regression)
Key Definitions
A population (universe) is the collection of all members of a group
N represents the population size
A sample is a portion of the population selected for analysis

n represents the sample size
A parameter is a numerical measure that describes a characteristic of a population d ib h t i ti f l ti A statistic is a numerical measure that describes a characteristic of a sample d ib h t i ti f l
3
Population vs. Sample

Population
a b cd
Sample
b gi o r y
Measures computed from sample data are called statistics
4
c n u
ef gh i jk l m n o p q rs t u v w x y z
Measures used to describe a population pop lation are called parameters
Examples
Population P l ti All eligible voters All light bulbs manufactured in a day All patients with high blood pressure for a clinical study Sample S l 1000 voters polled 100 light bulbs selected 200 hypertension patients enrolled for a clinical study
Two branches of statistics

Descriptive Statistics
Collecting, presenting, and characterizing data
Inferential Statistics
Drawing conclusions and/or making decisions concerning a population based only on sample data
Descriptive Statistics
Collect data
e.g., e g Survey
Present data
e.g., Tables and graphs
Characterize data
X e.g., Sample mean =

n
Inferential statistics
Population
Sample Use statistics to summarize features
Use parameters to summarize features
Drawing conclusions about a population based on sample results.

8
Two types of Inferential Statistics

Estimation e.g., Estimate the population mean weight using the sample mean weight Hypothesis testing e.g., Test the claim that earnings for males to be higher than females
Reasons for Drawing a Sample

Less Time Consuming Than a Census Less Costl to Administer Than a Cens s Costly Census Less Cumbersome and More Practical to Administer Than Census of th P Ad i i t Th a C f the Population l ti
10
Types of Data
Data
Categorical
Examples: Marital Status Political Party Eye Color (Defined categories)
Numerical
Discrete
Examples: Number of Children Defects per hour (Counted items)
Continuous
Examples: Weight distance (Measured characteristics)
11
Descriptive Statistics: Graphical description of Numerical Data

Numerical Data N i lD t
41, 24, 32, 26, 27, 27, 30, 24, 38, 21
Stem and Leaf Display 2 144677 3 028 4 1
Frequency Distributions & Cumulative Distributions
Histograms
7 6
Tables
5 4 3 2 1 0 10 20 30 40 50 60
Stem-and-Leaf Display St d L f Di l
A simple way to see distribution details in a p y data set
METHOD: Separate the sorted data series into leading digits (the stem) and the trailing digits (the leaves)
Data in Raw Form (as Collected): 24, 26, 24, 21, 27, 27, 30, 41, 32, 24 26 24 21 27 27 30 41 32 38 Data in Ordered Array from Smallest to Largest: Largest 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Stem-and-Leaf Stem and Leaf Display:
2 144677 3 028 4 1
Tabulating Numerical Data: Frequency Distributions

What is a Frequency Distribution? A frequency distribution is a list or a table containing class groupings (ranges within which the data fall) ... and the corresponding frequencies with which data fall ithi d t f ll within each grouping or category h i t It allows for a quick visual interpretation of the data
Tabulating Numerical Data: Frequency Distributions

Condenses the raw data and allows for a quick visual interpretation of the data Example: A manufacturer of insulation randomly E l f t fi l ti d l selects 20 winter days and records the daily high temperature
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
Sort R S Raw D Data on d days i A in Ascending O d di Order

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find Range: 58 - 12 = 46 Select Number of Classes: 5 ( (usually between 5 and ll b t d 15) Compute Class Interval (Width): 10 (46/5 then round up) C t Cl I t l (Width) Determine Class Boundaries (Limits):10, 20, 30, 40, 50,
60
Count Observations & Assign to Classes
q y Frequency Distributions and Percentage Distributions

Data in Ordered Array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class
[10, [10 20) [20, 30) [30, [30 40) [40, 50) [50, 60) Total
Frequency
3 6 5 4 2 20
Relative Frequency .15 15 .30 .25 25 .20 . 0 .10 1
Percentage
15 30 25 20 10 0 100
Histogram Example g p
Class [10, 20) [20, 30) [30, 40) [40, 50) [50, 60) Class Cl Midpoint Frequency 15 25 35 45 55 3 6 5 4 2
His togram : Daily High Te m pe rature 7 6 Fre equency y 5 4 3 2 1 0 5 15 25 35 45 55 Class Midpoints 65
(No gaps between bars)
Distribution Shape
The shape of the distribution is said to be symmetric if the observations are balanced, or evenly distributed, about the center. y ,
Symmetric Distribution
10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9
Fre equency
Distribution Shape
(continued)
The shape of the distribution is said to be skewed if the observations are not symmetrically distributed around the center.
Positively Skewed Distribution
A positively skewed distribution (skewed to the right) has a tail that extends to the right in the direction of g positive values.
12 10 Fre equency 8 6 4 2 0 1 2 3 4 5 6 7 8 9
A negatively skewed distribution (skewed to the left) has a tail that extends to the left in the direction of negative al es negati e values.
Negatively Skewed Distribution

12 10 Freq quency 8 6 4 2 0 1 2 3 4 5 6 7 8 9
What is the shape of distribution of daily high temperature?

His togram : Daily high te m pe rature 7 6 5 4 3 2 1 0 6 5 4 3 2 0 5 15 25 35 45 55 0 More
Fre equency
Numerical description
Summary M S Measures
Central Tendency (location measures) ( )

Mean Median Mode
Quartiles
Range Variance
Variation
Interquartile range Standard Deviation
Mean
Mean (Arithmetic Mean) of Data Values
Sample mean
n Population mean
X=
X
i =1
Sample Size
i
X1 + X 2 + L + X n = n
Population Size
X
i =1
X1 + X 2 + L + X N = N
An example
TV watching hours/week: 5, 7, 3, 38, 7
Mean = (5 + 7 + 3 + 38 + 7)/5 = 60/5 = 12
If the correct time for 4th subject is 8 ( t 38) th t ti f bj t i (not

Mean = (5 + 7 + 3 + 8 + 7)/5 = 30/5 = 6
12
38
Mean = 12
Mean = 6
Mean (Contd) (Cont d)

The Most Common Measure of Central Tendency, especially when n is large Affected b E t Aff t d by Extreme Values (Outliers) V l (O tli )
Median
Robust measure of central tendency y Not affected by extreme values
3 5 7 38 3 5 7 8
Median = 7
Median = 7
In an ordered array, the median is the middle number

If n is odd, th median i th middle number i dd the di is the iddl b (i.e,(n+1)/2 th measurement) If n is even, the median is the average of the n/2 th g and (n/2 +1) th measurement
Mode
A Measure of Central Tendency Value that Occurs Most Often Not Affected b Extreme Values N t Aff t d by E t V l There May Not Be a Mode There M Be S Th May B Several M d l Modes Used for Either Numerical or Categorical Data
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
No Mode
Which measure of location is the best? best ?

Mean is generally used, unless extreme values (outliers) exist The median is often used, since the median is not sensitive to extreme values. l
Example: Median home prices may be reported for a region less sensitive to outliers
Quartiles Q til
Split ordered data into 4 quarters i ( n + 1) Position of i th quartile i-th
( Qi ) =
25%
25%
25%
25%
( Q1 )
( Q2 )
( Q3 )
Noncentral Location Q1 , Q2, and Q3 are called 25th, 50th, and 75th percentile respectively. A pth percentile is the value of X such that p% of the measurements are less than X and (100 p)% (100-p)% are greater than X X.
Q1 (1st quartile) and Q3 (3rd quartile) are measures of
Quartiles ( Q til (example) l )

Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21
1(10 + 1) = 2.75 4
Position of first quartile is Position of third quartile is q
Q1 = 6 Q3 = 15 + 0.25 (18 15) = 15.75
3(10 + 1) = 8.25 4
5 number 5-number summary

Box-and-Whisker Box and Whisker Plot
Graphical display of data using 5-numbers Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21
X smallest Q 1
Median( Q2)
Q3
Xlargest
12
15.75 15 75 21
Example: Comparing variations

Suppose that you are a purchasing agent for a large manufacturing firm and that you regularly place orders with two different suppliers (A & B). The number of days required to fill orders are the following A: 9, 10, 10, 10, 10, 10, 11, 11, 11, 11 B: 7, 7, 8, 10, 10, 10, 11, 12, 13, 15
Which supplier do you prefer?

Supplier A: Mean = 10.3, Median=mode=10
Supplier A
6 5 4 Fre equency Fre equency 3 2 1 0 7 8 9 10 11 # of days 12 13 14 15 3.5 3 2.5 2 1.5 1 0.5 0 7 8 9 10 11 # of days 12 13 14 15
Supplier B: Mean = 10 3 M di M 10.3, Median=mode=10 d 10

Supplier B
Measures of Variation
Variation
Range Interquartile Range Variance
Standard Deviation
Measures of variation give information on the spread or variability of the data values.
Same center, different variation
Range
Easy to compute Difference between the Largest and the Smallest Observations: S ll t Ob ti
Range = X L t X Smallestt Largest S ll

Example:
Range = 12 - 7 = 5
7 8 9 10 11 12
Disadvantages of the Range

Ignores the way in which data are distributed g y
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Interquartile Range I t til R

Difference between the First and Third Quartiles
Data in Ordered Array: 3 6 6 12 12 12 15 15 18 21
Interquartile range = Q3 Q1 = 15.75 6 = 9.75 Not Affected by Extreme Values y
Variance
Sample Variance:
S2 =
( X
i =1
X)
n 1
Population Variance:
=
2
( X
i =1
Standard Deviation
Most widely used Measure of Variation y Has the Same Units as the Original Data
Sample Standard Deviation:
S=
Population Standard Deviation:
( X
i =1
X)
n 1
( X
i =1
Examples E l
Data set 11, 12, 13, 16, 16, 17, 18, 21 n=8,
1 X = (11 + 12 + ... + 21) = 15.5 8
X i X = 4.5, 3.5, 2.5, 0.5, 0.5, 1.5, 2.5, 5.5

1 2 2 2 s = ( 4.5) + ( 3.5) + ... + (5.5) = 11.14 7
2
s = s 2 = 11.14 = 3.34
Computational f C t ti l formula f s: l for

2 n n 1 1 2 s= X i X i n 1 i =1 n i =1
All we need to know are
Xi
i =1
and
X
i =1
Example ( i it) E l (revisit)

Data set 11, 12, 13, 16, 16, 17, 18, 21
X
i =1
8 i =1
= 11 + 12 + ... + 21 =124
X i = 112 + 12 2 + ... + 212 = 2000

2
1 1 2 s= 2000 124 = 3.34 7 8
Advantages of Variance and Standard Deviation St d d D i ti

Each value in the data set is used in the calculation Values far from the mean are given extra weight (because deviations from the mean are squared)
Visualizing variation
Small standard deviation
Large standard deviation

Lecture 1: Introduction: Statistics Is Concerned With

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1: Introduction: Statistics Is Concerned With

Uploaded by

Copyright:

Available Formats

Lecture 1: Introduction

Statistics is concerned with

Examples involving statistics

A sample is a portion of the population selected for analysis

Population vs. Sample

Measures used to describe a population pop lation are called parameters

Two branches of statistics

X e.g., Sample mean =

Sample Use statistics to summarize features

Use parameters to summarize features

Drawing conclusions about a population based on sample results.

Two types of Inferential Statistics

Reasons for Drawing a Sample

Descriptive Statistics: Graphical description of Numerical Data

Stem and Leaf Display 2 144677 3 028 4 1

Frequency Distributions & Cumulative Distributions

Tabulating Numerical Data: Frequency Distributions

Tabulating Numerical Data: Frequency Distributions

Sort R S Raw D Data on d days i A in Ascending O d di Order

Count Observations & Assign to Classes

q y Frequency Distributions and Percentage Distributions

Relative Frequency .15 15 .30 .25 25 .20 . 0 .10 1

His togram : Daily High Te m pe rature 7 6 Fre equency y 5 4 3 2 1 0 5 15 25 35 45 55 Class Midpoints 65

(No gaps between bars)

Negatively Skewed Distribution

What is the shape of distribution of daily high temperature?

Central Tendency (location measures) ( )

Interquartile range Standard Deviation

If the correct time for 4th subject is 8 ( t 38) th t ti f bj t i (not

Mean (Contd) (Cont d)

In an ordered array, the median is the middle number

Which measure of location is the best? best ?

Q1 (1st quartile) and Q3 (3rd quartile) are measures of

Quartiles ( Q til (example) l )

Position of first quartile is Position of third quartile is q

Q1 = 6 Q3 = 15 + 0.25 (18 15) = 15.75

5 number 5-number summary

Example: Comparing variations

Which supplier do you prefer?

Supplier B: Mean = 10 3 M di M 10.3, Median=mode=10 d 10

Same center, different variation

Range = X L t X Smallestt Largest S ll

Disadvantages of the Range

Interquartile Range I t til R

Interquartile range = Q3 Q1 = 15.75 6 = 9.75 Not Affected by Extreme Values y

X i X = 4.5, 3.5, 2.5, 0.5, 0.5, 1.5, 2.5, 5.5

Computational f C t ti l formula f s: l for

All we need to know are

Example ( i it) E l (revisit)

X i = 112 + 12 2 + ... + 212 = 2000

1 1 2 s= 2000 124 = 3.34 7 8

Advantages of Variance and Standard Deviation St d d D i ti

Small standard deviation

Large standard deviation

You might also like