You are on page 1of 12

Statistics I

Chapter 1: Introduction

Chapter 1: Introduction

Contents
I What is ’Statistics’ ? - definition
I Key-words: population, parameter, sample, statistic, population size,
sample size, individuals, objects
I Types of variables: categorical (ordinal, nominal) and numerical
(discrete, continuous)
I Why sample? Definition of a simple random sample
I Frequencies and frequency distribution/table: absolute, absolute
cumulative, relative, relative cumulative. Properties.
Chapter 1: Introduction

Recommended reading
I Peña, D., Romo, J., ’Introducción a la Estadı́stica para las Ciencias
Sociales’
I Chapters 1, 2, 3
I Newbold, P. ’Estadı́stica para los Negocios y la Economı́a’ (2009)
I Chapter 1
I Sections 2.1, 2.4, 2.7. How to lie with Statistics

Definition of Statistics

Def. Statistics is a science that deals with:


I collecting, organizing, summarizing, presenting, interpreting,
processing data to transform data into information
⇐ Descriptive Statistics
I predictions, forecasts, estimation
⇐ Inferential Statistics
• On what occasions did you hear/saw word ’statistics’ ?
◦ football/tennis match summary
◦ unemployment rates, number of people injured in car accidents
• There is much more to statistics than percentages and counts!
Key-words

I A population is the complete collection of all


items/individuals/objects/subjects of interest or under investigation
N represents the population size
I A sample is an observed subset of the population, typically chosen to
investigate the properties of a parent population
n represents the sample size
I A parameter is a specific characteristic of a population (fixed)
I A statistic is a specific characteristic of a sample (varies from sample
to sample)
I A variable is a characteristic of an individual

Examples

I Pop: all students at UC3M Variable: height ∈ (0, ∞)


Param: Average height of all students Statistic: Average height of
sampled students
I Pop: all fish in a sea Variable: size ∈ {L, M, S}
Param: Number of small fish in the entire sea Statistic: Number of
small fish caught
I Pop: all patients of Getafe Hospital Variable: blood type ∈
{A,B,AB,O}
Param: Percentage of all patients with AB Statistic: Percentage of
sampled patients with AB
I Pop: all Philip’s light-bulbs Variable: life-expectancy in days
∈ {0, 1, 2, . . .}
Param: Variation in life-expectancy of all light-bulbs Statistic:
Variation in life-expectancy of sampled light-bulbs
Types of data

Data (Variable)

. &
Categorical (Qualitative) Numerical (Quantitative)
. & . &
Ordinal Nominal Discrete Continuous
classes can be ranked no natural order integer nonintegers
Example Example Example Example
Clothes size: Blood type: # of children: Height:
L>M>S A,B,AB,O 0,1,2,. . . 1.55cm, 1.71cm

Notation: Letters X , Y , Z are typically used. Example:

X = height in cm (upper-case letters in definition)


x = 1.55 (lower-case letters for specific values)
x1 = 1.55, x2 = 1.71 (add subscripts if more than one)

Why sample?

In practice we don’t study the population because:


I We may destroy the population (eg. life-expectancy of a light-bulb)
I Population may exist as a concept but not in reality (eg. population
of defective items)
I Impractical (eg. population of all fish in a sea)
I Too expensive
I Too time consuming
Definition of a simple random sample (SRS)

Def. Simple random sample is obtained in such a way that


I each member of the population is chosen strictly by chance
I each member of the population is likely to be chosen, and
I every possible sample of n objects is equally likely to be chosen
Notation: Sample of size n from a variable X means that:
I We have n individuals selected at random from a population
I For each of the individuals we report the value of the variable X
I If X is categorical or discrete, it is convenient to write the different
sample values that X takes as x1 , x2 , . . . , xk , k ≤ n (ranked from the
smallest to the largest, unless X is nominal)

Frequencies and frequency distribution

Def. A frequency distribution is


I a list or a table . . .
I containing class groupings (categories or ranges within which the
data fall) . . .
I and the corresponding frequencies with which data fall within each
class or category
Frequencies:
I absolute (number of times the value appeared in the sample)
I relative (proportion of times the value appeared in the sample)
Why use frequency distributions?

I A frequency distribution is a way to summarize data


I The distribution condenses the raw data into a more useful form . . .
I and allows for a quick visual interpretation of the data

Grouping by classes: categorical and discrete data

Cumulative Cumulative
Absolute Relative Absolute Relative
Class, xi Freq, ni Freq, fi Freq, Ni Frequency, Fi
x1 n1 f1 = nn1 N 1 = n1 F1 = f1
x2 n2 f2 = nn2 N2 = N1 + n2 F2 = F1 + f2
.. .. .. .. ..
. . . . .
nk
xk nk fk = n Nk = n Fk = 1
Total n 1 empty empty

Note:
I ni = number of xi in the sample, fi = number
n
of xi

I Ni = Ni−1 + ni , Fi = Fi−1 + fi
I 0 ≤ fi , Fi ≤ 1
I Fi and Ni do not make sense for categorical-nominal variables
Grouping by classes

Example 1: The data below shows blood types reported for a sample of
40 individuals.
AB, A, B, O, A, A, A, B, O, AB,
B, O, B, B, B, A, A, A, AB, B,
O, A, A, A, AB, AB, O, B, B, AB,
O, B, O, O, A, A, O, B, AB, AB

I What kind of variable is ’blood type’ ? Find a frequency distribution


of the data.
I What percentage of the sampled people have blood type A?
I What percentage of the individuals have blood type other than O?

Grouping by classes

Example 1 cont.:
I Categorical, nominal with 4 different classes. The frequency
distribution is:

Absolute Relative
Class Frequency Frequency
A 12 0.300
B 11 0.275
AB 8 0.200
O 9 0.225
Total 40 1
I 30%
I 100% − 22.5% = 77.5%
Grouping by classes

Example 2: The table below shows different levels of satisfaction


(S=satisfied, V=very, U=unsatisfied) for 901 employees.

Absolute
Class Frequency
VU 62
U 108
S 319
VS 412
Total 901

I What type of variable is being studied? Find a frequency distribution


of the data.
I What percentage of the sampled people are satisfied?
I How many individuals are unsatisfied or worse? In %?
I How many individuals are at least satisfied? In %?

Grouping by classes

Example 2 cont.:
I Categorical, ordinal with 4 different classes. The frequency
distribution is:
Cumulative Cumulative
Absolute Relative Absolute Relative
Class Frequency Frequency Frequency Frequency
VU 62 0.07 62 0.07
U 108 0.12 170 0.19
S 319 0.35 489 0.54
VS 412 0.46 901 1
Total 901 1
I 35%
I 170, 19%
I 319 + 412 = 731 or 901 − 170 = 731, 35% + 46% = 81% or
100% − 19% = 81%
Grouping by classes
Example 3: To evaluate the performance of a new pesticide, a sample of
50 plants, from those treated by the new pesticide, was selected. The
number of leaves attacked by a pest was counted for each of the sampled
plants. The results are shown below.
Absolute
xi Frequency
0 6
1 10
2 12
3 8
4 5
5 4
6 3
8 1
10 1
Total 50

Grouping by classes

Example 3 cont.:
I What can you say about the variable in the study? Find its
frequency distribution.
I What percentage of the sampled plants had only 3 leaves attacked?
I How many plants had no more than 3 leaves attacked?
I How many plants had at least 6 leaves attacked?
I What percentage of plants have between 3 and 5 leaves attacked?
I What percentage of plants had at least 8 leaves attacked?
I What percentage of plants had at most 2 leaves attacked?
Grouping by classes

Example 3 cont.:
I Numerical, discrete with 9 different values. The frequency
distribution is:
Cumulative Cumulative
Absolute Relative Absolute Relative
xi Frequency Frequency Frequency Frequency
0 6 0.12 6 0.12
1 10 0.20 16 0.32
2 12 0.24 28 0.56
3 8 0.16 36 0.72
4 5 0.10 41 0.82
5 4 0.08 45 0.90
6 3 0.06 48 0.96
8 1 0.02 49 0.98
10 1 0.02 50 1
Total 50 1

Grouping by classes

Example 3 cont.:
I 16%
I 36
I 3 + 1 + 1 or 50 − 45 = 5
I 16% + 10% + 8% = 34% or (8 + 5 + 4)/50 = 34%
I 2% + 2% = 4% or 100% − 96% = 4%
I 56%
Grouping by class intervals: continuous (and discrete) data

Class Interval Midpoint


[li−1 , li ) xi = li +l2i−1 ni fi Ni Fi
[l0 , l1 ) x1 n1 f1 N1 F1
[l1 , l2 ) x2 n2 f2 N2 F2
.. .. .. .. .. ..
. . . . . .
[lk−1 , lk ] xk nk fk n 1
Total n 1 empty empty

Note:
I Left end-point is included, but right end-point is excluded (typical
convention)
I Reverse end-point convention can be applied - check your software
for definition
I Useful for tabulating discrete data if X takes many values

Grouping by class intervals: continuous (and discrete) data

I Very often class intervals have the same width


I Determine the width w of each interval by
largest number - smallest number
w=
number of desired intervals
I How many intervals? Roughly between 5 and 20. More specifically:

I k ≈ n if n is ’small’
I k ≈ 1 + 3.22 log(n) if n is ’large’
I Intervals never overlap
I Round up the interval width to get desirable interval endpoints
Grouping by class intervals: continuous (and discrete) data

Example 4: A manufacturer of insulation randomly selects 20 winter


days and records the daily high temperature (in Fahrenheit)
24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27

Find the frequency distribution of the data.


I Sort raw data in ascending order: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30,
32, 35, 37, 38, 41, 43, 44, 46, 53, 58
I Find range: 58 − 12 = 46
I Select number of classes: say k = 5
I Compute interval width: 10 (46/5 then round up)
I Determine the end-points: 10 but less than 20, 20 but less than 30,
etc
I Count the observations and assign to classes

Grouping by class intervals: continuous (and discrete) data

Example 4 cont.:
Class Interval Midpoint ni fi Ni Fi
[10, 20) 15 3 0.15 3 0.15
[20, 30) 25 6 0.30 9 0.45
[30, 40) 35 5 0.25 14 0.70
[40, 50) 45 4 0.20 18 0.90
[50, 60] 55 2 0.10 20 1
Total 20 1

I On how many days the temperature was below 30F? In %?


(3 + 6 = 9, which is 45%)
I On how many days (approximately) the temperature was at least
45F? In %?
(2 + 4 45−40
50−40 = 4, which is 20%)

You might also like