Professional Documents
Culture Documents
o
o
o
percent the previous year. The honeymoon Q3 growth for President Aquino is the highest when
compared with 0.4 percent for President Ramos, negative 0.7 percent for President Estrada, and 1.3
percent and 5.6 percent for the first and second terms, respectively, of President Arroyo.
Fishermen, farmers, and children comprised the poorest three sectors in 2006 with poverty incidences
of 49.9%, 44.0%, and 40.8%, respectively.
With the government funds for infrastructure projects frontloaded in the first semester, Government
Consumption Expenditure (GCE) reversed its growth by 6.1 percent from 12.1 percent
The expenditure items that recorded higher growths compared to the previous year
were: Miscellaneous expenses, 7.1 percent from 6.1 percent; Fuel, Light and Water, 8.1 percent from
negative 6.3 percent; Beverages, 11.4 percent from negative 5.8 percent; Household furnishings, 9.2
percent from negative 15.3 percent; Transportation and Communication, 0.9 percent from negative 0.7
percent; and, Clothing & Footwear, 1.4 percent from 3.6 percent.
(Source: http://www.nscb.gov.ph)
Statistical techniques are used to make many decisions that affect our lives
o Insurance companies use statistical analysis to set rates for home, automobile, life, and health insurance.
o Laguna Lake Development Authority is monitoring the water quality of Laguna Lake. They periodically take water
samples to establish the level of contamination and maintain the level of quality.
o Medical researchers study the cure rates for diseases using different drugs and different forms of treatment
Knowledge of statistical methods will help you understand how decisions are made and give you a better understanding of
how they affect you.
o No matter what your career is, you will make professional decisions that involve data
o To make an informed decision you need to
Determine whether the existing information is adequate or additional information is required
Gather additional information, if needed, in such a way that it does not provide misleading results
Summarize the information is a useful and informative manner
Analyze the available information
Draw conclusions and make inferences while assessing the risk of an incorrect conclusion
Statistics is the science that deals with the collection, classification, analysis, and interpretation of information or data to make
decisions, solve problems, and design products and processes
TYPES OF STATISTICS
Descriptive statistics utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information
revealed in a data set, and to present the information in a convenient form.
Inferential statistics utilizes sample data to make estimates, decisions, predictions, or other generalizations about a larger set of
data or population.
population the complete collection of individuals, items, or data under consideration in a statistical study
sample the portion of the population selected for analysis
Population
All registered voters
All owners of handguns
Household headed by a single parent
The CEOs of all private companies
All prison inmates
Foreigners living in the Philippines
Alzheimer patients in the Philippines
Adult children of alcoholics
Sample
A telephone survey of 600 registered voters
A telephone survey of 1000 handgun owners
The results from questionnaires sent to 2500 household
headed by a single parent
The results from surveys sent to 150 CEOs of private companies
A criminal justice study of 350 prison inmates
A sociological study conducted by a university researcher of 200
foreigners in the Philippines
A medical study of 50 such patients conducted by a university hospital
A psychological study of 200 such individuals
CLASSIFICATION OF DATA
Quantitative data are counts or measurements for which representation on a numerical scale is naturally meaningful.
Discrete
Continuous
Examples
Daytime temperature readings (in degrees Fahrenheit) in a 30-day period
Heights (in centimeters) of plants in a plot of land
Number (0, 1, 2, or so on) of people attending a conference
Distances (in miles) traveled by students commuting to school
Heights (in inches) of girls in a classroom
Number (0, 1, 2, or so on) of students in a classroom
Number (0, 1, 2, or so on) of teachers in favor of school uniforms
Ages (in months) of children in a preschool
Qualitative data consist of labels, category names, ratings, rankings, and such for which representation on a numerical scale is
not naturally meaningful.
Examples
Satisfaction ratings (on a scale from not satisfied to very satisfied) by users of a website
Party affiliation (Liberal, Nacionalista, Pwersa ng Masang Pilipino, Lakas Kampi, Bangon Pilipinas, etc.) of voters
Eye colors (blue, brown, or so on) of babies
Names (first and last) of a group of students who took an exam
Ten-digit Social Security numbers of a group of citizens
Foremost colors (red, yellow, orange, or so on) of flowers in a garden
Sex (male or female) of users of a website
Discrete data are quantitative data that are countable using a finite count, such as 0, 1, 2, and so on.
Examples
Number (0, 1, 2, or so on) of people attending a conference
Number (0, 1, 2, or so on) of male children in a family
Number (0, 1, 2, or so on) of students in a classroom
Number (0, 1, 2, or so on) of female teachers at a school
Number (0, 1, 2, or so on) of correct answers on a 20-item quiz
Number (0, 1, 2, or so on) of heads in 100 tosses of a coin
Continuous data are quantitative data that can take on any value within a range of values on a numerical scale in such a way that
there are no gaps, jumps, or other interruptions.
Examples
Ages (in years) of participants in a survey
Heights (in inches) of plants in a plot of land
Lengths (in inches) of newborn babies
Distances (in miles) traveled by students commuting to school
Heights (in inches) of girls in a classroom
Weights (in pounds) of male police officers
Daytime temperatures (in degrees Fahrenheit) over a 30-day period
Lengths (in meters) of broad jumps
Recognizing quantitative data as discrete or continuous is another useful skill in statistics.
Once you decide on the type of data (quantitative or qualitative) appropriate for the problem at hand, you'll need to collect the
data. Generally, data can be obtained in four different ways:
Data from a published source
Data from a designed experiment
Data from a survey
Data collected observationally
Levels of Measurement
nominal scale
o the lowest level of data
o applied to data that are used for category identification
o characterized by data that consist of names, labels, or categories only
o data cannot be arranged in an ordering scheme
o arithmetic operations are not performed for nominal data
Qualitative variable
Blood type
Province of residence
Type of crime
Color of road signs
Religion
A, B, AB, O
Laguna, Batangas, Cavite, Rizal, Quezon
misdemeanor, felony
red, white, blue, green
Christian, Moslem, etc.
ordinal scale
o the next higher level of data
o applied to data that can be arranged in some order, but differences between data values either cannot be
determined or are meaningless
o characterized by data that applies to categories that can be ranked
o data can be arranged in an ordering scheme
o arithmetic operations are not performed on ordinal level data
Qualitative variable
Product rating
Socioeconomic class
Pain level
interval scale
o applied to data that can be arranged in some order and for which differences in data values are meaningful
o results from counting or measuring
o data can be arranged in an ordering scheme and differences can be calculated and interpreted
o the value zero is arbitrarily chosen for interval data and does not imply an absence of the characteristic being
measured
o ratios are not meaningful for interval data
o Example: temperature, IQ scores,
ratio scale
o
o
o
o
o
o
ORGANIZING DATA
Raw data or ungrouped data are collected data that have not been organized numerically.
manslaughter
theft
theft
The type of offense is classified into the categories: rape, robbery, burglary, arson, murder, theft, and manslaughter.
Table 1.1
The relative frequency of a category is obtained by dividing the frequency for a category by the sum of all the frequencies.
The percentage for a category is obtained by multiplying the relative frequency for that category by 100.
Table 1.2
No. of Students
5
18
42
27
8
Table 1.3
Data organized and summarized as in the above frequency distribution are often called grouped data. Although the grouping
process generally destroys much of the original detail of the data, an important advantage is gained in the clear "overall" picture
that is obtained and in the vital relationships that are thereby made evident.
further mathematical analysis, all observations belonging to a given class interval are assumed to coincide with the class mark. Thus
all heights in the class interval 60-62 in are considered to be 61 in.
Class limits
60 62
63 65
66 68
69 71
72 74
Class boundaries
59.5 62.5
62.5 65.5
65.5 68.5
68.5 71.5
71.5 74.5
Class width
3
3
3
3
3
Class mark
61
64
67
70
73
No. of Students
0
5
23
65
92
100
Table 1.4
A graph showing the cumulative frequency less than any upper class boundary plotted against the upper class boundary is
called a cumulative-frequency polygon or ogive. For some purposes, it is desirable to consider a cumulative-frequency distribution
of all values greater than or equal to the lower class boundary of each class interval. Because in this case we consider heights of 59.5
in or more, 62.5 in or more, etc., this is sometimes called an or more cumulative distribution, while the one considered above is a
less than cumulative distribution. One is easily obtained from the other. The corresponding ogives are then called or more and
less than ogives. Whenever we refer to cumulative distributions or ogives without qualification, the less than type is implied.
The relative cumulative frequency, or percentage cumulative frequency, is the cumulative frequency divided by the total
frequency. For example, the relative cumulative frequency of heights less than 68.5 in is 65/100 = 65%, signifying that 65% of the
students have heights less than 68.5 in. If the relative cumulative frequencies are used in Table 1.4, in place of cumulative
frequencies, the results are called relative cumulative-frequency distributions (or percentage cumulative distributions) and relative
cumulative frequency polygons (or percentage ogives), respectively.
frequency polygon for a large population to have so many small, broken line segments that they closely approximate curves, which
we call frequency curves or relative-frequency curves, respectively.
It is reasonable to expect that such theoretical curves can be approximated by smoothing the frequency polygons or
relative-frequency polygons of the sample, the approximation improving as the sample size is increased. For this reason, a frequency
curve is sometimes called a smoothed frequency polygon.
In a similar manner, smoothed ogives are obtained by smoothing the cumulative-frequency polygons, or ogives. It is usually
easier to smooth an ogive than a frequency polygon.
1.
2.
3.
4.
5.
6.
7.
8.
Symmetrical or bell-shaped curves are characterized by the fact that observations equidistant from the central maximum have
the same frequency. Adult male and adult female heights have bell-shaped distributions.
Curves that have tails to the left are said to be skewed to the left. The lifetimes of males and females are skewed to the left. A
few die early in life but most live between 60 and 80 years. Generally, females live about ten years, on the average, longer than
males.
Curves that have tails to the right are said to be skewed to the right. The ages at the time of marriage of brides and grooms are
skewed to the right. Most marry in their twenties and thirties but a few marry in their forties, fifties, sixties and seventies.
Curves that have approximately equal frequencies across their values are said to be uniformly distributed. Certain machines
that dispense liquid colas do so uniformly between 15.9 and 16.1 ounces, for example.
In a J-shaped or reverse J-shaped frequency curve the maximum occurs at one end or the other.
A U-shaped frequency distribution curve has maxima at both ends and a minimum in between.
A bimodal frequency curve has two maxima.
A multimodal frequency curve has more than two maxima.
Examining a distribution
In any graph of data, look for the overall pattern and for striking deviations from that pattern. You can describe the overall
pattern of a histogram by its shape, center, and spread. An important kind of deviation is an outlier, an individual value that falls
outside the overall pattern.
Skewed distributions can show us where to concentrate our efforts.
Example. The number of 911 emergency calls classified as domestic disturbance calls in a large metropolitan location
were sampled for thirty randomly selected 24 hour periods with the following results. Find the mean number of calls per
24-hour period.
25
40
46
30
34
27
45
38
37
47
36
58
40
22
30
29
29
56
37
40
44
46
56
38
50
19
47
49
23
50
Example. The total number of 911 emergency calls classified as domestic disturbance calls last year in a large
metropolitan location was 14,950. Find the mean number of such calls per 24-hour period if last year was not a leap
year.
The Median
The median of a set of numbers arranged in increasing order is either the middle value or the arithmetic mean
of the two middle values. Median splits the set of ranked data values into equal-in-numbers parts. Extreme values do
not affect the median, making the median a good alternative to the mean when such values occur.
To find the median of a data set, first arrange the data in increasing order. If the number of observations is odd,
the median is the number in the middle of the ordered list. If the number of observations is even, the median is the
mean of the two values closest to the middle of the ordered list. It is the (
Examples. Find the median for each of the following data sets.
a.
25
43
40
60
12
b.
7
22
7
8
16
1
c.
6.7
7.6
7.5
6.9
9.3
6.7
7.6
th
8.5
Solutions
a. Arrange the numbers: 12, 25, 40, 43, 60. The median is 40, the middle number.
40, the median, is the (
th
) =(
th
) = 3rd value
Relative positions of mean, median, and mode for different frequency curves
The table below gives the shape of the distribution, the mean, the median, and the mode for the three data sets.
The variance and the standard deviation of a data set measures the spread of the data about the mean of the data set.
The variance of a sample of size n is represented by s2 and is given by
The variance of a sample of size n is
Example. The times required in minutes for five preschoolers to complete a task were 5, 10, 15, 3, and 7. The mean
time for the five preschoolers is 8 minutes. The table below illustrates the computation indicated by the sample variance
formula. The first column lists the observations, x. The second column lists the deviations from the mean, . The third
column lists the squares of the deviations. The sum at the bottom of the second column is called the slim of the
deviations, and is always equal to zero for any data set. The sum at the bottom of the third column is referred to as the
sum of the squares of the deviations. The sample variance is obtained by dividing the sum of the squares of the
deviations by n 1, or 5 1 = 4. The sample variance equals 88 divided by 4 which is 22 minutes squared.
The shortcut formulas for computing sample and population standard deviations are
( )
( )
( ) = (5 + 10 + 15 + 3 + 7)2 = 1600
The variance is given as follows:
( )
22
And the standard deviation is
For data sets having a symmetric mound-shaped distribution, the standard deviation is approximately equal to
one-fourth of the range of the data set. This fact can be used to estimate s for bell-shaped distributions.
where
= class mark,
23.3 years
The median for grouped data is found by locating the value that divides the data into two equal parts. In finding
the median for grouped data, it is assumed that the data in each class is uniformly spread across the class.
Median
where L = lower class boundary of the median class, the class containing the median
n = number of items in the data (i.e., total frequency)
sf = sum of frequencies of all classes lower than the median class
fm = frequency of the median class
c = size of the median class interval
Example. The median age for the data in the above table is a value such that 2500 ages are less than the value and 2500
are greater than the value. The median age must occur in the age group 15-24, since 750 are less than 15 and 2755 are
24 years or less. The class 15-24 is called the median class since the median must fall in this class. Since 750 are less than
15 years, there must be 1750 additional ages in the class 15-24 that are less than the median. In other words, we need
to go the fraction 1750/2005 across the class 15-24 to locate the median. We give the value 14.5 + (1750/2OO5) x 10 =
23.2 years as the median age. To summarize, 14.5 is the lower boundary of the median class, 1750/2005 is the fraction
we must go across the median class to reach the median, and 10 is the class width for the median class.
Median
= 23.2 years
The mode for grouped data is defined to be the class mark of the modal class, the class with the maximum
frequency.
Example. The modal class for the distribution in the table above is the class 15-24. The mode is the class mark for this
class that equals 19.5 years.
The range for grouped data is given by the difference between the upper boundary of the class having the
largest values minus the lower boundary of the class having the smallest values.
Example. The upper boundary for the class 45-54 is 54.5 and the lower boundary for the class 5-14 is 4.5, and the range
is 54.5 - 4.5 = 50.0 years.
Example. In order to find the variance and standard deviation for the distribution in the table above we first evaluate
and
= 9.52 (750) + 19.52 (2005)+ 29.52 (1950) + 39.52 (195) + 49.52 (100) = 3,076,350
)
2,709,792
The variance is
8.6 years.
COEFFICIENT OF VARIATION
The coefficient of variation is equal to the standard deviation divided by the mean. The result is usually
multiplied by 100 to express it as a percent.
sample coefficient of variation:
Z SCORES
A z score is the number of standard deviations that a given observation, x, is below or above the mean. For
Sample data, the z score is
If a data set is skewed to the right or to the left, then there is a greater chance that an outlier may be in your data set.
Outliers can greatly affect the mean and standard deviation of a data set. So, if your data set is skewed, you might want
to think about using different measures of central tendency and dispersion!
The percentile for observation x is found by dividing the number of observations less than x by the total number of
observations and then multiplying this quantity by 100. This percent is then rounded to the nearest whole number to
give the percentile for observation x.
Example. The number of observations in the table less than 5.5 is 11 . Eleven divided by 45 is 0.244 and 0.244 multiplied
by 100 is 24.4%. This percent rounds to 24%. The diameter 5.5 is the 24th percentile and we express this as P24 = 5.5. The
number of observations less than 5.0 is 9. Nine divided by 45 is 0.20 and 0.20 multiplied by 100 is 20%. P20 = 5.0. The
number of observations less than 10.0 is 39. Thirty-nine divided by 45 is 0.867 and 0.867 multiplied by 100 is 86.7%.
ince 86.7% rounds to 87% we write P87 = 10.0.
The pth percentile for a ranked data set consisting of n observations is found by a two-step procedure. The first step is to
compute index
. If is not an integer, the next integer greater than locates the position of the pth percentile in
the ranked data set. If is an integer, the pth percentile is the average of the observations in positions and
in the
ranked data set.
Example. To find the tenth percentile for the data above, compute = 10(45)/100 = 4.5. The next integer greater than
4.5 is 5. The observation in the fifth position in the table above is 3.6. Therefore, P10 = 3.6. Note that at least 10% of the
data in the table are 3.6 or less (the actual amount is 11.1%) and at least 90% of the data are 3.6 or more (the actual
amount is 91.1% ). For very large data sets, the percentage of observations equal to or less than P10 will be very close to
10% and the percentage of observations equal to or greater than P10 will be very close to 90%.
Example. To find the fortieth percentile for the data in the table above, compute = 40(45)/100 = 18. The fortieth
percentile is the average of the observations in the 18th and 19th positions in the ranked data set. The observation in
the 18th position is 6.0 and the observation in the 19th position is 6.2. Therefore P40 = (6.0 + 6.2)/2 = 6.1. Note that 40%
of the data in the table are 6.1 or less and that 60% of the observations are 6.1 or more.
Deciles and quartiles are determined in the same manner as percentiles, since they may be expressed as percentiles.
The deciles are represented as D1, D2, . . . , D9 and the quartiles are represented by Q1 , Q2 , and Q3. The following
equalities hold for deciles and percentiles:
D1 = P10 , D2 = P20 , ..., D9 = P90
The following equalities hold for quartiles and percentiles:
Q1 = P25 , Q2 = P50 , Q3 = P75
The above definitions of percentiles, deciles, and quartiles, the following equalities also hold:
Median = P50 = D5 = Q2
INTERQUARTILE RANGE
The interquartile range, designated by IQR, is defined as IQR = Q3 Q1. The interquartile range shows the
spread of the middle 50% of the data and is not affected by extremes in the data set.
Source:
Spiegel, M.R. and L. J. Stephens. 2008. Schaums Outline of Theory and Problems
of Statistics. McGraw-Hill