You are on page 1of 47

Subjects for examination

1. Data sources and types of data. Examples.


Primary data means original data that has been collected specially for the purpose in mind. It means
someone collected the data from the original source first hand. Data collected this way is called primary
data. For example: your own questionnaire. For example: data from a book.
Secondary data is data that has been collected for another purpose. Secondary data is data that is
being reused. Usually in a different context.
Data can be collected from existing sources or obtained through
surveys and experimental studies designed to obtain new data. All
companies maintain a variety of data bases about their employees,
customers, and business operations. Government agencies are
another important source of existing data. For instance, the
Department of Labor maintains data on employment rates, wage
rates, size of labor force, and union membership.
In recent years, the Internet has become an important source
of data. Most government agencies that collect and process data
make the result available through a web site. Many companies have
created Internet web sites and provide public access to them. One
can obtain access to an almost infinite variety of information. (World
Bank Open Data).

Data can be classified as being qualitative or quantitative.


Qualitative data consists of labels or names used to identify an
attribute (property, aspect or sign) of an element. Qualitative data
may be numeric or nonnumeric. Quantitative data are always
numeric and indicate how much or how many for the variable of
interest.

For purposes of statistical analysis, it is important to distinguish


between cross-sectional and time series data.

Cross-sectional data are data collected at the same or


approximately the same point of time. For instance, the academic
scores of students represent cross-sectional data, since they are
collected at the end of term or academic year.

Data collected at several successive periods of time are called


time series data.
2. Data. Levels of measurement. Examples.

Another characteristic of data is its level of measurement. The


level of measurement determines which statistical calculations are
meaningful. There are four levels of measurement: nominal, ordinal,
interval, and ratio.

Data at the nominal level of measurement are qualitative only.


They are given by names, labels, qualities, or codes. No
mathematical calculations can be made at this level. For instance,
students names in the register, or the months in the calendar are
data at the nominal level. The same are social security numbers,
phone numbers, because numbers simply represent labels that
identify particular individuals.

Data at the ordinal level of measurement are qualitative or


quantitative. They can be arranged in order, or ranked, but
differences between data entries are not meaningful (have no
mathematical meaning).

In data at the interval level a zero entry simply represents a


position on a scale; in such case we say that zero is not an inherent
zero. An inherent zero is a zero that implies none. For instance,
the amount of $0 means no money; its an inherent zero. On
contrary, 0C temperature does not mean no heat. Thus, 0C is
simply a position on the Celsius scale, and it is not an inherent zero.

Data at the ratio level of measurement are similar to that at


the interval level, with the additional property that zero entry is an
inherent zero. In this case, one data value can be expressed
meaningfully as a multiple of another (their ratio can be formed).

3. Descriptive statistics and inferential statistics. Examples.

The study of statistics has two major branches: descriptive


statistics and inferential statistics.

Descriptive statistics involves organization, summarization,


and display of data, and can be used to provide summaries of the
information contained in the data set and to present it in a form that
is easy for the reader to understand. Summaries of data may be
tabular, graphical or numerical.

The major contribution of statistics is that data from a sample


can be used to make estimates about the characteristics of a
population. This process is referred to as statistical inference.

Inferential statistics is the branch of statistics that involves


using a sample to draw conclusions about a population. A basic tool
in the study of inferential statistics is probability.

Whenever statisticians use a sample to make an inference


about a characteristic of a population, they provide a statement of
precision, associated with the inference.

Note. A sample should be representative of a population so


that sample data can be used to form conclusions about that
population. If the sample data are not collected using the
appropriate method, the data are of no value.

4. Statistical studies. Examples.


Sometimes the data needed for a particular decision or
application, are not available through existing sources. In such
cases, the necessary data can be obtained by conducting a
statistical study. Statistical studies can be classified as follows.

* Experimental study. In an experimental study, a variable of


interest is first identified. For instance, the Department of Health
might be interested in how a new drug influences the body
temperature. Thus, temperature is the variable of interest in the
study. The dosage level of the new drug is the other variable that
affects the body temperature. To obtain data about the effect of the
new drug, a sample of individuals will be selected. Different groups
of individuals will be given different dosage levels, and data on body
temperature will be collected for each group.

After that, statistical analysis of the experimental data will help


to determine how the new drug influences body temperature.

* Observational study. In an observational study a


researcher observes and measures variables of interest but does not
change existing conditions. For instance, it can be observed the
behavior of people crossing the street.

* Simulation. A simulation is the use of mathematical or


physical model to reproduce the conditions of the situation or
process. This often involves the use of computers. Simulations allow
you to study situations that are impractical or dangerous to create
in real life. For instance, when studying the effects of automobile
crashes on humans, often dummies are used.

* Survey. A survey is an investigation of one or more


characteristics of a population. Most often the surveys are carried
out on people by asking them questions, using interview, mail, or
telephone. For instance, the customers of some restaurants are
asked for their opinion about: food quality, friendly service, prompt
service, cleanliness, management. The answers may be: excellent,
good, satisfactory, and unsatisfactory, and provide data that enable
the managers to assess the quality of the restaurants operation.

A goal of every statistical study is to collect data and then use


the data to make a decision. Before you interpret the results of a
study, you should determine whether the results are valid, as well as
reliable. For this you should know how to design a statistical study.
Let us mention some guidelines for designing a statistical study.

1. Identify the variables of interest and the population of the


study.
2. Develop the plan for collecting data. If a sample is used,
make sure the sample is representative of the population.
3. Collect the data.
4. Describe the data, using descriptive statistics techniques.
5. Interpret the data and make decisions about the populations
using inferential statistics.

Identify any possible errors

5. Tabular methods of summarizing data. Examples.

You will learn here to organize and describe data sets. The goal is to make
the data easier to understand by describing trends, patterns or special
characteristics.

To develop a frequency distribution of the given data, we organize data


set by grouping the data into intervals, called classes (the first column) and
then counting the number of data entries, called frequencies, in each class
(the second column).

Definition. The frequency distribution is a table showing the


frequency of items in each of several nonoverlapping classes.

The advantage of frequency distribution is that it provides a data


summary showing how the sample is distributed across the classes

Besides frequency distribution, we are often interested to know the


proportion (relative frequency), or the percentage (percent frequency), of the
data items in each class. For a data set with n observations, the relative
frequency of each class is given by the following formula:

Frequency of the Class



Relative Frequency of the Class n .

A percent frequency of a class is the relative frequency multiplied by


100.

6. Graphical methods of summarizing data. Examples.

* Graphing Qualitative Data Sets.

To present qualitative data graphically, it is convenient to use such


devices as bar graphs and pie charts.

A bar graph is used to depict data that have been summarized in a


frequency, relative frequency, or percent frequency distribution.

On the horizontal axes of the graph, we specify the labels that are used
for each of the classes. Then, using a bar of fixed width drawn above each label,
we extend the height of the bar until we reach the frequency, relative
frequency, or percent frequency of the label as indicated by the vertical axes.
The bars are separated to emphasize the fact that the labels are different
categories.

The pie chart is commonly used for presenting relative frequency


distributions of qualitative data. To draw a pie chart, we first draw a circle, then
use the relative frequencies to divide the circle into sectors, that correspond to
the relative frequencies for each class

With quantitative data we have to be more careful in defining the


nonoverlapping classes to be used in frequency distribution.

The three steps necessary to define the classes are:


1. Determine the number of nonoverlapping classes.
2. Determine the width of each class.
3. Determine the class limits.

Generally, it is recommended to use between 5 and 20 classes. The goal


is to use enough classes to show the variation of the data, but not so many
classes that they contain only a few elements

It is recommended also that the width be the same for each class. Find
the range of the data set, and then divide the range by the number of classes
and round up to find the class width.

In practice, the number of classes and the appropriate class width are
determined by trial and error.

Class limits must be chosen so that each data value belongs to one and
only one class. Having the lower limit of the first class, you can find the lower
limit of the second class by adding the class width to the lower limit of the first
class. The upper limit of the first class will be one less than the lower limit of
the second class. The limits of the rest of the classes are determined similarly.

Now, a frequency distribution can be obtained by counting the number of


data items belonging to each class.

In some applications, we want to know the midpoints of the classes in a


frequency distribution. The class midpoint is the value halfway between the
lower and upper class limits.

The relative (or percent) frequency of a class is the portion (or


percentage) of the data that falls in that class.

Another tabular summary of the quantitative data is provided by the


cumulative frequency distribution.

The cumulative frequency of a class is the sum of the frequencies of


that class and all previous classes. The cumulative frequency of the last class is
equal to the sample size n . The cumulative frequencies distribution describes
the number of data entries that are equal to or below a certain value.

Sometimes it is easier to find patterns of a data set by looking at a graph


of the frequency distribution. One such graph is frequency histogram.

In a histogram the frequency of each class is shown by drawing a


rectangle whose base is the class interval on the horizontal axis and whose
height is the corresponding frequency. Unlike bar graph, consecutive rectangles
in the histogram must touch. To eliminate spaces draw the vertical lines of the
histogram halfway between the class limits. These half ways are called class
boundaries.
A histogram for the relative or percent frequency distribution would look
similarly.

Another way to graph a frequency distribution is to use a frequency


polygon.

To construct a frequency polygon plot points that represent the


midpoint and frequency of each class, and connect the points in order from left
to right.

A graph of a cumulative distribution is called ogive. To build an ogive plot


points that represent the upper class boundaries and their corresponding
cumulative frequencies. Connect the points in order from left to right. The graph
should start at the lower boundary of the first class (cumulative frequency is
zero) and should end at the upper boundary of the last class (cumulative
frequency is equal to the sample size.

7. Summarizing data for two variables. Examples.

Thus far, we have seen tabular and graphical methods that are used to
summarize the data for one variable at a time. Often managers and decision
makers are interested in tabular and graphical methods that will help in
understanding the relationship between two variables. Such methods are,
respectively, cross-tabulation and scatter diagram.

Cross-tabulation is a tabular method for summarizing the data for two


variables simultaneously.

Converting the entries in the table into row percentages or column


percentages can offer additional insight about the relationship between the
variables.

A scatter diagram is a graphical presentation of the relationship


between two quantitative variables.

8. Measures of data location. Examples.


We will consider here several numerical methods of descriptive statistics that
provide additional alternatives to tabular and graphical methods.

Consider a data set for a single variable, and containing n items, or data
values. We will define numerical measures of data location and dispersion. If the
measures are computed for data from a sample, they are called sample statistics,
and if the measures are computed for data from a population, they are called
population parameters.

The most important numerical measure of central location is the mean or


average value, for a variable. The mean provides the measure of central location for a
data set.
The mean of a data set is the sum of the data entries divided by the number of
entries. To find the mean of the data set, use one of the following formulas:

=
xi x =
xi
Population Mean: N Sample Mean: n ,

where xi represents the sum of the data values in the data set, N is the

number of entries in a population and n is the number of entries in a sample.

The mean can be interpreted as the balancing point of the data.

Another measure of central location for data is the median.

The median of a data set is the value that lies in the middle of the data when
the data set is ordered. If the data set has an odd number of entries, the median is the
middle data entry. If the data set has an even number of entries, the median is the
mean of the two middle data entries.

Although the mean is the more commonly used measure of central location, in
some situations the median is preferred. The mean is influenced by extremely small or
large data values. So, whenever there are extreme data values, the median is often
preferred measure of central location.

A third measure of location is the mode.

The mode is the data value that occurs with greatest frequency. When the
greatest frequency occurs at two or more different values, then we say that data are
bimodal or, respectively, multimodal.

The mode is an important measure of location for qualitative data, since for their
case it makes no sense to speak of the mean or median.

As we have seen, the mean can be greatly affected when the data set contains
outliers. An outlier is a data entry that is far removed from the other entries in the
data set.

9. Fractiles and box plot. Examples.

Fractiles are used to specify the position of a data entry within a data set.

Fractiles are numbers that divide an ordered data set into equal parts.

For instance, the median is a fractile because it divides an ordered data set into
two equal parts.

The most common fractiles are summarized as follows.

Fractiles Summary Symbols


Quartiles Divide a data set into 4 equal parts
Q1 , Q2 ,Q3

Deciles Divide a data set into 10 equal parts


D 1 , D2 , , D 9

Percentiles Divide a data set into 100 equal parts


P1 , P2 , , P99 .

Percentiles are often used in education and health-related fields to indicate how
one individual compares with others in the group. For instance, test scores are often
expressed in percentiles. Scores in the 95 th percentile and above are unusually high,
while those in the 5th percentile and below are unusually low.

To find a certain percentile for the given data set, we will perform the next
operations.

Step 1. Arrange the data in order from smallest to largest value.

Step 2. Compute the index ( 100p ) n


i= , where p is the percentile of interest

and n is number of items.

Step 3. (a) If i is not an integer, then the next integer greater than i

denotes the position of the p th percentile.

(b) If i is an integer, then the p th percentile is the average of the data

values in positions i and i +1.

The interquartile range (IQR) of a data set is the range for the middle 50% of

the data. It is the difference between the third and first quartiles:
IQR=Q3Q1 .

* Box Plot.

The interquartile range is used to represent data sets by box plots. The box
plot is a data analysis tool that highlights the important features of a data set. To
graph a box plot, you must know the following values:

1. The minimum entry


2. The first quartile
Q1

3. The median
Q2

4. The third quartile


Q3
5. The maximum entry.

These five numbers are called the five-number summary of the data. Now, to
draw the box plot perform the following steps:

1. Find the five-number summary of the data set. 2. Draw a horizontal scale. 3.

Plot the five numbers. 4. Draw a box from


Q1 to Q3 and draw a vertical line in the

box at
Q 2 . 5. Compute the lower limit according to the formula: lower limit

Q11.5 ( IQR ) . 6. Compute the upper limit according to the formula: upper limit

Q3+ 1.5 ( IQR ) . Data outside these limits are considered outliers. 8. Draw the

whiskers from the box to the lower and upper limits. 7. Finally, show the outliers.

10. Measures of data variability. Examples.

As we know, the range of a data set is the difference between the maximum and
minimum data entries in the set. This is a measure of variation that uses only two
entries from the data set. There are two measures of variability that use all the entries
in a data set. These are variance and standard deviation.

The variance is based on the difference between each data value and the mean.

The difference between each data value


x i and the mean ( x for a sample, and

for a population) is called a deviation about the mean. Thus, for a sample, a

deviation is (
x ix ); for a population, the deviation is ( x i ). In calculation of the

variance, the deviations about the mean are squared. The average of the squared
deviations is called variance.

Definition. The population variance is

2
=
( x i )
2
N
N , where is the population size.

The sample variance is


2
s=
( x ix )
2
n
n1 , where is the sample size.

The sample variance is often used to estimate the population variance.

Note, that in computing the sample variance, we use the divisor ( n1 )

instead of n . It can be shown that such a formula provides an unbiased estimate of

the population variance.

The positive square root of the variance is called standard deviation.

Definition.

Population Standard Deviation is = 2 .

Sample Standard Deviation is s= s2 .

Unlike variance, the standard deviation is measured in the same units as the
original data. For this reason, the standard deviation is more easily compared to the
mean and other statistics that are measured in the same units as the original data.

The standard deviation measures the spread about the mean of a distribution. It
shows a typical deviation from the mean. The more the entries are spread out, the
greater the standard deviation.
11. Empirical rule and Chebyshevs theorem. Examples.

* Empirical Rule (3 -Rule).

Many real-life data sets have approximately symmetric and bell-shaped


distributions. For such distributions we can apply the Empirical Rule, which shows how
valuable the standard deviation can be as a measure of variation.

Empirical Rule (3 -rule).

1. About 68% of the data lie within one standard deviation of the mean.

2. About 95% of the data lie within two standard deviations of the mean.

3. About 99.7% of the data lie within three standard deviations of the mean.
Data values that lie more than two standard deviations from the mean are
considered unusual. Data values that lie more than three standard deviations from the
mean are considered very unusual.

The empirical rule can be applied only to symmetric bell-shaped distributions.


But what if the distribution is not a bell-shaped or even its shape is not known? The
following theorem gives an inequality statement that applies to all distributions.

1
1
Chebyshevs Theorem. At least k2 percent of any data set lies within

k standard deviations of the mean ( k >1 ).

Thus, for any data set we have:

1 3
k=2 : At least 1 =
k 4 , or 75% of data lies within 2 standard deviations of
2

the mean.

1 8
k=3 : At least 1 =
k 9 , or 89% of data lies within 3 standard deviations of
2

the mean.

1 15
k=4 : At least 1 =
k 16 , or 94% of data lies within 4 standard deviations
2

of the mean.

One additional measure (statistic) is used when we are interested in how the
standard deviation is in relation to the mean. This measure is called coefficient of
variation and is calculated as follows.

Standard Deviation
Coefficient of variation
100%
Mean

In general, the coefficient of variation is a useful statistic for comparing the


variability in data sets having different standard deviations and different means.

12. Covariance and correlation coefficient. Examples.

* Covariance and Correlation Coefficient.


The researchers interested in the relationship between two variables often use
the descriptive measures such as covariance and correlation coefficient.

We introduce the covariance as a descriptive numerical measure of the linear


dependence between two variables.

Definition. For a sample of n elements ( x1 , y1 ) , ( x2 , y2 ) ,..., ( xn , yn ) , the

sample covariance,
s xy , is defined by the following equation.

s xy =
( x ix )( y i y )
n1 .

Note. The population covariance,


xy , is defined similarly for the population of

the size N and population means


x and
y of the vriables x and y ,

respectively:

xy =
( x i x )( yi y )
N .

A positive value for


s xy shows a positive linear dependence between x and

y , while the negative value for s xy indicates a negative linear dependence

between these two variables. Finally, if the points are evenly distributed, the value of

s xy will be close to zero, showing no linear dependence between x and y .

One problem with using covariance as an indicator of linear relationship is that

the value we obtain for the covariance depends on the units of measurement for x

and y . To avoid this dificulty, another measure of relationship between two

variables, called correlation coefficient, is used.

Definition. For sample data, the correlation coefficient,


r xy , is defined as

follows.
s xy
r xy = r xy sx
s x s y , where is sample covariance, is sample standard

deviation of x , and sy is sample standard deviation of y .

Similarly, the correlation coefficient,


xy , for population data is defined:

xy
xy = xy x
x y , where is population covariance, is population standard

deviation of x , and y is population standard deviation of y .

The value 1 of the correlation coefficient corresponds to a perfect positive

(negative) linear relationship between x and y . If the linear relationship is not

perfect (the points of the scatter diagram are not all on a straight line), the value of the

correlation coefficient will be less than 1. A value of


r xy near zero indicates a weak

linear relationship between the variables, and a value of


r xy equal to zero shows no

linear relationship between x and y .

13. Probability of an event. Classical, empirical and subjective methods. Examples.


* Expected Value and Variance

We can measure the center of probability distribution of a random variable, with its
expected value, or mean.

The expected value of a discrete random variable X is given by

EX =MX = xP(x ) .

The mean of the random variable describes the typical outcome (value) of the random
variable. To study the variation of the outcomes, we can use the variance and the standard
deviation of the random variable.

The variance of the discrete random variable X is


2 2
D(X )=Var ( X )= = ( xMX ) P (x) .

The standard deviation is

= =
2
( xMX ) P( x)
2
.

* Binomial Distribution

There are many experiments for which the results can be reduced to only two outcomes:
success and failure. Such are the binomial experiments.

A binomial experiment is characterized by the following properties.

3. The experiment consists of n identical independent trials.

2. There are only two possible outcomes in each trial: success and failure.

3. The probability of success, denoted by p , does not change from trial to trial.

Consequently, the probability of failure, 1 p , denoted by q , also does not change from

trial to trial.

In a binomial experiment we are interested in the number of successes occurring in the

n trials. Let us denote by X the number of successes occurring in the n trials. Then

X is a discrete random variable, assuming the values 0, 1, 2,..., n . The probability

distribution associated with this random variable is called the binomial probability
distribution.

To calculate the probability of x successes in n trials a special formula, called

binomial probability formula, can be used.

Binomial Probability Formula (Bernoullis Formula)

In a binomial experiment, the probability of x successes in n trials is

x x n x
P(x)=Cn p ( 1 p ) , where
p is the probability of a success in any trial.
There exist special tables developed to provide the probability of x successes in n

trials, with the given probability p of a success in one trial. They are called tables of

binomial probabilities, and are usually given in the books of probability and statistics.

To find the mean, variance and standard deviation for a discrete random variable, we
can use the formulas of the respective definitions. This formulas can be simplified when the
random variable has a binomial distribution.

Expected Value

EX=MX =np .

Variance

D(X )=Var ( X )= 2=npq .

Standard Deviation

= 2= npq .

* Poisson Distribution

In a binomial experiment you are interested in finding the probability of a certain


number of successes in a given number of trials. Suppose instead that you are interested in the
number of occurrences of an event over a specified interval of time or space. For instance,

X might be the number of arrivals at a car wash in one hour. In such cases, the random

variable has a Poisson probability distribution if the following conditions are satisfied.

Poisson Experiment

1. The probability of occurrence is the same for any two intervals of the same length.

2. The number of occurrences in one interval is independent of the number of


occurrences in other intervals.

The probability of exactly x occurrences in an interval is given by the Poisson

probability function.

Poisson probability function


x
e
P( x)= ,
x!

where is the expected value or the mean number of the occurrences per interval, and

e= 2.71828. Note that the discrete random variable may assume an infinite number of

values ( x= 0, 1, 2, 3,...).

The Poisson probability distribution can be used to approximate the binomial probability

distribution when the probability p of a success in one trial is small and the number n of

trials is big. The approximation is good whenever p 0.05 and n 20. In such case put

=np and use Poisson probability function.

14. Addition rule and multiplication rule. Examples.

15. Conditional probability. Bayes theorem. Examples.

16. Discrete probability distributions. Expected value and variance. Examples.


* Discrete Probability Distributions

For a discrete random variable X , the probability distribution is defined by a

probability function, denoted P( x) , which provides the probability for each value x of the

random variable. A discrete probability distribution must satisfy the following conditions:

(a)The probability of each value is between 0 and 1, that is 0 P( x) 1 ;

(b)The sum of all the probabilities is one, or P( x )=1 .

One important advantage of defining a random variable and its probability distribution is
that once it is known, it is relatively easy to find the probability of diverse events of interest.

17. Continuous probability distributions. Standard normal distribution. Examples.


* Continuous Probability Distributions.

So far, we have used the probability function P( x) to assign probabilities to

the values of a random variable. Recall that this function provides the probability of a
concrete value a discrete random variable can take. For a continuous random variable,

we introduce the concept of a probability density function, denoted by f ( x) .

The function f ( x) is a probability density function if the following two

conditions are respected.

1. f (x) 0 for all values of x .

2.The total area under the graph of f ( x) is equal to 1.

The first requirement is the analog of the requirement that P( x) 0 for the

discrete probability functions. The second condition is the analog of the condition that
the sum of the probabilities must equal 1 for a discrete probability function.

Once a probability density function f ( x) has been identified, the probability

that X will assume a value in a certain interval [


x1 , x 2 ], can be found by

computing the area under the graph of f ( x) over that interval.

The most important continuous probability distribution in statistical inference is


the normal distribution. Normal distributions are used in diverse models applied in
studying nature, industry and business, where the random variable represents, for
example, the heights of people, measurement errors, amounts of rainfall, rental costs,
and so on. The normal distribution can also be used as an approximation to the
discrete binomial distribution.

The normal distribution is a continuous probability distribution for a random

variable X , having the following probability density function.

1 2 2

f ( x)= e( x ) / 2
2
where is mean, and is standard deviation for the probability distribution of

X .

A normal distribution has the following properties.

1. The graph of the normal distribution, called normal curve, is bell-shaped and
is symmetric about the mean. Its points of inflection are one standard deviation from
the mean.

2. The highest point on the normal curve is at the mean, which is also the
median and mode of the distribution.

3. The total area under the normal curve is equal to 1.

4. The standard deviation determines the width of the curve. Larger values
result in wider curves, showing more variability in the data.

5. The tails of the curve extend to infinity in both directions and never touch the
horizontal axis.

6. Probabilities for some commonly used intervals are: (a) 68.26% of the values
of the random variable are within plus or minus one standard deviation of its mean; (b)
95.44% of the values are in plus or minus two standard deviations of the mean; (c)
99.72% of the values are within three standard deviations of the mean. See the Figure.

* Standard Normal Distribution.

To find the probability that a normal random variable takes values in a certain
segment, we must compute the area under the normal curve over that interval.
Probabilities for all normal distributions are computed by using the standard normal
distribution.

A random variable that has a normal distribution with a mean =0 and a

standard deviation =1 is said to have a standard normal probability

distribution. This particular normal random variable is usually denoted by Z .


Any normal random variable X with mean and standard deviation

can be converted to the standard normal random variable Z by the following

formula.

x
z=
.

This is the same formula as that for computing z -scores for a data set. Thus,

we can interpret Z as the number of standard deviations that the normal random

variable X is from its mean .

For the standard normal probability distribution, areas under the normal curve
have been computed and are available in special tables.

To compute probabilities for any normal random variable with mean and

standard deviation , we first convert it to a standard normal variable. Then we can

use the Standard Normal Table to find the desired probabilities.

18. Sampling methods. Examples.

* Sampling

We have defined earlier the two basic notions of statistics: a population and a
sample. Let us remind these definitions.

1. A population is the set of all the elements of interest in the study.

2. A sample is a subset of the population.

The main purpose of statistical inference is to make conclusions about a


population from information obtained from a sample.

Recall that the numerical characteristic of the population is called parameter,


and the corresponding characteristic of the sample is called sample statistic.

There are many methods of selecting a sample from a population. One of the
most common is simple random sampling. The definition and the process of selecting a
simple random sample depend on whether the population is finite or infinite.
A simple random sample of the size n from a finite population of the size

N is a sample selected such that each possible sample of the size n has the

same probability to be selected. In the case of an infinite population, a sample must be


selected such that the following conditions are satisfied.

1. Each element selected comes from the same population.

2. Each element is selected independently.

Simple random sampling is not the only method for selecting a sample from a population. Let us see
some alternative sampling methods.

Stratified Random Sampling. In stratified random sampling, the population is first divided in groups
called strata, such that each element in the population belongs to one and only one stratum, as, for instance,
department, age, marital status, and so on. After the strata are formed, a simple random sample is selected
from each stratum. The results for strata samples are combined into one estimate of the population
characteristic of interest. If elements within strata are homogeneous, the strata will have low variance , and,
consequently, small sample sizes can be used to obtain good estimates of the strata characteristics.

Cluster Sampling. In this case the population is first divided into separate groups called clusters.
Each element of the population belongs to one and only one cluster. The elements of each cluster form a
sample. The best results are provided when the elements within a cluster are heterogeneous (not alike). Then
each cluster is a small-scale version of the entire population. Sampling a small number of clusters give good
estimates of the population parameters.

Cluster sampling is often applied in area sampling, where clusters are city-blocks schools, or other
defined areas. Many sample observation can be obtained from a cluster in a relatively short time.

Systematic sampling. An alternative to simple random sampling is systematic sampling, applied


especially for large populations. Suppose we want to select a sample of 50 items from a population of 10000
elements. That is one element for every 200 elements in the population. We select randomly one of the first
200 elements from the population list. Starting with this first sampled element, we then select every 200 th
element that follows in the population list. This method is applicable especially when the population elements
are ordered randomly.

19. Point estimator. Sampling distribution of sample mean. Examples.

* Point Estimation.
Let us return to the previous example and assume that a sample of 30 employees has been selected
and the corresponding data are recorded. To estimate the value of a population parameter, we compute the

value of a corresponding sample statistic. Thus, to estimate the population mean (population average

age) and the population standard deviation , we calculate the corresponding sample mean x (sample

average age) and the sample standard deviation s . In addition, by computing the proportion of employees

in the sample who have completed the training program, we find the sample proportion p , which can be

used to estimate the population proportion p .

The statistical procedure in which we use the data from a sample to compute a value of a sample

statistic that serves as an estimate of a population parameter is called point estimation. Thus, x is the

point estimator of the population mean , s is the point estimator of the population standard

deviation , and p is the point estimator of the population proportion p . The actual numerical

values of x , s , and p obtained for a particular sample are called point estimates of the respective

population parameters.

Usually, we select only one sample from the population of interest. If the sampling process is

repeated for many times, then different samples generate a variety of values for the sample statistics x ,

s , and p . It follows that each of these statistics is a random variable. Just like any random variable,

each of the sample statistics x , s , and p has a mean or expected value, a variance, and a standard

deviation. The probability distribution of any sample statistic is called the sampling distribution of that
statistic.

* Sampling Distribution of x

The sample mean x is commonly used to make inferences about the population mean .

The sampling distribution of x is the probability distribution of the sample mean.


Let us denote by E( x ) the expected value of x , and use the notation x for the standard

deviation of the random variable x . Then there can be proved the following properties of the sample mean

x .

1. The expected value of the sample mean is equal to the population mean:

E( x )= .

n
2. If the size of the population is large in comparison to the sample size, that is, if
0.05, then
N

the standard deviation of x is

x =

n , otherwise, x =

N n
N 1 n .

The standard deviation of the sample mean is called the standard error of the mean.

20. Central limit theorem. Sampling distribution of sample proportion. Examples.

* Central Limit Theorem

This theorem describes the relationship between sampling distribution of the sample mean and
respective population distribution.

Central Limit Theorem.

In selecting simple random samples of size n from a population with a mean and a standard

deviation , the sampling distribution of the sample mean x can be approximated by a normal

probability distribution, if n 30 . The bigger the sample size, the better the approximation. If the

population itself has a normal probability distribution, then the sampling distribution of x is normally

distributed for any sample size n .


Each time we select a simple random sample and compute the sample mean x to estimate the

population mean , we cannot expect the sample mean to be exactly equal to the population mean. The

absolute value of the difference between x and , |x | , is called the sampling error. The

sampling distribution of x can be used to provide information about the size of the sampling error.

As the sample size is increases, the standard error of the mean decreases. As a result, the larger
sample sizes will provide higher probabilities that the sample mean is closer to the population mean.

* Sampling Distribution of p

In many problems in business and economics, we need the sample proportion p to estimate the

population proportion p . The sample proportion is not other than a sample mean. Indeed, suppose a

simple random sample (x 1 , x 2 , , x n ) is selected from the population, and x i=1 , when a

x i=0 when the characteristic is not present.


characteristic of interest is present for the ith element and

Then the sample proportion is x i /n , which is exactly the sample mean.The probability distribution of

all possible values of the sample proportion p is called sampling distribution of the sample proportion.

The sampling distribution of p has the following properties.

1. The expected value of p is

E( p )= p .

n
2. If the size of the population is large in comparison to the sample size, that is, if
0.05, then
N

the standard deviation of p is


p =
p(1 p)
n , otherwise, p =
n
p(1 p) Nn
N1 .
To determine the form of the sampling distribution of p , we apply the central limit theorem

which implies the following. The sampling distribution of p can be approximated by a normal probability

distribution when the sample size is large, that is, when

np 5 and n(1p)5 .

The absolute value of the difference between the value of the sample proportion p and the value

of the population proportion p , | p p| , is called the sampling error. The information about the

sampling error can be derived from the sampling distribution of p .

21. Interval estimation of the population mean (30). Examples.

* Interval Estimation of a Population Mean (Large-Sample Case).

As we know, the point estimators offer estimates for the values of the population parameters (mean,
proportion). However they do not provide information about the precision of this estimation. You will learn
further how interval estimation of population parameters can be used to deliver such information.

The interval [ 1 , 2 ] that covers the unknown population parameter with a probability of

c is called the confidence interval, and the probability c is called the confidence coefficient or the

confidence level.

In our course we will most often use the following confidence coefficients and the corresponding

z -scores:

Confidence coefficient: 90% 95% 99%

z -score: 1.64 1.96 2.57.


You can see in the graph, that if the probability that z falls within the confidence interval is c

(the confidence coefficient is c ), then the remaining area is (1 c , and the probability that z falls

(1c)
within each tail is 2 .

For instance, if c= 90%, then 5% of the area lies to the left of z c = 1.64 and 5% lies to

z c = 1.64.
the right of

z c and z c are called critical values.


The values

Given a confidence coefficient c , the margin of error E is defined as follows.


E=z c x =z c
n .

In many sampling situations the value of the population standard deviation is unknown. If the sample

size is large ( n 30 ), we simply use the statistic s to estimate the population standard deviation .

Knowing the sample mean and the margin of error, the confidence interval for the population mean is
now written as

x E x + E .

Let us summarize the procedure of finding the confidence interval for the population mean (for the

case n 30 ) in the following steps.

1. Find the sample mean x .


2. Determine the standard deviation
x of the sample mean x . If the population standard

deviation is known, use the formula x = / n , otherwise find the sample standard deviation


2
( x ix ) x .
s= and use it as an estimate of
n1

z c that corresponds to the given confidence coefficient c , using the


3. Find the critical value

Standard Normal Table.

4. Compute the margin of error


E=z c x .

5. Form the confidence interval x E x + E .

Note that a larger confidence coefficient provides a wider confidence interval.

* Minimum Sample Size

One way to improve the precision of the estimate without decreasing the confidence coefficient is to

enlarge the sample size. For a certain confidence coefficient c and given margin of error E , the

minimum sample size n needed to estimate the population mean is

2
z
n= c
E( ) .

If is unknown, it can be estimated by the sample statistic s , provided that the sample size is

at least 30 elements.

22. Interval estimation of the population mean (30). Examples.

* Interval Estimation of a Population Mean (Small-Sample Case)


In many real-life situations it is often impractical to collect samples of size 30 and more. Moreover,

the population standard deviation is often unknown. How could you build the confidence interval for

the population mean in this case?

If the population has a normal probability distribution, the sampling distribution of the sample mean

x is normal regardless the sample size. To build the confidence interval for , sample standard

deviation s will be used to estimate the unknown population standard deviation , and a probability

distribution known as the t-distribution will be applied instead of standard normal probability distribution.

x
t=
The quantity
s is said to have a t-distribution, if X is a normally distributed random
n

variable. Here is the expected value of X , n , x , and s are, respectively, the sample size,

mean and standard deviation.

Main properties of the t -distribution are as follows.

1. The t -distribution is a family of curves each bell-shaped and symmetric about the mean.

2. Each t -distribution curve has a probability distributions depending on a parameter known as

the degrees of freedom (d.f.).

d.f. n1 degrees of freedom

3. The total area under the t -curve is 1.

4. The mean, the median, and the mode of the t -distribution are 0.

5. As the degrees of freedom increase, the t -distribution approaches the standard normal

distribution.
t
We will use the notation to indicate the t -value with an area of in the upper tail of
2 2

the t -distribution, where =1c , and c is the confidence coefficient.

The T -Distribution Table lists the values of t for the given confidence coefficients and

degrees of freedom.

Now let us see how the t -distribution is used to build a confidence interval for the population

mean. The procedure is similar to constructing a confidence interval for normal distribution in both we need

a sample mean x and a margin of error E . Let us summarize it in the following steps.

1. Find the sample mean x and the sample standard deviation s .

2. Determine the degrees of freedom d.f., the confidence coefficient c and the respective t
2

value.

s
E=t
3. Find the margin of error 2 n
.

4. Build the confidence interval x E x + E .

23. Interval estimation of the population proportion. Examples.

* Interval Estimation of a Population Proportion

Building a confidence interval for a population proportion p is similar to the building a

confidence interval for a population mean. The following steps are to be made.

1. Based on the sample of the size n find the sample proportion p .

2. Check if the sampling distribution can be approximated by a normal distribution ( p 5 and

n(1 p ) 5 ).
z z =z c
3. Find the critical value 2 ( 2 ) that correspods to the given confidence coefficient

c ( =1c ), using the Standard Normal Table.

4. Compute the margin of error E=z


2 p (1 p )
n .

5. Form the confidence interval: pE p p + E .

As we have seen earlier, one way to increase the precision of a confidence interval without

diminishing the confidence coefficient is to increase the sample size. Given a confidence coefficient of c

and a margin of error E , the minimum sample size n needed to estimate the population proportion

p is

2
z
n=p (1 p ) ()2
E
, p is known. If not, use p= 0.5.

24. Interval estimation of the population variance and standard deviation. Examples.

* Interval Estimation of a Population Variance and Standard Deviation

In many practical situations it is important to control the amount of variation in some characteristics
of the population. For instance, thousands of parts produced by an industry must vary little or not at all. To

control the variation 2 you may use its point estimate s 2 . Consequently, for the standard deviation

the point estimate s can be used. To build a confidence interval for the variance and standard

deviation, the chi-square distribution is applied.

If a simple random sample of size n is selected from a normal population, the sampling

distribution of the quantity

2 ( n1 ) s2
=
2
has a chi-square distribution with n1 degrees of freedom.

The following properties of the chi-square distribution are to be mentioned.

1. All 2 values are nonnegative.

2. The area under each curve of the chi-square distribution is 1.

3. Chi-square distributions are positively skewed.

The values of 2 distribution are tabulated for various degrees of freedom. Each value in the table

represents the area under the chi-square curve to the right of the critical value. There are two critical values

2 2
for each level of confidence: L and R .

For the confidence level of c , the area to the right of 2R and 2L are, respectively,

1c 1c 1+ c
and
1 = c probability of obtaining a
2
2 2 2 . Thus, there is a value such that

( n1 ) s2 ( n1 ) s 2
2 2
L
2
R . As 2= , we can derive from the relationship 2L the
2 2

( n1 ) s2 ( n1 ) s 2 2 ( n1 ) s
2

following inequality:
2 . Similarly, from 2R the inequality

2L 2 2R

can be obtained. As a result, we can write the following general expression for the confidence intervals:

( n1 ) s 2 2 ( n1 ) s 2
Confidence Interval for 2
is
.
2R 2L

Confidence Interval for is


( n1 ) s 2
2R



( n1 ) s 2
2L .

25. Hypothesis testing procedure. Examples.


Hypothesis testing is a process in which sample statistics are used to determine whether a statement
about the value of a population parameter should be accepted or rejected.

A statement about a population parameter is called a statistical hypothesis. Traditionally, two

hypotheses are considered:


H 0 null hypothesis and H a - alternative hypothesis.

Definition. A null hypothesis


H 0 contains a statement of equality, such as , , or .

Ha
The alternative hypothesis is the complement of the null hypothesis, and it contains a statement of

strict inequality, such as , , or . The conclusion that H a is true can be made if the sample

H0
data indicate that is false.

Generally, a hypothesis test about the values of the population mean may take one of the

following three forms:

H0 0 H0 0 H0 = 0
: : :

Ha > 0 Ha < 0 Ha 0
: : :

Similar statements can be formulated to test other population parameters.

No mater which of the three forms of hypothesis test is used, you always begin by assuming that the
equality condition in the null hypothesis is true. After performing the hypothesis test, you will take one of two
decisions:

1.Reject the null hypothesis, or

2.Accept (fail to reject) the null hypothesis.

26. Types of errors and level of significance. Examples.

Definition. The type I error is made if the null hypothesis is rejected when it is true. The type II
error is made if the null hypothesis is not rejected when it is false.
Thus, we have the following four results of a hypothesis testing.

State of Things

Decision:
H 0 is true H 0 is false

Do not reject
H0 Correct decision Type II error

Reject
H0 Type I error Correct decision

Although we cannot eliminate the possibility of errors while performing a hypothesis test, we can
indicate the probability of their occurrence. The common notations are:

- the probability of making a Type I error.

- the probability of making a Type II error.

Most applications of hypothesis testing control for the probability of making a Type I error and do
not always control for the probability of making Type II error. That is why, in order to avoid the risk of
making a Type II error, the statement Do not reject is used instead of Accept.

Definition. The maximum allowable probability of making a Type I error is called level of
significance for the test.

Common choices for the level of significance are = 0.1, = 0.05, = 0.01. The

lower the level of significance, the smaller the probability of rejecting a true null hypothesis.

After stating the null and alternative hypotheses, and specifying the level of significance, the next
step is selecting a random sample from the population and calculating the sample statistics. The statistic used
to estimate the parameter in the null hypothesis is called the test statistic. The following table shows the
population parameters and corresponding test statistics.

Population Test Standardized

Parameter Statistic Test Statistic

x z ( n 30 ) or t ( n 30)
p p z

2 s2 2

One way to decide whether to reject the null hypothesis is to check if the standardized test statistic
falls within a rejection region of the sampling distribution.

Definition. A rejection region (or critical region) of the sampling distribution is the range of values
of the test statistic for which the null hypothesis is rejected. The value of test statistic that establishes the
boundary of the rejection region, is called the critical value.

27. Hypothesis tests about a population mean. Examples.

* Tests about a Population Mean.

The following is the procedure of conducting a hypothesis test for a population mean (large-sample
case).

H 0 and Ha .
1.State the null and alternative hypotheses

2.Specify the level of significance .

3.Use the Standard Normal Table, to determine the critical value


z 0 (values z 0 ).

4. Sketch the graph of the normal curve and indicate the rejection region. If the rejection region is
placed in only lower tail (or upper tail) of the sampling distribution, then we say the test is a left-tailed (or
right-tailed) hypothesis test. If the rejection region is placed in both the lower and the upper tails of the
sampling distribution, then we say the test is a two-tailed hypothesis test.

5.Find the standardized test statistic (for the population mean it is called z-statistic):

x 0 x 0
z=
/n if is known, or z=
s/n if is unknown.

z is in the rejection region, reject H 0 , otherwise fail to reject H0 .


6.Make a decision: if
Assume now, that the sample size is small ( n< 30) and the population standard deviation

is unknown. If the population has a normal distribution, you can use the t -distribution to make inferences

about the population mean. The test statistic for the mean is

x 0
t= (called t-statistic), where s is the sample standard deviation. This statistic has t-
s / n

distribution with n 1 degrees of freedom.

The the procedure of conducting a hypothesis test for a population mean for the small-sample case is
as follows.

H0 Ha
1.State the null and alternative hypotheses and .

2.Specify the level of significance .

3.Identify the degrees of freedom, d.f. n 1.

z 0 (values z 0 ).
3.Use the T-Distribution Table, to determine the critical value

4. Determine the rejection region.

5.Find the standardized test statistic (t-statistic):

x 0
t=
s / n .

6.Make a decision: if t is in the rejection region, reject


H 0 , otherwise fail to reject H0 .

28. Hypothesis tests about a population proportion. Examples.

* Tests about a Population Proportion.

The three forms for a hypothesis test about a population proportion p are as follows.
H0 p p0 H0 p p0 H0 p= p0
: : :

Ha p> p0 Ha p< p0 Ha p p0
: : :

If n p 5 and n(1 p ) 5 , then the sampling distribution of p is approximately normal

with an expected value of E p = p and a sample standard deviation of p =


p (1 p )
n . Consequently,

the following z -test about a population proportion can be used:

p p0
z= .
p

29. Hypothesis tests about means and proportions with two populations. Examples.

* Inference about Means and Proportions with two Populations.

You will learn further how to test a claim about the difference between the same parameters from two
populations. For instance, you may want to conduct a hypothesis test to find whether there is any difference
between educational quality provided at two high schools, or test the difference between the proportions of
defective parts supplied by two factories.

The type of test to be used is determined by the sizes of the samples selected from the two
populations, as well as by the fact of dependence or independence of the respective samples.

Definition. Two samples are called independent if they are selected from two different populations
and are not related one to another. Two samples are called dependent or matched if each element of one
sample corresponds to an element of the other sample.

For instance, if you select randomly 100 graduates from university A and 90 graduates from
university B, and test their qualification level, you obtain two independent samples. But if you select
randomly 70 freshmen from a university and measure their qualification level, then, after 3years, test the
same sample of students for their qualification level , then you have dependent (or matched) samples.

For a claim about two population parameters


1 and
2 , the possible pairs of null and

alternative hypotheses are


H0 1=2 H0 1 2 H0 1 2
: : :

Ha 1 2 Ha 1 < 2 Ha 1 > 2
: : :

Difference between the means of two populations. Independent samples. Let us consider the
hypotheses tests about the difference between the means of two populations for independent samples. If each

x 1x 2 ,
sample size is at least 30, then the sampling distribution of the difference of the sample means,

can be approximated by a normal probability distribution with mean and standard deviation as follows.

E( x 1 x 2)=12 x x = x 2 + x 2 .
1 2 1 2

Since the sampling distribution of


x 1x 2 is normal, we can use the z -test with the

standardized test statistic of the form

( x 1 x 2)(1 2)
z=
.
x 1
2
+ x 2
2

In real life it is often impractical to collect samples of large size. To test the difference between the
means of two small independent samples we assume that both populations have normal probability

distribution. With this condition, the sampling distribution for the difference of sample means
x 1x 2 is

t -distribution with mean 12 .The standard deviation and the degrees of


approximated by a

freedom depend on whether the population standard deviations


1 and 2 are equal or not.

The t-statistic for the difference between two population means


1 and 2 is

( x1 x2 ) ( 12 ) s x x
t= x 1x 2
s x x , where 1 2
is the standard deviation of .
1 2

If the population variances are equal, then


s x x = s
1 2
2
( 1 1
+
n1 n2 ) , where
2
s 2 is a weighted average of the two sample variances s 1 and

2 2 ( n11 ) s12 + ( n 21 ) s 22 n1 +n22 .


s2 : s= , and d.f.
n1 +n 22

If the population variances are not equal, then the standard error is

s x x =
1 2
( s 12 s 22
+
n1 n 2 ) and d.f.
min ( n11, n21 )
.

Difference between the means of two populations. Matched samples. To perform a two-sample

d i for
hypothesis test with dependent samples, we must use a different technique. First find the difference

di
d=
each data pair. Then determine the mean of that differences: n . If both populations are normally

distributed, then the sampling distribution of d is approximated by a t-distribution with n1 degrees

of freedom, where n is the number of data pairs.

d
Let us denote by the mean of difference values in the population and formulate the null and

alternative hypotheses as follows.

H0 d =0 H0 d 0 H0 d 0
: or : or :

Ha d 0 Ha d < 0 Ha d > 0
: : : .

di
d=
The sample mean and sample standard deviation for the difference values are n and

sd =
( d id )
.
n1
To test the null hypothesis, we will use the t-statistic


d d
t= with n1 degrees of freedom.
sd/ n

Difference between the proportions of two populations. The difference between two population

proportions
p1 and p2 can be tested using a sample proportion from each population. The following

pairs of null and alternative hypotheses are considered.

H0 p1= p2 H0 p1 p2 H0 p1 p2
: : :

Ha p1 p2 Ha p1< p 2 Ha p1> p 2
: : : .

If the samples are randomly selected, independent and large enough to use a normal sampling

n1 p 1 5 , n1 q1 5 , n2 p 2 5 , and n2 q2 5 , then the sampling distribution


distribution, that is

for the difference between the sample proportions,


p1 p2 , is a normal distribution with mean

E ( p1 p2 )= p 1p 2

and standard error

1 2

p p = p q
( n1 + n1 )
1 2
, where p is the weighted mean of the sample proportions, that is

n1 p 1+ n2 p2
p= q =1 p .
n 1+ n2 and

p1 p2 is normal, you can use the following z -statistic to test


If the sampling distribution for

the difference between two population proportions.

( p 1 p2 )( p 1 p2 )
z= .
p p
1 2
30. Individual and aggregate price indexes. Examples.

* Individual Price Indexes.

Index numbers are used as a descriptive statistic tool for describing the evolution of an economic
variable over time. An index number represents a ratio between the value of the variable recorded in one
period, called current period, and the value of the same variable recorded in the period, called basic period.

The National Bureau of Statistics regularly publishes a variety of indexes that can help users to
understand current business and economic situation. The most widely known and used is the Consumer Price
Index (CPI). The CPI measures changes in price over a period of time. Given a starting point or base period
with its associated index of 100, the CPI can be used to compare current period consumer prices with prices
in the base period. For example, a CPI of 110 shows that consumer prices increased approximately 10%
compared to the base period.

To compare prices in different years, we convert them to price relatives, or individual price indexes,
which express the unit price in each period as a percentage of the unit price in a base period.

Price period t
Price relative in period
t= 100 .
Base period price

Knowing the price relative, you can easily compare the price in any one year with the price in the
base year. Price relatives are very helpful in understanding and interpreting economic changes over time.

* Aggregate Price Indexes.

Economists are often more interested in the general price change for a group of products and services
taken as a whole. For example, if we are interested in the overall cost of living over time, we will need the
index that involves the price changes for a variety of items, including food, housing, clothing, medical care,
and so on. To measure the combined change of a group of items, a special aggregate price index is developed.

I t , in period t ,is computed by simply


A simple (unweighted) aggregate price index,

summing the unit prices in the period t and dividing that sum by the sum of unit prices of the base year:

It =
Pit 100
Pi 0
where
Pit is unit price for item i in period t , and Pi 0 is unit price for item i in the base

period.

The value of the index is heavily influenced by the items having large per-unit prices. Because of
such sensitivity, the simple index is not widely used. Instead, a weighted aggregate price index is commonly
applied. In computing this index, each item in the group is weighted according to its importance, for instance,
its quantity of usage or quantity weights. The quantity of usage shows the expected annual usage for each
type of item.

If
Qi denotes the quantity weight for item i , then the weighted aggregate price index is given

by

It =
Pit Qi 100
Pi 0 Q i .

The weighted index, compared with the simple aggregate index, shows a more moderate increase in
the expenses, since it takes into account the quantity of usage of the main products (bread and milk) and helps
to offset the large increase in reparation costs. The weighted aggregate index with quantities of usage as
weights is the preferred method for determining a price index for a group of products and services.

Qi are considered fixed and do not vary with time, they can be determined
When the quantities

from base year usages:


Qi=Qi 0 . Once established, they remain the same for all periods of time the index

is used. In this case the weighted aggragate price index is computed according to the formula

It =
Pit Qi 0 100
Pi 0 Q i 0

and is called Laspeyres index.

In the case when the quantity weights are revised and computed each year, the weight aggregate

index in period t whith quantity weights


Qit is given by

It =
Pit Qit 100
.
Pi 0 Qit
This weighted aggregate index is called Paasche index. Although it has the advantage of showing the

current usage patterns, this method requires re-determining the quantities


Qit each period and

recomputing the index numbers for the previous periods to reflect the effect of the current quantity weights.
Because of these disadvantages, the Laspeyres index is more widely used in applications.

The aggregate price index can be directly computed from individual price indexes of each item.
Indeed, for the Laspeyres index we have

P P
Pit Qi 0
it
Pi 0
Pi 0 Q i 0 it
w
Pi 0 i
It = 100= 100= 100 ,
Pi 0 Q i 0 P i0 Qi 0 wi

w i=Pi 0 Qi 0 is the weight applied to the individual price index for item i .
where

Similarly, the Paasche index takes the form

P P
Pit Qit 100= P it Pi 0 Qit P it wi
i0 i0
It = 100= 100 ,
Pi 0 Qit Pi 0 Qit wi

w i=Pi 0 Qit i .
where is the weight applied to the individual price index for item

31. Price indexes as deflators. Examples.

*Price Indexes as Deflators.

When dealing with time series data (data collected over several time periods) that involve money
amounts, the interpretations can be very misleading if the price changes during the time are ignored.

Deflating a time series has an important application in calculating the Gross Domestic Product
(GDP). The GDP is the total value of all goods and services produced in a particular country. To adjust the
total value of goods and services produced so as to reflect the real changes in their volume produced and sold,
the GDP must be computed with a price index deflator. The procedure is similar to that illustrated in the
example with the wages calculations.
32. Components of a time series. Examples.

* Components of a Time Series.

A forecast is a prediction of what will happen in the future. Suppose you are asked to forecast the
sales of a certain product in the coming year. To provide such a prediction, you will review the actual sales
data for the product in the previous periods to better understand the patterns of past sales. Historical sales
represent a time series.

Definition. A time series is a set of observations on a variable made at successive points of time or
over successive periods of time.

The objective of analyzing the time series is to provide good forecasts of future values of the time
series.

Forecasting method are classified into quantitative and qualitative methods. Quantitative forecasting
methods are used when (1) past information about the variable is available, (2) the data can be quantified, and
(3) it can be assumed that the pattern of the past will be continued in the future. If only past values of the time
series are considered, the forecasting procedure is called a time series method or causal method. We will
consider here three of the time series methods: smoothing, trend projection, and trend projection adjusted for
seasonal influence.

Qualitative methods generally use the expert judgement to provide forecasts. For instance, a group of
experts come to a consensus regarding the prime rate for the next year. Qualitative methods are applied when
the information used cannot be quantified and when historical data are unavailable.

Usually the four main components are distinguished in a behavior of the data in a time series: trend,
cyclical, seasonal, and irregular components.

Trend Component. The time series may show gradual shifts (or movements) to higher or lower values
over a long period of time. This shifting is usually the result of long-term factors such as demographic
changes, technology developement, consumer preferences and so on. The gradual shifting of the time series is
called the trend in the time series.

There can be other possible time series trend patterns ( linear decreasing , nonlinear, no trend, and so
on).

Cyclical Component. Usually, the values of the time series do not fall exactly on the trend line, and
often show alternating sequences of points below and above the trend line. Any repeated increasings and
decreasings about the trend line lasting more than one year can be atributed to the cyclical component of the
time series (see Figure.). This component of the time series is due to cyclical movements in the economy,
such as moderate inflation followed by a rapid inflation.

Seasonal Component. The trend and cyclical components can be observed when you are studying
historical data over multiannual periods. However, many time series show a regular pattern during one-year
period. For instance, peak sales are expected for the snow equipment during winter months and low sales
during summer months. Thus, the variability in data due to seasonal influences is determined by the seasonal
component of the time series. Generally, the seasonal component can be used to represent any regularly
repeating pattern within one-year period. For example, daily sales volume in a small market are expected to
be higher in the evening and lower during the day.

Irregular Component. The irregular component of the time series is responsable for deviations of the
actual values from those expected according to the effects of trend, cyclical, and seasonal components. It is
caused by the unanticipated factors, and, hence, is unpredictable.

33. Smoothing methods in forecasting. Examples.

* Smoothing Methods in Forecasting.

We will consider here three forecasting methods: moving averages, weighted moving averages, and
exponential smoothing. They are called smoothing methods, since the objective of these methods is
smoothing out the effects of the irregular component of the time series. These methods are appropriate for a
stable time series, when there are no significant trend, cyclical and seasonal changes. They provide a high
level of precision for short-range forecasts, for instance a forecast for the next time period.

Mooving Averages. This method uses the average of the most recent n data values in the time

series to make a forecast for the next period. The computing formula is

( most recent n data values )


Moving Average n .

One of the often-used measure of the accuracy of a method is the mean squared error (MSE)which is
equal to the average of the sum of squared errors.

Obviously, for a time series, moving averages of different lengths may provide different forecasting
accuracy. You may use trial and error to determine the length that minimizes the MSE for the past values in
the time series, and apply it for the next period.
Weighted Moving Averages. One variation of the moving average method involves weighted moving
averages when different weight is selected for each data value and a weighted average of the most recent

n results is computed as a forecast. Usually, the most recent data value is given the biggest weight, and

the weight decreases for older values. The sum of weights is equal to 1. This is a requirement to be respected
in selecting the weights. If we believe that the recent past value is better for prediction than the more distant
values, then larger weights should be given to the more recent observations. However, when the time series is
too variable, then approximately equal weights are to be given to all data values. To measure the forecast
accuracy, we can use MSE, choosing the combination of weights so as to minimize the mean squared error.

Exponential Smoothing. This method has minimal data requirements, and is good to use in
forecasting with large numbers of items. Exponential smoothing is a special case of weighted moving
averages method in which only one weight is selected the weight for the most recent data value. The
forecasting formula is as follows.

Ft +1= X t + (1 ) F t

where

Ft t ,
is forecast for the period

Xt t ,
is actual value of the time series in the period

is smoothing constant ( 0 1 ) .

As the formula shows, the forecast for the period t+1 is a weighted average of the actual value in

the period t and the forecasted value for the period t . It can be proved that the exponential smoothing

t X 1 , X 2 , , X n .
forecast for any period is a weighted average of all previous actual values

The formula for the exponential smoothing calculation can be rewritten as follows.

Ft +1=F t + ( X t F t )
.

It shows that the forecast in period t+1 is obtained by adjusting the forecast in period t by a

fraction of the forecast error.


Similarly to previous smoothing methods, we choose the value of so as to minimize the mean

squared error (MSE).

34. Trend projection in forecasting. Examples.

* Trend Projection in Forecasting.

The trend projection method is applicable to time series that have a trend component.

To identify a linear trend, the simple linear regression can be applied. Remind that the least squares
method is used to determine the best straight-line relationship between two variables. Thus, the equation of
the linear trend is

Y t =at+ b

where

Yt t
is the trend value of the time series in period

t is time

a is slope of the trend line

b is intercept of the trend line.

For the time series on sport shoes sales, t =1 corresponds to the first year period, t=2

corresponds to the second year period, and so on. Formulas for computing the estimated regression

coefficients a and b are as follows.

a=
X t tn X t
t 2n t 2


b= Xa t
where

Xt t
is the value of the time series in period

X X =
Xt
is the average value of the time series, that is n

t t
is the average value of t , that is t = n

n is number of periods.

Regression analysis can be applied to model also the curvilinear trends in time series.

Trend projection adjusted for seasonal influence. Let us see how to forecast a time series that has
both trend and seasonal components. In many situations economists are interested in period-to-period
comparisons of time-series values. In such cases seasonal effects must be taken into account, in order not to
make wrong conclusions about the overall trend in a time series. For example, the electric power
consumption may decrease in the summer months, whereas yearly use of electric power may be increasing.

Removing the seasonal effect from a time series is called deseasonalizing the time series. Thus, the
first step is to compute the seasonal indexes and use them to deseasonalize the data. After that, if the trend
exists, it can be estimated using the regression analysis.

You might also like