Applied Statistics en

APPLIED STATISTICS
EXAMPLES IN EXCEL AND SPSS

1
CONTENTS
I. Descriptive statistics ...................................................................................................4
What is Statistics?..........................................................................................................4
Scales of measurement...................................................................................................4
Discrete and continuous variables .................................................................................5
Data collecting ...............................................................................................................5
Census........................................................................................................................6
Sampling ....................................................................................................................6
Types of sample .............................................................................................................7
Simple random sample...............................................................................................7
Stratified sample ........................................................................................................8
Cluster sampling ........................................................................................................8
Quota sampling..........................................................................................................8
Systematic sampling ..................................................................................................9
Calculating a Sample Size .............................................................................................9
Frequency distribution ...................................................................................................9
Class intervals ..........................................................................................................22
Outliers.....................................................................................................................30
Data presentation: tables, diagrams and graphs...........................................................30
Descriptive statistics ....................................................................................................42
Measures of central tendency...................................................................................43
Measures of dispersion ............................................................................................43
Shape of distribution................................................................................................45
Symmetry or skewness ........................................................................................45
Kurtosis................................................................................................................46
Modality...............................................................................................................46
Measure of concentration.........................................................................................47
II. Empirical versus appropriate theoretical distributions (approximations with
binomial; Poisson, hypergeometric or normal distribution) ........................................67
BINOMIAL DISTRIBUTION.....................................................................................68
Probability distribution of a binomial random variable...........................................69
Characteristics of the Binomial distribution ............................................................70
POISSON DISTRIBUTION........................................................................................80
Probability distribution of Poisson random variable ...............................................80
Characteristics of the Poisson distribution...............................................................84
HYPERGEOMETRIC DISTRIBUTION....................................................................93
NORMAL DISTRIBUTION.......................................................................................95
Roles for standardized normal distribution..............................................................97
Characteristic intervals for normal distribution .......................................................98
STUDENT t-DISTRIBUTION..................................................................................111
CHI-SQUARE ,
2
_ DISTRIBUTION.....................................................................113
F DISTRIBUTION....................................................................................................115
LOGNORMAL DISTRIBUTION.............................................................................116
EXPONENTIAL DISTRIBUTION...........................................................................119
GAMA DISTRIBUTION..........................................................................................121
APPLIED STATISTICS
2
APROXIMATIONS FOR BINOMIAL, POISSON AND HYPERGEOMETIC
DISTRIBUTION WITH NORMAL DISTRIBUTION.............................................123
III. Inferential statistics: Estimation theory and hypothesis testing...........................124
INFERENCE..............................................................................................................124
THE DISTRIBUTION OF THE SAMPLE MEANS................................................125
CONFIDENCE INTERVAL FOR THE POPULATION MEAN.............................125
Standard deviation from population is known.......................................................125
Standard deviation from population isnt known...................................................126
CONFIDENCE INTERVAL FOR THE POPULATION PROPORTIONS .............132
CONFIDENCE INTERVAL FOR VARIANCE IN POPULATION .......................134
HOW TO DETERMINE SAMPLE SIZE ACCORDING TO SAMPLE ERROR? .137
Determining sample size for estimating population mean.....................................137
Determining sample size for estimating population proportion ............................138
HYPOTHESIS TESTING .........................................................................................140
Regions of rejection and non-rejection..................................................................141
Risks in decision making process ..........................................................................142
Procedure for hypothesis testing............................................................................142
Hypothesis for the mean ........................................................................................142
o known ............................................................................................................142
o unknown, small sample .................................................................................143
o unknown, large sample..................................................................................144
A two sample test for mean ...................................................................................150
A two sample test for variances .............................................................................154
Testing differences between arithmetic means of more than two populations on the
basis of their samples - analysis variance ANOVA...............................................162
Chi-square (
2
_ ) test ..............................................................................................167
Test for differences between proportion for populations...................................176
Test adequacy of approximations (goodness of fit) ...........................................177
Kolmogorov-Smirnov test .....................................................................................179
IV. REGRRESSION AND CORRELATION ANALISYS ......................................182
Aim............................................................................................................................182
Basic aspects ..............................................................................................................182
Scatter plot ................................................................... Error! Bookmark not defined.
Line of Best Fit (Regression Line).............................................................................187
The Correlation Coefficient .......................................................................................188
The Coefficient of Determination..............................................................................190
Interpretation of the size of a correlation...................................................................190
The standard error of estimate and the correlation coefficient ..................................192
Calculating the Equation of the Regression Line for two variables ..........................193
Prediction or forecasting............................................................................................197
Spearmans rank correlation coefficient ....................................................................198
Statistical testing (t test, ANOVA) ............................................................................201
Overview example for simple regression model with SPSS .....................................202
MULTIPLE REGRESSION MODEL.......................................................................209
The general multiple regression model..................................................................209
Measures for quality of multiple regression model ...................................................210
Statistical test (t test, ANOVA) .................................................................................211
Indicator dummy variables .....................................................................................215
APPLIED STATISTICS
3
Simple model with dummy variable ..................................................................216
Example indicator variables as the regression variables in the simple model with a
"dummy" variable ..................................................................................................217
Example of multiple regression models with indicator variables as a explanatory
variable and a continuous variable as another variable explanatory......................217
CONDITIONS FOR ECONOMETRIC MODELS...................................................222
Assumptions regression models through SPSS .....................................................222
MULTICOLLINEARITY..................................................................................222
OUTLIERS ........................................................................................................223
NORMALITY....................................................................................................224
AUTOCORRELATION....................................................................................224
HETEROSKEDASTICITY...............................................................................224
ECONOMETRIC CONDITIONS FOR REGRESION MODELS WITH SPSS
EXAMPLES ..........................................................................................................225
References..................................................................................................................282
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
4
I. Descriptive statistics
Wbat is Statistics?
Statistics, in short, is the study of data. It includes:
Descriptive statistics (the study of methods and tools for collecting data, and
mathematical models to describe and interpret data) and
Inferential statistics (the systems and techniques for making probability-based
decisions and accurate predictions based on incomplete (sample) data).
Three main aspects in statistical dealing with data are:
1. The collection of qualitative or numerical data,
2. The presentation of qualitative or numerical data and
3. The analysis of numerical data with appropriate statistical methods and models.
Scales of measurement
Different scales of measurement have correspondence with appropriate data type.
1. Nominal scale
Nominal scale classifies data into various distinct categories in which no ordering is
implied. Nominal variables might be used to identify different attributes. For example
nominal scale is appropriate for:
Gender
Citizenship
Internet provider that you prefer.
The license plate number of a car
The only comparisons that can be made between variable values are equality and
inequality. There are no "less than" or "greater than" relations among them, nor
operations such as addition or subtraction.
2. Ordinal scale
Ordinal scale classifies data into various distinct categories in which no ordering is
implied. Ordinal scale is in direct connection with ranking. For example there is
product satisfaction, because you can be: very satisfied, satisfied, neutral,
unsatisfied or very unsatisfied.
Comparisons of better and worst can be made, in addition to equality and inequality.
However, operations such as conventional addition and subtraction are still without
meaning. While the scale can be ranked from high to low the difference between
points cannot be quantified. We cannot say that the person who thinks facilities are
EXAMPLES IN EXCEL
5
good regards the facilities as twice as good as the person who thinks they are below
average.
3. Ratio scale
Ratio scale is an ordered scale in which the difference between the measurements
involves a true zero point (height, consumption, profit, etc.). All mathematical
operations are possible with this type of data and lead to meaningful results. There are
numerous methods for analyzing this type of data.
4. Interval scale
The most important characteristic of interval scale is that the measurement does not
involve a true zero point. The numbers have all the features of ordinal measurement
and also are separated by the same interval. Zero value is arbitrary, not real
(temperature, etc.)
In this case, differences between arbitrary pairs of numbers can be meaningfully
compared. Operations such as addition and subtraction are therefore meaningful.
However, the zero point on the scale is arbitrary, and ratios between numbers on the
scale are not meaningful, so operations such as multiplication and division cannot be
carried out. On the other hand, negative values on the scale can be used.
Categorical variables (attributes) are connected with nominal or ordinal scale, but
numerical variables are connected with ratio or interval scale.
Discrete and continuous variables
Numerical variable can be discrete or continuous:
Discrete variables produce numerical responses that arise from a counting
process. An example of a discrete numerical variable is the number of magazines
subscribed to. Another example would be the score given by a judge to a
gymnast in competition: the range is 0 to 10 and the score is always given to one
decimal (e.g., a score of 8.5). The response is one of a finite number of integers,
so a discrete variable can only take a finite number of real values.
Continuous variable produce numerical responses that arise from a measuring
process. The response takes on any value within a continuum or interval,
depending on the precision of the measuring instrument. Examples of a
continuous variable are distance, age, height, consumption, revenue, loan amount,
export/import...
Data collecting
Depending on the scope of research, data can be collected from a whole population or
from a part of population (a sample).
EXAMPLES IN EXCEL
6
Census
A survey of a whole population is called a census. A census refers to data collection
about every unit in a group or population. If you collected data about the height of
everyone in your class, that would be regarded as a class census. A characteristic of a
population (such as the population mean) is referred to as a parameter.
There are various reasons why a census may or may not be chosen as the method of
data collection:
Census data
Advantages (+)
Sampling variance is zero: There is no sampling variability attributed to the statistic
because it is calculated using data from the entire population.
Detail: Detailed information about small sub-groups of the population can be made
available.
Disadvantages ()
Cost: In terms of money, conducting a census for a large population can be very
expensive.
Time: A census generally takes longer to conduct than a sample survey.
Control: A census of a large population is such a huge undertaking that it makes it
difficult to keep every single operation under the same level of scrutiny and control.
Sampling
Sampling frame is a complete or partial listing of items comprising the population.
The frame can be data sources as population lists, directories or maps. Samples are
drawn from this frame. If the frame is inadequate because certain groups if individuals
or items in the population were not properly included, then the samples will be
inaccurate and biased.
The sampling process comprises several stages:
Defining the population of concern,
Specifying a sampling frame, a set of items or events possible to measure,
Specifying a sampling method for selecting items or events from the frame,
Determining the sample size,
Implementing the sampling plan,
Sampling and data collecting,
Reviewing the sampling process.
Examples of sample surveys:
Phoning the fifth person on every page of the local phonebook and asking them
how long they have lived in the area.
Selecting several cities in a country, several neighbourhoods in those cities and
several streets in those neighbourhoods to recruit participants for a survey.
EXAMPLES IN EXCEL
7
A characteristic of a sample (such as the sample standard deviation) is referred to as a
statistic.
Reasons one may or may not choose to use a sample survey include:
Sample survey
Advantages (+)
Cost: A sample survey costs less than a census because data are collected from only
part of a group.
Time: Results are obtained far more quickly for a sample survey, than for a census.
Fewer units are contacted and less data needs to be processed.
Control: The smaller scale of this operation allows for better monitoring and quality
control.
Disadvantages ()
Sampling variance is non-zero: The data may not be as precise because the data
came from a sample of a population, instead of the total population.
Detail: The sample may not be large enough to produce information about small
population sub-groups or small geographical areas.
Types of sample
Simple random sample
A simple random sample is selected so that every possible sample has an equal chance
of being selected from the population. Each individual is chosen randomly and
entirely by chance, such that each individual has the same probability of being chosen
at any stage during the sampling process.
In small populations such sampling is typically done without replacement. This
means that person or item once selected is not returned to the frame and therefore
cannot be selected again. An unbiased random selection of individuals is important so
that in the long run, the sample represents the population. However, this does not
guarantee that a particular sample is a perfect representation of the population.
Although simple random sampling can be conducted with replacement instead, this is
less common and would normally be described more fully as simple random sampling
with replacement. This means that person or item once selected is returned to the
frame and therefore can be selected again with the same probability
1
N
.
Advantages are that a random sample is free of classification error and it requires
minimum advance knowledge of the population. Random sampling best suits
situations where not much information is available about the population and data
collection can be efficiently conducted on randomly distributed items.
EXAMPLES IN EXCEL
8
Stratified sample
When sub-populations vary considerably, it is advantageous to sample each
subpopulation (stratum) independently. Stratification is the process of grouping
members of the population into relatively homogeneous subgroups before sampling.
The strata should be mutually exclusive: every element in the population must be
assigned to only one stratum. The strata should also be collectively exhaustive: no
population element can be excluded. Then random or systematic sampling is applied
within each stratum. This often improves the representativeness of the sample by
reducing sampling error.
In general, the size of the sample in each stratum is taken in proportion to the size of
the stratum. This is called proportionate allocation. If the population consists of 60%
in the male stratum and 40% in the female stratum, then the relative size of the two
samples (three males, two females) should reflect this proportion.
Cluster sampling
The problem with random sampling methods when we have to sample a population
that is disbursed across a wide geographic region is that you will have to cover a lot of
ground geographically in order to get to each of the units you sampled. It is for
precisely this problem that cluster or area random sampling was invented.
In cluster sampling, we follow these steps:
divide population into clusters (usually along geographic boundaries)
randomly sample clusters
measure all units within sampled clusters.
Cluster samples are generally used if:
No list of the population exists.
Well-defined clusters, which will often be geographic areas, exist.
Often the total sample size must be fairly large to enable cluster sampling to be used
effectively.
Quota sampling
Quota sampling is the non-probability equivalent of stratified sampling. Like
stratified sampling, the researcher first identifies the stratums and their proportions as
they are represented in the population. Then convenience or judgment sampling is
used to select the required number of subjects from each stratum. This differs from
stratified sampling, where the stratums are filled by random sampling.
There are two types of quota sampling: proportional and non-proportional. In
proportional quota sampling you want to represent the major characteristics of the
population by sampling a proportional amount of each. For instance, if you know the
EXAMPLES IN EXCEL
9
population has 40% women and 60% men, and that you want a total sample size of
100, you will continue sampling until you get those percentages and then you will
stop.
Non-proportional quota sampling is a bit less restrictive. In this method, you
specify the minimum number of sampled units you want in each category. Here,
you're not concerned with having numbers that match the proportions in the
population. Instead, you simply want to have enough to assure that you will be able to
talk about even small groups in the population.
Systematic sampling
Systematic sampling is a statistical method involving the selection of every k
th
element from a sampling frame, where k, the sampling interval, is calculated as:
k = population size (N) / sample size (n)
Using this procedure each element in the population has a known and equal
probability of selection. This makes systematic sampling functionally similar to
simple random sampling. It is however, much more efficient and much less expensive
to carry out. The researcher must ensure that the chosen sampling interval does not
hide a pattern. Any pattern would threaten randomness. A random starting point must
also be selected.
Systematic sampling is to be applied only if the given population is logically
homogeneous, because systematic sample units are uniformly distributed over the
population.
Calculating a Sample Size
The three most important factors that determine sample size are:
How accurate you wish to be?
How confident you are in the results?
What budget you have available?
The temptation is to say all should be as high as possible. The problem is that an
increase in either accuracy or confidence (or both) will always require a larger sample
and higher budget. Therefore, a compromise must be reached.
Frequency distribution
First result that we get after research is series with gross data. It is a database in
which we entered data for each item or object without any order (piled data). In
order to get an arranged statistical series (ordered array), we need to sort data by
order of magnitude (from smallest observation to the largest observation). The easiest
EXAMPLES IN EXCEL
10
method of organizing data is a frequency distribution, which converts raw data into
a meaningful pattern for statistical analysis.
Well, the final form of data grouping is the statistical distribution of frequencies, in
which each variable modality or interval (there is n of modalities or intervals)
associate a corresponding absolute frequency
i
f (number of times each value
(modality or class) appears or number of occurrences of a modality or class)
, ,
i i
x f or
,
1, 1, 1
,
i i i
L L f
+

.
The number of class groupings used depends on the number of observations in the
data (N). In general, the frequency distribution should have at least 5 class groupings
but no more than 15.
When a variable can take continuous values instead of discrete values or when the
number of possible values is too large, the table construction is cumbersome, if it is
not impossible. A slightly different tabulation scheme based on the range of values
(classes or intervals) is used in such cases
,
1, 1, 1
,
i i i
L L f
+

.
Frequency distribution tables can be used for both categorical and numeric variables.
Continuous variables should only be used with class intervals.
The relative frequency is proportion of units of a statistical set with the same
modality or interval. This relative frequency of a particular modality or class interval
is found by dividing the absolute frequency by the number of observations:
1
, 1
n
i
i i
i
f
p p
N
=
= =
_
.
The percentage frequency is found by multiplying each relative frequency value by
100. The percentage frequency is shown in percentages, and it has the same meaning
like the relative frequency:
1
100 100, 100
n
i
i i i
i
f
P p P
N
=
= = =
_
Cumulative frequency (CF) is used to determine the number of observations that lie
above (or below) a particular value in a data set (how many data have the value that is
equal to or lower than the value of present modality). The cumulative frequency is
calculated using a frequency distribution table. The cumulative frequency is
calculated by adding each frequency from a frequency distribution table to the sum of
its predecessors.
1
i
i j
j
S f
=
=
_
The last value will always be equal to the total for all observations, since all
frequencies will already have been added to the previous total.
Cumulative percentage (CF%) is used to determine the percentage or part of
observations that lie above (or below) a particular value in a data set (which part or %
EXAMPLES IN EXCEL
11
data have the value that is equal to or lower than the value of present modality). It is
calculated by adding each percentage frequency from a frequency distribution table to
the sum of its predecessors:
1
i
i j
j
F P
=
=
_
.
Excel solution for frequency distribution creating:
1. For qualitative data:
o Create column with modalities.
o In next column for first cell behind first modality choose Excel
function Statistical - Countif
Range - row or column or array with original data (fix that
range with $)
Criteria description of modality ()
o For other modalities do this with Copy option.
- For numerical data:
o Create new columns, one with lower and one with upper endpoints of
classes,
o Select free cells beside that column,
o Choose Excel function Statistical Frequency,
Data array row or column or array with original data,
Bins array new column with upper endpoint of classes,
CTRL+SHIFT+ENTER,
o That will produce absolute frequencies for all classes.
Example 1.
According to data base for HBS 2004 we have information about several variables for
7,413 households:
Entity
Canton
Gender
Marital status
Education level
Employment status
We have qualitative variables with small number of modalities, so we will use non-
interval grouping, or we will find absolute frequency for each modality.
First, we will in empty column of Excel sheet type modalities for given variable. We
will take variable marital status and modalities are: unmarried, married, unformal
marriage, divorced and widower/widow.
For construction of frequency distribution we will use Excel function: COUNTIF:
EXAMPLES IN EXCEL
12
Now we will give elements to the chosen CONTIF function:
Range row or column with original data (we will fix that data range with $:
$D$2:$D$7414)
Criteria cell with given modality (H10)
EXAMPLES IN EXCEL
13
With Copy-Paste option we will complete other cells for absolute frequency:
EXAMPLES IN EXCEL
14
On the same way we can complete frequency distribution for other variables.
Next step is to calculate relative and percentage frequencies according to absolute
frequencies:
1. we will get relative frequency when we divide absolute frequency with sample or
population size (N) like sum for absolute frequencies (when we give sum we
always fix series with $):
Other relative frequencies we will get with Copy-Paste option and sum of relative
frequencies has to be equal 1:
EXAMPLES IN EXCEL
15
2. Percentage frequency we will get when multiply relative frequency with 100%,
so we will transform part in percentage form:
Other percentage frequencies we will get with Copy-Paste option and sum of
percentage frequencies has to be equal 100:
EXAMPLES IN EXCEL
16
Interpretation: Highest part (71.24%) households has head in formal marriage, but
lowest part (0.27%) households has head in unformal marriage
When we have qualitative variable there is no any sense to calculate cumulative
frequency, because there is no logical explanation.
Example 2.
We have data base about import and export in year 2007. for 181 countries (Doing
business 2007 trading across boundaries). Variable number of documents for
export is example for discrete quantitative variable. For construction of frequency
distribution we will use option FREQUENCY.
First we will find minimal and maximal value of modality with statistical function
MIN and MAX:
EXAMPLES IN EXCEL
17
Minimal value of modalities is 3 and maximal value is 14, so we will according to
that take modalities from interval 3-14 in new column (I8:I19) for frequency
distribution:
Then we will select all cells where we need absolute frequencies (J8:J19) and we
choose in Functions: Statistical functions Frequency and:
1. Data array are original data (B2:B182)
2. Bins array are modalities (I8:I19)
EXAMPLES IN EXCEL
18
Than in the same time we press CTRL+SHIFT+ENTER and we will get frequency
distribution:
According to sum of absolute frequencies (175) we can see that for 6 countries data
about this variable are missing.
Next step is to calculate relative, percentage and cumulative frequencies according to
absolute frequencies:
EXAMPLES IN EXCEL
19
1. we will get relative frequency when we divide absolute frequency with sample or
population size (N) like sum for absolute frequencies (when we give sum we
always fix series with $):
Other relative frequencies we will get with Copy-Paste option and sum of relative
frequencies has to be equal 1:
3. Percentage frequency we will get when multiply relative frequency with 100%:
EXAMPLES IN EXCEL
20
Interpretation: Highest part of countries (19,43%) ask for 6 documents for export
realization, but lowest part of countries (1,14%) ask for 13 or 14 documents for export
realization.
EXAMPLES IN EXCEL
21
4. Increasing cumulative frequency
First increasing cumulative frequency is always same as first absolute frequency and
then we on current cumulant add next absolute frequency:
Other cumulative frequency we will get with option Copy-Paste and last cumulative
frequency has to be equal N:
Interpretation: 149 countries ask 9 or less than 9 documents for export realization.
5. Increasing cumulative percentage frequency
First increasing cumulative percentage frequency is always same as first percentage
frequency and then we on current cumulant add next percentage frequency:
EXAMPLES IN EXCEL
22
Other cumulative percentage frequency we will get with option Copy-Paste and last
cumulative percentage frequency has to be equal 100:
Interpretation: 61,14%countries ask 7 or less than 7 documents for export realization.
Class intervals
Class interval width is the difference between the lower and upper endpoint of an
interval (
2, 1, i i i
l L L = ).
In summary, follow these basic rules when constructing a frequency distribution table
for a data set that contains a large number of observations:
find the lowest and highest values of the variables,
EXAMPLES IN EXCEL
23
decide on the width of the class intervals and form class intervals that are mutually
exclusive,
include all possible values of the variable.
In an interval grouped series, in order to provide for additional data calculation, we
need to approximate the intervals to corresponding class middles (class mark,
midpoint, centre of interval):
1, 2,
2
i i
i
L L
c
+
= .
Example 3.
We have data base about import and export in year 2007. for 181 countries (Doing
business 2007 trading across boundaries). Variable cost to import is example for
continuous quantitative variable. For construction of frequency distribution we will
use option FREQUENCY.
First we will find minimal and maximal value of modality with statistical function
MIN and MAX:
EXAMPLES IN EXCEL
24
Minimal value is 367 and maximal value is 5.520, and according to that we will
determine interval for frequency distribution. We will take intervals with width 500
and in next cells we will type boundaries for that intervals (truing to be visually
symmetric):
When we set up boundaries for intervals then we can go to the function Frequency.
We will select all cells where we want to find absolute frequencies (K8:K19),
Kada smo odredili granice intervala moemo pristupiti funkciji Frequency. We select
all cells where we want to find absolute frequencies (K8:K19) and we choose in
Functions: Statistical functions Frequency and:
1. Data array are original data (G2:G182)
EXAMPLES IN EXCEL
25
2. Bins array are upper boundaries for intervals (that are included in current
interval) (J8:J19)
We pres at the same time CTRL+SHIFT+ENTER and we will get frequency
distribution:
According to sum of absolute frequencies (175) we can see that for 6 countries data
about this variable are missing.
Frequency distribution looks like:
EXAMPLES IN EXCEL
26
Next step is to calculate relative, percentage and cumulative frequencies according to
absolute frequencies:
1. we will get relative frequency when we divide absolute frequency with
sample or population size (N) like sum for absolute frequencies (when we
give sum we always fix series with $):
Copy-Paste option is used to give other relative frequencies:
EXAMPLES IN EXCEL
27
Sum of relative frequencies is 1.
2. Percentage frequency we will get when multiply relative frequency with
100%:
EXAMPLES IN EXCEL
28
Interpretation: Highest part (32,57%) countries have cost of import per container in
interval 1000-1500 US$., but lowest part of them (0,57%) have cost of import per
container in intervals 3000-3500 or 5500-6000. Because of that we can conclude that
interval 5500-6000 or data from that interval is outlier.
3. Increasing cumulative frequency
First is equal to first absolute frequency and than we use cumulation:
Then we use Copy-Paste option:
EXAMPLES IN EXCEL
29
For example, one of the conclusions can be that 170 countries have cost for import
lower than 4000 US$
4. Increasing cumulative percentage frequency
Procedure is same like in previous step but with percentage frequencies:
Then we use Copy-Paste option:
EXAMPLES IN EXCEL
30
For example, one of the conclusions can be that 90,29% countries have cost for
import lower than 2500 US$.
Outliers
An outlier is an extreme value of the data. It is an observation value that is
significantly different from the rest of the data. There may be more than one outlier in
a set of data.
Sometimes, outliers are significant pieces of information and should not be ignored.
Other times, they occur because of an error or misinformation and should be ignored.
Data presentation: tables, diagrams and grapbs
Two most important ways for presenting data are previously presented tables with
frequency distributions and graphs.
Why use graphs when presenting data? Because graphs:
are quick and direct
highlight the most important facts
facilitate understanding of the data
can convince readers
can be easily remembered.
Knowing what type of graph to use with what type of information is crucial.
Depending on the nature of the data and variable type some graphs might be more
appropriate than others. You too can experiment with different types of graphs and
select the most appropriate. There are several suggestions for appropriate selection
according to effects that you want to get with graphs:
EXAMPLES IN EXCEL
31
pie chart (description of components)
horizontal bar graph (comparison of items and relationships, time series)
vertical bar graph (discrete variable, comparison of items and relationships, time
series, frequency distribution)
line graph (time series and frequency distribution)
scatter plot (analysis of relationships)
histogram (continuous variable).
In Excel in segment Tools Customize Insert- Chart we can find function Chart and
choose different types of graphs:
Example 1.
We will again work with variable marital status. What types of graphs we can use?
According to the variable type qualitative variable, we can construct structural pie
or vertical bars.
For this example we will construct structural pie:
EXAMPLES IN EXCEL
32
We choose option Next:
1. Titles - we give title to the graph
2. Legend We choose way to represent legend
EXAMPLES IN EXCEL
33
3. Data labels we choose options to show on pie: variable name, modality name,
absolute frequency, %. We will take to show % because we already have modality
names in legend.
We choose option Next and determine place where graph will be saved:
EXAMPLES IN EXCEL
34
We choose option Finish:
Example 2.
Variable number of documents for export is discrete variable. Because of that we
will choose structure pie, vertical bars or frequency polygon to represent it.
We will construct graph for vertical bars:
EXAMPLES IN EXCEL
35
In Series option we will fix values for modalities (I8:I19):
a) Titles we will determine title for graph and axes
EXAMPLES IN EXCEL
36
b) Axes we set up axes
c) Gridlines we set up gridlines
d) Legend we choose to include legend and how to do that. If we have only one
variable than legend is not important. But if we have more variables we will
use legend to classify variables.
e) Data labels we choose options to show on graph: variable name, modality
name, absolute frequency, %. We will take to show absolute frequencies:
f) Data table if we include this option we will get table below graph, but this is
same information like information on graph.
EXAMPLES IN EXCEL
37
We choose option Next and we determine place where graph will be saved:
We choose option Finish:
Example 3.
This is continuous variable cost of import. Because of that we prefer to use
histogram, frequency polygon or polygon of cumulative frequency.
A. First we will construct histogram. Procedure is same like with vertical bars. On the
end when we get graph with vertical bar we will on the graph make format for gap
within bars, to be equal 0:
EXAMPLES IN EXCEL
38
We click on bars on Excel graph and we choose Format data series, and then we
choose Options where we make that Gap width be equal to 0:
Click OK and there is histogram (graph with continuous bars):
EXAMPLES IN EXCEL
39
B. Now we will construct polygon of absolute frequency. We need centres of intervals
for that. We need columns with lower and upper boundaries for intervals. Centre of
interval is sum of lower and upper boundary divided by 2:
Others centre of intervals we will get with Copy-Paste function:
Now we can to construct polygon of frequency:
EXAMPLES IN EXCEL
40
We choose Next and select in Data range cells with absolute frequencies:
For Series we select centers of intervals like modalities:
EXAMPLES IN EXCEL
41
Again we use option Next:
a) Axes we set up axes
b) Gridlines we set up gridlines
c) Legend we choose to include legend and how to do that. If we have only one
variable than legend is not important. But if we have more variables we will use
legend to classify variables.
d) Data labels we choose options to show on graph: variable name, modality name,
absolute frequency, %. We will take to show absolute frequencies.
e) Data table if we include this option we will get table below graph, but this is
same information like information on graph:
EXAMPLES IN EXCEL
42
We choose option Next and we determine place where graph will be saved:
On the same way we can create polygon of cumulative frequencies, but in that case on
the beginning in Data range we would select cells with cumulative frequencies.
poligon of cumulative procentual frequencies
3,4286
34,8571
67,4286
82,8571
99,4286
90,2857
95,4286
96,0000
97,1429
98,2857
99,4286100,0000
0,0000
20,0000
40,0000
60,0000
80,0000
100,0000
120,0000
1 2 3 4 5 6 7 8 9 10 11 12
centre of interval
C
F
%
Descriptive statistics
Descriptive statistics are used to describe the basic features of the data in a study.
Together with simple graphics analysis, they form the basis of virtually every
quantitative and qualitative analysis of data.
There may be several objectives for formulating a summary statistic or parameter:
- To choose a statistic that shows how different units seem similar. Statistical
textbooks call one solution to this objective, a measure of central tendency.
- To choose another statistic that shows how they differ. This kind of statistic is
often called a measure of statistical variability.
- To analyze shape of frequency distribution.
EXAMPLES IN EXCEL
43
Measures of central tendency
Measures of central tendency summarize a list of numbers by a "typical" value called
measure of location. The three most common measures of location are the mean, the
median, and the mode.
- The mean (average) is the sum of the values, divided by the number of values. It
has the smallest possible sum of squared differences from members of the list.
1
N
i
x
X
N
=
_
- The median is the middle value in the sorted list. It has the smallest possible sum
of absolute differences from members of the list frequencies. The first modality or
interval in which it is
2
Me
N
CF s is the median or interval in which the median
is contained. If it is an interval, then the median is determined using the following
formula:
1
1
2
( )
e
e e
e
M
e M M
M
N
CF
M L l
f R
= +
- The mode is the most frequent value in the list (or one of the most frequent values,
if there are more than one). Mode is only calculated for the statistical distribution
(grouped series). It is graphically determined in a histogram. For a non-interval
grouped distribution, on the basis of the highest frequency (
max Mo
f f = ) the mod
data is read. For an interval grouped distribution, the frequency of the read
interval opposed to the highest frequency is determined on the basis of the
following formula:
, ,
1
1
1 1
o o
o o
o o o o
M M
o M M
M M M M
f f
M L l
f f f f
= +
+
Sometimes, we choose specific values from the cumulative distribution function
called quartiles. Procedure is same like with median:
25% of data has value less or equal to the first quartile and 75% of data has
value higher than the first quartile (theoretical position
1
4
Q
N
CF s )
75% of data has value less or equal to the third quartile and 25% of data has
value higher than the third quartile (theoretical position
3
3
4
Q
N
CF s ).
Measures of dispersion
Dispersion refers to the spread of the values around the central tendency. There are
three common absolute measures of dispersion:
EXAMPLES IN EXCEL
44
- The range
The range is simply the highest value minus the lowest value:
max min
RV x x = .
- The quartile range
The quartile range (
3 1 Q
I Q Q = ) is the range from the 25th to the 75th percentile
of a distribution. It represents the "Middle Half" of the data and is a marker of
variability or spread that is robust to outliers.
- The standard deviation
The standard deviation is the square root of the sum of the squared deviations
from the mean divided by the number of scores (or the number of scores minus
one, if we work with sample).
For population:
,
2
2 2
1
1
,
N
i
i
x X
N
o o o
=
1
= =
(
]
_
For sample:
,
2
2 2
1
1
,
1
N
i
i
x X
N
o o o
=
1
= =
(
]
_
The standard deviation allows us to reach some conclusions about specific scores
in our distribution. Assuming that the distribution of scores is normal or bell-
shaped (or close to it!), the following conclusions can be reached (role six sigma):
- approximately 68% of the scores in the sample fall within one standard
deviation of the mean
- approximately 95% of the scores in the sample fall within two standard
deviations of the mean
- approximately 99% of the scores in the sample fall within three standard
deviations of the mean.
Problem with standard deviation, like absolute measure of dispersion, is that we
can not use standard deviation for comparison of series with different unit of
measure or with different average.
Behind that we can define relative measures of dispersion like:
- Coefficient of variation
The variance coefficient is a relative measure of variability which can be used for
comparing series with different units of measure, because it is an unnamed
number.
100 (%) V
X
o
=
It can be used for comparing series with different arithmetic means.
- z value
Z values determine the relative position of variable modality in the series:
, 1, 2,...,
i
i
x X
z i N
o
= =
They are appropriate for comparing positions of data in different series. Z values
are specific because of fact that we can calculate z value for each modality, not
only for series of data.
EXAMPLES IN EXCEL
45
- The quartile deviation coefficient
The quartile deviation coefficient is relative dispersion indicator and shows
variability around median value:
, ,
3 1
1 3 1 3
100% 100%
Q
Q
I
Q Q
V
Q Q Q Q
= =
+ +
Higher value of the quartile deviation coefficient indicates greater dispersion and
vice versa. This is relative indicator of data varying around the median.
Shape of distribution
Symmetry or skewness
A frequency distribution may be symmetrical or asymmetrical. Imagine constructing a
histogram centred on a piece of paper and folding the paper in half the long way. If
the distribution is symmetrical, the part of the histogram on the left side of the fold
would be the mirror image of the part on the right side of the fold. If the distribution is
asymmetrical, the two sides will not be mirror images of each other. True symmetric
distributions include what we will later call the normal distribution. Asymmetric
distributions are more commonly found.
Measure of skewness
3
3
3
o
= o
,
3
3
1
1
N
i
i
x X
N
=
1
=
]
_
= o 0
3
symmetry
3
0 o > positively skewed
3
0 o < negatively skewed
X
f
symmetric
left asymmetric right asymmetric
EXAMPLES IN EXCEL
46
If a distribution is asymmetric it is either positively skewed or negatively skewed. A
distribution is said to be positively skewed if the scores tend to cluster toward the
lower end of the scale (that is, the smaller numbers) with increasingly fewer scores at
the upper end of the scale (that is, the larger numbers). A negatively skewed
distribution is exactly the opposite. With a negatively skewed distribution, most of the
scores tend to occur toward the upper end of the scale while increasingly fewer scores
occur toward the lower end.
Kurtosis
Another descriptive statistic that can be derived to describe a distribution is called
kurtosis. It refers to the relative concentration of data in the centre, the upper and
lower ends (tails), and the shoulders of a distribution. A distribution is platykurtic if
it is flatter than the corresponding normal curve and leptokurtic if it is more peaked
than the normal curve.
Modality
A distribution is called unimodal if there is only one major "peak" in the distribution
of scores when represented as a histogram. A distribution is bimodal if there are two
major peaks. If there are more than two major peaks, we call the distribution
multimodal.
Measure of kurtosis
4
4 4
o
o
=
,
4
4
1
1
N
i
i
x X
N
=
1
=
]
_
4
3 o = normal
4
3 o > leptocurtic
4
3 o < platykurtic
EXAMPLES IN EXCEL
47
Measure of concentration
The Lorenz curve is a graphical representation of the cumulative distribution
function of a probability distribution; it is a graph showing the proportion of the
distribution assumed by the bottom y% of the values. It is often used to represent
income distribution, where it shows for the bottom x% of households, what
percentage y% of the total income they have.
Every point on the Lorenz curve represents a statement like "the bottom 20% of all
households has 10% of the total income". A perfectly equal income distribution would
be one in which every person has the same income. In this case, the bottom N% of
society would always have N% of the income. This can be depicted by the straight
line y = x; called the line of perfect equality.
By contrast, a perfectly unequal distribution would be one in which one person has all
the income and everyone else has none. In that case, the curve would be at y = 0 for
all x < 100%, and y = 100% when x = 100%. This curve is called the line of perfect
inequality.
The Ginny coefficient is the area between the line of perfect equality and the
observed Lorenz curve, as a percentage of the area between the line of perfect
equality and the line of perfect inequality. This equals two times the area between the
line of perfect equality and the observed Lorenz curve.
EXAMPLES IN EXCEL
48
,
1
concentration area
2 concentration area 2
0, 5
1
0 1
j j j
G S
G p Q Q
G
= = =
= +
s s
_
The higher the Ginny coefficient, the more unequal the distribution is.
Software Excel and SPSS do not offer the option to directly calculate measures of
concentration, and we therefore have based on a formula in Excel, so we develop the
procedure.
Example 4.
We have data base about variables that follow procedure of paving taxes for 181
countries (source: http://www.doingbusiness.org/CustomQuery/, data for 2008. year).
Data are given in Excel sheet (A1-G363). Variables are:
- Payments (number) (B2-B363)
- Time (hours) (C2-C363)
- Total tax rate (%profit) (D2-D363).
There are quantitative variables, so we can apply methodology for descriptive
statistics for series of 181 data per each variable to get several parameters which will
describe given series.
Most simple and fast way to get several parameters which will describe given series
(x
min
, x
max
, average, deviation, mod, median, kurtosis and skewness) is to use Excel
function: Tools Data Analysis. If that option is not included we have to renew it:
1. Tools Add-ins:
EXAMPLES IN EXCEL
49
2. We have to renew or choose Analysis ToolPak and Analysis ToolPak VBA:
3. Click OK and we will get in Tools:
Now we can use Data Analysis option:
EXAMPLES IN EXCEL
50
We will get list with analysis that we can make. Currently we are interested for option
Descriptive statistics, so we choose it and click OK. In Input range we can in the same
time to select all columns with several variables and to give grouping according to the
columns ($B$1:$D$182). When we select data we include and first cell with variable
name and include option Labels in first row. Then we set up empty cell or new sheet
where we want to save result of analyses and we select what we want to get of
parameters:
- Summary statistics - x
min
, x
max
, average, deviation, mod, median, kurtosis and
skewness, range, count...
- Confidence level for mean This is boundary for confidence interval for
average with given confidence level (for example 95%)
- Kth largest i Kth smallest If we want to calculate quantiles we will choose
this option , for example for first and third quartile in both case we take 25, for
firs and ninth decile in both case we take 10
Click OK and result is:
EXAMPLES IN EXCEL
51
On example on of this variables time (hours) we will give interpretations for results:
- Average is 317.63 hours, for sample of 181 countries (count), so in average it
is needed 317.63 hours for paying taxes procedure.
- Standard error of average estimation is given on base of sample size and
standard deviation in sample (
X
n
o
o = ) is 23.61 hours.
- Median is 256, so for 50% of countries is needed 256 hours or less for paying
taxes procedure until for 50% of countries is needed more than 256 hours for
paying taxes procedure.
- Mod is 270, so we have most frequently appeared country with 270 hours for
paying taxes procedure.
- Standard deviation like average linear deviation from average is 317.66 hours,
so we can calculate coefficient of variation:
317.66
100 100 100%
317.63
V
X
o
= = =
Relative variability around average is 100%. Only in comparison with another
series this information has sense.
- Variance like average square deviation from average is 100906.1, but we
interpret this through standard deviation.
- Kurtosis is (19.96+3)=22.96 what is more than 3 so we can conclude that this
distribution is significantly more peaked than the normal curve.
- Skewness is 3.77 what is more than 0 so we can conclude that this distribution
is significantly right asymmetric in comparison with the normal curve
- Range like difference between highest and lowest value is 2600 h.
- Minimal time for paying taxes procedure is 0 h.
- Maximal time for paying taxes procedure is 2600 h.
- Sum of data in series is 57491, but there is no logic interpretation for this
information.
EXAMPLES IN EXCEL
52
- Third quartile is 453, so for 75% of countries is needed 453 hours or less for
paying taxes procedure until for 25% of countries is needed more than 453
hours for paying taxes procedure.
- Third quartile is 105, so for 25% of countries is needed 105 hours or less for
paying taxes procedure until for 75% of countries is needed more than 105
hours for paying taxes procedure.
- Boundary for confidence interval for average with given confidence level 95%
is 46.59. Confidence interval for average with 95% confidence level is
[317.6346,59]= [271.04-364.22]. So with first type error 5% we can
conclude that time for paying taxes procedure for some country will be I
interval [271.04-364.22] hours.
To see these parameters visually we will construct histogram. We have option in Data
analysis:
Before we construct histogram we have to define intervals according to minimal and
maximal value and to numbers of interval that we want to create. Maximal value is
2600 and minimal value is 0, so we will determine intervals with width 100: 0-100,
100-200, ..., 400-500, 500-600, ..., 2500-2600. Upper limits for that intervals that are
included in intervals are: 99, 199, ..., 499, 599, ..., 2600. We will type this limits in
one Excel column (I22:I47).
For Input range we will select column with original data (C2:C182) and for Bin
Range we will select cells where we type upper limits for intervals (I22:I47). We will
find place to save result and option Chart output:
EXAMPLES IN EXCEL
53
Graph that we are get is graph with vertical bars, but we will click on graph and get
Chart options Options. There we will set up that gap between bars be equal 0:
Finally histogram looks like:
EXAMPLES IN EXCEL
54
Histogram
0
10
20
30
40
50
60
9
9
2
9
9
4
9
9
6
9
9
8
9
9
1
0
9
9
1
2
9
9
1
4
9
9
1
6
9
9
1
8
9
9
2
0
9
9
2
2
9
9
2
4
9
9
M
o
r
e
Bin
F
r
e
q
u
e
n
c
y
Our interpretation of parameters for distribution shape is completely proved. It is very
positive (right) asymmetric and peaked distribution. This distribution is significantly
different in comparison with normal curve.
Example 5.
With aim to analyse concentration for consumption for data base HBS 2008, we are
taken data about consumption per capita for 23374 individuals from 7071 households:
There are original gross data, so we will firs construct appropriate frequency
distribution. We need to find minimal and maximal value for consumption level in our
sample:
EXAMPLES IN EXCEL
55
According to that we make decision to set up intervals with width 5000, so we have
upper limits that are included in intervals (bins): 4999,99, 9999,99, 14999,99, ,
54999,99. That limits we will type in empty column in sheet where are original data:
EXAMPLES IN EXCEL
56
We select empty cells in column behind (E6:E16). In function (f
x
) we choose
Frequency:
With CTRL+SHIFT+ENTER we will get frequency distribution:
EXAMPLES IN EXCEL
57
Now we can to start with construction of Lorenz curve and calculation of Ginny
coefficient. We need centres of intervals and relative frequencies, but before that we
have to form columns with lower and upper limits for intervals:
First we will calculate centres of intervals:
EXAMPLES IN EXCEL
58
With Copy-Paste option we will get column with centres of intervals:
Than we will calculate relative frequencies:
EXAMPLES IN EXCEL
59
With Copy-Paste option we will get column with relative frequencies:
Than, we will calculate relative cumulative frequencies. First is same like first relative
frequency and we follow cumulating:
With Copy-Paste option we will get column with relative cumulative frequencies:
EXAMPLES IN EXCEL
60
Than we need cumulant for relative aggregate. First we will calculate aggregate (cp)
like product of centre of interval and absolute frequency for given interval:
With Copy-Paste option we will get column for aggregate:
We will calculate relative aggregate like:
i i
i
i i
c p
q
c p
_
:
With Copy-Paste option we will get column for relative aggregate:
EXAMPLES IN EXCEL
61
On the end we will find relative cumulative aggregate (Q):
With Copy-Paste option we will get column for cumulant of relative aggregate:
To graph Lorenz curve for x axes we will take relative cumulative frequencies and
like y axes we will take cumulant of relative aggregate. Before that we will insert one
point with value 0 for both cumulant:
EXAMPLES IN EXCEL
62
Now we can graph Lorenz curve:
For line of perfect equality we will for both axes take same data for relative
cumulative frequencies.
For Lorenz curve we take:
EXAMPLES IN EXCEL
63
Now with Add we will insert new series for line with perfect equality:
We choose Next and then we will get option to give titles:
EXAMPLES IN EXCEL
64
Finally graph looks like this:
White area is area of concentration.
We will calculate Ginny coefficient like quantification for measure of concentration
according to relation:
,
1
1
j j j
G p Q Q
= +
_
:
With Copy-Paste option we will complete this column:
EXAMPLES IN EXCEL
65
When we calculate (1-this sum) we will get Ginny coefficient:
And we will get result:
EXAMPLES IN EXCEL
66
Ginny coefficient is 0.3378 so distribution of consumption is not perfect equal but
level of concentration is not very high.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
67
II. Empirical versus appropriate tbeoretical distributions
{approximations witb binomial, Poisson,
bypergeometric or normal distribution]
PROBABILITY DISTRIBUTIONS
Frequency distribution formed with groupation of population units according to same
characteristics is empirical distribution. Distribution formed on the basis of theoretical
prepositions is theoretical distribution. Main characteristics of theoretical distributions
are:
We suppose them in some statistical model or we create them like hypothesis
that we have to test.
Theoretical distributions are given like analytic model with known parameters:
expectation, mod, median, standard deviation, skewness and kurtosis.
Theoretical distributions are given like probability distributions.
Probability where we know number of possible outcomes of event and we know
number of success realization is a priory probability. But in statistical research is
most frequently that we dont know probability a priori so with experiment we try to
get knowledge for probability calculations like a posterior. Well a posterior
probability is empirical or statistical probability.
Empirical probability or a posterior is limited value for relative frequency for number
of success of event A if we have great number of trials: which tends to infinity:
( ) lim
n
m
p A
n
= ; m- number of success realization, n- number of trials.

Cumulative function for discrete variable X (F(x)) is function that x will take values
lower or equal to same real number
i
x or , , ( )
i
i i i
X x
F x P X x p x
s
= s =
_
.
Cumulative function for continuous variable X (F(x)) has general form
like , ,
}

= =
a
dx x f a X F , and it is determined by parameters like expectation and
variance..
If discrete variable X (F(x)) can take values
k
x x x ,..., ,
2 1
with
probabilities , , ,
k
x p x p x p ,..., ,
2 1
, where sum of probabilities has to be 1, expectation
for X is :
, , , , ,
i
k
i
i k k
x x p x x p x x p x x p X E = + + + =
_
=1
2 2 1 1
... .
For continuous variable expectation is:
EXAMPLES IN EXCEL
68
, ,
}

= dx x xf X E , - < < x .
Variance for discrete variable is:
, , , , X E x p x X
k
i
i i
= = =
_ _
=
o ,
1
2 2 2
odnosno
, ,
2
1 1
2 2
(
]
1
=
_ _
= =
k
i
i i
k
i
i i
x p x x p x o .
Variance for continuous variable is:
, , , , dx x xf dx x f x X E
} }

= = = o ,
2 2 2 2
.
Well, theoretical probability distributions can be split into 2 groups:
discrete probability distributions deal with discrete events
o binomial distribution
o Poisson distribution
o Hypergeometric distribution.
continuous probability distributions deal with continuous events
o normal distribution
o Student (t) distribution
o
2
_ (chi-square) distribution
o F distribution.
The probability distribution of a random variable describes the probability off all
possible outcomes. The sum (integral) of these probabilities will equal 1.
BINUMIAL DISTRIBUTIUN
The binomial distribution is used when discrete random variable of interest is the
number of successes obtained in a sample of n observations. It is used to model
situations that have the following properties:
The sample consists of a fixed number of observations n.
Each observation is classified into one of two mutually exclusive categories,
usually called success and failure.
The probability of an observation being classified as success, noted as p, is
constant from observation to observation. Thus, the probability of an observation
being classified as failure, noted as (1-p)=q, is constant over all observations.
The outcome (success or failure) of any observation is independent of the outcome
of any other observation.
Well, binomial distribution has two parameters:
n number of observations, trials or experiment repetitions.
EXAMPLES IN EXCEL
69
p the probability of success (occurrences of a given event) on a single
observation, trial or experiment.
Probability distribution of a binomial random variable
The probability distribution of a binomial random variable is:
, ( ) 1 , 0,
n x
x
n
p x p p x n
x
| |
= =
|
\ .
,
where x is exact number of successes of interest and ( ) p x is probability that among n
trials will been realized exactly x successes (given event will be realized exactly x
times).
Binomial probability function
1
Example 1.
An insurance broker believes that for particular contact, the probability of making sale
is 0.4. Suppose now that he has five contacts. What is probability that he will realize
three sales among these five contacts?
Solution:
Discrete random variable X is defined to take value 1 if sale is made and 0 if sale is
not made so this is discrete variable that can be treated with binomial distribution.
Experiment of sale we will repeat 5 times n=5.
According to conclusion about dichotomous variable we will apply approximation
with binomial distribution:
1 From Wikipedia, the free encyclopedia
EXAMPLES IN EXCEL
70
(1) 0.4
(0) 1 0.4 0.6
5
3
p p
q p
n
x
= =
= = =
=
=
,
3 2
5
( ) 1 (3) 0.4 0.6 0.23
3
n x
x
n
p x p p p
x
| | | |
= = =
| |
\ . \ .
Probability that he will realize three sales among these five contacts is 23%.
Characteristics of the Binomial distribution
Shape
Binomial distribution can be symmetrical (if p=0.5) or skewed (if p= 0.5)
Mean
( ) E X n p = =
Variance
,
2
2
(1 ) E X n p p o
1
= =
]
We have 4 types for binomial distribution:
symmetric; if p=q=0.5
asymmetric; if p = q
a priori; if we know probabilities p and q
a posterior; if we have to find p and q by empirical method
Conditions for approximation empirical distribution with binomial distribution are:
0 1
X
n
s s
2
1
X
X
n
o
| |
~
|
\ .
Error of approximation is measure for quality of approximation. Error of
approximation according to modalities is:
b
k k k
d f f = where:
k
f is empirical
frequency and
b
k
f is theoretical frequency, so overall error of approximation is:
2 2
1
1
b k
d
n
o =
+
_
Example 2.
Accounting office in one company has information that 40% customers don't realize
obligation on time because of inflation. If we randomly select 6 customers, what is
probability:
1. that are all customers realized obligation on time
2. that more than 3/4 of customers realized obligation on time
3. that 50% or more of customers don't realize obligation on time.
EXAMPLES IN EXCEL
71
Solution:
p=60%=0,6 (realize obligation on time)
q=40%=0,4 (dont realize obligation on time)
n=6
( )
x n x
n
p x p q
x
| |
=
|
\ .
1. Probability that that
are all customers realized obligation on time according to the table is
p(6)=4.67%.
2. Probability that more than 3/4 of customers realized obligation on time 3/4
of 6 is 4,5 so we will take probability for x=5 and x=6. According to the table
p(5)= 18.66% and p(6)= 4.67% , so final result according to (Additional
theorem) is 23.33%.
3. Probability that 50% or more of customers don't realize obligation on time
50% of 6 is 3, so we will take probability for x=3, 4, 5, 6. According to the
table this is (0.27648+0.311040+0.186624+0.046656)=0.8208 82.8%.
Example 3.
For 1000 products we can find 28 with defect. If we randomly select 14 products for
sample, what is probability that:
a) in sample we have exactly 4 products with defect;
b) in sample we have maximum 2 products with defect;
c) in sample we have minimum 4 products with defect.
Solution (by Excel):
This is dichotomous variable, so in that case we will apply Binomial distribution with
modalities - x: 0,1,2,3,4,...,14.
28
0.028 0.972
1000
p q = = =
14
, 0,14
:
14
( ) 0.028 0.972
k
b k k
k k
x k k
X
p P x k
k
| |
= =
|
| |
|
= = =
|
|
\ . \ .
We will use Excel function:
x
i
p(x) F(x)
0 0.004096 0.004096
1 0.036864 0.040960
2 0.138245 0.179205
3 0.276480 0.455685
4 0.311040 0.766725
5 0.186624 0.953349
6 0.046656 1.000000
EXAMPLES IN EXCEL
72
a) in sample we have exactly 4 products with defect
We ask for probability in point not for cumulative function, so for option Cumulative
we will take False.
{=BINOMDIST(4;14;0.028;FALSE)}= 0.000463 0.0463%
b) in sample we have maximum 2 products with defect (so 0, 1 or 2 product with
defect), this is cumulative distribution so for option Cumulative we will take True.
EXAMPLES IN EXCEL
73
{=BINOMDIST(2;14;0.028;TRUE)}= 0.993662 99.3662%
c) in sample we have minimum 4 products with defect 4, 5 or more products with
defect, what is opposite event for cumulative frequency (maximum 3 products with
defect or 1, 2 or 3 products with defect). Event and opposite event for sum of
probabilities have 1, so we can use Excel to get probability for opposite event (1, 2 or
3 products with defect) and than use that characteristic:
1- {=BINOMDIST(3;14;0.028;TRUE)}=1- 0.999509=0.000491 0.491%
Example 4.
For monitoring of work for one automat machine, inspector will take sample with 10
products. On base of 50 samples we get this information about number of products
with defect:
Number of
products with
defect
Number of
samples
0 6
1 11
2 15
3 10
EXAMPLES IN EXCEL
74
4 7
5 1
50
We have to create appropriate theoretical approximation for this empirical distribution.
Solution:
This is discrete random variable. We have two modalities in one trial: product can be
correct or with defect. That shows us that appropriate theoretical distribution is
binomial distribution. According to empirical distribution of frequencies we will
calculate average and standard deviation. We can con use Excel function directly,
because this is grouped distribution and we will set up formulas for calculate average
and standard deviation:
10 , 50 = = n N
Result is:
EXAMPLES IN EXCEL
75
Or we will create new column (xf) and sum for that column we will divide with sum
of absolute frequencies:
k
x
k
f
k k
f x
0 6 0
1 11 11
2 15 30
3 10 30
4 7 28
5 1 5
50 104
104
2.08
50
k k
x f
X
N
= = =
_
Then we will calculate standard deviation:
Result is:
EXAMPLES IN EXCEL
76
Or we will create new column
k k
f x
2
and calculate o with general formula
2
2
2 k k
x f
X
N
o

=
_
:
k
x
k
f
k k
f x
k k
f x
2
0 6 0 0
1 11 11 11
2 15 30 60
3 10 30 90
4 7 28 112
5 1 10 25
50 109 298
2
2
2 2
298
2.08 1.63 1.278
50
k k
x f
X
N
o o
= = = =
_
Now we will test that conditions for binomial approximations are satisfied:
2
2.08
1 2.08 1 1.65 1
10
X X
X X
n n
o
| | | |
| |
= = ~
| |
|
\ .
\ . \ .
0 0.208 1
X
n
s = s
Conditions are satisfied so we can apply approximation. Then is: 0.208
X
p
n
= = and
0, 792 q = .
10
10
0.208 0.709 , 0, 5
b x x b b
x x x
p x f p N
x
| |
= = =
|
\ .
In Excel we will create formula for probability calculations
10
10
0.208 0.709 , 0, 5
b x x
x
p x
x
| |
= =
|
\ .
and than according to these theoretical
probabilities we can compute theoretical frequencies
b b
x x
f p N = :
EXAMPLES IN EXCEL
77
With Paste option we can complete other cells in column with theoretical probabilities.
Result is:
Now we will compute theoretical frequencies:
EXAMPLES IN EXCEL
78
With Paste option we can complete other cells in column with theoretical frequencies.
Result is:
That was procedure for approximation with binomial distribution. Now we have
schedule for this variable and we can make predictions. Quality of approximation will
be measured by error of approximation.
Error of approximation for modalities is:
b
k k k
d f f =

EXAMPLES IN EXCEL
79
Because of different signs, we will square those errors:
We will sum square of errors:

EXAMPLES IN EXCEL
80
2 2
1 9.589
0.872
1 11
b k
d
n
o = = =
+
_
Error of approximation is 0.872.
PUISSUN DISTRIBUTIUN
The Poisson distribution is a useful discrete probability distribution when you are
interested in the number of times a certain event will occur in a given unit of area or
time. This type of situation occurs frequently in a business. of opportunity approaches
zero as the area of opportunity becomes smaller. The Poisson distribution has one
parameter 0 i > , which is average or expected number of events per unit.
Probability distribution of Poisson random variable
The probability distribution of a Poisson random variable is: ( )
!
x
e
p x
x
i
i
=
where is:
x number of events per unit (number of successes per unit)
EXAMPLES IN EXCEL
81
( ) p x is the probability of x successes given a knowledge of i
i average number of events per unit (average number of successes per unit)
e=2.71828 (constant)
Poisson probability function
2
The horizontal axis is the index k. The function is only defined at
integer values of k (empty lozenges). The connecting lines are only
guides for the eye.
Example 5.
If the probability that an individual be late on job on Friday is 0.001, determine the
probability that out of 2000 individuals.
a) exactly 3
b) more than 2
individuals will be late on job on Friday.
Solution:
p=0.001 - probability that an individual be late on job on Friday (rare event
Poisson distribution)
2000 0.001 2 N p i = = =
2
2
( )
! !
x x
e e
p x
x x
i
i

= =
a)
2 3
2
(3) 0.18
3!
e
p
= =
There is 18% of chance that out of 2000 individuals exactly 3 will be late on job on
Friday.
EXAMPLES IN EXCEL
82
b)

2 0 2 1 2 2
( 2) (3) (4) ... 1 (0) (1) (2)
2 2 2
1 0.323
0! 1! 2!
p x p p p p p
e e e

> = + + = + + =
1
= + + =
(
]
There is 32.3% of chance that out of 2000 individuals more than 2 will be late on job
on Friday.
Example 6.
Suppose that, on average, three customers arrive per minute at the bank during the
noon to 1 p.m. hour. What is probability that in a given minute exactly two customers
will arrive?
Solution:
We are interested in the number of times a certain event will occur in a given unit of
time Poisson distribution.
i=3
3
3
( )
! !
x x
e e
p x
x x
i
i

= =
3 2
3
(2) 0.224
2!
e
p
= =
There is 22.4% probability that at in a given minute exactly two customers will arrive.
Example 7.
If probability that randomly selected person will be daltonist is 0.3% what is
probability that between 2800 persons we will find:
a) 4 daltonists
b) more than 3 daltonists.
c) not more than 2 daltonists.
Solution (by Excel-a):
0.003 0.3% p = Rare event Poisson distribution
2800 0.003 8.4 n p i = = =
8,4
8.4
( )
! !
x x
e e
p x
x x
i
i

= =
We will use Excel function:
EXAMPLES IN EXCEL
83
a) exactly 4 daltonists
We ask for probability in point not for cumulative function, so for option Cumulative
we will take False.
= = ) 4 ( X P {=POISSON(4;8.4;FALSE)}= 0.046648 4.6648%
b) more than 3 daltonists, this is opposite to cumulative distribution so for option
Cumulative we will take True and on the end we will find probability for opposite
event:
EXAMPLES IN EXCEL
84
1- = s ) 3 ( X P 1-{=POISSON(3;8.4;TRUE)}=1- 0.03226= 0.96774 96.774%
c) not more than 2 daltonists, this is cumulative distribution so for option Cumulative
we will take True.
= s ) 2 ( X P {=POISSON(2;8.4;TRUE)}=0.0100471.0047 %
Characteristics of the Poisson distribution
Shape
Poisson distribution is always positively (right) skewed.
Mean
( ) E X i = =
Variance
,
2
2
E X o i
1
= =
]
i
o
1
3
= ,
i
o
1
3
4
+ = .
EXAMPLES IN EXCEL
85
The Poisson distribution can be derived as a limiting case to the binomial
distribution as the number of trials goes to infinity and the expected number of
successes remains fixed. Therefore it can be used as an approximation of the
binomial distribution if n is sufficiently large and p is sufficiently small. There is a
rule of thumb stating that the Poisson distribution is a good approximation of the
binomial distribution if n is at least 20 and p is smaller than or equal to 0.05.
According to this rule the approximation is excellent if n 100 and np 10.
Example 8.
In one office there is copy machine. We want to determine average number of
incorrect copies. We take samples with 1000 copies, number of trials was 250 and
results are:
number of
incorrect copies
Number
of
samples
0 10
1 20
2 40
3 55
4 50
5 40
6 15
7 10
8 5
9 3
10 2
250
Solution:
This is discrete random variable. We have two modalities in one trial: copy can be
correct or incorrect. That shows us that appropriate theoretical distribution is binomial
or Poisson distribution. According to empirical distribution of frequencies we will
calculate average and standard deviation. We can con use Excel function directly,
because this is grouped distribution and we will set up formulas for calculate average
and standard deviation:
100 , 250 = = n N
EXAMPLES IN EXCEL
86
Result for average is:
We will find variance:
EXAMPLES IN EXCEL
87
Result for variance is:
There is o ~
2
X Poisson distribution, 3.65 X i = =
3.65
3.65
!
x
p
x
p e
x
=
EXAMPLES IN EXCEL
88
In Excel we will create formula for probability calculations
3.65
3.65
, 0
!
x
p
x
p e x
x
= >
and than according to these theoretical probabilities we can compute theoretical
frequencies
b b
x x
f p N = :
With Paste option we can complete other cells in column with theoretical probabilities.
Result is:
Now we will calculate theoretical frequencies:
EXAMPLES IN EXCEL
89
With Paste option we can complete other cells in column with theoretical frequencies.
Result is:
That was procedure for approximation with Poisson distribution. Now we have
schedule for this variable and we can make predictions. Quality of approximation will
be measured by error of approximation.
EXAMPLES IN EXCEL
90
Error of approximation for modalities is:
b
k k k
d f f =
Because of different signs, we will square those errors:

EXAMPLES IN EXCEL
91
We will sun that square errors:

EXAMPLES IN EXCEL
92
2 2
1 1941.47
7.76
1 251
b k
d
n
o = = =
+
_
Approximation error is 7.76.
EXAMPLES IN EXCEL
93
HYPERCEUMETRIC DISTRIBUTIUN
Hipergeometric distribution H(N,n,p) is distribution for n random Bernoullis
dependent variables. There is sampling without replications. Symbols are:
N- number of elements in population
M- number of elements in population with characteristic A
n- number of elements in sample
k - number of elements in sample with characteristic A
N M k N n s s s ,
h
k
p is probability that in sample from that population be k elements with
characteristic A: ,
n
N
k n
N
k
N
C
C C
n
N
k n
N
k
N
k X p
=
|
|
.
|
\
|
|
|
.
|
\
|
|
|
.
|
\
|
= =
2 1
2 1
Expectations and variance are:
,
1
;
2 1 2 1
= =
N
n N
N
N
N
N
n
N
N
n X E o
This distribution has application in sampling procedure. When is (n/N<1/10) we can
approximate hypergeometric distribution with binomial distribution.
Example 9.
In firm, we have 10 economists and 22 employees with other vocations. What is
probability that sample of 8 employees will have 3 employees with other vocations?
Solution:
3 , 22 , 8 , 32 = = = = k M n N
22 10
3 5
0.037 3.7%
32
8
h
k
M N M
k n k
p
N
n
| | | | | | | |

| | | |
\ . \ . \ . \ .
= = =
| | | |
| |
\ . \ .
Example 10.
In population we have 30 products and there is 30% of incorrect products. We will
choose sample with 4 products without replications. What is probability that we will
have not more than 2 incorrect products?
30% incorrect there is 9 incorrect products in population
Without replication dependent events hipergeometric distribution.
not more than 2 incorrect products 0 or 1 or 2 incorrect products
We will apply Excel function for hipergeometric distribution:
EXAMPLES IN EXCEL
94
- probability that we select 0 incorrect products
={=HYPGEOMDIST(0;4;9;30)} = 0.21839121.84%
- probability that we select 1 incorrect product
={=HYPGEOMDIST(1;4;9;30)} = 0.43678243.68%
probability that we select 2 incorrect products
EXAMPLES IN EXCEL
95
={=HYPGEOMDIST(2;4;9;30)} = 0.27586227.59%
- Finally, probability that we will have not more than 2 incorrect products is sum of
previous find probabilities (like or probability for mutually excluded events)
0.931034 93.1%
NURMAL DISTRIBUTIUN
The normal distribution, also called the Gaussian distribution, is an important family
of continuous probability distributions, applicable in many fields. Each member of the
family may be defined by two parameters, location and scale: the mean ("average", )
and variance (standard deviation squared,
2
) respectively.
The continuous probability density function of the normal distribution is the Gaussian
function:
,
2
1
2 1
, , ( )
2
x E
i
i i
x f x e
o
o t
| |
|
|
\ .
e + =

where > 0 is the standard deviation, the real parameter is the expected value. To
indicate that a real-valued random variable X is normally distributed with mean and
variance 0, we write
2
( ; ) X N o
EXAMPLES IN EXCEL
96
Normal probability density function
3
The red line is the standard normal distribution
The standard normal distribution is the normal distribution with a mean of zero and a
variance of one (the red curves in the plots to the right). According to transformation
formula that will be:
2
2
2
3 4
1
, ( ) , (0,1),
2
( ) 0, 1, 0, 3
i
z
i
i i
Z
x E
z z e Z N
E Z
o t
o o o

= =
= = = =
The probability density function has notable properties including:
symmetry about its mean
the mode and median both equal the mean
the inflection points of the curve occur one standard deviation away from the
mean, i.e. at and + .
The cumulative distribution function of a probability distribution, evaluated at a
number (lower-case) x, is the probability of the event that a random variable X with
that distribution is less than or equal to x. The cumulative distribution function of the
normal distribution is expressed in terms of the density function as follows:
2
1
2
1
( ) ( )
2
i
x E
x
i i
x p X x e dx
o
|
o t
| |
|
|
\ .
= s =

}
x
~
EXAMPLES IN EXCEL
97
The cumulative distribution function of a probability distribution, evaluated at a
number (lower-case) z, is the probability of the event that a random variable Z with
that distribution is less than or equal to z. The cumulative distribution function of the
standardized normal distribution (red line) is expressed in terms of the density
function as follows:
2
2 1
( ) ( )
2
i
z
z
i i
F z p z z e dz
t

= s =
}
There are tables with values of cumulative distribution function of the standardized
normal distribution.
Roles for standardized normal distribution
Roles for determination probability for different kinds of cases with standardized
normal distribution are:
1. ( ) 1 ( )
i i
p Z z F z > =
2. ( ) ( ) ( )
i j j i
i j p z Z z F z F z < s s =
5. ( ) 1 ( )
i i
p Z z F z s =
6. ( ) ( ) ( ) 2 ( ) 1
i i i i i
p z Z z F z F z F z < s = =
On next two graphs we can see illustration for determination area under curve for
standardized normal distribution (probability):
1. ( 1.25) (1.25) p z F s =
EXAMPLES IN EXCEL
98
2.
( 1.25) ( 1.25) ( 1.25) ( 1.25) 1 ( 1.25)
( 1.25) 1 (1.25)
p z F p z p z p z
F F
s = = s = > = s
=
Characteristic intervals for normal distribution
If
2
~N( ; ) X o then we have characteristic intervals for distances of one, two and
three standard deviations from the mean:

68.3% p X o o s s + =

2 2 95.4% p X o o s s + =

3 3 99.7% p X o o s s + =
Example 5.
The tread life of a certain brand of tire has a normal distribution with mean 35000
miles and standard deviation 4000 miles. For randomly selected tire, what is
probability that its life is:
a) less than 37200 miles
b) more than 38000 miles
c) between 30000 and 36000 miles
d) less than 34000 miles
e) more than 33000 miles.
EXAMPLES IN EXCEL
99
Solution:
2
(35000; 4000 ) X N
First we have to standardize or to transform x in z. We use Excel function:
For probabilities with z scores we use Excel function:
a) less than 37200 miles
~
EXAMPLES IN EXCEL
100
Standardize or to transform x in z
Or by formula:
we made transformation from to
37200 35000
( 37200) ( 0.55)
4000
x z
p x p z p z
| |
< = < = < =
|
\ .
This is table value for cumulate because z is positive and relation is <. We dont ask
for probability in point than for cumulative function, so for option Cumulative we will
take True:
Or by formula:
(0.55) 0.708840 70.884%
from tables
F = =
b) more than 38000 miles
Standardize or to transform x in z
EXAMPLES IN EXCEL
101
Or by formula:
38000 35000
( 38000) ( 0.75) 1 (0.75)
4000
p x p z p z F
| |
> = > = > =
|
\ .
This is not table value for cumulate because z is positive and relation is >. We dont
ask for probability in point than for cumulative function, so for option Cumulative we
will take True but on the end we will apply formula for opposite event:
Or by formula for opposite events:
1 0.773373=0.226627 22.6627%
from tables
=
d) between 30000 and 36000 miles
First standardization:
EXAMPLES IN EXCEL
102
And
Or by formula:
30000 35000 36000 35000
(30000 36000) ( 1.25 0.25)
4000 4000
(0.25) ( 1.25)
p x p z p z
F F
| |
< < = < < = < < =
|
\ .
= =
Now we will find cumulative probabilities for that z scores:
EXAMPLES IN EXCEL
103
And
Now we complete formula:
0.598706 0.10565=
=0.493056 49.3056%
from tables
=
e) less than 34000 miles

First standardization:
EXAMPLES IN EXCEL
104
Or by formula:
34000 35000
( 34000) ( 0.25)
4000
p x p z p z
| |
< = < = < =
|
\ .
Then cumulative probabilities:
0.401294 40.1294%
from tables
=
f) more than 33000 miles
Transformation in z:
EXAMPLES IN EXCEL
105
Or formula:
33000 35000
( 33000) ( 0.5) (0.5)
4000
p x p z p z F
| |
> = > = > =
|
\ .
Then cumulates:
This is opposite event:
tablic value
= 1-0,308537=0,691463 69,1463%
Example 6.
Scores on an examination taken by a very large group of students are normally
distributed with mean 700 and deviation 120. It is decided to give a failing grade to
the 5% of students with lowest scores. What is minimum score needed to avoid a
failing grade (or maximum score that means a failing grade)?
Solution:
2
(700;120 ) X N
EXAMPLES IN EXCEL
106
0
0
( ) 0.05
?
p x x
x
< =
=
There is inverse situation, we know probability and we need to find z and x for that
probability. We will use Excel function NORMINV:
Or mathematically by formula:
0 0 0 0
0
0
( ) 0.05 ( ) 0.95 1.65 1.65
700
1.65 502
120
from tables
p z z p z z z z
z x
x
x
< = < = = =
= =
Minimum score needed to avoid a failing grade is 502.
Example 7.
A journal editor find that the length of time that elapses between receipt of a
manuscript and a decision on publication follows a normal distribution with mean 18
weeks and deviation 4 weeks. The probability is 0.2 that it will take longer, than how
many weeks before a decision is made on a manuscript?
Solution:
2
(18; 4 ) X N
0
0
( ) 0.2
?
p x x
x
> =
=
There is opposite for table cumulate. So, we will find z for table value (1-0.2)=0.8.
EXAMPLES IN EXCEL
107
Or by formula:
0 0 0
0
0
( ) 0.2 ( ) 1 0.2 0.8 0.85
18
0.85 21.4
4
from tables
p z z p z z z
z x
x
x
> = < = = =
= =
21.4 weeks before a decision is made on a manuscript.
Example 14.
For 100 employees in one company we know annual income (in 000 KM):
Annual
income
Number of
employees
60-62 5
62-64 20
64-66 42
66-68 27
68-70 6
Solution:
This is continuous variable. We will use approximation with normal distribution. We
will replace intervals with centre of intervals:
EXAMPLES IN EXCEL
108
First we will compute average and deviation for empirical distribution:
Then is:
Now we will standardize upper limit for intervals:
2 2
( ) 65,18
1, 9
i i
i
x
L A X L
z
o

= =
EXAMPLES IN EXCEL
109
For this z scores we will find table cumulative probabilities with function
NORMSDIST . On the beginning and on the end we will make new intervals with x: -
and +, so cumulative for that intervals are 0 and 1:
Then we will find theoretical probabilities for normal distribution according to the
relation:
1 1
( ) ( )
i i i
p F z F z
+ +
= and theoretical frequencies according to relation:
ti i
f N p = :
EXAMPLES IN EXCEL
110
That was procedure for approximation with normal distribution. Now we can find
approximation error:
5
2
1
1
( )
n i ti
i
f f
n
o
=
=
_
. First we will compute square distances
theoretical from empirical frequencies:
EXAMPLES IN EXCEL
111
Then we create formula for approximation error:
Or by formula:
5
2
1
1 8.6772
( ) 1.32
5
n i ti
i
f f
n
o
=
= = =
_
Approximation error
STUDENT t-DISTRIBUTIUN
T distribution was constructed 1908. by W.S.Gosset, but he published that with
pseudonym Student and we call this distribution Student t distribution. He created
that when he worked with results on samples methods. Density function is:
EXAMPLES IN EXCEL
112
,
2
2
1
1
2
1
,
2
1
1
1
n
n
t n
B n
t f
|
|
.
|
\
|
+
|
.
|
\
|
= ,
where is
|
.
|
\
|
2
1
,
2
1 n
B beta-function with parameters
2
1
,
2
1 n
and n is number of
elements.
With function F(t) we can compute probability that variable has value more that fixed
t, and we can find tables with appropriate probabilities.
Shape of t distribution depends on n, but (n-1) is degree of freedom or v (ni). Degree
of freedom is number of independent observations minus number of parameters that
define distribution: k n df = = v
Student distribution is wider than normal distribution. For greater values of n (more
than 30) student distribution tends to standardized normal distribution.
T distribution doesnt have application in concrete problems like normal distribution,
but it is very important for inferential statistic. So we will see finding t when we know
probability.
Example 15.
For degrees of freedom 9 = n , we have to find
0
t , for
0 0
( ) 0.99 P t t t s s = . For the
same distribution we have to determine the function of probability if t = 2.54.
Solution:
This is inverse situation when we know area (probability) between two symmetric t
scores. We use Excel function for Two-tailed:
We calculate for opposite event:
EXAMPLES IN EXCEL
113
Or by formula:
0 0 9 0 9 0 0
( ) 2 ( ) 1 0.99 ( ) 0.995 3.3 P t t t S t S t t s s = = = =
Now we have to find function of probability and cumulative probability if t = 2.54.
We will use function TDIST:
CHI-SQUARE ,
2
_ DISTRIBUTIUN
Applies in cases where the need to make a decision on the significant difference of
actual (observed) and theoretical (expected) frequency, or the value of variable
(characteristics). Marked by the Greek letter hi , _ , is defined as the sum of the
distances (relationship difference) between the observed and expected values
according to the expected values, that is
,
_
=
=
r
i i
i i
e
e m
1
2
2
_ ,
i
m - observed frequency
-
i
e expected (theoretical) frequency
This distribution can take values from 0 to , the values are always positive, depends
on the number of degrees of freedom, and for each number of degrees of freedom hi-
square distribution is different. Probability distributions are given in the table. The
EXAMPLES IN EXCEL
114
table was to give information to the 30 degrees of freedom, and if it is about more
than 30 degrees of freedom R. A. Fisher suggests that took the form 1 2 2
2
v _ ,
that is approximately normally distributed, so that the case can apply the table surface
below the normal curve.
Arithmetic mean hi-square distribution is equal to the number of degrees of freedom,
a mode is where it is
2
_ = 2 v (unless if 1 = v ), variance is v 2 and coefficient of
skewness
v
2
. From the expression for the coefficient of skewness follows that this
distribution is very asymmetrical for a small number of degrees of freedom, and that
with increasing degrees of freedom approaching to symmetric distribution.
In the specific problems it has no autonomous application as a normal distribution, but
it is very important for inferential statistics. Therefore, we observe the calculations
with hi-square distribution.
Example 16.
If degrees of freedom is 5 and we know probability 0.9, we have to find appropriate
2
0
_ value, if probability is for
2 2
0
_ _ > . With same conditions find
2
0
_ if probability
is for
2 2
0
_ _ s .
Solution:
5 = n
2 2
0
_ _ > - this is direct relation for Excel function CHIINV.
Or by formula:
2 2 2
0 0
( ) 0, 9 1, 61 P _ _ _ > = =
Opposite event is
2 2
0
_ _ s , so:
EXAMPLES IN EXCEL
115
2 2 2 2
0 0
( ) 0, 9 ( ) 1 0, 9 0,1 P P _ _ _ _ s = > = = . That means:
Or by formula:
2 2 2
0 0
( ) 1 0, 9 0,1 9, 5 P _ _ _ > = = =
DISTRIBUTIUN
We suppose that is:
X continuous random variable which has a hi-square distribution ,
2
_ with m
degrees of freedom and
Y continuous random variable which has a hi-square distribution ,
2
_ with n
degrees of freedom
These two variables are independent,
Then the variable F, which we define like quotient of quotients for previously defined
variables and their respective degrees of freedom:
n Y
m X
F
/
/
= follows Ficher-
Snedecor's distribution with the degrees of freedom
|
|
.
|
\
|
n
m
. Distribution of probability
is not balanced or symmetric with respect to m and n.
Random variable takes the value of the interval (0, ) and distribution has the
following format:
,
, 2
1
2
,
1
2
2
n m
m
n m
x
x
n
s
m
n m
x f
+
|
.
|
\
|
I I
|
.
|
\
| +
I
=
where m and n represent degrees of freedom (df).
Expected values and variance are:
EXAMPLES IN EXCEL
116
2 za
2
>
= n
n
m
,
, ,
4 za ,
4 2
2 2
2
2
>

+
= n
n n
m n m
o
Ficher's (F) distribution is used in cases where we want to analyze variability two
basic populations based on the sample. We will use the F distribution to test
hypotheses about the equality of two samples variance over their relations on the basis
of the number of degree of freedom for each of them. When the referent populations
meetings normally distributed then the quotient two independent assessments variance
given in the form of:
2
2
2
1
o
o
= F
Example 17
If it is a Fisher-Snedecor's schedule use it to determine
0
F if the appropriate number
of degrees of freedom is
1 2
4, 7 v v = = and the corresponding likelihood is
0
( ) 0, 05 P F F > = .
Solution:
There is relation > , so we can direct apply Excel function FINV:
Or: 12 , 4 7 , 4 , 05 , 0
0 2 1
= = v = v = F p
tablica iz
LUCNURMAL DISTRIBUTIUN
Lognormal distribution characteristics are as follows:
Probability distribution of random variables whose logarithms (base 10 or e)
below normal distribution.
There is curved - the asymmetrical right.
When we find logarithm value for variables whose distribution is curved to the
right, obtained logarithms follow a normal distribution.
EXAMPLES IN EXCEL
117
As a measure of central tendencies in lognormal distribution is used geometric
mean for x or the arithmetic mean of ln(x) or log(x).
It is defined by the expectations and standard deviation.
If there is a normal distribution for ln(x), then x has a lognormal distribution with the
function of probability:
,
2
2
1 1 1
( ) exp ln , 0
2 2
f x x x
x

o t o
1
= >
(
]
,
Where is o standard deviation for ln x and expected value for ln( ) x .
Unlike the normal distribution, lognormal distribution is not balanced, but seeks to
normal distribution if there is a value less than 0.1.
On next graph we can see function lognormal probability distribution depending on
the value of standard deviations:
In Excel function LOGNORMDIST elements are:
x the value for which we observe the function.
Mean is average for ln(x) or log(x).
Standard-dev. is standard deviation for ln(x) or log(x).
Example 18.
We have next information: 4, x = expected value for ln(x) is 3.5 and standard
deviation is 1,2. How to read the appropriate value of the timetable for the lognormal
function?
Solution:
We use Excel function:
EXAMPLES IN EXCEL
118
Or by normal distribution First standardization:

EXAMPLES IN EXCEL
119
We get same result.
If you know the appropriate function of the timetable (probability) and we want to
determine which x is responsible use the inverse function LOGINV. For example, for
this problem we know that value for cumulative probability is 34% and we want to
find x for that function:
EXPUNENTIAL DISTRIBUTIUN
Exponential distribution is used in the analysis of reliability to describe the time "out"
components or systems. Exponential functions of probability distribution is:
( ) , 0
x
f x e x
i
i

= > , where is 0 i > constant.
Expected value and variance exponential distribution are:
2
2
1 1
o
i i
= = .
Parameter i is the rate of "fault" system, while the expected value of the distribution
of the average time until the next "out" system.
EXAMPLES IN EXCEL
120
Exponential distribution is used for planning time between certain procedures, stages
or steps. For example, we know how much time is needed cash-machine that pays
cash. We can use the exponential distribution, for example to determine may be that
the process of a maximum duration of 2 minutes.
Example 19.
We know that: 0, 4 x = , and expected value is 0,1. How to read the appropriate
probability or likelihood for the exponential distribution?
Solution:
We will use Excel function EXPONDIST:
We need i and we know that is
1 1 1
10
0,1
i
i
= = = = . So, we are defined
exponential distribution.
For probability that is 0, 4 x = in option Cumulative we will choose False:
For probability that is 0, 4 x s in option Cumulative we will choose True:
EXAMPLES IN EXCEL
121
CAMA DISTRIBUTIUN
Gamma function of probability distribution is:
1
1
( ) , 0
( )
x
f x x e x
o |
o
| o

= >
I
,
4
where is 0 o > parameter for slope and
1
0
|
>
is measuring parameter for scale.
Expected value and variance for gamma distribution are:
2 2
o | o o | = = .
If o=1 then is gamma distribution equal to the exponential distribution. Gamma
distribution can have different shapes depend on o and |. This makes it useful model
for a wide range of continuous random variables.
If
2
n
o = and 2 | = we will get special form of gamma distribution - chi-square
distribution, where n is number for degrees of freedom.
Gamma distribution is used in the case asymmetric distribution. We have practical
application of the theory in the ranks.
Example 20.
Let the value of continuous random variable is 8 x = . How to read the appropriate
value for gamma distribution if the parameters are 6 i 2 o | = = ? If the probability
that x is less than the default value is equal to 54% to determine the value.
Solution:
4
( ) o I is gamma function, defined like
1
0
( ) , 0
x
x e dx
o
o o

I = >
}
. if o is positive integer, then is
( ) ( 1)! o o I = .
EXAMPLES IN EXCEL
122
For known x and determination of probability we will use Excel function
GAMMADIST:
For probability that 8 x = in option Cumulative we will take False:
For probability that 8 x s in option Cumulative we will take True:
Now we will find value for x if probability that x is less than the default value is equal
to 54%. We will use function GAMMAINV:
EXAMPLES IN EXCEL
123
For that gamma distribution x may be less than the default value of 11.83 is equal to
54%.
APRUXIMATIUNS FUR BINUMIAL, PUISSUN AND
HYPERCEUMETIC DISTRIBUTIUN WITH NURMAL
DISTRIBUTIUN
There are conditions for approximations for binomial, Poisson and hypergeometic
distribution with normal distribution:
10%
n
N

30
0,10
n
p

( , , ) H N n p ( , ) B n p ( ) P
10%
30
n
N
n
20
10
(1 ) 10
n
np
n p

15
( , ) N
INFERENCIAL STATISTICS
124
III. Inferential statistics: Estimation tbeory and
bypotbesis testing
INFERENCE
Inferential statistics are used to draw inferences about a population from a sample. It
is very important that the chosen sample is randomly selected and representative for
the population. However you select the sample there is always the likelihood of some
level of sample error. But, there is role: the larger sample lead to the smaller sample
error.
Consider an experiment in which 10 subjects who performed a task after 24 hours of
sleep deprivation scored 12 points lower than 10 subjects who performed after a
normal night's sleep. Is the difference real or could it be due to chance? This is the
type of questions answered by inferential statistics.
There are two main methods used in inferential statistics: estimation and hypothesis
testing. In estimation, the sample is used to estimate a parameter and a confidence
interval about the estimate is constructed. A confidence interval gives an estimated
range of values which is likely to include an unknown population parameter, the
estimated range being calculated from a given set of sample data
5
:
( ) 1 P h h 0 o s s + =
where is:
- statistic from sample
0- parameter from population
h surroundings
, 1 o - confidence
o- first type error
In the most common use of hypothesis testing, null hypothesis is put forward and it is
determined whether the data are strong enough to reject it. For the sleep deprivation
study, the null hypothesis would be that sleep deprivation has no effect on
performance.
Inferential statistics are used to make generalizations from a sample to a population. It
is possible that error occur. There are two sources of error that may result in a
sample's being different from (not representative of) the population from which it is
drawn. These are
Sampling error - chance, random
error, decreases as the sample size
increases
Sample bias - constant error, due to
inadequate frame and design, does not
depend on the size of the sample
5
Definition taken from Valerie J. Easton and John H. McCools Statistics Glossary v1.1
125
Inferential statistics take into account sampling error. These statistics do not
correct for sample bias.
THE DISTRIBUTIUN UF THE SAMPLE MEANS
The distribution of sample means has some interesting characteristics. First, if our
samples are big enough (a large n), then the sampling distribution will approximate a
normal distribution. Second, the mean of our sampling distribution, which is
sometimes designated
X
, will be the same as the population mean. Together, these
two properties of sampling distributions comprise the central limit theorem.
Third, as you also know, to compute probabilities from a normal distribution, you
have to know the standard deviation of the distribution. In this case, the standard
deviation of the sampling distribution is called the standard error of means, designated
X
o , and is calculated by dividing the population standard deviation by the square root
of n. In other words, standard error of means can be calculated as:
X
n
o
o = .
Standard error of means depends on the sample size (n), so the larger sample lead to
the smaller standard error of means.
CUNFIDENCE INTERVAL FUR THE PUPULATIUN MEAN
Standard deviation from population is known
For a population with unknown mean and known standard deviation o for
population, a confidence interval for the population mean, based on a simple random
sample of size n, is:
,
2 ( ) 1 1
X X
P X z X z F z o o o s s + = =
where:
X is the mean from the sample
z is the upper (1
2
o
) critical value for the standard normal distribution and
depends on required confidence
X
n
o
o = is standard error of means.
This is rare situation when we know standard deviation from population.
126
Standard deviation from population isnt known
If standard deviation from population isnt known, unbiased estimator from the
sample is:
,
2
1
i i
i
x X f
n
o

=
_
where
i
o is standard deviation from sample.
In most practical research, the standard deviation for the population of interest is not
known. In this case, the standard deviation from population o is replaced by the
estimated standard deviation from sample
i
o , also known as the standard error. Since
the standard error is an estimate for the true value of the standard deviation, the
distribution of the sample mean X is no longer normal with mean and standard
deviation
n
o
. Instead, the sample mean follows the t distribution with mean and
standard deviation
i
n
o
. The t distribution is also described by its degrees of freedom.
For a sample of size n, the t distribution will have (n-1) degrees of freedom. The
notation for a t distribution with k degrees of freedom is
k
t .
Well, for a population with unknown mean and unknown standard deviation, a
confidence interval for the population mean, based on a simple random sample of size
n, is:
,
1 1 1
2 ( ) 1 1
n n n X X
P X t S X t S S t o

s s + = =
where:
X is the mean from the sample
t is the upper (1
2
o
) critical value for the t distribution with (n-1) degrees of
freedom,
1 n
t

i
X
S
n
o
= is approximation for standard error of means.
This is most common situation when we dont know standard deviation from
population.
As the sample size n increases, the t distribution becomes closer to the normal
distribution, since the standard error approaches the true standard deviation o for
large n. So, for sample size n >30, we can use normal instead of t distribution.
Example 1.
Suppose a student measuring the boiling temperature of a certain liquid observes the
readings (in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6
127
different samples of the liquid. What is the confidence interval for the population
mean at a 95% confidence level?
Solution:
First we will calculate statistics for sample: Tools Descriptive statistics. We will
choose option for confidence interval and appropriate significance level:
128
, 1 0, 95 0, 05 o o = =
Unknown standard deviation o for population:
X X
X t S X t S s s +
n>30, unknown standard deviation o for population, we know only standard
deviation
i
o for sample t distribution
1 6 1 5
( ) 1 0, 975
2
n
S t
o
= =
= = from tables or with Excel function TINV:
X
S
X
t S
129
101.82 2.57 0.402 101.82 2.57 0.402
100.78 102.85 ( 5%)
o
s s +
s s =
Confidence interval for the population mean at a 95% confidence level is (101.78-
102.85).
Example 2.
According to report for 2009. year, we have data about predicted Recovery rate in
cent per dollar after closing business (from
http://www.doingbusiness.org/CustomQuery/, predictions for 2009. year) for sample
with 33 countries. We have data in Excel sheet (A1-A33). We have to construct
confidence interval for Recovery rate for population of all countries with first type
error 1%.
Solution:
For beginning, we will calculate statistics for sample of 33 countries: Tools
Descriptive statistics:
130
X X
X z S X z S s s +
131
n>30, we don't know deviation for population o , we only know deviation from
sample
i
o z distribution
( ) 1 0, 995
2
F z
o
= = from tables or with Excel function NORMSINV:
52.91 2.58 4.05 52.91 2.58 4.05
42.48 63.34
s s +
s s
Confidence interval for Recovery rate for population of all countries with first type
error 1% is (42.48-63.34).
Example 3.
The dataset "Normal Body Temperature, Gender, and Heart Rate" contains 130
observations of body temperature, along with the gender of each individual and his or
her heart rate. Sample mean is 98.249 and sample standard deviation is 0.733. Find a
99% confidence interval for the mean of population.
Solution:
,
130
98.249
0.733
1 0.99 0.01
i
n
X
o
o o
=
=
=
= =
n>30, unknown standard deviation o for population, we know only standard
deviation
i
o for sample z distribution
X X
X z S X z S s s +
( ) 1 0.995 2.58
2
from tables
F z z
o
= = =
132
0.733 0.733
98.249 2.58 98.249 2.58
130 130
98.08 98.41
s s +
s s
Confidence interval for the population mean at a 99% confidence level is (98.08-
98.41).
CUNFIDENCE INTERVAL FUR THE PUPULATIUN
PRUPURTIUNS
Applying the general formula for a confidence interval, the confidence interval for a
proportion, , is
p
p z t o 1 e
]
where is:
p is the proportion in the sample,
z depends on the level of confidence desired, and
p
o , the standard error of a proportion, is equal to:
, 1
p
n
t t
o

=
where is:
is the proportion in the population and
n is the sample size.
Since is not known, p is used to estimate it. Therefore the estimated value of
p
o is:
, 1
p
p p
S
n

=
and than will be:
p
p z S t 1 e
]
Example 4.
Based on the HBS 2004 databases have information on 7413 households for the
variable marital status holder households. On the basis of this information to assess
proportion of households whose holder married to a complete population of
households in B&H, with the first type of error of 2%.
Solution:
It is necessary first for the sample of n=7413 households calculate proportion of
households where the holder of household persons in marriage:
133
Value for p is: 0.7124 p = and n=7413.
, 1
0.7124 0.2876
0.005
7414
p
p p
S
n

= = =
0.02 ( ) 1 0.99
2
F z
o
o = = =
134
0.7124 2.326 0.005 0.7124 2.326 0.005
0.701 0.724
p
p z S t
t
t
1 e
]
s s +
s s
Confidence interval for proportion of households whose holder married to a complete

population of households in B&H states (70.1-72.4%).
CUNFIDENCE INTERVAL FUR VARIANCE IN PUPULATIUN
Depending on whether the sample is small or large for the determination of
confidence interval for the population variance we use chi-square or normal
distribution according to the following forms:
small sample
2 2
2
2 2
1,1 1,
2 2
1
i i
n n
n n
P
o o
o o
o o
_ _

| |
|
s s =
|
|
\ .
2 2
1
1,1
2
2 2
1
1,
2
( ) 1
2
( )
2
n
n
n
n
P
P
o
o
o
_ _
o
_ _

=
=
large sample
, ,
2 2
2
2 2
2 2
2 ( ) 1
2 3 2 3
i i
n n
P F z
n z n z
o o
o o
| |

|
s s = =
|
| +
\ .
( ) 1
2
F z z
o
=
135
Example 1. cont.
Suppose a student measuring the boiling temperature of a certain liquid observes the
readings (in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6
different samples of the liquid. What is the confidence interval for the population
variance at a 97% confidence level?
Solution:
It is necessary first to determine variance from the sample:
2
0, 9697
i
o =
It is a small sample and we use a chi-square distribution, it is a function CHINV:
2 2 2
1 6 1 5, 0,985
1,1
2
2 2 2
1 6 1 5, 0,015
1,
2
( ) 1
2
( )
2
n
n
n
n
P
P
o
o
o
_ _ _
o
_ _ _
=

=
= =
= =
Now we can complete the term for confidence interval:
136
2
2
6 0, 9697 6 0, 9697
14, 098 0, 662
0, 413 8, 789
o
o

s s
s s
Confidence interval for variance variable temperature boiling with reliability 97%
read (0,413-8,789).
Example 2. cont.
According to report for 2009. year, we have data about predicted Recovery rate in
cent per dollar after closing business (from
http://www.doingbusiness.org/CustomQuery/, predictions for 2009. year) for sample
with 33 countries. We have data in Excel sheet (A1-A33). We have to construct
confidence interval for variance of variable Recovery rate for population of all
countries with first type error 4%.
Solution:
It is necessary first to determine variance from the sample:
2
541,16
i
o =
It is a large sample and we use a normal distribution, it is a function NORMDIST:
, ,
2 2
2
2 2
2 2
2 3 2 3
i i
o
n n
n z n z
o o
o

s s
+
We need a value for z if the first type of error was 4%:
Now we can complete the term for confidence interval:
, ,
2
2 2
2
2 33 541,16 2 33 541,16
2 33 3 2, 05 2 33 3 2, 05
353, 62 1030, 49
o
o

s s
+
s s
Confidence interval for variance of variable Recovery rate for population of all
countries with first type error 4% is (353,62-1030,49).
137
HUW TU DETERMINE SAMPLE SIZE ACCURDINC TU
SAMPLE ERRUR?
Determining sample size for estimating population mean
Determining sample size is a very important issue because samples that are too large
may waste time, resources and money, while samples that are too small may lead to
inaccurate results. In many cases, we can easily determine the minimum sample size
needed to estimate a population parameter, such as the population mean .
When sample data is collected and the sample mean X is calculated, that sample
mean is typically different from the population mean . This difference between the
sample and population means can be thought of as an error. The margin of error
X
E
is the maximum difference between the observed sample mean X and the true value
of the population mean :
2 2
X X
E z z
n
o o
o
o = =
where:
1
2
z
o
is known as the critical value, the positive z value that is at the vertical
boundary for the area of
2
o
in the right tail of the standard normal distribution.
o is the population standard deviation.
Rearranging this formula, we can solve for the sample size necessary to produce
results accurate to a specified confidence and margin of error:
2
1
2 X
n z
E
o
o
| |
=
|
\ .
This formula can be used when you know o and want to determine the sample size
necessary to establish, with a confidence of , 1 o , the mean value to within E .
138
You can still use this formula if you dont know your population standard deviation o
and you have standard deviation for sample:
2
1
2
i
X
n z
E
o
o
| |
=
|
\ .
Example 5.
A consumer group wants to estimate the mean electric bill for the month July for
single-family homes in a large city. Based on studies conducted in other cities, the
standard deviation is assumed to be $25. The group wants to estimate the mean bill
for July to within $5 of the true average with 95% confidence. What sample size is
needed?
Solution:
25
5
0.05
?
X
E
n
o
o
=
=
=
=
0.05 ( ) 1 0.975
2
F z
o
o = = =
2
2
1
2
25
1.96 96
5
X
n z
E
o
o
| |
| |
= = =
|
|
\ .
\ .
They need sample with 96 single-family homes in a large city.
Determining sample size for estimating population proportion
To develop formula for determining the appropriate sample size needed when
constructing a confidence interval estimate of the proportion, recall equation for
confidence interval estimate of the proportion:
139
, 1
p p
E z z
n
t t
o

= =
where:
1
2
z
o
is known as the critical value, the positive value that is at the vertical
boundary for the area of
2
o
in the right tail of the standard normal distribution.
t is the proportion from population.
Rearranging this formula, we can solve for the sample size necessary to produce
results accurate to a specified confidence and margin of error.
,
2
2
1
p
z
n
E
t t
=
This formula can be used when you know t and want to determine the sample size
necessary to establish, with a confidence of , 1 o , the proportion for population to
within
p
E . You can still use this formula if you dont know your population
proportion and you have a proportion from sample:
,
2
2
1
p
z p p
n
E

=
Example 6.
If you want to be 99% confident of estimating the population proportion to within an
error of 0.02 and there is historical evidence that the population proportion is
approximately 0.4, what sample size is needed?
Solution:
0.01
0.4
0.02
?
p
E
n
o
t
=
=
=
=
0.01 ( ) 1 0.995
2
F z
o
o = = =
140
,
2
2
2 2
1
2.58 0.4 0.6
3994
0.02
p
z
n
E
t t

= = =
We need sample with 3994 elements.
HYPUTHESIS TESTINC
Hypothesis testing typically begins with some theory, claim, or assertion about a
particular parameter of a population. For example, for purposes of statistical analysis,
your initial hypothesis about the cereal example is that the process is working
properly, meaning that the mean fill is 368 grams, and no corrective action is needed.
The hypothesis that the population parameter is equal to the company specification is
referred to as the null hypothesis. A null hypothesis is always one of status quo, and is
identified by the symbol H
0
. Here the null hypothesis is that the filling process is
working properly, that the mean fill per box is the 368-gram specification. This can be
stated as:
H
0
: =368
Whenever a null hypothesis is specified, an alternative hypothesis must also be
specified, one that must be true if the null hypothesis is found to be false. The
alternative hypothesis H
1
is the opposite of the null hypothesis H
0.
This is stated in the
cereal example as:
H
1
: =368
The alternative hypothesis represents the conclusion reached by rejecting the null
hypothesis if there is sufficient from the sample information to decide that the null
hypothesis is unlikely to be true.
Hypothesis-testing methodology is designed so that the rejection of the null
hypothesis is based on evidence from the sample that the alternative hypothesis is far
more likely to be true. However, failure to reject the null hypothesis is not proof that it
is true. One can never prove that the null hypothesis is correct because the decision is
based only on the sample information, not on the entire population. Therefore, if you
fail to reject the null hypothesis, you can only conclude that there is insufficient
evidence to warrant its rejection.
141
The following key points summarize the null and alternative hypotheses:
1. The null hypothesis H
0
represents the status quo or the current belief in a situation.
2. The alternative hypothesis H
1
is the opposite if the null hypothesis and represents
a research claim or specific inference we would like to prove.
3. If we reject the null hypothesis, we have statistical proof that the alternative
hypothesis is correct.
4. The failure to prove the alternative hypothesis, however, does not mean that we
have proven the null hypothesis.
5. The null hypothesis H
0
always refers to specified value of the population
parameter (such as ), not a sample statistic (such as X ).
6. The statement of the null hypothesis always contains an equal sign regarding the
specified value of the population parameter (H
0
: =368)
7. The statement of the alternative hypothesis never contains an equal sign regarding
the specified value of the population parameter (H
1
: =368).
Hypothesis-testing methodology provides clear definitions for evaluating such
differences and enables us to quantify the decision-making process so that the
probability of obtaining a given sample result can be found if the null hypothesis is
not reject. This is achieved by first determining the sampling distribution for the
sample statistic of interest (e.g. the sample mean) and then computing the particular
test statistics based on the given sample result.
Regions of rejection and non-rejection
The sampling distributions of the test statistics is divided into two regions:
Region of rejection (critical region) and
Region of non-rejection.
If the test statistic falls into the region of non-rejection, the null hypothesis cannot be
rejected. If a value of the test statistic falls into this rejection region, the null
hypothesis is rejected because that value is unlikely if the null hypothesis is true.
142
When we use a sample statistic to make decision about a population parameter, there
is a risk that an incorrect conclusion will be reached. Two different types of errors can
occur when applying hypothesis testing methodology, type I errors and type II errors.
A type I error occurs if the null hypothesis H
0
is rejected when a fact it is true and
should not be rejected. The probability of a type I error occurring is o. A type II error
occurs if the null hypothesis H
0
is not rejected when a fact it is false and should be
rejected. The probability of a type II error occurring is |.
The confidence coefficient (1-o) is the probability that the null hypothesis H
0
is not
rejected when in fact it is true and should not be rejected. The power of a statistical
test (1-|) is the probability of rejecting the null hypothesis when in fact it is false and
should be rejected.
Risks in decision making process
Next table illustrates the results of the two possible decision (do not reject H
0
or reject
H
0
) that can occur in any hypothesis test. Depending on the specific decision, one of
two types of errors may occur or one of two types correct conclusion may be reached.
Actual situation
Statistical decision
H
0
true H
0
false
do not reject H
0
Correct decision
Confidence=(1-o)
Type II error
p(type II error)=|
reject H
0
Type I error
p(type I error)=o
Correct decision
Power=(1-|)
Procedure for hypothesis testing
Several steps can describe procedure for hypothesis testing:
1. Determine null and alternative hypothesis
2. State critical of test statistics according to significance or confidence level and
appropriate theoretical distribution
3. Calculate the test statistic according to values from the sample
4. Compare test statistic to critical values draw conclusion.
Hypothesis for the mean
We begin with the problem of testing the simple null hypothesis that the population
mean is equal, higher or lower than some specified value
0
.
known
1. Two-tailed test
143
0 0 1 0
. .
.
1
2 2
. .
1. : / :
( )
2
2. ,
( ) 1
2
t t
t
t t
H H
F z z
z z z
F z z
o o

o
o

= =
1
=
(
1
e (
(
( ]
=
(
]
0
.
. . 0 . . 1
3.
4. ,
e
X
e t e t
X
z
z z H z z H
=
e e
2. One-tailed test
a.
0 0 1 0
. .
0
.
. . 0 . . 1
1. : / :
2. ( )
3.
4. ,
t t
e
X
e t e t
H H
F z z
X
z
z z H z z H

o
o
> <
=
=
> <
b.
0 0 1 0
. .
0
.
. . 0 . . 1
1. : / :
2. ( ) 1
3.
4. ,
t t
e
X
e t e t
H H
F z z
X
z
z z H z z H

o
o
s >
=
=
s >
unknown, small sample
1. Two-tailed test
0 0 1 0
1 . .
.
1
2 2
1 . .
0
.
. . 0 . . 1
1. : / :
( )
2
2. ,
( ) 1
2
3.
4. ,
n t t
t
n t t
e
X
e t e t
H H
S t t
t t t
S t t
X
t
S
t t H t t H
o o

o
o
= =
1
=
(
1
e (
(
( ]
=
(
]
=
e e
2. One-tailed test
a.
0 0 1 0
1 . .
0
.
. . 0 . . 1
1. : / :
2. ( )
3.
4. ,
n t t
e
X
e t e t
H H
S t t
X
t
S
t t H t t H

o
> <
=
=
> <
144
b.
0 0 1 0
1 . .
0
.
. . 0 . . 1
1. : / :
2. ( ) 1
3.
4. ,
n t t
e
X
e t e t
H H
S t t
X
t
S
t t H t t H

o
s >
=
=
s >
unknown, large sample
1. Two-tailed test
0 0 1 0
. .
.
1
2 2
. .
0
.
. . 0 . . 1
1. : / :
( )
2
2. ,
( ) 1
2
3.
4. ,
t t
t
t t
e
X
e t e t
H H
F z z
z z z
F z z
X
z
S
z z H z z H
o o

o
o
= =
1
=
(
1
e (
(
( ]
=
(
]
=
e e
2. One-tailed test
a.
0 0 1 0
. .
0
.
. . 0 . . 1
1. : / :
2. ( )
3.
4. ,
t t
e
X
e t e t
H H
F z z
X
z
S
z z H z z H

o
> <
=
=
> <
b.
0 0 1 0
. .
0
.
. . 0 . . 1
1. : / :
2. ( ) 1
3.
4. ,
t t
e
X
e t e t
H H
F z z
X
z
S
z z H z z H

o
s >
=
=
s >
Example 7.
Studies have shown that the average height of adult European males 176.28 cm.
Determine whether there is a statistically significant difference between the average
height of adult men in the city of Sarajevo based sample of 48 citizens of Sarajevo
male gender (data in an Excel table) and the European average with the first type of
error of 5%.
145
Solution:
It is necessary first to determine the sample average height and standard deviation:
We do not know the standard deviation for the population, the sample is large, then it
is two-way z test:
0 1
.
.
1. : 176, 28 / : 176, 28
( ) 0, 025
2
2.
( ) 1 0, 975
2
t
t
H H
F z
F z

o
o
= =
1
= =
(
(
(
= =
(
]
146

.
1, 96, 1, 96
t
z e
0
.
. . 1
182, 33 176, 28
3. 4, 87
1, 24
8, 61
1, 24
48
4.
e
X
X
e t
X
z
n
z z H
o
o
o

= = =
= = =
e
There is significant difference between the average height of adult men, the city of
Sarajevo and the European average (5% error).
Or SPSS solution:
147
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
height 48 182,3333 8,61551 1,24354
One-Sample Test
Test Value = 176.28
95% Confidence Interval of the
Difference
t df Sig. (2-tailed) Mean Difference
Lower Upper
height 4,868 47 ,000 6,05333 3,5517 8,5550
p value for t test is p=0,000<0,05
1
H .
Example 8.
The director of admissions at a large university advises parents of incoming students
about the cost of food during a typical semester. A sample of 80 students enrolled in
the university indicates a sample meat cost of $315.4 with a sample standard deviation
of $43.2. Using the 0.01 level of significance, is there evidence that the population
mean is less than $320?
Solution:
0
43.2
80
315.4
320
i
n
X
o
=
=
=
=
We dont know standard deviation for population, sample is large and this is one-
tailed z test:
148
0 1
.
1. : 320 / : 320
2.. ( ) 0.01
t
H H
F z

o
> <
= =
.
0
.
. . 0
2.33
315.4 320
3. 0.95
4.83
43.2
4.83
80
4.
t
e
X
i
X
e t
z
X
z
S
S
n
z z H
o
=

= = =
= = =
>
There is no evidence that the population mean is less than $320.
We can not use SPSS procedure directly, because there is one-tailed test.
Example 9.
A manufacturer of flashlight batteries took a sample of 13 batteries from a days
production and used them continuously until they failed to work. The life as measured
by the number of hours until failure was:
342, 426, 317, 545, 264, 451, 1049, 631, 512, 266, 492, 562, 298.
At the level of significance 0.1, is there evidence that the mean life of the batteries is
more than 350 hours?
Solution:
From original data we calculate :
149
0
13
473.46
210.77
350
i
n
X
o
=
=
=
=
We dont know standard deviation for population, sample is small and this is one-
tailed t test:
0 1
1 12 . .
0
.
. . 1
1. : 350 / : 350
2. ( ) 1 0.9 1.78
473.46 350
3. 2.11
58.45
210.77
58.45
13
4.
n t t
e
X
i
X
e t
H H
S t t
X
t
S
S
n
t t H

o
o
=
s >
= = =

= = =
= = =
>
150
There is evidence that the mean life of the batteries is more than 350 hours.
A two sample test for mean
Means are used to summarize distributions based on continuous data (interval or ratio
measurement). A statistical measure called the t test is used to test for the significance
of the difference between two means. The t test assesses the degree of overlap in the
distribution of scores in each of two samples being compared. When the two
distributions are highly similar, there will be little difference between the means.
When scores in one distribution are distributed differently from the other, there is a
greater probability that the difference between the means will be greater.
A t test can be used with large or small samples. However, as the sample size
becomes smaller, mean differences have to be larger to become significant. In
addition to the requirement of continuous measurement, the t test assumes that the
variable being measured is normally distributed in the population from which the
sample was selected. Even when distributions for samples are mildly skewed, it may
be reasonable to assume a normal distribution for the variable in the population.
However, when the distribution for a sample is badly skewed or you doubt that the
variable is normally distributed in the population, you should not use a t test. As an
alternative you can compare medians or convert continuous data to a set of intervals
and conduct a chi square test.
We have two main type of test for the significance of the difference between two
mean:
1. If
1 2
2 30 n n + > z distribution
0 1 2 1 1 2
. .
.
1
2 2
. .
2 2
2 1 1 2 2 1 2 1 2
.
1 2 1 2
. . 0 . . 1
1. : / :
( )
2
2. ,
( ) 1
2
( 1) ( 1)
3. ,
2
4. ,
t t
t
t t
i i
e d
d
e t e t
H H
F z z
z z z
F z z
n n X X n n
z S
s n n n n
z z H z z H
o o

o
o
o o
= =
1
=
(
1
e (
(
( ]
=
(
]
+ +
= =
+
e e
2. If
1 2
2 30 n n + s t distribution
151
1 2
1 2
0 1 2 1 1 2
2 . .
.
1
2 2
2 . .
1 2
.
. . 0 . . 1
1. : / :
( )
2
2. ,
( ) 1
2
3.
4. ,
n n t t
t
n n t t
e
d
e t e t
H H
S t t
t t t
S t t
X X
t
s
t t H t t H
o o

o
o
+
+
= =
1
=
(
1
e (
(
( ]
=
(
]
=
e e
We also have different procedures depending on the test of whether the samples are
independent or dependent.
Example 10.
We conducted the research on the impact lack of sleep on the ability of solving
mathematical tasks. On a sample of 30 of the first test mathematics applied in the
"normal" circumstances. After that we not allowed to them to sleep 72 hours, and is
applied parallel to the test (the test results in an Excel table). Is there significant
difference in the results of 1st and 2 testing? The data are in the table, use the
reliability of 0.94.
Solution:
1 2
2 30 n n + > z distribution, paired samples
0 1 2 1 1 2
1. : / : H H = =
Data used in the analysis of paired samples:
152
t-Test: Paired Two Sample for Means
Variable 1Variable 2
Mean 28,13333 26,06667
Variance 45,29195 32,61609
Observations 30 30
Pearson Correlation 0,853868
Hypothesized Mean Difference 0
df 29
t Stat 3,231368
P(T<=t) one-tail 0,001531
t Critical one-tail 1,601972
P(T<=t) two-tail 0,003063
t Critical two-tail 1,957293
1
0, 003 0, 05 p H = <
There is significant difference between the averages for the population which means
that it is confirmed the existence of a lack of sleep impact on the ability of solving
mathematical tasks.
Or SPSS variant:
153
154
Paired Samples Statistics
Mean N Std. Deviation Std. Error Mean
test1 28,1333 30 6,72993 1,22871 Pair 1
test2 26,0667 30 5,71105 1,04269
Paired Samples Correlations
N Correlation Sig.
Pair 1 test1 & test2 30 ,854 ,000
Paired Samples Test
Paired Differences
95% Confidence
Interval of the
Difference
Mean
Std.
Deviation
Std. Error
Mean
Lower Upper
t df
Sig. (2-
tailed)
Pair 1 test1-
test2
2,06667 3,50304 ,63956 ,75861 3,37472 3,231 29 ,003
Of course the results are the same.
A two sample test for variances
For testing hypotheses about the (non)existence of differences between variances two
populations based on their samples using F test:
155
1 2
2 2 2 2
0 1 2 1 1 2
2 . 2
1 1 2 2
2 2
1 1
. 2 2
2 2
. 0 . 1
1. : / :
2. ( )
2
1, 1
3.
4. ,
o o o o
teor
u
izr
u
i teor i teor
H H
P F F F F
n n
S
F
S
F F H F F H
o o
o o o o
o
v v
o
o
>
= =
= =
= =
= =
< >
Example 11.
In 4.b. grade of 1. Gymnasium measured the emotional intelligence. The results of the
test are given in an Excel table. Is there a statistically significant difference in
intelligence between genders? ( 5% o = )?
Solution:
1 2
2 30 n n + < t distribution, independent sample
0 1 2 1 1 2
1. : / : H H = =
Data used in the analysis for independent samples, but first we have F test to check
whether variance equal:
156
F-Test Two-Sample for Variances
Variable 1 Variable 2
Mean 83,3125 77,58333
Variance 35,42917 28,26515
Observations 16 12
df 15 11
F 1,253458
P(F<=f) one-tail 0,358325
F Critical one-tail 2,71864
p value of F test is greater than 0.05 variance equal.
157
t-Test: Two-Sample Assuming Equal Variances
Mean 77,58333 83,3125
Variance 28,26515 35,42917
Observations 12 16
Pooled Variance 32,39824
df 26
t Stat -2,63574
p value of t test is less than 0.05 averages are not equal, and the conclusion follows
that there are significant differences in intelligence between gender.
Or SPSS test:
Both the samples are presented in the same column but the column of projects make
selection according to gender:
158
159
Group Statistics
gender N Mean Std. Deviation Std. Error Mean
M 12 77,5833 5,31650 1,53474 EI
16 83,3125 5,95224 1,48806
Independent Samples Test
Levenes Test for
Equality of Variances t-test for Equality of Means
95% Confidence Interval
of the Difference
F Sig. t df
Sig. (2-
tailed)
Mean
Difference
Std. Error
Difference Lower Upper
Equal
variances
assumed
,412 ,526 -2,636 26 ,014 -5,72917 2,17365 -10,19716 -1,26117 EI
Equal
variances
not
assumed
-2,680 25,122 ,013 -5,72917 2,13770 -10,13075 -1,32758
p value of t test is less than 0.05 averages are not equal, and the conclusion follows
that there are significant differences in intelligence between gender.
Example 12.
The company X are checked as being of employees affects the number of days sick
leave. A random sample selected 14 employees were younger age (20 to 30 years) and
14 employees, older age (50 to 60 years). Data on number of days of sick leave in
160
2008. year are given in an Excel table. Is there a statistically significant difference in
the number of days of sick leave between the two referent age group, the reliability of
99%?
Solution:
1 2
2 30 n n + < t distribution, independent samples
0 1 2 1 1 2
1. : / : H H = =
Data used in the analysis for independent samples, but first we have F test to check
whether variance equal:
F-Test Two-Sample for Variances
Variable 1 Variable 2
Mean 7,071429 5,714286
Variance 136,2253 18,83516
Observations 14 14
df 13 13
F 7,232497
P(F<=f) one-tail 0,000542
F Critical one-tail 3,905204
p vralue for F test lower than 0,01 variances are not equal.
161
t-Test: Two-Sample Assuming Unequal Variances
Mean 5,714286 7,071429
Variance 18,83516 136,2253
Observations 14 14
df 17
t Stat -0,40779
p value of t test is greater than 0.01 averages are equal, the conclusion that there is
no statistically significant difference in the number of days of sick leave between the
two referent age group.
Or SPSS test:
Both the sample are presented in the same column, but in the column before make
selection according to age group. Choosing Compare Means - Independent samples:
ANOVA
daysofsick
Sum of Squares df Mean Square F Sig.
Between Groups 13,000 2 6,500 ,109 ,897
Within Groups 2335,286 39 59,879
Total 2348,286 41
162
Group Statistics
grupa N Mean Std. Deviation Std. Error Mean
M 14 5,7143 4,33995 1,15990 daysofsick
S 14 7,0714 11,67156 3,11936
Levenes Test for
Equality of Variances t-test for Equality of Means
95% Confidence Interval
of the Difference
F Sig. t df
Sig. (2-
tailed)
Mean
Difference
Std. Error
Equal
variances
assumed
8,536 ,007 -,408 26 ,687 -1,35714 3,32802 -8,19799 5,48371 daysofsi
ck
Equal
variances
not
assumed
-,408 16,527 ,689 -1,35714 3,32802 -8,39398 5,67970
Conclusion is same.
Testing differences between arithmetic means of more than
two populations on the basis of their samples - analysis
variance ANOVA
The aim of the analysis variance that are testing whether there is a difference between
the arithmetic means of two basic paper on the basis of their samples and comparing
163
their variances. In other words, we want to investigate the influence of various k
factors on the one character
k
A A A ,..., ,
2 1
. Therefore, we have k samples with their
elements and the sample works only factor. For example, investigating the influence
of different fertilizer the harvest yields some kind of wheat. If the number of elements
in the i-th sample is
i
n , and if the j-th element and the sample we designate with
ij
x ,
we have the following results of measurements:
k
i
kn k k
in i i
n
n
x x x
x x x
x x x
x x x
...
... ... ... ...
...
... ... ... ...
...
...
2 1
2 1
2 22 21
1 12 11
2
1
Arithmetic mean and variance these samples are:
k i x
n
X
j
n
j
ij i
, 1 ,
1
1 1
= =
_
=
, k i X x
n
j
n
j
i ij i
, 1 ,
1
1
2
1
2
= = o
_
=
If all of these blocks are connected in one sample returns a sample with
_
=
=
k
i
i
n n
1
elements of the arithmetic mean
_
=
=
k
i
i i
X n
n
X
1
1
and total variance
,
__
= =
=
k
i
n
j
ij t
i
X x S
1 1
2
2
.
As the , , , , ,
2
1
2
1
2
1
2
X X n X x X X X x X x
i i
n
j
i ij
n
j
i i ij
n
j
ij
i i i
+ = + =
_ _ _
= = =
, then
is , ,
2 2
1
2
1 1
2
2
A r
k
i
i i
k
i
n
j
i ij t
S S X X n X x S
i
+ = + =
_ __
= = =
, where is
,
_ __
= = =
o = =
k
i
i i
k
i
n
j
i ij r
n X x S
i
1
2
1 1
2
2
residual variance, ,
_
=
=
k
i
i i A
X X n S
1
2
2
factorial
variance.
Degrees of freedom are: , , , . , 1 , 1
2 2 2
k n S k S n S
r A t

Appropriate assessments for variances are:
1
2
=
n
S
W
t
t
- this is estimate for total variance for population and is a result of
fluctuations in the sample as well as all other causes that effectively influence
the characteristic seen.
164
1
2
=
k
S
W
A
A
- this is estimate for a mid-grade variance more groups of samples
and is a result of fluctuations as the sample and the diversity of actions to the
factors. Therefore it is called a factorial variance.
k n
S
W
r
r
=
2
- this is estimation for total variance base with whom he eliminated
the influence of factors. It is a product of the fluctuations in the sample and,
therefore, is called residual variance.
If there is no difference in the effects of different factors to the characteristic
seen variances and
A r
W W should represent the same variance and the
quotient
2
2
1
r
A
r
A
i
S
S
k
k n
W
W
F
= = should not be significantly different from 1.

If it is correct, our assumption that all k samples belong to a same normally divided
basic set (not different to their arithmetic mean), then the theoretical value for the
comparison have a timetable value of F with 1
1
= v k and k n = v
2
degrees of
freedom and first type error o, so thats will be
t i
F F s .
Example 13.
Data on number of days of sick leave of employees of enterprises X in 2004. The
results are in an Excel table. We analyzed the three employee groups: younger age (20
- 30 yrs.), mean age (30 - 50 yrs.) and older age (50 - 60 yrs.). Is there a statistically
significant difference in the number of days of sick leave among the three referent age
group, ie, whether impact significantly age the number of employees on sick leave?
(reliability 95%)?
Solution:
How date are in three groups, to test the differences between averages we use
ANOVA:
165
Anova: Single Factor
SUMMARY
Groups Count Sum Average Variance
Column 1 14 80 5,714286 18,83516
Column 2 14 91 6,5 24,57692
Column 3 14 99 7,071429 136,2253
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 13 2 6,5 0,108552 0,897402 3,238096
Within Groups 2335,286 39 59,87912
Total 2348,286 41
p value of F test is greater than 0.01 averages are equal, the conclusion that there
is no statistically significant difference in the number of days of sick leave among the
three referent age group.
Or SPSS options:
All three samples are presented in the same column, but in the column before make
selection according to age group.
166
167
ANOVA
daysofsick
Sum of Squares df Mean Square F Sig.
Between Groups 13,000 2 6,500 ,109 ,897
Within Groups 2335,286 39 59,879
Total 2348,286 41
Conclusion is same.
Chi-square (
2
_ ) test
For a contingency table that has r rows and c columns,
2
_ test can be generalized as a
test of independence or association in the joint responses to two categorical variables.
Contingency table (m
ij
=f
ij
, n
i.
=f
i.
, n
.j
=f
.j
) looks like:
modalities for variable B modalities
for
variable A
B
1
B
2
... B
j
... B
c
total ()
A
1
m
11
m
12
m
1j
m
1c
n
1.
A
2
m
21
m
22
m
2j
m
2c
n
2.
...
A
i
m
i1
m
i2
m
ij
m
ic
n
i.
...
A
r
m
r1
m
r2
m
rj
m
rc
n
r.
total () n
.1
n
.1
n
.j
n
.c
n
168
0
1
0 . . 1 . .
1. : there is no relationship between two categorical variables
or that variables are independent /
: there is relationship between two categorical variables
or that variables are dependent
: / :
2. (
ij i j ij i j
k
H
H
H p p p H p p p
P
= - =
, ,
,
2 2 2
. .
2
( ) ( ) 2
.
. . .
.
. . 2
) 1
1 1
3. ,
( )
, , ,
t t
sum off all frequencies in the row sum off all frequencies in the column e t
e t
t
j i j
i
i j ij ij ij
k r c
row total column total
f f
f
f overal sample size n
n n n
n
or p p p e n p
n n n
_ _ o _
_
< =
=
= = (
(
]
= = = = =
_
,
2
. . 2
.
1 1
2 2 2 2
. . 0 . . 1
4. ,
r c
ij ij
i j
izr
i j
ij
e t e t
m e
n n
n e
H H
_
_ _ _ _
= =
=
< >
__
Example 14.
A large corporation is interested in detrmining whether an association exists between
the commuting time of its employees and the level of stress related problems observed
on the job. A study of 116 assembly line workers reveals the following:
Stress Commuting
time high moderate low total
Under 15 min 9 5 18 32
15-45 min 17 8 28 53
Over 45 min 18 6 7 31
total 44 19 53 116
At the level of significance 0.1, is there evidence of a significant relationship between
commuting time and stress?
Solution:
In contingency table we have information about empirical frequency. We will
calculate theoretical frequency by the formula:
( ) ( )
( )
sum off all frequencies in the row sum off all frequencies in the column
t
row total column total
f
overal sample size n
=
169
Theoretical frequencies
Stress Commuting
time high moderate low total
Under 15 min 12,13793 5,241379 14,62069 32
15-45 min 20,10345 8,681034 24,21552 53
Over 45 min 11,75862 5,077586 14,16379 31
total 44 19 53 116
Now we can calculate
,
2
e t
t
f f
f
:
170
,
2
e t
t
f f
f
high moderate low

Under 15 min 0,811226 0,011116 0,781067
15-45 min 0,479092 0,053428 0,591452
Over 45 min 3,312873 0,167569 3,623318
On the end we will sum all
,
2
e t
t
f f
f
:
171
,
2
2
.
9.831141
e t
e
t
f f
f
_
1
= = (
(
]
_
Appropriate
2
_ test procedure is:
, ,
0
1
2 2
4 .
1. : there is no relationship between two categorical variables /
: there is relationship between two categorical variables
2. ( ) 1 0.99
1 1 2 2 4
k t
H
H
P
k r c
_ _ o
=
< = =
= = =
,
2
.
2
2
.
2 2
. . 0
13.277
3. 9.831141
4.
t
e t
e
t
e t
f f
f
H
_
_
_ _
=
1
= = (
(
]
<
_
There is no evidence of a significant relationship between commuting time and stress
172
Example 15.
In the framework of a survey among tourists who go to the Sarajevo airport (sample
of 216 passengers) between the other set are the following questions:
What kind of tourism is the theme (motive) of your visitors?

How do you assess the security situation in B&H?
How do you rate the hotel accommodation that you have had in Bosnia and
Herzegovina?
Answers are given in the Excel table.
Examine whether the type of tourism that is the motive of the visit and given a total
interdependence with error 1%.
Solution:
Used SPSS:
173
In Statistics option we will choose:
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
motive * gradeS 211 97,7% 5 2,3% 216 100,0%
174
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
motive * gradeS 211 97,7% 5 2,3% 216 100,0%
motive * gradeH 200 92,6% 16 7,4% 216 100,0%
motive * gradeS
Crosstab
Count
gradeS
1,00 2,00 3,00
Total
1,00 2 7 2 11
2,00 7 10 0 17
3,00 2 5 2 9
4,00 67 40 4 111
5,00 30 11 2 43
motive
6,00 6 8 6 20
Total 114 81 16 211
Chi-square Tests
Value df
Asymp. Sig. (2-
sided)
Pearson Chi-square 37,653
a
10 ,000
Likelihood Ratio 33,789 10 ,000
N of Valid Cases 211
a. 8 cells (44,4%) have expected count less than 5. The minimum expected
count is ,68.
P value of chi-square test was 0000 which is less than 0.01 and indicates that the 1%
error can claim that the referent crossword variables are not independent, it is that
independent variables affect the dependent variable. In this case the motive of arrival
and city safety are dependent variables.
Symmetric Measures
Value Approx. Sig.
175
Nominal by Nominal Contingency Coefficient ,389 ,000
motive * gradeH
Crosstab
Count
gradeH
1,00 2,00 3,00
Total
1,00 4 6 1 11
2,00 6 9 1 16
3,00 3 6 0 9
4,00 45 52 5 102
5,00 21 18 3 42
motive
6,00 10 5 5 20
Total 89 96 15 200
Chi-square Tests
Value df
Asymp. Sig. (2-
sided)
Pearson Chi-square 14,457
a
10 ,153
Likelihood Ratio 12,553 10 ,250
a. 8 cells (44,4%) have expected count less than 5. The minimum expected
count is ,68.
P value of chi-square test was 0.153 which is greater than 0.01 and indicates that the
1% error can claim that the referent crossword independent variables. In this case the
motive of arrival and grade of hotel accommodation are independent variables.
Symmetric Measures
Value Approx. Sig.
Nominal by Nominal Contingency Coefficient ,260 ,153
176
Test for differences between proportion for populations
If we are to examine whether there are significant differences between proportions for
two or more than two populations based on samples from the structure of these
populations. Model used for testing is as follows:
,
'
0 1 2
1
2 2 2
. .
'
2
2 1
.
1
1
2 2 2 2
. . 0 . . 1
1. : ... ... /
: , 1,
2. ( ) 1
1 1( 0)
3. , ,
4. ,
k m
k
teor teor
k
m
k m
k kt
k
izr kt k m
k
kt
k
k
izr teor izr teor
H P P P P P
H P P k m
P
k m r m r
f
f f
f n p p
f
n
H H
_ _ o _
_
_ _ _ _
=
=
=
= = = = = =
- = =
< =
= = =
1
= = = (
(
]
< >
_
_
_
Where is:
m- number of samples (number of populations)
k
P - proportion in k-th population
k
n - sample size for sample from k-th population
) (
kt k
f f - empirical (theoretical) frequency
Example 16.
On 4 separate areas are investigating the purchase of coffee. It is assumed that the
coffee in the same proportion buying consumers in each of these 4 areas. We have
selected a sample of consumers coffee to test this assumption.
area Sample size Number of coffee
consumers and buyers in
sample
A 100 20
B 200 35
C 150 37
D 250 43
total 700 135
Can we accept the assumption that the proportion of buyers coffee equal to each area
with a 5% error?
Solution:
area Sample size -
i
n
Number of coffee consumers and
buyers in sample -
i
f
A 100 20
B 200 35
C 150 37
D 250 43
total 700 135
177
'
0 1 2
1
2 2
4 1 3
1. : ... ... same proportion for each area
: not same proportion for each area, 1,
2. ( ) 1 0, 05 0, 95
k m
k
t
k
H P P P P P
H P P k m
P _ _
= =
= = = = = =
- = =
< = =
2
7, 815
t
_ =
1
1
135
3. where is 0,19286
700
m
k
k
ti i m
k
k
f
f n p p
n
=
=
= = = =
_
_
area
i
n
i
f
Expected number of
coffee consumers and
buyers in sample -
ti
f
,
ti
ti i
f
f f
2
A 100 20 19,286 0,02643

B 200 35 38,572 0,33079
C 150 37 28,929 2,25176
D 250 43 48,215 0,56406
sum 700 135 135,002 3,17304
,
2
2
1
2 2
0
3,17304,
4.
m
k kt
i
k
kt
i t
f f
f
H
_
_ _
=
1
= = (
(
]
>
_
Therefore, we can say that in every area of proportions equal to coffee customers.
Test adequacy of approximations(goodness of fit)
If we have previously approximation for empirical distribution by some theoretical
schedule, and we want to examine the quality (adequacy of the approximations) we
use a nonparametric chi-square test:
: . 1
0
H Arrange the population is a specific form connected to a specific theoretical
distribution of frequency /
0 1
: H H approximation is not correct
178
,
'
2 2 2 '
. .
2
2
.
1
2 2 2 2
. . 0 . . 1
2. ( ) 1 , 1
3.
4. ,
teor teor
k
m
k kt
izr
k
kt
izr teor izr teor
P k m r
f f
f
H H
_ _ o _
_
_ _ _ _
=
< = =
1
= (
(
]
< >
_
Where is:
r - number of parameters that are estimated from empirical data
m- number of modalities or intervals
) (
kt k
f f - empirical (theoretical) frequencies
Example 17.
For empirical distribution:
modalities frequencies
0 150
1 100
2 50
3 15
4 7
5 2
We assume that behaves according Poisson distribution. We have to test the validity
of these assumptions. ( % 4 = o )
Given that the Poisson distribution of a characteristic parameter and it is the same as
the arithmetic mean, like first we will calculate arithmetic mean of the series (using
the Paste function):
m X = ={=SUMPRODUCT(A45:A50;B45:B50)/SUM(B45:B50)}= 0,873457.
Then we calculate the theoretical frequency Poisson distribution as follows:
{=324*POISSON(x;0,873457;FALSE)} for each x from interval 0 to 5.
Modalities
(A45:A50)
Frequencies
(B45:B50)
(C45:C50)
0 150 135,2719
1 100 118,1541
2 50 51,60127
3 15 15,02383
4 7 3,280666
5 2 0,573104
sum 324 323,7571
Given that we have a class with a frequencies lower than 5 must make transformation:
modalities
(E45:E48)
Frequencies
(F45:F48)
(G45:G49)
0 150 135,2719
1 100 118,1541
179
2 50 51,60127
3 24 18,8776
We will apply chi square test:
{=CHITEST(F45:F48;G45:G48)} empirical probability is 0,120048 and based on it
with a 2 degree of freedom returns = _
2
e
{=CHIINV(0,000563349;2)}= 4,239725
Now we will found chi-square theoretical value: = _
2
t
{=CHIINV(0,04;2)}= 6,437737
There is
2 2
e t
_ _ < we can not reject null hypothesis assumption that data from
research conduct by Poisson distribution accept.
Kolmogorov-Smirnov test
KS test is nonparametric test and examines whether the analyzed variable behaves by
default theoretical distribution. Used with the larger sample of 50 observations.
Option for the implementation of KS test provides SPSS program.
Example 18.
For 208 employees, we track the data on the amount of wages. The data presented in
the Excel sheet. Whether the analyzed variable behaves according to normal
distribution (the reliability of 99%).
Solution:
From the Excel we will take data to the data into SPSS sheet:
180
181
Descriptive Statistics
N Mean Std. Deviation Minimum Maximum
wage 208 39,9231 11,25548 26,70 97,00
One-Sample Kolmogorov-Smirnov Test
wage
N 208
Mean 39,9231 Normal Parameters
a,,b
Std. Deviation 11,25548
Absolute ,138
Positive ,138
Most Extreme Differences
Negative -,136
Kolmogorov-Smirnov Z 1,997
Asymp. Sig. (2-tailed) ,001
a. Test distribution is Normal.
b. Calculated from data.
As the p value of KS test is less than 0.01 it means to accept an alternative hypothesis,
and we think the presumption is not done "normal" when analyzed empirical
frequency distribution for the variable wages.
REGRESSION AND CORRELATION ANALISYS
182
IV. RECRRESSIUN AND CURRELATIUN ANALISYS
Aim
Correlation and regression analysis has a different purpose than the previous techniques
we have looked at. The goal of correlation and regression analysis is to determine and
quantify the relationship between two or more than two variables. One variable has two
or more scores (the data must be interval for the technique we will look at) coming from
the same individual. Over many cases we wish to know whether there is a relationship
between the variables. Correlation and regression are methods of describing the nature
and degree of relationship between two or more variables. For example:
Hours spent studying and grade point average
Family income and child's I.Q.
College G.P.A and adult income
Amount of time watching T.V. and fear of crime.
In each case, for each person or case, the individual is measured on the two variables and
we wish to determine if the two variables are related.
There are there most important concepts in correlation and regression analysis:
The scatter plot displays the form, direction, and strength of the relationship
between two quantitative variables. Straight-line (linear) relationships are
particularly important because a straight line is a simple pattern that is quite
common.
The correlation measures the direction and strength of the linear relationship.
The least-squares regression line is the line that makes the sum of the squares of
the vertical distances of the data points from the line as small as possible.
Basic aspects
In correlation and regression analysis, basic aspects are:
a) The direction of the relationship
Positive high scores on one variable go with high scores on the other variable.
Negative high scores on one variable go with low scores on the other variable
and vice versa.
b) The form of the relationship
Linear versus nonlinear relationships
c) The degree of the relationship
In a positive relationship are high scores always associated with other high scores
and low scores with other low scores or just sometimes?
183
Scatter plot
A scatter plot is a type of graph using Cartesian coordinates to display values for two
variables for a set of data. The data is displayed as a collection of points, each having the
value of one variable (independent variable x) determining the position on the horizontal
axis and the value of the other variable determining the position on the vertical axis
(dependent variable y). A scatter plot is also called a scatter chart, scatter diagram and
scatter graph.
Example 1.
Here is a table showing the results of two examinations set to 10 students. They took a
maths exam and an Statistics exam and record the scores that they get in both:
John Betty Sarah Peter Fiona Charlie Tim Gerry Martine Rachel
Maths
score 72 65 80 36 50 21 79 64 44 55
Statistics
score 78 70 81 31 55 29 74 64 47 53
We want to create scatter graph for this variables.
Solution:
We draw two axes. The horizontal axis will represent the score on the Maths exam. The
vertical axis will represent the score on the Statistics exam. For each student, we then
mark a small dot at the co-ordinates representing their two scores.
In Excel we choose chart:
184
185
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70 80 90
Maths score
S
t
a
t
i
s
t
i
c
s

s
c
o
r
e
We can see that the points follow a fairly strong pattern. Students who are good at Maths
tend to be good at Statistics as well. The marks lie fairly close to an imaginary straight
line that we can draw on the graph. In the diagram below, we have drawn in this straight
line: we will make right click with mouse on marks and we will get next options.
We choose Add Trend line:
186
We choose linear model, what is obvious from graph:
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70 80 90
Maths score
S
t
a
t
i
s
t
i
c
s

s
c
o
r
e
The fact that the points lie close to the straight line is called a strong linear correlation.
The fact that this line points upwards to right - indicating that the Statistics mark tends to
increase as the Maths mark increases - is called a positive correlation.
187
On next graph we can see different forms of scatter plots
6
:
x
y
a x
y
b x
y
c
x
y
d x
y
e x
y
f
In cases a and b we have linear relationships. In case a direction of relationship is positive
(direct, when a case is high on one variable it is high on the other variable), but in case b
relationship is negative (indirect, when one variable is high the other is low). In case c
there is no relationship between the variables, a case can be high on one variable and
either high or low on the other. In cases d, e and f there are nonlinear relationships.
Line of Best Fit {Regression Line]
The straight line that we draw through the points is called either the line of best fit or the
regression line. It describes the relationship between the two variables (the quantities
compared) mathematically. There is a standard way to draw this line to ensure that it fits
as closely to the data points as possible. Later on, we will investigate exactly what that
mathematical way is. For now, we only have to remember one thing:
The regression line goes through the point whose co-ordinates are the mean values
of the variables.
The arithmetic means are found by adding the relevant scores, and dividing by 10. We
work out:
mean Maths scores = (72 + 65 + 80 + 36 + 50 + 21 + 79 + 64 + 44 + 55) / 10 =
56.6
6
Somun-Kapetanovi R., Statistika u ekonomiji i menadmentu, Ekonomski fakultet u Sarajevu, Sarajevo
2006., page 112
188
mean of the Statistics scores = (78 + 70 + 81 + 31 + 55 + 29 + 74 + 64 + 47 + 53)
/ 10 = 58.2
and we can be sure that the line must go through the point (56.6, 58.2). We notice that
there are roughly the same number of data point lying above this line as there are below it
on scaterr plot for example 1.
We can use the regression line to make predictions. For instance, what Statistics mark
would we expect someone to receive if they received a Maths mark of 30? If we look at
the straight line, we can see that when the Maths mark is 30, the Statistics mark is
approximately 28. Similarly, we can assume that anyone who got an Statistics mark of 40,
would also get a Maths mark of about 40. However, there are limits on the predictions
that we can make, as you will see later on.
Tbe Correlation Coefficient
We can see by looking at the graph whether there is a strong or weak linear correlation
between two variables, and whether that correlation is positive or negative. However,
there is a mathematical way of working it out, and that is to calculate the correlation
coefficient. This is also known as Pearson's Correlation Coefficient, represented by the
letter r, and it is a single number which ranges from -1 (perfect strong negative
correlation) to +1 (perfect strong positive correlation). Correlation coefficients which are
close to -1 or +1 indicate a strong correlation. Values close to 0 indicate a weak
correlation, with 0 itself indicating no correlation at all. The stronger the correlation
means the better the prediction and the smaller the errors of prediction.
Here is how we calculate the linear correlation coefficient between two variables:
, ,
2 2
2 2
2 2
( )( )
( , )
( ) ( )
i i
X Y
i i
x x y y
Cov X Y
r
x x y y
n x y x y
r
n x x n y y
o o

= =

=
1 1

( (
] ]
_
_ _
_ _ _
_ _ _ _
where:

1
( ) ( )
xy i i
xy
C x x y y x y
n n
= =
_
_
is covariance between x (like independent
variable) and y (like dependent variable). Covariance simultaneously monitor variability
of both variables
2
2 2
1
( )
x i
x
x x x
N N
o = =
_
_
is standard deviation for variable x
2
2 2
1
( )
y i
y
y y y
N N
o = =
_
_
is standard deviation for variable y
189
x - mean for variable x
y - mean for variable y
n is number of objects.
Example 1. cont.
We want to calculate correlation coefficient between Maths score and Statistics score.
Solution:
In Excel statistical functions we will chose function CORREL:
190
Correlation coefficient is close to 1 and indicates a strong positive correlation, as we
supposed according to scatter plot. Well, there is strong direct relationship between
scores on Math and Statistics.
Tbe Coefficient of Determination
Another figure that is useful is the coefficient of determination. This is written as r
2
and
is found by squaring the correlation coefficient. Because the correlation coefficient must
be in the range -1 to +1, and square numbers must be positive, the coefficient of
determination must be in the range 0 to +1.
The correlation coefficient indicates whether there is a relationship between the two
variables, and whether the relationship is a positive or a negative number.
The coefficient of determination tells you what proportion of the variation between the
data points is explained or accounted for by the best fit line fitted to the points. It
indicates how close the points are to the line.
Interpretation of tbe size of a correlation
Several authors have offered guidelines for the interpretation of a correlation coefficient,
as we can see in next table:
191
Correlation Negative Positive
Small 0.3 to 0.1 0.1 to 0.3
Medium 0.5 to 0.3 0.3 to 0.5
Large 1.0 to 0.5 0.5 to 1.0
Cohen (1988) has observed, however, that all such criteria are in some ways arbitrary and
should not be observed too strictly. This is because the interpretation of a correlation
coefficient depends on the context and purposes. A correlation of 0.9 may be very low if
one is verifying a physical law using high-quality instruments, but may be regarded as
very high in the social sciences where there may be a greater contribution from
complicating factors.
Along this vein, it is important to remember that "large" and "small" should not be taken
as synonyms for "good" and "bad" in terms of determining that a correlation is of a
certain size.
192
Tbe standard error of estimate and tbe correlation
coefficient
1. Decomposition of an observed score if y is dependent variable:
,
( )
i i i i
y y y y y y = + +
2. Partitioning the variance in scores
a) More useful may be looking at it in terms of variability, breaking the total variability
of the score (its deviation from the mean) into two portions:
,
( ) ( )
i i i i
y y y y y y = +
( )
i
y y - The deviation of the score from the mean.
,
i
y y - The deviation of the predicted score from the mean this is the portion
of the score that reflects the relationship with the x variable.
( )
i i
y y - The deviation of the observed score from the predicted score, this is
error, or the part of the score that is not related to the x variable.
b) If we square these deviations and sum them we have sums of squares, these sums of
squares are additive
:
,
2
2 2
( ) ( )
i i i i
y y y y y y = +
_ _ _
2
( )
i
y y
_
is the total sum of squares for the dependent variable SS
y
(total
variability)
,
2
i
y y
_
is the sum of squares due to prediction or regression (SS
regression
) this
is the part of the y variable that the x variable did predict (explained variability).
The stronger the correlation the larger this term will be:
193
- If r = 0 then ,
2
0
i
y y =
_
- If r = 1 then , ,
2 2
i i
y y y y =
_ _
2
( )
i i
y y
_
is the sum of squares for the residual or the errors of prediction, the
part of SS
y
that the x variable did not predict (SS
errors in prediction
or residual SS
regression
or unexplained variability). The stronger the correlation, the smaller this term
will be:
- If r = 0 then ,
2
2
( )
i i i
y y y y =
_ _
- If r = 1 then
2
( ) 0
i i
y y =
_
3.
,
,
2
2
2
1
errors in prediction regression i
y y
i
SS SS y y
r
SS SS
y y
= = =
_
_
is the coefficient of determination and it can be seen that it represents the fraction of the
total variation in the y scores that can be predicted from the x scores.
a. Than, we can calculate standard error of estimate like:
,
2
2
(1 )
standard error of estimate
2 2
y i i
error
r SS y y
SS
df n n

= = =

_
Calculating tbe Equation of tbe Regression Line for two
variables
The regression line is defined by two numbers - the gradient and the intercept on the
vertical axis of the line that best fits those points. We always refer to the gradient of the
line as b and the intercept as a, which gives the equation of the regression line as:
i i
y a b x = +
The Least-Squares Method (LSM) determines the values of a and b that minimizes the
sum of squares for the residual or the errors of prediction:
,
2
2
( ) minimum
i i i i
y y y a b x = + 1
]
_ _
.
According to this LSM method, here are formulas for calculation of the gradient and the
intercept and general roles for their interpretation:
x b y a = - indicates which is the value of y when x is 0.
,
2 2
2
xy
X
Cov n x y x y
b
n x x
o

= =
_ _ _
_ _
- indicates how much the y values change as x
changes for one unit.
194
Example 1. cont.
We want to create regression model for relationship between Maths score and Statistics
score, in sense that Statistics score depends on Maths score.
Solution:
I way for solution:
In Excel function we will find functions INTERCEPT and SLOPE:
195
Regression model: 5.089 0.938

i i
y x = +
Interpretation:
Statistics score will rise for 0.938 if Math score rise for 1.
Student who have 0 score from Math will have 5.089 score from Statistics.
II way for solution:
Excel Data Analysis Regresion:
196
SUMMARY OUTPUT
Regression Statistics
Multiple R 0,971121335
R Square 0,943076647
Adjusted R Square0,935961228
Standard Error 4,68868839
Observations 10
197
ANOVA
df SS MS F Significance F
Regression 1 2913,729609 2913,729609 132,5399 2,94E-06
Residual 8 175,8703905 21,98379882
Total 9 3089,6
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 5,083182203 4,846187507 1,048903328 0,324874 -6,09215 16,25851
X Variable 1 0,938459678 0,081515907 11,5125957 2,94E-06 0,750484 1,126436
RESIDUAL OUTPUT
Observation Predicted Y Residuals Standard Residuals
172,65227905 5,347720953 1,209744422
2 66,0830613 3,916938701 0,886077412
380,15995647 0,840043526 0,190031974
438,86773063 -7,867730625 -1,779812993
552,00616612 2,993833877 0,677255576
624,79083545 4,209164551 0,952183814
7 79,2214968 -5,221496796 -1,181190394
865,14460162 -1,14460162 -0,258928137
946,37540805 0,624591948 0,141293203
1056,69846451 -3,698464515 -0,836654877
X Variable 1 Residual Plot
-10
-5
0
5
10
0 20 40 60 80 100
X Variable 1
R
e
s
i
d
u
a
l
s
Prediction or forecasting
This model, which is determined by LSM method, is used for forecasting values of
dependent variable y for different given values of independent variable x. Predictions in
regression analysis can be:
198
Interpolation - values of independent variable x are within original range from
smallest to largest x used in developing the regression model. This is relatively
reliable prediction.
Extrapolation - values of independent variable x arent within original range from
smallest to largest x used in developing the regression model. This prediction can be
subject to unknown effects that we dont expect, so in case of extrapolation,
reliability is questionable.
Example 1. cont.
If student have Math score 75, what is expected score for Statistics?
Solution:
We will make interpolation:
75 5.089 0.938 5.089 0.938 75 75.214
i i i
x y x = = + = + =
According to previous regression model, we will expect that student who have Math
score 75 get 75.214 score on Statistic.
Spearman's rank correlation coefficient
Spearmans correlation coefficient () used with ranked data, can be calculated like:
2
3
6
1
d
n n
_
where d is difference in ranking for x and y:
x y
d r r = .
The only difference between it and the standard r is that the data used are ranks.
Example 2.
Two art historians were asked to rank six paintings from 1 (best) to 6 (worst). Their
rankings are shown like table:
Painting Historian 1 Historian 2
A 6 5
B 5 6
C 1 2
D 3 1
E 4 3
F 2 4
Calculate Spearmans rank correlation coefficient. Explain.
199
Solution:
We have ranks for two variables and we will calculate difference in ranking for x and y:
x y
d r r = .
Painting Historian 1 -
x
r
Historian 2 -
y
r
d
2
d
A 6 5 1 1
B 5 6 -1 1
C 1 2 -1 1
D 3 1 2 4
E 4 3 1 1
F 2 4 -2 4
suma 12
Spearmans rank correlation coefficient is:
2
3 3
6
6 12
1 1 0.66
6 6
d
n n

= = =

_
That suggests relatively strong direct agreement (66%) between opinion of this two art
historians.
Or by SPSS program:
200
For correlation option we will choose bivariate and then we will define variables:
201
Correlations
K1 K2
Correlation Coefficient 1,000 ,657
Sig. (2-tailed) . ,156
K1
N 6 6
Correlation Coefficient ,657 1,000
Sig. (2-tailed) ,156 .
Spearmans rho
K2
N 6 6
Same conclusion.
Statistical testing {t test, ANUVA]
It is possible to implement test significance of parameters in the model of simple
regressions:
1.
0 1
: 0/ : 0 H b H b = =
2. standard error for parameter b
202
,
2 2
2
where is:
2
b
i
i i
x N x
y y
N
o
o
o
=

_
_
3.
e
b
b
t
o
=
4.
1,
2
1, 1,1
2 2
1,1
2
;
N k
t t
N k N k
N k
t
t t t t
t
o
o o
o

1
= e
(
]
Where is k=1 number of independent variables in simple regression model
5.
0 e t
t t H e , parameter b is not significant, it is the independent variable
that follows the model was not significant.
1 e t
t t H e , parameter b is significant.
Concept of p values, which is simpler, concludes that:
If the p value with a parameter, which were significant we tested, less than 0.05
might mistake the first kind of 5% say that is a parameter that is the variable that
it monitors significant in the model.
If the p value with a parameter, which were significant we tested, higher of 0.05 is
with the first type of error of 5% say that this parameter is variable that
accompanies it is not significant in the model, it is such independent variable be
excluded from the model.
Example 2, cont.
We will analyze some Excel output for regression analysis in example 1:
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 5,083182203 4,846187507 1,048903328 0,324874 -6,09215 16,25851
X Variable 1 0,938459678 0,081515907 11,5125957 2,94E-06 0,750484 1,126436
Uverview example for simple regression model witb SPSS
Example 3.
To examine relationship between the store size (i.e. square footage) and its annual sales, a
sample of 14 stores was selected. The results for these 14 stores are summarized in next
table:
Store Square feet
(000)
Annual sales (in
millions of $)
1 1.7 3.7
203
2 1.6 3.9
3 2.8 6.7
4 5.6 9.5
5 1.3 3.4
6 2.2 5.6
7 1.3 3.7
8 1.1 2.7
9 3.2 5.5
10 1.5 2.9
11 5.2 10.7
12 4.6 7.6
13 5.8 11.8
14 3.0 4.1
a) To examine relationship between the store size and its annual sales create scatter
plot. Comment.
b) Create regression model for this variables. Explain parameters.
c) Calculate and explain coefficient of correlation and coefficient of determination.
d) Comment model representatives.
e) If store size is 4200 square feet, what level of annual sales for that store we could
expect?
Solution:
a) Scatter plot:
1. independent variable is store size,
2. dependent variable is annual sale
We use graph option in SPSS:
204
We will find variables:
205
206
According to this scatter plot, we suppose that there is direct linear relationship.
b) Linear model:
i i
y a b x = +
207
208
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 ,951
a
,904 ,896 ,9664
a. Predictors: (Constant), size
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
Regression 105,748 1 105,748 113,234 ,000
a
Residual 11,207 12 ,934
1
Total 116,954 13
a. Predictors: (Constant), size
b. Dependent Variable: sale
Coefficients
a
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.
209
B Std. Error Beta
(Constant) ,964 ,526 1,833 ,092 1
size 1,670 ,157 ,951 10,641 ,000
a. Dependent Variable: sale
Regression model is: 0.964 1.67
i i
y x = +
b - indicates that annual sale increase for 1.67 million of dollars as story size increase
for 1000 square feet.
a - indicates that annual sale is 0.964 million of dollars when story size is 0 square
feet (this interpretation is not logic).
c) Correlation coefficient is 0.95. This indicates strong (but not perfect) positive
correlation.
Coefficient of determination is:
2
0.904 r = Use of regression model has reduced
variability in predicting annual sales by 90.4%. Only 9.6% of the sample variability in
annual sales is due to factors other than what is accounted for by linear regression model
that uses only square footage.
d) We can analyse quality of regression simple models, in addition coefficients of
determination and correlations, and monitor the t test for parameters with an independent
variable. Empirical value of t is forthcoming 10.64 and p value of the t test is 0000, which
means that the independent variable in the model that accompanies this parameter is
significant.
e) 4.2
i
x = is within original range from smallest to largest x used in developing the
regression model, so we made interpolation.
4.2 0.964 1.67 0.964 1.67 4.2 7.978
i i i
x y x = = + = + =
The predicted average annual sale of a store with 4200 square feet is $7,978,000.
MULTIPLE RECRESSIUN MUDEL
The general multiple regression model
The general multiple regression model with K independent variables is:
e X X X f Y
K
+ = ) ,...., , (
2 1
210
Dependent variable Y is expressed as a function of K independent random variables and e.
If the member is a functional part of the model defined linear function model can define
the standard model of multiple linear regressions the following equation:
e X b X b X b a Y
K K
+ + + + + = ...
2 2 1 1
Coefficients in the regression model have the following meaning:
Parameter a is free, constant member who represents the expected value of dependent
variable Y when the value of K independent variables (X
1
, X
2
,...,X
K
) equal to zero. The
value of this parameter does not always logical explanation.
Parameter b
i
(i=1,2,....,K) or the regression coefficient by the independent variable
indicates the average change in dependent variable Y conditional unit increase in
independent variables X
i
, provided that the other independent variables remain
unchanged. Positive value of parameter that indicates the proportional relationship
between variables Y and X
i
. This means that the growth of independent variables X
i
conditional growth dependent variable Y. A negative value means that the coefficient
inversely proportional relationship between dependent variable Y and independent
variable X
i
. In this case the direction of changes of independent and dependent
variables is the opposite, that growth is causing the decline X
i
dependent variable Y, a
decline X
i
causes growth dependent variable Y.
Values of model parameters, multiple regressions evaluated using the method of least
squares.
Measures for quality of multiple regression model
A. Model error
,
2
i
ie ie
x
e
x x
N
o

=
_
- model error is unexplained variability.
B. Coefficient variation for model
i
x
i
V
X
o
=
C. Coefficient of multiple determination (relationship for explained and total variability)
has defined the following expression:
1 0 ,
) (
) (
2
2
2
2
,.., 2 , 1 ;
,.., 2 , 1 ;
s s
_
_
=
K Y
R
y y
y y
R
i
i
K Y
Coefficient of multiple determinations explains how the changes in variability of
dependent characteristics are explained by the changes of variability for K independent
features included in the regression model.
211
D.Coefficient of multiple linear correlations expresses strength relationship between
variability dependent variables, and summary variability for K independent variables.
Determined as the square root of the coefficient multiple determinations:
1 0 ,
) (
) (
,.., 2 , 1 ;
2
2
,.., 2 , 1 ;
s s
_
_
=
K Y
i
i
K Y
R
y y
y y
R
Or by expression:
y y
i i
K Y
n
y y y y
R
o o
) )( (
,.., 2 , 1 ;
_
=
Coefficient is not the sign of the association, because relations between the dependent
and independent variables can be multidirectional.
E. Partial correlation coefficient shows the strength and direction of the connection
dependent variable Y and j-independent variables with the same impact of the remaining
(K-1) variables which represent the c. The value of this coefficient is moving within
limits: 1 1
, ;
s s
c j y
r .
For example, partial correlation coefficients of the first order for K = 2 is defined using a
simple coefficient of linear correlation in the following manner:
) 1 )( 1 (
;
) 1 )( 1 (
2
2 , 1
2
2 ;
2 , 1 1 ; 2 ;
1 , 2 ;
2
2 , 1
2
1 ;
2 , 1 2 ; 1 ;
2 , 1 ;
r r
r r r
r
r r
r r r
r
y
y y
y
y
y y
y

=

=
Interpretation of partial correlation coefficients: explaining the strength and status the
independent and dependent variables (their variability), if you switch off the influence of
others (K-1) independent variables.
F. Adjusted determination coefficient
,
2 2
. ...
1
1 1
1
i
n
adjusted R R
n k
1
=
(

]
Adjustment is done with number of predictors and size of sample and with small samples
taken into consideration this coefficient.
Statistical test {t test, ANUVA]
a. Testing for parameter significance b
ij.12...m
7
for multiple regression model
1.
0 .12... 1 .12...
: 0 / : 0
ij m ij m
H b H b = =
7
Behind point are not only i and j.
212
2. Standard error evaluation parameter b is
.12...
ij m
b
o
and determined on the
basis of
,
2
i i
y y
N M
o

=
_
3.
.12...
.12...
ij m
ij m
e
b
b
t
o
=
4.
1,
2
1, 1,1
2 2
1,1
2
;
N k
t t
N k N k
N k
t
t t t t
t
o
o o
o

1
= e
(
]
where k = M-1 - the number of independent variables in multiple regression
model
5.
0 e t
t t H e
, parameter b is not significant, it is the variable that follows
the model was not significant
1 e t
t t H e
, parameter b is significant, it is the variable that follows the
model was significant
b. Analysis of variance in the regression model - F test for regression model
This analysis tested whether there is a significant link between a number of independent
variables included in the model and the dependent variable.
The methodology of conducting F test is as follows:
1.
0 1. ... 2. ... . ... 1 . ...
: ... 0/ : least one parameter 0
i i ik ij
H b b b H b = = = = =
2.
2
/
2
1
y x
e
y
k
F
n k
o
o
=

3. for given o,
; 1 t k n k
F F

=
where k is number of independent variables in the regression model
4.
0
1
e t
e t
F F H
F F H
s
>
If you accept an alternative hypothesis can be considered that at least one of the
independent (explanatory) variables involved in the model important for the movement of
dependent variables.
Example 4.
Sample of 34 shops in the chain store was selected for a marketing test. Dependent
variable is the volume of sales, while the independent variables are price and cost of the
promotion:
213
Sale (units) Price (KM) Promotion
cost (00
KM)
Sale (units) Price (KM) Promotion
cost (00
KM)
4141 59 200 2730 79 400
3842 59 200 2618 79 400
3056 59 200 4421 79 400
3519 59 200 4113 79 600
4226 59 400 3746 79 600
4630 59 400 3532 79 600
3507 59 400 3825 79 600
3754 59 400 1096 99 200
5000 59 600 761 99 200
5120 59 600 2088 99 200
4011 59 600 820 99 200
5015 59 600 2114 99 400
1916 79 200 1882 99 400
675 79 200 2159 99 400
3636 79 200 1602 99 400
3224 79 200 3354 99 600
2295 79 400 2927 99 600
Create an appropriate regression model and analyze the results.
Solution:
It is a model of multiple regressions with two independent variables. Using Excel (Data
analysis - Regression)
8
obtained the regression model. The result looks like this:
SUMMARY OUTPUT
Multiple R 0,870475
R Square 0,757726
Adjusted R
Square 0,742095
Standard
Error 638,0653
Observations 34
ANOVA df SS MS F
Significance
F
Regression 2 39472731 19736365 48,47713 2,86E-10
Residual 31 12620947 407127,3
Total 33 52093677
Coefficients
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 5837,521 628,1502 9,293192 1,79E-10 4556,4 7118,642
8
The database column with the dependent variable must be either the first or last, because the independent
variables must be given as a "block" variables
214
Price -53,2173 6,852221 -7,76644 9,2E-09 -67,1925 -39,2421
Promotion
cost 3,613058 0,685222 5,272828 9,82E-06 2,215538 5,010578
Excel output obtained we interpret in the following manner:
Correlation coefficient (multiple R) 0,87
Determination coefficient (R square) 0,757
Adjusted determination coefficient (adjusted R square) 0,742
Model error (Standard Error) = 638,06 =
residual unexplained by model
MS MS =
Previous coefficients indicate a model with 87% strength explains a dependent
variable volume of sales. So, the model is good.
Then they give the results of ANOVA (analysis variances) gained as a test model:
o In the first column, the information on the appropriate number of
degrees of freedom:
-
explained
unexplained
1
1
regresion
residual
total
df df k
df df n k
df n
= =
= =
=
o In the second column are the results of the sum of squares deviation.
-
2
explained
( )
regresion i
SS SS y y = = =
_
39,472,731
-
2
unexplained
( )
residual i i
SS SS y y = = =
_
12,620,947
-
2
( )
total i
SS y y = =
_
52,093,677
o In the third column are the results of the MS (the sum of squares of
deviation / number of degrees of freedom)
-
explained
unexplained
total
( )
1
1
number of independent variables in model
- numb
regresion
regresion
regresion
residual
residual
residual
total
total
total
SS
MS aproppriate
df
SS SS
MS
df k
SS
SS
MS
df n k
SS SS
MS
df n
k
n
=
= =
= =

= =
er of observation (objects)
o In the fourth column is the empirical value of F test, and in the fifth
column of the appropriate p-value (F significance).
Like it is
1
48.48
=2.86E-10 < =0.05
e
F
p H o
=
, we consider a model significant (at

least one of the independent variables included in the model is significant is
to influence the dependent variable).
215
In the latter part of the table are the parameters of the model and the information
that they follow:
o
1 2 3
5837.52 53.2173 3.6131
i i i
x x x = +
- If the price increases by 1 KM, the volume of sales is reduced by
53.2173 KM, provided that the investment in the promotion does
not change.
- If the cost increases for the promotion of 100 KM, volume of sales
increases of 3.6131, with the condition that the price does not
change.
o In addition to the parameters or coefficients regression model gives:
A. standard error estimates of these parameters
B.
parameter
standard error for parameter
e
z =
9
for testing significance of each
parameter separately. How are all these theoretical values outside the
interval
0.025
0.975
1.96
1.96
t
z
z
z
=
=
=
, accept an alternative hypothesis, and we
think both explanatory significant variables in the model
C. p-value for testing significance of each parameter separately. How are
all these values of less than a specified level of errors of first kind of 5%,
accept an alternative hypothesis, and we think both explanatory
significant variables in the model.
D. lower and upper limits for interval evaluation of each parameter (*
standard error separately, obtained as: the parameter of the model
parameter estimates)
Indicator - dummy variables
In the previous considerations regression models we talked about the independent
variables in terms of quantitative variables. "Dummy", dichotomy, encrypted, or indicator
variable is based on qualitative variables derived or artificial numerical variable, which is
used in regression analysis to show subsets of the analyzed sample from the population.
In simplest case, the indicator variable used values 0 and 1:
0 for elements in the control groups or elements that are not in the target group
(do not have the desired characteristic) and
1 for elements in the experimental group (are in a specific treatment) or for
elements that are the target group (with the desired characteristic).
When designing the research, the indicator variable is often used to set boundaries
between differently treated groups.
Indicator variable is very useful because it gives the possibility to use a simple regression
9
If n is less than 30 we use t distribution with (n-k-1) degrees of freedom.
216
equation for the representation of different groups, which means that it is not necessary to
construct separate regression models for each group or subset.
Indicator variable is used to qualitative explanatory (independent) variables included in
the regression model. So, another advantage of the indicator variable is that despite the
fact that it is a nominal scale variable it is possible to treat as a variable interval scale. For
example, if the calculation of average for this variable, the result is interpreted as the
proportion of models in the distribution of 1.
Examples of indicator variables:
indicator variable for gender: 1 if male, 0 if not
indicator for marital status: 1 if married, 0 if not
indicator for employment: 1 if employed, 0 if not
indicator for categorization according to the mid urbanity: 1 if urban, 0 if not
indicator for citizenship: 1 if a citizen of the state date, 0 if not ....
Simple model with dummy variable
Simple regression model with a "dummy" variable is a model with only one independent
variable type of "dummy" variables, and read:
i i i
y a b d e = + +
Where is:

i
y - value of dependent variable (the result of the outcome) for and the i object
a coefficient of intercept
b slope coefficient
dichotomy variable:
1, if object is in experimental group
0, if object is in control group
i
i
d
i
=
i
e - residual (error) for i object
That illustrates the indicator variable, we will analyse further simple regression model
with a "dummy" variables. The first step is to set that looks regression equation
separately for both groups. For the control group 0
i
d = , for experimental group 1
i
d = .
When referred to introduce in the regression model assuming that the phrase residuals or
errors on average equal to 0, returns the following:
i i i
y a b d e = + +
For control group ( 0
i
d = ):
0 0
Ki
Ki
y a b
y a
= + +
=
For experimental group ( 1
i
d = ):
1 0
Ei
Ei
y a b
y a b
= + +
= +
217
We will calculate difference between the groups. This will be the difference between
regression models for the referent group.
( )

Ei Ki
Ei Ki
y y a b a b
y y b
= + =
=
.
Therefore, the difference between the groups shows the coefficient b.
Example indicator variables as the regression variables in the
simple model with a "dummy" variable
Let us take a concrete example of a simple regression model where the dependent
variable wages and independent indicator variable is an indicator for marital status (1 if
married, 0 if not).
798.44 178.61
i i i i i
y a b d e d e = + + = + +
What is interpretation in these case parameters in the model?
Parameter a mean that for people who are not married average wage equal to 798.44
KM.
Parameter b means that the salaries of persons who are married to 178.61 KM greater
than the salaries of persons who are not married.
Summary of parameter a and b means that for people who are married average wage
equal to 977.05 KM.
Example of multiple regression models with indicator variables
as a explanatory variable and a continuous variable as another
variable explanatory
Let us take a concrete example regression model where the dependent variable wages and
independent variables:
indicator variable is an indicator for completed faculty (1 if completed university,
0 if not).
continuous variable is the length of employment (in months)
1
1 1
275 162 6.3
i d i x i i i i i
y a b d b x e d x e = + + + = + + +
What is interpretation in these case parameters in the model?
Parameter a means that for people who have not completed university, and whose
work experience is equal to 0 (start to work) is equal 275 KM.
Parameter
d
b means that the salary the person who finished university for 162 KM
more than pay the person who has not completed university.
Parameter
1
x
b means that if all other factors in the model remain unchanged increase
of service for 1 month leads to increase wages for 6.3 KM.
218
Note: In the model it is possible to include more continuous and indicator variables.
Interpretations remain the same, noting that other factors remain unchanged (under the
control of the) we will interpret parameter obtained for the given variable.
Example 5.
For 15 houses are well-known information about: the sale value (000 KM), size (00 m2)
and possession of fire protection systems:
Sale value Size Possession of fire
protection systems
84.4 2.00 yes
77.4 1.71 no
75.7 1.45 no
85.9 1.76 yes
79.1 1.93 no
70.4 1.20 yes
75.8 1.55 yes
85.9 1.93 yes
78.5 1.59 yes
79.2 1.50 yes
86.7 1.90 yes
79.3 1.39 yes
74.5 1.54 no
83.8 1.89 yes
76.8 1.59 no
Construct model to predict the sales value of the house depending on its size and
information about the system of fire protection. Interpret the parameters obtained.
Solution:
As the possession of variable quality fire protection based on the need to create the
indicator variable:
1, if house have fire protection system
0, if house don't have fire protection system
i
d = :
Use the Excel IF function to create dummy variables:
219
Continue with the Copy-Paste were joined by other cells:
Sale value -
y
Size - x Possession of fire
protection systems
d
84.4 2 yes
1
77.4 1.71 no
0
75.7 1.45 no
0
85.9 1.76 yes
1
79.1 1.93 no
0
220
70.4 1.2 yes
1
75.8 1.55 yes
1
85.9 1.93 yes
1
78.5 1.59 yes
1
79.2 1.5 yes
1
86.7 1.9 yes
1
79.3 1.39 yes
1
74.5 1.54 no
0
83.8 1.89 yes
1
76.8 1.59 no
0
Appropriate regression model reads:
1
1
i x i d i i
y a b x b d e = + + +
Model thus designed is evaluated as multiple regression (EXCEL - Data analysis):
SUMMARY OUTPUT
Multiple R 0,900587
R Square 0,811057
Adjusted R
Square 0,779567
Standard
Error 2,262596
Observations 15
ANOVA df SS MS F
Significance
F
Regression 2 263,7039 131,852 25,75565 4,55E-05
Residual 12 61,43209 5,11934
Total 14 325,136
Coefficients
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 50,09049 4,351658 11,51067 7,68E-08 40,60904 59,57194
Size 16,18583 2,574442 6,287124 4,02E-05 10,57661 21,79506
Possession
of fire
protection
systems 3,852982 1,241223 3,104183 0,009119 1,148591 6,557374
Interpretations:
Correlation coefficient 0.9
Determination coefficient 0.81
Adjusted determination coefficient 0.7796
Model error 2.26
Previous coefficients indicate a model with 90% strength explains a dependent
variable sale value for house. So, the model is good.
221
Then they give the results of ANOVA (analysis variances) gained as a test model:.
o
2
( )
i
y y =
_
263.7
o
2
( )
i i
y y =
_
61.43
o
2
( )
i
y y =
_
325.13
o Like it is
1
25.75565321
=4.54968E-05 < =0.05
e
F
p H o
=
, we consider a model
significant (at least one of the independent variables included in the
model is significant is to influence the dependent variable).
In the latter part of the table are the parameters of the model and the information that
they follow:
1
50.09 16.186 3.853
i i i
y x d = + + . That means:
o For each 100 square meters sale value is higher for 16.186 KM, if
other variables stay same.
o House that possess of fire protection system has for 3.853 KM sale
value than house without fire protection system.
o In addition to the parameters or coefficients regression model gives:
standard error estimates of these parameters
e
t for testing parameter significance for each parameter separately.
First we have to find theoretical interval:
How are all these theoretical values (t Stat in table behind parameters) outside the
theoretical interval
12, 0.025
12, 0.975
2.178
2.178
t
t
t
t
=
=
=
, accept an alternative hypothesis, and we
think both explanatory significant variables in the model.
p-value for testing significance of each parameter separately. How are all these
values of less than a specified level of errors of first kind of 5%, accept an
alternative hypothesis, and we think both explanatory significant variables in the
model.
222
lower and upper limits for interval evaluation of each parameter (* standard
error separately, obtained as: the parameter of the model parameter estimates)
CUNDITIUNS FUR ECUNUMETRIC MUDELS
Regression model stated right line: , 1, 2,..., .
i i i
y a bx e i n = + + = has two parts. The first
part of the model (a+bx
i
) represents a functional relationship in which the Y is linearly
dependent of X, if the other factors constant. Second, stochastic models of the (e
i
),
represents the random variation, which takes into account the effect of changes in other
variables that are not explicitly included in the model.
Provided that the specification matches the model in relation to economic realities and
practices and to problems of measuring economic relations expressed as problems of
statistical evaluation of parameters of probability timetable must be met assumptions
about linear regression model. These assumptions are as follows:
a. E(e
i
) = 0, (expected value of errors is equal to zero)
b. E(e
i
2
)= o
2
, (constant common variance homoskedastic)
c. E(e
i
e
j
)= 0, for each i, j ; i=j; (independency, there is no autocorrelation with
stochastic part)
d. e
i
: N(0, o
2
), (normality) - This assumption points to the absence of extreme
data in the sample or the outlier values of Xt and Yt, which are very distant from the
values of other variables.
e. E(eiX
j
) = 0, for each i, j; ( independency from X
j
).
To evaluate the value of parameters regression model it is necessary to choose the
formula (assessor, estimator), which will come to their best estimates. Estimators should
have the following characteristics:
1. Impartiality
2. Consistency
3. Efficiency
4. The best linear impartiality.
Assumptions regression models through SPSS
MULTICOLLINEARITY
For first, we monitor correlation matrix. If the correlation coefficient between the
independent variables is higher then 0.7, there could be problem of multicollinearity.
VIF (Variance Inflation Factor)
223
2
2
1 1
1
where is determination coefficient in multiple regression model
VIF
Tolerance R
R
= =
If VIF>10 and Tolerance<0,1 assumption of noncollinearity is not done.

Eigen value (the total amount of variance of independent variables which can be
explained, and include by one dimension). If it is greater than 1 indicates that
assumption of noncollinearity is not done.
Condition index (CI) square roots quotient successive Eigen values:
o , 5 10 CI e and more than two proportions variances for the independent
variables are greater than 0.5 weak dependence between the
independent variables.
o , 10 30 CI e and more than two proportions variances for the
independent variables are greater than 0.5 medium dependence
between the independent variables.
o , 30 CI e and more than two proportions variances for the independent
variables are greater than 0.5 strong dependence between the independent
variables assumption of noncollinearity is not done.
How to solve the problem mullticolinearity?
To combine related independent variables into one (the average z score of
independent variables, factorial analysis ...)
To eliminate some of the independent variables for which the interdependence is
characteristic.
To collect more data on the analyzed variables in order to verify the new
multicollinearity.
OUTLIERS
Outlier exist where standardized residuals have values 3.5
ri
z > . There are several ways
to detect outliers through appropriate tests:
Distance - analysis residuals. It is important that no more than 5% standardized
residuals with a value of 2.5
ri
z >
Calculates the Laverage value (as a new variable). The problem of outlier should
review instances where the value is greater than 0.04
Calculates the Cook's D value (as a new variable). The problem of outlier should
review instances where the value is greater than (4/n). High Cook 'Y value
indicates the outliers.
Standardized Dfbeta indicates the change of coefficients regression if exclude
outlier. The problem of outlier should review instances where the absolute value
is greater than (2/ n ). High Dfbeta value indicates the outliers.
224
NORMALITY
After construction regression model provides a new variable - residuals. Kolmongorov-
Smirnov test checked whether the assumption of normality done residuals distribution.
AUTOCORRELATION
Durbin-Watson test indicates autocorrelation. DW value equal 2 indicates that there is no
autocorrelation. If the Durbin-Watson statistics is significantly smaller than 2, there is
evidence of positive serial correlation. As a rough rule that applies if the Durbin-Watson
statistic is less than 1, it is cause for alarm because autocorrelation. In case that the DW
statistics in the interval 2-4, which indicates no negative serial correlation.
According to the position of empirical values of DW in the interval between 0 and 4, we
can conclude the following:
2 2
1
1
1 2 2 1
1. 4
2. 0
3. 4 4
4. or 4 4
0 2 there is no autocorrelation
0< <2 positive autocorrelationthat is higher if dwis more lower than 2
1 0 perfect positive autocorrelation
1 4
d dw d
dw d
d dw
d dw d d dw d
dw
dw
dw
dw
< <
< <
< <
< < < <
= =
= =
= = perfect negative autocorrelation
2< <4 negative autokorelacija autocorrelationthat is higher if dwis more higher dw
HETEROSKEDASTICITY
Test Goldfeld-Quandt aims that compare the sum of residual squares deviation after
division of the sample into two samples. For models in the time cross-section groups
together the information to the growing or raising values independent variables that can
be a source of heteroskedasticity (this is not necessary for models with time-series).
We will create two regressions for two samples and using the F test compared the
residual deviations. Hypothesis H
0
is accepted if there are no significant differences
between the sums of residual squares deviation.
It needs to be grouped according to given independent variable that can be a source of
heteroskedasticity. Share a number of observations in two samples, for both sample rate
regressions and calculate residuals. We will test whether the residual variances from
225
different samples are same or not - Leven test (within the test of arithmetic means). If
residual variances from different samples are not equal, it is a problem heteroskedasticity.
This problem can be try to solve by the weighted regression with the factor same inverse
square root of the variable that is the source heteroskedasticity.
ECONOMETRIC CONDITIONS FOR REGRESION MODELS WITH
SPSS EXAMPLES
Example 1.
SIMPLE LINEAR REGRESSION
We have data in Excel sheet. We will open SPSS document:
SPSS provides a blank sheet. Importing data, or we approach the transfer of data from
excel Sheet:
226
Further to the type of file selected Excel and we give document that convert from Excel
to SPSS:
Choosing Open. If the Excel document, we have more sheet provides us the option to
choose which we want to covert:
227
Choose the sheet that we want and OK. We got a sheet with SPSS data:
We can adjust the characteristics of variables so that we will with Data view exceed the
Variable view (options are in the bottom of the window):
228
We have chosen the type of numerical variables. We have a dependent and one
independent variable, and then it is a simple regression. First we create a diagram scatter
plot:
229
Choosing Simple Scatter and Define. It provides a window in which the problem
variables:
230
In Titles can define a chart title, with the Options control with missing information (there
is no them and that part we did not use). As output we obtain a diagram:
231
Now we create a model simple linear regression, as this diagram scaterrplot refers to a
different form of connection:
Returns a window in which the problem variables:
232
The Statistics allocates ancillary parameters regression models:
We choose Continue to return to the window with the regression.
The plots allocates Normal probability plot:
233
We choose Continue to return to the window with the regression.
In the end we choose OK. Returns Output:
Regression
Mean Std. Deviation N
Yt 24,33 4,579 12
Xt 61,75 4,993 12
Correlations
Yt Xt
Yt 1,000 ,624 Pearson Correlation
Xt ,624 1,000
Yt . ,015 Sig. (1-tailed)
Xt ,015 .
Yt 12 12 N
Xt 12 12
Variables Entered/Removed
b
234
Model Variables Entered
Variables
Removed Method
1 Xt
a
. Enter
a. All requested variables entered.
b. Dependent Variable: Yt
Model Summary
b
Model R R Square Adjusted R Square
Std. Error of the
Estimate Durbin-Watson
1 ,624
a
,390 ,329 3,752 ,836
a. Predictors: (Constant), Xt
ANOVA
b
Regression 89,878 1 89,878 6,384 ,030
a
Residual 140,789 10 14,079
1
Total 230,667 11
a. Predictors: (Constant), Xt
235
Coefficients
a
Standardized
Coefficients 95,0% Confidence Interval for B Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Lower Bound Upper Bound Zero-order Partial Part Tolerance VIF
(Constant) -11,017 14,033 -,785 ,451 -42,284 20,250 1
Xt ,572 ,227 ,624 2,527 ,030 ,068 1,077 ,624 ,624 ,624 1,000 1,000
a. Dependent Variable: Yt
Coefficient Correlations
a
Model Xt
Correlations Xt 1,000 1
Covariances Xt ,051
Residuals Statistics
a
Minimum Maximum Mean Std. Deviation N
Predicted Value 19,32 29,06 24,33 2,858 12
Residual -5,766 6,806 ,000 3,578 12
Std. Predicted Value -1,752 1,652 ,000 1,000 12
Std. Residual -1,537 1,814 ,000 ,953 12
236
We can test the assumption of normality of residuals taking residuals as variable with the
KS test:
237
Unstandardized
Residual
N 12
Mean ,0000000 Normal Parameters
a,,b
Absolute ,161
Positive ,161
Negative -,117
Kolmogorov-Smirnov Z ,557
238
Unstandardized
Residual
N 12
a,,b
Absolute ,161
Positive ,161
Negative -,117
P value of normality test is greater than 0.05, which means that it complies with the
assumption of normality.
239
Example 2.
MULTIPLE LINEAR REGRESSION
We have data in Excel sheet and transforming them into SPSS document:
We start with Regression and allocates variables:
240
Completing Statistics (include check for multicollinearity because in multiple regression),
plots and Save with the desired options. Output is:
Y 21,90 6,471 10
X1 12,10 4,748 10
X2 30,80 9,402 10
Correlations
Y X1 X2
Y 1,000 ,919 -,829
X1 ,919 1,000 -,691
Pearson Correlation
X2 -,829 -,691 1,000
Y . ,000 ,001
X1 ,000 . ,013
Sig. (1-tailed)
X2 ,001 ,013 .
Y 10 10 10
X1 10 10 10
N
X2 10 10 10
VariablesEntered/Removed
Model
Variables
Entered
Variables
Removed Method
1 X2, X1
a
. Enter
241
Model Summary
b
Change Statistics
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
R Square
Change F Change df1 df2 Sig. F Change
Durbin-Watson
1 ,957
a
,917 ,893 2,120 ,917 38,420 2 7 ,000 2,156
a. Predictors: (Constant), X2, X1
b. Dependent Variable: Y
ANOVA
b
Regression 345,431 2 172,716 38,420 ,000
a
Residual 31,469 7 4,496
1
Total 376,900 9
242
Coefficients
a
Standardized
Model
B Std. Error Beta
t Sig.
(Constant) 18,872 5,290 3,568 ,009 6,363 31,381
X1 ,902 ,206 ,662 4,377 ,003 ,415 1,389 ,919 ,856 ,478 ,522 1,916
1
X2 -,256 ,104 -,372 -2,460 ,043 -,502 -,010 -,829 -,681 -,269 ,522 1,916
a. Dependent Variable: Y
a
Model X2 X1
X2 1,000 ,691 Correlations
X1 ,691 1,000
X2 ,011 ,015
1
Covariances
X1 ,015 ,042
243
Collinearity Diagnostics
a
Variance Proportions
Model
Dimensi
on Eigenvalue Condition Index
(Constant) X1 X2
1 2,822 1,000 ,00 ,01 ,00
2 ,168 4,098 ,00 ,21 ,11
1
3 ,010 16,496 1,00 ,78 ,89
VIF<10
Eigen value <1
Condition index <30
So, there is no multicollinearity problem..
a
Predicted Value 12,90 31,41 21,90 6,195 10
Standard Error of Predicted
Value
,671 1,641 1,110 ,361 10
Adjusted Predicted Value 13,72 33,43 22,14 6,416 10
Residual -1,945 4,251 ,000 1,870 10
Std. Residual -,918 2,005 ,000 ,882 10
Stud. Residual -1,053 2,251 -,044 1,022 10
Deleted Residual -3,430 5,357 -,244 2,567 10
Stud. Deleted Residual -1,062 3,964 ,126 1,483 10
Mahal. Distance ,001 4,489 1,800 1,716 10
Cooks Distance ,005 ,513 ,130 ,188 10
Centered Leverage Value ,000 ,499 ,200 ,191 10
The problem of outlier should review instances where the value is greater than 0.04.
244
245
We can test the assumption of normality of residuals taking residuals as variable with the
KS test:
Unstandardized Residual
N 10
a,,b
Absolute ,165
Positive ,165
Negative -,149
246
Unstandardized Residual
N 10
a,,b
Absolute ,165
Positive ,165
Negative -,149
Residuals distribution is normal.

.
247
Example 3.
MULTIPLE REGRESSION WITH DUMMY VARIABLE
Here is a regression with qualitative variables, with type of settlement (village / city /
suburban). We have data in Excel sheet and transforming them into SPSS document
How it comes to quality we have on the basis of the indicator, or create dummy variables:
1, if there is village
0, if there is no village
Si
d =
1, if there is city
0, if there is no city
Gi
d =
1, if there is suburban
0, if there is no suburban
Pi
d =
We use option Recode into different variables:
248
We give qualitative variable SGP and transformed into the first dummy variable for the
modality Village:
249
We will complete transformation with Old and new values:
Before the next choose changes Add to include the current change. Completing all three
changes:
250
We choose Continue, return to the start window and choosing OK. The result is a new
column with the indicator variable for the modality Village:
In the same way we create dummy variables for the ways of city and suburbs:
251
Now we can create a regression model. In order to avoid the problem appeared
multicollinearity we will take in regression two dummy variables (to city and suburbs),
while the interpretation based on the connection with the third dummy variable (village):
Cost 3678,64 1118,325 364
Revenue 39425,41 15002,371 364
dP ,35 ,479 364
dG ,38 ,487 364
Correlations
Cost Revenue dP dG
Cost 1,000 ,532 -,402 ,586
Revenue ,532 1,000 -,134 -,238
dP -,402 -,134 1,000 -,582
Pearson Correlation
dG ,586 -,238 -,582 1,000
252
Cost . ,000 ,000 ,000
Revenue ,000 . ,005 ,000
dP ,000 ,005 . ,000
Sig. (1-tailed)
dG ,000 ,000 ,000 .
Cost 364 364 364 364
Revenue 364 364 364 364
dP 364 364 364 364
N
dG 364 364 364 364
Model
Variables
Entered
Variables
Removed Method
1 dG, Revenue,
dP
a
. Enter
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
1 ,923
a
,852 ,851 432,254 1,750
a. Predictors: (Constant), dG, Revenue, dP
b. Dependent Variable: Cost
253
ANOVA
b
Regression 3,867E8 3 1,289E8 689,923 ,000
a
Residual 6,726E7 360 186843,445
1
Total 4,540E8 363
a. Predictors: (Constant), dG, Revenue, dP
b. Dependent Variable: Cost
Coefficients
a
Standardized
Model
B Std. Error Beta
t Sig.
(Constant) 409,740 93,392 4,387 ,000 226,078 593,401
Revenue ,058 ,002 ,778 34,962 ,000 ,055 ,061 ,532 ,879 ,709 ,831 1,203
dP 533,163 62,069 ,228 8,590 ,000 411,101 655,226 -,402 ,412 ,174 ,582 1,717
1
dG 2078,090 62,355 ,904 33,327 ,000 1955,464 2200,717 ,586 ,869 ,676 ,559 1,788
a. Dependent Variable: Cost
254
a
Model
Dimensi
(Constant) Revenue dP dG
1 2,683 1,000 ,01 ,01 ,02 ,02
2 1,000 1,638 ,00 ,00 ,19 ,17
3 ,280 3,098 ,00 ,14 ,41 ,38
1
4 ,037 8,532 ,99 ,85 ,38 ,43
On the basis of collinearity diagnostic, concludes that there is no problem multi
collinearity.
a
Predicted Value 1681,29 7068,75 3678,64 1032,159 364
Value
36,703 90,065 44,447 8,825 364
Residual -728,342 768,672 ,000 430,464 364
Std. Residual -1,685 1,778 ,000 ,996 364
Stud. Residual -1,696 1,787 ,000 1,002 364
Deleted Residual -737,861 778,844 ,010 435,365 364
Mahal. Distance 1,620 14,762 2,992 1,907 364
Cooks Distance ,000 ,023 ,003 ,003 364
255
256
Example 4.
MULTIPLE REGRESSION (HETEROSKEDASTICITY - TWO SAMPLES)
We have data in Excel sheet and transforming them into SPSS document:
We start with Regression and allocate variables:
257
Completing Statistics, plots and Save with the desired options. Output is:
Y 23,00 27,617 32
X1 25,09 30,003 32
X2 63,53 60,156 32
Correlations
Y X1 X2
Y 1,000 ,901 ,815
X1 ,901 1,000 ,769
Pearson Correlation
X2 ,815 ,769 1,000
Y . ,000 ,000
X1 ,000 . ,000
Sig. (1-tailed)
X2 ,000 ,000 .
Y 32 32 32 N
X1 32 32 32
258
Correlations
Y X1 X2
Y 1,000 ,901 ,815
X1 ,901 1,000 ,769
Pearson Correlation
X2 ,815 ,769 1,000
Y . ,000 ,000
X1 ,000 . ,000
Sig. (1-tailed)
X2 ,000 ,000 .
Y 32 32 32
X1 32 32 32
X2 32 32 32
Model
Variables
Entered
Variables
Removed Method
1 X2, X1
a
. Enter
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
1 ,921
a
,848 ,837 11,142 1,396
ANOVA
b
Regression 20043,661 2 10021,831 80,724 ,000
a
Residual 3600,339 29 124,150
1
Total 23644,000 31
259
Coefficients
a
Standardized
Coefficients 95,0% Confidence Interval for B Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Lower Bound Upper Bound Tolerance VIF
(Constant) -1,215 2,890 -,421 ,677 -7,126 4,695
X1 ,618 ,104 ,671 5,923 ,000 ,404 ,831 ,409 2,443
1
X2 ,137 ,052 ,299 2,639 ,013 ,031 ,244 ,409 2,443
a
Model X2 X1
X2 1,000 -,769 Correlations
X1 -,769 1,000
X2 ,003 -,004
1
Covariances
X1 -,004 ,011
a
Model
Dimensi
(Constant) X1 X2
1 2,505 1,000 ,05 ,03 ,03
2 ,378 2,575 ,83 ,17 ,03
1
3 ,117 4,634 ,12 ,80 ,94
260
Casewise Diagnostics
a
Case
Number Std. Residual Y Predicted Value Residual
1 1,257 149 134,99 14,010
2 1,001 73 61,84 11,156
3 -1,030 21 32,48 -11,476
4 ,041 7 6,54 ,462
5 -,703 9 16,83 -7,831
6 1,645 60 41,67 18,330
7 ,940 26 15,53 10,473
8 ,281 5 1,87 3,128
9 -,212 5 7,36 -2,362
10 ,276 14 10,93 3,070
11 ,540 26 19,99 6,013
12 ,851 27 17,52 9,483
13 -,250 10 12,78 -2,782
14 ,186 13 10,93 2,070
15 ,270 23 19,99 3,013
16 -,342 38 41,81 -3,808
17 -3,057 19 53,06 -34,061
18 -1,925 17 38,45 -21,445
19 ,160 11 9,21 1,786
20 ,783 13 4,27 8,726
21 -1,666 17 35,56 -18,563
22 -,293 31 34,26 -3,260
23 ,342 19 15,18 3,816
24 -,930 11 21,36 -10,360
25 -,157 3 4,75 -1,754
26 ,284 13 9,83 3,168
27 ,408 15 10,45 4,551
28 1,002 26 14,84 11,159
29 ,913 22 11,82 10,178
30 -,207 3 5,30 -2,303
31 -,207 3 5,30 -2,303
32 -,205 7 9,28 -2,283
261
Casewise Diagnostics
a
Case
Number Std. Residual Y Predicted Value Residual
1 1,257 149 134,99 14,010
2 1,001 73 61,84 11,156
3 -1,030 21 32,48 -11,476
4 ,041 7 6,54 ,462
5 -,703 9 16,83 -7,831
6 1,645 60 41,67 18,330
7 ,940 26 15,53 10,473
8 ,281 5 1,87 3,128
9 -,212 5 7,36 -2,362
10 ,276 14 10,93 3,070
11 ,540 26 19,99 6,013
12 ,851 27 17,52 9,483
13 -,250 10 12,78 -2,782
14 ,186 13 10,93 2,070
15 ,270 23 19,99 3,013
16 -,342 38 41,81 -3,808
17 -3,057 19 53,06 -34,061
18 -1,925 17 38,45 -21,445
19 ,160 11 9,21 1,786
20 ,783 13 4,27 8,726
21 -1,666 17 35,56 -18,563
22 -,293 31 34,26 -3,260
23 ,342 19 15,18 3,816
24 -,930 11 21,36 -10,360
25 -,157 3 4,75 -1,754
26 ,284 13 9,83 3,168
27 ,408 15 10,45 4,551
28 1,002 26 14,84 11,159
29 ,913 22 11,82 10,178
30 -,207 3 5,30 -2,303
31 -,207 3 5,30 -2,303
32 -,205 7 9,28 -2,283
262
a
Predicted Value 1,87 134,99 23,00 25,428 32
Std. Predicted Value -,831 4,404 ,000 1,000 32
Value
2,033 9,335 3,108 1,428 32
Residual -34,061 18,330 ,000 10,777 32
Std. Residual -3,057 1,645 ,000 ,967 32
Stud. Residual -3,309 2,303 ,020 1,081 32
Deleted Residual -39,921 46,990 ,720 14,518 32
Stud. Deleted Residual -4,122 2,503 -,004 1,193 32
Mahal. Distance ,063 20,788 1,937 3,783 32
Cooks Distance ,000 4,161 ,172 ,737 32
263
Now, since the data we create two samples (according to growing values of independent
variables X
1
, which could be the cause heteroskedasticityi because the likelihood of its
critical minimum):
Determine their regression models, preserving the value of residuals. Then calculate variance
residuals for both regression and testing the difference variances, in order to check the
assumption homoskedasticity. Add new variable (1 for the sample - X
1
has a value of 8 or
less, 2 for sample II - X
1
has a value of 21 or more, and the third group are the other
observations). Outputs organize in groups (Split file):
264
Specifies that the data is divided into groups according to the new variables:
Choosing OK.
Restarts Regression and allocates variables as previously. Here the goal is to save the new
variable with the residuals.
265
I model regressions (group 1) ANOVA and coefficients:
Model Summary
b,c
Model R R Square
Adjusted R
Square
Std. Error of the
1 ,671
a
,450 ,328 4,771 2,266
b. VAR00001 = 1,00
c. Dependent Variable: Y
ANOVA
b,c
Regression 167,820 2 83,910 3,687 ,068
a
Residual 204,847 9 22,761
1
Total 372,667 11
b. VAR00001 = 1,00
266
Coefficients
a,b
Standardized
Coefficients Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Zero-order Partial Part Tolerance VIF
(Constant) -2,849 4,497 -,633 ,542
X1 1,701 ,747 ,582 2,278 ,049 ,638 ,605 ,563 ,935 1,070
1
X2 ,036 ,043 ,217 ,847 ,419 ,365 ,272 ,209 ,935 1,070
a. VAR00001 = 1,00
II model regressions (group 2) ANOVA and coefficients:
Model Summary
b,c
Model R R Square
Adjusted R
Square
Std. Error of the
1 ,905
a
,819 ,783 17,101 1,983
b. VAR00001 = 2,00
267
ANOVA
b,c
Regression 13236,938 2 6618,469 22,633 ,000
a
Residual 2924,293 10 292,429
1
Total 16161,231 12
b. VAR00001 = 2,00
Coefficients
a,b
Standardized
Model
B Std. Error Beta
t Sig.
(Constant) -4,944 8,792 -,562 ,586
X1 ,597 ,272 ,554 2,195 ,053 ,881 ,570 ,295 ,284 3,519
1
X2 ,173 ,113 ,387 1,533 ,156 ,855 ,436 ,206 ,284 3,519
a. VAR00001 = 2,00
Annihilate Split File option and return to the analysis of the new variables residuals.
268
Start with T test differences, because we average in the framework of the profit test and the
differences between variances:
269
In Define groups we give samples (1 i 2):
We choose Continue, then OK:
Group Statistics
VAR00
001 N Mean Std. Deviation Std. Error Mean
1,00 12 ,0000000 4,31537183 1,24574054 Unstandardized Residual
2,00 13 ,0000000 15,61060671 4,32960330
270
Levenes Test for Equality of
Variances t-test for Equality of Means
95% Confidence Interval of the
Difference
F Sig. t df Sig. (2-tailed) Mean Difference
Std. Error
Equal variances
assumed
14,753 ,001 ,000 23 1,000 ,00000000 4,66934791 -9,65928209 9,65928209 Unstandardized
Residual
Equal variances not
assumed
,000 13,965 1,000 ,00000000 4,50525629 -9,66510526 9,66510526
As the p for Leven test less than 0.05, concludes that the variances residuals for selected samples of different, which indicates the presence
heteroskedasticity. Variable, which is the source heteroskedasticity, it is critical that the likelihood (with the parameter action regression)
minimum. Heteroskedasticity can be solved weighted regression, we weight variable which is the source heteroskedasticity. So we create new
variables, by weighting and using the Compute:
271
272
With new variables we create regression and Output is:
273
Yn 1,1531 ,60050 32
X1n ,1048 ,09310 32
X2n 4,7574 4,44110 32
Correlations
Yn X1n X2n
Yn 1,000 ,221 ,392
X1n ,221 1,000 ,642
Pearson Correlation
X2n ,392 ,642 1,000
Yn . ,112 ,013
X1n ,112 . ,000
Sig. (1-tailed)
X2n ,013 ,000 .
Yn 32 32 32
X1n 32 32 32
N
X2n 32 32 32
Model
Variables
Entered
Variables
Removed Method
1 X2n, X1n
a
. Enter
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
1 ,394
a
,155 ,097 ,57062 1,890
a. Predictors: (Constant), X2n, X1n
b. Dependent Variable: Yn
274
ANOVA
b
Regression 1,736 2 ,868 2,666 ,087
a
Residual 9,443 29 ,326
1
Total 11,179 31
a. Predictors: (Constant), X2n, X1n
b. Dependent Variable: Yn
Coefficients
a
Standardized
Model
B Std. Error Beta
t Sig.
(Constant) ,914 ,160 5,708 ,000
X1n -,331 1,435 -,051 -,230 ,819 ,221 -,043 -,039 ,588 1,700
1
X2n ,057 ,030 ,425 1,910 ,066 ,392 ,334 ,326 ,588 1,700
a. Dependent Variable: Yn
275
a
Model X2n X1n
X2n 1,000 -,642 Correlations
X1n -,642 1,000
X2n ,001 -,028
1
Covariances
X1n -,028 2,060
a
Model
Dimensi
(Constant) X1n X2n
1 2,554 1,000 ,05 ,03 ,04
2 ,287 2,983 ,94 ,10 ,19
1
3 ,159 4,006 ,01 ,86 ,77
a
Predicted Value ,8483 1,9590 1,1531 ,23664 32
Value
,102 ,408 ,161 ,068 32
Adjusted Predicted Value ,8033 2,3659 1,1716 ,29959 32
Residual -,70505 2,15608 ,00000 ,55191 32
Std. Residual -1,236 3,778 ,000 ,967 32
Stud. Residual -1,277 3,997 -,014 1,024 32
Deleted Residual -,79451 2,41295 -,01846 ,62254 32
Mahal. Distance ,027 14,907 1,938 3,017 32
Cooks Distance ,000 ,635 ,046 ,123 32
276
Comparison of regressions obtained with the initial regression indicates significantly "worse"
model in the case of corrections to heteroskedasticity implemented method.
277
Example 5
MULTIPLE REGRESSION (AUTOCORRELATION)
We have data in Excel sheet and transforming them into SPSS document. We start with
Regression and allocates variables:
Completing Statistics, plots and Save with the desired options. Output is:
Y 9,95 36,626 19
X1 10832,26 4771,812 19
X2 377,21 1204,523 19
Correlations
Y X1 X2
Y 1,000 ,629 ,413
X1 ,629 1,000 -,009
Pearson Correlation
X2 ,413 -,009 1,000
Y . ,002 ,039
X1 ,002 . ,486
Sig. (1-tailed)
X2 ,039 ,486 .
N Y 19 19 19
278
X1 19 19 19
X2 19 19 19
Model
Variables
Entered
Variables
Removed Method
1 X2, X1
a
. Enter
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
1 ,756
a
,572 ,518 25,421 1,071
So, this means presence of positive autocorrelation.
ANOVA
b
Regression 13807,684 2 6903,842 10,684 ,001
a
Residual 10339,263 16 646,204
1
Total 24146,947 18
279
Coefficients
a
Standardized
Model
B Std. Error Beta
t Sig.
(Constant) -47,505 14,933 -3,181 ,006 -79,161 -15,848
X1 ,005 ,001 ,633 3,870 ,001 ,002 ,008 ,629 ,695 ,633 1,000 1,000
1
X2 ,013 ,005 ,419 2,561 ,021 ,002 ,023 ,413 ,539 ,419 1,000 1,000
a
Model
Dimensi
(Constant) X1 X2
1 2,078 1,000 ,03 ,03 ,06
2 ,842 1,571 ,01 ,01 ,94
1
3 ,080 5,082 ,96 ,95 ,01
REGRESSION ANALYSIS
280
a
Predicted Value -51,14 65,38 9,95 27,696 19
Residual -42,129 40,390 ,000 23,967 19
Std. Residual -1,657 1,589 ,000 ,943 19
REGRESSION ANALYSIS
281
REGRESSION ANALYSIS
282
References
Curwin J. and Slater R., Quantitative Methods for Business Decisions, Thomson
Learning fifth edition 2002.
Ku H., Statistike funkcije u Excelu, Grafiar, Zenica, 1999.
Levine D.M. and others, Statistics for Managers Using Microsoft Excel, Prentice
Hall, NY, 2005.
Newbold P., Statistics for business and economics, Prentice Hall, 1988.
Papi M., Primjenjena statistika u MS Excelu, Zoro, Zagreb, 2005.
Somun-Kapetanovi R., Statistika u ekonomiji i menadmentu, Ekonomski
fakultet u Sarajevu, Sarajevo 2006.
http://www.answers.com
http://www.mnstate.edu
http://www.statcan.ca
http://www.wikipedia.org
http://www.socialresearchmethods.net

Applied Statistics en

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Statistics en

Uploaded by

Copyright:

Available Formats

APPLIED STATISTICS

EXAMPLES IN EXCEL AND SPSS

= ; m- number of success realization, n- number of trials.

EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS

We will sum square of errors:

Because of different signs, we will square those errors:

We will sun that square errors:

e) less than 34000 miles

Or by normal distribution First standardization:

EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS

Confidence interval for proportion of households whose holder married to a complete

= = should not be significantly different from 1.

high moderate low

What kind of tourism is the theme (motive) of your visitors?

A 100 20 19,286 0,02643

Regression model: 5.089 0.938

, we consider a model significant (at

If VIF>10 and Tolerance<0,1 assumption of noncollinearity is not done.

Residuals distribution is normal.

You might also like