Professional Documents
Culture Documents
= +
- The mode is the most frequent value in the list (or one of the most frequent values,
if there are more than one). Mode is only calculated for the statistical distribution
(grouped series). It is graphically determined in a histogram. For a non-interval
grouped distribution, on the basis of the highest frequency (
max Mo
f f = ) the mod
data is read. For an interval grouped distribution, the frequency of the read
interval opposed to the highest frequency is determined on the basis of the
following formula:
, ,
1
1
1 1
o o
o o
o o o o
M M
o M M
M M M M
f f
M L l
f f f f
= +
+
Sometimes, we choose specific values from the cumulative distribution function
called quartiles. Procedure is same like with median:
25% of data has value less or equal to the first quartile and 75% of data has
value higher than the first quartile (theoretical position
1
4
Q
N
CF s )
75% of data has value less or equal to the third quartile and 25% of data has
value higher than the third quartile (theoretical position
3
3
4
Q
N
CF s ).
Measures of dispersion
Dispersion refers to the spread of the values around the central tendency. There are
three common absolute measures of dispersion:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
44
- The range
The range is simply the highest value minus the lowest value:
max min
RV x x = .
- The quartile range
The quartile range (
3 1 Q
I Q Q = ) is the range from the 25th to the 75th percentile
of a distribution. It represents the "Middle Half" of the data and is a marker of
variability or spread that is robust to outliers.
- The standard deviation
The standard deviation is the square root of the sum of the squared deviations
from the mean divided by the number of scores (or the number of scores minus
one, if we work with sample).
For population:
,
2
2 2
1
1
,
N
i
i
x X
N
o o o
=
1
= =
(
]
_
For sample:
,
2
2 2
1
1
,
1
N
i
i
x X
N
o o o
=
1
= =
(
]
_
The standard deviation allows us to reach some conclusions about specific scores
in our distribution. Assuming that the distribution of scores is normal or bell-
shaped (or close to it!), the following conclusions can be reached (role six sigma):
- approximately 68% of the scores in the sample fall within one standard
deviation of the mean
- approximately 95% of the scores in the sample fall within two standard
deviations of the mean
- approximately 99% of the scores in the sample fall within three standard
deviations of the mean.
Problem with standard deviation, like absolute measure of dispersion, is that we
can not use standard deviation for comparison of series with different unit of
measure or with different average.
Behind that we can define relative measures of dispersion like:
- Coefficient of variation
The variance coefficient is a relative measure of variability which can be used for
comparing series with different units of measure, because it is an unnamed
number.
100 (%) V
X
o
=
It can be used for comparing series with different arithmetic means.
- z value
Z values determine the relative position of variable modality in the series:
, 1, 2,...,
i
i
x X
z i N
o
= =
They are appropriate for comparing positions of data in different series. Z values
are specific because of fact that we can calculate z value for each modality, not
only for series of data.
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
45
- The quartile deviation coefficient
The quartile deviation coefficient is relative dispersion indicator and shows
variability around median value:
, ,
3 1
1 3 1 3
100% 100%
Q
Q
I
Q Q
V
Q Q Q Q
= =
+ +
Higher value of the quartile deviation coefficient indicates greater dispersion and
vice versa. This is relative indicator of data varying around the median.
Shape of distribution
Symmetry or skewness
A frequency distribution may be symmetrical or asymmetrical. Imagine constructing a
histogram centred on a piece of paper and folding the paper in half the long way. If
the distribution is symmetrical, the part of the histogram on the left side of the fold
would be the mirror image of the part on the right side of the fold. If the distribution is
asymmetrical, the two sides will not be mirror images of each other. True symmetric
distributions include what we will later call the normal distribution. Asymmetric
distributions are more commonly found.
Measure of skewness
3
3
3
o
= o
,
3
3
1
1
N
i
i
x X
N
=
1
=
]
_
= o 0
3
symmetry
3
0 o > positively skewed
3
0 o < negatively skewed
X
f
symmetric
left asymmetric right asymmetric
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
46
If a distribution is asymmetric it is either positively skewed or negatively skewed. A
distribution is said to be positively skewed if the scores tend to cluster toward the
lower end of the scale (that is, the smaller numbers) with increasingly fewer scores at
the upper end of the scale (that is, the larger numbers). A negatively skewed
distribution is exactly the opposite. With a negatively skewed distribution, most of the
scores tend to occur toward the upper end of the scale while increasingly fewer scores
occur toward the lower end.
Kurtosis
Another descriptive statistic that can be derived to describe a distribution is called
kurtosis. It refers to the relative concentration of data in the centre, the upper and
lower ends (tails), and the shoulders of a distribution. A distribution is platykurtic if
it is flatter than the corresponding normal curve and leptokurtic if it is more peaked
than the normal curve.
Modality
A distribution is called unimodal if there is only one major "peak" in the distribution
of scores when represented as a histogram. A distribution is bimodal if there are two
major peaks. If there are more than two major peaks, we call the distribution
multimodal.
Measure of kurtosis
4
4 4
o
o
=
,
4
4
1
1
N
i
i
x X
N
=
1
=
]
_
4
3 o = normal
4
3 o > leptocurtic
4
3 o < platykurtic
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
47
Measure of concentration
The Lorenz curve is a graphical representation of the cumulative distribution
function of a probability distribution; it is a graph showing the proportion of the
distribution assumed by the bottom y% of the values. It is often used to represent
income distribution, where it shows for the bottom x% of households, what
percentage y% of the total income they have.
Every point on the Lorenz curve represents a statement like "the bottom 20% of all
households has 10% of the total income". A perfectly equal income distribution would
be one in which every person has the same income. In this case, the bottom N% of
society would always have N% of the income. This can be depicted by the straight
line y = x; called the line of perfect equality.
By contrast, a perfectly unequal distribution would be one in which one person has all
the income and everyone else has none. In that case, the curve would be at y = 0 for
all x < 100%, and y = 100% when x = 100%. This curve is called the line of perfect
inequality.
The Ginny coefficient is the area between the line of perfect equality and the
observed Lorenz curve, as a percentage of the area between the line of perfect
equality and the line of perfect inequality. This equals two times the area between the
line of perfect equality and the observed Lorenz curve.
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
48
,
1
concentration area
2 concentration area 2
0, 5
1
0 1
j j j
G S
G p Q Q
G
= = =
= +
s s
_
The higher the Ginny coefficient, the more unequal the distribution is.
Software Excel and SPSS do not offer the option to directly calculate measures of
concentration, and we therefore have based on a formula in Excel, so we develop the
procedure.
Example 4.
We have data base about variables that follow procedure of paving taxes for 181
countries (source: http://www.doingbusiness.org/CustomQuery/, data for 2008. year).
Data are given in Excel sheet (A1-G363). Variables are:
- Payments (number) (B2-B363)
- Time (hours) (C2-C363)
- Total tax rate (%profit) (D2-D363).
There are quantitative variables, so we can apply methodology for descriptive
statistics for series of 181 data per each variable to get several parameters which will
describe given series.
Most simple and fast way to get several parameters which will describe given series
(x
min
, x
max
, average, deviation, mod, median, kurtosis and skewness) is to use Excel
function: Tools Data Analysis. If that option is not included we have to renew it:
1. Tools Add-ins:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
49
2. We have to renew or choose Analysis ToolPak and Analysis ToolPak VBA:
3. Click OK and we will get in Tools:
Now we can use Data Analysis option:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
50
We will get list with analysis that we can make. Currently we are interested for option
Descriptive statistics, so we choose it and click OK. In Input range we can in the same
time to select all columns with several variables and to give grouping according to the
columns ($B$1:$D$182). When we select data we include and first cell with variable
name and include option Labels in first row. Then we set up empty cell or new sheet
where we want to save result of analyses and we select what we want to get of
parameters:
- Summary statistics - x
min
, x
max
, average, deviation, mod, median, kurtosis and
skewness, range, count...
- Confidence level for mean This is boundary for confidence interval for
average with given confidence level (for example 95%)
- Kth largest i Kth smallest If we want to calculate quantiles we will choose
this option , for example for first and third quartile in both case we take 25, for
firs and ninth decile in both case we take 10
Click OK and result is:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
51
On example on of this variables time (hours) we will give interpretations for results:
- Average is 317.63 hours, for sample of 181 countries (count), so in average it
is needed 317.63 hours for paying taxes procedure.
- Standard error of average estimation is given on base of sample size and
standard deviation in sample (
X
n
o
o = ) is 23.61 hours.
- Median is 256, so for 50% of countries is needed 256 hours or less for paying
taxes procedure until for 50% of countries is needed more than 256 hours for
paying taxes procedure.
- Mod is 270, so we have most frequently appeared country with 270 hours for
paying taxes procedure.
- Standard deviation like average linear deviation from average is 317.66 hours,
so we can calculate coefficient of variation:
317.66
100 100 100%
317.63
V
X
o
= = =
Relative variability around average is 100%. Only in comparison with another
series this information has sense.
- Variance like average square deviation from average is 100906.1, but we
interpret this through standard deviation.
- Kurtosis is (19.96+3)=22.96 what is more than 3 so we can conclude that this
distribution is significantly more peaked than the normal curve.
- Skewness is 3.77 what is more than 0 so we can conclude that this distribution
is significantly right asymmetric in comparison with the normal curve
- Range like difference between highest and lowest value is 2600 h.
- Minimal time for paying taxes procedure is 0 h.
- Maximal time for paying taxes procedure is 2600 h.
- Sum of data in series is 57491, but there is no logic interpretation for this
information.
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
52
- Third quartile is 453, so for 75% of countries is needed 453 hours or less for
paying taxes procedure until for 25% of countries is needed more than 453
hours for paying taxes procedure.
- Third quartile is 105, so for 25% of countries is needed 105 hours or less for
paying taxes procedure until for 75% of countries is needed more than 105
hours for paying taxes procedure.
- Boundary for confidence interval for average with given confidence level 95%
is 46.59. Confidence interval for average with 95% confidence level is
[317.6346,59]= [271.04-364.22]. So with first type error 5% we can
conclude that time for paying taxes procedure for some country will be I
interval [271.04-364.22] hours.
To see these parameters visually we will construct histogram. We have option in Data
analysis:
Before we construct histogram we have to define intervals according to minimal and
maximal value and to numbers of interval that we want to create. Maximal value is
2600 and minimal value is 0, so we will determine intervals with width 100: 0-100,
100-200, ..., 400-500, 500-600, ..., 2500-2600. Upper limits for that intervals that are
included in intervals are: 99, 199, ..., 499, 599, ..., 2600. We will type this limits in
one Excel column (I22:I47).
For Input range we will select column with original data (C2:C182) and for Bin
Range we will select cells where we type upper limits for intervals (I22:I47). We will
find place to save result and option Chart output:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
53
Graph that we are get is graph with vertical bars, but we will click on graph and get
Chart options Options. There we will set up that gap between bars be equal 0:
Finally histogram looks like:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
54
Histogram
0
10
20
30
40
50
60
9
9
2
9
9
4
9
9
6
9
9
8
9
9
1
0
9
9
1
2
9
9
1
4
9
9
1
6
9
9
1
8
9
9
2
0
9
9
2
2
9
9
2
4
9
9
M
o
r
e
Bin
F
r
e
q
u
e
n
c
y
Our interpretation of parameters for distribution shape is completely proved. It is very
positive (right) asymmetric and peaked distribution. This distribution is significantly
different in comparison with normal curve.
Example 5.
With aim to analyse concentration for consumption for data base HBS 2008, we are
taken data about consumption per capita for 23374 individuals from 7071 households:
There are original gross data, so we will firs construct appropriate frequency
distribution. We need to find minimal and maximal value for consumption level in our
sample:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
55
According to that we make decision to set up intervals with width 5000, so we have
upper limits that are included in intervals (bins): 4999,99, 9999,99, 14999,99, ,
54999,99. That limits we will type in empty column in sheet where are original data:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
56
We select empty cells in column behind (E6:E16). In function (f
x
) we choose
Frequency:
With CTRL+SHIFT+ENTER we will get frequency distribution:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
57
Now we can to start with construction of Lorenz curve and calculation of Ginny
coefficient. We need centres of intervals and relative frequencies, but before that we
have to form columns with lower and upper limits for intervals:
First we will calculate centres of intervals:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
58
With Copy-Paste option we will get column with centres of intervals:
Than we will calculate relative frequencies:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
59
With Copy-Paste option we will get column with relative frequencies:
Than, we will calculate relative cumulative frequencies. First is same like first relative
frequency and we follow cumulating:
With Copy-Paste option we will get column with relative cumulative frequencies:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
60
Than we need cumulant for relative aggregate. First we will calculate aggregate (cp)
like product of centre of interval and absolute frequency for given interval:
With Copy-Paste option we will get column for aggregate:
We will calculate relative aggregate like:
i i
i
i i
c p
q
c p
_
:
With Copy-Paste option we will get column for relative aggregate:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
61
On the end we will find relative cumulative aggregate (Q):
With Copy-Paste option we will get column for cumulant of relative aggregate:
To graph Lorenz curve for x axes we will take relative cumulative frequencies and
like y axes we will take cumulant of relative aggregate. Before that we will insert one
point with value 0 for both cumulant:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
62
Now we can graph Lorenz curve:
For line of perfect equality we will for both axes take same data for relative
cumulative frequencies.
For Lorenz curve we take:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
63
Now with Add we will insert new series for line with perfect equality:
We choose Next and then we will get option to give titles:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
64
Finally graph looks like this:
White area is area of concentration.
We will calculate Ginny coefficient like quantification for measure of concentration
according to relation:
,
1
1
j j j
G p Q Q
= +
_
:
With Copy-Paste option we will complete this column:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
65
When we calculate (1-this sum) we will get Ginny coefficient:
And we will get result:
DESCRIPTIVE STATISTICS
EXAMPLES IN EXCEL
66
Ginny coefficient is 0.3378 so distribution of consumption is not perfect equal but
level of concentration is not very high.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
67
II. Empirical versus appropriate tbeoretical distributions
{approximations witb binomial, Poisson,
bypergeometric or normal distribution]
PROBABILITY DISTRIBUTIONS
Frequency distribution formed with groupation of population units according to same
characteristics is empirical distribution. Distribution formed on the basis of theoretical
prepositions is theoretical distribution. Main characteristics of theoretical distributions
are:
We suppose them in some statistical model or we create them like hypothesis
that we have to test.
Theoretical distributions are given like analytic model with known parameters:
expectation, mod, median, standard deviation, skewness and kurtosis.
Theoretical distributions are given like probability distributions.
Probability where we know number of possible outcomes of event and we know
number of success realization is a priory probability. But in statistical research is
most frequently that we dont know probability a priori so with experiment we try to
get knowledge for probability calculations like a posterior. Well a posterior
probability is empirical or statistical probability.
Empirical probability or a posterior is limited value for relative frequency for number
of success of event A if we have great number of trials: which tends to infinity:
( ) lim
n
m
p A
n
= dx x xf X E , - < < x .
Variance for discrete variable is:
, , , , X E x p x X
k
i
i i
= = =
_ _
=
o ,
1
2 2 2
odnosno
, ,
2
1 1
2 2
(
]
1
=
_ _
= =
k
i
i i
k
i
i i
x p x x p x o .
Variance for continuous variable is:
, , , , dx x xf dx x f x X E
} }
= = = o ,
2 2 2 2
.
Well, theoretical probability distributions can be split into 2 groups:
discrete probability distributions deal with discrete events
o binomial distribution
o Poisson distribution
o Hypergeometric distribution.
continuous probability distributions deal with continuous events
o normal distribution
o Student (t) distribution
o
2
_ (chi-square) distribution
o F distribution.
The probability distribution of a random variable describes the probability off all
possible outcomes. The sum (integral) of these probabilities will equal 1.
BINUMIAL DISTRIBUTIUN
The binomial distribution is used when discrete random variable of interest is the
number of successes obtained in a sample of n observations. It is used to model
situations that have the following properties:
The sample consists of a fixed number of observations n.
Each observation is classified into one of two mutually exclusive categories,
usually called success and failure.
The probability of an observation being classified as success, noted as p, is
constant from observation to observation. Thus, the probability of an observation
being classified as failure, noted as (1-p)=q, is constant over all observations.
The outcome (success or failure) of any observation is independent of the outcome
of any other observation.
Well, binomial distribution has two parameters:
n number of observations, trials or experiment repetitions.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
69
p the probability of success (occurrences of a given event) on a single
observation, trial or experiment.
Probability distribution of a binomial random variable
The probability distribution of a binomial random variable is:
, ( ) 1 , 0,
n x
x
n
p x p p x n
x
| |
= =
|
\ .
,
where x is exact number of successes of interest and ( ) p x is probability that among n
trials will been realized exactly x successes (given event will be realized exactly x
times).
Binomial probability function
1
Example 1.
An insurance broker believes that for particular contact, the probability of making sale
is 0.4. Suppose now that he has five contacts. What is probability that he will realize
three sales among these five contacts?
Solution:
Discrete random variable X is defined to take value 1 if sale is made and 0 if sale is
not made so this is discrete variable that can be treated with binomial distribution.
Experiment of sale we will repeat 5 times n=5.
According to conclusion about dichotomous variable we will apply approximation
with binomial distribution:
1 From Wikipedia, the free encyclopedia
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
70
(1) 0.4
(0) 1 0.4 0.6
5
3
p p
q p
n
x
= =
= = =
=
=
,
3 2
5
( ) 1 (3) 0.4 0.6 0.23
3
n x
x
n
p x p p p
x
| | | |
= = =
| |
\ . \ .
Probability that he will realize three sales among these five contacts is 23%.
Characteristics of the Binomial distribution
Shape
Binomial distribution can be symmetrical (if p=0.5) or skewed (if p= 0.5)
Mean
( ) E X n p = =
Variance
,
2
2
(1 ) E X n p p o
1
= =
]
We have 4 types for binomial distribution:
symmetric; if p=q=0.5
asymmetric; if p = q
a priori; if we know probabilities p and q
a posterior; if we have to find p and q by empirical method
Conditions for approximation empirical distribution with binomial distribution are:
0 1
X
n
s s
2
1
X
X
n
o
| |
~
|
\ .
Error of approximation is measure for quality of approximation. Error of
approximation according to modalities is:
b
k k k
d f f = where:
k
f is empirical
frequency and
b
k
f is theoretical frequency, so overall error of approximation is:
2 2
1
1
b k
d
n
o =
+
_
Example 2.
Accounting office in one company has information that 40% customers don't realize
obligation on time because of inflation. If we randomly select 6 customers, what is
probability:
1. that are all customers realized obligation on time
2. that more than 3/4 of customers realized obligation on time
3. that 50% or more of customers don't realize obligation on time.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
71
Solution:
p=60%=0,6 (realize obligation on time)
q=40%=0,4 (dont realize obligation on time)
n=6
( )
x n x
n
p x p q
x
| |
=
|
\ .
1. Probability that that
are all customers realized obligation on time according to the table is
p(6)=4.67%.
2. Probability that more than 3/4 of customers realized obligation on time 3/4
of 6 is 4,5 so we will take probability for x=5 and x=6. According to the table
p(5)= 18.66% and p(6)= 4.67% , so final result according to (Additional
theorem) is 23.33%.
3. Probability that 50% or more of customers don't realize obligation on time
50% of 6 is 3, so we will take probability for x=3, 4, 5, 6. According to the
table this is (0.27648+0.311040+0.186624+0.046656)=0.8208 82.8%.
Example 3.
For 1000 products we can find 28 with defect. If we randomly select 14 products for
sample, what is probability that:
a) in sample we have exactly 4 products with defect;
b) in sample we have maximum 2 products with defect;
c) in sample we have minimum 4 products with defect.
Solution (by Excel):
This is dichotomous variable, so in that case we will apply Binomial distribution with
modalities - x: 0,1,2,3,4,...,14.
28
0.028 0.972
1000
p q = = =
14
, 0,14
:
14
( ) 0.028 0.972
k
b k k
k k
x k k
X
p P x k
k
| |
= =
|
| |
|
= = =
|
|
\ . \ .
We will use Excel function:
x
i
p(x) F(x)
0 0.004096 0.004096
1 0.036864 0.040960
2 0.138245 0.179205
3 0.276480 0.455685
4 0.311040 0.766725
5 0.186624 0.953349
6 0.046656 1.000000
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
72
a) in sample we have exactly 4 products with defect
We ask for probability in point not for cumulative function, so for option Cumulative
we will take False.
{=BINOMDIST(4;14;0.028;FALSE)}= 0.000463 0.0463%
b) in sample we have maximum 2 products with defect (so 0, 1 or 2 product with
defect), this is cumulative distribution so for option Cumulative we will take True.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
73
{=BINOMDIST(2;14;0.028;TRUE)}= 0.993662 99.3662%
c) in sample we have minimum 4 products with defect 4, 5 or more products with
defect, what is opposite event for cumulative frequency (maximum 3 products with
defect or 1, 2 or 3 products with defect). Event and opposite event for sum of
probabilities have 1, so we can use Excel to get probability for opposite event (1, 2 or
3 products with defect) and than use that characteristic:
1- {=BINOMDIST(3;14;0.028;TRUE)}=1- 0.999509=0.000491 0.491%
Example 4.
For monitoring of work for one automat machine, inspector will take sample with 10
products. On base of 50 samples we get this information about number of products
with defect:
Number of
products with
defect
Number of
samples
0 6
1 11
2 15
3 10
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
74
4 7
5 1
50
We have to create appropriate theoretical approximation for this empirical distribution.
Solution:
This is discrete random variable. We have two modalities in one trial: product can be
correct or with defect. That shows us that appropriate theoretical distribution is
binomial distribution. According to empirical distribution of frequencies we will
calculate average and standard deviation. We can con use Excel function directly,
because this is grouped distribution and we will set up formulas for calculate average
and standard deviation:
10 , 50 = = n N
Result is:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
75
Or we will create new column (xf) and sum for that column we will divide with sum
of absolute frequencies:
k
x
k
f
k k
f x
0 6 0
1 11 11
2 15 30
3 10 30
4 7 28
5 1 5
50 104
104
2.08
50
k k
x f
X
N
= = =
_
Then we will calculate standard deviation:
Result is:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
76
Or we will create new column
k k
f x
2
and calculate o with general formula
2
2
2 k k
x f
X
N
o
=
_
:
k
x
k
f
k k
f x
k k
f x
2
0 6 0 0
1 11 11 11
2 15 30 60
3 10 30 90
4 7 28 112
5 1 10 25
50 109 298
2
2
2 2
298
2.08 1.63 1.278
50
k k
x f
X
N
o o
= = = =
_
Now we will test that conditions for binomial approximations are satisfied:
2
2.08
1 2.08 1 1.65 1
10
X X
X X
n n
o
| | | |
| |
= = ~
| |
|
\ .
\ . \ .
0 0.208 1
X
n
s = s
Conditions are satisfied so we can apply approximation. Then is: 0.208
X
p
n
= = and
0, 792 q = .
10
10
0.208 0.709 , 0, 5
b x x b b
x x x
p x f p N
x
| |
= = =
|
\ .
In Excel we will create formula for probability calculations
10
10
0.208 0.709 , 0, 5
b x x
x
p x
x
| |
= =
|
\ .
and than according to these theoretical
probabilities we can compute theoretical frequencies
b b
x x
f p N = :
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
77
With Paste option we can complete other cells in column with theoretical probabilities.
Result is:
Now we will compute theoretical frequencies:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
78
With Paste option we can complete other cells in column with theoretical frequencies.
Result is:
That was procedure for approximation with binomial distribution. Now we have
schedule for this variable and we can make predictions. Quality of approximation will
be measured by error of approximation.
Error of approximation for modalities is:
b
k k k
d f f =
2 2
1 9.589
0.872
1 11
b k
d
n
o = = =
+
_
Error of approximation is 0.872.
PUISSUN DISTRIBUTIUN
The Poisson distribution is a useful discrete probability distribution when you are
interested in the number of times a certain event will occur in a given unit of area or
time. This type of situation occurs frequently in a business. of opportunity approaches
zero as the area of opportunity becomes smaller. The Poisson distribution has one
parameter 0 i > , which is average or expected number of events per unit.
Probability distribution of Poisson random variable
The probability distribution of a Poisson random variable is: ( )
!
x
e
p x
x
i
i
=
where is:
x number of events per unit (number of successes per unit)
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
81
( ) p x is the probability of x successes given a knowledge of i
i average number of events per unit (average number of successes per unit)
e=2.71828 (constant)
Poisson probability function
2
The horizontal axis is the index k. The function is only defined at
integer values of k (empty lozenges). The connecting lines are only
guides for the eye.
Example 5.
If the probability that an individual be late on job on Friday is 0.001, determine the
probability that out of 2000 individuals.
a) exactly 3
b) more than 2
individuals will be late on job on Friday.
Solution:
p=0.001 - probability that an individual be late on job on Friday (rare event
Poisson distribution)
2000 0.001 2 N p i = = =
2
2
( )
! !
x x
e e
p x
x x
i
i
= =
a)
2 3
2
(3) 0.18
3!
e
p
= =
There is 18% of chance that out of 2000 individuals exactly 3 will be late on job on
Friday.
2 From Wikipedia, the free encyclopedia
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
82
b)
2 0 2 1 2 2
( 2) (3) (4) ... 1 (0) (1) (2)
2 2 2
1 0.323
0! 1! 2!
p x p p p p p
e e e
> = + + = + + =
1
= + + =
(
]
There is 32.3% of chance that out of 2000 individuals more than 2 will be late on job
on Friday.
Example 6.
Suppose that, on average, three customers arrive per minute at the bank during the
noon to 1 p.m. hour. What is probability that in a given minute exactly two customers
will arrive?
Solution:
We are interested in the number of times a certain event will occur in a given unit of
time Poisson distribution.
i=3
3
3
( )
! !
x x
e e
p x
x x
i
i
= =
3 2
3
(2) 0.224
2!
e
p
= =
There is 22.4% probability that at in a given minute exactly two customers will arrive.
Example 7.
If probability that randomly selected person will be daltonist is 0.3% what is
probability that between 2800 persons we will find:
a) 4 daltonists
b) more than 3 daltonists.
c) not more than 2 daltonists.
Solution (by Excel-a):
0.003 0.3% p = Rare event Poisson distribution
2800 0.003 8.4 n p i = = =
8,4
8.4
( )
! !
x x
e e
p x
x x
i
i
= =
We will use Excel function:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
83
a) exactly 4 daltonists
We ask for probability in point not for cumulative function, so for option Cumulative
we will take False.
= = ) 4 ( X P {=POISSON(4;8.4;FALSE)}= 0.046648 4.6648%
b) more than 3 daltonists, this is opposite to cumulative distribution so for option
Cumulative we will take True and on the end we will find probability for opposite
event:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
84
1- = s ) 3 ( X P 1-{=POISSON(3;8.4;TRUE)}=1- 0.03226= 0.96774 96.774%
c) not more than 2 daltonists, this is cumulative distribution so for option Cumulative
we will take True.
= s ) 2 ( X P {=POISSON(2;8.4;TRUE)}=0.0100471.0047 %
Characteristics of the Poisson distribution
Shape
Poisson distribution is always positively (right) skewed.
Mean
( ) E X i = =
Variance
,
2
2
E X o i
1
= =
]
i
o
1
3
= ,
i
o
1
3
4
+ = .
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
85
The Poisson distribution can be derived as a limiting case to the binomial
distribution as the number of trials goes to infinity and the expected number of
successes remains fixed. Therefore it can be used as an approximation of the
binomial distribution if n is sufficiently large and p is sufficiently small. There is a
rule of thumb stating that the Poisson distribution is a good approximation of the
binomial distribution if n is at least 20 and p is smaller than or equal to 0.05.
According to this rule the approximation is excellent if n 100 and np 10.
Example 8.
In one office there is copy machine. We want to determine average number of
incorrect copies. We take samples with 1000 copies, number of trials was 250 and
results are:
number of
incorrect copies
Number
of
samples
0 10
1 20
2 40
3 55
4 50
5 40
6 15
7 10
8 5
9 3
10 2
250
We have to create appropriate theoretical approximation for this empirical distribution.
Solution:
This is discrete random variable. We have two modalities in one trial: copy can be
correct or incorrect. That shows us that appropriate theoretical distribution is binomial
or Poisson distribution. According to empirical distribution of frequencies we will
calculate average and standard deviation. We can con use Excel function directly,
because this is grouped distribution and we will set up formulas for calculate average
and standard deviation:
100 , 250 = = n N
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
86
Result for average is:
We will find variance:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
87
Result for variance is:
There is o ~
2
X Poisson distribution, 3.65 X i = =
3.65
3.65
!
x
p
x
p e
x
=
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
88
In Excel we will create formula for probability calculations
3.65
3.65
, 0
!
x
p
x
p e x
x
= >
and than according to these theoretical probabilities we can compute theoretical
frequencies
b b
x x
f p N = :
With Paste option we can complete other cells in column with theoretical probabilities.
Result is:
Now we will calculate theoretical frequencies:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
89
With Paste option we can complete other cells in column with theoretical frequencies.
Result is:
That was procedure for approximation with Poisson distribution. Now we have
schedule for this variable and we can make predictions. Quality of approximation will
be measured by error of approximation.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
90
Error of approximation for modalities is:
b
k k k
d f f =
2 2
1 1941.47
7.76
1 251
b k
d
n
o = = =
+
_
Approximation error is 7.76.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
93
HYPERCEUMETRIC DISTRIBUTIUN
Hipergeometric distribution H(N,n,p) is distribution for n random Bernoullis
dependent variables. There is sampling without replications. Symbols are:
N- number of elements in population
M- number of elements in population with characteristic A
n- number of elements in sample
k - number of elements in sample with characteristic A
N M k N n s s s ,
h
k
p is probability that in sample from that population be k elements with
characteristic A: ,
n
N
k n
N
k
N
C
C C
n
N
k n
N
k
N
k X p
=
|
|
.
|
\
|
|
|
.
|
\
|
|
|
.
|
\
|
= =
2 1
2 1
Expectations and variance are:
,
1
;
2 1 2 1
= =
N
n N
N
N
N
N
n
N
N
n X E o
This distribution has application in sampling procedure. When is (n/N<1/10) we can
approximate hypergeometric distribution with binomial distribution.
Example 9.
In firm, we have 10 economists and 22 employees with other vocations. What is
probability that sample of 8 employees will have 3 employees with other vocations?
Solution:
3 , 22 , 8 , 32 = = = = k M n N
22 10
3 5
0.037 3.7%
32
8
h
k
M N M
k n k
p
N
n
| | | | | | | |
| | | |
\ . \ . \ . \ .
= = =
| | | |
| |
\ . \ .
Example 10.
In population we have 30 products and there is 30% of incorrect products. We will
choose sample with 4 products without replications. What is probability that we will
have not more than 2 incorrect products?
Solution (by Excel):
30% incorrect there is 9 incorrect products in population
Without replication dependent events hipergeometric distribution.
not more than 2 incorrect products 0 or 1 or 2 incorrect products
We will apply Excel function for hipergeometric distribution:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
94
- probability that we select 0 incorrect products
={=HYPGEOMDIST(0;4;9;30)} = 0.21839121.84%
- probability that we select 1 incorrect product
={=HYPGEOMDIST(1;4;9;30)} = 0.43678243.68%
probability that we select 2 incorrect products
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
95
={=HYPGEOMDIST(2;4;9;30)} = 0.27586227.59%
- Finally, probability that we will have not more than 2 incorrect products is sum of
previous find probabilities (like or probability for mutually excluded events)
0.931034 93.1%
NURMAL DISTRIBUTIUN
The normal distribution, also called the Gaussian distribution, is an important family
of continuous probability distributions, applicable in many fields. Each member of the
family may be defined by two parameters, location and scale: the mean ("average", )
and variance (standard deviation squared,
2
) respectively.
The continuous probability density function of the normal distribution is the Gaussian
function:
,
2
1
2 1
, , ( )
2
x E
i
i i
x f x e
o
o t
| |
|
|
\ .
e + =
where > 0 is the standard deviation, the real parameter is the expected value. To
indicate that a real-valued random variable X is normally distributed with mean and
variance 0, we write
2
( ; ) X N o
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
96
Normal probability density function
3
The red line is the standard normal distribution
The standard normal distribution is the normal distribution with a mean of zero and a
variance of one (the red curves in the plots to the right). According to transformation
formula that will be:
2
2
2
3 4
1
, ( ) , (0,1),
2
( ) 0, 1, 0, 3
i
z
i
i i
Z
x E
z z e Z N
E Z
o t
o o o
= =
= = = =
The probability density function has notable properties including:
symmetry about its mean
the mode and median both equal the mean
the inflection points of the curve occur one standard deviation away from the
mean, i.e. at and + .
The cumulative distribution function of a probability distribution, evaluated at a
number (lower-case) x, is the probability of the event that a random variable X with
that distribution is less than or equal to x. The cumulative distribution function of the
normal distribution is expressed in terms of the density function as follows:
2
1
2
1
( ) ( )
2
i
x E
x
i i
x p X x e dx
o
|
o t
| |
|
|
\ .
= s =
}
3 From Wikipedia, the free encyclopedia
x
~
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
97
The cumulative distribution function of a probability distribution, evaluated at a
number (lower-case) z, is the probability of the event that a random variable Z with
that distribution is less than or equal to z. The cumulative distribution function of the
standardized normal distribution (red line) is expressed in terms of the density
function as follows:
2
2 1
( ) ( )
2
i
z
z
i i
F z p z z e dz
t
= s =
}
There are tables with values of cumulative distribution function of the standardized
normal distribution.
Roles for standardized normal distribution
Roles for determination probability for different kinds of cases with standardized
normal distribution are:
1. ( ) 1 ( )
i i
p Z z F z > =
2. ( ) ( ) ( )
i j j i
i j p z Z z F z F z < s s =
5. ( ) 1 ( )
i i
p Z z F z s =
6. ( ) ( ) ( ) 2 ( ) 1
i i i i i
p z Z z F z F z F z < s = =
On next two graphs we can see illustration for determination area under curve for
standardized normal distribution (probability):
1. ( 1.25) (1.25) p z F s =
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
98
2.
( 1.25) ( 1.25) ( 1.25) ( 1.25) 1 ( 1.25)
( 1.25) 1 (1.25)
p z F p z p z p z
F F
s = = s = > = s
=
Characteristic intervals for normal distribution
If
2
~N( ; ) X o then we have characteristic intervals for distances of one, two and
three standard deviations from the mean:
68.3% p X o o s s + =
2 2 95.4% p X o o s s + =
3 3 99.7% p X o o s s + =
Example 5.
The tread life of a certain brand of tire has a normal distribution with mean 35000
miles and standard deviation 4000 miles. For randomly selected tire, what is
probability that its life is:
a) less than 37200 miles
b) more than 38000 miles
c) between 30000 and 36000 miles
d) less than 34000 miles
e) more than 33000 miles.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
99
Solution:
2
(35000; 4000 ) X N
First we have to standardize or to transform x in z. We use Excel function:
For probabilities with z scores we use Excel function:
a) less than 37200 miles
~
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
100
Standardize or to transform x in z
Or by formula:
we made transformation from to
37200 35000
( 37200) ( 0.55)
4000
x z
p x p z p z
| |
< = < = < =
|
\ .
This is table value for cumulate because z is positive and relation is <. We dont ask
for probability in point than for cumulative function, so for option Cumulative we will
take True:
Or by formula:
(0.55) 0.708840 70.884%
from tables
F = =
b) more than 38000 miles
Standardize or to transform x in z
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
101
Or by formula:
38000 35000
( 38000) ( 0.75) 1 (0.75)
4000
p x p z p z F
| |
> = > = > =
|
\ .
This is not table value for cumulate because z is positive and relation is >. We dont
ask for probability in point than for cumulative function, so for option Cumulative we
will take True but on the end we will apply formula for opposite event:
Or by formula for opposite events:
1 0.773373=0.226627 22.6627%
from tables
=
d) between 30000 and 36000 miles
First standardization:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
102
And
Or by formula:
30000 35000 36000 35000
(30000 36000) ( 1.25 0.25)
4000 4000
(0.25) ( 1.25)
p x p z p z
F F
| |
< < = < < = < < =
|
\ .
= =
Now we will find cumulative probabilities for that z scores:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
103
And
Now we complete formula:
0.598706 0.10565=
=0.493056 49.3056%
from tables
=
= =
Minimum score needed to avoid a failing grade is 502.
Example 7.
A journal editor find that the length of time that elapses between receipt of a
manuscript and a decision on publication follows a normal distribution with mean 18
weeks and deviation 4 weeks. The probability is 0.2 that it will take longer, than how
many weeks before a decision is made on a manuscript?
Solution:
2
(18; 4 ) X N
0
0
( ) 0.2
?
p x x
x
> =
=
There is opposite for table cumulate. So, we will find z for table value (1-0.2)=0.8.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
107
Or by formula:
0 0 0
0
0
( ) 0.2 ( ) 1 0.2 0.8 0.85
we made transformation from to
18
0.85 21.4
4
from tables
p z z p z z z
z x
x
x
> = < = = =
= =
21.4 weeks before a decision is made on a manuscript.
Example 14.
For 100 employees in one company we know annual income (in 000 KM):
Annual
income
Number of
employees
60-62 5
62-64 20
64-66 42
66-68 27
68-70 6
We have to create appropriate theoretical approximation for this empirical distribution.
Solution:
This is continuous variable. We will use approximation with normal distribution. We
will replace intervals with centre of intervals:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
108
First we will compute average and deviation for empirical distribution:
Then is:
Now we will standardize upper limit for intervals:
2 2
( ) 65,18
1, 9
i i
i
x
L A X L
z
o
= =
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
109
For this z scores we will find table cumulative probabilities with function
NORMSDIST . On the beginning and on the end we will make new intervals with x: -
and +, so cumulative for that intervals are 0 and 1:
Then we will find theoretical probabilities for normal distribution according to the
relation:
1 1
( ) ( )
i i i
p F z F z
+ +
= and theoretical frequencies according to relation:
ti i
f N p = :
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
110
That was procedure for approximation with normal distribution. Now we can find
approximation error:
5
2
1
1
( )
n i ti
i
f f
n
o
=
=
_
. First we will compute square distances
theoretical from empirical frequencies:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
111
Then we create formula for approximation error:
Or by formula:
5
2
1
1 8.6772
( ) 1.32
5
n i ti
i
f f
n
o
=
= = =
_
Approximation error
STUDENT t-DISTRIBUTIUN
T distribution was constructed 1908. by W.S.Gosset, but he published that with
pseudonym Student and we call this distribution Student t distribution. He created
that when he worked with results on samples methods. Density function is:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
112
,
2
2
1
1
2
1
,
2
1
1
1
n
n
t n
B n
t f
|
|
.
|
\
|
+
|
.
|
\
|
= ,
where is
|
.
|
\
|
2
1
,
2
1 n
B beta-function with parameters
2
1
,
2
1 n
and n is number of
elements.
With function F(t) we can compute probability that variable has value more that fixed
t, and we can find tables with appropriate probabilities.
Shape of t distribution depends on n, but (n-1) is degree of freedom or v (ni). Degree
of freedom is number of independent observations minus number of parameters that
define distribution: k n df = = v
Student distribution is wider than normal distribution. For greater values of n (more
than 30) student distribution tends to standardized normal distribution.
T distribution doesnt have application in concrete problems like normal distribution,
but it is very important for inferential statistic. So we will see finding t when we know
probability.
Example 15.
For degrees of freedom 9 = n , we have to find
0
t , for
0 0
( ) 0.99 P t t t s s = . For the
same distribution we have to determine the function of probability if t = 2.54.
Solution:
This is inverse situation when we know area (probability) between two symmetric t
scores. We use Excel function for Two-tailed:
We calculate for opposite event:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
113
Or by formula:
0 0 9 0 9 0 0
( ) 2 ( ) 1 0.99 ( ) 0.995 3.3 P t t t S t S t t s s = = = =
Now we have to find function of probability and cumulative probability if t = 2.54.
We will use function TDIST:
CHI-SQUARE ,
2
_ DISTRIBUTIUN
Applies in cases where the need to make a decision on the significant difference of
actual (observed) and theoretical (expected) frequency, or the value of variable
(characteristics). Marked by the Greek letter hi , _ , is defined as the sum of the
distances (relationship difference) between the observed and expected values
according to the expected values, that is
,
_
=
=
r
i i
i i
e
e m
1
2
2
_ ,
i
m - observed frequency
-
i
e expected (theoretical) frequency
This distribution can take values from 0 to , the values are always positive, depends
on the number of degrees of freedom, and for each number of degrees of freedom hi-
square distribution is different. Probability distributions are given in the table. The
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
114
table was to give information to the 30 degrees of freedom, and if it is about more
than 30 degrees of freedom R. A. Fisher suggests that took the form 1 2 2
2
v _ ,
that is approximately normally distributed, so that the case can apply the table surface
below the normal curve.
Arithmetic mean hi-square distribution is equal to the number of degrees of freedom,
a mode is where it is
2
_ = 2 v (unless if 1 = v ), variance is v 2 and coefficient of
skewness
v
2
. From the expression for the coefficient of skewness follows that this
distribution is very asymmetrical for a small number of degrees of freedom, and that
with increasing degrees of freedom approaching to symmetric distribution.
In the specific problems it has no autonomous application as a normal distribution, but
it is very important for inferential statistics. Therefore, we observe the calculations
with hi-square distribution.
Example 16.
If degrees of freedom is 5 and we know probability 0.9, we have to find appropriate
2
0
_ value, if probability is for
2 2
0
_ _ > . With same conditions find
2
0
_ if probability
is for
2 2
0
_ _ s .
Solution:
5 = n
2 2
0
_ _ > - this is direct relation for Excel function CHIINV.
Or by formula:
2 2 2
0 0
( ) 0, 9 1, 61 P _ _ _ > = =
Opposite event is
2 2
0
_ _ s , so:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
115
2 2 2 2
0 0
( ) 0, 9 ( ) 1 0, 9 0,1 P P _ _ _ _ s = > = = . That means:
Or by formula:
2 2 2
0 0
( ) 1 0, 9 0,1 9, 5 P _ _ _ > = = =
DISTRIBUTIUN
We suppose that is:
X continuous random variable which has a hi-square distribution ,
2
_ with m
degrees of freedom and
Y continuous random variable which has a hi-square distribution ,
2
_ with n
degrees of freedom
These two variables are independent,
Then the variable F, which we define like quotient of quotients for previously defined
variables and their respective degrees of freedom:
n Y
m X
F
/
/
= follows Ficher-
Snedecor's distribution with the degrees of freedom
|
|
.
|
\
|
n
m
. Distribution of probability
is not balanced or symmetric with respect to m and n.
Random variable takes the value of the interval (0, ) and distribution has the
following format:
,
, 2
1
2
,
1
2
2
n m
m
n m
x
x
n
s
m
n m
x f
+
|
.
|
\
|
I I
|
.
|
\
| +
I
=
where m and n represent degrees of freedom (df).
Expected values and variance are:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
116
2 za
2
>
= n
n
m
,
, ,
4 za ,
4 2
2 2
2
2
>
+
= n
n n
m n m
o
Ficher's (F) distribution is used in cases where we want to analyze variability two
basic populations based on the sample. We will use the F distribution to test
hypotheses about the equality of two samples variance over their relations on the basis
of the number of degree of freedom for each of them. When the referent populations
meetings normally distributed then the quotient two independent assessments variance
given in the form of:
2
2
2
1
o
o
= F
Example 17
If it is a Fisher-Snedecor's schedule use it to determine
0
F if the appropriate number
of degrees of freedom is
1 2
4, 7 v v = = and the corresponding likelihood is
0
( ) 0, 05 P F F > = .
Solution:
There is relation > , so we can direct apply Excel function FINV:
Or: 12 , 4 7 , 4 , 05 , 0
0 2 1
= = v = v = F p
tablica iz
LUCNURMAL DISTRIBUTIUN
Lognormal distribution characteristics are as follows:
Probability distribution of random variables whose logarithms (base 10 or e)
below normal distribution.
There is curved - the asymmetrical right.
When we find logarithm value for variables whose distribution is curved to the
right, obtained logarithms follow a normal distribution.
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
117
As a measure of central tendencies in lognormal distribution is used geometric
mean for x or the arithmetic mean of ln(x) or log(x).
It is defined by the expectations and standard deviation.
If there is a normal distribution for ln(x), then x has a lognormal distribution with the
function of probability:
,
2
2
1 1 1
( ) exp ln , 0
2 2
f x x x
x
o t o
1
= >
(
]
,
Where is o standard deviation for ln x and expected value for ln( ) x .
Unlike the normal distribution, lognormal distribution is not balanced, but seeks to
normal distribution if there is a value less than 0.1.
On next graph we can see function lognormal probability distribution depending on
the value of standard deviations:
In Excel function LOGNORMDIST elements are:
x the value for which we observe the function.
Mean is average for ln(x) or log(x).
Standard-dev. is standard deviation for ln(x) or log(x).
Example 18.
We have next information: 4, x = expected value for ln(x) is 3.5 and standard
deviation is 1,2. How to read the appropriate value of the timetable for the lognormal
function?
Solution:
We use Excel function:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
118
= >
I
,
4
where is 0 o > parameter for slope and
1
0
|
>
is measuring parameter for scale.
Expected value and variance for gamma distribution are:
2 2
o | o o | = = .
If o=1 then is gamma distribution equal to the exponential distribution. Gamma
distribution can have different shapes depend on o and |. This makes it useful model
for a wide range of continuous random variables.
If
2
n
o = and 2 | = we will get special form of gamma distribution - chi-square
distribution, where n is number for degrees of freedom.
Gamma distribution is used in the case asymmetric distribution. We have practical
application of the theory in the ranks.
Example 20.
Let the value of continuous random variable is 8 x = . How to read the appropriate
value for gamma distribution if the parameters are 6 i 2 o | = = ? If the probability
that x is less than the default value is equal to 54% to determine the value.
Solution:
4
( ) o I is gamma function, defined like
1
0
( ) , 0
x
x e dx
o
o o
I = >
}
. if o is positive integer, then is
( ) ( 1)! o o I = .
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
122
For known x and determination of probability we will use Excel function
GAMMADIST:
For probability that 8 x = in option Cumulative we will take False:
For probability that 8 x s in option Cumulative we will take True:
Now we will find value for x if probability that x is less than the default value is equal
to 54%. We will use function GAMMAINV:
EMPIRICAL VERSUS APPROPRIATE THEORETICAL DISTRIBUTIONS
EXAMPLES IN EXCEL
123
For that gamma distribution x may be less than the default value of 11.83 is equal to
54%.
APRUXIMATIUNS FUR BINUMIAL, PUISSUN AND
HYPERCEUMETIC DISTRIBUTIUN WITH NURMAL
DISTRIBUTIUN
There are conditions for approximations for binomial, Poisson and hypergeometic
distribution with normal distribution:
10%
n
N
30
0,10
n
p
( , , ) H N n p ( , ) B n p ( ) P
10%
30
n
N
n
20
10
(1 ) 10
n
np
n p
15
( , ) N
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
124
III. Inferential statistics: Estimation tbeory and
bypotbesis testing
INFERENCE
Inferential statistics are used to draw inferences about a population from a sample. It
is very important that the chosen sample is randomly selected and representative for
the population. However you select the sample there is always the likelihood of some
level of sample error. But, there is role: the larger sample lead to the smaller sample
error.
Consider an experiment in which 10 subjects who performed a task after 24 hours of
sleep deprivation scored 12 points lower than 10 subjects who performed after a
normal night's sleep. Is the difference real or could it be due to chance? This is the
type of questions answered by inferential statistics.
There are two main methods used in inferential statistics: estimation and hypothesis
testing. In estimation, the sample is used to estimate a parameter and a confidence
interval about the estimate is constructed. A confidence interval gives an estimated
range of values which is likely to include an unknown population parameter, the
estimated range being calculated from a given set of sample data
5
:
( ) 1 P h h 0 o s s + =
where is:
- statistic from sample
0- parameter from population
h surroundings
, 1 o - confidence
o- first type error
In the most common use of hypothesis testing, null hypothesis is put forward and it is
determined whether the data are strong enough to reject it. For the sleep deprivation
study, the null hypothesis would be that sleep deprivation has no effect on
performance.
Inferential statistics are used to make generalizations from a sample to a population. It
is possible that error occur. There are two sources of error that may result in a
sample's being different from (not representative of) the population from which it is
drawn. These are
Sampling error - chance, random
error, decreases as the sample size
increases
Sample bias - constant error, due to
inadequate frame and design, does not
depend on the size of the sample
5
Definition taken from Valerie J. Easton and John H. McCools Statistics Glossary v1.1
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
125
Inferential statistics take into account sampling error. These statistics do not
correct for sample bias.
THE DISTRIBUTIUN UF THE SAMPLE MEANS
The distribution of sample means has some interesting characteristics. First, if our
samples are big enough (a large n), then the sampling distribution will approximate a
normal distribution. Second, the mean of our sampling distribution, which is
sometimes designated
X
, will be the same as the population mean. Together, these
two properties of sampling distributions comprise the central limit theorem.
Third, as you also know, to compute probabilities from a normal distribution, you
have to know the standard deviation of the distribution. In this case, the standard
deviation of the sampling distribution is called the standard error of means, designated
X
o , and is calculated by dividing the population standard deviation by the square root
of n. In other words, standard error of means can be calculated as:
X
n
o
o = .
Standard error of means depends on the sample size (n), so the larger sample lead to
the smaller standard error of means.
CUNFIDENCE INTERVAL FUR THE PUPULATIUN MEAN
Standard deviation from population is known
For a population with unknown mean and known standard deviation o for
population, a confidence interval for the population mean, based on a simple random
sample of size n, is:
,
2 ( ) 1 1
X X
P X z X z F z o o o s s + = =
where:
X is the mean from the sample
z is the upper (1
2
o
) critical value for the standard normal distribution and
depends on required confidence
X
n
o
o = is standard error of means.
This is rare situation when we know standard deviation from population.
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
126
Standard deviation from population isnt known
If standard deviation from population isnt known, unbiased estimator from the
sample is:
,
2
1
i i
i
x X f
n
o
=
_
where
i
o is standard deviation from sample.
In most practical research, the standard deviation for the population of interest is not
known. In this case, the standard deviation from population o is replaced by the
estimated standard deviation from sample
i
o , also known as the standard error. Since
the standard error is an estimate for the true value of the standard deviation, the
distribution of the sample mean X is no longer normal with mean and standard
deviation
n
o
. Instead, the sample mean follows the t distribution with mean and
standard deviation
i
n
o
. The t distribution is also described by its degrees of freedom.
For a sample of size n, the t distribution will have (n-1) degrees of freedom. The
notation for a t distribution with k degrees of freedom is
k
t .
Well, for a population with unknown mean and unknown standard deviation, a
confidence interval for the population mean, based on a simple random sample of size
n, is:
,
1 1 1
2 ( ) 1 1
n n n X X
P X t S X t S S t o
s s + = =
where:
X is the mean from the sample
t is the upper (1
2
o
) critical value for the t distribution with (n-1) degrees of
freedom,
1 n
t
i
X
S
n
o
= is approximation for standard error of means.
This is most common situation when we dont know standard deviation from
population.
As the sample size n increases, the t distribution becomes closer to the normal
distribution, since the standard error approaches the true standard deviation o for
large n. So, for sample size n >30, we can use normal instead of t distribution.
Example 1.
Suppose a student measuring the boiling temperature of a certain liquid observes the
readings (in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
127
different samples of the liquid. What is the confidence interval for the population
mean at a 95% confidence level?
Solution:
First we will calculate statistics for sample: Tools Descriptive statistics. We will
choose option for confidence interval and appropriate significance level:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
128
, 1 0, 95 0, 05 o o = =
Unknown standard deviation o for population:
X X
X t S X t S s s +
n>30, unknown standard deviation o for population, we know only standard
deviation
i
o for sample t distribution
1 6 1 5
( ) 1 0, 975
2
n
S t
o
= =
= = from tables or with Excel function TINV:
X
S
X
t S
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
129
101.82 2.57 0.402 101.82 2.57 0.402
100.78 102.85 ( 5%)
o
s s +
s s =
Confidence interval for the population mean at a 95% confidence level is (101.78-
102.85).
Example 2.
According to report for 2009. year, we have data about predicted Recovery rate in
cent per dollar after closing business (from
http://www.doingbusiness.org/CustomQuery/, predictions for 2009. year) for sample
with 33 countries. We have data in Excel sheet (A1-A33). We have to construct
confidence interval for Recovery rate for population of all countries with first type
error 1%.
Solution:
For beginning, we will calculate statistics for sample of 33 countries: Tools
Descriptive statistics:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
130
X X
X z S X z S s s +
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
131
n>30, we don't know deviation for population o , we only know deviation from
sample
i
o z distribution
( ) 1 0, 995
2
F z
o
= = from tables or with Excel function NORMSINV:
52.91 2.58 4.05 52.91 2.58 4.05
42.48 63.34
s s +
s s
Confidence interval for Recovery rate for population of all countries with first type
error 1% is (42.48-63.34).
Example 3.
The dataset "Normal Body Temperature, Gender, and Heart Rate" contains 130
observations of body temperature, along with the gender of each individual and his or
her heart rate. Sample mean is 98.249 and sample standard deviation is 0.733. Find a
99% confidence interval for the mean of population.
Solution:
,
130
98.249
0.733
1 0.99 0.01
i
n
X
o
o o
=
=
=
= =
n>30, unknown standard deviation o for population, we know only standard
deviation
i
o for sample z distribution
X X
X z S X z S s s +
( ) 1 0.995 2.58
2
from tables
F z z
o
= = =
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
132
0.733 0.733
98.249 2.58 98.249 2.58
130 130
98.08 98.41
s s +
s s
Confidence interval for the population mean at a 99% confidence level is (98.08-
98.41).
CUNFIDENCE INTERVAL FUR THE PUPULATIUN
PRUPURTIUNS
Applying the general formula for a confidence interval, the confidence interval for a
proportion, , is
p
p z t o 1 e
]
where is:
p is the proportion in the sample,
z depends on the level of confidence desired, and
p
o , the standard error of a proportion, is equal to:
, 1
p
n
t t
o
=
where is:
is the proportion in the population and
n is the sample size.
Since is not known, p is used to estimate it. Therefore the estimated value of
p
o is:
, 1
p
p p
S
n
=
and than will be:
p
p z S t 1 e
]
Example 4.
Based on the HBS 2004 databases have information on 7413 households for the
variable marital status holder households. On the basis of this information to assess
proportion of households whose holder married to a complete population of
households in B&H, with the first type of error of 2%.
Solution:
It is necessary first for the sample of n=7413 households calculate proportion of
households where the holder of household persons in marriage:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
133
Value for p is: 0.7124 p = and n=7413.
, 1
0.7124 0.2876
0.005
7414
p
p p
S
n
= = =
0.02 ( ) 1 0.99
2
F z
o
o = = =
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
134
0.7124 2.326 0.005 0.7124 2.326 0.005
0.701 0.724
p
p z S t
t
t
1 e
]
s s +
s s
=
=
large sample
, ,
2 2
2
2 2
2 2
2 ( ) 1
2 3 2 3
i i
n n
P F z
n z n z
o o
o o
| |
|
s s = =
|
| +
\ .
( ) 1
2
F z z
o
=
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
135
Example 1. cont.
Suppose a student measuring the boiling temperature of a certain liquid observes the
readings (in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6
different samples of the liquid. What is the confidence interval for the population
variance at a 97% confidence level?
Solution:
It is necessary first to determine variance from the sample:
2
0, 9697
i
o =
It is a small sample and we use a chi-square distribution, it is a function CHINV:
2 2 2
1 6 1 5, 0,985
1,1
2
2 2 2
1 6 1 5, 0,015
1,
2
( ) 1
2
( )
2
n
n
n
n
P
P
o
o
o
_ _ _
o
_ _ _
=
=
= =
= =
Now we can complete the term for confidence interval:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
136
2
2
6 0, 9697 6 0, 9697
14, 098 0, 662
0, 413 8, 789
o
o
s s
s s
Confidence interval for variance variable temperature boiling with reliability 97%
read (0,413-8,789).
Example 2. cont.
According to report for 2009. year, we have data about predicted Recovery rate in
cent per dollar after closing business (from
http://www.doingbusiness.org/CustomQuery/, predictions for 2009. year) for sample
with 33 countries. We have data in Excel sheet (A1-A33). We have to construct
confidence interval for variance of variable Recovery rate for population of all
countries with first type error 4%.
Solution:
It is necessary first to determine variance from the sample:
2
541,16
i
o =
It is a large sample and we use a normal distribution, it is a function NORMDIST:
, ,
2 2
2
2 2
2 2
2 3 2 3
i i
o
n n
n z n z
o o
o
s s
+
We need a value for z if the first type of error was 4%:
Now we can complete the term for confidence interval:
, ,
2
2 2
2
2 33 541,16 2 33 541,16
2 33 3 2, 05 2 33 3 2, 05
353, 62 1030, 49
o
o
s s
+
s s
Confidence interval for variance of variable Recovery rate for population of all
countries with first type error 4% is (353,62-1030,49).
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
137
HUW TU DETERMINE SAMPLE SIZE ACCURDINC TU
SAMPLE ERRUR?
Determining sample size for estimating population mean
Determining sample size is a very important issue because samples that are too large
may waste time, resources and money, while samples that are too small may lead to
inaccurate results. In many cases, we can easily determine the minimum sample size
needed to estimate a population parameter, such as the population mean .
When sample data is collected and the sample mean X is calculated, that sample
mean is typically different from the population mean . This difference between the
sample and population means can be thought of as an error. The margin of error
X
E
is the maximum difference between the observed sample mean X and the true value
of the population mean :
2 2
X X
E z z
n
o o
o
o = =
where:
1
2
z
o
is known as the critical value, the positive z value that is at the vertical
boundary for the area of
2
o
in the right tail of the standard normal distribution.
o is the population standard deviation.
n is the sample size.
Rearranging this formula, we can solve for the sample size necessary to produce
results accurate to a specified confidence and margin of error:
2
1
2 X
n z
E
o
o
| |
=
|
\ .
This formula can be used when you know o and want to determine the sample size
necessary to establish, with a confidence of , 1 o , the mean value to within E .
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
138
You can still use this formula if you dont know your population standard deviation o
and you have standard deviation for sample:
2
1
2
i
X
n z
E
o
o
| |
=
|
\ .
Example 5.
A consumer group wants to estimate the mean electric bill for the month July for
single-family homes in a large city. Based on studies conducted in other cities, the
standard deviation is assumed to be $25. The group wants to estimate the mean bill
for July to within $5 of the true average with 95% confidence. What sample size is
needed?
Solution:
25
5
0.05
?
X
E
n
o
o
=
=
=
=
0.05 ( ) 1 0.975
2
F z
o
o = = =
2
2
1
2
25
1.96 96
5
X
n z
E
o
o
| |
| |
= = =
|
|
\ .
\ .
They need sample with 96 single-family homes in a large city.
Determining sample size for estimating population proportion
To develop formula for determining the appropriate sample size needed when
constructing a confidence interval estimate of the proportion, recall equation for
confidence interval estimate of the proportion:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
139
, 1
p p
E z z
n
t t
o
= =
where:
1
2
z
o
is known as the critical value, the positive value that is at the vertical
boundary for the area of
2
o
in the right tail of the standard normal distribution.
t is the proportion from population.
n is the sample size.
Rearranging this formula, we can solve for the sample size necessary to produce
results accurate to a specified confidence and margin of error.
,
2
2
1
p
z
n
E
t t
=
This formula can be used when you know t and want to determine the sample size
necessary to establish, with a confidence of , 1 o , the proportion for population to
within
p
E . You can still use this formula if you dont know your population
proportion and you have a proportion from sample:
,
2
2
1
p
z p p
n
E
=
Example 6.
If you want to be 99% confident of estimating the population proportion to within an
error of 0.02 and there is historical evidence that the population proportion is
approximately 0.4, what sample size is needed?
Solution:
0.01
0.4
0.02
?
p
E
n
o
t
=
=
=
=
0.01 ( ) 1 0.995
2
F z
o
o = = =
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
140
,
2
2
2 2
1
2.58 0.4 0.6
3994
0.02
p
z
n
E
t t
= = =
We need sample with 3994 elements.
HYPUTHESIS TESTINC
Hypothesis testing typically begins with some theory, claim, or assertion about a
particular parameter of a population. For example, for purposes of statistical analysis,
your initial hypothesis about the cereal example is that the process is working
properly, meaning that the mean fill is 368 grams, and no corrective action is needed.
The hypothesis that the population parameter is equal to the company specification is
referred to as the null hypothesis. A null hypothesis is always one of status quo, and is
identified by the symbol H
0
. Here the null hypothesis is that the filling process is
working properly, that the mean fill per box is the 368-gram specification. This can be
stated as:
H
0
: =368
Whenever a null hypothesis is specified, an alternative hypothesis must also be
specified, one that must be true if the null hypothesis is found to be false. The
alternative hypothesis H
1
is the opposite of the null hypothesis H
0.
This is stated in the
cereal example as:
H
1
: =368
The alternative hypothesis represents the conclusion reached by rejecting the null
hypothesis if there is sufficient from the sample information to decide that the null
hypothesis is unlikely to be true.
Hypothesis-testing methodology is designed so that the rejection of the null
hypothesis is based on evidence from the sample that the alternative hypothesis is far
more likely to be true. However, failure to reject the null hypothesis is not proof that it
is true. One can never prove that the null hypothesis is correct because the decision is
based only on the sample information, not on the entire population. Therefore, if you
fail to reject the null hypothesis, you can only conclude that there is insufficient
evidence to warrant its rejection.
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
141
The following key points summarize the null and alternative hypotheses:
1. The null hypothesis H
0
represents the status quo or the current belief in a situation.
2. The alternative hypothesis H
1
is the opposite if the null hypothesis and represents
a research claim or specific inference we would like to prove.
3. If we reject the null hypothesis, we have statistical proof that the alternative
hypothesis is correct.
4. The failure to prove the alternative hypothesis, however, does not mean that we
have proven the null hypothesis.
5. The null hypothesis H
0
always refers to specified value of the population
parameter (such as ), not a sample statistic (such as X ).
6. The statement of the null hypothesis always contains an equal sign regarding the
specified value of the population parameter (H
0
: =368)
7. The statement of the alternative hypothesis never contains an equal sign regarding
the specified value of the population parameter (H
1
: =368).
Hypothesis-testing methodology provides clear definitions for evaluating such
differences and enables us to quantify the decision-making process so that the
probability of obtaining a given sample result can be found if the null hypothesis is
not reject. This is achieved by first determining the sampling distribution for the
sample statistic of interest (e.g. the sample mean) and then computing the particular
test statistics based on the given sample result.
Regions of rejection and non-rejection
The sampling distributions of the test statistics is divided into two regions:
Region of rejection (critical region) and
Region of non-rejection.
If the test statistic falls into the region of non-rejection, the null hypothesis cannot be
rejected. If a value of the test statistic falls into this rejection region, the null
hypothesis is rejected because that value is unlikely if the null hypothesis is true.
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
142
When we use a sample statistic to make decision about a population parameter, there
is a risk that an incorrect conclusion will be reached. Two different types of errors can
occur when applying hypothesis testing methodology, type I errors and type II errors.
A type I error occurs if the null hypothesis H
0
is rejected when a fact it is true and
should not be rejected. The probability of a type I error occurring is o. A type II error
occurs if the null hypothesis H
0
is not rejected when a fact it is false and should be
rejected. The probability of a type II error occurring is |.
The confidence coefficient (1-o) is the probability that the null hypothesis H
0
is not
rejected when in fact it is true and should not be rejected. The power of a statistical
test (1-|) is the probability of rejecting the null hypothesis when in fact it is false and
should be rejected.
Risks in decision making process
Next table illustrates the results of the two possible decision (do not reject H
0
or reject
H
0
) that can occur in any hypothesis test. Depending on the specific decision, one of
two types of errors may occur or one of two types correct conclusion may be reached.
Actual situation
Statistical decision
H
0
true H
0
false
do not reject H
0
Correct decision
Confidence=(1-o)
Type II error
p(type II error)=|
reject H
0
Type I error
p(type I error)=o
Correct decision
Power=(1-|)
Procedure for hypothesis testing
Several steps can describe procedure for hypothesis testing:
1. Determine null and alternative hypothesis
2. State critical of test statistics according to significance or confidence level and
appropriate theoretical distribution
3. Calculate the test statistic according to values from the sample
4. Compare test statistic to critical values draw conclusion.
Hypothesis for the mean
We begin with the problem of testing the simple null hypothesis that the population
mean is equal, higher or lower than some specified value
0
.
known
1. Two-tailed test
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
143
0 0 1 0
. .
.
1
2 2
. .
1. : / :
( )
2
2. ,
( ) 1
2
t t
t
t t
H H
F z z
z z z
F z z
o o
o
o
= =
1
=
(
1
e (
(
( ]
=
(
]
0
.
. . 0 . . 1
3.
4. ,
e
X
e t e t
X
z
z z H z z H
=
e e
2. One-tailed test
a.
0 0 1 0
. .
0
.
. . 0 . . 1
1. : / :
2. ( )
3.
4. ,
t t
e
X
e t e t
H H
F z z
X
z
z z H z z H
o
o
> <
=
=
> <
b.
0 0 1 0
. .
0
.
. . 0 . . 1
1. : / :
2. ( ) 1
3.
4. ,
t t
e
X
e t e t
H H
F z z
X
z
z z H z z H
o
o
s >
=
=
s >
unknown, small sample
1. Two-tailed test
0 0 1 0
1 . .
.
1
2 2
1 . .
0
.
. . 0 . . 1
1. : / :
( )
2
2. ,
( ) 1
2
3.
4. ,
n t t
t
n t t
e
X
e t e t
H H
S t t
t t t
S t t
X
t
S
t t H t t H
o o
o
o
= =
1
=
(
1
e (
(
( ]
=
(
]
=
e e
2. One-tailed test
a.
0 0 1 0
1 . .
0
.
. . 0 . . 1
1. : / :
2. ( )
3.
4. ,
n t t
e
X
e t e t
H H
S t t
X
t
S
t t H t t H
o
> <
=
=
> <
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
144
b.
0 0 1 0
1 . .
0
.
. . 0 . . 1
1. : / :
2. ( ) 1
3.
4. ,
n t t
e
X
e t e t
H H
S t t
X
t
S
t t H t t H
o
s >
=
=
s >
unknown, large sample
1. Two-tailed test
0 0 1 0
. .
.
1
2 2
. .
0
.
. . 0 . . 1
1. : / :
( )
2
2. ,
( ) 1
2
3.
4. ,
t t
t
t t
e
X
e t e t
H H
F z z
z z z
F z z
X
z
S
z z H z z H
o o
o
o
= =
1
=
(
1
e (
(
( ]
=
(
]
=
e e
2. One-tailed test
a.
0 0 1 0
. .
0
.
. . 0 . . 1
1. : / :
2. ( )
3.
4. ,
t t
e
X
e t e t
H H
F z z
X
z
S
z z H z z H
o
> <
=
=
> <
b.
0 0 1 0
. .
0
.
. . 0 . . 1
1. : / :
2. ( ) 1
3.
4. ,
t t
e
X
e t e t
H H
F z z
X
z
S
z z H z z H
o
s >
=
=
s >
Example 7.
Studies have shown that the average height of adult European males 176.28 cm.
Determine whether there is a statistically significant difference between the average
height of adult men in the city of Sarajevo based sample of 48 citizens of Sarajevo
male gender (data in an Excel table) and the European average with the first type of
error of 5%.
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
145
Solution:
It is necessary first to determine the sample average height and standard deviation:
We do not know the standard deviation for the population, the sample is large, then it
is two-way z test:
0 1
.
.
1. : 176, 28 / : 176, 28
( ) 0, 025
2
2.
( ) 1 0, 975
2
t
t
H H
F z
F z
o
o
= =
1
= =
(
(
(
= =
(
]
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
146
.
1, 96, 1, 96
t
z e
0
.
. . 1
182, 33 176, 28
3. 4, 87
1, 24
8, 61
1, 24
48
4.
e
X
X
e t
X
z
n
z z H
o
o
o
= = =
= = =
e
There is significant difference between the average height of adult men, the city of
Sarajevo and the European average (5% error).
Or SPSS solution:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
147
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
height 48 182,3333 8,61551 1,24354
One-Sample Test
Test Value = 176.28
95% Confidence Interval of the
Difference
t df Sig. (2-tailed) Mean Difference
Lower Upper
height 4,868 47 ,000 6,05333 3,5517 8,5550
p value for t test is p=0,000<0,05
1
H .
Example 8.
The director of admissions at a large university advises parents of incoming students
about the cost of food during a typical semester. A sample of 80 students enrolled in
the university indicates a sample meat cost of $315.4 with a sample standard deviation
of $43.2. Using the 0.01 level of significance, is there evidence that the population
mean is less than $320?
Solution:
0
43.2
80
315.4
320
i
n
X
o
=
=
=
=
We dont know standard deviation for population, sample is large and this is one-
tailed z test:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
148
0 1
.
1. : 320 / : 320
2.. ( ) 0.01
t
H H
F z
o
> <
= =
.
0
.
. . 0
2.33
315.4 320
3. 0.95
4.83
43.2
4.83
80
4.
t
e
X
i
X
e t
z
X
z
S
S
n
z z H
o
=
= = =
= = =
>
There is no evidence that the population mean is less than $320.
We can not use SPSS procedure directly, because there is one-tailed test.
Example 9.
A manufacturer of flashlight batteries took a sample of 13 batteries from a days
production and used them continuously until they failed to work. The life as measured
by the number of hours until failure was:
342, 426, 317, 545, 264, 451, 1049, 631, 512, 266, 492, 562, 298.
At the level of significance 0.1, is there evidence that the mean life of the batteries is
more than 350 hours?
Solution:
From original data we calculate :
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
149
0
13
473.46
210.77
350
i
n
X
o
=
=
=
=
We dont know standard deviation for population, sample is small and this is one-
tailed t test:
0 1
1 12 . .
0
.
. . 1
1. : 350 / : 350
2. ( ) 1 0.9 1.78
473.46 350
3. 2.11
58.45
210.77
58.45
13
4.
n t t
e
X
i
X
e t
H H
S t t
X
t
S
S
n
t t H
o
o
=
s >
= = =
= = =
= = =
>
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
150
There is evidence that the mean life of the batteries is more than 350 hours.
A two sample test for mean
Means are used to summarize distributions based on continuous data (interval or ratio
measurement). A statistical measure called the t test is used to test for the significance
of the difference between two means. The t test assesses the degree of overlap in the
distribution of scores in each of two samples being compared. When the two
distributions are highly similar, there will be little difference between the means.
When scores in one distribution are distributed differently from the other, there is a
greater probability that the difference between the means will be greater.
A t test can be used with large or small samples. However, as the sample size
becomes smaller, mean differences have to be larger to become significant. In
addition to the requirement of continuous measurement, the t test assumes that the
variable being measured is normally distributed in the population from which the
sample was selected. Even when distributions for samples are mildly skewed, it may
be reasonable to assume a normal distribution for the variable in the population.
However, when the distribution for a sample is badly skewed or you doubt that the
variable is normally distributed in the population, you should not use a t test. As an
alternative you can compare medians or convert continuous data to a set of intervals
and conduct a chi square test.
We have two main type of test for the significance of the difference between two
mean:
1. If
1 2
2 30 n n + > z distribution
0 1 2 1 1 2
. .
.
1
2 2
. .
2 2
2 1 1 2 2 1 2 1 2
.
1 2 1 2
. . 0 . . 1
1. : / :
( )
2
2. ,
( ) 1
2
( 1) ( 1)
3. ,
2
4. ,
t t
t
t t
i i
e d
d
e t e t
H H
F z z
z z z
F z z
n n X X n n
z S
s n n n n
z z H z z H
o o
o
o
o o
= =
1
=
(
1
e (
(
( ]
=
(
]
+ +
= =
+
e e
2. If
1 2
2 30 n n + s t distribution
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
151
1 2
1 2
0 1 2 1 1 2
2 . .
.
1
2 2
2 . .
1 2
.
. . 0 . . 1
1. : / :
( )
2
2. ,
( ) 1
2
3.
4. ,
n n t t
t
n n t t
e
d
e t e t
H H
S t t
t t t
S t t
X X
t
s
t t H t t H
o o
o
o
+
+
= =
1
=
(
1
e (
(
( ]
=
(
]
=
e e
We also have different procedures depending on the test of whether the samples are
independent or dependent.
Example 10.
We conducted the research on the impact lack of sleep on the ability of solving
mathematical tasks. On a sample of 30 of the first test mathematics applied in the
"normal" circumstances. After that we not allowed to them to sleep 72 hours, and is
applied parallel to the test (the test results in an Excel table). Is there significant
difference in the results of 1st and 2 testing? The data are in the table, use the
reliability of 0.94.
Solution:
1 2
2 30 n n + > z distribution, paired samples
0 1 2 1 1 2
1. : / : H H = =
Data used in the analysis of paired samples:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
152
t-Test: Paired Two Sample for Means
Variable 1Variable 2
Mean 28,13333 26,06667
Variance 45,29195 32,61609
Observations 30 30
Pearson Correlation 0,853868
Hypothesized Mean Difference 0
df 29
t Stat 3,231368
P(T<=t) one-tail 0,001531
t Critical one-tail 1,601972
P(T<=t) two-tail 0,003063
t Critical two-tail 1,957293
1
0, 003 0, 05 p H = <
There is significant difference between the averages for the population which means
that it is confirmed the existence of a lack of sleep impact on the ability of solving
mathematical tasks.
Or SPSS variant:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
153
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
154
Paired Samples Statistics
Mean N Std. Deviation Std. Error Mean
test1 28,1333 30 6,72993 1,22871 Pair 1
test2 26,0667 30 5,71105 1,04269
Paired Samples Correlations
N Correlation Sig.
Pair 1 test1 & test2 30 ,854 ,000
Paired Samples Test
Paired Differences
95% Confidence
Interval of the
Difference
Mean
Std.
Deviation
Std. Error
Mean
Lower Upper
t df
Sig. (2-
tailed)
Pair 1 test1-
test2
2,06667 3,50304 ,63956 ,75861 3,37472 3,231 29 ,003
Of course the results are the same.
A two sample test for variances
For testing hypotheses about the (non)existence of differences between variances two
populations based on their samples using F test:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
155
1 2
2 2 2 2
0 1 2 1 1 2
2 . 2
1 1 2 2
2 2
1 1
. 2 2
2 2
. 0 . 1
1. : / :
2. ( )
2
1, 1
3.
4. ,
o o o o
teor
u
izr
u
i teor i teor
H H
P F F F F
n n
S
F
S
F F H F F H
o o
o o o o
o
v v
o
o
>
= =
= =
= =
= =
< >
Example 11.
In 4.b. grade of 1. Gymnasium measured the emotional intelligence. The results of the
test are given in an Excel table. Is there a statistically significant difference in
intelligence between genders? ( 5% o = )?
Solution:
1 2
2 30 n n + < t distribution, independent sample
0 1 2 1 1 2
1. : / : H H = =
Data used in the analysis for independent samples, but first we have F test to check
whether variance equal:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
156
F-Test Two-Sample for Variances
Variable 1 Variable 2
Mean 83,3125 77,58333
Variance 35,42917 28,26515
Observations 16 12
df 15 11
F 1,253458
P(F<=f) one-tail 0,358325
F Critical one-tail 2,71864
p value of F test is greater than 0.05 variance equal.
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
157
t-Test: Two-Sample Assuming Equal Variances
Variable 1Variable 2
Mean 77,58333 83,3125
Variance 28,26515 35,42917
Observations 12 16
Pooled Variance 32,39824
Hypothesized Mean Difference 0
df 26
t Stat -2,63574
P(T<=t) one-tail 0,006984
t Critical one-tail 1,705618
P(T<=t) two-tail 0,013969
t Critical two-tail 2,055529
p value of t test is less than 0.05 averages are not equal, and the conclusion follows
that there are significant differences in intelligence between gender.
Or SPSS test:
Both the samples are presented in the same column but the column of projects make
selection according to gender:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
158
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
159
Group Statistics
gender N Mean Std. Deviation Std. Error Mean
M 12 77,5833 5,31650 1,53474 EI
16 83,3125 5,95224 1,48806
Independent Samples Test
Levenes Test for
Equality of Variances t-test for Equality of Means
95% Confidence Interval
of the Difference
F Sig. t df
Sig. (2-
tailed)
Mean
Difference
Std. Error
Difference Lower Upper
Equal
variances
assumed
,412 ,526 -2,636 26 ,014 -5,72917 2,17365 -10,19716 -1,26117 EI
Equal
variances
not
assumed
-2,680 25,122 ,013 -5,72917 2,13770 -10,13075 -1,32758
p value of t test is less than 0.05 averages are not equal, and the conclusion follows
that there are significant differences in intelligence between gender.
Example 12.
The company X are checked as being of employees affects the number of days sick
leave. A random sample selected 14 employees were younger age (20 to 30 years) and
14 employees, older age (50 to 60 years). Data on number of days of sick leave in
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
160
2008. year are given in an Excel table. Is there a statistically significant difference in
the number of days of sick leave between the two referent age group, the reliability of
99%?
Solution:
1 2
2 30 n n + < t distribution, independent samples
0 1 2 1 1 2
1. : / : H H = =
Data used in the analysis for independent samples, but first we have F test to check
whether variance equal:
F-Test Two-Sample for Variances
Variable 1 Variable 2
Mean 7,071429 5,714286
Variance 136,2253 18,83516
Observations 14 14
df 13 13
F 7,232497
P(F<=f) one-tail 0,000542
F Critical one-tail 3,905204
p vralue for F test lower than 0,01 variances are not equal.
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
161
t-Test: Two-Sample Assuming Unequal Variances
Variable 1Variable 2
Mean 5,714286 7,071429
Variance 18,83516 136,2253
Observations 14 14
Hypothesized Mean Difference 0
df 17
t Stat -0,40779
P(T<=t) one-tail 0,344258
t Critical one-tail 2,566934
P(T<=t) two-tail 0,688516
t Critical two-tail 2,898231
p value of t test is greater than 0.01 averages are equal, the conclusion that there is
no statistically significant difference in the number of days of sick leave between the
two referent age group.
Or SPSS test:
Both the sample are presented in the same column, but in the column before make
selection according to age group. Choosing Compare Means - Independent samples:
ANOVA
daysofsick
Sum of Squares df Mean Square F Sig.
Between Groups 13,000 2 6,500 ,109 ,897
Within Groups 2335,286 39 59,879
Total 2348,286 41
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
162
Group Statistics
grupa N Mean Std. Deviation Std. Error Mean
M 14 5,7143 4,33995 1,15990 daysofsick
S 14 7,0714 11,67156 3,11936
Independent Samples Test
Levenes Test for
Equality of Variances t-test for Equality of Means
95% Confidence Interval
of the Difference
F Sig. t df
Sig. (2-
tailed)
Mean
Difference
Std. Error
Difference Lower Upper
Equal
variances
assumed
8,536 ,007 -,408 26 ,687 -1,35714 3,32802 -8,19799 5,48371 daysofsi
ck
Equal
variances
not
assumed
-,408 16,527 ,689 -1,35714 3,32802 -8,39398 5,67970
Conclusion is same.
Testing differences between arithmetic means of more than
two populations on the basis of their samples - analysis
variance ANOVA
The aim of the analysis variance that are testing whether there is a difference between
the arithmetic means of two basic paper on the basis of their samples and comparing
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
163
their variances. In other words, we want to investigate the influence of various k
factors on the one character
k
A A A ,..., ,
2 1
. Therefore, we have k samples with their
elements and the sample works only factor. For example, investigating the influence
of different fertilizer the harvest yields some kind of wheat. If the number of elements
in the i-th sample is
i
n , and if the j-th element and the sample we designate with
ij
x ,
we have the following results of measurements:
k
i
kn k k
in i i
n
n
x x x
x x x
x x x
x x x
...
... ... ... ...
...
... ... ... ...
...
...
2 1
2 1
2 22 21
1 12 11
2
1
Arithmetic mean and variance these samples are:
k i x
n
X
j
n
j
ij i
, 1 ,
1
1 1
= =
_
=
, k i X x
n
j
n
j
i ij i
, 1 ,
1
1
2
1
2
= = o
_
=
If all of these blocks are connected in one sample returns a sample with
_
=
=
k
i
i
n n
1
elements of the arithmetic mean
_
=
=
k
i
i i
X n
n
X
1
1
and total variance
,
__
= =
=
k
i
n
j
ij t
i
X x S
1 1
2
2
.
As the , , , , ,
2
1
2
1
2
1
2
X X n X x X X X x X x
i i
n
j
i ij
n
j
i i ij
n
j
ij
i i i
+ = + =
_ _ _
= = =
, then
is , ,
2 2
1
2
1 1
2
2
A r
k
i
i i
k
i
n
j
i ij t
S S X X n X x S
i
+ = + =
_ __
= = =
, where is
,
_ __
= = =
o = =
k
i
i i
k
i
n
j
i ij r
n X x S
i
1
2
1 1
2
2
residual variance, ,
_
=
=
k
i
i i A
X X n S
1
2
2
factorial
variance.
Degrees of freedom are: , , , . , 1 , 1
2 2 2
k n S k S n S
r A t
Appropriate assessments for variances are:
1
2
=
n
S
W
t
t
- this is estimate for total variance for population and is a result of
fluctuations in the sample as well as all other causes that effectively influence
the characteristic seen.
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
164
1
2
=
k
S
W
A
A
- this is estimate for a mid-grade variance more groups of samples
and is a result of fluctuations as the sample and the diversity of actions to the
factors. Therefore it is called a factorial variance.
k n
S
W
r
r
=
2
- this is estimation for total variance base with whom he eliminated
the influence of factors. It is a product of the fluctuations in the sample and,
therefore, is called residual variance.
If there is no difference in the effects of different factors to the characteristic
seen variances and
A r
W W should represent the same variance and the
quotient
2
2
1
r
A
r
A
i
S
S
k
k n
W
W
F
= = (
(
]
= = = = =
_
,
2
. . 2
.
1 1
2 2 2 2
. . 0 . . 1
4. ,
r c
ij ij
i j
izr
i j
ij
e t e t
m e
n n
n e
H H
_
_ _ _ _
= =
=
< >
__
Example 14.
A large corporation is interested in detrmining whether an association exists between
the commuting time of its employees and the level of stress related problems observed
on the job. A study of 116 assembly line workers reveals the following:
Stress Commuting
time high moderate low total
Under 15 min 9 5 18 32
15-45 min 17 8 28 53
Over 45 min 18 6 7 31
total 44 19 53 116
At the level of significance 0.1, is there evidence of a significant relationship between
commuting time and stress?
Solution:
In contingency table we have information about empirical frequency. We will
calculate theoretical frequency by the formula:
( ) ( )
( )
sum off all frequencies in the row sum off all frequencies in the column
t
row total column total
f
overal sample size n
=
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
169
Theoretical frequencies
Stress Commuting
time high moderate low total
Under 15 min 12,13793 5,241379 14,62069 32
15-45 min 20,10345 8,681034 24,21552 53
Over 45 min 11,75862 5,077586 14,16379 31
total 44 19 53 116
Now we can calculate
,
2
e t
t
f f
f
:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
170
,
2
e t
t
f f
f
:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
171
,
2
2
.
9.831141
e t
e
t
f f
f
_
1
= = (
(
]
_
Appropriate
2
_ test procedure is:
, ,
0
1
2 2
4 .
1. : there is no relationship between two categorical variables /
: there is relationship between two categorical variables
2. ( ) 1 0.99
1 1 2 2 4
k t
H
H
P
k r c
_ _ o
=
< = =
= = =
,
2
.
2
2
.
2 2
. . 0
13.277
3. 9.831141
4.
t
e t
e
t
e t
f f
f
H
_
_
_ _
=
1
= = (
(
]
<
_
There is no evidence of a significant relationship between commuting time and stress
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
172
Example 15.
In the framework of a survey among tourists who go to the Sarajevo airport (sample
of 216 passengers) between the other set are the following questions:
= = = (
(
]
< >
_
_
_
Where is:
m- number of samples (number of populations)
k
P - proportion in k-th population
k
n - sample size for sample from k-th population
) (
kt k
f f - empirical (theoretical) frequency
Example 16.
On 4 separate areas are investigating the purchase of coffee. It is assumed that the
coffee in the same proportion buying consumers in each of these 4 areas. We have
selected a sample of consumers coffee to test this assumption.
area Sample size Number of coffee
consumers and buyers in
sample
A 100 20
B 200 35
C 150 37
D 250 43
total 700 135
Can we accept the assumption that the proportion of buyers coffee equal to each area
with a 5% error?
Solution:
area Sample size -
i
n
Number of coffee consumers and
buyers in sample -
i
f
A 100 20
B 200 35
C 150 37
D 250 43
total 700 135
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
177
'
0 1 2
1
2 2
4 1 3
1. : ... ... same proportion for each area
: not same proportion for each area, 1,
2. ( ) 1 0, 05 0, 95
k m
k
t
k
H P P P P P
H P P k m
P _ _
= =
= = = = = =
- = =
< = =
2
7, 815
t
_ =
1
1
135
3. where is 0,19286
700
m
k
k
ti i m
k
k
f
f n p p
n
=
=
= = = =
_
_
area
i
n
i
f
Expected number of
coffee consumers and
buyers in sample -
ti
f
,
ti
ti i
f
f f
2
= = (
(
]
>
_
Therefore, we can say that in every area of proportions equal to coffee customers.
Test adequacy of approximations(goodness of fit)
If we have previously approximation for empirical distribution by some theoretical
schedule, and we want to examine the quality (adequacy of the approximations) we
use a nonparametric chi-square test:
: . 1
0
H Arrange the population is a specific form connected to a specific theoretical
distribution of frequency /
0 1
: H H approximation is not correct
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
178
,
'
2 2 2 '
. .
2
2
.
1
2 2 2 2
. . 0 . . 1
2. ( ) 1 , 1
3.
4. ,
teor teor
k
m
k kt
izr
k
kt
izr teor izr teor
P k m r
f f
f
H H
_ _ o _
_
_ _ _ _
=
< = =
1
= (
(
]
< >
_
Where is:
r - number of parameters that are estimated from empirical data
m- number of modalities or intervals
) (
kt k
f f - empirical (theoretical) frequencies
Example 17.
For empirical distribution:
modalities frequencies
0 150
1 100
2 50
3 15
4 7
5 2
We assume that behaves according Poisson distribution. We have to test the validity
of these assumptions. ( % 4 = o )
Solution (by Excel):
Given that the Poisson distribution of a characteristic parameter and it is the same as
the arithmetic mean, like first we will calculate arithmetic mean of the series (using
the Paste function):
m X = ={=SUMPRODUCT(A45:A50;B45:B50)/SUM(B45:B50)}= 0,873457.
Then we calculate the theoretical frequency Poisson distribution as follows:
{=324*POISSON(x;0,873457;FALSE)} for each x from interval 0 to 5.
Modalities
(A45:A50)
Frequencies
(B45:B50)
Theoretical frequencies
(C45:C50)
0 150 135,2719
1 100 118,1541
2 50 51,60127
3 15 15,02383
4 7 3,280666
5 2 0,573104
sum 324 323,7571
Given that we have a class with a frequencies lower than 5 must make transformation:
modalities
(E45:E48)
Frequencies
(F45:F48)
Theoretical frequencies
(G45:G49)
0 150 135,2719
1 100 118,1541
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
179
2 50 51,60127
3 24 18,8776
We will apply chi square test:
{=CHITEST(F45:F48;G45:G48)} empirical probability is 0,120048 and based on it
with a 2 degree of freedom returns = _
2
e
{=CHIINV(0,000563349;2)}= 4,239725
Now we will found chi-square theoretical value: = _
2
t
{=CHIINV(0,04;2)}= 6,437737
There is
2 2
e t
_ _ < we can not reject null hypothesis assumption that data from
research conduct by Poisson distribution accept.
Kolmogorov-Smirnov test
KS test is nonparametric test and examines whether the analyzed variable behaves by
default theoretical distribution. Used with the larger sample of 50 observations.
Option for the implementation of KS test provides SPSS program.
Example 18.
For 208 employees, we track the data on the amount of wages. The data presented in
the Excel sheet. Whether the analyzed variable behaves according to normal
distribution (the reliability of 99%).
Solution:
From the Excel we will take data to the data into SPSS sheet:
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
180
INFERENCIAL STATISTICS
EXAMPLES IN EXCEL AND SPSS
181
Descriptive Statistics
N Mean Std. Deviation Minimum Maximum
wage 208 39,9231 11,25548 26,70 97,00
One-Sample Kolmogorov-Smirnov Test
wage
N 208
Mean 39,9231 Normal Parameters
a,,b
Std. Deviation 11,25548
Absolute ,138
Positive ,138
Most Extreme Differences
Negative -,136
Kolmogorov-Smirnov Z 1,997
Asymp. Sig. (2-tailed) ,001
a. Test distribution is Normal.
b. Calculated from data.
As the p value of KS test is less than 0.01 it means to accept an alternative hypothesis,
and we think the presumption is not done "normal" when analyzed empirical
frequency distribution for the variable wages.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
182
IV. RECRRESSIUN AND CURRELATIUN ANALISYS
Aim
Correlation and regression analysis has a different purpose than the previous techniques
we have looked at. The goal of correlation and regression analysis is to determine and
quantify the relationship between two or more than two variables. One variable has two
or more scores (the data must be interval for the technique we will look at) coming from
the same individual. Over many cases we wish to know whether there is a relationship
between the variables. Correlation and regression are methods of describing the nature
and degree of relationship between two or more variables. For example:
Hours spent studying and grade point average
Family income and child's I.Q.
College G.P.A and adult income
Amount of time watching T.V. and fear of crime.
In each case, for each person or case, the individual is measured on the two variables and
we wish to determine if the two variables are related.
There are there most important concepts in correlation and regression analysis:
The scatter plot displays the form, direction, and strength of the relationship
between two quantitative variables. Straight-line (linear) relationships are
particularly important because a straight line is a simple pattern that is quite
common.
The correlation measures the direction and strength of the linear relationship.
The least-squares regression line is the line that makes the sum of the squares of
the vertical distances of the data points from the line as small as possible.
Basic aspects
In correlation and regression analysis, basic aspects are:
a) The direction of the relationship
Positive high scores on one variable go with high scores on the other variable.
Negative high scores on one variable go with low scores on the other variable
and vice versa.
b) The form of the relationship
Linear versus nonlinear relationships
c) The degree of the relationship
In a positive relationship are high scores always associated with other high scores
and low scores with other low scores or just sometimes?
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
183
Scatter plot
A scatter plot is a type of graph using Cartesian coordinates to display values for two
variables for a set of data. The data is displayed as a collection of points, each having the
value of one variable (independent variable x) determining the position on the horizontal
axis and the value of the other variable determining the position on the vertical axis
(dependent variable y). A scatter plot is also called a scatter chart, scatter diagram and
scatter graph.
Example 1.
Here is a table showing the results of two examinations set to 10 students. They took a
maths exam and an Statistics exam and record the scores that they get in both:
John Betty Sarah Peter Fiona Charlie Tim Gerry Martine Rachel
Maths
score 72 65 80 36 50 21 79 64 44 55
Statistics
score 78 70 81 31 55 29 74 64 47 53
We want to create scatter graph for this variables.
Solution:
We draw two axes. The horizontal axis will represent the score on the Maths exam. The
vertical axis will represent the score on the Statistics exam. For each student, we then
mark a small dot at the co-ordinates representing their two scores.
In Excel we choose chart:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
184
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
185
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70 80 90
Maths score
S
t
a
t
i
s
t
i
c
s
s
c
o
r
e
We can see that the points follow a fairly strong pattern. Students who are good at Maths
tend to be good at Statistics as well. The marks lie fairly close to an imaginary straight
line that we can draw on the graph. In the diagram below, we have drawn in this straight
line: we will make right click with mouse on marks and we will get next options.
We choose Add Trend line:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
186
We choose linear model, what is obvious from graph:
0
10
20
30
40
50
60
70
80
90
0 10 20 30 40 50 60 70 80 90
Maths score
S
t
a
t
i
s
t
i
c
s
s
c
o
r
e
The fact that the points lie close to the straight line is called a strong linear correlation.
The fact that this line points upwards to right - indicating that the Statistics mark tends to
increase as the Maths mark increases - is called a positive correlation.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
187
On next graph we can see different forms of scatter plots
6
:
x
y
a x
y
b x
y
c
x
y
d x
y
e x
y
f
In cases a and b we have linear relationships. In case a direction of relationship is positive
(direct, when a case is high on one variable it is high on the other variable), but in case b
relationship is negative (indirect, when one variable is high the other is low). In case c
there is no relationship between the variables, a case can be high on one variable and
either high or low on the other. In cases d, e and f there are nonlinear relationships.
Line of Best Fit {Regression Line]
The straight line that we draw through the points is called either the line of best fit or the
regression line. It describes the relationship between the two variables (the quantities
compared) mathematically. There is a standard way to draw this line to ensure that it fits
as closely to the data points as possible. Later on, we will investigate exactly what that
mathematical way is. For now, we only have to remember one thing:
The regression line goes through the point whose co-ordinates are the mean values
of the variables.
The arithmetic means are found by adding the relevant scores, and dividing by 10. We
work out:
mean Maths scores = (72 + 65 + 80 + 36 + 50 + 21 + 79 + 64 + 44 + 55) / 10 =
56.6
6
Somun-Kapetanovi R., Statistika u ekonomiji i menadmentu, Ekonomski fakultet u Sarajevu, Sarajevo
2006., page 112
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
188
mean of the Statistics scores = (78 + 70 + 81 + 31 + 55 + 29 + 74 + 64 + 47 + 53)
/ 10 = 58.2
and we can be sure that the line must go through the point (56.6, 58.2). We notice that
there are roughly the same number of data point lying above this line as there are below it
on scaterr plot for example 1.
We can use the regression line to make predictions. For instance, what Statistics mark
would we expect someone to receive if they received a Maths mark of 30? If we look at
the straight line, we can see that when the Maths mark is 30, the Statistics mark is
approximately 28. Similarly, we can assume that anyone who got an Statistics mark of 40,
would also get a Maths mark of about 40. However, there are limits on the predictions
that we can make, as you will see later on.
Tbe Correlation Coefficient
We can see by looking at the graph whether there is a strong or weak linear correlation
between two variables, and whether that correlation is positive or negative. However,
there is a mathematical way of working it out, and that is to calculate the correlation
coefficient. This is also known as Pearson's Correlation Coefficient, represented by the
letter r, and it is a single number which ranges from -1 (perfect strong negative
correlation) to +1 (perfect strong positive correlation). Correlation coefficients which are
close to -1 or +1 indicate a strong correlation. Values close to 0 indicate a weak
correlation, with 0 itself indicating no correlation at all. The stronger the correlation
means the better the prediction and the smaller the errors of prediction.
Here is how we calculate the linear correlation coefficient between two variables:
, ,
2 2
2 2
2 2
( )( )
( , )
( ) ( )
i i
X Y
i i
x x y y
Cov X Y
r
x x y y
n x y x y
r
n x x n y y
o o
= =
=
1 1
( (
] ]
_
_ _
_ _ _
_ _ _ _
where:
1
( ) ( )
xy i i
xy
C x x y y x y
n n
= =
_
_
is covariance between x (like independent
variable) and y (like dependent variable). Covariance simultaneously monitor variability
of both variables
2
2 2
1
( )
x i
x
x x x
N N
o = =
_
_
is standard deviation for variable x
2
2 2
1
( )
y i
y
y y y
N N
o = =
_
_
is standard deviation for variable y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
189
x - mean for variable x
y - mean for variable y
n is number of objects.
Example 1. cont.
We want to calculate correlation coefficient between Maths score and Statistics score.
Solution:
In Excel statistical functions we will chose function CORREL:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
190
Correlation coefficient is close to 1 and indicates a strong positive correlation, as we
supposed according to scatter plot. Well, there is strong direct relationship between
scores on Math and Statistics.
Tbe Coefficient of Determination
Another figure that is useful is the coefficient of determination. This is written as r
2
and
is found by squaring the correlation coefficient. Because the correlation coefficient must
be in the range -1 to +1, and square numbers must be positive, the coefficient of
determination must be in the range 0 to +1.
The correlation coefficient indicates whether there is a relationship between the two
variables, and whether the relationship is a positive or a negative number.
The coefficient of determination tells you what proportion of the variation between the
data points is explained or accounted for by the best fit line fitted to the points. It
indicates how close the points are to the line.
Interpretation of tbe size of a correlation
Several authors have offered guidelines for the interpretation of a correlation coefficient,
as we can see in next table:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
191
Correlation Negative Positive
Small 0.3 to 0.1 0.1 to 0.3
Medium 0.5 to 0.3 0.3 to 0.5
Large 1.0 to 0.5 0.5 to 1.0
Cohen (1988) has observed, however, that all such criteria are in some ways arbitrary and
should not be observed too strictly. This is because the interpretation of a correlation
coefficient depends on the context and purposes. A correlation of 0.9 may be very low if
one is verifying a physical law using high-quality instruments, but may be regarded as
very high in the social sciences where there may be a greater contribution from
complicating factors.
Along this vein, it is important to remember that "large" and "small" should not be taken
as synonyms for "good" and "bad" in terms of determining that a correlation is of a
certain size.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
192
Tbe standard error of estimate and tbe correlation
coefficient
1. Decomposition of an observed score if y is dependent variable:
,
( )
i i i i
y y y y y y = + +
2. Partitioning the variance in scores
a) More useful may be looking at it in terms of variability, breaking the total variability
of the score (its deviation from the mean) into two portions:
,
( ) ( )
i i i i
y y y y y y = +
( )
i
y y - The deviation of the score from the mean.
,
i
y y - The deviation of the predicted score from the mean this is the portion
of the score that reflects the relationship with the x variable.
( )
i i
y y - The deviation of the observed score from the predicted score, this is
error, or the part of the score that is not related to the x variable.
b) If we square these deviations and sum them we have sums of squares, these sums of
squares are additive
:
,
2
2 2
( ) ( )
i i i i
y y y y y y = +
_ _ _
2
( )
i
y y
_
is the total sum of squares for the dependent variable SS
y
(total
variability)
,
2
i
y y
_
is the sum of squares due to prediction or regression (SS
regression
) this
is the part of the y variable that the x variable did predict (explained variability).
The stronger the correlation the larger this term will be:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
193
- If r = 0 then ,
2
0
i
y y =
_
- If r = 1 then , ,
2 2
i i
y y y y =
_ _
2
( )
i i
y y
_
is the sum of squares for the residual or the errors of prediction, the
part of SS
y
that the x variable did not predict (SS
errors in prediction
or residual SS
regression
or unexplained variability). The stronger the correlation, the smaller this term
will be:
- If r = 0 then ,
2
2
( )
i i i
y y y y =
_ _
- If r = 1 then
2
( ) 0
i i
y y =
_
3.
,
,
2
2
2
1
errors in prediction regression i
y y
i
SS SS y y
r
SS SS
y y
= = =
_
_
is the coefficient of determination and it can be seen that it represents the fraction of the
total variation in the y scores that can be predicted from the x scores.
a. Than, we can calculate standard error of estimate like:
,
2
2
(1 )
standard error of estimate
2 2
y i i
error
r SS y y
SS
df n n
= = =
_
Calculating tbe Equation of tbe Regression Line for two
variables
The regression line is defined by two numbers - the gradient and the intercept on the
vertical axis of the line that best fits those points. We always refer to the gradient of the
line as b and the intercept as a, which gives the equation of the regression line as:
i i
y a b x = +
The Least-Squares Method (LSM) determines the values of a and b that minimizes the
sum of squares for the residual or the errors of prediction:
,
2
2
( ) minimum
i i i i
y y y a b x = + 1
]
_ _
.
According to this LSM method, here are formulas for calculation of the gradient and the
intercept and general roles for their interpretation:
x b y a = - indicates which is the value of y when x is 0.
,
2 2
2
xy
X
Cov n x y x y
b
n x x
o
= =
_ _ _
_ _
- indicates how much the y values change as x
changes for one unit.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
194
Example 1. cont.
We want to create regression model for relationship between Maths score and Statistics
score, in sense that Statistics score depends on Maths score.
Solution:
I way for solution:
In Excel function we will find functions INTERCEPT and SLOPE:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
195
Interpretation:
Statistics score will rise for 0.938 if Math score rise for 1.
Student who have 0 score from Math will have 5.089 score from Statistics.
II way for solution:
Excel Data Analysis Regresion:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
196
SUMMARY OUTPUT
Regression Statistics
Multiple R 0,971121335
R Square 0,943076647
Adjusted R Square0,935961228
Standard Error 4,68868839
Observations 10
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
197
ANOVA
df SS MS F Significance F
Regression 1 2913,729609 2913,729609 132,5399 2,94E-06
Residual 8 175,8703905 21,98379882
Total 9 3089,6
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 5,083182203 4,846187507 1,048903328 0,324874 -6,09215 16,25851
X Variable 1 0,938459678 0,081515907 11,5125957 2,94E-06 0,750484 1,126436
RESIDUAL OUTPUT
Observation Predicted Y Residuals Standard Residuals
172,65227905 5,347720953 1,209744422
2 66,0830613 3,916938701 0,886077412
380,15995647 0,840043526 0,190031974
438,86773063 -7,867730625 -1,779812993
552,00616612 2,993833877 0,677255576
624,79083545 4,209164551 0,952183814
7 79,2214968 -5,221496796 -1,181190394
865,14460162 -1,14460162 -0,258928137
946,37540805 0,624591948 0,141293203
1056,69846451 -3,698464515 -0,836654877
X Variable 1 Residual Plot
-10
-5
0
5
10
0 20 40 60 80 100
X Variable 1
R
e
s
i
d
u
a
l
s
Prediction or forecasting
This model, which is determined by LSM method, is used for forecasting values of
dependent variable y for different given values of independent variable x. Predictions in
regression analysis can be:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
198
Interpolation - values of independent variable x are within original range from
smallest to largest x used in developing the regression model. This is relatively
reliable prediction.
Extrapolation - values of independent variable x arent within original range from
smallest to largest x used in developing the regression model. This prediction can be
subject to unknown effects that we dont expect, so in case of extrapolation,
reliability is questionable.
Example 1. cont.
If student have Math score 75, what is expected score for Statistics?
Solution:
We will make interpolation:
75 5.089 0.938 5.089 0.938 75 75.214
i i i
x y x = = + = + =
According to previous regression model, we will expect that student who have Math
score 75 get 75.214 score on Statistic.
Spearman's rank correlation coefficient
Spearmans correlation coefficient () used with ranked data, can be calculated like:
2
3
6
1
d
n n
_
where d is difference in ranking for x and y:
x y
d r r = .
The only difference between it and the standard r is that the data used are ranks.
Example 2.
Two art historians were asked to rank six paintings from 1 (best) to 6 (worst). Their
rankings are shown like table:
Painting Historian 1 Historian 2
A 6 5
B 5 6
C 1 2
D 3 1
E 4 3
F 2 4
Calculate Spearmans rank correlation coefficient. Explain.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
199
Solution:
We have ranks for two variables and we will calculate difference in ranking for x and y:
x y
d r r = .
Painting Historian 1 -
x
r
Historian 2 -
y
r
d
2
d
A 6 5 1 1
B 5 6 -1 1
C 1 2 -1 1
D 3 1 2 4
E 4 3 1 1
F 2 4 -2 4
suma 12
Spearmans rank correlation coefficient is:
2
3 3
6
6 12
1 1 0.66
6 6
d
n n
= = =
_
That suggests relatively strong direct agreement (66%) between opinion of this two art
historians.
Or by SPSS program:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
200
For correlation option we will choose bivariate and then we will define variables:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
201
Correlations
K1 K2
Correlation Coefficient 1,000 ,657
Sig. (2-tailed) . ,156
K1
N 6 6
Correlation Coefficient ,657 1,000
Sig. (2-tailed) ,156 .
Spearmans rho
K2
N 6 6
Same conclusion.
Statistical testing {t test, ANUVA]
It is possible to implement test significance of parameters in the model of simple
regressions:
1.
0 1
: 0/ : 0 H b H b = =
2. standard error for parameter b
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
202
,
2 2
2
where is:
2
b
i
i i
x N x
y y
N
o
o
o
=
_
_
3.
e
b
b
t
o
=
4.
1,
2
1, 1,1
2 2
1,1
2
;
N k
t t
N k N k
N k
t
t t t t
t
o
o o
o
1
= e
(
]
Where is k=1 number of independent variables in simple regression model
5.
0 e t
t t H e , parameter b is not significant, it is the independent variable
that follows the model was not significant.
1 e t
t t H e , parameter b is significant.
Concept of p values, which is simpler, concludes that:
If the p value with a parameter, which were significant we tested, less than 0.05
might mistake the first kind of 5% say that is a parameter that is the variable that
it monitors significant in the model.
If the p value with a parameter, which were significant we tested, higher of 0.05 is
with the first type of error of 5% say that this parameter is variable that
accompanies it is not significant in the model, it is such independent variable be
excluded from the model.
Example 2, cont.
We will analyze some Excel output for regression analysis in example 1:
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 5,083182203 4,846187507 1,048903328 0,324874 -6,09215 16,25851
X Variable 1 0,938459678 0,081515907 11,5125957 2,94E-06 0,750484 1,126436
Uverview example for simple regression model witb SPSS
Example 3.
To examine relationship between the store size (i.e. square footage) and its annual sales, a
sample of 14 stores was selected. The results for these 14 stores are summarized in next
table:
Store Square feet
(000)
Annual sales (in
millions of $)
1 1.7 3.7
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
203
2 1.6 3.9
3 2.8 6.7
4 5.6 9.5
5 1.3 3.4
6 2.2 5.6
7 1.3 3.7
8 1.1 2.7
9 3.2 5.5
10 1.5 2.9
11 5.2 10.7
12 4.6 7.6
13 5.8 11.8
14 3.0 4.1
a) To examine relationship between the store size and its annual sales create scatter
plot. Comment.
b) Create regression model for this variables. Explain parameters.
c) Calculate and explain coefficient of correlation and coefficient of determination.
d) Comment model representatives.
e) If store size is 4200 square feet, what level of annual sales for that store we could
expect?
Solution:
a) Scatter plot:
1. independent variable is store size,
2. dependent variable is annual sale
We use graph option in SPSS:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
204
We will find variables:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
205
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
206
According to this scatter plot, we suppose that there is direct linear relationship.
b) Linear model:
i i
y a b x = +
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
207
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
208
Model Summary
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
1 ,951
a
,904 ,896 ,9664
a. Predictors: (Constant), size
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
Regression 105,748 1 105,748 113,234 ,000
a
Residual 11,207 12 ,934
1
Total 116,954 13
a. Predictors: (Constant), size
b. Dependent Variable: sale
Coefficients
a
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
209
B Std. Error Beta
(Constant) ,964 ,526 1,833 ,092 1
size 1,670 ,157 ,951 10,641 ,000
a. Dependent Variable: sale
Regression model is: 0.964 1.67
i i
y x = +
b - indicates that annual sale increase for 1.67 million of dollars as story size increase
for 1000 square feet.
a - indicates that annual sale is 0.964 million of dollars when story size is 0 square
feet (this interpretation is not logic).
c) Correlation coefficient is 0.95. This indicates strong (but not perfect) positive
correlation.
Coefficient of determination is:
2
0.904 r = Use of regression model has reduced
variability in predicting annual sales by 90.4%. Only 9.6% of the sample variability in
annual sales is due to factors other than what is accounted for by linear regression model
that uses only square footage.
d) We can analyse quality of regression simple models, in addition coefficients of
determination and correlations, and monitor the t test for parameters with an independent
variable. Empirical value of t is forthcoming 10.64 and p value of the t test is 0000, which
means that the independent variable in the model that accompanies this parameter is
significant.
e) 4.2
i
x = is within original range from smallest to largest x used in developing the
regression model, so we made interpolation.
4.2 0.964 1.67 0.964 1.67 4.2 7.978
i i i
x y x = = + = + =
The predicted average annual sale of a store with 4200 square feet is $7,978,000.
MULTIPLE RECRESSIUN MUDEL
The general multiple regression model
The general multiple regression model with K independent variables is:
e X X X f Y
K
+ = ) ,...., , (
2 1
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
210
Dependent variable Y is expressed as a function of K independent random variables and e.
If the member is a functional part of the model defined linear function model can define
the standard model of multiple linear regressions the following equation:
e X b X b X b a Y
K K
+ + + + + = ...
2 2 1 1
Coefficients in the regression model have the following meaning:
Parameter a is free, constant member who represents the expected value of dependent
variable Y when the value of K independent variables (X
1
, X
2
,...,X
K
) equal to zero. The
value of this parameter does not always logical explanation.
Parameter b
i
(i=1,2,....,K) or the regression coefficient by the independent variable
indicates the average change in dependent variable Y conditional unit increase in
independent variables X
i
, provided that the other independent variables remain
unchanged. Positive value of parameter that indicates the proportional relationship
between variables Y and X
i
. This means that the growth of independent variables X
i
conditional growth dependent variable Y. A negative value means that the coefficient
inversely proportional relationship between dependent variable Y and independent
variable X
i
. In this case the direction of changes of independent and dependent
variables is the opposite, that growth is causing the decline X
i
dependent variable Y, a
decline X
i
causes growth dependent variable Y.
Values of model parameters, multiple regressions evaluated using the method of least
squares.
Measures for quality of multiple regression model
A. Model error
,
2
i
ie ie
x
e
x x
N
o
=
_
- model error is unexplained variability.
B. Coefficient variation for model
i
x
i
V
X
o
=
C. Coefficient of multiple determination (relationship for explained and total variability)
has defined the following expression:
1 0 ,
) (
) (
2
2
2
2
,.., 2 , 1 ;
,.., 2 , 1 ;
s s
_
_
=
K Y
R
y y
y y
R
i
i
K Y
Coefficient of multiple determinations explains how the changes in variability of
dependent characteristics are explained by the changes of variability for K independent
features included in the regression model.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
211
D.Coefficient of multiple linear correlations expresses strength relationship between
variability dependent variables, and summary variability for K independent variables.
Determined as the square root of the coefficient multiple determinations:
1 0 ,
) (
) (
,.., 2 , 1 ;
2
2
,.., 2 , 1 ;
s s
_
_
=
K Y
i
i
K Y
R
y y
y y
R
Or by expression:
y y
i i
K Y
n
y y y y
R
o o
) )( (
,.., 2 , 1 ;
_
=
Coefficient is not the sign of the association, because relations between the dependent
and independent variables can be multidirectional.
E. Partial correlation coefficient shows the strength and direction of the connection
dependent variable Y and j-independent variables with the same impact of the remaining
(K-1) variables which represent the c. The value of this coefficient is moving within
limits: 1 1
, ;
s s
c j y
r .
For example, partial correlation coefficients of the first order for K = 2 is defined using a
simple coefficient of linear correlation in the following manner:
) 1 )( 1 (
;
) 1 )( 1 (
2
2 , 1
2
2 ;
2 , 1 1 ; 2 ;
1 , 2 ;
2
2 , 1
2
1 ;
2 , 1 2 ; 1 ;
2 , 1 ;
r r
r r r
r
r r
r r r
r
y
y y
y
y
y y
y
=
=
Interpretation of partial correlation coefficients: explaining the strength and status the
independent and dependent variables (their variability), if you switch off the influence of
others (K-1) independent variables.
F. Adjusted determination coefficient
,
2 2
. ...
1
1 1
1
i
n
adjusted R R
n k
1
=
(
]
Adjustment is done with number of predictors and size of sample and with small samples
taken into consideration this coefficient.
Statistical test {t test, ANUVA]
a. Testing for parameter significance b
ij.12...m
7
for multiple regression model
1.
0 .12... 1 .12...
: 0 / : 0
ij m ij m
H b H b = =
7
Behind point are not only i and j.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
212
2. Standard error evaluation parameter b is
.12...
ij m
b
o
and determined on the
basis of
,
2
i i
y y
N M
o
=
_
3.
.12...
.12...
ij m
ij m
e
b
b
t
o
=
4.
1,
2
1, 1,1
2 2
1,1
2
;
N k
t t
N k N k
N k
t
t t t t
t
o
o o
o
1
= e
(
]
where k = M-1 - the number of independent variables in multiple regression
model
5.
0 e t
t t H e
, parameter b is not significant, it is the variable that follows
the model was not significant
1 e t
t t H e
, parameter b is significant, it is the variable that follows the
model was significant
b. Analysis of variance in the regression model - F test for regression model
This analysis tested whether there is a significant link between a number of independent
variables included in the model and the dependent variable.
The methodology of conducting F test is as follows:
1.
0 1. ... 2. ... . ... 1 . ...
: ... 0/ : least one parameter 0
i i ik ij
H b b b H b = = = = =
2.
2
/
2
1
y x
e
y
k
F
n k
o
o
=
3. for given o,
; 1 t k n k
F F
=
where k is number of independent variables in the regression model
4.
0
1
e t
e t
F F H
F F H
s
>
If you accept an alternative hypothesis can be considered that at least one of the
independent (explanatory) variables involved in the model important for the movement of
dependent variables.
Example 4.
Sample of 34 shops in the chain store was selected for a marketing test. Dependent
variable is the volume of sales, while the independent variables are price and cost of the
promotion:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
213
Sale (units) Price (KM) Promotion
cost (00
KM)
Sale (units) Price (KM) Promotion
cost (00
KM)
4141 59 200 2730 79 400
3842 59 200 2618 79 400
3056 59 200 4421 79 400
3519 59 200 4113 79 600
4226 59 400 3746 79 600
4630 59 400 3532 79 600
3507 59 400 3825 79 600
3754 59 400 1096 99 200
5000 59 600 761 99 200
5120 59 600 2088 99 200
4011 59 600 820 99 200
5015 59 600 2114 99 400
1916 79 200 1882 99 400
675 79 200 2159 99 400
3636 79 200 1602 99 400
3224 79 200 3354 99 600
2295 79 400 2927 99 600
Create an appropriate regression model and analyze the results.
Solution:
It is a model of multiple regressions with two independent variables. Using Excel (Data
analysis - Regression)
8
obtained the regression model. The result looks like this:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0,870475
R Square 0,757726
Adjusted R
Square 0,742095
Standard
Error 638,0653
Observations 34
ANOVA df SS MS F
Significance
F
Regression 2 39472731 19736365 48,47713 2,86E-10
Residual 31 12620947 407127,3
Total 33 52093677
Coefficients
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 5837,521 628,1502 9,293192 1,79E-10 4556,4 7118,642
8
The database column with the dependent variable must be either the first or last, because the independent
variables must be given as a "block" variables
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
214
Price -53,2173 6,852221 -7,76644 9,2E-09 -67,1925 -39,2421
Promotion
cost 3,613058 0,685222 5,272828 9,82E-06 2,215538 5,010578
Excel output obtained we interpret in the following manner:
Correlation coefficient (multiple R) 0,87
Determination coefficient (R square) 0,757
Adjusted determination coefficient (adjusted R square) 0,742
Model error (Standard Error) = 638,06 =
residual unexplained by model
MS MS =
Previous coefficients indicate a model with 87% strength explains a dependent
variable volume of sales. So, the model is good.
Then they give the results of ANOVA (analysis variances) gained as a test model:
o In the first column, the information on the appropriate number of
degrees of freedom:
-
explained
unexplained
1
1
regresion
residual
total
df df k
df df n k
df n
= =
= =
=
o In the second column are the results of the sum of squares deviation.
-
2
explained
( )
regresion i
SS SS y y = = =
_
39,472,731
-
2
unexplained
( )
residual i i
SS SS y y = = =
_
12,620,947
-
2
( )
total i
SS y y = =
_
52,093,677
o In the third column are the results of the MS (the sum of squares of
deviation / number of degrees of freedom)
-
explained
unexplained
total
( )
1
1
number of independent variables in model
- numb
regresion
regresion
regresion
residual
residual
residual
total
total
total
SS
MS aproppriate
df
SS SS
MS
df k
SS
SS
MS
df n k
SS SS
MS
df n
k
n
=
= =
= =
= =
er of observation (objects)
o In the fourth column is the empirical value of F test, and in the fifth
column of the appropriate p-value (F significance).
Like it is
1
48.48
=2.86E-10 < =0.05
e
F
p H o
=
i
e - residual (error) for i object
That illustrates the indicator variable, we will analyse further simple regression model
with a "dummy" variables. The first step is to set that looks regression equation
separately for both groups. For the control group 0
i
d = , for experimental group 1
i
d = .
When referred to introduce in the regression model assuming that the phrase residuals or
errors on average equal to 0, returns the following:
i i i
y a b d e = + +
For control group ( 0
i
d = ):
0 0
Ki
Ki
y a b
y a
= + +
=
For experimental group ( 1
i
d = ):
1 0
Ei
Ei
y a b
y a b
= + +
= +
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
217
We will calculate difference between the groups. This will be the difference between
regression models for the referent group.
( )
Ei Ki
Ei Ki
y y a b a b
y y b
= + =
=
.
Therefore, the difference between the groups shows the coefficient b.
Example indicator variables as the regression variables in the
simple model with a "dummy" variable
Let us take a concrete example of a simple regression model where the dependent
variable wages and independent indicator variable is an indicator for marital status (1 if
married, 0 if not).
798.44 178.61
i i i i i
y a b d e d e = + + = + +
What is interpretation in these case parameters in the model?
Parameter a mean that for people who are not married average wage equal to 798.44
KM.
Parameter b means that the salaries of persons who are married to 178.61 KM greater
than the salaries of persons who are not married.
Summary of parameter a and b means that for people who are married average wage
equal to 977.05 KM.
Example of multiple regression models with indicator variables
as a explanatory variable and a continuous variable as another
variable explanatory
Let us take a concrete example regression model where the dependent variable wages and
independent variables:
indicator variable is an indicator for completed faculty (1 if completed university,
0 if not).
continuous variable is the length of employment (in months)
1
1 1
275 162 6.3
i d i x i i i i i
y a b d b x e d x e = + + + = + + +
What is interpretation in these case parameters in the model?
Parameter a means that for people who have not completed university, and whose
work experience is equal to 0 (start to work) is equal 275 KM.
Parameter
d
b means that the salary the person who finished university for 162 KM
more than pay the person who has not completed university.
Parameter
1
x
b means that if all other factors in the model remain unchanged increase
of service for 1 month leads to increase wages for 6.3 KM.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
218
Note: In the model it is possible to include more continuous and indicator variables.
Interpretations remain the same, noting that other factors remain unchanged (under the
control of the) we will interpret parameter obtained for the given variable.
Example 5.
For 15 houses are well-known information about: the sale value (000 KM), size (00 m2)
and possession of fire protection systems:
Sale value Size Possession of fire
protection systems
84.4 2.00 yes
77.4 1.71 no
75.7 1.45 no
85.9 1.76 yes
79.1 1.93 no
70.4 1.20 yes
75.8 1.55 yes
85.9 1.93 yes
78.5 1.59 yes
79.2 1.50 yes
86.7 1.90 yes
79.3 1.39 yes
74.5 1.54 no
83.8 1.89 yes
76.8 1.59 no
Construct model to predict the sales value of the house depending on its size and
information about the system of fire protection. Interpret the parameters obtained.
Solution:
As the possession of variable quality fire protection based on the need to create the
indicator variable:
1, if house have fire protection system
0, if house don't have fire protection system
i
d = :
Use the Excel IF function to create dummy variables:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
219
Continue with the Copy-Paste were joined by other cells:
Sale value -
y
Size - x Possession of fire
protection systems
d
84.4 2 yes
1
77.4 1.71 no
0
75.7 1.45 no
0
85.9 1.76 yes
1
79.1 1.93 no
0
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
220
70.4 1.2 yes
1
75.8 1.55 yes
1
85.9 1.93 yes
1
78.5 1.59 yes
1
79.2 1.5 yes
1
86.7 1.9 yes
1
79.3 1.39 yes
1
74.5 1.54 no
0
83.8 1.89 yes
1
76.8 1.59 no
0
Appropriate regression model reads:
1
1
i x i d i i
y a b x b d e = + + +
Model thus designed is evaluated as multiple regression (EXCEL - Data analysis):
SUMMARY OUTPUT
Regression Statistics
Multiple R 0,900587
R Square 0,811057
Adjusted R
Square 0,779567
Standard
Error 2,262596
Observations 15
ANOVA df SS MS F
Significance
F
Regression 2 263,7039 131,852 25,75565 4,55E-05
Residual 12 61,43209 5,11934
Total 14 325,136
Coefficients
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 50,09049 4,351658 11,51067 7,68E-08 40,60904 59,57194
Size 16,18583 2,574442 6,287124 4,02E-05 10,57661 21,79506
Possession
of fire
protection
systems 3,852982 1,241223 3,104183 0,009119 1,148591 6,557374
Interpretations:
Correlation coefficient 0.9
Determination coefficient 0.81
Adjusted determination coefficient 0.7796
Model error 2.26
Previous coefficients indicate a model with 90% strength explains a dependent
variable sale value for house. So, the model is good.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
221
Then they give the results of ANOVA (analysis variances) gained as a test model:.
o
2
( )
i
y y =
_
263.7
o
2
( )
i i
y y =
_
61.43
o
2
( )
i
y y =
_
325.13
o Like it is
1
25.75565321
=4.54968E-05 < =0.05
e
F
p H o
=
, we consider a model
significant (at least one of the independent variables included in the
model is significant is to influence the dependent variable).
In the latter part of the table are the parameters of the model and the information that
they follow:
1
50.09 16.186 3.853
i i i
y x d = + + . That means:
o For each 100 square meters sale value is higher for 16.186 KM, if
other variables stay same.
o House that possess of fire protection system has for 3.853 KM sale
value than house without fire protection system.
o In addition to the parameters or coefficients regression model gives:
standard error estimates of these parameters
e
t for testing parameter significance for each parameter separately.
First we have to find theoretical interval:
How are all these theoretical values (t Stat in table behind parameters) outside the
theoretical interval
12, 0.025
12, 0.975
2.178
2.178
t
t
t
t
=
=
=
, accept an alternative hypothesis, and we
think both explanatory significant variables in the model.
p-value for testing significance of each parameter separately. How are all these
values of less than a specified level of errors of first kind of 5%, accept an
alternative hypothesis, and we think both explanatory significant variables in the
model.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
222
lower and upper limits for interval evaluation of each parameter (* standard
error separately, obtained as: the parameter of the model parameter estimates)
CUNDITIUNS FUR ECUNUMETRIC MUDELS
Regression model stated right line: , 1, 2,..., .
i i i
y a bx e i n = + + = has two parts. The first
part of the model (a+bx
i
) represents a functional relationship in which the Y is linearly
dependent of X, if the other factors constant. Second, stochastic models of the (e
i
),
represents the random variation, which takes into account the effect of changes in other
variables that are not explicitly included in the model.
Provided that the specification matches the model in relation to economic realities and
practices and to problems of measuring economic relations expressed as problems of
statistical evaluation of parameters of probability timetable must be met assumptions
about linear regression model. These assumptions are as follows:
a. E(e
i
) = 0, (expected value of errors is equal to zero)
b. E(e
i
2
)= o
2
, (constant common variance homoskedastic)
c. E(e
i
e
j
)= 0, for each i, j ; i=j; (independency, there is no autocorrelation with
stochastic part)
d. e
i
: N(0, o
2
), (normality) - This assumption points to the absence of extreme
data in the sample or the outlier values of Xt and Yt, which are very distant from the
values of other variables.
e. E(eiX
j
) = 0, for each i, j; ( independency from X
j
).
To evaluate the value of parameters regression model it is necessary to choose the
formula (assessor, estimator), which will come to their best estimates. Estimators should
have the following characteristics:
1. Impartiality
2. Consistency
3. Efficiency
4. The best linear impartiality.
Assumptions regression models through SPSS
MULTICOLLINEARITY
For first, we monitor correlation matrix. If the correlation coefficient between the
independent variables is higher then 0.7, there could be problem of multicollinearity.
VIF (Variance Inflation Factor)
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
223
2
2
1 1
1
where is determination coefficient in multiple regression model
VIF
Tolerance R
R
= =
< <
< <
< <
< < < <
= =
= =
= = perfect negative autocorrelation
2< <4 negative autokorelacija autocorrelationthat is higher if dwis more higher dw
HETEROSKEDASTICITY
Test Goldfeld-Quandt aims that compare the sum of residual squares deviation after
division of the sample into two samples. For models in the time cross-section groups
together the information to the growing or raising values independent variables that can
be a source of heteroskedasticity (this is not necessary for models with time-series).
We will create two regressions for two samples and using the F test compared the
residual deviations. Hypothesis H
0
is accepted if there are no significant differences
between the sums of residual squares deviation.
It needs to be grouped according to given independent variable that can be a source of
heteroskedasticity. Share a number of observations in two samples, for both sample rate
regressions and calculate residuals. We will test whether the residual variances from
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
225
different samples are same or not - Leven test (within the test of arithmetic means). If
residual variances from different samples are not equal, it is a problem heteroskedasticity.
This problem can be try to solve by the weighted regression with the factor same inverse
square root of the variable that is the source heteroskedasticity.
ECONOMETRIC CONDITIONS FOR REGRESION MODELS WITH
SPSS EXAMPLES
Example 1.
SIMPLE LINEAR REGRESSION
We have data in Excel sheet. We will open SPSS document:
SPSS provides a blank sheet. Importing data, or we approach the transfer of data from
excel Sheet:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
226
Further to the type of file selected Excel and we give document that convert from Excel
to SPSS:
Choosing Open. If the Excel document, we have more sheet provides us the option to
choose which we want to covert:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
227
Choose the sheet that we want and OK. We got a sheet with SPSS data:
We can adjust the characteristics of variables so that we will with Data view exceed the
Variable view (options are in the bottom of the window):
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
228
We have chosen the type of numerical variables. We have a dependent and one
independent variable, and then it is a simple regression. First we create a diagram scatter
plot:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
229
Choosing Simple Scatter and Define. It provides a window in which the problem
variables:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
230
In Titles can define a chart title, with the Options control with missing information (there
is no them and that part we did not use). As output we obtain a diagram:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
231
Now we create a model simple linear regression, as this diagram scaterrplot refers to a
different form of connection:
Returns a window in which the problem variables:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
232
The Statistics allocates ancillary parameters regression models:
We choose Continue to return to the window with the regression.
The plots allocates Normal probability plot:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
233
We choose Continue to return to the window with the regression.
In the end we choose OK. Returns Output:
Regression
Descriptive Statistics
Mean Std. Deviation N
Yt 24,33 4,579 12
Xt 61,75 4,993 12
Correlations
Yt Xt
Yt 1,000 ,624 Pearson Correlation
Xt ,624 1,000
Yt . ,015 Sig. (1-tailed)
Xt ,015 .
Yt 12 12 N
Xt 12 12
Variables Entered/Removed
b
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
234
Model Variables Entered
Variables
Removed Method
1 Xt
a
. Enter
a. All requested variables entered.
b. Dependent Variable: Yt
Model Summary
b
Model R R Square Adjusted R Square
Std. Error of the
Estimate Durbin-Watson
1 ,624
a
,390 ,329 3,752 ,836
a. Predictors: (Constant), Xt
b. Dependent Variable: Yt
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
Regression 89,878 1 89,878 6,384 ,030
a
Residual 140,789 10 14,079
1
Total 230,667 11
a. Predictors: (Constant), Xt
b. Dependent Variable: Yt
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
235
Coefficients
a
Unstandardized Coefficients
Standardized
Coefficients 95,0% Confidence Interval for B Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Lower Bound Upper Bound Zero-order Partial Part Tolerance VIF
(Constant) -11,017 14,033 -,785 ,451 -42,284 20,250 1
Xt ,572 ,227 ,624 2,527 ,030 ,068 1,077 ,624 ,624 ,624 1,000 1,000
a. Dependent Variable: Yt
Coefficient Correlations
a
Model Xt
Correlations Xt 1,000 1
Covariances Xt ,051
a. Dependent Variable: Yt
Residuals Statistics
a
Minimum Maximum Mean Std. Deviation N
Predicted Value 19,32 29,06 24,33 2,858 12
Residual -5,766 6,806 ,000 3,578 12
Std. Predicted Value -1,752 1,652 ,000 1,000 12
Std. Residual -1,537 1,814 ,000 ,953 12
a. Dependent Variable: Yt
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
236
We can test the assumption of normality of residuals taking residuals as variable with the
KS test:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
237
One-Sample Kolmogorov-Smirnov Test
Unstandardized
Residual
N 12
Mean ,0000000 Normal Parameters
a,,b
Std. Deviation 3,57756669
Absolute ,161
Positive ,161
Most Extreme Differences
Negative -,117
Kolmogorov-Smirnov Z ,557
Asymp. Sig. (2-tailed) ,915
a. Test distribution is Normal.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
238
One-Sample Kolmogorov-Smirnov Test
Unstandardized
Residual
N 12
Mean ,0000000 Normal Parameters
a,,b
Std. Deviation 3,57756669
Absolute ,161
Positive ,161
Most Extreme Differences
Negative -,117
Kolmogorov-Smirnov Z ,557
Asymp. Sig. (2-tailed) ,915
a. Test distribution is Normal.
b. Calculated from data.
P value of normality test is greater than 0.05, which means that it complies with the
assumption of normality.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
239
Example 2.
MULTIPLE LINEAR REGRESSION
We have data in Excel sheet and transforming them into SPSS document:
We start with Regression and allocates variables:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
240
Completing Statistics (include check for multicollinearity because in multiple regression),
plots and Save with the desired options. Output is:
Descriptive Statistics
Mean Std. Deviation N
Y 21,90 6,471 10
X1 12,10 4,748 10
X2 30,80 9,402 10
Correlations
Y X1 X2
Y 1,000 ,919 -,829
X1 ,919 1,000 -,691
Pearson Correlation
X2 -,829 -,691 1,000
Y . ,000 ,001
X1 ,000 . ,013
Sig. (1-tailed)
X2 ,001 ,013 .
Y 10 10 10
X1 10 10 10
N
X2 10 10 10
VariablesEntered/Removed
Model
Variables
Entered
Variables
Removed Method
1 X2, X1
a
. Enter
a. All requested variables entered.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
241
Model Summary
b
Change Statistics
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate
R Square
Change F Change df1 df2 Sig. F Change
Durbin-Watson
1 ,957
a
,917 ,893 2,120 ,917 38,420 2 7 ,000 2,156
a. Predictors: (Constant), X2, X1
b. Dependent Variable: Y
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
Regression 345,431 2 172,716 38,420 ,000
a
Residual 31,469 7 4,496
1
Total 376,900 9
a. Predictors: (Constant), X2, X1
b. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
242
Coefficients
a
Unstandardized Coefficients
Standardized
Coefficients 95,0% Confidence Interval for B Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Lower Bound Upper Bound Zero-order Partial Part Tolerance VIF
(Constant) 18,872 5,290 3,568 ,009 6,363 31,381
X1 ,902 ,206 ,662 4,377 ,003 ,415 1,389 ,919 ,856 ,478 ,522 1,916
1
X2 -,256 ,104 -,372 -2,460 ,043 -,502 -,010 -,829 -,681 -,269 ,522 1,916
a. Dependent Variable: Y
Coefficient Correlations
a
Model X2 X1
X2 1,000 ,691 Correlations
X1 ,691 1,000
X2 ,011 ,015
1
Covariances
X1 ,015 ,042
a. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
243
Collinearity Diagnostics
a
Variance Proportions
Model
Dimensi
on Eigenvalue Condition Index
(Constant) X1 X2
1 2,822 1,000 ,00 ,01 ,00
2 ,168 4,098 ,00 ,21 ,11
1
3 ,010 16,496 1,00 ,78 ,89
a. Dependent Variable: Y
VIF<10
Eigen value <1
Condition index <30
So, there is no multicollinearity problem..
Residuals Statistics
a
Minimum Maximum Mean Std. Deviation N
Predicted Value 12,90 31,41 21,90 6,195 10
Std. Predicted Value -1,453 1,535 ,000 1,000 10
Standard Error of Predicted
Value
,671 1,641 1,110 ,361 10
Adjusted Predicted Value 13,72 33,43 22,14 6,416 10
Residual -1,945 4,251 ,000 1,870 10
Std. Residual -,918 2,005 ,000 ,882 10
Stud. Residual -1,053 2,251 -,044 1,022 10
Deleted Residual -3,430 5,357 -,244 2,567 10
Stud. Deleted Residual -1,062 3,964 ,126 1,483 10
Mahal. Distance ,001 4,489 1,800 1,716 10
Cooks Distance ,005 ,513 ,130 ,188 10
Centered Leverage Value ,000 ,499 ,200 ,191 10
a. Dependent Variable: Y
The problem of outlier should review instances where the value is greater than 0.04.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
244
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
245
We can test the assumption of normality of residuals taking residuals as variable with the
KS test:
One-Sample Kolmogorov-Smirnov Test
Unstandardized Residual
N 10
Mean ,0000000 Normal Parameters
a,,b
Std. Deviation 1,86989587
Absolute ,165
Positive ,165
Most Extreme Differences
Negative -,149
Kolmogorov-Smirnov Z ,522
Asymp. Sig. (2-tailed) ,948
a. Test distribution is Normal.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
246
One-Sample Kolmogorov-Smirnov Test
Unstandardized Residual
N 10
Mean ,0000000 Normal Parameters
a,,b
Std. Deviation 1,86989587
Absolute ,165
Positive ,165
Most Extreme Differences
Negative -,149
Kolmogorov-Smirnov Z ,522
Asymp. Sig. (2-tailed) ,948
a. Test distribution is Normal.
b. Calculated from data.
1, if there is village
0, if there is no village
Si
d =
1, if there is city
0, if there is no city
Gi
d =
1, if there is suburban
0, if there is no suburban
Pi
d =
We use option Recode into different variables:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
248
We give qualitative variable SGP and transformed into the first dummy variable for the
modality Village:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
249
We will complete transformation with Old and new values:
Before the next choose changes Add to include the current change. Completing all three
changes:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
250
We choose Continue, return to the start window and choosing OK. The result is a new
column with the indicator variable for the modality Village:
In the same way we create dummy variables for the ways of city and suburbs:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
251
Now we can create a regression model. In order to avoid the problem appeared
multicollinearity we will take in regression two dummy variables (to city and suburbs),
while the interpretation based on the connection with the third dummy variable (village):
Descriptive Statistics
Mean Std. Deviation N
Cost 3678,64 1118,325 364
Revenue 39425,41 15002,371 364
dP ,35 ,479 364
dG ,38 ,487 364
Correlations
Cost Revenue dP dG
Cost 1,000 ,532 -,402 ,586
Revenue ,532 1,000 -,134 -,238
dP -,402 -,134 1,000 -,582
Pearson Correlation
dG ,586 -,238 -,582 1,000
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
252
Cost . ,000 ,000 ,000
Revenue ,000 . ,005 ,000
dP ,000 ,005 . ,000
Sig. (1-tailed)
dG ,000 ,000 ,000 .
Cost 364 364 364 364
Revenue 364 364 364 364
dP 364 364 364 364
N
dG 364 364 364 364
Variables Entered/Removed
Model
Variables
Entered
Variables
Removed Method
1 dG, Revenue,
dP
a
. Enter
a. All requested variables entered.
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate Durbin-Watson
1 ,923
a
,852 ,851 432,254 1,750
a. Predictors: (Constant), dG, Revenue, dP
b. Dependent Variable: Cost
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
253
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
Regression 3,867E8 3 1,289E8 689,923 ,000
a
Residual 6,726E7 360 186843,445
1
Total 4,540E8 363
a. Predictors: (Constant), dG, Revenue, dP
b. Dependent Variable: Cost
Coefficients
a
Unstandardized Coefficients
Standardized
Coefficients 95,0% Confidence Interval for B Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Lower Bound Upper Bound Zero-order Partial Part Tolerance VIF
(Constant) 409,740 93,392 4,387 ,000 226,078 593,401
Revenue ,058 ,002 ,778 34,962 ,000 ,055 ,061 ,532 ,879 ,709 ,831 1,203
dP 533,163 62,069 ,228 8,590 ,000 411,101 655,226 -,402 ,412 ,174 ,582 1,717
1
dG 2078,090 62,355 ,904 33,327 ,000 1955,464 2200,717 ,586 ,869 ,676 ,559 1,788
a. Dependent Variable: Cost
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
254
Collinearity Diagnostics
a
Variance Proportions
Model
Dimensi
on Eigenvalue Condition Index
(Constant) Revenue dP dG
1 2,683 1,000 ,01 ,01 ,02 ,02
2 1,000 1,638 ,00 ,00 ,19 ,17
3 ,280 3,098 ,00 ,14 ,41 ,38
1
4 ,037 8,532 ,99 ,85 ,38 ,43
a. Dependent Variable: Cost
On the basis of collinearity diagnostic, concludes that there is no problem multi
collinearity.
Residuals Statistics
a
Minimum Maximum Mean Std. Deviation N
Predicted Value 1681,29 7068,75 3678,64 1032,159 364
Std. Predicted Value -1,935 3,284 ,000 1,000 364
Standard Error of Predicted
Value
36,703 90,065 44,447 8,825 364
Adjusted Predicted Value 1666,92 7058,33 3678,63 1032,123 364
Residual -728,342 768,672 ,000 430,464 364
Std. Residual -1,685 1,778 ,000 ,996 364
Stud. Residual -1,696 1,787 ,000 1,002 364
Deleted Residual -737,861 778,844 ,010 435,365 364
Stud. Deleted Residual -1,700 1,792 ,000 1,003 364
Mahal. Distance 1,620 14,762 2,992 1,907 364
Cooks Distance ,000 ,023 ,003 ,003 364
Centered Leverage Value ,004 ,041 ,008 ,005 364
a. Dependent Variable: Cost
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
255
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
256
Example 4.
MULTIPLE REGRESSION (HETEROSKEDASTICITY - TWO SAMPLES)
We have data in Excel sheet and transforming them into SPSS document:
We start with Regression and allocate variables:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
257
Completing Statistics, plots and Save with the desired options. Output is:
Descriptive Statistics
Mean Std. Deviation N
Y 23,00 27,617 32
X1 25,09 30,003 32
X2 63,53 60,156 32
Correlations
Y X1 X2
Y 1,000 ,901 ,815
X1 ,901 1,000 ,769
Pearson Correlation
X2 ,815 ,769 1,000
Y . ,000 ,000
X1 ,000 . ,000
Sig. (1-tailed)
X2 ,000 ,000 .
Y 32 32 32 N
X1 32 32 32
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
258
Correlations
Y X1 X2
Y 1,000 ,901 ,815
X1 ,901 1,000 ,769
Pearson Correlation
X2 ,815 ,769 1,000
Y . ,000 ,000
X1 ,000 . ,000
Sig. (1-tailed)
X2 ,000 ,000 .
Y 32 32 32
X1 32 32 32
X2 32 32 32
Variables Entered/Removed
Model
Variables
Entered
Variables
Removed Method
1 X2, X1
a
. Enter
a. All requested variables entered.
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate Durbin-Watson
1 ,921
a
,848 ,837 11,142 1,396
a. Predictors: (Constant), X2, X1
b. Dependent Variable: Y
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
Regression 20043,661 2 10021,831 80,724 ,000
a
Residual 3600,339 29 124,150
1
Total 23644,000 31
a. Predictors: (Constant), X2, X1
b. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
259
Coefficients
a
Unstandardized Coefficients
Standardized
Coefficients 95,0% Confidence Interval for B Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Lower Bound Upper Bound Tolerance VIF
(Constant) -1,215 2,890 -,421 ,677 -7,126 4,695
X1 ,618 ,104 ,671 5,923 ,000 ,404 ,831 ,409 2,443
1
X2 ,137 ,052 ,299 2,639 ,013 ,031 ,244 ,409 2,443
a. Dependent Variable: Y
Coefficient Correlations
a
Model X2 X1
X2 1,000 -,769 Correlations
X1 -,769 1,000
X2 ,003 -,004
1
Covariances
X1 -,004 ,011
a. Dependent Variable: Y
Collinearity Diagnostics
a
Variance Proportions
Model
Dimensi
on Eigenvalue Condition Index
(Constant) X1 X2
1 2,505 1,000 ,05 ,03 ,03
2 ,378 2,575 ,83 ,17 ,03
1
3 ,117 4,634 ,12 ,80 ,94
a. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
260
Casewise Diagnostics
a
Case
Number Std. Residual Y Predicted Value Residual
1 1,257 149 134,99 14,010
2 1,001 73 61,84 11,156
3 -1,030 21 32,48 -11,476
4 ,041 7 6,54 ,462
5 -,703 9 16,83 -7,831
6 1,645 60 41,67 18,330
7 ,940 26 15,53 10,473
8 ,281 5 1,87 3,128
9 -,212 5 7,36 -2,362
10 ,276 14 10,93 3,070
11 ,540 26 19,99 6,013
12 ,851 27 17,52 9,483
13 -,250 10 12,78 -2,782
14 ,186 13 10,93 2,070
15 ,270 23 19,99 3,013
16 -,342 38 41,81 -3,808
17 -3,057 19 53,06 -34,061
18 -1,925 17 38,45 -21,445
19 ,160 11 9,21 1,786
20 ,783 13 4,27 8,726
21 -1,666 17 35,56 -18,563
22 -,293 31 34,26 -3,260
23 ,342 19 15,18 3,816
24 -,930 11 21,36 -10,360
25 -,157 3 4,75 -1,754
26 ,284 13 9,83 3,168
27 ,408 15 10,45 4,551
28 1,002 26 14,84 11,159
29 ,913 22 11,82 10,178
30 -,207 3 5,30 -2,303
31 -,207 3 5,30 -2,303
32 -,205 7 9,28 -2,283
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
261
Casewise Diagnostics
a
Case
Number Std. Residual Y Predicted Value Residual
1 1,257 149 134,99 14,010
2 1,001 73 61,84 11,156
3 -1,030 21 32,48 -11,476
4 ,041 7 6,54 ,462
5 -,703 9 16,83 -7,831
6 1,645 60 41,67 18,330
7 ,940 26 15,53 10,473
8 ,281 5 1,87 3,128
9 -,212 5 7,36 -2,362
10 ,276 14 10,93 3,070
11 ,540 26 19,99 6,013
12 ,851 27 17,52 9,483
13 -,250 10 12,78 -2,782
14 ,186 13 10,93 2,070
15 ,270 23 19,99 3,013
16 -,342 38 41,81 -3,808
17 -3,057 19 53,06 -34,061
18 -1,925 17 38,45 -21,445
19 ,160 11 9,21 1,786
20 ,783 13 4,27 8,726
21 -1,666 17 35,56 -18,563
22 -,293 31 34,26 -3,260
23 ,342 19 15,18 3,816
24 -,930 11 21,36 -10,360
25 -,157 3 4,75 -1,754
26 ,284 13 9,83 3,168
27 ,408 15 10,45 4,551
28 1,002 26 14,84 11,159
29 ,913 22 11,82 10,178
30 -,207 3 5,30 -2,303
31 -,207 3 5,30 -2,303
32 -,205 7 9,28 -2,283
a. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
262
Residuals Statistics
a
Minimum Maximum Mean Std. Deviation N
Predicted Value 1,87 134,99 23,00 25,428 32
Std. Predicted Value -,831 4,404 ,000 1,000 32
Standard Error of Predicted
Value
2,033 9,335 3,108 1,428 32
Adjusted Predicted Value 1,64 102,01 22,28 21,408 32
Residual -34,061 18,330 ,000 10,777 32
Std. Residual -3,057 1,645 ,000 ,967 32
Stud. Residual -3,309 2,303 ,020 1,081 32
Deleted Residual -39,921 46,990 ,720 14,518 32
Stud. Deleted Residual -4,122 2,503 -,004 1,193 32
Mahal. Distance ,063 20,788 1,937 3,783 32
Cooks Distance ,000 4,161 ,172 ,737 32
Centered Leverage Value ,002 ,671 ,063 ,122 32
a. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
263
Now, since the data we create two samples (according to growing values of independent
variables X
1
, which could be the cause heteroskedasticityi because the likelihood of its
critical minimum):
Determine their regression models, preserving the value of residuals. Then calculate variance
residuals for both regression and testing the difference variances, in order to check the
assumption homoskedasticity. Add new variable (1 for the sample - X
1
has a value of 8 or
less, 2 for sample II - X
1
has a value of 21 or more, and the third group are the other
observations). Outputs organize in groups (Split file):
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
264
Specifies that the data is divided into groups according to the new variables:
Choosing OK.
Restarts Regression and allocates variables as previously. Here the goal is to save the new
variable with the residuals.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
265
I model regressions (group 1) ANOVA and coefficients:
Model Summary
b,c
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate Durbin-Watson
1 ,671
a
,450 ,328 4,771 2,266
a. Predictors: (Constant), X2, X1
b. VAR00001 = 1,00
c. Dependent Variable: Y
ANOVA
b,c
Model Sum of Squares df Mean Square F Sig.
Regression 167,820 2 83,910 3,687 ,068
a
Residual 204,847 9 22,761
1
Total 372,667 11
a. Predictors: (Constant), X2, X1
b. VAR00001 = 1,00
c. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
266
Coefficients
a,b
Unstandardized Coefficients
Standardized
Coefficients Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Zero-order Partial Part Tolerance VIF
(Constant) -2,849 4,497 -,633 ,542
X1 1,701 ,747 ,582 2,278 ,049 ,638 ,605 ,563 ,935 1,070
1
X2 ,036 ,043 ,217 ,847 ,419 ,365 ,272 ,209 ,935 1,070
a. VAR00001 = 1,00
b. Dependent Variable: Y
II model regressions (group 2) ANOVA and coefficients:
Model Summary
b,c
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate Durbin-Watson
1 ,905
a
,819 ,783 17,101 1,983
a. Predictors: (Constant), X2, X1
b. VAR00001 = 2,00
c. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
267
ANOVA
b,c
Model Sum of Squares df Mean Square F Sig.
Regression 13236,938 2 6618,469 22,633 ,000
a
Residual 2924,293 10 292,429
1
Total 16161,231 12
a. Predictors: (Constant), X2, X1
b. VAR00001 = 2,00
c. Dependent Variable: Y
Coefficients
a,b
Unstandardized Coefficients
Standardized
Coefficients Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Zero-order Partial Part Tolerance VIF
(Constant) -4,944 8,792 -,562 ,586
X1 ,597 ,272 ,554 2,195 ,053 ,881 ,570 ,295 ,284 3,519
1
X2 ,173 ,113 ,387 1,533 ,156 ,855 ,436 ,206 ,284 3,519
a. VAR00001 = 2,00
b. Dependent Variable: Y
Annihilate Split File option and return to the analysis of the new variables residuals.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
268
Start with T test differences, because we average in the framework of the profit test and the
differences between variances:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
269
In Define groups we give samples (1 i 2):
We choose Continue, then OK:
Group Statistics
VAR00
001 N Mean Std. Deviation Std. Error Mean
1,00 12 ,0000000 4,31537183 1,24574054 Unstandardized Residual
2,00 13 ,0000000 15,61060671 4,32960330
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
270
Independent Samples Test
Levenes Test for Equality of
Variances t-test for Equality of Means
95% Confidence Interval of the
Difference
F Sig. t df Sig. (2-tailed) Mean Difference
Std. Error
Difference Lower Upper
Equal variances
assumed
14,753 ,001 ,000 23 1,000 ,00000000 4,66934791 -9,65928209 9,65928209 Unstandardized
Residual
Equal variances not
assumed
,000 13,965 1,000 ,00000000 4,50525629 -9,66510526 9,66510526
As the p for Leven test less than 0.05, concludes that the variances residuals for selected samples of different, which indicates the presence
heteroskedasticity. Variable, which is the source heteroskedasticity, it is critical that the likelihood (with the parameter action regression)
minimum. Heteroskedasticity can be solved weighted regression, we weight variable which is the source heteroskedasticity. So we create new
variables, by weighting and using the Compute:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
271
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
272
With new variables we create regression and Output is:
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
273
Descriptive Statistics
Mean Std. Deviation N
Yn 1,1531 ,60050 32
X1n ,1048 ,09310 32
X2n 4,7574 4,44110 32
Correlations
Yn X1n X2n
Yn 1,000 ,221 ,392
X1n ,221 1,000 ,642
Pearson Correlation
X2n ,392 ,642 1,000
Yn . ,112 ,013
X1n ,112 . ,000
Sig. (1-tailed)
X2n ,013 ,000 .
Yn 32 32 32
X1n 32 32 32
N
X2n 32 32 32
Variables Entered/Removed
Model
Variables
Entered
Variables
Removed Method
1 X2n, X1n
a
. Enter
a. All requested variables entered.
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate Durbin-Watson
1 ,394
a
,155 ,097 ,57062 1,890
a. Predictors: (Constant), X2n, X1n
b. Dependent Variable: Yn
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
274
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
Regression 1,736 2 ,868 2,666 ,087
a
Residual 9,443 29 ,326
1
Total 11,179 31
a. Predictors: (Constant), X2n, X1n
b. Dependent Variable: Yn
Coefficients
a
Unstandardized Coefficients
Standardized
Coefficients Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Zero-order Partial Part Tolerance VIF
(Constant) ,914 ,160 5,708 ,000
X1n -,331 1,435 -,051 -,230 ,819 ,221 -,043 -,039 ,588 1,700
1
X2n ,057 ,030 ,425 1,910 ,066 ,392 ,334 ,326 ,588 1,700
a. Dependent Variable: Yn
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
275
Coefficient Correlations
a
Model X2n X1n
X2n 1,000 -,642 Correlations
X1n -,642 1,000
X2n ,001 -,028
1
Covariances
X1n -,028 2,060
a. Dependent Variable: Yn
Collinearity Diagnostics
a
Variance Proportions
Model
Dimensi
on Eigenvalue Condition Index
(Constant) X1n X2n
1 2,554 1,000 ,05 ,03 ,04
2 ,287 2,983 ,94 ,10 ,19
1
3 ,159 4,006 ,01 ,86 ,77
a. Dependent Variable: Yn
Residuals Statistics
a
Minimum Maximum Mean Std. Deviation N
Predicted Value ,8483 1,9590 1,1531 ,23664 32
Std. Predicted Value -1,288 3,406 ,000 1,000 32
Standard Error of Predicted
Value
,102 ,408 ,161 ,068 32
Adjusted Predicted Value ,8033 2,3659 1,1716 ,29959 32
Residual -,70505 2,15608 ,00000 ,55191 32
Std. Residual -1,236 3,778 ,000 ,967 32
Stud. Residual -1,277 3,997 -,014 1,024 32
Deleted Residual -,79451 2,41295 -,01846 ,62254 32
Stud. Deleted Residual -1,291 5,861 ,047 1,284 32
Mahal. Distance ,027 14,907 1,938 3,017 32
Cooks Distance ,000 ,635 ,046 ,123 32
Centered Leverage Value ,001 ,481 ,063 ,097 32
a. Dependent Variable: Yn
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
276
Comparison of regressions obtained with the initial regression indicates significantly "worse"
model in the case of corrections to heteroskedasticity implemented method.
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
277
Example 5
MULTIPLE REGRESSION (AUTOCORRELATION)
We have data in Excel sheet and transforming them into SPSS document. We start with
Regression and allocates variables:
Completing Statistics, plots and Save with the desired options. Output is:
Descriptive Statistics
Mean Std. Deviation N
Y 9,95 36,626 19
X1 10832,26 4771,812 19
X2 377,21 1204,523 19
Correlations
Y X1 X2
Y 1,000 ,629 ,413
X1 ,629 1,000 -,009
Pearson Correlation
X2 ,413 -,009 1,000
Y . ,002 ,039
X1 ,002 . ,486
Sig. (1-tailed)
X2 ,039 ,486 .
N Y 19 19 19
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
278
X1 19 19 19
X2 19 19 19
Variables Entered/Removed
Model
Variables
Entered
Variables
Removed Method
1 X2, X1
a
. Enter
a. All requested variables entered.
Model Summary
b
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate Durbin-Watson
1 ,756
a
,572 ,518 25,421 1,071
a. Predictors: (Constant), X2, X1
b. Dependent Variable: Y
So, this means presence of positive autocorrelation.
ANOVA
b
Model Sum of Squares df Mean Square F Sig.
Regression 13807,684 2 6903,842 10,684 ,001
a
Residual 10339,263 16 646,204
1
Total 24146,947 18
a. Predictors: (Constant), X2, X1
b. Dependent Variable: Y
REGRESSION AND CORRELATION ANALISYS
EXAMPLES IN EXCEL AND SPSS
279
Coefficients
a
Unstandardized Coefficients
Standardized
Coefficients 95,0% Confidence Interval for B Correlations Collinearity Statistics
Model
B Std. Error Beta
t Sig.
Lower Bound Upper Bound Zero-order Partial Part Tolerance VIF
(Constant) -47,505 14,933 -3,181 ,006 -79,161 -15,848
X1 ,005 ,001 ,633 3,870 ,001 ,002 ,008 ,629 ,695 ,633 1,000 1,000
1
X2 ,013 ,005 ,419 2,561 ,021 ,002 ,023 ,413 ,539 ,419 1,000 1,000
a. Dependent Variable: Y
Collinearity Diagnostics
a
Variance Proportions
Model
Dimensi
on Eigenvalue Condition Index
(Constant) X1 X2
1 2,078 1,000 ,03 ,03 ,06
2 ,842 1,571 ,01 ,01 ,94
1
3 ,080 5,082 ,96 ,95 ,01
a. Dependent Variable: Y
REGRESSION ANALYSIS
EXAMPLES IN EXCEL AND SPSS
280
Residuals Statistics
a
Minimum Maximum Mean Std. Deviation N
Predicted Value -51,14 65,38 9,95 27,696 19
Residual -42,129 40,390 ,000 23,967 19
Std. Predicted Value -2,205 2,001 ,000 1,000 19
Std. Residual -1,657 1,589 ,000 ,943 19
a. Dependent Variable: Y
REGRESSION ANALYSIS
EXAMPLES IN EXCEL AND SPSS
281
REGRESSION ANALYSIS
EXAMPLES IN EXCEL AND SPSS
282
References
Curwin J. and Slater R., Quantitative Methods for Business Decisions, Thomson
Learning fifth edition 2002.
Ku H., Statistike funkcije u Excelu, Grafiar, Zenica, 1999.
Levine D.M. and others, Statistics for Managers Using Microsoft Excel, Prentice
Hall, NY, 2005.
Newbold P., Statistics for business and economics, Prentice Hall, 1988.
Papi M., Primjenjena statistika u MS Excelu, Zoro, Zagreb, 2005.
Somun-Kapetanovi R., Statistika u ekonomiji i menadmentu, Ekonomski
fakultet u Sarajevu, Sarajevo 2006.
http://www.answers.com
http://www.mnstate.edu
http://www.statcan.ca
http://www.wikipedia.org
http://www.socialresearchmethods.net