You are on page 1of 41

Basics of

Statistical Inference
V. Sreenivas
sreenivas_vishnu@yahoo.com
Basics of Statistical Inference
Importance
Primary entities in statistical inference
Types of statistical inference
Estimation considerations
Measure of accuracy of estimates
General framework of testing
Types of errors in statistical testing
Form of a Statistical test
Interpretation of a test result
Basics of Statistical Inference
In all clinical/epidemiological studies,
information collected represents only a
sample from the target population
Drawing conclusions about the
population depends on statistical analysis
of data
So the basis of statistical inference is
important to understand & interpret the
results from the epidemiological studies
Basics of Statistical Inference
3 primary entities
The target population
Set of characteristics or
variables
Probability distribution of the
characteristics
Basics of Statistical Inference
Population
Collection of units of observation that are of
interest & is the target of the investigation
Eg. In studying the prevalence of
osteoporosis in women of a city, all the
women in that city would form the target
population
Essential to identify the population clearly &
precisely
Basics of Statistical Inference
Variables
Once population is identified, clearly define what
characteristics of the units of this population are to
be studied
In the above example, we need to define:
Osteoporosis (reliable & valid method of diagnosis:
DEXA/ Ultrasound, normal values of BMD etc.)
Clear & precise methods of measuring these
characteristics are essential for the success of the
study
Basics of Statistical Inference
Variables
Qualitative: Take a few possible values (Eg.
Sex, Disease status)
Quantitative: Can theoretically take any
value within a specified range (Eg.
Blood sugar, Syst. BP)
Type of analysis depends on the type of the
variable
Basics of Statistical Inference
Probability distribution
Most crucial link between population & its
characteristics
Allows to draw inferences on the population,
based on a sample
Tells what different values can a variable take
How frequently each value can occur in the
population
Basics of Statistical Inference
Probability distribution
Common distributions in health research are
Binomial, Poisson & Normal
Eg. Incidence of a relatively common disease
may be approximated by Binomial distribution
Incidence of a rare disease can be considered
to have a Poisson distribution
Continuous variables are often considered to
be Normally distributed
Probability distribution
Prob. Distribution is characterized by certain quantities
called parameters
These quantities allow us to calculate the probabilities of
various events concerning the variable
Eg. Binomial dist. has 2 parameters n and p.
This distribution occurs when a fixed number (n) of
subjects is observed, the characteristic is dichotomous in
nature and each subject has the same probability (p) of
having one value and (1-p) of having the other value
The statistical inference then involves finding out the value
of p in the population, based on a carefully selected sample
Binomial distribution
for n = 10 & p = 0.5
0.25
0.20
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6 7 8 9 10
Number of successes

Probability distribution
Eg. The Normal distribution is a mathematical curve
represented by two quantities (, o) mean and
standard deviation respectively.
Most quantitative characteristics follow this
Symmetric, Bell shaped curve
One half is the mirror image of the other
Mean, median & mode are same and are at center
Mean 1SD covers 68% data, 2SD 95%, 3SD 99%
X
0 Z =
68.6%
X-1SD
1 Z =
X-2SD
1.96 Z =
X+2SD
1.96 Z =+
X+1SD
1 Z =+
95.0% area
X+2.58SD
2.58 Z =+
X-2.58SD
2.58 Z =
99.0% area under the curve
Empirical properties of a Normal Deviate
X: Variable in original units Z: Standardized variable
Statistical inference
Estimation
We estimate some
characteristic of the
population, based
on a sample
Testing
We test some
hypotheses about
the population
parameters
Descriptive Studies
In these, generally the objective is:
To estimate the values of the parameters of
the Prob. dist., based on the sampled
observations
Best guess of the value in the population
and a measure of accuracy of this estimate
are obtained
Estimation
Best guesses
Population mean : Mean of sample
Population proportion: Sample proportion
Considerations:
Consistency: As the sample size increases, the
estimates approach their target values
Unbiased: The average value of the estimated
parameter over a large number of repeated samples of
same size will be equal to the population value
Maximum likelihood: That value of the parameter
which maximizes the probability of observing a
sample that has been observed
Accuracy of estimates
When an estimate (E) of a parameter is obtained,
we need to know how this value (E) would
change if another sample is studied
The distribution of values of E over different
repeated samples (under identical conditions) is
known as the sampling distribution of E
This sampling distribution can be determined
empirically or purely based sampling theory
The standard deviation of the estimate E is called
the Standard Error (SE)
Accuracy of Estimates
Once the sampling distribution of the
estimate is known, it can be answered
How close is my estimate likely to be
the true value of the parameter
Can state with certain confidence that
the true value will be within certain
interval (Confidence Interval)
Confidence Interval
The more the confidence required, more
the width, for a given sample size
Intuitively, more the information we
have (larger sample), the smaller the
width of the interval (the more certain
we are about the result)
Estimation of parameters from a Normal
population - Example
The average Bone Mineral Density (BMD) of 150
elder women (60+ years) is 0.678gm/cm
2
with a SD
of 0.12 gm/cm
2
, what is the 95% C.I of the mean
BMD?
Sample size (n) = 150 Mean = 0.678 SD = 0.12
It has been shown that Mean will have a Normal
distribution with Mean as mean itself and the
Standard Error as / SE n o =
0.678/ 150 0.055 SE = =
Confidence Interval for
BMD in elderly women
We have the Mean = 0.678, SE = 0.055; and
we also know that Mean follows a Normal
Distribution

Using the Normal distribution properties, we know
that Mean 1.96 SE covers 95% of values

0.678 (1.96*0.055) = (0.570 0.786) covers
95% of results if we repeat the study

This interval is called the 95% CI for mean BMD
Interpretation of CI
Mean BMD = 0.678 and 95% CI: (0.570 0.786)

If we repeat the study 100 times, 95% times we get
a mean BMD between a 0.57 and 0.79 gm/cm
2


Another interpretation is:
There is 95% chance that these two limits cover the
true, unknown, but fixed value of BMD in the
elderly women

There is 95% chance that the truth is somewhere in
this interval
Do not get the impression that truth varies from
0.57 to 0.79
Interpretation of CI
Mean BMD = 0.678 and
95% CI: (0.570 0.786)
The narrower the interval, the more confident
we are of the result
Alternatively, the wider this interval, the less
certain we are about the result
Analytical Studies
Involve testing of hypothesis
Study will have formulated research questions
(hypotheses)
Eg. Is treatment A is superior to treatment B ?
Based on the observations from the sample, we
need to draw conclusions
Inference is a 2 step process:
- Estimate the parameters
- Test the hypotheses involving these parameters
Statistical Tests of Hypotheses
Step 1: Identify the Null Hypothesis (H
0
)
- No additional effect of the new treatment;
- No difference in prevalence rates;
- Relative risk is one etc.
It should be testable
- Possible to identify which parameters
need to be estimated and their sampling
distribution, given the study design
Statistical tests of Hypotheses
Null hypothesis: cure rate p
1
= p
2
Alternative hypotheses to the null
hypothesis:
The cure rates are different
(P
1
= p2 two-tailed alternative)
The cure rate in new method is more
(p
2
> p
1
one-tailed alternative)
Step 2: Determine the levels of errors
that can be acceptable
Decision
Truth in the population
H
0
is true H
0
is false
Accept H
0
No error Type II Error ()
Reject H
0
Type I error (o) No error
Analogous with a laboratory test
o: False Positivity
: False Negativity
1 : Sensitivity (Power of a test)
Step 2: Determine the levels of errors
that can be acceptable
Decision
Truth in the population
H
0
is true H
0
is false
Accept H
0
No error Type II Error ()
Reject H
0
Type I error (o) No error
Impossible to reduce both the errors simultaneously
One decreases when the other increases
Design the study with a desired level of o and
minimize the |
choice of o & | is made after determining the
consequences of each of the errors and is made
at the design stage itself
Step 3: Determine the best Statistical test for
the stated Null hypothesis
Depends on:
Study design (Cross over or Independent
groups, Paired or Unpaired observations etc.)
Type of variable (Qualitative / Quantitative)
The properties of the study variable
(Binomial/Normal distribution, Standard
Error of the estimate etc.)
Step 3: Determining the best test
Common tests of significance
t TEST
Chi-square (_
2
) test
Z test
Non-parametric tests
Involves calculating a critical ratio
that helps to make a decision
Tests of significance

Parameter
Critical Ratio = ----------------------------------
(Test Statistic) SE of that parameter
If we are comparing two proportions:
Diff. between
the two proportions
Critical Ratio = Z = ---------------------------------
SE of the difference
between the two
proportions
Step 4: Perform the Statistical Test
- Calculate the test statistic (Z / _
2
/ t etc)
- Using the properties of the distribution of the test
Statistic, obtain the probability of
observing such an estimate of the Statistic
- This probability is the probability of getting the
observed value of the test statistic if the Null
hypothesis is true
- If this is small, Null hypothesis is an unlikely
explanation for the results Reject the Null
hypothesis (Significant result). If not
- t
n-1
, 1-o/2 0 t
n-1
, 1-o/2
Acceptance region
|t| < t
n-1
, 1- o/2
Rejection region
t < -t
n-1
, 1- o/2
Rejection region
t > t
n-1
, 1- o/2
Acceptance & Rejection regions for a paired t test
Step 5: If the Null hypothesis is not rejected
at the given level of significance, the
statistical power of the test (1- ) should be
computed
Recall that is an error of accepting H
0
,
when it is false. So 1- will be prob. of
rejecting H
0
, when it is false. If this
quantity is low, we recommend that the
study be repeated with a larger sample
Statistical test oh Hypothesis An example
We wish to compare the BMD of Indian elderly
women with Caucasian elderly women

We hypothesize that Indian women will have lower
BMD

Our Null hypothesis: both groups will have equal
BMD level

Our alternative hypothesis is: both groups BMD
will be unequal

We collected data on 150 Indian women and data
on Caucasian women is available from literature
Statistical test oh Hypothesis An example
Indian data:
Caucasian data:

Since the sample sizes are large, we can apply a
test called Z test and Z statistic is calculated as:





Calculations give us a Z value:
0.176/0.0117 = 15.04
1 1 1 150 0.678 0.12 n x S = = =
2 2 2 300 0.854 0.11 n x S = = =
1 2
2 2
1 2
1 2
| x x
Z
S S
n n
,
=
+
Statistical test oh Hypothesis An example
From the data, we have Z = 15.04

We know that Z follows a Normal distribution
Using the properties of Normal distribution we
realize that the probability of observing this much
value of Z or more extreme in either direction
is < 0.0000001 or < one in a million

In other words, if our Null hypothesis is correct, our
chance of finding a Z = 15.04 is so small

We suspect the Null hypothesis and Reject it and
conclude that both groups have statistically different
BMD levels
Summary
3 entities viz. Population, Variables &
Probability distribution of the variables are
important in Statistical Inference
Estimation & Testing are 2 components of
Statistical Inference
Descriptive studies generally deal with
estimation & Analytical studies deal with
testing of hypotheses
Summary contd.
Estimation is followed by a measure of accuracy
Confidence Interval
2 types of errors can be committed in statistical
testing
Type I error is nothing but the usual p value (False
Positivity)
The compliment of type II error (False negativity)
is called the Power of the test (Sensitivity)
Test estimator is generally of the form of a ratio of
2 quantities
Summary contd.
The calculated Ratio under given
circumstances follows a known pattern called
its distribution
Using this distribution, we can know the
probability of observing a Ratio of the
magnitude that is observed, by chance alone
If this chance probability is low, chance is
unlikely to explain the observed result and we
Reject the null hypothesis
Summary Contd..
If the Null hypothesis is rejected, we attribute the
observed difference to the exposure under
consideration
If the null hypothesis is not rejected (accepted), we
should be sure that our data is sensitive enough to
believe the negative result (statistical power
should be calculated)

You might also like