Gagan

Point estimation and interval
estimation
learning objectives:
 to understand the relationship between

point estimation and interval
estimation
to calculate and interpret the

confidence interval
Statistical estimation
Every member of the
population has the
same chance of
being
Population
selected in the
sample
Parameters
Random sample
estimation
Statistics
Statistical estimation
Estimate
Point estimate Interval estimate
• sample mean • confidence interval for mean

• sample proportion • confidence interval for proportion
Point estimate is always within the interval

estimate
Interval estimation
Confidence interval (CI)
provide us with a range of values that we belive,

with a given level of confidence, containes a true
value
CI for the
95 poipulation
%C I = means
x ± 1.96 SEM
99 %CI =x ±2.58 SEM
SD
SEM =
n
Interval estimation
Confidence interval (CI)
34% 34%
14% 14%
2% 2%
z
-3.0 -2.0 -1.0 0.0 1.0 2.0
3.0
-2.58 -1.96 1.96 2.58
Interval estimation
Confidence interval (CI), interpretation and example
50
40
Frequency
30
20
10
0
22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5
25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0
Age in years
x= 41.0, SD= 8.7, SEM=0.46, 95% CI (40.0, 42), 99%CI (39.7,
42.1)
Testing of hypotheses
» to understand the role of significance test
» to distinguish the null and alternative

hypotheses
» to interpret p-value, type I and II errors

Statistical inference. Role of chance.
S c i e n t i f i c k n o w l e d g
R e a s o n a n E d m i n p t i u r i i ct i ao l n o b s
Formulate Collect data to

hypotheses test hypotheses
Statistical inference. Role of chance.
Systematic error
Formulate Collect data to

hypotheses test hypotheses
CHANCE
Accept hypothesis Reject hypothesis
Random error (chance) can be controlled by statistical significance

or by confidence interval
Significance test
Subjects: random sample of 352 nurses from HUS surgical
hospitals
Mean age of the nurses (based on sample): 41.0
Another random sample gave mean value: 42.0.
Question: Is it possible that the “true” age of

nurses from HUS surgical hospitals was
41 years and observed mean ages
differed just because of sampling error?
Answer can be given based on Significance

Testing.
Null hypothesis H0 - there is no difference
Alternative hypothesis HA - question explored by

the investigator
Statistical method are used to test hypotheses
The null hypothesis is the basis for statistical test.

Example
The purpose of the study:
to assess the effect of the lactation nurse on attitudes
towards breast feeding among women
Research question: Does the lactation nurse have

an effect on attitudes towards
breast feeding ?
HA : The lactation nurse has an

effect on attitudes towards
breast feeding.
H0 : The lactation nurse has no
effect on attitudes towards
breast feeding.
Definition of p-value.
90
2.5% 95% 2.5%
80
70
60
50
40
30
20
10
0
23.8 28.8 33.8 38.8 43.8 48.8 53.8 58.8
AGE
If our observed age value lies outside the green lines, the
probability of getting a value as extreme as this if the null
hypothesis is true is < 5%
Definition of p-value.
p-value = probability of observing a value more

extreme that actual value observed, if the null
hypothesis is true
The smaller the p-value, the more unlikely the

null hypothesis seems an explanation for the
data
Interpretation for the example
If results falls outside green lines, p<0.05,
if it falls inside green lines, p>0.05
Type I and Type II Errors
No study is perfect,
there is always the chance for error
Decision H0 true / HA false H0 false / HA true
Accept H0 / Type II error (β
)
reject HA OK
p=1-α p=β
Reject H0 Type I error (α
)
/accept HA OK
p=α p=1-β
α - level of significance 1-β - power of the test

there is only 5 chance in 100 that the result
α =0.05 termed "significant" could occur by chance
alone
The probability of making a Type I (α) can be

decreased by altering the level of significance.
it will be more difficult to find a significant result
the power of the test will be decreased

the risk of a Type II error will be increased
The probability of making a Type II (β ) can be decreased by

increasing the level of significance.
it will increase the chance of a Type I error
To which type of error you are willing to risk ?

Type I and Type II Errors. Example
Suppose there is a test for a particular disease.

If the disease really exists and is diagnosed early,
it can be successfully treated
If it is not diagnosed and treated, the person will
become severely disabled
If a person is erroneously diagnosed as having the
disease and treated, no physical damage is done.
To which type of error you are willing to risk ?

Type I and Type II Errors. Example.
Decision No disease Disease
Not diagnosed OK Type II error
Diagnosed Type I error OK
irreparable damage
treated but not harmed
would be done
by the treatment
Decision: to avoid Type error II, have high level

of significance
Confidence interval and significance test
Null
hypothesis is
accepted
A value for null hypothesis
p-value > 0.05
within the 95% CI
Null
hypothesis is
A value for null hypothesis rejected
outside of 95% CI p-value < 0.05
Parametric and nonparametric
tests of significance
 to distinguish parametric and

nonparametric tests of significance
to identify situations in which the use

of parametric tests is appropriate
to identify situations in which the use

of nonparametric tests is appropriate
Parametric test of significance - to estimate at least one

population parameter from sample statistics
Assumption: the variable we have measured in the sample is
normally distributed in the population to which we plan to
generalize our findings
Nonparametric test - distribution free, no assumption about

the distribution of the variable in the population
Nonparametric tests Parametric tests
Nominal Ordinal data Ordinal, interval,
data ratio data
One group
Two
unrelated
groups
Two related
groups
K-unrelated
groups
K-related
groups
Some concepts related to the statistical
methods.
Multiple comparison
two or more data sets, which should be analyzed
– repeated measurements made on the same

individuals
– entirely independent samples

methods.
Sample size
number of cases, on which data have been
obtained
Which of the basic characteristics of a distribution are
more sensitive to the sample size ?
central tendency (mean, median, mode) mean
variability (standard deviation, range, IQR) standard deviation
skewness skewness
kurtosis
kurtosis
methods.
Degrees of freedom
the number of scores, items, or other units in
the data set, which are free to vary
One- and two tailed tests

one-tailed test of significance used for directional
hypothesis
two-tailed tests in all other situations
Selected nonparametric tests
Chi-Square goodness of fit test.
to determine whether a variable has a frequency

distribution compariable to the one expected
1
χ = ∑ ( f oi − f ei )
2 2
ι f
ei
expected frequency can be based on

• theory
• previous experience
• comparison groups
Example
The average prognosis of total hip replacement
in relation to pain reduction in hip joint is
exelent - 80%
good - 10%
expecte
medium - 5%
d
bad - 5%
In our study of we had got a different outcome
exelent - 95%
good - 2%
medium - 2% observe
bad - 1% d
Does observed frequencies differ from expected ?

Example
fe1 = 80, fe2 = 10, fe3 =5, fe4 = 5;
fo1 = 95, fo2 = 2, fo3 =2, fo4 = 1;
χ 2
> 3.841 p < 0.05
χ = 14.2, df=3 (4-1)
2
χ 2
> 6.635 p < 0.01
0.0005 < p < 0.05
χ
> 10.83 p <
2
0.001
Null hypothesis is rejected at 5% level
Chi-Square test.
Chi-square statistic (test) is usually used with an R

(row) by C (column) table.
Expected frequencies can be calculated:
1
Frc = ( fr fc )
N
then
1
χ = ∑ ∑ ( f ij − Fij )
2 2
ι j F
ij
df = (fr-1) (fc-1)
Chi-Square test. Example
Question: whether men are treated more aggressively for
cardiovascular problems than women?
Sample: people have similar results on initial

testing
Response: whether or not a cardiac
catheterization was recommended
Independent: sex of the patient

Result: observed frequencies
Sex
Cardiac male female Row total
Cath
No 15 16 31
Yes 45 24 69
Column 60 40 100
total
Result: expected frequencies
Sex
Cardiac male female Row total
Cath
No 18.6 12.4 31
Yes 41.4 27.6 69
Column 60 40 100
total
Result:
χ = 2.52, df=1 (2-1) (2-1)

2
p > 0.05
Null hypothesis is accepted at 5% level
Conclusion: Recommendation for cardiac

catheterization is not related to the sex of the patient
Chi-Square test. Underlying
assumptions. Cannot be used to
 Frequency data
analyze differences in
scores or their means
✔ Adequate sample size Expected frequencies
should not be less than 5
✔ Measures independent No subjects can be

of each other count more than once
✔ Theoretical basis for the Categories should be

categorization of the defined prior to data
collection and analysis
variables
Fisher’s exact test. McNemar test.
 For N x N design and very small sample size Fisher's

exact test should be applied
 McNemar test can be used with two dichotomous

measures on the same subjects (repeated
measurements). It is used to measure change
Parametric and nonparametric tests of
significance
Nonparametric tests Parametric tests
Nominal Ordinal data Ordinal, interval,
data ratio data
One group Chi square
goodness
of fit
Two Chi square
unrelated
groups
Two related McNemar’
groups s test
K-unrelated Chi square
groups test
K-related
groups
Ordinal data independent groups.
Mann-Whitney U : used to compare two groups
Kruskal-Wallis H: used to compare two or more

groups
Mann-Whitney test
Null hypothesis : Two sampled populations are
equivalent in location
The observations from both groups are

combined and ranked, with the average rank
assigned in the case of ties.
If the populations are identical in location, the

ranks should be randomly mixed between the
two samples
Kruskal-Wallis test
k- groups comparison, k ≥ 2
Null hypothesis : k sampled populations are

equivalent in location
The observations from all groups are combined

and ranked, with the average rank assigned in
the case of ties.
If the populations are identical in location, the

ranks should be randomly mixed between the k
samples
Ordinal data related groups.
Wilcoxon matched-pairs signed rank test:

used to compare two related groups
Friedman matched samples:

used to compare two or more related
groups
Ordinal data 2 related groups Wilcoxon signed rank test
Two related variables. No assumptions about the
shape of distributions of the variables.
Null hypothesis : Two variables have the same

distribution
Takes into account information about the

magnitude of differences within pairs and gives
more weight to pairs that show large differences
than to pairs that show small differences.
Based on the ranks of the absolute values of the differences

between the two variables.
Parametric and nonparametric tests of
significance
Nonparametrictests Parametric
tests
Nominal Ordinal data
data
Onegroup Chi square W ilcoxonsigned
goodness of ranktest
fit
Two Chi square W ilcoxonrank
unrelated sumtest,
groups Mann-W hitney
test
Tworelated McNemar’s W ilcoxonsigned
groups test ranktest
K-unrelated Chi square Kruskal -W allis
groups test onewayanalysis
of variance
K-related Friedman
groups matchedsamples
Selected parametric tests
One group t-test. Example
Comparison of sample mean with a population

mean
It is known that the weight of young adult male has
a mean value of 70.0 kg with a standard deviation
of 4.0 kg.
Thus the population mean, µ= 70.0 and population
standard deviation, σ= 4.0.
Data from random sample of 28 males of similar
ages but with specific enzyme defect: mean body
weight of 67.0 kg and the sample standard
deviation of 4.2 kg.
Question: Whether the studed group have a
significantly lower body weight than the general
population?
One group t-test. Example
population mean, µ= 70.0

population standard deviation, σ=
4.0.
sample size = 28
sample mean, x = 67.0
sample standard deviation, s= 4.0.
Null hypothesis: There is no difference between

sample mean and population mean.
t - statistic = 0.15, p >0.05
Null hypothesis is accepted at 5% level

Two unrelated group, t-test. Example
Comparison of means from two unrelated groups

Study of the effects of anticonvulsant therapy on
bone disease in the elderly.
Study design:
Samples: group of treated patients (n=55)
group of untreated patients (n=47)
Outcome measure: serum calcium
concentration
Research question: Whether the groups statistically
significantly differ in mean serum consentration?
Test of significance: Pooled t-test
Two unrelated group, t-test. Example
Comparison of means from two unrelated groups

Study design:
Samples: group of treated patients (n=20)
group of untreated patients (n=27)
concentration
significantly differ in mean serum consentration?
Test of significance: Separate t-test
Two related group, paired t-test. Example
Comparison of means from two related variabless

Study design:
Sample: group of treated patients (n=40)

concentration before and
after operation
Research question: Whether the mean serum
consentration statistically
significantly differ before
and
Test after operation? paired t-test
of significance:
k unrelated group, one -way ANOVA test. Example
Comparison of means from k unrelated groups
Study of the effects of two different drugs (A and B)
on weight reduction.
Study design:
Samples: group of patients treated with drug A
(n=32)
group of patientstreated with drug B
(n=35)
Outcome measure: weight reduction
control group (n=40)
significantly differ in mean
weight reduction?
Test of significance: one-way ANOVA test
k unrelated group, one -way ANOVA test. Example
The group means compared with the overall mean

of the sample
Visual examination of the individual group means

may yield no clear answer about which of the
means are different
Additionally post-hoc tests can be used (Scheffe or

Bonferroni)
k related group, two -way ANOVA test. Example
Comparison of means for k related variables
Study of the effects of drugs A on weight

reduction.
Study design:
Samples: group of patients treated with drug A
(n=35)
control group (n=40)
Outcome measure: weight in Time 1 (before using

drug) and Time 2 (after using
drug)
Kate Grayson
Why Segmentation?
 Used by e.g. retail and consumer product companies

 Trying to learn about and describe their customers'
buying habits, gender, age, income level, etc.
 These companies tailor their marketing and product
development strategies to each consumer group to
increase sales and build brand loyalty.
 A valuable approach in Market Research, and SPSS
offers some useful tools to facilitate this commercial
process
Segmentation in SPSS
 Most of the techniques for segmentation and

profiling are exploratory
 There is no right or wrong answer, and the
results are open to interpretation
 Trying to make sense of the data or find
patterns
 Iterative techniques
 If it does not make business sense then it is
not a good model!
Segmentation in SPSS
Techniques include:
 Factor Analysis / Principal Components
Analysis
 Hierarchical Clustering
 K-Means Cluster
 Non-Linear Principal Components Analysis
(PRINCALS/CATPCA)
 The new Two-Step Cluster
Which Technique to Use?
•Cluster •Categories
•Analysis
•Factor Analysis
•Exploratory
•Confirmatory
•Discriminant
•Analysis
•AnswerTree
Which Test to use?
 Factor Analysis - to find patterns within variables
 Categories - use if data doesn’t fit assumptions for Factor
Analysis
 Cluster Analysis - to find patterns between individuals
 Two-Step Cluster – To use with both categorical and
continuous variables
 Discriminant Analysis - to look for differences between
groups, try to predict target variable
 AnswerTree - combinations of data, to predict target
Multivariate Analysis
 These techniques are inter-

related, but don’t have to use
all of them
 Can use a combination of

these techniques to segment
the data
Main Considerations
 Looking for patterns or trying to

make predictions?
 Levels of Measurement of the
data (categorical or continuous)
 Sample size
 Missing values
 Does data fulfil assumptions for
test?
Handling Missing Data
 Check before analysis for any patterns within

missing data
 Check before analysis that missing values are
defined as missing - otherwise may
compromise the model
 Be aware that most segmentation techniques
ignore any cases with missing values - so may
have less usable data than you think!
Variable and Value
Labels….
 It is worth checking the labels on your
file
 SPSS may truncate long variable and
value labels in the output, making it
difficult to interpret the output
 Make sure all the useful information is
at the beginning of the variable and
value labels - so even if they are
truncated, the output is still easy to
read
Data Coding
 Check the direction of the coding scheme,

and maybe consider re-coding the data if
the codes are counter-intuitive
 e.g. if have a rating scale that ranges from
high to low, rather than low to high…
 ... it can be difficult to interpret output and
factor scores etc. once the data has been
through several transformations
Sample Data
 Data = usage of underarm Brand usually use: frequency w ithin sample

deodorants for men
 Three brands tested: Frequency Percent
 ‘Rambo’: the current market Valid Rambo AP Spray 1013 39.5
leader Rambo AP Roll-on 624 24.3
 ‘Brad’ : second most popular Brad AP Spray 441 17.2
 ‘Clint’ : recently launched Brad AP Roll-on 140 5.5
product Clint AP Spray 278 10.8
Clint AP Roll-on 71 2.8
Total 2567 100.0
Profiling the
Customers..
‘Clint’ isn’t selling as well as was hoped, so the
research aims to find out:
 Who is buying ‘Clint’?
 What sort of characteristics do they share?
 Who is buying the other deodorants tested?
 How might the marketing campaign be changed
to ensure that the correct market is targeted?
Data Collected
 Ratings of a range of lifestyle attribute

questions, e.g. ‘I tend to own the most up-to-
date products’, ‘My family is most important
thing in my life’, ‘I prefer to dress and entertain
casually’ etc. (34 of these)
 Demographics: age, type of work, exercise etc.
 Brand of D/O usually use
 How see yourself in relation to others, e.g.
‘What makes you distinctive from your friends’
Segmentation – the steps
1. Run Principal Components Analysis on ‘attribute

rating’ questions, to see if any underlying dimension
in the variables
2. Check using Discriminant Analysis to see if these
dimensions help predict brand used
3. Run Cluster Analysis to see if can find similarities
between cases
4. Decide if other variables need to be included, e.g.
categorical demographics
5. Run Two-Step Cluster using all variables
Factor Analysis: what is
it?
 Looks for relationships between continuous variables
(based on correlations), in this case ‘attribute rating’
questions
 Derives underlying constructs or dimensions in the
data
 Tries to reduce a large number of variables to a small
number of factors which explain most of the variance
in the data
 If can’t interpret the resulting solution then no good!
Factor Analysis Results
The best solution produced 9 factors, interpreted below:

 F1: High computer use
 F2: Rules, need to conform
 F3: Party animal
 F4: Family man
 F5: Likes new products, experiments
 F6: Likes pampering, pays more for trusted brands
 F7: Cautious, follower rather than leader for new products
 F8: Relaxed, casual
 F9: Home loving
Do these factors
help?
Run Discriminant Analysis to see if can predict D/O used
Combined Groups Plot
4
Brand usually use
Rambo AP Spray
Rambo AP Roll-on
Brad AP Spray
Brad AP Roll-on
2 Clint AP Spray
Clint AP Roll-on
Group Centroid
Function 2
Rambo AP Spray
0
Brad AP Roll-on Rambo AP Roll-on
Clint AP Spray
Clint AP Roll-on
Brad AP Spray
-2
-4
-7.5 -5.0 -2.5 0.0 2.5 5.0 7.5

Function 1
Factor Analysis Results
 The factors are good at predicting ‘Rambo’

usage, but not at differentiating between
‘Brad’ and ‘Clint’
 So try instead investigating relationships
between cases – using Cluster Analysis
 Options for clustering are:
 Hierarchical Cluster
 K-Means Cluster
 Two-Step Cluster
Hierarchical Cluster
 This is often thought of as the

‘proper cluster’ method
 Looking for natural groupings within
the data
 Bases groupings upon the similarity
or dissimilarity between cases,
rather than variables
 Very iterative technique – time
consuming!
•Clustering Data - Diagram
•= data point:
•one case
Decisions before
Cluster:
Which variables to use?
Which distance measures between cases to use?
Which criteria for creating clusters to choose?
NB
The quality of the analysis will always depend upon the
variables used
Cluster Analysis will always find a solution!
It is not possible to assess in the analysis itself how
appropriate a variable is
Stages of Hierarchical
Cluster:
Select variables for analysis (carefully!)
Build and assess model
Save cluster membership
If required, create cluster matrix for K-
Means
NB
Because based on cases, need to make sure
data is measured on same scale - if not,
data should be standardized
Decision with D/O Data
 I can’t get a very good (i.e. useful to

the business) model from
Hierarchical Cluster analysis
 Also, I want to be able to include
both categorical and continuous
variables in the same model
 So I decide to use Two-Step Cluster
instead
Two-Step Cluster
 The TwoStep Cluster Analysis procedure is an
exploratory tool designed to reveal natural
groupings (or clusters) within a data set that would
otherwise not be apparent.
 The algorithm employed by this procedure has
several features that differentiate it from traditional
clustering techniques:
 The ability to create clusters based on both categorical and
continuous variables.
 Automatic selection of the number of clusters.
 The ability to analyze large data files efficiently.
TwoStep Cluster
 Uses scalable cluster analysis algorithm
 This algorithm can handle both continuous and
categorical variables or attributes and requires only
one data pass in the procedure
 The first step of the procedure pre-clusters the records
into many small sub-clusters
 Then it clusters the sub-clusters created in the pre-
cluster step into the desired number of clusters
 If the desired number of clusters is unknown, TwoStep
Cluster analysis automatically finds the proper
number of clusters
Two-Step Cluster
• This is unlike other clustering methods

in SPSS - if the desired number of
clusters is unknown, TwoStep Cluster
analysis automatically finds the proper
number of clusters
• Or you can pre-specify the number of
clusters required - flexibility
•
Link to more information
 More useful information about Two-Step Cluster can be

found at the following websites:
 http://www.rrz.uni-hamburg.de/RRZ/Software/SPSS/Algorith.12
 NB This was the handout for the talk, with algorithm etc.
 Also useful:
 http://www.spss.com/pdfs/S115AD8-1202A.pdf
 http://www.norusis.com/pdf/SPC_v13.pdf
Brand usually use by
Cluster
‘Clint’ spray seems to be associated with Cluster 6,

with the roll-on version being associated with
Clusters 4 and 2
branduse Brand usually use
Percent
Cluster
1 2 3 4 5 6 Combined
Rambo AP Spray .0% 18.1% 52.3% .0% 29.6% .0% 100.0%
Rambo AP Roll-on 70.3% 29.7% .0% .0% .0% .0% 100.0%
Brad AP Spray .0% .0% .0% 100.0% .0% .0% 100.0%
Brad AP Roll-on .0% 3.6% .0% 96.4% .0% .0% 100.0%
Clint AP Spray .0% .4% .0% .0% .0% 99.6% 100.0%
Clint AP Roll-on .0% 14.3% .0% 85.7% .0% .0% 100.0%
Employment Status by
Cluster
Cluster 2 (‘Clint’ roll-on) is largely made up of part-time,
retired and not working respondents, Cluster 4 also has
a high number of retired respondents, while Cluster 6
‘Clint’ spray) also has a high percentage of part-time
and unemployed.
employ Employment Status
Percent
Cluster
Full time
24.5% 2.3% 29.7% 13.8% 16.8% 12.9% 100.0%
employment
Part-time
11.9% 61.9% 4.8% 2.4% .0% 19.0% 100.0%
employment
Not employed .0% 79.3% .0% 9.8% .0% 10.9% 100.0%
Student .0% 91.9% .0% 1.0% .0% 7.1% 100.0%
Retired .0% 61.1% .0% 33.3% 5.6% .0% 100.0%
Age Group by Cluster
Cluster 2 (‘Clint’ roll-on) is largely made up of the

younger and older age groups, Cluster 4 also has a
high percentage of older respondents. Cluster 6 is
more from 25 years upwards
agerseu Age of respondent
Percent
Cluster
Under 18 .0% 96.8% .0% 3.2% .0% .0% 100.0%
18-24 .0% 57.2% .0% 6.9% 27.6% 8.3% 100.0%
25-34 18.3% 11.1% 47.4% 10.5% .0% 12.7% 100.0%
35-44 23.0% 3.8% 44.6% 14.2% .0% 14.4% 100.0%
45-54 29.7% 5.5% .0% 13.1% 39.5% 12.2% 100.0%
55-64 15.5% 38.7% .0% 15.2% 19.1% 11.6% 100.0%
65 or over .0% 68.8% .0% 18.8% .0% 12.5% 100.0%
TwoStep Cluster Number = 4
•Cluster 4 F1: High computer use
•(‘Clint’ roll- F8: Relaxed, casual
on) F2: Rules, need to conform
•has below F9: Home loving

Variable
•average F4: Family man
•computer use F7: Cautious, follower rather than

leader for new products
•and need to F5: Likes new products,

experiments
•conform, F6: Likes pampering, pays more for

trusted brands
•above F3: Party animal
•average -30 -20 -10 0
•on
•‘Home
TwoStep Cluster Number = 6
•Cluster 6 F1: High computer use
•(‘Clint’ spray) F2: Rules, need to conform
•has above F8: Relaxed, casual
F7: Cautious, follower rather than

•average leader for new products
Variable
•scores on F4: Family man
F6: Likes pampering, pays more for

•‘Relaxed, Casual’ trusted brands
F9: Home loving

•but not much
F5: Likes new products,
•else – this is experiments
F3: Party animal

•Mr Laid Back!
-40 -20 0 20 40
Summary of Findings
 Profiling of this data suggests that ‘Clint’ is not

targeting the expected market
 ‘Clint’ is often not seen as sufficiently different from
‘Brad’, it has no perceived USP
 ‘Clint’ is being used by a high percentage of older,
retired, and part-time or not employed consumers,
which may be a result of the aggressive product
launch campaign with free samples, discounted prices
etc.
 ‘Clint’ marketing needs some more work!
Summary of Segmenting and
Profiling this data using
SPSS
 Principal Components Analysis helped investigate
relationships between the rated attribute variables
 Hierarchical Cluster was used to try and find similarities
between cases, using the factors derived from PCA
 Two-Step Cluster was then used to enable clustering of both
continuous and categorical variables in the same model
 Useful conclusions were drawn about the market positioning
of ‘Clint’ deodorant
Windows Display System
What is SAS?
 SAS is a comprehensive statistical software

system which integrates utilities for storing,
modifying, analyzing, and graphing data.
 SAS runs on both Windows and UNIX platforms
 SAS is used in a wide range of industries such as
healthcare, education, financial services, life
sciences,…
 Check out the webpage to learn more
 http://www.sas.com/
SAS User Interface •Run button – click on this button
•to run SAS code
•Tool bar similar
•to Windows applications
•Log Window •Click here for SAS help
•New Window button
•Explore
•Save button
r
•Windo
w •Editor Window
•Results
•Window
•(not shown) •Output Window (not shown)
Editor Window
•The Editor Window contains inputted data

•sets and SAS programs
Explorer Window
•Libraries Folder - Contains data sets created in SAS
•Explorer
•Window
Libraries Folder
•Contents of the Work Folder

•Contents of the Libraries •These are the data sets that
•Folder •have been created in SAS
•through inputting data and
•The Work Folder contains •by creating data sets in SAS
•data sets created in SAS •programs
Log Window
•The Log Window contains a record
•of all commands submitted to
•SAS and shows errors in the
•commands.
Output Window
•The Output Window contains output

•based on SAS programs submitted in the
•Editor Window.
Results Window
•The Results Window shows a •Click on any procedure to

•listing of SAS programs •view all output parts of the
•that have been submitted •procedure and click on any
•in the order that they were •individual part to view the
•submitted. •actual output.
SAS Help
•PRINCESS C. BARCEGA
•APG SCHOOL, MANAMA,
BAHRAIN
•Powerpoint hosted on www.worldofteaching.com

•Please visit for 100’s more free powerpoints
What do you know about
 bar graph?
 double bar graph?
 Histogram?
Bar Graph
 A bar graph
Spanish
can be used to
display and
Mandarin
compare data
Hindi
 The scale
should include
English all the data
values and be
0 200 400 600 800 1000
easily divided
into equal
intervals.
How to interpret a Bar
Graph?
•The bar graph shows Mr.
Snowden’s students by gender
and band membership.
7
 How many of Mr.
6
Snowden’s
5
students are
4
band members?
3
 How many of Mr.
2
Snowden’s 1
students are not Female Female not Male band Male not
0
band band band
band members?
Double Bar Graph
90  Can be used
80
70 to compare
60
50
two related
40 sets of data
30
20
10
0
1st 2nd 3rd 4th
Qtr Qtr Qtr Qtr
How to make a Double-
Bar Graph?
 Choose a scale and
interval for the vertical
axis.
 Draw a pair of bars for
each country’s data. Use
different colors to show
males and females.
 Label the axes and give
the graph a title.
 Make a key to show what
each bar represents.
The table shows the
highway speed limits on
interstate roads .within
State three statesRural
Urban
Florida 65mi/h 70 mi/h
Texas 70 mi/h 70 mi/h
Vermont 55mi/h 65 mi/h

Step 1
 Choose a scale 80
and interval for

the vertical axis. 60
40
20
State Urban Rural
0
Vermont 55mi/h 65 mi/h
Step 2 Draw a pair of bars for each
state’s data. Use different colors
to show urban and rural.
80
State Urban Rural 60

40
Vermont 55mi/h 65 mi/h 20
0
Florida T exas Vermon t
Step 3 and 4
•Speed Limit on Interstate
Roads
80
 Label the axes •
and give the 60 Urban
•Speed Limit
graph a title.
40
•
 Make a key to
(mi/h)
Rural
show what each 20
bar represents
0
Florida Texas Vermont
Histogram
 Histogram is a bar graph that shows

the frequency of data within equal
intervals.
 There is no space in between the
bars.
How to make a
histogram?
The table below shows the
number of hours students watch
TV in one week Make a histogram
Number
of of hours
all the of TV
data.
1 II 6 III
2 IIII 7 IIII - IIII
3 IIII - IIII 8 III
4 IIII - I 9 IIII
5 IIII - III
 Make a frequency Step 1
table of the data.
Be sure to use
equal intervals
Number of Frequency
hours of TV
Number of hours of TV
1-3 15
1 II 6 III
4-6 17
2 IIII 7 IIII - IIII
7-9 16
3 IIII - IIII 8 III
4 IIII - I 9 IIII
5 IIII - III
Step 2
 Choose an appropriate scale and interval for the
vertical axis. The greatest value on the scale should
be at least as great as the greatest frequency.
20
16
Number of Frequency 12
hours of TV
8
4
1-3 15
0
4-6 17 1-3 4-6 7-9
7-9 16
Step 3 Hours of Television
Watched
 Draw a bar for each
interval. The height of the 20
bar is the frequency for
Number of students
that interval. Bars must 16
touch but not overlap. 12
 Label the axes and give
8
the graph title
4
Number of Frequency 0
hours of TV 1-3 4-6 7-9
Hours
1-3 15
4-6 17
7-9 16
Hours of Television
Watched
20
Number of students
16
12
8
4
0
1-3 4-6 7-9
Hours
The list below shows the results
of a typing test in words per
minute. Make a histogram of
the data.
62, 55, 68, 47, 50, 41, 62, 39,
54, 70, 56, 70, 56, 47, 71, 55,
60, 42
Essential Information
Commonly used visual tools
 Charts:
 Bar
 Line
 Pie
 XY
 Area
 Thematic map

Gagan

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gagan

Uploaded by

Copyright:

Available Formats

Point estimation and interval

 to understand the relationship between

to calculate and interpret the

Point estimate Interval estimate

• sample mean • confidence interval for mean

Point estimate is always within the interval

provide us with a range of values that we belive,

» to understand the role of significance test

» to distinguish the null and alternative

» to interpret p-value, type I and II errors

Formulate Collect data to

Formulate Collect data to

Accept hypothesis Reject hypothesis

Random error (chance) can be controlled by statistical significance

Question: Is it possible that the “true” age of

Answer can be given based on Significance

Null hypothesis H0 - there is no difference

Alternative hypothesis HA - question explored by

Statistical method are used to test hypotheses

The null hypothesis is the basis for statistical test.

Research question: Does the lactation nurse have

HA : The lactation nurse has an

p-value = probability of observing a value more

The smaller the p-value, the more unlikely the

α - level of significance 1-β - power of the test

The probability of making a Type I (α) can be

it will be more difficult to find a significant result

the power of the test will be decreased

The probability of making a Type II (β ) can be decreased by

it will increase the chance of a Type I error

To which type of error you are willing to risk ?

Suppose there is a test for a particular disease.

To which type of error you are willing to risk ?

Diagnosed Type I error OK

Decision: to avoid Type error II, have high level

 to distinguish parametric and

to identify situations in which the use

to identify situations in which the use

Parametric test of significance - to estimate at least one

Nonparametric test - distribution free, no assumption about

two or more data sets, which should be analyzed

– repeated measurements made on the same

– entirely independent samples

variability (standard deviation, range, IQR) standard deviation

One- and two tailed tests

to determine whether a variable has a frequency

expected frequency can be based on

Does observed frequencies differ from expected ?

Chi-square statistic (test) is usually used with an R

Expected frequencies can be calculated:

Sample: people have similar results on initial

Independent: sex of the patient

Result: observed frequencies

Result: expected frequencies

χ = 2.52, df=1 (2-1) (2-1)

Null hypothesis is accepted at 5% level

Conclusion: Recommendation for cardiac

✔ Measures independent No subjects can be

✔ Theoretical basis for the Categories should be

 For N x N design and very small sample size Fisher's

 McNemar test can be used with two dichotomous

Mann-Whitney U : used to compare two groups