You are on page 1of 127

Point estimation and interval

estimation

learning objectives:

 to understand the relationship between


point estimation and interval
estimation

to calculate and interpret the


confidence interval
Statistical estimation
Every member of the
population has the
same chance of
being
Population
selected in the
sample

Parameters

Random sample
estimation
Statistics
Statistical estimation

Estimate

Point estimate Interval estimate

• sample mean • confidence interval for mean


• sample proportion • confidence interval for proportion

Point estimate is always within the interval


estimate
Interval estimation
Confidence interval (CI)

provide us with a range of values that we belive,


with a given level of confidence, containes a true
value

CI for the
95 poipulation
%C I = means
x ± 1.96 SEM
99 %CI =x ±2.58 SEM
SD
SEM =
n
Interval estimation
Confidence interval (CI)

34% 34%
14% 14%
2% 2%
z
-3.0 -2.0 -1.0 0.0 1.0 2.0
3.0
-2.58 -1.96 1.96 2.58
Interval estimation
Confidence interval (CI), interpretation and example

50

40
Frequency

30

20

10

0
22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5
25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0

Age in years

x= 41.0, SD= 8.7, SEM=0.46, 95% CI (40.0, 42), 99%CI (39.7,
42.1)
Testing of hypotheses

learning objectives:

» to understand the role of significance test

» to distinguish the null and alternative


hypotheses

» to interpret p-value, type I and II errors


Statistical inference. Role of chance.

S c i e n t i f i c k n o w l e d g

R e a s o n a n E d m i n p t i u r i i ct i ao l n o b s

Formulate Collect data to


hypotheses test hypotheses
Statistical inference. Role of chance.

Systematic error

Formulate Collect data to


hypotheses test hypotheses

CHANCE

Accept hypothesis Reject hypothesis

Random error (chance) can be controlled by statistical significance


or by confidence interval
Testing of hypotheses
Significance test
Subjects: random sample of 352 nurses from HUS surgical
hospitals
Mean age of the nurses (based on sample): 41.0
Another random sample gave mean value: 42.0.

Question: Is it possible that the “true” age of


nurses from HUS surgical hospitals was
41 years and observed mean ages
differed just because of sampling error?

Answer can be given based on Significance


Testing.
Testing of hypotheses

Null hypothesis H0 - there is no difference

Alternative hypothesis HA - question explored by


the investigator

Statistical method are used to test hypotheses

The null hypothesis is the basis for statistical test.


Testing of hypotheses
Example
The purpose of the study:
to assess the effect of the lactation nurse on attitudes
towards breast feeding among women

Research question: Does the lactation nurse have


an effect on attitudes towards
breast feeding ?

HA : The lactation nurse has an


effect on attitudes towards
breast feeding.
H0 : The lactation nurse has no
effect on attitudes towards
breast feeding.
Testing of hypotheses
Definition of p-value.
90
2.5% 95% 2.5%
80

70

60

50

40

30

20

10

0
23.8 28.8 33.8 38.8 43.8 48.8 53.8 58.8

AGE

If our observed age value lies outside the green lines, the
probability of getting a value as extreme as this if the null
hypothesis is true is < 5%
Testing of hypotheses
Definition of p-value.

p-value = probability of observing a value more


extreme that actual value observed, if the null
hypothesis is true

The smaller the p-value, the more unlikely the


null hypothesis seems an explanation for the
data
Interpretation for the example
If results falls outside green lines, p<0.05,
if it falls inside green lines, p>0.05
Testing of hypotheses
Type I and Type II Errors
No study is perfect,
there is always the chance for error
Decision H0 true / HA false H0 false / HA true
Accept H0 / Type II error (β
)
reject HA OK
p=1-α p=β
Reject H0 Type I error (α
)
/accept HA OK
p=α p=1-β

α - level of significance 1-β - power of the test


Testing of hypotheses
Type I and Type II Errors
there is only 5 chance in 100 that the result
α =0.05 termed "significant" could occur by chance
alone

The probability of making a Type I (α) can be


decreased by altering the level of significance.

it will be more difficult to find a significant result

the power of the test will be decreased


the risk of a Type II error will be increased
Testing of hypotheses
Type I and Type II Errors

The probability of making a Type II (β ) can be decreased by


increasing the level of significance.

it will increase the chance of a Type I error

To which type of error you are willing to risk ?


Testing of hypotheses
Type I and Type II Errors. Example

Suppose there is a test for a particular disease.


If the disease really exists and is diagnosed early,
it can be successfully treated
If it is not diagnosed and treated, the person will
become severely disabled
If a person is erroneously diagnosed as having the
disease and treated, no physical damage is done.

To which type of error you are willing to risk ?


Testing of hypotheses
Type I and Type II Errors. Example.
Decision No disease Disease
Not diagnosed OK Type II error

Diagnosed Type I error OK

irreparable damage
treated but not harmed
would be done
by the treatment

Decision: to avoid Type error II, have high level


of significance
Testing of hypotheses
Confidence interval and significance test
Null
hypothesis is
accepted
A value for null hypothesis
p-value > 0.05
within the 95% CI
Null
hypothesis is
A value for null hypothesis rejected
outside of 95% CI p-value < 0.05
Parametric and nonparametric
tests of significance
learning objectives:

 to distinguish parametric and


nonparametric tests of significance

to identify situations in which the use


of parametric tests is appropriate

to identify situations in which the use


of nonparametric tests is appropriate
Parametric and nonparametric
tests of significance

Parametric test of significance - to estimate at least one


population parameter from sample statistics
Assumption: the variable we have measured in the sample is
normally distributed in the population to which we plan to
generalize our findings

Nonparametric test - distribution free, no assumption about


the distribution of the variable in the population
Parametric and nonparametric
tests of significance
Nonparametric tests Parametric tests
Nominal Ordinal data Ordinal, interval,
data ratio data
One group
Two
unrelated
groups
Two related
groups
K-unrelated
groups
K-related
groups
Some concepts related to the statistical
methods.

Multiple comparison

two or more data sets, which should be analyzed

– repeated measurements made on the same


individuals

– entirely independent samples


Some concepts related to the statistical
methods.
Sample size
number of cases, on which data have been
obtained
Which of the basic characteristics of a distribution are
more sensitive to the sample size ?
central tendency (mean, median, mode) mean

variability (standard deviation, range, IQR) standard deviation

skewness skewness
kurtosis
kurtosis
Some concepts related to the statistical
methods.

Degrees of freedom
the number of scores, items, or other units in
the data set, which are free to vary

One- and two tailed tests


one-tailed test of significance used for directional
hypothesis
two-tailed tests in all other situations
Selected nonparametric tests
Chi-Square goodness of fit test.

to determine whether a variable has a frequency


distribution compariable to the one expected

1
χ = ∑ ( f oi − f ei )
2 2
ι f
ei

expected frequency can be based on


• theory
• previous experience
• comparison groups
Selected nonparametric tests
Chi-Square goodness of fit test.
Example
The average prognosis of total hip replacement
in relation to pain reduction in hip joint is
exelent - 80%
good - 10%
expecte
medium - 5%
d
bad - 5%
In our study of we had got a different outcome
exelent - 95%
good - 2%
medium - 2% observe
bad - 1% d

Does observed frequencies differ from expected ?


Selected nonparametric tests
Chi-Square goodness of fit test.
Example
fe1 = 80, fe2 = 10, fe3 =5, fe4 = 5;
fo1 = 95, fo2 = 2, fo3 =2, fo4 = 1;

χ 2
> 3.841 p < 0.05
χ = 14.2, df=3 (4-1)
2

χ 2
> 6.635 p < 0.01
0.0005 < p < 0.05
χ
> 10.83 p <
2

0.001
Null hypothesis is rejected at 5% level
Selected nonparametric tests
Chi-Square test.

Chi-square statistic (test) is usually used with an R


(row) by C (column) table.

Expected frequencies can be calculated:

1
Frc = ( fr fc )
N
then
1
χ = ∑ ∑ ( f ij − Fij )
2 2
ι j F
ij
df = (fr-1) (fc-1)
Selected nonparametric tests
Chi-Square test. Example
Question: whether men are treated more aggressively for
cardiovascular problems than women?

Sample: people have similar results on initial


testing
Response: whether or not a cardiac
catheterization was recommended

Independent: sex of the patient


Selected nonparametric tests
Chi-Square test. Example

Result: observed frequencies

Sex
Cardiac male female Row total
Cath
No 15 16 31
Yes 45 24 69
Column 60 40 100
total
Selected nonparametric tests
Chi-Square test. Example

Result: expected frequencies

Sex
Cardiac male female Row total
Cath
No 18.6 12.4 31
Yes 41.4 27.6 69
Column 60 40 100
total
Selected nonparametric tests
Chi-Square test. Example

Result:

χ = 2.52, df=1 (2-1) (2-1)


2

p > 0.05

Null hypothesis is accepted at 5% level

Conclusion: Recommendation for cardiac


catheterization is not related to the sex of the patient
Selected nonparametric tests
Chi-Square test. Underlying
assumptions. Cannot be used to
 Frequency data
analyze differences in
scores or their means
✔ Adequate sample size Expected frequencies
should not be less than 5

✔ Measures independent No subjects can be


of each other count more than once

✔ Theoretical basis for the Categories should be


categorization of the defined prior to data
collection and analysis
variables
Selected nonparametric tests
Fisher’s exact test. McNemar test.

 For N x N design and very small sample size Fisher's


exact test should be applied

 McNemar test can be used with two dichotomous


measures on the same subjects (repeated
measurements). It is used to measure change
Parametric and nonparametric tests of
significance
Nonparametric tests Parametric tests
Nominal Ordinal data Ordinal, interval,
data ratio data
One group Chi square
goodness
of fit
Two Chi square
unrelated
groups
Two related McNemar’
groups s test
K-unrelated Chi square
groups test
K-related
groups
Selected nonparametric tests
Ordinal data independent groups.

Mann-Whitney U : used to compare two groups

Kruskal-Wallis H: used to compare two or more


groups
Selected nonparametric tests
Ordinal data independent groups.
Mann-Whitney test
Null hypothesis : Two sampled populations are
equivalent in location

The observations from both groups are


combined and ranked, with the average rank
assigned in the case of ties.

If the populations are identical in location, the


ranks should be randomly mixed between the
two samples
Selected nonparametric tests
Ordinal data independent groups.
Kruskal-Wallis test
k- groups comparison, k ≥ 2

Null hypothesis : k sampled populations are


equivalent in location

The observations from all groups are combined


and ranked, with the average rank assigned in
the case of ties.

If the populations are identical in location, the


ranks should be randomly mixed between the k
samples
Selected nonparametric tests
Ordinal data related groups.

Wilcoxon matched-pairs signed rank test:


used to compare two related groups

Friedman matched samples:


used to compare two or more related
groups
Selected nonparametric tests
Ordinal data 2 related groups Wilcoxon signed rank test
Two related variables. No assumptions about the
shape of distributions of the variables.

Null hypothesis : Two variables have the same


distribution

Takes into account information about the


magnitude of differences within pairs and gives
more weight to pairs that show large differences
than to pairs that show small differences.

Based on the ranks of the absolute values of the differences


between the two variables.
Parametric and nonparametric tests of
significance
Nonparametrictests Parametric
tests
Nominal Ordinal data
data
Onegroup Chi square W ilcoxonsigned
goodness of ranktest
fit
Two Chi square W ilcoxonrank
unrelated sumtest,
groups Mann-W hitney
test
Tworelated McNemar’s W ilcoxonsigned
groups test ranktest
K-unrelated Chi square Kruskal -W allis
groups test onewayanalysis
of variance
K-related Friedman
groups matchedsamples
Selected parametric tests
One group t-test. Example

Comparison of sample mean with a population


mean
It is known that the weight of young adult male has
a mean value of 70.0 kg with a standard deviation
of 4.0 kg.
Thus the population mean, µ= 70.0 and population
standard deviation, σ= 4.0.
Data from random sample of 28 males of similar
ages but with specific enzyme defect: mean body
weight of 67.0 kg and the sample standard
deviation of 4.2 kg.
Question: Whether the studed group have a
significantly lower body weight than the general
population?
Selected parametric tests
One group t-test. Example

population mean, µ= 70.0


population standard deviation, σ=
4.0.
sample size = 28
sample mean, x = 67.0
sample standard deviation, s= 4.0.

Null hypothesis: There is no difference between


sample mean and population mean.

t - statistic = 0.15, p >0.05

Null hypothesis is accepted at 5% level


Selected parametric tests
Two unrelated group, t-test. Example

Comparison of means from two unrelated groups


Study of the effects of anticonvulsant therapy on
bone disease in the elderly.
Study design:
Samples: group of treated patients (n=55)
group of untreated patients (n=47)
Outcome measure: serum calcium
concentration
Research question: Whether the groups statistically
significantly differ in mean serum consentration?
Test of significance: Pooled t-test
Selected parametric tests
Two unrelated group, t-test. Example

Comparison of means from two unrelated groups


Study of the effects of anticonvulsant therapy on
bone disease in the elderly.
Study design:
Samples: group of treated patients (n=20)
group of untreated patients (n=27)
Outcome measure: serum calcium
concentration
Research question: Whether the groups statistically
significantly differ in mean serum consentration?
Test of significance: Separate t-test
Selected parametric tests
Two related group, paired t-test. Example

Comparison of means from two related variabless


Study of the effects of anticonvulsant therapy on
bone disease in the elderly.
Study design:
Sample: group of treated patients (n=40)

Outcome measure: serum calcium


concentration before and
after operation
Research question: Whether the mean serum
consentration statistically
significantly differ before
and
Test after operation? paired t-test
of significance:
Selected parametric tests
k unrelated group, one -way ANOVA test. Example
Comparison of means from k unrelated groups
Study of the effects of two different drugs (A and B)
on weight reduction.
Study design:
Samples: group of patients treated with drug A
(n=32)
group of patientstreated with drug B
(n=35)
Outcome measure: weight reduction
control group (n=40)
Research question: Whether the groups statistically
significantly differ in mean
weight reduction?
Test of significance: one-way ANOVA test
Selected parametric tests
k unrelated group, one -way ANOVA test. Example

The group means compared with the overall mean


of the sample

Visual examination of the individual group means


may yield no clear answer about which of the
means are different

Additionally post-hoc tests can be used (Scheffe or


Bonferroni)
Selected parametric tests
k related group, two -way ANOVA test. Example
Comparison of means for k related variables

Study of the effects of drugs A on weight


reduction.
Study design:
Samples: group of patients treated with drug A
(n=35)
control group (n=40)

Outcome measure: weight in Time 1 (before using


drug) and Time 2 (after using
drug)
Kate Grayson
Why Segmentation?

 Used by e.g. retail and consumer product companies


 Trying to learn about and describe their customers'
buying habits, gender, age, income level, etc.
 These companies tailor their marketing and product
development strategies to each consumer group to
increase sales and build brand loyalty.
 A valuable approach in Market Research, and SPSS
offers some useful tools to facilitate this commercial
process
Segmentation in SPSS

 Most of the techniques for segmentation and


profiling are exploratory
 There is no right or wrong answer, and the
results are open to interpretation
 Trying to make sense of the data or find
patterns
 Iterative techniques
 If it does not make business sense then it is
not a good model!
Segmentation in SPSS

Techniques include:
 Factor Analysis / Principal Components
Analysis
 Hierarchical Clustering
 K-Means Cluster
 Non-Linear Principal Components Analysis
(PRINCALS/CATPCA)
 The new Two-Step Cluster
Which Technique to Use?

•Cluster •Categories
•Analysis

•Factor Analysis

•Exploratory

•Confirmatory
•Discriminant
•Analysis
•AnswerTree
Which Test to use?
 Factor Analysis - to find patterns within variables
 Categories - use if data doesn’t fit assumptions for Factor
Analysis
 Cluster Analysis - to find patterns between individuals
 Two-Step Cluster – To use with both categorical and
continuous variables
 Discriminant Analysis - to look for differences between
groups, try to predict target variable
 AnswerTree - combinations of data, to predict target
Multivariate Analysis

 These techniques are inter-


related, but don’t have to use
all of them

 Can use a combination of


these techniques to segment
the data
Main Considerations

 Looking for patterns or trying to


make predictions?
 Levels of Measurement of the
data (categorical or continuous)
 Sample size
 Missing values
 Does data fulfil assumptions for
test?
Handling Missing Data

 Check before analysis for any patterns within


missing data
 Check before analysis that missing values are
defined as missing - otherwise may
compromise the model
 Be aware that most segmentation techniques
ignore any cases with missing values - so may
have less usable data than you think!
Variable and Value
Labels….
 It is worth checking the labels on your
file
 SPSS may truncate long variable and
value labels in the output, making it
difficult to interpret the output
 Make sure all the useful information is
at the beginning of the variable and
value labels - so even if they are
truncated, the output is still easy to
read
Data Coding

 Check the direction of the coding scheme,


and maybe consider re-coding the data if
the codes are counter-intuitive
 e.g. if have a rating scale that ranges from
high to low, rather than low to high…
 ... it can be difficult to interpret output and
factor scores etc. once the data has been
through several transformations
Sample Data

 Data = usage of underarm Brand usually use: frequency w ithin sample


deodorants for men
 Three brands tested: Frequency Percent
 ‘Rambo’: the current market Valid Rambo AP Spray 1013 39.5
leader Rambo AP Roll-on 624 24.3
 ‘Brad’ : second most popular Brad AP Spray 441 17.2
 ‘Clint’ : recently launched Brad AP Roll-on 140 5.5
product Clint AP Spray 278 10.8
Clint AP Roll-on 71 2.8
Total 2567 100.0
Profiling the
Customers..
‘Clint’ isn’t selling as well as was hoped, so the
research aims to find out:
 Who is buying ‘Clint’?
 What sort of characteristics do they share?
 Who is buying the other deodorants tested?
 How might the marketing campaign be changed
to ensure that the correct market is targeted?
Data Collected

 Ratings of a range of lifestyle attribute


questions, e.g. ‘I tend to own the most up-to-
date products’, ‘My family is most important
thing in my life’, ‘I prefer to dress and entertain
casually’ etc. (34 of these)
 Demographics: age, type of work, exercise etc.
 Brand of D/O usually use
 How see yourself in relation to others, e.g.
‘What makes you distinctive from your friends’
Segmentation – the steps

1. Run Principal Components Analysis on ‘attribute


rating’ questions, to see if any underlying dimension
in the variables
2. Check using Discriminant Analysis to see if these
dimensions help predict brand used
3. Run Cluster Analysis to see if can find similarities
between cases
4. Decide if other variables need to be included, e.g.
categorical demographics
5. Run Two-Step Cluster using all variables
Factor Analysis: what is
it?
 Looks for relationships between continuous variables
(based on correlations), in this case ‘attribute rating’
questions
 Derives underlying constructs or dimensions in the
data
 Tries to reduce a large number of variables to a small
number of factors which explain most of the variance
in the data
 If can’t interpret the resulting solution then no good!
Factor Analysis Results

The best solution produced 9 factors, interpreted below:


 F1: High computer use
 F2: Rules, need to conform
 F3: Party animal
 F4: Family man
 F5: Likes new products, experiments
 F6: Likes pampering, pays more for trusted brands
 F7: Cautious, follower rather than leader for new products
 F8: Relaxed, casual
 F9: Home loving
Do these factors
help?
Run Discriminant Analysis to see if can predict D/O used

Combined Groups Plot

4
Brand usually use
Rambo AP Spray
Rambo AP Roll-on
Brad AP Spray
Brad AP Roll-on
2 Clint AP Spray
Clint AP Roll-on
Group Centroid
Function 2

Rambo AP Spray

0
Brad AP Roll-on Rambo AP Roll-on
Clint AP Spray
Clint AP Roll-on
Brad AP Spray

-2

-4

-7.5 -5.0 -2.5 0.0 2.5 5.0 7.5


Function 1
Factor Analysis Results

 The factors are good at predicting ‘Rambo’


usage, but not at differentiating between
‘Brad’ and ‘Clint’
 So try instead investigating relationships
between cases – using Cluster Analysis
 Options for clustering are:
 Hierarchical Cluster
 K-Means Cluster
 Two-Step Cluster
Hierarchical Cluster

 This is often thought of as the


‘proper cluster’ method
 Looking for natural groupings within
the data
 Bases groupings upon the similarity
or dissimilarity between cases,
rather than variables
 Very iterative technique – time
consuming!
•Clustering Data - Diagram
•= data point:
•one case
Decisions before
Cluster:
Which variables to use?
Which distance measures between cases to use?
Which criteria for creating clusters to choose?
NB
The quality of the analysis will always depend upon the
variables used
Cluster Analysis will always find a solution!
It is not possible to assess in the analysis itself how
appropriate a variable is
Stages of Hierarchical
Cluster:
Select variables for analysis (carefully!)
Build and assess model
Save cluster membership
If required, create cluster matrix for K-
Means
NB
Because based on cases, need to make sure
data is measured on same scale - if not,
data should be standardized
Decision with D/O Data

 I can’t get a very good (i.e. useful to


the business) model from
Hierarchical Cluster analysis
 Also, I want to be able to include
both categorical and continuous
variables in the same model
 So I decide to use Two-Step Cluster
instead
Two-Step Cluster
 The TwoStep Cluster Analysis procedure is an
exploratory tool designed to reveal natural
groupings (or clusters) within a data set that would
otherwise not be apparent.
 The algorithm employed by this procedure has
several features that differentiate it from traditional
clustering techniques:
 The ability to create clusters based on both categorical and
continuous variables.
 Automatic selection of the number of clusters.
 The ability to analyze large data files efficiently.
TwoStep Cluster
 Uses scalable cluster analysis algorithm
 This algorithm can handle both continuous and
categorical variables or attributes and requires only
one data pass in the procedure
 The first step of the procedure pre-clusters the records
into many small sub-clusters
 Then it clusters the sub-clusters created in the pre-
cluster step into the desired number of clusters
 If the desired number of clusters is unknown, TwoStep
Cluster analysis automatically finds the proper
number of clusters
Two-Step Cluster

• This is unlike other clustering methods


in SPSS - if the desired number of
clusters is unknown, TwoStep Cluster
analysis automatically finds the proper
number of clusters
• Or you can pre-specify the number of
clusters required - flexibility


Link to more information

 More useful information about Two-Step Cluster can be


found at the following websites:
 http://www.rrz.uni-hamburg.de/RRZ/Software/SPSS/Algorith.12
 NB This was the handout for the talk, with algorithm etc.

 Also useful:
 http://www.spss.com/pdfs/S115AD8-1202A.pdf
 http://www.norusis.com/pdf/SPC_v13.pdf
Brand usually use by
Cluster

‘Clint’ spray seems to be associated with Cluster 6,


with the roll-on version being associated with
Clusters 4 and 2
branduse Brand usually use

Percent
Cluster
1 2 3 4 5 6 Combined
Rambo AP Spray .0% 18.1% 52.3% .0% 29.6% .0% 100.0%
Rambo AP Roll-on 70.3% 29.7% .0% .0% .0% .0% 100.0%
Brad AP Spray .0% .0% .0% 100.0% .0% .0% 100.0%
Brad AP Roll-on .0% 3.6% .0% 96.4% .0% .0% 100.0%
Clint AP Spray .0% .4% .0% .0% .0% 99.6% 100.0%
Clint AP Roll-on .0% 14.3% .0% 85.7% .0% .0% 100.0%
Employment Status by
Cluster
Cluster 2 (‘Clint’ roll-on) is largely made up of part-time,
retired and not working respondents, Cluster 4 also has
a high number of retired respondents, while Cluster 6
‘Clint’ spray) also has a high percentage of part-time
and unemployed.
employ Employment Status

Percent
Cluster
1 2 3 4 5 6 Combined
Full time
24.5% 2.3% 29.7% 13.8% 16.8% 12.9% 100.0%
employment
Part-time
11.9% 61.9% 4.8% 2.4% .0% 19.0% 100.0%
employment
Not employed .0% 79.3% .0% 9.8% .0% 10.9% 100.0%
Student .0% 91.9% .0% 1.0% .0% 7.1% 100.0%
Retired .0% 61.1% .0% 33.3% 5.6% .0% 100.0%
Age Group by Cluster

Cluster 2 (‘Clint’ roll-on) is largely made up of the


younger and older age groups, Cluster 4 also has a
high percentage of older respondents. Cluster 6 is
more from 25 years upwards
agerseu Age of respondent

Percent
Cluster
1 2 3 4 5 6 Combined
Under 18 .0% 96.8% .0% 3.2% .0% .0% 100.0%
18-24 .0% 57.2% .0% 6.9% 27.6% 8.3% 100.0%
25-34 18.3% 11.1% 47.4% 10.5% .0% 12.7% 100.0%
35-44 23.0% 3.8% 44.6% 14.2% .0% 14.4% 100.0%
45-54 29.7% 5.5% .0% 13.1% 39.5% 12.2% 100.0%
55-64 15.5% 38.7% .0% 15.2% 19.1% 11.6% 100.0%
65 or over .0% 68.8% .0% 18.8% .0% 12.5% 100.0%
TwoStep Cluster Number = 4

•Cluster 4 F1: High computer use

•(‘Clint’ roll- F8: Relaxed, casual

on) F2: Rules, need to conform

•has below F9: Home loving


Variable

•average F4: Family man

•computer use F7: Cautious, follower rather than


leader for new products

•and need to F5: Likes new products,


experiments

•conform, F6: Likes pampering, pays more for


trusted brands

•above F3: Party animal

•average -30 -20 -10 0

•on
•‘Home
TwoStep Cluster Number = 6

•Cluster 6 F1: High computer use

•(‘Clint’ spray) F2: Rules, need to conform

•has above F8: Relaxed, casual

F7: Cautious, follower rather than


•average leader for new products
Variable

•scores on F4: Family man

F6: Likes pampering, pays more for


•‘Relaxed, Casual’ trusted brands

F9: Home loving


•but not much
F5: Likes new products,

•else – this is experiments

F3: Party animal


•Mr Laid Back!
-40 -20 0 20 40
Summary of Findings

 Profiling of this data suggests that ‘Clint’ is not


targeting the expected market
 ‘Clint’ is often not seen as sufficiently different from
‘Brad’, it has no perceived USP
 ‘Clint’ is being used by a high percentage of older,
retired, and part-time or not employed consumers,
which may be a result of the aggressive product
launch campaign with free samples, discounted prices
etc.
 ‘Clint’ marketing needs some more work!
Summary of Segmenting and
Profiling this data using
SPSS
 Principal Components Analysis helped investigate
relationships between the rated attribute variables
 Hierarchical Cluster was used to try and find similarities
between cases, using the factors derived from PCA
 Two-Step Cluster was then used to enable clustering of both
continuous and categorical variables in the same model
 Useful conclusions were drawn about the market positioning
of ‘Clint’ deodorant
Windows Display System
What is SAS?

 SAS is a comprehensive statistical software


system which integrates utilities for storing,
modifying, analyzing, and graphing data.
 SAS runs on both Windows and UNIX platforms
 SAS is used in a wide range of industries such as
healthcare, education, financial services, life
sciences,…
 Check out the webpage to learn more
 http://www.sas.com/
SAS User Interface •Run button – click on this button
•to run SAS code
•Tool bar similar
•to Windows applications
•Log Window •Click here for SAS help
•New Window button
•Explore
•Save button
r
•Windo
w •Editor Window

•Results
•Window
•(not shown) •Output Window (not shown)
Editor Window

•The Editor Window contains inputted data


•sets and SAS programs
Explorer Window

•Libraries Folder - Contains data sets created in SAS

•Explorer
•Window
Libraries Folder

•Contents of the Work Folder


•Contents of the Libraries •These are the data sets that
•Folder •have been created in SAS
•through inputting data and
•The Work Folder contains •by creating data sets in SAS
•data sets created in SAS •programs
Log Window
•The Log Window contains a record
•of all commands submitted to
•SAS and shows errors in the
•commands.
Output Window

•The Output Window contains output


•based on SAS programs submitted in the
•Editor Window.
Results Window

•The Results Window shows a •Click on any procedure to


•listing of SAS programs •view all output parts of the
•that have been submitted •procedure and click on any
•in the order that they were •individual part to view the
•submitted. •actual output.
SAS Help
•PRINCESS C. BARCEGA
•APG SCHOOL, MANAMA,
BAHRAIN

•Powerpoint hosted on www.worldofteaching.com


•Please visit for 100’s more free powerpoints
What do you know about

 bar graph?
 double bar graph?
 Histogram?
Bar Graph

 A bar graph
Spanish
can be used to
display and
Mandarin
compare data
Hindi
 The scale
should include
English all the data
values and be
0 200 400 600 800 1000
easily divided
into equal
intervals.
How to interpret a Bar
Graph?
•The bar graph shows Mr.
Snowden’s students by gender
and band membership.

7
 How many of Mr.

6
Snowden’s

5
students are

4
band members?

3
 How many of Mr.
2
Snowden’s 1

students are not Female Female not Male band Male not
0

band band band

band members?
Double Bar Graph

90  Can be used
80
70 to compare
60
50
two related
40 sets of data
30
20
10
0
1st 2nd 3rd 4th
Qtr Qtr Qtr Qtr
How to make a Double-
Bar Graph?
 Choose a scale and
interval for the vertical
axis.
 Draw a pair of bars for
each country’s data. Use
different colors to show
males and females.
 Label the axes and give
the graph a title.
 Make a key to show what
each bar represents.
The table shows the
highway speed limits on
interstate roads .within
State three statesRural
Urban

Florida 65mi/h 70 mi/h

Texas 70 mi/h 70 mi/h

Vermont 55mi/h 65 mi/h


Step 1

 Choose a scale 80

and interval for


the vertical axis. 60

40

20
State Urban Rural
0
Florida 65mi/h 70 mi/h
Texas 70 mi/h 70 mi/h
Vermont 55mi/h 65 mi/h
Step 2 Draw a pair of bars for each
state’s data. Use different colors
to show urban and rural.

80

State Urban Rural 60

Florida 65mi/h 70 mi/h


40
Texas 70 mi/h 70 mi/h
Vermont 55mi/h 65 mi/h 20

0
Florida T exas Vermon t
Step 3 and 4
•Speed Limit on Interstate
Roads
80
 Label the axes •
and give the 60 Urban

•Speed Limit
graph a title.
40

 Make a key to

(mi/h)
Rural
show what each 20
bar represents
0
Florida Texas Vermont
Histogram

 Histogram is a bar graph that shows


the frequency of data within equal
intervals.
 There is no space in between the
bars.
How to make a
histogram?
The table below shows the
number of hours students watch
TV in one week Make a histogram
Number
of of hours
all the of TV
data.
1 II 6 III

2 IIII 7 IIII - IIII

3 IIII - IIII 8 III

4 IIII - I 9 IIII

5 IIII - III
 Make a frequency Step 1
table of the data.
Be sure to use
equal intervals
Number of Frequency
hours of TV

Number of hours of TV
1-3 15
1 II 6 III
4-6 17
2 IIII 7 IIII - IIII
7-9 16
3 IIII - IIII 8 III
4 IIII - I 9 IIII
5 IIII - III
Step 2
 Choose an appropriate scale and interval for the
vertical axis. The greatest value on the scale should
be at least as great as the greatest frequency.

20
16
Number of Frequency 12
hours of TV
8
4
1-3 15
0
4-6 17 1-3 4-6 7-9

7-9 16
Step 3 Hours of Television
Watched
 Draw a bar for each
interval. The height of the 20
bar is the frequency for

Number of students
that interval. Bars must 16
touch but not overlap. 12
 Label the axes and give
8
the graph title
4
Number of Frequency 0
hours of TV 1-3 4-6 7-9
Hours

1-3 15
4-6 17
7-9 16
Hours of Television
Watched

20

Number of students
16
12

8
4

0
1-3 4-6 7-9
Hours
The list below shows the results
of a typing test in words per
minute. Make a histogram of
the data.
62, 55, 68, 47, 50, 41, 62, 39,
54, 70, 56, 70, 56, 47, 71, 55,
60, 42
Essential Information
Commonly used visual tools
 Charts:
 Bar
 Line
 Pie
 XY
 Area
 Thematic map

You might also like