Professional Documents
Culture Documents
Evolution of Outsourcing
Knowledge Based
2007 and beyond
1. Knowledge of SAS
Exposure to projects in
Statistician - Project
12- 24 segmentation , SEM / other
Technical Expertise Account Statistical Analyst
Months techniques across accounts and
Buildup s
domains
Statistician - Set up processes/ provide
TL - 24 - 36 Lead - Statistical
Technical Expertise statistical solutions to new
accounts Months Solutions
Deployment projects / accounts
TL - Take on lead training role/ lead
Statistician - gt 36 Statistical
Technical Multiple role in multiple accounts
Evangalist months Evangalist
accounts
Growth over 5 years - Technical – Tools experts
Account/
Track Profile Time Designation Skills
Role
Individual Contributor for
Tools Experts - Reporting /Automation
Individual 0-12
Technical Learning Tools Expert Individual Contributor for modeling
Contribut Months
Phase Gain expertise in SAS/
or
Standardised Dataset creation /
automation techniques / IBM
database expertise
Independently provide optimal
Tools Experts
Project 12- 24 Data and Reporting solutions across multiple platforms /
Technical - Expertise
Accounts Months Strategist databases to clients reporting
buildup
needs
Independently provide optimal
Tools Experts - solutions across multiple platforms /
TL - 24-36 Lead - Analytics
Technical Expertise databases across multiple clients in
accounts Months Support
Deployment an account or across 2-3 accounts
Informational Predictive
1. Which region, dealer and product has 1. How is the profitability of my bank
the highest sales. going to get impacted if more younger
profile of people are targeted
2. In the credit cards business which
segment of customers is the most 2. What will be the increase in sales for
profitable Liril if a ‘fairness’ feature is added to
the campaign
3. Who is the most successful sales
person in the organization 3. How many mailers need to be sent for
a new life insurance product to get
4. Which competitor is gaining in market 1000 new insurance applications
share
4. What are the factors which need to be
focused on to maximize sales of credit
cards.
Answering these questions requires
Quantitative
Statistics - Definition
a) Collection,
b) Classification,
c) Analysis,
e) Presentation of data.
Data and Variables
Quantitative:
Frequency Distribution
Relative Frequency and Percent Frequency Distributions
Dot Plot
Histogram
Cumulative Distributions
Ogive
Frequency Distribution
R e la t iv e P e rc e nt
R a t ing F re que nc y
F re que nc y F re que nc y
Poor 2 0.10 10
B elo w A verage 3 0.15 15
A verage 5 0.25 25
A bo ve A verage 9 0.45 45
Excellent 1 0.05 5
To tal 20 1.00 100
Bar Graph
A bar graph is a graphical device for depicting qualitative data.
On the horizontal axis we specify the labels that are used for each of the
classes.
A frequency, relative frequency, or percent frequency scale can be used for
the vertical axis.
10
8
Frequency
0
Poor Below Average Average Above Average Excellent
Rating
Pie Chart
The pie chart is a commonly used graphical device for presenting
relative frequency distributions for qualitative data. Example -
Excellent Poor
5% 10%
Below
Average
15%
Above
Average
45%
Average
25%
Quantitative data representation: example
910 780 930 570 750 520 990 880 970 620
710 690 720 890 660 750 790 750 720 760
1040 740 620 680 970 1050 770 650 800 1090
850 970 880 680 830 680 710 690 670 740
620 820 980 1010 790 1050 790 690 620 730
Frequency Distribution
. . .. . . .
. . .. .. .. .. . .
. . ..... .......... .. . .. . . ... . .. .
500 600 700 800 900 1000 1100
Cost (Rs)
Histogram
18
16
14
Frequency
12
10
8
6
4
2 Parts
Cost (Rs)
500 600 700 800 900 1000 1100
Cumulative Distributions
Cum. Relative
Cost (Rupees) Cum. frequency
frequency
<=590 2 .04
<=690 15 .30
<=790 31 .62
<=890 38 .76
<=990 45 .90
<=1090 50 1.00
Ogive
An ogive is a graph of a cumulative distribution.
The data values are shown on the horizontal axis.
Shown on the vertical axis are the:
cumulative frequencies, or
cumulative relative frequencies, or
cumulative percent frequencies
The frequency (one of the above) of each class is plotted as
a point.
The plotted points are connected by straight lines.
Because the class limits for the parts-cost data are 500-
590, 600-690, and so on, there appear to be one-unit
gaps from 590 to 600, 690 to 700, and so on.
These gaps are eliminated by plotting points halfway
between the class limits.
Thus, 595 is used for the 500-590 class, 695 is used for
the 600-690 class, and so on.
Ogive Example: Bimal Auto Repair
Ogive with Cumulative Percent Frequencies
100
Cumulative Percent Frequency
80
60
40
20
The number of Sobha homes sold for each style and price for
the past two years is shown below.
Insights:
Houses less than 35,00,000 rupees are sold about 100% more
than the ones above 35,00,000.
Only 6 sold houses were duplex.
Scatter Diagram
y
y
x x
Summary: Tabular and Graphical Procedures
Data
Dispersion
Skewness
Kurtosis
Descriptive statistics – Central Tendency
Central Tendency:
Central Tendency is the middle point of distribution
Measures of Central Tendency are also called Measures of
Location
Dispersion:
Dispersion is the spread of the data in a distribution
That is the extent to which the observations are scattered
0.6
0.5
0.4
f(x)
0.3
0.2
0.1
0
-4
-3
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
2.5
X 4
Descriptive Statistics – Dispersion Continued ….
Positively Skewed
Negatively Skewed
Descriptive Statistics – Skewness
Syntax:
Proc means data = <Dataset-name>;
var <variable name>;
run;
Syntax:
Proc Univariate data = <Dataset-name>;
var <variable name>;
run;
SAS code – Descriptive Statistics
For example, we may survey a large number of consumers (say 1000) and
ask for their preference of brand of computer. We then record the number
(x) who prefer a particular brand. Since we don’t know the number we will
record for x before the experiment, it is called a random variable
Random Variables
Example –
A multiple choice exam of 20 questions. The random variable X is the
number of correct answers.
Possible values for X are 0, 1, 2, 3, 4, 5, ……. 20.
Example
The time spent studying for a course per week could be the
measurement variable X.
It could be measured in days, hours, minutes, seconds, etc… (say 600
minutes/week, or 591 minutes/week, or 590 minutes and 45 seconds,
and so on)
Discrete probability distributions
Once we know all the possible values and the probabilities associated with
those values for a Discrete Random Variable, we can construct a Discrete
Probability Distribution
T H
P(TH)=0.25
H T P(HT)=0.25
P(HH)=0.25
H
Discrete probability distributions (Cont.)
x 0 1 2
0.6
0.5
0.4
p(x) 0.3
0.2
0.1
0
0 1 2
X
Discrete probability distributions (Cont.)
x 0 1 2
N
E( X ) = ∑ x . p( x )
i =1
i i
2
E 2
σ X µ
x
= (
2
)−
N
E( X ) = ∑ xi . p ( xi )
2 2
i =1
N
VAR( X ) = σ x = ∑ xi . p ( xi ) − µ
2 2 2
i =1
The Binomial Distribution
2 Each experimental unit can take only one of two possible outcomes.
Conventionally these are either called success or failure
0.35
P=0.5, n=5 0.3
0.3 0.25
P=0.3, n=10
0.25
0.2
0.2
p(x) p(x) 0.15
0.15
0.1
0.1
0.05 0.05
0 0
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10
X X
0
2
4
6
8
10
12
14
16
18
20
X X
Formula for a Binomial Distribution
Where,
λ : Mean number of successes in a given time period, λ>0
x : Number of successes we are interested in, where x = 0,1,2…n
e : Base of natural logarithm in function ln(≈ 2.71828)
Continuous Probability Distributions
0.3 0.3
0.25 0.25
0.2 0.2
p(x) 0.15 0.15
0.1 f(x)
0.1
0.05
0.05
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
X -0.05
X
Continuous Probability Distributions
The reason is that it has a very important use in the statistical theory of
drawing conclusions from sample data about the populations from which
the samples are drawn, and in Statistical Process Control.
There are several characteristics that make the normal distribution very
important for statisticians:
a) It is bell shaped
b) Symmetrical about Mean which is also Median and Mode
c) Most observations in the distribution are close to the mean, with
gradually fewer observations further away
The Normal Distribution (Cont..)
0.6
0.5
0.4
f(x)
0.3
0.2
0.1
0
-4
-3
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
2.5
4
X
The Normal Distribution (Cont)
P(µ
µ-σ
σ < X < µ+σ
σ) = 0.683
0.6
µ
0.5 X∼
∼N(40,10)
0.3
0.2
0.1
0
0
10
15
20
25
30
35
40
45
50
55
60
65
70
80
X
The Standard Normal Distribution
A special case of the normal distribution, the standard normal
distribution has a mean of 0 and a standard deviation of 1
0.6
0.5
0.4
f(z)
0.3
0.2
0.1
0
-4
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
4
Z
The Standard Normal Distribution (Cont)
X −µ
Z=
σ
Standard Normal Distribution Tables
A) As the data are symmetrical, then we know that 50% of observations lie
above and below the mean. If the mean is zero, then there are 50% of
observations above and below zero
i.e. if Z∼N(0,1)
P(z<0) = 0.5
0.6
0.5
0.4
f(z)
0.3
0.2
0.1
0
-4
-3
-2.5
-2
-1.5
-1
-0.5
0.5
1.5
2.5
4
Z
Standard Normal Distribution
Example: If X is a continuous random variable with a mean of 40
and a standard deviation of 10, what proportion of observations
are a) less than 50 b) < 20, c) between 20 and 50
0.6
0.5 ∼N(40,10)
X∼
0.4
f(x)
0.3
0.2
0.1
0
0
10
15
20
25
30
35
40
45
50
55
60
65
70
80
X
Sampling & Sampling Distribution
Sampling and Sampling Distributions
We have previously learnt that for a given distribution, we can calculate the
probability of an individual observation lying within a certain range
In the real world, we don’t know the exact population parameters and we use
a sample to make inference about the population
Sample
Because it is seldom possible to measure all the individuals in a population,
researchers use samples and infer their results to the population of interest
Example:
Allocate a number to each member of the population and use a random
number generator to determine which individuals will be measured
Sampling (Cont.)
Percentage of
Age Group
total
Birth – 19 years 30
20 – 39 years 40
40 – 59 years
20
In Cluster sampling, we divide the population into groups, or clusters, and then
select a random sample of these clusters. We assumed that these individual
clusters are representative of the population as a whole.
For example:
If a market research team is attempting to determine by sampling the
average number of television sets per household in a large city.
They could use a city map to divide the territory into blocks and then choose
a certain number of blocks (clusters) for interviewing. Every household in
each of these blocks would be interviewed.
Comparison of Stratified and Cluster Sampling
With both cluster and stratified sampling, the population is divided into
well-defined groups.
We use ---
a) stratified sampling when each group has small variation within itself
but there is a wide variation between the groups.
b) cluster sampling in the opposite case---when there is a considerable
variation within each group but the groups are essentially similar to
each other.
Sampling Distributions
They have also proven that the distribution of these sample means
will always be normally distributed, regardless of the shape of the
parent population. This is known as the Central Limit Theorem
The Central Limit Theorem
STATEMENT: A distribution with a mean µ and variance σ², the sampling
distribution of the mean approaches a normal distribution with a mean (µ)
and a variance σ²/N as N, the sample size increases.
The standard deviation of the sample means is called the standard error,
and can be calculated by the formula;