Professional Documents
Culture Documents
Data
Pristine www.edupristine.com
Pristine
Agenda
Introduction
Data
Pristine 1
2.Data
I. Population vs. Sample
Pristine 2
2.a. Population vs. Sample
Population Sample
Sample1
Sample2 Sample3
Population
Pristine 3
2.b. Case: Types of Data variables
Romanov, an Analytics consultant works with Credit One bank. His manager gave him a list
having the name of bank's customers. Further he has been asked to pull the information from
bank's database pertaining to the customer list. The information will be around the credit
cards issued by the bank. He needs to define the variable types and the type of value each one
of them will contain. Romanov, who has just started his professional career, doesn't has a good
idea about different variable types.
Now, suppose after extracting data he approached you and asked your help in categorizing the
different variables. Help Romanov in variable categorization.
Pristine 4
2.b. Case: Types of Data variables
Value
? ? ? ? ? ? ? ?
Stored
Variable
? ? ? ? ? ? ? ?
Type
Remarks
Pristine 5
2.b. Case: Types of Data variables (Data snapshot)
Name of Number of Age of Customer Gender of the Marital Status of Annual Salary Monthly Credit
Sl # Customer ID
Customer Credit Cards Last Birthday Customer the Customer (in USD) Card Usage
Pristine 6
2.b. Case: Types of Data variables
Variable
? ? ? ? ? ? ? ?
Type
Remarks
Pristine 7
2.b. Types of Data Variables
Data consists of a combination of "variables" which actually contain the values
Variables at a high level are of two types depending on the kind of values they store:
Numerical
Categorical
Pristine 8
2.b. Types of Data Variables - Summary
Data (Consists
of Variables)
Numerical Categorical
Dichotomous
Continuous Discrete Nominal Ordinal
or Binary
Several Several
Arises from Arises from Only two
unordered ordered
measuring counting categories
category category
Pristine 9
2.b. Case: Types of Data variables (Revisited)
Pristine 10
2.c. Case: Summarizing Data
Romanov, an Analytics consultant works with Credit One bank. His manager gave him some
data around credit cards relating to number of credit cards issued to a set of customers and
the credit limit of the cards. Further he has been tasked to summarize the data in a
presentable form and prepare the report. Romanov, who has just started his professional
career, has never played around with such kind of data, so he is clueless about the different
summarizing techniques.
Now, suppose he approached you and asked your help in preparing the report. Help Romanov
in summarizing the data and preparing the report.
Pristine 11
2.c. Comments: Summarizing Data
Pristine 12
2.c. Summarizing Data - Frequency distribution
A technique to summarize discrete data
A simple process which involves counting of distinct discrete values
The representation can be either tabular or graphical
Example: Number of credit cards owned in a sample of 3000 individuals
7 240 0
1 2 3 4 5 6 7 8 9 10
8 150
# Cards
9 120
10 90
Pristine 13
2.c. Summarizing Data - Frequency distribution (Using MS Excel)
1 2 3 Number of 4
Credit Cards
3
2
4
5
1
7
9
10
6
8
4. Press ctrl+alt+enter
# Customers 7 6 5
700
600
500
400
300 # Customers
200
100
0
1 2 3 4 5 6 7 8 9 10
Pristine 14
2.c. Summarizing Data - Grouped Frequency distribution
A technique to summarize continuous data or discrete data having large number of observations
and an extended range
A simple process which involves counting of values falling under the different intervals (grouped)
Example and illustration 2.2: Number of customers falling under different Salary groups
Graphical representation - Bar Chart
100
80
#Customers
60
40
20
Salary Band
Pristine 15
2.c. Summarizing Data Grouped Frequency distribution (Using MS Excel)
1 2
1. Press ctrl+alt+enter
4
5.Observe the difference
between horizontal axes of
two charts
3
5
# Customers
4.From Edit select the
120
100
salary bands as horizontal
80 axis
60
40
20
0
450001-475000
100001-125000
150001-175000
200001-225000
250001-275000
300001-325000
350001-375000
400001-425000
500001-525000
550001-575000
600001-625000
650001-675000
700001-725000
750001-775000
800001-825000
850001-875000
900001-925000
950001-975000
0-75000
Pristine 16
2.c. Summarizing Data - Cumulative Frequency distribution
Cumulative frequencies are obtained by accumulating the frequencies to give the total number of
observations up to and including the value or group in question.
Example and illustration 2.3: Cumulative number of cards in the sample of 3000 individuals
Cumulative # Customers
2500
2 450
3 900 2000
4 1560 1500
5 2100
1000
6 2400
7 2640 500
8 2790 0
0 1 2 3 4 5 6 7 8 9 10
9 2910
# Cards
10 3000
Pristine 17
2.c. Summarizing Data - Cumulative Frequency distribution (Using MS Excel)
1 2
5 4 3
Cumulative # Customers
3500
3000
2500
2000
1500
1000
500
0
0 2 4 6 8 10 12
3. Observe the last entry. It is equal to
Pristine the total numbers of observations 18
2.c. Summarizing Data Stem-leaf diagram
Stem-leaf diagram
Not suitable for large data. Hence, not extensively used in industry.
Illustration: Given age of 20 individuals in years. Represent them using stem-leaf diagram
Sl # Age Age (Sorted)
1 23 21
2 33 23 Stem Leaf
3 23 24
4 33 27
5 34 30 20 1 3 4 7
6 21 31
7 54 33
8 52 34
30 1 3 4 5 6 9
9 34 35
10 36 36
11 52 39
12 51 40 40 0 3 8 9
13 48 43
14 35 48
15 40 49
16 43 51 50 1 2 3 4 7
17 49 52
18 54 53
19 27 54
20 39 57
Pristine 19
2.c. Summarizing Data Line Plots
Line plot diagram
Not suitable for large data. Hence, not extensively used in industry.
Illustration: Given test scores of 20 students. Represent them using line plot diagram
Sl # Score Score (Sorted)
1 50 20
2 20 20
3 50 20
4 50 30
5 50 30
6 30 30
7 30 30
8 40 30
9 30 40
10 40 40
11 30 40
12 20 40
13 50 40
14 40 50
15 20 50
16 30 50
17 40 50
18 40 50
19 50 50
20 50 50
Pristine 20
2.c. Case: Measure of Central Tendency/Location
After Romanov presented the summarized data to his manager at Credit One, he was asked to
produce the various measures of Central Tendency of the Credit Card data.
Now, Romanov being unaware of the term "central tendency" again approached you and asked
your help in calculating the central tendency of the data in question. Help Romanov in carrying
out his task.
Pristine 21
2.d. Measure of Central Tendency/Location
There are a number of different quantities, which can be used to estimate the central point of a
sample.
These are:
Mean
Median
Mode
Pristine 22
2.d. Measure of Central Tendency/Location - Mean
By far the most common measure for describing the location of a set of data is the mean.
For a set of observations denoted by x1, x2,.,xn the mean is defined by
<x> = (x1 + x2 + + xn)/n (also denoted by x-bar i.e. ).
For a frequency distribution with values x1, x2, xn and corresponding frequency values f1, f2,
,fn it is defined as
<x> = (f1 * x1 + f2 * x3 + . + fn * xn)/(f1 + f2 + + fn).
Illustration 2.4: Calculating mean for sample of 3000 individuals having credit cards.
1. Using Excel function for 2. Using Excel function for frequency
granular data distribution table
Pristine 23
2.d. Measure of Central Tendency/Location - Median
Another useful measure of location.
The median is a value, which splits the data set into two equal halves.
So that half the observations are less than the median and half are greater than the median.
If n is even, then the median is the midpoint of the middle two observations i.e. (n + 1) / 2th
observation.
One of the potential advantages of the median for certain data sets is that it is robust or resistant
to the effects of extreme observations.
Illustration 2.5: Calculating median for sample of 3000 individuals having credit cards along with
demonstration of extreme observations.
Pristine 24
2.d. Measure of Central Tendency/Location - Median
1. Using Excel function for granular data 2. For summarized data in form of frequency table
Median # Cards
Pristine 25
2.d. Measure of Central Tendency/Location - Mode
A third measure of location is the mode.
Defined as the value which occurs with the greatest frequency or the most typical value.
Illustration 2.6: Finding the mode for sample of 3000 individuals having credit cards.
Excel has inbuilt function Mode for granular data
For summarized data it can be find easily by visual inspection
Tabular representation
Number of
# Customers
Credit Cards
1 150
2 300 Mode = 4 i.e. highest number of
3 450 individuals have 4 cards
4 660
5 540
6 300
7 240
8 150
9 120
10 90
Pristine 26
2.d. Case: Measure of Spread
After Romanov presented the summarized data along with "measures of Central tendency" to
his manager at Credit One, he was further asked to add the various measures of spread to the
report.
Now, Romanov being unaware of the term "measures of spread" again approached you and
asked for your help. Help Romanov in carrying out his task.
Pristine 27
2.d. Measure of Spread
The central tendency of a data set is usually the main feature of interest.
Meaning how widely spread the data are about the mean (or other measure of location).
The Range
Pristine 28
2.d. Measure of Spread - Variance and Standard Deviation
The most commonly used measure of spread is the standard deviation.
Essentially it is a measure of how far on average the observations are from the mean.
For a data set having values x1, x2,,xn (or xi where i=1,2,,n) and mean of <x> variance is
calculated as
For granular data: Variance (2) = (xi - <x>)2/n
For summarized frequency table: Variance (2) = {fi*(xi - <x>)2}/n
Standard deviation is positive square root of variance denoted by
For a sample variance is calculated as
Variance (s2) = (xi - <x>)2/(n-1)
Dividing by (n 1) makes the sample variance an unbiased estimator of the population variance.
We will look into the details of it in later part of the course
Illustration 2.7: Calculating variance and standard deviation for sample of 3000 individuals having
credit cards
Exercise: Do the algebra to make sure that the above mentioned formulae of variance are
equivalent.
Pristine 29
2.d. Measure of Spread - Variance and Standard Deviation
(Using MS Excel)
1
2
Pristine 30
2.d. Measure of Spread - Range
The range is a very simple measure of spread defined, as its name suggests, by the difference
between the largest and smallest observations in the data set.
A poor measure of the spread of the data as it relies on the extreme values
Illustration 2.8: Calculating Range for sample of 3000 individuals having credit cards
1 2
3
Pristine 31
2.d. Measure of Spread - Inter quartile Range
Similar to Range but is not affected by the data extremes.
Just as the median divides a set of data into two halves, the quartiles divide a set of data into four
quarters. They are denoted by Q1, Q2 and Q3.
Q2 is just the median, while Q1 is called the lower quartile and Q3 the upper quartile.
Q1 can be defined to be the (n + 2) / 4th observation counting from below and Q3 as the same counting
from above, with relevant interpolation if needed.
The Inter quartile range is defined as Q3 Q1.
Illustration 2.9: Calculating Inter quartile Range for sample of 3000 individuals having credit cards
Pristine 32
2.d. Case: Symmetry and skewness of data
Romanov got appreciations after he presented the summarized data along with "measures of
Central tendency" and "measure of spread" to his manager at Credit One. But, he was further
asked to create an illustration around symmetry and skewness of data. Following that carry out
the analysis of credit card data
Now, Romanov being unaware of the term "symmetry and skewness" again approached you
and asked for your help. In return he promised to gift you a bottle of Champagne. Help
Romanov in carrying out his task.
Pristine 33
2.d. Symmetry and skewness
It deals with the shape of the distribution of a data set, that is, whether it is symmetric or skewed
to one side or the other.
Illustration 2.9: Calculating mean, median, mode and variance for symmetric and skewed data.
Pristine 34
2.d. Symmetry and skewness
he got appreciated for his work. As next step, his manager asked him to put a data
management and management framework in place.
Pristine 36
2.d. Comments: Data Collection and Management Framework
At a high level, from an analyst's perspective data collection and management framework
will involve following components
Data collection mechanism
Maintaining a data dictionary
Missing value imputation
Outlier treatment
Pristine 37
2.e. Data Collection - quick background
Pristine 38
2.f. Data Dictionary
A comprehensive data dictionary should be maintained and updated as and when any new information is gathered.
USE: It can go a long way in helping us understand the data better. For instance, it can help us to revisit old information and see what our initial
hypothesis was and how it is changing with the new updated information.
Things To Include In The Data Dictionary:
Meaning of all Potential Predictors:
Maintain labels of as many variables as possible
If possible, one should also try to capture the business sense of these variables
Wherever things are not clear, it should be noted down so that it can be clarified with the client later on
Clear Definition of Unique Identifier and its Meaning:
Ascertain the level at which data is to be rolled up / down. For instance,
Individual level
Individual x Account level
Individual x Month level
Individual x Account x Month level, etc.
Identify unique key of every dataset. Few examples below:
Payment data may be at transaction level
Demographic data at individual level
Census data at zip code level
Dependent Variable Definition and Meaning: This is a very crucial step in modeling exercise as wrong definition can lead to completely
wrong conclusions. In absence of a clear definition at this stage, it may be defined later after some actual data analysis.
Variable Classification: If not already given, one should always try and classify the variables like
Demographic variables, e.g. age, gender
Performance variables, e.g. spend, number of transactions
Credit Attributes, e.g. total credit line, FICO score
Census level, e.g. population, location attributes such as income levels
Pristine 39
2.g. Missing Value Imputation
There are a variety of techniques for missing value imputation; but these should be considered more
as scenario-specific than just being a set of pure alternative choices.
Missing Value Imputation Techniques
A. Impute Missing Values with ZERO
B. Impute Missing Values with MEDIAN
C. Impute Missing Values with MEAN
D. Impute Missing Values with MODE
E. Information based Segmentation
F. Non-Missing Dummy Creation
G. Imputation and Non-Missing Dummy Creation
H. Impute based on Bivariate Graphs
I. Impute using Regression on other Non-Missing Predictors
J. DNI
K. Multiple Imputation
Pristine 40
2.h. Outlier Treatment
An outlier is a single observation "far away" from rest of the data.
Outlier
Reasons for outliers:
Errors
Data errors
Sampling error
Standardization failure Outlier
Faulty distributional assumptions
Human Error
Genuine Outliers
Pristine www.edupristine.com
Pristine 42