You are on page 1of 4

LECTURE 2: NOTES

BASICS OF DATA ANALYITCS | 21 January 2017

Sample Space is the set of all possible outcomes. Any subset of the Sample Space
makes an event. A probability can be assigned to each of the Event. The axioms of
probability can be used to compute probability of a certain event when all outcomes
are Equi-probable.

Partition:
A set of all mutually exclusive and collectively
exhaustive events is called a Partition. (See figure 1)
A1 U A2 U A3 U A4 =S
Ai ꓵ Aj = φ, i, j ={1,2,3,4} & i ≠ j
Where, S = sample space.
A1, A2, A3, A4 are all partitions here.
A and Ᾱ constitute a partition. A1 A2 A3 A4
Figure 1: Partitions

Conditional Probability:

The probability that event A also occurs if event B has


certainly occurred. (See figure 2)

P(A|B) = P(AꓵB)/P(B)

For P(B)>0.
A B AꓵB

Independent Events:
If the outcome of one event doesn’t affect the outcome
of the other, the events are called independent Figure 2: Conditional Probability
events.

For A and B being independent events,


P(A|B)=P(A), as the occurrence of B doesn’t affect
occurrence of A.
From conditional probability, we know:
P(A|B)= P(AꓵB)/P(B).
A
So,P(AꓵB) = P(A|B)*P(B) = P(A)*P(B)

=> P(AꓵB) = P(A)*P(B) -> that’s the condition for event


B
A and B being independent.

Mutually Exclusive Events:


Two events A and B are mutually exclusive if
Figure 3: Mutually Exclusive Events
P(AꓵB)=φ.

Basics of Data Analytics | Lecture 2 | Page | 1


LECTURE 2: NOTES
BASICS OF DATA ANALYITCS | 21 January 2017

While collecting data, we need to make sure that the data is collected from an
unbiased sample, all the necessary fields to make the required analysis of the data
have been addressed and sufficiently large data points are collected.

The larger data set => limited error.

What to do with the collected Data?

Step 1: Understand the Data:


It is important to understand what
the data represents. A domain
expert will help understand the data
better and that ensures no incorrect
inferences are drawn. Figure 4

Example: The height of waves in the Arabian Sea and the population of Portugal
both show similar patterns. However both are completely unrelated. Understanding
the data will help avoid errors where two independent data are used to draw a
relationship.

Example: The data collected related to predicting the number of cars in Bangalore
at the end of the year will need not only need to contain the cars added each year
over the past ten years, but also the change in infrastructure and the change in
demographic of the city over the same period. Understanding the problem at hand
and collecting comprehensive data for the same will help increase accuracy (of the
inferences drawn out of the data).

Step 2: Make a scatter plot and identify the


pattern:

A scatter plot helps have a feel of the data by


identifying the pattern in the data.

Example: Number of cars sold vs. salary when plotted


as a scatter plot would typically show a linear pattern.
(See Figure 5)
Figure 6: Linear Trend in Scatter Plot
Example: Cool drinks sales vs. month (Jan to Dec)
plotted on a scattered plot will roughly show a pattern
Cool drink sales

as shown in figure 6. This is called seasonality.

When data show seasonality with a longer time period


(in years), its termed Cyclicity. Cyclicity is observed
mostly in astronomical data. Example: Sun-spot months
activities which occur every 12 years. Figure 5: Seasonality in Scatter Plot

Basics of Data Analytics | Lecture 2 | Page | 2


LECTURE 2: NOTES
BASICS OF DATA ANALYITCS | 21 January 2017

Outliers and the King Kong Effect:

Say we collect a data as shown in Figure 7: 16


height vs. weight of monkeys. We can see that 14 outlier

Weight of Monkey
the all data points (except one outlier) are 12
scattered in a certain cluster which shows the 10
height vs. weight of monkeys lie in a certain 8
interval. From this we can compute the 6
4
average weight and height of the monkeys in
2
the population surveyed. However, when the
0
height of a King Kong is added to the same 0 2 4 6 8 10 12
data set, even though the King Kong is a Height Of Monkey
monkey, the point lies way outside the Figure 7: King-kong Effect
formed cluster and appears as an outlier that
can affect the averages of the data inappropriately.

Outliers often originate due to errors in measurement (human or instrument error).


However in the above data, the outlier is a perfectly valid data point and should not
be ignored.

The help of a Domain Expert showed be taken in order to valid the data and the
existence of outliers should be studied and understood. An informed decision can
then be taken either to ignore or accept the data. Often, an outlier is analysed
separately if it’s a valid data point and may be ignored in estimating certain
population parameters.

In a nut shell, we should understand the data collected, make a scatter plot and
identify the pattern (trend or seasonality) in the data, carefully analyse the outliers
and carry out the required analysis without deviating from the goal.

Co-relation Coefficient:
Common Formulae:
Say 𝑥 ∈ 𝑋 and 𝑦 ∈ 𝑌 are two set of data. ∑𝑛
1 𝑥𝑖
Mean = 𝜇 = 𝑥̅ =
𝑛
Covariance of X and Y is defined as: ∑𝑛 (𝑥𝑖 −𝜇)^2
Variance = 𝜎 2 = 1
𝑛−1
∑(𝒙𝒊 − 𝒙
̅) (𝒚𝒊 − 𝒚
̅)
𝒄𝒐𝒓(𝒙, 𝒚) =
𝒏−𝟏

Variance tells us how much data is scattered about the mean. Squaring takes care of
the polarity so that the deviations (positive and negative) don’t cancel each other out
to give an unrealistic value. Squaring also gives higher weightage to larger deviations
thus highlighting them. Mean absolute deviation also negates the polarity, however,
gives equal weightage to each deviation.

Basics of Data Analytics | Lecture 2 | Page | 3


LECTURE 2: NOTES
BASICS OF DATA ANALYITCS | 21 January 2017

Covariance gives the relationship between two data set X and Y and thus depends
on their dimensions. In order to make the quantities dimensionless, we define the
correlation coefficient as:

𝒄𝒐𝒗(𝒙,𝒚)
𝒄𝒐𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒄𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 = 𝒓 = Where, −1 ≤ 𝑟 ≤ 1
𝝈𝒙∙𝝈𝒚

If r = 0 (or close to 0)*, there is no linear relationship between X & Y.

If r = 1 or r = -1 (close to 1or -1)*, there exists a linear trend (increasing or decreasing,


respectively) between X and Y.

𝟏.𝟗𝟔 −𝟏.𝟗𝟔
*Typically if 𝒓 > √𝒏
or 𝒓 < √𝒏
we can say
with higher confidence that a linear coefficient
exists. This implies, larger the data collected
higher the confidence on the observed trend.

Figure 8: 95% of the area under the normal distribution


lies within 1.96 standard deviations of the mean.

Basics of Data Analytics | Lecture 2 | Page | 4

You might also like