Data and Statistical Notes

TYPES OF VARIABLES.
1. Numerical Take numerical values and can use arithmetic

operations.
1. Continuous
Any value Height. (Though if we round off our
height it might seem
discrete but its not.)
2. Discrete One of a specific set of values Number of cars a
household has
2. Categorical Take on a distinct category, can be numerical but
makes no sense arithmetically.
1. Ordinal Ordered level (Customer service review, 1,2,3,4,5)
2. Regular Are you a morning person or an afternoon person? No
order.
When two variables show some connection with one another are called
associated variables.
There are two types
1. Positive
2. Negative
First always find out what type of variable are you dealing with?
STUDIES.
1. Observational Collect data in a way that does not interfere with
how data arises. Only establish an association.
1.1.
Retrospective
Uses data from the past.
1.2.
Prospective Data is collected throughout the study.
2. Experiment
Randomly assign subjects to treatments.
Establish causal connections.
Extraneous variables that affect both the explanatory and the response variable,
and that make it seem like there is a relation between them is called a
CONFOUNDING VARIABLE.
CORRELATION DOES NOT IMPLY CAUSATION.
SAMPLING AND SOURCES OF BIAS.

Cons of a census.
1. Lots of resources
2. Some individuals maybe hard to locate or measure. And these people
may be different from the rest of the population
3. Populations rarely stand still. It changes constantly.
To taste soup, you take a spoonful and when you decide that the spoonful is not
salty enough thats exploratory analysis.
Types of Biases:
1. Convenience sample
People easily available are used in the
study.
2. Non response
Only a few NON-RANDOM people from
the randomly sampled people respond then the result is not
representative.
3. Voluntary Response
Contains people only who volunteer to
respond. This is only when they have a strong opinion from them and thus
is also not representative.
Difference between voluntary and non-response: In voluntary the sampling is not

random and in non-response the sampling is random.
SAMPLING METHODS
1. Simple Random Sampling
Randomly select cases from the population and each case is likely to be
elected. Drawing a name from the hat.
2. Stratified Sampling
Divide the population into homogenous strata then randomly sample
from within each stratum
3. Cluster Sampling
Divide the population onto clusters. Randomly sample a few the clusters
and then randomly sample from within these clusters. Unlike the strata the
clusters might not be homogenous but each cluster is similar to another
such that we can get away from sampling from just a few clusters.
EXPERIMENTAL DESIGN.
1. Control
Compare treatment of interest to a control group.
2. Randomize
Randomly assign subjects to treatment
3. Replicate
Collect a sufficiently large sample or replicate the entire study
4. Block
Block for variables known or suspected to affect the outcome is block.
Difference between explanatory variable and blocking variable:
Explanatory variables (factors) are conditions which we can impose on
experimental units.
Blocking variables are the characteristics that the experimental units come with,
that we would like to control.
Blocking is like stratifying
Blocking during randomly assigning.

Stratifying during random sampling.
Few new terms

1. Placebo
Fake treatment often used as the control group for medical studies.
2. Placebo Effect
Showing change despite being on placebo.
3. Blinding
Experimental units dont know which group theyre in.
4. Double Blind
Both experimental units and the researchers dont know the group
assignment.
VISUALISING NUMERICAL DATA

1.
Scatter Plot
Explanatory Variable is usually the x-axis and the response is the yaxis.
Things to bear in mind when evaluating the relationship between two
variable:
1.1.
Direction
Positive or negative
1.2.
Shape
Linear or some other form
1.3.
Strength
Strong indicated by little scatter or weak indicated by lots of scatter
1.4.
Any potential outliers.
Investigate these points to make sure they are not data entry
errors.
A nave approach would be to ignore (exclude) the outliers but sometimes these
outliers can be very interesting cases and handling them with careful
consideration of research question and other associated variables is important.
2. Histograms
2.1.
Provides a view of the data density.
2.2.
Identifying the shape of the distribution.
The width of the bin in the histogram can alter the story that the histogram is
conveying.
3. Dot Plot
4. Box Plot
5. Intensity Map
MEASURES OF CENTER:
1. Mean
Arithmetic average
2. Median
50th Percentile
3. Most frequent
Most frequent observation
If these measurements are calculated from a sample they are known as sample
statistics.
MEASURE OF SPREAD
1. Range: Max-min
2. Variance
Why do we square the difference?
To get rid of negatives so that positive and negative dont cancel

each other.
Large deviations are weighed more heavily than small deviations.
3. Standard Deviation
4. Inter-Quartile Range
ROBUST STATISTICS
We define robust statistics as measures on which extreme observations have
little effect.
TRANSFORMING DATA
A transformation is rescaling the data using a function.
When data are very strongly skewed we sometimes transform them so they are
easier to model.
Methods
1. Log (natural) Transformation (most usual)
To make the relationship between variables more linear and hence easier
to model with simple methods.
2. Other Transformations
Goals of Transformation.
To see data structure differently.

Reduce skew for assisting modeling.
To straighten a non-linear relationship in a scatterplot.
EXPLORING CATEGORICAL VARIABLES

1. Frequency table and bar plot
2. Pie Chart
Less helpful than bar plots.
3. Contingency Table.
4. Relative frequencies.
5. Segmented bar plot.
6. Relative frequency segmented bar plot.
7. Mosaic plot.
8. Side-by-side box plots.
INTRODUCTION TO INFERENCE
PROBABILITY AND DISTRIBUTIONS
Random Process
In a random process we know what outcomes could happen but we dont
know which particular outcome will happen.
1. Frequentist interpretation
The probability of an outcome is the proportion of times the outcome
would occur if we observed the random process an infinite number of
times
2. Bayesian interpretation
A Bayesian interprets probability as a subjective degree of belief.
Largely popularized by revolutionary advance in computational

technology and methods during the last twenty years.
Law of large numbers
Law of large numbers states that as more observations are collected the
proportion of occurrences with a particular outcome converges to the
probability of that outcome.
DISJOINT EVENTS & GENERAL ADDITION RULE

Disjoint or mutually exclusive events cannot happen at the same time.
1. Union of disjoint events

Probability of two disjoint sets A & B then the P (A or B) = P (A) + P (B) P
(A & B)
For a disjoint event however, P (A&B) = 0.
2. Sample space
A sample space is a collection of all possible outcomes of a trial.
3. Probability distribution
A probability distribution lists all possible outcomes in sample space and
the probabilities with which they occur.
3.1. The events must be disjoint
3.2. Each probability must be between 0 & 1
3.3. The probabilities must total 1.
4. Complementary events.
Complementary events are two mutually exclusive events whose
probabilities adds up to 1.
DISJOINT vs. COMPLEMENTARY

Not necessarily the probabilities of disjoint sets adds to 1.
The probabilities in the complementary events always add to 1.
Therefore, complementary events are necessarily disjoint events however the
converse is not true.
INDEPENENT EVENTS
Two processes are said to be independent if knowing the outcome of 1 provides
no useful information about the outcome of the other.
Checking for independence
P (A|B) = P (A), then A & B are independent.
DETERMINING DEPENDENCE BASED ON SAMPLE DATA
If the difference is large there is stronger evidence that the difference is real
If the sample size is large even a small difference can provide strong evidence of
a real difference.
RULE FOR INDEPENDENT EVENTS

If A and B are independent, P (A & B) = P (A) x P (B)

Data and Statistical Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data and Statistical Notes

Uploaded by

Copyright:

Available Formats

TYPES OF VARIABLES.

1. Numerical Take numerical values and can use arithmetic

SAMPLING AND SOURCES OF BIAS.

Difference between voluntary and non-response: In voluntary the sampling is not

Blocking during randomly assigning.

Few new terms

VISUALISING NUMERICAL DATA

Why do we square the difference?

To get rid of negatives so that positive and negative dont cancel

To see data structure differently.

EXPLORING CATEGORICAL VARIABLES

PROBABILITY AND DISTRIBUTIONS

Largely popularized by revolutionary advance in computational

DISJOINT EVENTS & GENERAL ADDITION RULE

1. Union of disjoint events

DISJOINT vs. COMPLEMENTARY

DETERMINING DEPENDENCE BASED ON SAMPLE DATA

RULE FOR INDEPENDENT EVENTS

You might also like