You are on page 1of 10

Statistics

as a process can be summarized in four steps: 3. Plot the intervals on the X-axis and the frequency
on the Y-axis
1. Producing data from a sample from the population of Computing for number of observations in an interval:
interest 1. Multiply total number of observations by
2. Exploratory Data analysis (EDA)- summarizing data relative frequency
3. -4. Probability and inference- drawing conclusions 2. Round up to the next whole number
about the whole population from the data collected. Describing the pattern of a histogram:
o Shape
Keywords: Symmetry/skewness
Data- information about individuals organized into
variables
Individual- person or object
Variable- characteristic of an individual
Data set- collection of data identified within particular
circumstances

Variables can be:
Categorical- labels; mutually exclusive
Quantitative- numerical and measurable

Scales of Measurement/ Types of Variables:
Nominal- qualitative; discrete categories; presence or
absence of an attribute; no rankings
Ordinal- rank-orders, but ranks do not reflect equal
differences Peakedness (modality)
Interval- numerical and has measurable distances
among points, but does not have a meaningful zero
point (e.g. temperature, IQ scores)
Ratio- similar to the interval where value represents
an amount of the attribute, but there is an absolute
zero


Frequency Distributions
**skewed shapes can also be
Used to summarize the distribution of a categorical
variable bimodal

Steps:
1. List the different categories the variable takes Spread/Variability
2. Tally how many observations fall under the - described by the approximate range covered by the
category data (min and max values give us the range)
3. Convert into percentages by dividing occurrence
by category by the entire number of observations Outliers
collected - observations that fall outside the overall pattern
Pie and Bar Charts Stemplots
Can be used to visualize the distribution of categorical Also graphically displays distribution of quantitative
data data
Bar charts help with distribution; pie charts help with
visualizing relation to the whole

Histograms
Can be used to visualize distribution of a continuous
quantitative variable
Steps:
1. Arrange values and break into intervals
depending on what is appropriate for data (up to
you!)
2. Tally how many observations fall into each
interval
Steps: Symmetric distributions with no outliers, the MEAN IS
1. Create intervals for your data EQUAL TO THE MEDIAN.
2. Sort the numbers based on smallest to largest Skewed-right distributions: the MEAN IS BIGGER THAN
3. Doing so by interval, write the stem (left-most THE MEDIAN.
number) on the left column and all the leaves on
the corresponding right column
4. Rotate your graph 90 degrees counterclockwise

Skewed-left distributions: the MEAN IS LESS THAN THE


MEDIAN




** you can also create a

dotplot (display each

observation as a dot instead

of leaves)

Measures of Central Tendency

Tell us what a typical value is

within the distribution



Mode

most frequently occurring value

How to obtain:
** Get the frequency of each unique value, the
one with the largest frequency is the mode
Measures of Spread
Mean

Average Range
How to obtain: Distance between the largest data point (max) and the
** sum all the values and divide by the total smallest data point (min)
number of observations
Inter-Quartile Range
Median
Measures the range covered by the MIDDLE 50% of
Midpoint of the distribution wherein half of the the data
numbers fall above and the other half falls below
How to obtain:
** if odd, the median is the (n+1)/2 value when
the observations are arranged from smallest to
largest
** if even, the median is the mean of two center
observations that are in the n/2 and n/2+1
positions

Relationship between the Mean and Median
The mean is obtained from the actual values, whereas
the median is obtained from the order.
The mean is sensitive to outliers, whereas the median
is resistant.

How to obtain: 6. Mark outliers with an asterisk.
o Arrange data from smallest to largest and
find the MEDIAN (as the median divides the Standard Deviation
data into equal halves) Quantifies spread of data by measuring how far they
o Find the median of the LOWER 50% of the are from their mean.
data. This is called the Q1 as 25% of the data It first takes the distance each value from the mean
fall below it. and squares each individual difference. It then
o Find the median of the TOP 50% of the data. averages the values by n-1 (n= total no. of
This is called Q3 as three quarters (75%) of observations). This value is the variance. When you
the data falls below it. take the square root, it is the SD.
o The IQR is the distance between the Q1 and
Q3, hence IQR= Q3-Q1.
The IQR is used as basis to classify extreme
observations as outliers using the 1.5(IQR) criterion.
o An observation is a suspected outlier if it is
BELOW Q1-1.5(IQR)
ABOVE Q3 + 1.5 (IQR)


The SD Rule or the Empirical Rule

What to do with outliers:
If the distribution is symmetrical, the following rules
o KEEP is it is produced by the same process as
apply:
the rest of the data and has the potential to
occur again o Approximately 68% of the observations fall
o DISCARD if produced under difference within 1 standard deviation of the mean.
conditions or it is a mistake in the data
o Approximately 95% of the observations fall
The Five Number Summary and Boxplots
The Five Number Summary consists of the min and within 2 standard deviations of the mean.
max values, Q1, M, and Q3 (IQR).
These values provide a numerical summary o Approximately 99.7% (or virtually all) of the
of the center and spread of a distribution. observations fall within 3 standard

deviations of the mean.
The BOXPLOT graphically displays the five number
summary and observations that were suspected as
outliers.
How to plot:
1. Draw a number line wherein the numbers are
of an equal distance from each other and can
contain all 5 numbers of the summary.
2. Draw a box around Q1 and Q3.
3. Draw a line in the box where the median is
located.
4. Obtain the IQR and the minimum and
maximum values to place the whiskers using
the Tukey method (1.5*IQR). Add this value
to Q3 and subtract it from Q1.
5. If there are outliers, draw lines to the last
value that is not an outlier; otherwise draw a
line until the min and max values.
Exploring Relationships Between 2 Variables
An explanatory variable (independent) claims to
explain or predict the response.
The response variable (dependent) is the outcome of
the study.

Role Type Classification
Further classifying the explanatory and response
variables into categorical and quantitative
types yields the You also need to supplement the table with CONDITIONAL
following: PERCENTS. This means that you convert the values to
percentages but restricted to the value of the explanatory
variable separately (e.g. do not divide by entire number of
observation in the study, only total number of observations
for the specific explanatory variable).

Q -> Q: use a scatterplot

o How to create:
1. Place the EXPLANATORY VARIABLE ON
Categorical explanatory and quantitative
THE X AXIS. The response variable should
response
be on the Y axis. If there is no clear
Categorical explanatory and categorical explanatory or response variable, you can
response put either on either axis.
2. Plot each value on the graph.
Quantitative explanatory and quantitative
response We look at the overall pattern of the scatterplot: direction,
form, and strength. Also include any deviations from the
Quantitative explanatory and categorical patterns (i.e. outliers)
response The direction can go up (positive; direct relationship) or it
can go down (negative; indirect relationship). Not all fit into
these two categories.

The Role-Type classification is important to know what


relationship the variables follow I the study so we can
choose appropriate statistical tools to analyze them.

C -> Q: use side-by-side boxplots to compare
distributions with the five number summary.

C -> C: use a two-way table; sum at each intersection of
column and row to see number of responses
The form describes the shape of the graph. It can be linear,
curvilinear, etc.
The strength of the relationship is how closely the data
adheres to the form of the graph.

Outliers are data that deviate from the pattern of the
relationship.
deviations. This line is called the least-squares
regression line.

Algebra of a Line (Review)
A line is defined by two points X and Y. their
relationship is expressed in the formula Y = a
+bX
o a (intercept) change in Y when X is 0
o b (slope) change in Y when X increases by 1
unit.

Least Squares Regression Line
The intercept and slope can be obtained given this
formula:


Linear Relationships
Correlation Coefficient: numerical measure that SD of the response
assesses strength of a linear relationship as Correlation variable
denoted by r. coefficient
r can only have a value that ranges from negative

SD of the
Mean of the explanatory variable
response variable





Mean of the
explanatory variable
one to one. You need to find the slope first,
The sign of r indicates the direction of the as the value of the intercept depends on it.
relationship.
The closer to zero the value of r, the weaker the Extrapolation is predicting for ranges of the
relationship. explanatory variable that is no longer in the data. It is
The closer to negative one or one, the stronger the not reliable and should be avoided.
relationship.
R is unitless
R only measures the strength of a LINEAR Association does not imply causation.
relationship; correlation is 0 for non-linear ones Lurking variables (those not part of either explanatory
R cannot determine for you if a relationship is or response variables) may affect results significantly.
linear or not A lurking variable is confounded with the explanatory
R is heavily influenced by outliers variable if their effects on the response variable cannot
be distinguished from each other.
Linear Regression
Regression is a technique that specifies how much the Target Population: people you want your results to apply to
response variable is dependent on the explanatory Sampling frame: A list of all the members of the target
variable. population.
If the dependence is linear, then it is called linear Census: Getting desired information from everyone in
regression. the target population.
Therefore, linear regression is the technique of finding Random sample: Each member of the population has
the line that best fits the pattern of the linear an equal chance of being selected for the sample.
relationship. Bias: Systematic unfairness in sample selection or data
The LEAST SQUARES criterion states that among all the collection.
possible lines for the relationship, we need to choose Convenience sample: Chosen solely for convenience,
the one with the smallest sum of squared vertical not based on randomness.
Volunteer/self-selected sample: Sample where people Observational Studies
determine on their own to be involved. o Values of the variable are recorded as they
Non-response bias: Occurs when someone in the naturally occur; no interference from
sample doesnt return or doesnt finish the survey. researchers
Response bias: When the respondent takes the survey o Too many lurking variables; can never assume
but doesnt give correct information. causation
Undercoverage: Sampling frame doesnt include Survey
adequate representation from certain groups o Also a type of observational study, but
within the target population. individuals report the values themselves
o Questions may be leading, have unbalanced
Types of Samples responses, complicated, and sensitive
Volunteer: individuals select themselves to be Experiment
participants o Researchers assign the values of the
o Participants have strong opinions, which explanatory variable to individuals
cannot be generalized outside themselves o May have random sampling and assignment
Convenience: people who are at the right time and o May have a control group
place to suit the schedule of the researcher are o Control may be used to mean: control of a
sampled confounding variable, controlled experiment
o Some individuals are more likely to be (explanatory variable is assigned), and control
selected than others group
Sampling Frame: a list of potential individuals to be o May be blind (either participant or researcher;
sampled if participant is blinded: employs use of a
o List may not be representative of entire placebo) or double-blind (both of them);
population of interest prevents experimenter effect
Systematic Sampling: using a uniform method of o Lacks ecological validity
sampling to select individuals from a sampling frame o Hawthorne effect- behavior of participant
th
(etc. every 50 name in the list) changes because of awareness of being
o Each individual does not have an equal chance observed
of being selected o Noncompliance may occur
Simple Random Sample (SRS): individuals sampled
completely at random Modifications to Randomization
o All subjects may not respond or comply o Blocking: dividing subjects into groups based
(nonresponse) on similarities to chosen background variables;
Probability Sampling Plans: analogous to stratification in sampling
o Simple Random Sampling o Matched Pairs: comparison of only two groups
o Cluster Sampling: population is naturally who are similar in many important respects;
divided into groups (called clusters) and or compare same individual whose responses
randomly select clusters and have all are compared for two explanatory values. (ex.
individuals within the selected cluster as our before and after studies)
sample
Multistage Sampling (can have more Probability
than 2 stages): Conducting another Official name for chance; likelihood that an event will
SRS within clusters after selection happen
o Stratified Sampling: population is naturally Represented by P, and the probability of an event A
divided into sub-population (called stratum or is denoted by P(A).
strata) and we choose a simple random The probability of an event can have values of 0 to 1. It
sample from these strata can be decimals or percentages.
o 0= no chance of happening
Designs o .50= equally likely to happen and not happen
o Experiment o x< .50= less likely to happen
o Observational Study o x> .50= more likely to happen
Retrospective o 1= definitely happening
Prospective
o Survey Random Experiment: process of observing the outcome of a
chance event; experiment that produces an outcome that
cannot be predicted in advance (hence the uncertainty)


Elementary Outcomes: all possible results of the random Since P(A) + P(A) = 1, then
experiment; different from EVENTS which are grouped P(A)= 1- P(A)
o Useful for statements that ask for AT LEAST
Events: set of elementary outcomes; statement about the ON OF which has a complement of NONE
nature of the outcome we are going to get; denoted in capital and can be combined by the multiplication
letters rule for independent events
can be combined and, or, and not The Addition Rule (OR)
In a finite set, the no. of events is 2^n, meaning 2 o P(A or B)= P(A) + P(B) P( A B)
raised to the number of outcomes in the sample o P( A B) can be obtained by the multiplication
space rule
o Disjoint events: Two events that cannot occur
Sample Space: set or collection of all the elementary outcomes at the same time are called disjoint
The total probability of the sample space must be 1! or mutually exclusive.
When choosing from a set of objects: o For DISJOINT events, there is no OVERLAP
o Permutations: order MATTERS: P(n,r) n!/(n-r)! (they do not share any common outcomes),
o Combination: order DOES NOT MATTER therefore P(A or B)= P(A) + P(B)
C(n,r)= n!/r!(n-r!) The Multiplication Rule (AND)
o P(A and B)= P(A)*P(B|A)
Methods of determining probability: o Special rule for ind. events: P(A)*P(B)
Theoretical or Classical o Independent events: Two events A and B are
o Used for games of chance said to be independent if the fact that one
o Games themselves determine all possible event has occurred does not affect the
outcomes probability that the other event will occur.
o The game is fair and all elementary outcomes o Dependent events: If whether or not one
have the same probability. event occurs does affect the probability that
Empirical or Observational (via Relative Frequency) the other event will occur, then the two
o We perform the random experiment multiple events are said to be dependent.
times and list down our observations (usually o For INDEPENDENT EVENTS: P(A)*P(B)
with a large sample) o For dependent events, we turn to
o Strength lies in allowing to understanding CONDITIONAL PROBABILITY.
events where patterns cannot be Possible combinations:
predetermined If A and B are disjoint, they have no intersection and cannot

Relative Frequency
In empirical methods, it involves expressing the total
number of occurrences of the event of interest and the
total number of times the random experiment was
performed as a ratio.
This is only an estimation of the actual frequency. happen together. If A happens, B cannot happen and vice versa.
Therefore, this makes the outcome of the second event 0.
Therefore, this affectation implies that they are NOT
Law of Large Numbers independent.
The actual (or true) probability of an event (A) is
estimated by the relative frequency with which the Probability Table
event occurs in a long series of trials. The general addition rule can sometimes be better
In other words, we get closer to the theoretical/ actual represented by this table.
probability with increasing number of trials. Margins of the table: total just one of the events
Body of the table: total of those that involves both
Laws of Probability: events
Value of probability ranges from 0 to 1
The Complement Rule/ the Subtraction Rule
o P(not A)= 1-P(A)
o P(A)=1-P(not A)
P(A)= event of interest
P(A) occurs AT LEAST ONCE: the
opposite is not at all
P(A)= NOT AT ALL (all are the
opposite of the event of interest!)
Total row and column must both sum to 1. Bayes Theorem/Law of Total Probability
Used when we want to know the probability of a prior
occurrence involving conditional probabilities
Events that occur first (the givens) are usually the
explanatory variables; those that come after are the
response variables



Theoretically, if you have at least one of the values in
the 3 divisions of the table, you can complete it:



LAWS OF PROBABILITY CHEAT SHEET




Conditional Probability
Expressed as P(A|B) which is read as probability of
event A given B.
Can be visualized in a probability table.
Formal definition: The conditional probability of event
B, given event A, is P(B | A) = P(A and B) / P(A)

Independence Checks
Done to see if the events are indeed independent or
not
Done by comparing original probability of one of the
events with their probability given the second event. If
different, they are dependent. Otherwise, they are
independent.
We can also compare P(B|A) and P(B| not A). If the
events are independent, not A would have no
bearing and you would get the same values.
We can also check by comparing the known
intersection and the product of the special
multiplication rule.
In short, two events are independent if any of these Probability Models
statements hold. If one is not true, then all the rest will Help you get the long-term average outcomes and
also be false. amount of variability in the results of one random
P(B | A) = P(B) experiment to the next

P(A | B) = P(A)
Discrete Random Variables
P(B | A) = P(B | not A)
are finite or countably infinite (no upper value can be
P(A and B) = P(A) * P(B)
assigned, but has a limit)


contrast with continuous random variables which are
uncountable How to obtain:
Probability Mass Function o Subtract the mean from the value of X
is a function that assigns probabilities to each value of o Square the difference
X using a probability distribution o Multiply by probability of X for that particular
P(X) is always between 0 to 1 value of X
You can add probabilities of values a or b, as individual o Repeat for all values
values of X are mutually exclusive o Sum
Probabilities in a probability distribution all add up to 1 Variance is always greater than or equal to zero since
It is called a mass function as it is the formula that it is squared
assigns a weight on how probable is each value to
occur Standard Deviation
Get the square root of the variance
(Relative Frequency) Histograms Takes the unit of the given
For illustrating probability distributions These are not additive!
X-Axis: numerical values
Y-Axis: percentage of time each value occurs Linear Transformation of One Random Variable


Events
One or more outcomes of interest from the sample
space

Cumulative Distribution Function of X
The cumulative distribution function (cdf) is the Linear Transformation of Two Random Variables
probability that the variable takes a value less than or
equal to x.
The probability of value a is equal to the sum of the
probabilities of all values for X that are less than or
equal to a.
CDFs are stepwise functions, as the probabilities
remain the same until a new value emerges for a
different value of x, but the endpoint is always 1.
You can get the PDM from the CDF by getting the net
probability from the next jump in the values of X.

Expected Value Binomial Experiments:
Long-term average outcome of a random variable a fixed number (n) of trials
Weighted average of all possible values of X (mean of each trial must be independent of the others
X) o To assume independence of events, the
How to obtain: population size is greater than or equal to 10
o Multiply each value of X by its probability times the sample size. In symbols this is: N
o Repeat for all values 10n.
o Sum all results each trial has just two possible outcomes, called
If the pmf of X is symmetric, the expected value would "success" (the outcome of interest) and "failure"
be the middle value. If it is skewed, it wont be that there is a constant probability (p) of success for each
value. trial, the complement of which is the probability (1 - p)
Mean of X does not have to equal a value of X, because of failure
it is an average. However, it must lie between the written as: X, is binomial with n = 3 and p = 1/4.
minimum and maximum values. It does not have to To get the probability distribution of a binomial variable:
equal to 1.

Variance
Expected amount of variability in your results after
repeating an experiment a theoretically infinite
number of times
2
V(X) or

The combination formula tells us the number of ways 6. Answer the problem in the original context of the
each outcome occurs. question.
P is RAISED to x!!! Remember!
Complement: OPPOSITE of what you want! Rule of Thumb: Normal Approximation to the Binomial

Mean and Standard Deviation of a Binomial Random Variable
Mean = mean of x * probability of x
Variance and SD:

Shape is symmetric when p=.05.


If p is larger than .5= skewed left.
If p is smaller than .5= skewed right.
You need a larger n to even this out to symmetric
shape again.

Continuous Random Variables
best illustrated by a probability density curve rather
than a histogram as values are infinite across a range
The area of the curve equals 1, as they are all Central Limit Theorem
probabilities the distribution of sample means will be approximately
The probability that X gets a value in any interval of normal as long as the sample size is large enough (30 as
interest is the area above this interval and below the rule of thumb)
density curve.
Probability is 0 when X=specific value; only ranges have
value

Z-Score

Tells us how many SDs (not units!) below or above the
mean the original value is
Formula used to transform normal distribution to a
standard normal distribution value
How to obtain:
z-score = (value - mean)/standard deviation
For values above the mean, z is positive. For values
below the mean, z is negative.
Z-scores can also be used to compare values across
distributions

Finding Probabilities for a Normal Distribution
1. Draw a picture of the distribution
2. Translate the problem into probability notation.
Shade the area in the picture.
3. Transform a into a z value using the z-score
formula.
4. Look up value on z table (use first number and first
decimal to locate number on columns, then
second decimal to locate number on row; where
they intersect is your z-value)
5. If problem is less than, that is the z-value. If
problem is greater than, take the complement (OR
INVERT ALL SIGNS: inequality and +/-). If you have
an in between problem, apply the complement
rule to the greater than value. Subtract lesser
value from greater value.

You might also like