Variables, Distributions, and Sampling

find more resources at oneclass.
com
A. Variables:
Categorical Variable: Puts individuals into one of several groups or categories.

- Pie Charts/Graphs.
- Bar Graphs.
Quantitative Variable: Uses numerical operations, such as averages, to describes data.

- Histograms.
- Stem Plots.
** Good for large data, but does not tell you each data is.
Types of Distributions:
Symmetric: A distribution is symmetric if the right and left sides of the histogram are
approximately the same shape.
Unimodal: The distribution has a single peak that shows the most common value in the data.
(Only one peak).
Bimodal: Distribution has two peaks; represents the two modes of the data.
Skewed to the Right: If the right side extends much farther out than the left side.
Skewed to the Left: If the left side extends much farther out than the right side.
find more resources at oneclass.com

B. Measures of Central Tendency:
Mean: x =∑x/N
Median:
- If total number(n) is an odd number: n+1/N
- If total number(n) is an even number: n/2 + (n/2 +1)/2
Mode: The most frequently occurring score or value.
Density Curves:
Symmetric Curve: Mean, median, and mode are all in the same spot…in the centre.
Right Skewed Curve: Mean is pulled to the right, towards the tail.
Mean>Median>Mode  Mean is always greater than Median
Left Skewed Curve: Mean is pulled to the left.
Mode>Median>Mean  Median is always greater than Mean
C. Measures of Spread:
Range: xMax – xMin
Quartiles:
- Q1: Larger than 25% of the observations.
- Q2: Median.
- Q3: Larger than 75% of the observations.
MIN Q1 MEDIAN Q3 MAX
IQR: Q3-Q1
Five Number Summary and Box-Plot:
Outliers: Can greatly affect the values of the mean and

standard deviation.
To Find Outliers:
- Q3 + (1.5)IQR
- Q1 – (1.5)IQR

D. Normal Distribution:
Properties:
- Have a bell-shape and are symmetrical.
- The mean (MU) is in the centre of the distribution.
- Standard deviation tells us how spread out the data is on both sides.
Large SD  The more spread out
Empirical Rule:
Standard Normal Variable:
Z = X – MU/ Sigma
X = MU + Z(Sigma)
 When GREATER… we want above…(1-Z)

 When LESS…just need Z.
E. Scatterplots:
- Shows relationship between TWO QUANTITATIVE variables. Independent variable

appears on horizontal axis, and dependent variable appears on vertical axis.
X-Variable: Explanatory Variable Y-Variable: Response Variable

Linear relation gives an indication of HOW CLOSELY the points form a straight line.

F. Correlation:
- Measures the direction and strength of the linear relationship between two quantitative
variables.
Properties of Correlation:
1. No distinction between explanatory and response variables.

2. R doesn’t change when you change the unit of measurement.
3. Positive R indicates positive relationship. Negative R indicates a negative relationship.
4. R is always between -1 and 1.
5. Correlation is NOT resistant and is strongly effected by outliers.
G. Regression:
Regression Line: A straight line that describes how a response variable Y CHANGES as an
explanatory variable X CHANGES. We use this to predict the value of Y for a given value of X.
Least-Squares Regression Line:
SLOPE indicates rate of change in Y per unit in X.

Y-INTERCEPT indicates the Y-VALUE WHEN THE X-VALUE is 0.
1. B=SLOPE and R=CORRELATION always have the same sign.

2. Least-squares regression always passes through x, y.
3. R2 is the fraction of the variation in the y-values that is explained by the least square
regression of Y ON X.
Residuals: Difference between an observed value of the response variable and the predicted
value.
OBSERVED Y – PREDICTED Y
Lurking Variable: A variable that is NOT among the explanatory or response variables and
may influence the interpretation of relationships.
- Association does NOT imply causation.

H. Two-Way Tables:
- TWO CATEGORICAL variables.
Marginal Distribution: Involves ONE of the categorical variables.
Conditional Distribution: Compares different values.
Simpson’s Paradox: Sometimes an association that holds true for all several groups can reverse
direction when the data are combined to form one larger group.
J. Methods of Sampling:
Population: The group you want information about.
Sample: What group you obtain information from (don’t include people who don’t respond).
Sampling:
Observation Study: Variables are measured on individuals without influencing the responses.
Experiments: Impose some treatment on individuals.
Principles of Experimental Design:
1. CONTROL the effects of lurking variables on the response.

2. RANDOMLY assign the individuals to the treatments.
3. REPEAT each treatment on many units to reduce chance in variation in the results.
Bias: If it favours certain outcomes.
Block Design: Group similar individuals together and then randomize within each of these
BLOCKS.
Matched Pairs Design: Two treatments under study. Subjects are matched in pairs based on
attributes. Randomly assigned within each pair.
Statistically Significant: An observed effect is so large that it would be unlikely to have

occurred by chance.
Design Sample Survey:

Voluntary Response: People who volunteer themselves. Convenience Sample: Ask people who are
convenient. Judgement: Select based on their own opinion or judgement. Simple Random Sample:
Randomly selected individuals. Stratified Random Sample: Divide into groups called STRATA. Multi-
Stage Sampling: Split population into parts. Then another SRS is taken. Cluster Sampling: Divide population
into groups, called clusters. Systematic Sample: Select every K individual.

K. Probability:
Mutually Exclusive VS Independence:
Mutually Exclusive (or disjoint): Events that can’t both happen at the same time.
Ex: Draw one card, drawing an ace and drawing a king: a single card cannot both be an ace and
a king.
Independent Events: They don’t affect each other at all.

Ex: Drawing an ace and drawing a spade; a single card can be both a space and an ace.
Probability = Number of Favourable Events/ Number of Total events
Properties:
1. Event A must be between 0 and 1.

2. Sum of probabilities of all events must equal 1.
3. COMPLEMENT: P(NOT A) = 1 – (A).
4. Mutually exclusive/Disjoint: P(A or B) = (A) + (B).
5. Independent: P(A and B) = (A) x (B).
6. Addition: P(A or B) = (A) + (B) – (A and B).
7. Conditional: P(A given B) = (A and B)/(B).
8. Multiplication: P(A or B) = (A given B)(B) = (B given A)(A).
9. Bayes’: P(B given A) = (A given B)(B)/(A).
L. Random Variables:
- Assigns a numerical result to an outcome of an experiment that is associated with chance.

Can be either DISCRETE or CONTINUOUS.
Discrete Random Variables: Lists each possible value the random variable can assume,
together with its probability.
Must satisfy the following conditions:

- Probability of each value of the discrete variable is between 0 and 1.
- Sum of all probabilities is 1.
Continuous Random Variables: Can take on any possible value.

- Probability that a random variable fall in certain range. PDF.
Properties:
- F(x) >_ 0.
- The total area under the curve is equal to 1.
- P(a< X < b) = area under the curve between A and B.

M. Sampling Distribution:
Parameter: Number that describes the population.
Statistic: Number that can be calculated from a sample without using any unknown parameters.
Law of Large Numbers: X is rarely exactly right. If we keep taking larger and larger samples,
the statistic X will get closer and closer to the parameter MU.
Mean and Standard Deviation of a Sample Mean: Mean of a SRS of n is drawn from a large
population with the MU and SIGMA.
SIGMA/SQR(N)
Sampling Distribution of Sample Mean:

N(MU, SIGMA/SQR(N)
Central Limit Theorem: Forms foundation for the inferential branch of statistics.
1. If samples of size n, N > 40, then the sampling distribution of the sample mean approximates
a Normal Distribution. The greater the sample size, the better the approximation.
2. If population itself Normally distributed, the sampling distribution of the sample mean is
Normally Distributed for ANY sample size n.
** Always allows us to use Normal probability calculations to answer questions about sample
means from many observations even when the population distribution is NOT normal.
N. Confidence Interval for a Mean:
- Estimated range of values, calculated from the sample data.

- Success rate of the method used to construct the interval.
Interpretations of the Confidence Interval: (Suppose 95% Confidence Interval)
1. Based on data; 95% confident that the population parameter is contained in the interval.
2. Out of n separate confident intervals, 95% of them will contain the population parameter.
Estimate + Margin of Error (m)
Confidence Interval for a Population Mean:
Case 1: SIGMA is known and population is NORMALLY DISTRIBUTED.

X + Z* SIGMA/SQR(N)
Case 2: SIGMA is unknown and population is NORMALLU DISTRIBUTED.
X + T* SIGMA/SQR(N)

SIGMA INCREASES, MARGIN OF ERROR INCREASES//Z INCREASES, MARGIN OF ERROR

INCREASES//N INCREASES, MARGIN OF ERROR DECREASES.
**ALWAYS ROUND UP N
O. Sample Size and Margin of Error:
Sample Size N for MARGIN of ERROR:
N = Z*(SIGMA)2/MARGIN OF ERROR
Margin of Error:
M = Z* (SIGMA/SQR(N)
P. Hypothesis Testing:
- Is a process using sample statistics to test a claim about the value of a population
parameter.
Steps:
1. State NULL HYPOTHEIS. Contains a statement of equality; <, =, >. This is what we
actually test with the evidence.
2. State ALTERNATIVE HYPOTHESIS. It is the complement of the NULL HYPOTHESIS.

Contains a statement of inequality and must be true if the NULL HYPOTHESIS is false. >,
=/, <. This is the competing theory to NULL or the claim you are trying to find evidence for.
3. Determine the appropriate statistical test and corresponding test statistic.
4. Determine P-VALUE by determining the probability. 2-SIDED HYPOTHESIS; MULTIPLY

P-VALUE BY 2.
5. Make statistical decision.

a. P-VALUE < 0.01. VERY STRONG AGAINST NULL.
b. 0.01 < P-VALUE < 0.05. STRONG EVIDENCE AGAINST NULL.
c. 0.05 < P-VALUE < 0.10. MODERATE EVIDENCE AGAINST NULL.
d. 0.10 < P-VALUE. LITTLE TO NO EVIDENCE AGAINST NULL.
**Use if a is not given. A = level of significance to test at.
P-Values:
- The smaller the p-value, the greater the evidence against the null hypothesis. Use the
STANDARD NORMAL TABLE to find the p-value.
Z Test for a Population Mean:

Z = XBAR – MU/SIGMA/SQR(N)

Q. Type I and Type II Errors:
- You always make TWO decisions; 1) Reject the null hypothesis. 2) Fail to reject the null
hypothesis.
TYPE I ERROR: REJECT a TRUE null hypothesis. False +

- Something is wrong…but not.
TYPE II ERROR: FAIL to reject a FALSE null hypothesis. False –

- Something is not wrong…but now there is.
V. One-Sample t-Confidence Interval:
- t-distributions are more spread out than the standard normal. Have more probability in the
tails than in the centre.
(n-1) degrees of freedom
- Draw a SRS of size n from a large population having an unknown mean MU. A level C
confidence interval for MU is:
S/SQR(N) is called the Standard Error of the mean.
Robustness:
- T procedure is correct when the population is
normally distributed.
- ROBUST to small deviations from normality. Results will not be
greatly affected.
Factors:
1. MUST be a SRS from the population.
2. Outliers are skewness. Strongly influence the mean. Gets smaller as sample size increases.
N<15 Data must be very close normal and have NO outliers.

N between 15 and 40, mild skewness allowed, NO outliers.
N>40 VALID even with strong skewness.
Z-TEST: Population/process/manufacturer/population standard deviation.

T-TEST: Sample standard deviation/ mean WITH standard deviation.

R. Two-Sample Problems (Independent Samples):
Conditions for Inferences Comparing Two Means:

1. TWO SRS and TWO different populations. The samples are independent.
2. Both populations are normally distributed. MEANS and STANDARD DEVIATION are
UNKNOWN.
S. Large Sample Confidence Intervals for a Proportion:
Confidence Interval:
To find the required sample size…
N = Z*/M2 P*(1-p*)
P* is the guessed value for the sample proportion.
Where…
Plus Four Confidence Interval:

- Used when the C level is at least 90% and the sample size n is at least 10.

Variables, Distributions, and Sampling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Variables, Distributions, and Sampling

Uploaded by

Copyright:

Available Formats

find more resources at oneclass.

Categorical Variable: Puts individuals into one of several groups or categories.

Quantitative Variable: Uses numerical operations, such as averages, to describes data.

find more resources at oneclass.com

B. Measures of Central Tendency:

Mode: The most frequently occurring score or value.

Mean>Median>Mode  Mean is always greater than Median

Left Skewed Curve: Mean is pulled to the left.

Mode>Median>Mean  Median is always greater than Mean

Range: xMax – xMin

MIN Q1 MEDIAN Q3 MAX

Five Number Summary and Box-Plot:

Outliers: Can greatly affect the values of the mean and

find more resources at oneclass.com

Large SD  The more spread out

Standard Normal Variable:

 When GREATER… we want above…(1-Z)

- Shows relationship between TWO QUANTITATIVE variables. Independent variable

X-Variable: Explanatory Variable Y-Variable: Response Variable

find more resources at oneclass.com

1. No distinction between explanatory and response variables.

Least-Squares Regression Line:

SLOPE indicates rate of change in Y per unit in X.

1. B=SLOPE and R=CORRELATION always have the same sign.

find more resources at oneclass.com

- TWO CATEGORICAL variables.

Marginal Distribution: Involves ONE of the categorical variables.

Conditional Distribution: Compares different values.

Population: The group you want information about.

Experiments: Impose some treatment on individuals.

Principles of Experimental Design:

1. CONTROL the effects of lurking variables on the response.

Bias: If it favours certain outcomes.

Statistically Significant: An observed effect is so large that it would be unlikely to have

Design Sample Survey:

find more resources at oneclass.com

Mutually Exclusive VS Independence:

Independent Events: They don’t affect each other at all.

Probability = Number of Favourable Events/ Number of Total events

1. Event A must be between 0 and 1.

- Assigns a numerical result to an outcome of an experiment that is associated with chance.

Must satisfy the following conditions:

Continuous Random Variables: Can take on any possible value.

find more resources at oneclass.com

Parameter: Number that describes the population.

Sampling Distribution of Sample Mean:

N. Confidence Interval for a Mean:

- Estimated range of values, calculated from the sample data.

Interpretations of the Confidence Interval: (Suppose 95% Confidence Interval)

Estimate + Margin of Error (m)

Confidence Interval for a Population Mean:

Case 1: SIGMA is known and population is NORMALLY DISTRIBUTED.

find more resources at oneclass.com

SIGMA INCREASES, MARGIN OF ERROR INCREASES//Z INCREASES, MARGIN OF ERROR

O. Sample Size and Margin of Error:

Sample Size N for MARGIN of ERROR:

2. State ALTERNATIVE HYPOTHESIS. It is the complement of the NULL HYPOTHESIS.

3. Determine the appropriate statistical test and corresponding test statistic.

4. Determine P-VALUE by determining the probability. 2-SIDED HYPOTHESIS; MULTIPLY

5. Make statistical decision.

**Use if a is not given. A = level of significance to test at.

Z Test for a Population Mean: