Professional Documents
Culture Documents
Bivariate Data
When data come in dependent pairs (i.e. one data value is linked to another), say X is height and Y is
weight and we have n individuals we're taking the measurements on. Then our data would be as
follows:
(X1, Y1), (X2, Y2), ... , (Xn, Yn). This type of data is called bivariate.
Scatterplots
Scatterplots, as we discussed in Lecture 2 are like histograms, and are a good visual means to
understanding patterns of bivariate numerical data.
Recall or Lecture 2 Notes:
Scatterplot - A graph of two numerical variables (X vs. Y) This graph is also known as a scattergram or scatter diagram. A
scatterplot is a two dimensional graph representing two variables measures from the same set of subjects or elements. That
is, for each point being graphed, there are two pieces of information. For example, the height and weight of a group of 10
teenagers would result in a scatter plot of 10 paired (X=height, Y=weight) points on the graph. The graph could then be
studied to determine if a relationship exists between the two variables. If the relationship between the two variables forms
a straight line (linear), the variables are considered to be correlated. This correlated pattern usually appears to form a type
of roadway rather than an exact straight line.
Typically, the more controlled variable is plotted on the X axis and the response variable is plotted on the Y axis.
Note that there are many other relationships beyond linear, such as, quadratic, cubic, exponential, ...etc.
Here are some examples of X,Y paired data:
X-Cloud Density vs. Y-Rainfall
X-Advertising Dollars Spent vs. Y-Sales
X-Speed vs. Y-Braking Distance
Here are two examples of scatter plots that show a linear relationship (correlation):
The scatterplot gives us a means of observing relationships between the two variables. We call this
relationship positive if an increase in one variable corresponds to an increase in the other. This results
in a upward trend of the data points (1st graph above). When one variable increases while the other
decreases, we call the relationship negative, which results in a downward trend of the data points (2nd
graph above).
What does a scatterplot show? In general terms, it gives us an idea of what kind of relationships (or
patterns) the bivariate data has. We may have
1. Positive (or negative) linear relationship
2. Positive (or negative) curved relationship
3. Other relationships
4. No relationship
Scatterplots can also give us visual evidence of outliers or suspicious observations.
Here is a another scatterplot of the diameter of Oak trees vs. the age of the trees.
Correlation
To get a measure of how strongly X and Y values are related, we will use Pearson's Correlation
Coefficient. Correlation is concerned with linear trends: if X increases, does Y tend to increase or
decrease? How much? How strong is this tendency?
Note: Sxy measures covariation of X and Y while Sxx and Syy measure internal variation of X and Y
respectively.
Sxx = Sum of Squares of the X data (same as SSx)
Syy = Sum of Squares of the Y data (same as SSy)
Sxy = Sum of Squares of the (X,Y) pairs (same as SSxy)
Sx = Standard deviation of the X data
Sy = Standard deviation of the Y data
Sxy = Standard deviation of the (X,Y) pairs
Properties of Pearson's Correlation
1. The value of r does not depend upon the units of measurement. For example, if X is the weight
in pounds and Y is the height in inches of a person, then the correlation between X and Y would
be the same if we measured X in kilograms and Y in centimeters.
2. The value of r does not depend upon which variable is labeled X and which variable is labeled
Y. In other words, the correlation of X and Y = correlation of Y and X.
3. 1 r +1. A positive value of r means a positive linear relationship, a negative value of r
means a negative linear relationship.
4. r = +1 is a perfect positive linear relationship and r = 1 is a perfect negative linear
relationship. This is extremely rare.
5. Values of r close to 0 implies no linear relationship. But note that "no linear relationship"
doesn't mean "no relationship" exists; a higher "non-linear" relationship might still exist when r
is close to 0. Always remember that r measures only the linear relationship between X and Y.
Strength of Correlation
We can generally define the strength of correlation as follows: (This is only a general guide!)
Strong: |r| 0.8
Moderate: 0.5 |r| < 0.8
Weak: |r| < 0.5
NOTE: Since correlation measures only a linear relationship, to have r close to or equal to zero does
not mean that there is no relationship between x and y.
2. Common response: Both X and Y respond to changes in some unobserved variable. All three of
our examples above are examples of common response.
a. Ice cream sales and shark attacks both increase during summer.
b. Skirt lengths and stock prices are both controlled by the general attitude of the country,
liberal or conservative.
c. The number of cavities and children's vocabulary are both related to a child's age.
3. Confounding: The effect of X on Y is hopelessly mixed up with the effects of other
explanatory variables on Y. For example, if we are studying the effects of Tylenol on reducing
pain, and we give a group of pain-sufferers Tylenol and record how much their pain is reduced,
we are confounding the effect of giving them Tylenol with giving them any pill. Many people
report a reduction in pain by simply being given a sugar pill (placebo) with no medication in it
at all. This is called the placebo effect. To establish causation, a designed experiment must be
run.
The Regression Model & Regression Analysis
Regression Analysis
Regression analysis is a statistical tool that utilizes the relation between two or more quantitative
variables so that one variable (dependent variable) can be predicted from the others (independent
variables). For example, if one knows the relationship between advertising expenditures and sales, one
can predict sales by regression analysis once the level of advertising expenditures has been set. In
simple linear regression, we specifically consider the case when a single independent variable is used
for predicting the dependent variable and the dependent variable and the independent variable are
linearly related.
where
The slope "b" measures the amount Y increases when X increases by one unit.
The y-intercept is the value of Y when X = 0.
The objective of simple linear regression is to fit a straight line through the points on a scatterplot that
best represents all the points. So we want to find a and b such that the line [Y = a + bX] fits the points
as well as possible. To do this, we first define what we mean by a "best fit" line. This line, in some
sense, is closest to all of the data points simultaneously. In statistics, we define a residual, ei, as the
vertical distance between a point and the line,
ei = Yi - (a + bXi)
Since residuals can be positive or negative, we square them to remove the sign. By adding up all of the
squared residuals, we get a total measure of how far away from the data our line is. This sum is called
the SSresid = Sum of Squared residuals. Thus, the "best fit" line is defined as the one whose sum of
squared residuals is a minimum. This method of finding a line is called least squares. This is a very
important point and is further explain on pp. 199-201 of your textbook!
This is expressed in the following diagram:
where X is the age of a child in months and Y is the height of that child, and let's further assume that
the X data values range from 1 to 24 months. To predict the height of an 18 month old child, we just
plug in X=18 to get:
And remember since correlation measures only a linear relationship, to have r close to or equal to zero
does not mean that there is no relationship between x and y.
Let's consider a second measure of the strength of a linear relationship.
Coefficient of Determination
A statistic that is widely used to determine how well a regression line fits a set of (X,Y) data pairs is
the coefficient of determination (or multiple correlation coefficient), R2. The coefficient of
determination represents the percent of variability in Y that can be explained by the variability
explained by the linear relationship between X and Y. In other words, R2 explains how much of the
variability in the Y's can be explained by the fact that they are related to X. (i.e., how close the points
7.7
71.0
8.4
71.4
8.7
65.0
9.0
68.7
9.6
64.4
9.6
69.4
10.0
63.0
10.2
64.6
10.4
66.9
11.0
62.6
11.7
61.7
It appears that a negative linear relationship exists between treadmill time (X) and ski time (Y).
STEP 2: ESTIMATE THE LEAST-SQUARES REGRESSION LINE
Since the p-value is less than 0.05, Ho is rejected is you can conclude that the slope is significantly
different from 0, thus you can conclude a statistically significant linear relationship exists between X
and Y. Remember, if the true slope could reasonably equal 0 (accepting Ho), then there is no
statistically significant linear relationship between X and Y. If we Reject Ho, then we would conclude
that a statistically significant linear relationship does exist between X and Y.
STEP 4: STRENGTH OF THE LINEAR RELATIONSHIP
Again, from the above output, R2 = 63.4%, thus you can conclude that 63.4% of the variability in Y
can be "explained" by the linear relationship between X and Y.
Correlation (r) can be determined by taking the square root of R2.
SQRT(0.634) = 0.80
But, since the linear relationship is negative, we need to make the answer as such, so...
r = 0.80
This is considered a strong negative linear relationship between X and Y.
STEP 5: CHECK ASSUMPTIONS
The residuals appear to be normal since the Normality plot shows an approximate straight line.
Since there seems to be a random pattern of the standardized residuals, it appears the linear relationship
between X and Y is appropriate, and there are no higher order relationships here. In addition, there
appears to be constant variance across the graph. In other words, the points are approximately equally
spread up and down across the graph.
STEP 6: CONFIDENCE AND PREDICTION INTERVALS FOR THE MEAN RESPONSE
The following graph shows confidence and prediction bands for the mean response (Y) across the
range of X values.
Notice how far the confidence and predication bands are from the estimated regression line. This is
partly due to the small set of points we used in this example. Remember, more data, less error!
This is how the output looks if you want an estimate and prediction for a specific value of X, for
example, let X*=8.5.