You are on page 1of 4

# Term Project Details

For this semester, we are going to use the concepts in Chapter 10 (Correlation
and Linear Regression) in a statistical study. Your project should follow the
following steps and details of how you performed each steps should be included
1. Find a statistical question to answer.
In this case, it would be about defining correlation between two sets of data such
that one (y) depends on the other (x). For example, a statistical question would
be: is there a correlation between age (x) and say, cholesterol level (y)? Note
that cholesterol level (y) depends on age (x), not the other way around.
(Remember that correlation DOES NOT NECESSARILY mean causation).
Clearly state your question and elaborate on what correlation you are going to
analyze. A project is worthless if the readers have no clue what it is that you are
trying to find and analyze.
2. Come up with your own hypothesis.
This is where you are making a guess/conclusion about what you are likely to
find. For example, a hypothesis for the age vs. cholesterol may be that older a
person, higher the cholesterol. Your analysis of the data will either confirm or
deny your early conclusion, i.e., hypothesis.
3. Collect data.
Since we are going to use Chapter 4 material exclusively, make sure that the
data that you do choose are generally speaking, linearly related. Remember you
are going to be analyzing the data to find if they are linearly correlated and if so,
come up with linear regression line that best fits the data. The equation of the
line (y=mx+b) would be then the model for the data in that, you can plug in any x
value in that equation to find the y. Therefore, if the data (x vs. y) is say, nonlinear, trying to do linear analysis would be foolish. So, once you plot your data
in a scatterplot and the data do not look linear, find a different question to answer
for which data are somewhat linear.
Make sure you have enough data. If you are going to make some conclusion
about the whole American population and all you have is 20 data, it would be
hard to impress/convince anyone with your analysis.
Give the source of your data.

## 4. Organize and summarize the data.

Use tables and charts. Use Chapter 3 knowledge to find/calculate 1-variable
summaries, i.e., shape, outliers, center (mean), and variability (standard deviation) of
the data. You will do this for the dataset x and dataset y separately.
You will need to include the raw data and the 1-variable summaries for the x and
y datasets in the main report or in an appendix.
5. Plot x vs. y and calculate linear correlation coefficient, r.
Compare your calculated r to the r from Table II in the Appendix in your text
book.
If your calculated r is greater than the Table II r then, significant correlation exists
in your data set. And you are okay to proceed with linear regression analysis to
come up with the best fit line and an equation (y=mx+b) for the line. Finding the
equation means calculating the slope m and the interceptor b.
Show the chart with all the points, best-fit line, and the equation.
6. Plot the residual (observed y predicted y) vs. x.
Here observed y is the raw data that you collected. Predicted y is calculated
from the best-fit line equation where you plug in your x and calculate your y.
Before we can conclusively say that the data (x vs. y) are linearly correlated, and
before you can use the best-fit line equation as your model of the data to make
prediction, you need to satisfy two more conditions:
A. If residual vs. x plot shows a discreet pattern then the data are NOT
linearly related.
B. If residual vs. x shows the spread of the residuals increasing of
decreasing then the data are NOT linearly related.
Include in your report the residual vs. x data tables as well as the graphs.
7. Check to see if the linear model assumption is valid.
If all 3 criteria in Steps 5 and 6 are met, then:
A. Linear correlation exists between the data and you can use the best fit
line as a model of the data to make prediction.
B. If ANY of the 3 criteria is NOT met, then the linear model assumption is
wrong and the equation is no good.

## In this case, the predicted y is always average y (average of the raw y

data), no matter what your x is.
State in details your final conclusion of the data in terms of if linear relation
exists. If it does, then give best fit line equation with r. Also, discuss how your
data.
8. Make a few predictions.
Remember again, if your linear model assumption is valid, then you can just plug
in an x value in the best-fit line equation to predict y. However, if your linear
model assumption is NOT valid, then your predicted y should just be average y
since linear regression equation is not representative of the data.
Make sure you do not make prediction outside your data range as your linear
regression equation is only good within the data range that you used to come up
with the equation.
9. Afterthought.
In this section, you explain/analyze what you did wrong (if at all!) in your study.
Ask yourself questions like did you just do a convenience or voluntary-response
sampling to collect your data? Did your study suffer from too few data points?
Are you misrepresenting the data? Is your analysis correct? Does your
conclusion make sense?
Also write what you could have done to make your study more worthwhile and
useful to the reader. This part can be also in your reflective writing in the eportfolio.
The grade for this project is 5%. The term project (a final report) is due on May
1. Please make sure you put in time and effort commensurate with that in mind
to produce a really professional, thorough, well-organized report. And finally, it is
a team project so make sure every one contributes and learns from the
experience. Good luck.
e-Portfolio
The last thing to do is to create an e-portfolio website (use wordpress, weebly, or
yola).
Then you need to upload the final version of the term project on your e-portfolio
website. It is your responsibility to make sure you get a copy of the report so you