You are on page 1of 2

Linear Models, Assessed Practical

Marco Scutari
November 14, 2014
This practical sheet contains two exercises. Write a report on Exercise 2 only. Any
queries you have about Exercise 1 may be directed to the Teaching Assistants or the
Lecturer. Neither will answer questions regarding Exercise 2 (with the sole exception
of questions relating to a limited number of programming issues).
Several R functions will be suggested and used for the analysis, please use the help
(?fun or help("fun")) to make yourself familiar with them as needed.

Exercise 1: NOT ASSESSED


1. Load the data from R using data(longley); a description of the variables and
their meaning is provided in ?longley. Rename the columns for easy reference:
names(longley) = c("Gd", "G", "U", "AF", "P", "Y", "E")
2. Perform a basic exploratory analysis using partial correlation coefficients, which
we can obtain from the covariance matrix produced by cor() after inverting it
(with solve() and cov2cor() to rescale it).
3. Use anova() to assess whether the explanatory variables are significant using
ANOVA. Would you select the same model from the output of summary()? If
not, why?
4. Use step to select an appropriate model, with AIC as the selection criterion.
5. Use ridge regression to fit a model that is robust against collinear explanatory
variables. Look at the regression coefficients for the two models: are they very
different? Look at the standard deviations of the residuals of the classic linear
regression used for ANOVA and ridge regression with sd() and residuals().
Which models is better, and why it is so?

Exercise 2: ASSESSED
The human body composition is the distribution of the three components that form
body weight: bone, fat and lean. They identify, respectively, the mineral content,
the fat and the remaining mass of the body (mainly muscles). In a detailed analysis,
body composition is measured separately across trunk, legs, and arms, and it is an
important diagnostic tool since ratios of these masses can reveal regional physiological
disorders. One of the most common ways to measure these masses is the dual-energy xray absorptiometry (DXA), which unfortunately is time-consuming and very expensive.
Therefore, there is an interest in alternative protocols that can be used to the same
effect.
The key point to be covered in the report is to try, in a very simple way, to predict
body composition from related quantities that are much cheaper and easier to measure:
age, height, weight and waist circumference. This should be covered in detail for at
least one variable, TF. To estimate such a model, we will use a sample of 100 white
men collected from the NHANES project which includes simultaneous measurements
for the variables above.
1. Read the data from the boco.txt file and perform a basic exploratory analysis,
using plots and computing correlation coefficients to assess which of A, H, W and
C might be good predictors of each of the body composition variables. Try using
partial correlations for at least one body mass variable (say, TF), and explain why
this approach is preferable to that above.
2. Fit a linear model for TF using A, H, W and C as explanatory variables (main effects
only, no interactions); then comment on it using the output of summary() and
anova().
3. Use step to perform model selection for TF using BIC, and comment the selected
model. Why does it make more sense to use BIC than AIC in this case?
4. Ridge regression is preferable to a classic linear regression model when some
explanatory variables are collinear, i.e. strongly correlated. Is that the case
for the models above? Fit a ridge regression model with penalized, describe
the required steps and comment on how regression coefficients differ from those
estimated with lm.
Variable names are as follows: age (A), height (H), weight (W), waist circumference (C),
trunk fat (TF), legs fat (LF), arms fat (AF), trunk lean mass (TL), legs lean mass (LL),
arms lean mass (AL), trunk bone (TB), legs bone (LB), arms bone (AB).

You might also like