Professional Documents
Culture Documents
Chapter 9
Regression Methods...................................................................................................... 80
ANOVA analysis using single marker genotypes....................................................... 80
ANOVA analysis using multiple marker genotypes. .................................................. 80
Regression on QTL probability, conditional on marker haplotypes. ........................... 80
Haley-Knott regression.............................................................................................. 81
Regression of phenotype on marker type ................................................................... 81
Maximum Likelihood estimation:.................................................................................. 82
Comparison of likelihood and regression procedures................................................. 85
Multiple regression on marker genotypes, ................................................................. 87
Inverval mapping with marker co-factors (composite interval mapping) .................... 88
Precision of mapping and hypothesis testing ................................................................. 88
Permutation testing.................................................................................................... 89
Bootstrapping............................................................................................................ 89
Accounting for multiple testing ................................................................................. 89
References .................................................................................................................... 90
In this Chapter we will discuss in more detail regression analysis and Maximum
likelihood methods for QTL mapping. Regression methods are generally much easier to
use (standard software like SAS or ASREML can easily be used), and the method is
much faster computationally. Maximum likelihood is computationally more demanding,
and specific software is needed. For many designs, results are very similar to regression.
This makes regression analysis attractive as it can be used in resampling methods.
Resampling methods are use to determine test statistics for hypothesis testing. In this
Chapter we will discuss bootstrapping and permutation tests.
We will also discuss QTL mapping with multiple markers (more than 2) and methods to
account for more than one QTL. Accounting for other QTL has been proposed by
including cofactors, or by using composite interval mapping.
There are two classes of methods that are not discussed in the chapter. Those are the
mixed model methods and Monte Carlo Markov Chain methods. In both methods, QTLs
are modeled either as fixed or as random effects, and additional random effects can
account for polygenic variation. Combined segregation and linkage analysis is needed to
infer QTL genotype probabilities from marker data.
Both methods are useful in ‘complex pedigrees’, typical in animal breeding data
from outbred populations. When line crosses are analysed, or half sib families ignoring
relationships across families, such methods are less relevant, and they have not been
extensively used in QTL detection studies. In most animal breeding applications,
however, such methods are typically needed in genetic evaluations including QTLs.
We will discuss mixed model methods including QTL effects in chapters 17 and 18.
79
Chapter 9 Methods for QTL analysis
Regression Methods
y = µ + MG1 + e
This is a multiple regression model, and markers can drop out of the model if they are not
significant. The set of markers that is significant in the final analysis point to the
existence of a significant QTL effect (or more, depending how far the markers are apart).
The analysis does not take into account any recombination rates between markers, or
between QTL and markers. In that sense it is comparable with regression on single
marker genotype. The multiple marker method is more powerful than single marker
analysis, and when the markers are well spread over the genome, it is better able to
distinguish the position of the QTL. Normally, after detection of such a location, analysis
with interval mapping would be recommended.
For a given marker genotype, or marker haplotype that was inherited from the sire, we
can calculate the probability for having inherited the Q or the q allele. It seems therefore
natural to regress phenotype on Q-probability. The model is
y = µ + α.x + e
80
Chapter 9 Methods for QTL analysis
The coefficient for x are obtained as in Chapter 7 (Table 4). For a each QTL position, the
residual sums of squares can be determined, and the estimate of the QTL position is there
where SSE is minimum. This is interval mapping (see Chapter 7)
Haley-Knott regression
Haley and Knott (1992) have proposed a slight reparameterization from the previous
model, but the principle is similar. Rather than dealing with marker haplotypes, they
present a more general model where QTL genotypes are dependent on marker genotypes.
The probability of carrying a certain QTL genotype depends on the marker genotypes and
the design
y = µ + α.x1 + βx2 + e
x1 and x2 are probabilities for QTL genotypes conditional the flanking marker genotypes.
The regression coefficients α and β represent the difference between the homozygote
QTL genotypes, and the QTL dominance effect, respectively.
Haley and Knott are well known for their proposed regression model, but an important
result from their paper was the similarity that was shown with maximum likelihood. They
proposed to use the following test statistic, indicated as ‘approximate Likelihood ratio
test’:
SSE reduced
LR = n ln( ) = -n.ln(1-r2)
SSE full
Which is ration of the residual sums of squares in a model with the QTL (”full’) and a
model without it (‘reduced’). The term r2 is the usual R-squared, used for the percentage
of variance explained by the model (only applicable if there are no other fixed effects).
Whittaker et al. (1996) have shown that direct regression of phenotype on marker types,
provides the same information about location and QTL-effect without having to step to
all positions on the interval.
81
Chapter 9 Methods for QTL analysis
y = µ + αλ.xL + αρ.xR + e
4β2θ (1 − θ )
r1 = 0.51 − 1 −
β2 + β1 (1 − 2θ )
[ β1 + (1 − 2θ ) β2 ][[ β2 + (1 − 2θ ) β1
α=
1 − 2θ
where θ = r1+r2(1-2r1). Hence a single analysis can give the same result as a complete
interval mapping. Note that the assumption is here that there are no QTL’s in the
neighboring marker-brackets.
In these notes, we will not discuss the detail of a maximum likelihood analysis (for
interested readers are referred to Lynch and Walsh (1998). Only the principle is given
here.
We have a probability of observing certain data (y) for a given set of parameters (θ):
F(yi) = P(y|θ)
This function F is indicated as probability density function (pdf). For example, if we take
normally distributed observations, and the simplest model, with a mean (µ) and standard
deviation (σ) the pdf looks like:
82
Chapter 9 Methods for QTL analysis
2
1
( y −µ ) 2
1
f(yi| µ, σ) = e σ2
[2]
σ 2π
The likelihood is the probability of certain parameters, given the observed data: L(θ| y).
We can use the same function for this, e.g.
2
1
( y −µ ) 2
1
L(µ, σ|yi) = e σ2
σ 2π
The total likelihood of data set y is calculated as the product of all likelihoods for each
observation.
L( µ, σ| y) = Πi L(µ, σ|yi)
As these likelihoods can become very small numbers, is better to work with the
LogLikelihood
Also for an alternative model, e.g. with a QTL effect, we may have different means.
A new set of parameters is then (µ1, µ2, α, and σ) and we can write the likelihood.
1 ( y −µ )2 1 ( y −µ )2
2 1 2 2
1 1
L(µ1, µ2,, σ|yi) = P(µ1). σ σ
2 2
e + P(µ2). e [3]
σ 2π σ 2π
Typically, in QTL analysis, we are not sure about QTL genotype, i.e. whether an
observation belongs to the Q-mean or to the q-mean. The likelihood is calculated as the
sum of the two possibilities, each weighted with its probability (=P(µI)).
The estimates of the model parameters are obtained for those values where the likelihood
is at its maximum. The maximum can be found using maximization routines (EM;
Newton Raphson; NAG-libraries).
A test of significance is obtained by comparing the maximum likelihood with the
likelihood of a model with the tested parameter omitted (reduced model).
The reduced model refers to the null-hypothesis, e.g. "there is no QTL effect"
83
Chapter 9 Methods for QTL analysis
In QTL analysis the data consists not only of phenotypic observations of performance,
but also of marker genotypes.
Using the example as in chapter 7, where we looked at a half sib family with known
paternal marker haplotypes, we could calculate the probability of having inherited the
paternal QTRL alleles for each of the four marker haplotypes (and given the
recombination fractions, i.e. for a given QTL position)
If the dam alleles are fixed there are only two possible QTL genotypes, hence we can
calculate the likelihood for each observation as in [3]. If the dam alleles are not fixed, we
would have to sum over all three possibilities.
In a simple fixed effects model, the ML estimate of the fixed effect parameters is equal to
the LS estimate of the fixed effects. Hence for a given QTL positions we can calculate µ
and α from a regression as in [1] and subsequently calculate the likelihood as in [3].
The following Table shows a likelihood calculation of the example as in Chapter 7, for
the QTL position M1-Q = 0.1
The likelihood is calculated according to [3] using σ2, and the two means are
84
Chapter 9 Methods for QTL analysis
µQ = µ + α = 50.443 and µq = µ = 50.057 and the weights are P(Q) and 1-P(Q),
where P(Q) is given for each individual in the third column of the Table.
The sum of the Log Likelihood over the whole data for the H0-model = -5.964
The difference between maximum likelihood and regression is that the last method
assumes normality within a marker group, i.e. there is a homogeneous variance within a
marker group (errors only due to e). Maximum likelihood accounts for the fact that
within a marker group, some animals have obtained a q and some have obtained a Q,
hence there are actually two distributions. The fact that the test statistics are practically
very similar shows that accounting for this bimodality within marker genotypes is not
very important. Most of the variation is explained from the differences between the
marker genotypes. Xu(1995) shows that the regression method is somewhat biased: it
overestimates the residual variance, and therefore tends to give lower values for the
approximate LR test. This bias is larger if the difference between Q and q is larger, and
when there is less certainty about QTL-allele inherited. The largest differences between
the two methods will be found in the middle of a marker bracket, when there is most
uncertainty about which QTL allele was inherited.
Xu’s suggest correction is
4
σ e2_ corrected = σ e2 − a 2 ∑ pi (1 − p i )
i =1
where pi is the probability of having inherited Q in marker genotype class i and a is the
regression coefficient on Q-probability in the regression model. Generally, this
adjustment has only a small effect, unless the QTL effect is very large and markers are
far from the QTL position
85
Chapter 9 Methods for QTL analysis
86
Chapter 9 Methods for QTL analysis
A few approached have been proposed to avoid effects of additional linked QTL.
87
Chapter 9 Methods for QTL
analysis
Jansen (1993) proposed an interval mapping approach where additional markers were
included in the model as cofactors. Such an additional QTL (say QTL2) can be
accounted for if there is information about additional markers (outside the bracket) that
are linked to QTL2. This analysis is also referred to as composite interval mapping (CIM)
(Zeng, 1994). Regression is on the additional marker genotypes are, hence, additional
QTL are accounted for as if they were at the marker locus.
Several authors have shown that composite interval mapping gives a large increase in
power, and much more precision in estimating QTL position.
As we discussed earlier in this chapter, Whittaker et al (1996) found that the regression
coefficient for two adjacent markers contain all information about position and effect of a
QTL between those markers. If the QTL is isolated, i.e. there are no QTL’s in the
adjacent brackets, than these regression coefficients can not be biased by other QTLs
outside the bracket. However, no distinction can be made between on or more QTL
within the bracket. hence, the position estimate within a marker bracket is only unbiased
if there is only one QTL. If there are more QTL within the bracket, we can not estimate
their positions.
rather than accounting for more QTL as in [5] we can also account for them with the
following model:
y = p(QTL1| M1M2) + p(QTL2| other markers near QTL2) [5]
hence this refers to a multiple interval mapping procedure (Kao et al., 1999).
Some problems here can be that 1) not all markers are informative, especially not in
outbred populations 2) it is hard to search for the best fitting model (set of positions) as
there are many combinations possible with multiple QTL.
The problem of multiple QTL will be further dealt with in chapter 10.
Maximum likelihood estimates are approximately normally distributed for large sample
sizes and confidence intervals can be based don the sampling variances. However, these
are often not so easy to obtain.
Approximate 95% confidence intervals for QTL position can be constructed using the
‘one-LOD rule’ (Lander and Botstein, 1989). All QTL with a LOD score value less than
1 from the maximum fall within this confidence interval. Note that 1 LOD score
corresponds to a LR value of 4.61, which has a significance value of 4% for the χ21-
distribution.
LR tests have a χdf2-distribution, where df refers to the degrees of freedom of the tested
parameter (i.e. the difference in df between the full model and the restricted model).
88
Chapter 9 Methods for QTL
analysis
In QTL analysis, this statistic provides only an approximate test, as the null-hypothesis
involves a non-mixture distribution whereas the QTL model involves a mixture
distribution.
Also regression analysis provide only approximate test statistics, as they assume normal
distributed errors within marker type, whereas the distribution is really a mixture of two
(or 3).
Simulation studies have been used to examine distributions of test statistics, or to
determine threshold values. However, such studies rely on the true data have the same
distribution as the simulated data.
Permutation testing
Bootstrapping
25 tests with a significance level of 1% would give a probability of 22% to find false
positives. It is nearly one for a few hundred tests.
α = 1 – (1 - γ)1/n ≈ γ/n.
89
Chapter 9 Methods for QTL
analysis
Hence, for 200 tests we would need a significance level of 0.05/200 = 0.00025 to have a
chance of false positives of about 5%. Usually, a significance level of around 0.1% is
applied.
However, test statistics from common analysis are usually not valid. Empirical threshold
values obtained by permutation testing are more reliable. Permutation testing can also be
used to obtain genome-wide significance levels, by simply repeating the procedure across
all markers.
References
Churchill, G.A. and R.W. Doerge. 1994. Empirical threshold values for quantitative trait
mapping. Genetics 138:963-971.
Haley, C.S. and S.A. Knott. 1992. A simple regression method for mapping quantitative
trait loci in line crosses using flanking markers. Heredity 69:315-324
Jansen, R.C. 1993. Interval mapping of multiple quantitative trait loci. Genetics. 135:205-
211.
Kao, C.H. , Z.B. Zheng, and R.D. Teasdale. 1999. Multiple interval mapping for
quantitative trait loci. Genetics 152: 1203-1216.
Kearsey, M.J. and V. Hyne. 1994. QTL analysis: a simple ‘marker regression’ approach.
Theor. Appl. genet. 698-702.
Lynch, M. and B. Walsh. 1998. Genetics and analysis of quantitative traits. Sinauer
Associates Inc. ISBN 0-87893-481-2.
Visscher, P.M., R. Thompson and C.S. Haley. Confidence intervals in QTL mapping by
bootstrapping. Genetics 143:1013-1020
Whittaker, J.C., Thompson, R., and P. Visscher. 1996. On the mapping of QTL by
regression of phenotype on marker type. Heredity 77:23-32.
Xu, S. 1995. A comment on the simple regression method for interval mapping. Genetics
141:1657-1659.
Zeng, Z-B. 1994. Precision mapping of quantitative trait loci. genetics 136:1457-14.
90