You are on page 1of 128

EXPERT SYSTEMS AND SOLUTIONS

Email: expertsyssol@gmail.com
expertsyssol@yahoo.com
Cell: 9952749533
www.researchprojects.info
PAIYANOOR, OMR, CHENNAI
Call For Research Projects Final
year students of B.E in EEE, ECE,
EI, M.E (Power Systems), M.E
(Applied Electronics), M.E (Power
Electronics)
Ph.D Electrical and Electronics.
Students can assemble their hardware in our
Research labs. Experts will be guiding the
projects.
Classification of Microarray
Gene Expression Data

Geoff McLachlan
Department of Mathematics & Institute for Molecular Bioscience
University of Queensland
Institute for Molecular Bioscience,
University of Queensland
“A wide range of supervised and
unsupervised learning methods have been
considered to better organize data, be it to
infer coordinated patterns of gene
expression, to discover molecular
signatures of disease subtypes, or to derive
various predictions. ”

Statistical Methods for Gene Expression:


Microarrays and Proteomics
Outline of Talk
• Introduction

• Supervised classification of tissue


samples – selection bias

• Unsupervised classification
(clustering) of tissues – mixture
model-based approach
Vital Statistics
by C. Tilstone
Nature 424, 610-612, 2003.

   
“DNA microarrays have
given geneticists and
molecular biologists access
to more data than ever
before. But do these Branching out: cluster
analysis can group
researchers have the samples that show
similar patterns of gene
statistical know-how to expression.

cope?”
MICROARRAY DATA
REPRESENTED by a p × n matrix
(x1,, xn )
xj contains the gene expressions for the p genes
of the jth tissue sample (j = 1, …, n).
p =No. of genes (103 - 104)
n =No. of tissue samples (10 - 102)

STANDARD STATISTICAL METHODOLOGY


APPROPRIATE FOR n >> p
HERE p >> n
Two Groups in Two Dimensions. All cluster information would
be lost by collapsing to the first principal component. The
principal ellipses of the two groups are shown as solid curves.
bioArray News (2, no. 35, 2002)
Arrays Hold Promise for Cancer Diagnostics
Oncologists would like to use arrays to predict
whether or not a cancer is going to spread in the
body, how likely it will respond to a certain type
of treatment, and how long the patient will
probably survive.
It would be useful if the gene expression
signatures could distinguish between subtypes of
tumours that standard methods, such as
histological pathology from a biopsy, fail to
discriminate, and that require different treatments.
van’t Veer & De Jong (2002, Nature Medicine 8)
The microarray way to tailored cancer treatment
In principle, gene activities that determine the
biological behaviour of a tumour are more
likely to reflect its aggressiveness than general
parameters such as tumour size and age of the
patient.
(indistinguishable disease states in diffuse large B-cell
lymphoma unravelled by microarray expression profiles
– Shipp et al., 2002, Nature Med. 8)
Microarray to be used as routine
clinical screen
by C. M. Schubert
Nature Medicine
9, 9, 2003.

The Netherlands Cancer Institute in Amsterdam is to become the first institution


in the world to use microarray techniques for the routine prognostic screening of
cancer patients. Aiming for a June 2003 start date, the center will use a panoply
of 70 genes to assess the tumor profile of breast cancer patients and to
determine which women will receive adjuvant treatment after surgery.
Microarrays also to be used in the
prediction of breast cancer by Mike West
(Duke University) and the Koo
Foundation Sun Yat-Sen Cancer Centre,
Taipei

Huang et al. (2003, The Lancet, Gene


expression predictors of breast cancer).
CLASSIFICATION OF TISSUES
SUPERVISED CLASSIFICATION
(DISCRIMINANT ANALYSIS)
We OBSERVE the CLASS LABELS y1, …, yn where
yj = i if jth tissue sample comes from the ith class
(i=1,…,g).
AIM: TO CONSTRUCT A CLASSIFIER C(x) FOR
PREDICTING THE UNKNOWN CLASS LABEL y
OF A TISSUE SAMPLE x.
e.g. g = 2 classes G1 - DISEASE-FREE
G2 - METASTASES
LINEAR CLASSIFIER
FORM

C ( x)   0  β x T

 β0  β1 x1    β p x p
for the production of the group label y of
a future entity with feature vector x.
FISHER’S LINEAR DISCRIMINANT FUNCTION

y  sign C ( x )
1
where β  S ( x1  x 2 )
1
 0   ( x1  x 2 ) S ( x1  x 2 )
T 1

2
and x1 , x 2 , and S are the sample means and pooled sample
covariance matrix found from the training data
SUPPORT VECTOR CLASSIFIER
Vapnik (1995)

C ( x )  β0  β1 x1    β p x p
where β0 and β are obtained as follows:
n
1
   j
2
min β
β , 0 2 j 1
subject to  j  0,
y j C(x j )  1   j ( j  1, , n)

1 , ,  n relate to the slack variables


   separable case
n
βˆ   ˆ j y j x j
j 1

with non-zero ̂ j only for those observations j for which the


constraints are exactly met (the support vectors).
n
C ( x )   ˆ j y j x Tj x  ˆ0
j 1
n
  ˆ j y j x j , x  ˆ0
j 1
Support Vector Machine (SVM)

REPLACE x by h( x )
n
C ( x )   ˆ j h( x j ), h( x )  ˆ0
j 1
n
  ˆ j K ( x j , x )  ˆ0
j 1

where the kernel function K ( x j , x )  h( x j ), h( x )


is the inner product in the transformed feature space.
HASTIE et al. (2001, Chapter 12)

The Lagrange (primal function) is


 
n n n
1 2
LP  β     j    j y j C ( x j )  (1   j )    j j (1)
2 j 1 j 1 j 1

which we maximize w.r.t. β, β0, and ξj.

Setting the respective derivatives to zero, we get


n
β   j y j x j (2)
j 1
n
   j y j (3)
j 1

 j    j ( j  1, , n). (4)


with  j  0,  j  0, and  j  0 ( j  1,  , n).
By substituting (2) to (4) into (1), we obtain the Lagrangian dual
function n
1 n n
LD    j    j k y j yk x x k T
j (5)
j 1 2 j 1 k 1
n
We maximize (5) subject to 0   j   and 
j 1
j y j  0.

In addition to (2) to (4), the constraints include


 j  y j C (x j )  (1   j )  0 (6)
 j j  0 (7)
y j C (x j )  (1   j )  0 (8)
for j  1,  , n.
Together these equations (2) to (8) uniquely characterize the solution
to the primal and dual problem.
Leo Breiman (2001)
Statistical modeling:
the two cultures (with discussion).
Statistical Science 16, 199-231.

Discussants include Brad Efron and David Cox


Selection bias in gene extraction on the
basis of microarray gene-expression data
Ambroise and McLachlan

Proceedings of the National Academy of Sciences


Vol. 99, Issue 10, 6562-6566, May 14, 2002

http://www.pnas.org/cgi/content/full/99/10/6562
GUYON, WESTON, BARNHILL & VAPNIK
(2002, Machine Learning)

• COLON Data (Alon et al., 1999)

• LEUKAEMIA Data (Golub et al., 1999)


Since p>>n, consideration given to
selection of suitable genes

SVM: FORWARD or BACKWARD (in terms of


magnitude of weight βi)
RECURSIVE FEATURE ELIMINATION (RFE)

FISHER: FORWARD ONLY (in terms of CVE)


GUYON et al. (2002)

LEUKAEMIA DATA:
Only 2 genes are needed to obtain a zero
CVE (cross-validated error rate)

COLON DATA:
Using only 4 genes, CVE is 2%
GUYON et al. (2002)

“The success of the RFE indicates that RFE has a


built in regularization mechanism that we do not
understand yet that prevents overfitting the
training data in its selection of gene subsets.”
Figure 1: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of colon tissue samples
Figure 2: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of leukemia tissue samples
Figure 3: Error rates of Fisher’s rule with stepwise forward
selection procedure using all the colon data
Figure 4: Error rates of Fisher’s rule with stepwise forward
selection procedure using all the leukemia data
Figure 5: Error rates of the SVM rule averaged over 20 noninformative
samples generated by random permutations of the class labels of the
colon tumor tissues
Error Rate Estimation
Suppose there are two groups G1 and G2

C(x) is a classifier formed from the


data set
(x1, x2, x3,……………, xn)
The apparent error is the proportion of
the data set misallocated by C(x).
Cross-Validation
From the original data set, remove x1 to
give the reduced set
(x2, x3,……………, xn)
Then form the classifier C(1)(x ) from this
reduced set.
Use C(1)(x1) to allocate x1 to either G1 or
G2.
Repeat this process for the second data
point, x2.

So that this point is assigned to either G1 or


G2 on the basis of the classifier C(2)(x2).

And so on up to xn.
Figure 1: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of colon tissue samples
ADDITIONAL REFERENCES

Selection bias ignored:


XIONG et al. (2001, Molecular Genetics and Metabolism)
XIONG et al. (2001, Genome Research)
ZHANG et al. (2001, PNAS)

Aware of selection bias:


SPANG et al. (2001, Silico Biology)
WEST et al. (2001, PNAS)
NGUYEN and ROCKE (2002)
BOOTSTRAP APPROACH
Efron’s (1983, JASA) .632 estimator
B.632  .368  AE  .632  B1
*
where B1 is the bootstrap when rule R k is applied to a point not in
the training sample.
A Monte Carlo estimate of B1 is
n
B1   Ej n
j 1
K K
where Ej   IjkQjk I jk

k 1 k 1
if xj  kth bootstrap sample
with Ijk  10 otherwise
 *
and Qjk  1 if R k misallocates xj
 0 otherwise
Toussaint & Sharpe (1975) proposed the
ERROR RATE ESTIMATOR
A(w)  (1 - w)AE  wCV2E
where w  0. 5
McLachlan (1977) proposed w=wo where wo is
chosen to minimize asymptotic bias of A(w) in the
case of two homoscedastic normal groups.

Value of w0 was found to range between 0.6


n1
and 0.7, depending on the values of p, , and n .
2
.632+ estimate of Efron & Tibshirani (1997, JASA)
B.632   (1 - w)AE  w B1

.632
where w
1  .368r
B1  AE
r (relative overfitting rate)
  AE
g
   pi (1  qi ) (estimate of no information error rate)
i 1

If r = 0, w = .632, and so B.632+ = B.632


r = 1, w = 1, and so B.632+ = B1
One concern is the heterogeneity of the tumours
themselves, which consist of a mixture of normal
and malignant cells, with blood vessels in between.
Even if one pulled out some cancer cells from a
tumour, there is no guarantee that those are the cells
that are going to metastasize, just because tumours
are heterogeneous.
“What we really need are expression profiles from
hundreds or thousands of tumours linked to relevant,
and appropriate, clinical data.”
John Quackenbush
UNSUPERVISED CLASSIFICATION
(CLUSTER ANALYSIS)

INFER CLASS LABELS y1, …, yn of x1, …, xn

Initially, hierarchical distance-based methods


of cluster analysis were used to cluster the
tissues and the genes

Eisen, Spellman, Brown, & Botstein (1998, PNAS)


Hierarchical (agglomerative) clustering algorithms
are largely heuristically motivated and there exist a
number of unresolved issues associated with their
use, including how to determine the number of
clusters.
“in the absence of a well-grounded statistical
model, it seems difficult to define what is
meant by a ‘good’ clustering algorithm or the
‘right’ number of clusters.”
(Yeung et al., 2001, Model-Based Clustering and Data Transformations
for Gene Expression Data, Bioinformatics 17)
Attention is now turning towards a model-based
approach to the analysis of microarray data
For example:
• Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model
for identifying changes in gene expression from microarray
experiments. Journal of Computational Biology 9

•Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression


data from microarray experiments. Bioinformatics 18

•Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering with


variable and transformation selection. In Bayesian Statistics 7
• Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray
gene expression data. Genome Biology 3

• Yeung et al., 2001, Model based clustering and data transformations


for gene expression data, Bioinformatics 17
The notion of a cluster is not easy to define.
There is a very large literature devoted to
clustering when there is a metric known in
advance; e.g. k-means. Usually, there is no a
priori metric (or equivalently a user-defined
distance matrix) for a cluster analysis.
That is, the difficulty is that the shape of the
clusters is not known until the clusters have
been identified, and the clusters cannot be
effectively identified unless the shapes are
known.
In this case, one attractive feature of
adopting mixture models with elliptically
symmetric components such as the normal
or t densities, is that the implied clustering
is invariant under affine transformations of
the data (that is, under operations relating
to changes in location, scale, and rotation
of the data).
Thus the clustering process does not
depend on irrelevant factors such as the
units of measurement or the orientation of
the clusters in space.
 Height  H  W
   
x   Weight   H-W 
 BP   BP 
   
MIXTURE OF g NORMAL COMPONENTS

f ( x )   1 ( x; μ1 , Σ1 )     g ( x; μg , Σ g )

where

 2 log  ( x; μ, Σ )  ( x  μ) Σ ( x  μ)  constant
TT 11
       

MAHALANOBIS DISTANCE

( x  μ )T ( x  μ )

EUCLIDEAN DISTANCE
MIXTURE OF g NORMAL COMPONENTS

f ( x )   1 ( x; μ1 , Σ1 )     g ( x; μg , Σ g )
k-means

Σ1    Σgg  σ II 22

SPHERICAL CLUSTERS
Equal spherical covariance matrices
Crab Data

Figure 6: Plot of Crab Data


Figure 7: Contours of the fitted component
densities on the 2nd & 3rd variates for the blue crab
data set.
With a mixture model-based approach to
clustering, an observation is assigned
outright to the ith cluster if its density in
the ith component of the mixture
distribution (weighted by the prior
probability of that component) is greater
than in the other (g-1) components.

f ( x )   1 ( x; μ1 , Σ1 )     i ( x; μi , Σi ) 
   g ( x; μg , Σ g )
http://www.maths.uq.edu.au/~gjm

McLachlan and Peel (2000),


Finite Mixture Models. Wiley.
Estimation of Mixture Distributions
It was the publication of the seminal paper of
Dempster, Laird, and Rubin (1977) on the
EM algorithm that greatly stimulated interest
in the use of finite mixture distributions to
model heterogeneous data.

McLachlan and Krishnan (1997, Wiley)


• If need be, the normal mixture model can
be made less sensitive to outlying
observations by using t component densities.

• With this t mixture model-based approach,


the normal distribution for each component
in the mixture is embedded in a wider class
of elliptically symmetric distributions with
an additional parameter called the degrees of
freedom.
The advantage of the t mixture model is that,
although the number of outliers needed for
breakdown is almost the same as with the
normal mixture model, the outliers have to
be much larger.
Two Clustering Problems:
• Clustering of genes on basis of tissues –
genes not independent

• Clustering of tissues on basis of genes -


latter is a nonstandard problem in
cluster analysis (n << p)
Mixture Software
McLachlan, Peel, Adams, and Basford (1999)
http://www.maths.uq.edu.au/~gjm/emmix/emmix.html
EMMIX for Windows

http://www.maths.uq.edu.au/~gjm/EMMIX_Demo/emmix.html
PROVIDES A MODEL-BASED
APPROACH TO CLUSTERING

McLachlan, Bean, and Peel, 2002, A Mixture Model-


Based Approach to the Clustering of Microarray
Expression Data, Bioinformatics 18, 413-422

http://www.bioinformatics.oupjournals.org/cgi/screenpdf/18/3/413.pdf
Example: Microarray Data
Colon Data of Alon et al. (1999)
n=62 (40 tumours; 22 normals)
tissue samples of
p=2,000 genes in a
2,000  62 matrix.
Mixture of 2 normal components
Mixture of 2 t components
Mixture of 2 t components
Mixture of 3 t components
In this process, the genes are being treated
anonymously.

May wish to incorporate existing biological


information on the function of genes into
the selection procedure.
Lottaz and Spang (2003, Proceedings of 54th Meeting of the ISI)

They structure the feature space by using a functional grid


provided by the Gene Ontology annotations.
Clustering of COLON Data
Genes using EMMIX-GENE
Grouping for Colon Data
1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
Clustering of COLON Data
Tissues using EMMIX-GENE
Grouping for Colon Data
1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
Mixtures of Factor Analyzers
A normal mixture model without restrictions
on the component-covariance matrices may
be viewed as too general for many situations
in practice, in particular, with high
dimensional data.

One approach for reducing the number of


parameters is to work in a lower
dimensional
space by adopting mixtures of factor
analyzers (Ghahramani & Hinton, 1997).
g
f ( x j )    i ( x j ; i ,  i ),
i 1

where
 i  Bi B  DiT
i (i  1,..., g ),

Bi is a p x q matrix and Di is a
diagonal matrix.
Number of Components
in a Mixture Model
Testing for the number of components,
g, in a mixture is an important but very
difficult problem which has not been
completely resolved.
Order of a Mixture Model
A mixture density with g components might
be empirically indistinguishable from one
with either fewer than g components or
more than g components. It is therefore
sensible in practice to approach the question
of the number of components in a mixture
model in terms of an assessment of the
smallest number of components in the
mixture compatible with the data.
Likelihood Ratio Test Statistic
An obvious way of approaching the
problem of testing for the smallest value of
the number of components in a mixture
model is to use the LRTS, -2log. Suppose
we wish to test the null hypothesis,

H 0 : g  g 0 versus H1 : g  g1
for some g1>g0.
We let Ψ̂ i denote the MLE of Ψ calculated
under Hi , (i=0,1). Then the evidence against
H0 will be strong if  is sufficiently small, or
equivalently, if -2log is sufficiently large,
where

 2 log   2{log L(Ψˆ 1 )  log L(Ψˆ 0 )}


Bootstrapping the LRTS

McLachlan (1987) proposed a


resampling approach to the assessment of
the P-value of the LRTS in testing
H 0 : g  g0 v H1 : g  g1

for a specified value of g0.


Bayesian Information Criterion
The Bayesian information criterion (BIC)
of Schwarz (1978) is given by
ˆ )  d log n
 2 log L(
as the penalized log likelihood to be
maximized in model selection, including
the present situation for the number of
components g in a mixture model.
Gap statistic (Tibshirani et al., 2001)

Clest (Dudoit and Fridlyand, 2002)


Analysis of LEUKAEMIA Data
using EMMIX-GENE
Grouping for Leukemia Data
1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
21 22 23 24 25

26 27 28 29 30

31 32 33 34 35

36 37 38 39 40
Breast cancer data set in van’t Veer et al.
(van’t Veer et al., 2002, Gene Expression Profiling Predicts
Clinical Outcome Of Breast Cancer, Nature 415)
These data were the result of microarray experiments
on three patient groups with different classes of
breast cancer tumours.

The overall goal was to identify a set of genes that


could distinguish between the different tumour
groups based upon the gene expression information
for these groups.
The Economist (US), February 2, 2002
The chips are down; Diagnosing
breast cancer (Gene chips have shown
that there are two sorts of breast cancer)
Nature (2002, 4 July Issue, 418)

News feature (Ball)

Data visualiztion: Picture this


Colour-coded: this plot of gene-expression data
shows breast tumours falling into two groups
Microarray data from 98 patients with
primary breast cancers with p = 24,881
genes
• 44 from good prognosis group
(remained metastasis free after a period
of more than 5 years)
• 34 from poor prognosis group (developed
distant metastases within 5 years)
• 20 with hereditary form of cancer
(18 with BRAC1; 2 with BRAC2)
Pre-processing filter of van’t Veer et al.

only genes with both:


• P-value less than 0.01; and
• at least a two-fold difference in more
than 5 out of the 98 tissues for the genes
were retained.

This reduces the data set to 4869 genes.


Heat Map Displaying the Reduced Set of 4,869 Genes
on the 98 Breast Cancer Tumours
Unsupervised Classification Analysis Using
EMMIX-GENE
Steps used in the application of EMMIX-GENE:
1. Select the most relevant genes from this filtered set
of 4,869 genes. The set of retained genes is thus
reduced to 1,867.

2. Cluster these 1,867 genes into forty groups. The


majority of gene groups produced were reasonably
cohesive and distinct.

3. Using these forty group means, cluster the tissue


samples into two and three components using a
mixture of factor analyzers model with q = 4 factors.
Insert heat map of 1867 genes

Heat Map of Top 1867 Genes


1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
21 22 23 24 25

26 27 28 29 30

31 32 33 34 35

36 37 38 39 40
i mi Ui i mi Ui i mi Ui i mi Ui
1 146 112.98 11 66 25.72 21 44 13.77 31 53 9.84
2 93 74.95 12 38 25.45 22 30 13.28 32 36 8.95
3 61 46.08 13 28 25.00 23 25 13.10 33 36 8.89
4 55 35.20 14 53 21.33 24 67 13.01 34 38 8.86
5 43 30.40 15 47 18.14 25 12 12.04 35 44 8.02
6 92 29.29 16 23 18.00 26 58 12.03 36 56 7.43
7 71 28.77 17 27 17.62 27 27 11.74 37 46 7.21
8 20 28.76 18 45 17.51 28 64 11.61 38 19 6.14
9 23 28.44 19 80 17.28 29 38 11.38 39 29 4.64
10 23 27.73 20 55 13.79 30 21 10.72 40 35 2.44
where i = group number
mi = number in group i
Ui = -2 log λi
Heat Map of Genes in Group G1
Heat Map of Genes in Group G2
Heat Map of Genes in Group G3
1. A change in gene expression is apparent between
the sporadic (first 78 tissue samples) and hereditary
(last 20 tissue samples) tumours.
2. The final two tissue samples (the two BRCA2
tumours) show consistent patterns of expression.
This expression is different from that exhibited by
the set of BRCA1 tumours.
3. The problem of trying to distinguish between the
two classes, patients who were disease-free after 5
years 1 and those with metastases within 5 years
2, is not straightforward on the basis of the gene
expressions.
Selection of Relevant Genes
We compared the genes selected by EMMIX-
GENE with those genes retained in the original
study by van’t Veer et al. (2002).
van’t Veer et al. used an agglomerative
hierarchical algorithm to organise the genes into
dominant genes groups. Two of these groups were
highlighted in their paper, with their genes
corresponding to biologically significant features.
Number of matches
Number
Identification of van’t Veer et al. with genes retained
of genes
by select-gene
containing genes co-regulated with the
Cluster A 40 24
ER-a gene (ESR1)
containing “co-regulated genes that are
the molecular reflection of extensive
Cluster B 40 23
lymphocytic infiltrate, and comprise a set
of genes expressed in T and B cells”

We can see that of the 80 genes identified by van’t Veer


et al., only 47 are retained by the select-genes step of
the EMMIX-GENE algorithm.
Comparing Clusters from Hierarchical Algorithm
with those from EMMIX-GENE Algorithm
Cluster
Index Number Percentage
(EMMIX- of Genes Matched
Matched (%)
GENE)
2 21 87.5
Cluster
A 3 2 8.33
14 1 4.17
17 18 78.3
Cluster 19 1 4.35
B 21 4 17.4

Subsets of these 47 genes appeared inside several of the


40 groups produced by the cluster-genes step of
EMMIX-GENE.
Genes Retained by EMMIX-GENE Appearing in Cluster A
(vertical blue lines indicate the three groups of tumours)
Genes Rejected by EMMIX-GENE Appearing in Cluster A
Genes Retained by EMMIX-GENE Appearing in Cluster B
Genes Rejected by EMMIX-GENE Appearing in Cluster B
Assessing the Number of Tissue Groups
To assess the number of components g to be used in
the normal mixture the likelihood ratio statistic 
was adopted, and the resampling approach used to
assess the P-value.

By proceeding sequentially, testing the null


hypothesis H0: g = g0 versus the alternative
hypothesis H1: g = g0 + 1, starting with g0 = 1 and
continuing until a non-significant result was
obtained it was concluded that g = 3 components
were adequate for this data set.
Clustering Tissue Samples on the Basis of Gene
Groups using EMMIX-GENE

Tissue samples can be subdivided into two groups


corresponding to 78 sporadic tumours and 20
hereditary tumours.
When the two cluster assignment of EMMIX-GENE
is compared to this genuine grouping, only 1 of the
20 hereditary tumour patients is misallocated,
although 37 of the sporadic tumour patients are
incorrectly assigned to the hereditary tumour cluster.
Using a mixture of factor analyzers model with q = 8
factors, we would misallocate:
7 out of the 44 members of 1;
24 out of the 34 members of 2; and
1 of the 18 BRCA1 samples.
The misallocation rate of 24/34 for the second class,
2, is not surprising given both the gene expressions
as summarized in the groups of genes and that we
are classifying the tissues in an unsupervised manner
without using the knowledge of their true
classification.
Supervised Classification
When knowledge of the groups’ true classification is
used (van’t Veer et al.), the reported error rate was
approximately 50% for members of 2 when
allowance was made for the selection bias in forming
a classifier on the basis of an optimal subset of the
genes.

Further analysis of this data set in a supervised context


confirms the difficulty in trying to discriminate
between the disease-free class 1 and the metastases
class 2. (Tibshirani and Efron, 2002, “Pre-Validation and Inference in
Microarrays”, Statistical Applications In Genetics And Molecular Biology 1)
Investigating Underlying Signatures With
Other Clinical Indicators

The three clusters constructed by EMMIX-


GENE were investigated in order to determine
whether they followed a pattern contingent upon
the clinical predictors of histological grade,
angioinvasion, oestrogen receptor, lymphocytic
infiltrate.
Microarrays have become promising
diagnostic tools for clinical applications.
However, large-scale screening
approaches in general and microarray
technology in particular, inescapably
lead to the challenging problem of
learning from high-dimensional data.
Hope to see you in Cairns in
2004!

You might also like