You are on page 1of 91

Bioinformatics I

2004 - 2005

Mihaela Zavolan | Michael Primig


Erik van Nimwegen | Torsten Schwede

Leandro Hermida (Database developer)


Philippe Demougin (LSTF)
Microarray expression data

Goals of BixI Array I


1. Introduce you to expression profiling
using microarrays
2. Outline experimental steps
3. Array data management
4. Present GeneChip system and its
approach to raw data computation
and analysis

• System available at Biozentrum


• 900 scanners world-wide
• Robust and reliable platform
Summary

1. Introduction to microarray technology


2. Major applications
3. Current systems
• PCR/oligonucleotide spotting
• GeneChip system
4. Experimental design considerations
5. GeneChip architecture
6. Raw data production (GCOS)
• File types
• data content
7. Data management (MIMAS)
8. Data computation and analysis (GCOS)
• mRNA detection calls
9. Practical work (LSTF)
10. Textbooks, Literature & web portals
Introduction to DNA microarray technology

Basic principle:

DNA molecules are attached to glass support

adhesion (“spotted microarrays”): single or


double stranded oligonucleotides or PCR
fragments in aqueous solution are administered
via robotic device

covalent bond (“high density oligonucleotide


microarrays or GeneChips”): single stranded
oligonucleotide is synthesized in situ
Major applications

1. RNA concentrations (expression profiling)


• Cell cycle
• Germ cell differentiation
• Transcription factor target genes
• Stress/drug response
• Temperature response
• Cancer progression & categorization

2. DNA mutations (SNP-Chip)

3. Protein-DNA interactions
• Transcription factors
• Recombination enzymes
• RNA replication factors
Spotted microarray technology

Treat glass slide with poly-lysine (to make


glass sticky), PCR amplify and purify probe
DNA (including controls from bacteria)

Spot DNA in duplicate/triplicate using robotic


systems

Prepare targets form two samples and label


them with CY5 or CY3 fluorophors

Hybridize array with mix

Scan microarray with filters in 635nm (CY5)


and 532nm (CY3) channels

Green and red indicate different target


concentrations, yellow indicate identical
target concentrations in the samples

YOU DETERMINE A RATIO OF TWO


VALUES
Current systems at the Biozentrum

GMS 427 spotting system


Current systems at the Biozentrum

• Fluid transport and


transfer based on
surface tension
forces
• Pin-and-Ring
System for accurate
and reproducible
spotting
Current systems at the Biozentrum
Current systems at the Biozentrum
GeneChip system
GeneChip technology

1.28 cm

12.7cm

Wafer
≈ 60 million probes A
T
C
G

Probe cell (Feature)


≈ 1.3 million in HumanU133 Plus 2.0 Genechip
106 to 107 oligo probes in each probe cell
11 µm
GeneChip manufacturing
GeneChip architecture
Oligonucleodide probes of 25 bases that are
synthesized directly onto a glass support
using photolithography and combinatorial
chemistry

Several 106 identical oligos fit into one


location of ~20 mm called a feature

A glass slide of 1.28 cm2 can hold up to


500 000 features

11 to 16 oligos of different sequence are


complementary to one ORF/gene.

A wild-type oligo is called a Perfect Match

An oligo that contains a point mutation at


PM position 13 is called a MisMatch
MM
PM You hybridize a GeneChips with one single
MM target molecule

YOU DETERMINE a signal per gene for all


probe pairs. That signal reflects the
expression level of the target gene.
GeneChip architecture

Probe pairs are not arranged


right next to each other but
dispersed all over the
GeneChip.

This minimizes the impact of


local perturbations (dust,
manufacturing problem,
local background).

>> Distributed architecture


Experimental design

Total RNA prepration

Reverse
Transcriptase to
make cDNA

IVT using biotinylated


CTP/UTP nucleotide
analogs to make
cRNA probe

Heat fragmentation

Hybridization: a
specific target
molecule will bind to
PM but not to MM

Washing and staining

Using statistics, a
signal per gene for all
probe pairs (PM-MM)
is computed after the
scan
Data file types

• DAT
- Raw bitmap image of the Affymetrix GeneChip® scanned probe array
- Encodes essential information about the experiment and sample that the
image belongs to
• CEL
- Cell intensities and quality control information generated by aligning grid
to DAT file and using the Affymetrix Cell Analysis Algorithm to compute
intensity values for each X and Y coordinate
- Contains information about the image, sample, and experiment CEL data
derived from
• CDF
- Affymetrix GeneChip® array library definition information containing the X
and Y coordinate map on the probe array to the following information:
probe set ID, feature number, and perfect match (PM) or mismatch (MM)
Data file types

• CHP
- GCOS native file containing the probe (CEL) analysis output using the
GCOS probe analysis algorithm (which computes one expression signal
value and detection call from all the probe intensities in a probe set)
• EXP
- GCOS native file containing general experiment information such as name,
sample, probe array type, hybridization information, etc.
• RPT
- Summary analysis information gathered during GCOS probe analysis (CHP
file generation)
• TXT
- Tab-delimited text output of the CHP file containing probe set IDs, probe
set signals, and detection calls generated by the GCOS probe analysis
algorithm as well as important information about the experiment, sample,
and analysis algorithm parameters
Data file types

Flat file storage

MIMAS repository
File types
3‘ UTR
AAAAA

DAT:
image
at the
pixel
level
File types

DAT

CEL

Raw data for PM and MM at the feature


(oligonucleotide probe) level
File types

DAT

CEL

Affy ID Stat Pairs Stat Pairs used Signal Flag p-value

CHP 222080_s_at 11 11 1025.2 P 0.432373

Raw data from the complete probe set


at the locus/transcript level
DAT: feature level data
CDF: array library definition file
CHP: transcript level data
CHP: transcript level data
CHP: transcript level data
EXP: experiment annotation
RPT: key data on experiment
RPT: key data on experiment

P calls:
Yeast: 75-80%
Mammals: 35-45%
TXT: export chp transcript level data
Why do we need MIMAS?

• CENTRAL DATA REPOSITORY


- Extremely important to have independent microarray facilities that need to be integrated under one
umbrella
- Microarray core data needs to be stored in one easy to access location

• VERSATILE DATA STORAGE BACK-END


- Utilizing DBMS technology, microarray core data can be stored to allow for data access from many types of
tools
- Data storage needs can change as microarray technology evolves

• COMPLETE AND STANDARDIZED DESCRIPTION OF DATA


- Microarray data can only be compared when the meta data describing it is complete and understandable
- Publishing of scientific results is only possible with complete and MIAME-compliant meta data

• LONG-TERM SAFETY & SECURITY OF DATA


- Archival and backup of microarray data can only be effectively done with a central system
- Affymetrix microarray data is expensive to produce so maintaining it is of vital importance

• FLEXIBLE DATA MANIPULATION & ANALYSIS


- Option to develop custom home-grown tools to interact with data as well as using commercially available
solutions
Why do we need MIMAS?
• EXPENSIVE COMMERCIAL SOLUTIONS (> $100K)
- Right fundamental concepts that we need
- Scalable
- Data and capabilities are locked in the proprietary software

• MID-PRICED COMMERCIAL SOLUTIONS (< $100K)


- Have some of the fundamental concepts right
- Somewhat scalable
- Data and capabilities are locked in the proprietary software

• OPEN-SOURCE/FREE SOLUTIONS
- Do not have fundamental concepts right (make customized development to our
needs difficult)
- No scalability

• CUSTOM SOLUTION + INTEGRATION W/ MID-PRICED COMMERCIAL SOLUTION


- Best of both worlds!
MIMAS

•Oracle Database Back-end

•MIMAS API

•Web Front-end

•Loading, Transformation, & Extraction


Processes

•Integration with Silicon Genetics Signet


and GeneSpring
MIMAS

• REPOSITORY
- Feature-level intensities
- Meta-data
- Data warehouse schema design

• ONTOLOGY/ARRAY LIBRARY
- Controlled vocabulary used to describe microarray data
- Library can be extended by MIMAS users
- MIAME compliant and extensible
- Affymetrix GeneChip® information

• UPLOAD/STAGING DATABASE
- Persistent area to store uploads before they are ready to go into the repository

• WEB MANAGEMENT DATABASE


- User, laboratory, session, and external job requests
Array experiment annotation: MIAME

•Microarray Gene Expression Data Society


(MGED)
- http://www.mged.org/

•Minimal Information about a microarray


experiment - MIAME
- http://www.mged.org/Workgroups/MIAME
/miame.html
MIMAS processes

• MIMAS-Signet Loader
- Master script which takes uploaded/staged experiments and then loads it into the
MIMAS Repository
- Runs daily (or more frequently depending on hardware)
- Integrity and redundancy checking of experiment, CEL, TXT data
- Integrity checking of CEL files
- CEL fingerprinting (avoid redundancy)
- Transformation and loading of meta-data (MIAME)
- Integrity checking of TXT files
- Emails user of success or failure
• External Job Execution
- Periodically scans the EXTERNAL_JOBS MIMAS table for requests and executes them
depending on available resources
- Recreation and archiving of sample CEL files for download
Data computation

DAT >> CEL >> CHP >> Txt


Data computation and analysis

Single Array Analysis

• Detection p-value
• Detection call
• Signal algorithm

Comparison analysis

• Normalization
• Change p-value
• Change call
• Signal log ratio algorithm
The CEL analysis algorithm

DAT >> CEL file

Histogram
1243 1283 1346 1271
8 100.%
95.% 1158 1272 1254 1247
90.%
7
85.%
80.% 1247 1255 1192 1182
6 75.%
70.%
65.%
1254 1309 1241 1122
5
60.%
Frequency

55.%
4 50.%
45.%
40.%
3
35.%
30.%
2 25.%
20.%
15.%
1
10.%
5.% 1271.3
0 .%
1140 1180 1220 1260 1300 1340 1380
Bin

Cumulative Distribution Function (cdf) is used to determine the intensity of the


75th percentile = 1271.3: this eliminates out-lyers with extreme values
Data computation and analysis

Single Array Analysis

•Generates qualitative and quantitative values from one


gene expression hybridization experiment

•Yields data required to compare experiments (one


hybridization = one experiment)

The Detection call (a quantitative value), indicates


whether a transcript is reliably detected (Present) or
fails to be detected (Absent) in the array experiment

The Detection call is determined by comparing the


Detection p-value generated in the analysis against
user-defined cut-off values .

A quantitative value, the signal, assigns a relative


measure of abundance to each transcript represented by
probes on the array.
Data computation and analysis

Hypothesis testing is the use of statistics to determine the probability


that a given hypothesis is true. The usual process of hypothesis testing
consists of four steps.

1. Formulate the null hypothesis H0 (commonly, that the


observations are the result of pure chance) and the alternative
hypothesis Ha (commonly, that the observations show a real effect
combined with a component of chance variation).
2. Identify a test statistic that can be used to assess the truth of the
null hypothesis.
3. Compute the P-value, which is the probability that a test statistic
at least as significant as the one observed would be obtained
assuming that the null hypothesis were true. The smaller the P-value,
the stronger the evidence against the null hypothesis.
4. Compare the p-value to an acceptable significance value (also
called an alpha value). If the observed effect is statistically
significant, the null hypothesis is ruled out, and the alternative
hypothesis is valid.
Data computation and analysis

Detection algorithm

There are four steps:

1. Remove saturated probe pairs and ignore probe pairs


wherein PM ~ MM + tau
2. Calculate the discrimination scores. (This tells us how
different the PM and MM cells are.)
3. Use Wilcoxon’s Signed Rank test to calculate a
significance or p-value. (This tells us how confident
we can be about a certain result.)
4. Compare the p-value with the preset significance
levels to make the call.

P-value: the probability that a variate would assume a value


greater than or equal to the observed value strictly by
chance:
Data computation and analysis

Detection algorithm

Discrimination Score

The discrimination score [R] is a relative measure


of the difference between the PM and MM
intensities. The discrimination score for the ith
probe pair is:

We use τ (default τ = 0.015), a small threshold


value between 0 and 1 as a significant difference
from zero. If the median (Ri) > τ, the hypothesis
that PM and MM are equally hybridizing to the
sample can be rejected. A detection call based on
the strength of this rejection (the p-value) can be
made.
Data computation and analysis

Detection algorithm

Probe pairs (PPs) with R > τ vote for the


transcript to be present (P) while PPs with R < τ
vote for the transcript to be absent (A).

The voting results of all PPs are summarized as a


p-value computed by the

One-sided Wilcoxson’s Signed Rank test


statistics.
Data computation and analysis

Detection algorithm

One-sided Wilcoxson’s Signed Rank test

Null hypothesis:

H0: median (Ri) = τ versus the alternative

H1: median (Ri) > τ

τ = 0.015 (see Liu et al. Bioinformatics 2002)


Data computation and analysis

One-sided Wilcoxson’s Signed Rank test

A nonparametric test that assumes that there is information in


the magnitudes of the differences between paired observations.

Take the paired observations, calculate the differences, and rank


them from smallest to largest by absolute value.

Compute the sum of the ranks of the positive differences. If the


null hypothesis is true, the sum of the ranks of the positive
differences should be about the same as the sum of the ranks of
the negative differences.

Add all the ranks associated with positive differences, giving the
T+ statistic. Finally, the P-value associated with this statistic is
identified in an appropriate table.

The Wilcoxon test is an R(obust)-estimate (meaning that it is


based on a rank test).
Data computation and analysis

Detection algorithm

If a mismatch cell is saturated (MM > 46000),


the corresponding probe pair is not used in
further computations.

PPs where PM and MM are within τ of each other


are discarded.

If all probe pairs in a unit are saturated, the


gene is reported as detected and the p-value is
set to 0.
Making the call

Present (detected): p < α1


Marginal: α1 = p < α2
Absent (undetected): p > α2
Significance levels

Default α1 = 0.04 (16-20 probe pairs)


Default α2 = 0.06 (16-20 probe pairs)
Data computation and analysis

Signal algorithm: CEL >> CHP


Data computation and analysis

Background

• For purposes of calculating


background values, the array
is split up into K rectangular
zones Z_k (k = 1, …, K, default
K = 16).
• Control cells and masked cells
are not used in the calculation.
• The cells are ranked and the
lowest 2% is chosen as the
background b for that zone
(bZk).
• The standard deviation of the
lowest 2% cell intensities is
calculated as an estimate of
the background variability n for
each zone (nZk).
Data computation and analysis

Signal algorithm: CEL >> CHP

The signal is a quantitative value that reflect the


relative concentration of a given mRNA

It is computed as a weighted mean using the One-


Step Tukey’s Biweight Estimate

The specific signal for each PP is calculated by


subtracting stray signal (detected by MM) from the
PM value: IPM = IT + IS and then taking its log.
Is: Intensity due to stray signal
IT: Intensity due to true signal
Data computation and analysis

Signal algorithm: CEL >> CHP

Three rules are applied:

If MM < PM then MM is considered informative


and its value is directly used as a stray
(background) estimate

If MM are often but not always informative, the


outlyers are adjusted

If MM > PM the MMs are replaced by a value


smaller than PM
Data computation and analysis

Signal algorithm: CEL >> CHP

Signal calculation:

1. CEL intensity values are adjusted for global


background
2. MM value is calculated and subtracted from PM
3. Adjusted PM values are log2-transformed to
stabilize variance
4. Tukey’s biweight estimator is used to compute a
robust mean of the resulting values
5. The signal is scaled using a trimmed mean
Data computation and analysis

Signal algorithm: CEL >> CHP

One-step Tukey’s bi-weight algorithm

• Determine median to define center of


data
• Calculate distance of each data point
from median. This distance is used to
determine to what extent a given
value will contribute to the final signal

The greater the distance to the median,


the smaller the contribution of a data
point
>>this minimizes the effect of
outlyers…
Data computation and analysis

Signal algorithm: CEL >> CHP

One-step Tukey’s bi-weight


algorithm

• Calculate median M for n values


• Calculate absolute distance of each
data point from median.
• Calculate S, the median of the
absolute distances from M.
• The Median Absolute Deviation
(MAD) is a first measure of the
data distribution
Data computation and analysis

Signal algorithm: CEL >> CHP

One-step Tukey’s bi-weight algorithm

For each datapoint i, a uniform measure of distance u from


the center is given by

xi: value of datapoint i


c: tuning constant (default c=5)
ε: small value used to avoid zero
Data computation and analysis

Signal algorithm: CEL >> CHP

One-step Tukey’s bi-weight algorithm

The weight w of each point is calculated by the bisquare


function:

•For each point the weight w is reduced as a function of its


distance from the median. The weight of extreme values is
reduced to zero
Data computation and analysis

Signal algorithm: CEL >> CHP

One-step Tukey’s bi-weight algorithm

Corrected values can then be computed with the


one-step w-estimate which is a weighted mean

Σ w(u)xi
Tbi =
Σ w(u)
Practical work

Life Sciences Training Facility


Pharmazentrum, room 5021
http://www.bioz.unibas.ch/corelab

GCOS:
•Compute CEL, CHP files
•Determine presence/absence calls
•Data quality control

MIMAS
•Annotate and upload files
Textbooks, Literature & web portals

DNA Microarray Data


Analysis

Follow the link “books and


magazines”.

J Tuimala and M Laine

You can download the pdf file


after registration for free
http://www.csc.fi/molbio/
Textbooks, Literature & web portals

Microarray Gene
Expression Data Analysis:
A Beginner's Guide

Helen Causton, John Quackenbush,


Alvis Brazma

http://www.amazon.co.uk/exec/obidos/ASIN/
1405106824/qid%3D1047375686/026-
1898565-5814030
Textbooks, Literature & web portals

Bioinformatics
Sequence and Genome
Analysis

David Mount
2004 CSH Press

http://www.bioinformaticsonl
ine.org/
Textbooks, Literature & web portals

Liu et al.
Analysis of high density expression microarrays with signed-
rank call algorithms. Bioinformatics 2002

Hubbel et al.
Robust estimators for expression analysis. Bioinformatics 2002

Irizarry et al.
Summaries of Affymetrix GeneChip probe level data. Nucleic
Acids Res 2003.

Bolstadt et al.
A comparison of normalization methods for high density
oligonucleotide array data based on variance and bias.
Bioinformatics 2003

Gautier et al.
affy--analysis of Affymetrix GeneChip data at the probe level.
Bioinformatics 2004
Textbooks, Literature & web portals

NETAFFX:
http://www.affymetrix.com/analysis/

Register and access info on probes,


annotation, technotes, stats
reference guide, expression manuals
etc.
Textbooks, Literature & web portals

Certified array data repositories

EBI: ArrayExpress: http://www.ebi.ac.uk/arrayexpress/

NCBI: GeneOmnibus: http://www.ncbi.nlm.nih.gov/projects/geo/


Textbooks, Literature & web portals

http://www.nslij-genetics.org/microarray/

You might also like