Micro Arrays

Bioinformatics I
2004 - 2005
Mihaela Zavolan | Michael Primig

Erik van Nimwegen | Torsten Schwede
Leandro Hermida (Database developer)

Philippe Demougin (LSTF)
Microarray expression data
Goals of BixI Array I

1. Introduce you to expression profiling
using microarrays
2. Outline experimental steps
3. Array data management
4. Present GeneChip system and its
approach to raw data computation
and analysis
• System available at Biozentrum

• 900 scanners world-wide
• Robust and reliable platform
Summary
1. Introduction to microarray technology

2. Major applications
3. Current systems
• PCR/oligonucleotide spotting
• GeneChip system
4. Experimental design considerations
5. GeneChip architecture
6. Raw data production (GCOS)
• File types
• data content
7. Data management (MIMAS)
8. Data computation and analysis (GCOS)
• mRNA detection calls
9. Practical work (LSTF)
10. Textbooks, Literature & web portals
Introduction to DNA microarray technology
Basic principle:
DNA molecules are attached to glass support
adhesion (“spotted microarrays”): single or

double stranded oligonucleotides or PCR
fragments in aqueous solution are administered
via robotic device
covalent bond (“high density oligonucleotide

microarrays or GeneChips”): single stranded
oligonucleotide is synthesized in situ
Major applications
1. RNA concentrations (expression profiling)

• Cell cycle
• Germ cell differentiation
• Transcription factor target genes
• Stress/drug response
• Temperature response
• Cancer progression & categorization
2. DNA mutations (SNP-Chip)
3. Protein-DNA interactions
• Transcription factors
• Recombination enzymes
• RNA replication factors
Spotted microarray technology
Treat glass slide with poly-lysine (to make

glass sticky), PCR amplify and purify probe
DNA (including controls from bacteria)
Spot DNA in duplicate/triplicate using robotic

systems
Prepare targets form two samples and label

them with CY5 or CY3 fluorophors
Hybridize array with mix
Scan microarray with filters in 635nm (CY5)

and 532nm (CY3) channels
Green and red indicate different target

concentrations, yellow indicate identical
target concentrations in the samples
YOU DETERMINE A RATIO OF TWO

VALUES
Current systems at the Biozentrum
GMS 427 spotting system

• Fluid transport and

transfer based on
surface tension
forces
• Pin-and-Ring
System for accurate
and reproducible
spotting
GeneChip system
GeneChip technology
1.28 cm
12.7cm
Wafer
≈ 60 million probes A
T
C
G
Probe cell (Feature)

≈ 1.3 million in HumanU133 Plus 2.0 Genechip
106 to 107 oligo probes in each probe cell
11 µm
GeneChip manufacturing
GeneChip architecture
Oligonucleodide probes of 25 bases that are
synthesized directly onto a glass support
using photolithography and combinatorial
chemistry
Several 106 identical oligos fit into one

location of ~20 mm called a feature
A glass slide of 1.28 cm2 can hold up to

500 000 features
11 to 16 oligos of different sequence are

complementary to one ORF/gene.
A wild-type oligo is called a Perfect Match
An oligo that contains a point mutation at

PM position 13 is called a MisMatch
MM
PM You hybridize a GeneChips with one single
MM target molecule
YOU DETERMINE a signal per gene for all

probe pairs. That signal reflects the
expression level of the target gene.
GeneChip architecture
Probe pairs are not arranged

right next to each other but
dispersed all over the
GeneChip.
This minimizes the impact of

local perturbations (dust,
manufacturing problem,
local background).
>> Distributed architecture

Experimental design
Total RNA prepration
Reverse
Transcriptase to
make cDNA
IVT using biotinylated

CTP/UTP nucleotide
analogs to make
cRNA probe
Heat fragmentation
Hybridization: a
specific target
molecule will bind to
PM but not to MM
Washing and staining
Using statistics, a
signal per gene for all
probe pairs (PM-MM)
is computed after the
scan
Data file types
• DAT
- Raw bitmap image of the Affymetrix GeneChip® scanned probe array
- Encodes essential information about the experiment and sample that the
image belongs to
• CEL
- Cell intensities and quality control information generated by aligning grid
to DAT file and using the Affymetrix Cell Analysis Algorithm to compute
intensity values for each X and Y coordinate
- Contains information about the image, sample, and experiment CEL data
derived from
• CDF
- Affymetrix GeneChip® array library definition information containing the X
and Y coordinate map on the probe array to the following information:
probe set ID, feature number, and perfect match (PM) or mismatch (MM)
Data file types
• CHP
- GCOS native file containing the probe (CEL) analysis output using the
GCOS probe analysis algorithm (which computes one expression signal
value and detection call from all the probe intensities in a probe set)
• EXP
- GCOS native file containing general experiment information such as name,
sample, probe array type, hybridization information, etc.
• RPT
- Summary analysis information gathered during GCOS probe analysis (CHP
file generation)
• TXT
- Tab-delimited text output of the CHP file containing probe set IDs, probe
set signals, and detection calls generated by the GCOS probe analysis
algorithm as well as important information about the experiment, sample,
and analysis algorithm parameters
Data file types
Flat file storage
MIMAS repository
File types
3‘ UTR
AAAAA
DAT:
image
at the
pixel
level
File types
DAT
CEL
Raw data for PM and MM at the feature

(oligonucleotide probe) level
File types
DAT
CEL
Affy ID Stat Pairs Stat Pairs used Signal Flag p-value
CHP 222080_s_at 11 11 1025.2 P 0.432373
Raw data from the complete probe set

at the locus/transcript level
DAT: feature level data
CDF: array library definition file
CHP: transcript level data
EXP: experiment annotation
RPT: key data on experiment
RPT: key data on experiment
P calls:
Yeast: 75-80%
Mammals: 35-45%
TXT: export chp transcript level data
Why do we need MIMAS?
• CENTRAL DATA REPOSITORY

- Extremely important to have independent microarray facilities that need to be integrated under one
umbrella
- Microarray core data needs to be stored in one easy to access location
• VERSATILE DATA STORAGE BACK-END

- Utilizing DBMS technology, microarray core data can be stored to allow for data access from many types of
tools
- Data storage needs can change as microarray technology evolves
• COMPLETE AND STANDARDIZED DESCRIPTION OF DATA

- Microarray data can only be compared when the meta data describing it is complete and understandable
- Publishing of scientific results is only possible with complete and MIAME-compliant meta data
• LONG-TERM SAFETY & SECURITY OF DATA

- Archival and backup of microarray data can only be effectively done with a central system
- Affymetrix microarray data is expensive to produce so maintaining it is of vital importance
• FLEXIBLE DATA MANIPULATION & ANALYSIS

- Option to develop custom home-grown tools to interact with data as well as using commercially available
solutions
Why do we need MIMAS?
• EXPENSIVE COMMERCIAL SOLUTIONS (> $100K)
- Right fundamental concepts that we need
- Scalable
- Data and capabilities are locked in the proprietary software
• MID-PRICED COMMERCIAL SOLUTIONS (< $100K)

- Have some of the fundamental concepts right
- Somewhat scalable
- Data and capabilities are locked in the proprietary software
• OPEN-SOURCE/FREE SOLUTIONS
- Do not have fundamental concepts right (make customized development to our
needs difficult)
- No scalability
• CUSTOM SOLUTION + INTEGRATION W/ MID-PRICED COMMERCIAL SOLUTION

- Best of both worlds!
MIMAS
•Oracle Database Back-end
•MIMAS API
•Web Front-end
•Loading, Transformation, & Extraction

Processes
•Integration with Silicon Genetics Signet

and GeneSpring
MIMAS
• REPOSITORY
- Feature-level intensities
- Meta-data
- Data warehouse schema design
• ONTOLOGY/ARRAY LIBRARY
- Controlled vocabulary used to describe microarray data
- Library can be extended by MIMAS users
- MIAME compliant and extensible
- Affymetrix GeneChip® information
• UPLOAD/STAGING DATABASE
- Persistent area to store uploads before they are ready to go into the repository
• WEB MANAGEMENT DATABASE

- User, laboratory, session, and external job requests
Array experiment annotation: MIAME
•Microarray Gene Expression Data Society

(MGED)
- http://www.mged.org/
•Minimal Information about a microarray

experiment - MIAME
- http://www.mged.org/Workgroups/MIAME
/miame.html
MIMAS processes
• MIMAS-Signet Loader
- Master script which takes uploaded/staged experiments and then loads it into the
MIMAS Repository
- Runs daily (or more frequently depending on hardware)
- Integrity and redundancy checking of experiment, CEL, TXT data
- Integrity checking of CEL files
- CEL fingerprinting (avoid redundancy)
- Transformation and loading of meta-data (MIAME)
- Integrity checking of TXT files
- Emails user of success or failure
• External Job Execution
- Periodically scans the EXTERNAL_JOBS MIMAS table for requests and executes them
depending on available resources
- Recreation and archiving of sample CEL files for download
Data computation
DAT >> CEL >> CHP >> Txt

Data computation and analysis
Single Array Analysis
• Detection p-value
• Detection call
• Signal algorithm
Comparison analysis
• Normalization
• Change p-value
• Change call
• Signal log ratio algorithm
The CEL analysis algorithm
DAT >> CEL file
Histogram
1243 1283 1346 1271
8 100.%
95.% 1158 1272 1254 1247
90.%
7
85.%
80.% 1247 1255 1192 1182
6 75.%
70.%
65.%
1254 1309 1241 1122
5
60.%
Frequency
55.%
4 50.%
45.%
40.%
3
35.%
30.%
2 25.%
20.%
15.%
1
10.%
5.% 1271.3
0 .%
1140 1180 1220 1260 1300 1340 1380
Bin
Cumulative Distribution Function (cdf) is used to determine the intensity of the

75th percentile = 1271.3: this eliminates out-lyers with extreme values
Single Array Analysis
•Generates qualitative and quantitative values from one

gene expression hybridization experiment
•Yields data required to compare experiments (one

hybridization = one experiment)
The Detection call (a quantitative value), indicates

whether a transcript is reliably detected (Present) or
fails to be detected (Absent) in the array experiment
The Detection call is determined by comparing the

Detection p-value generated in the analysis against
user-defined cut-off values .
A quantitative value, the signal, assigns a relative

measure of abundance to each transcript represented by
probes on the array.
Hypothesis testing is the use of statistics to determine the probability

that a given hypothesis is true. The usual process of hypothesis testing
consists of four steps.
1. Formulate the null hypothesis H0 (commonly, that the

observations are the result of pure chance) and the alternative
hypothesis Ha (commonly, that the observations show a real effect
combined with a component of chance variation).
2. Identify a test statistic that can be used to assess the truth of the
null hypothesis.
3. Compute the P-value, which is the probability that a test statistic
at least as significant as the one observed would be obtained
assuming that the null hypothesis were true. The smaller the P-value,
the stronger the evidence against the null hypothesis.
4. Compare the p-value to an acceptable significance value (also
called an alpha value). If the observed effect is statistically
significant, the null hypothesis is ruled out, and the alternative
hypothesis is valid.
Detection algorithm
There are four steps:
1. Remove saturated probe pairs and ignore probe pairs

wherein PM ~ MM + tau
2. Calculate the discrimination scores. (This tells us how
different the PM and MM cells are.)
3. Use Wilcoxon’s Signed Rank test to calculate a
significance or p-value. (This tells us how confident
we can be about a certain result.)
4. Compare the p-value with the preset significance
levels to make the call.
P-value: the probability that a variate would assume a value

greater than or equal to the observed value strictly by
chance:
Detection algorithm
Discrimination Score
The discrimination score [R] is a relative measure

of the difference between the PM and MM
intensities. The discrimination score for the ith
probe pair is:
We use τ (default τ = 0.015), a small threshold

value between 0 and 1 as a significant difference
from zero. If the median (Ri) > τ, the hypothesis
that PM and MM are equally hybridizing to the
sample can be rejected. A detection call based on
the strength of this rejection (the p-value) can be
made.
Detection algorithm
Probe pairs (PPs) with R > τ vote for the

transcript to be present (P) while PPs with R < τ
vote for the transcript to be absent (A).
The voting results of all PPs are summarized as a

p-value computed by the
One-sided Wilcoxson’s Signed Rank test

statistics.
Detection algorithm
Null hypothesis:
H0: median (Ri) = τ versus the alternative
H1: median (Ri) > τ
τ = 0.015 (see Liu et al. Bioinformatics 2002)

A nonparametric test that assumes that there is information in

the magnitudes of the differences between paired observations.
Take the paired observations, calculate the differences, and rank

them from smallest to largest by absolute value.
Compute the sum of the ranks of the positive differences. If the

null hypothesis is true, the sum of the ranks of the positive
differences should be about the same as the sum of the ranks of
the negative differences.
Add all the ranks associated with positive differences, giving the
T+ statistic. Finally, the P-value associated with this statistic is
identified in an appropriate table.
The Wilcoxon test is an R(obust)-estimate (meaning that it is

based on a rank test).
Detection algorithm
If a mismatch cell is saturated (MM > 46000),

the corresponding probe pair is not used in
further computations.
PPs where PM and MM are within τ of each other

are discarded.
If all probe pairs in a unit are saturated, the

gene is reported as detected and the p-value is
set to 0.
Making the call
Present (detected): p < α1

Marginal: α1 = p < α2
Absent (undetected): p > α2
Significance levels
Default α1 = 0.04 (16-20 probe pairs)

Default α2 = 0.06 (16-20 probe pairs)
Signal algorithm: CEL >> CHP

Background
• For purposes of calculating

background values, the array
is split up into K rectangular
zones Z_k (k = 1, …, K, default
K = 16).
• Control cells and masked cells
are not used in the calculation.
• The cells are ranked and the
lowest 2% is chosen as the
background b for that zone
(bZk).
• The standard deviation of the
lowest 2% cell intensities is
calculated as an estimate of
the background variability n for
each zone (nZk).
The signal is a quantitative value that reflect the

relative concentration of a given mRNA
It is computed as a weighted mean using the One-

Step Tukey’s Biweight Estimate
The specific signal for each PP is calculated by

subtracting stray signal (detected by MM) from the
PM value: IPM = IT + IS and then taking its log.
Is: Intensity due to stray signal
IT: Intensity due to true signal
Three rules are applied:
If MM < PM then MM is considered informative

and its value is directly used as a stray
(background) estimate
If MM are often but not always informative, the

outlyers are adjusted
If MM > PM the MMs are replaced by a value

smaller than PM
Signal calculation:
1. CEL intensity values are adjusted for global

background
2. MM value is calculated and subtracted from PM
3. Adjusted PM values are log2-transformed to
stabilize variance
4. Tukey’s biweight estimator is used to compute a
robust mean of the resulting values
5. The signal is scaled using a trimmed mean
One-step Tukey’s bi-weight algorithm
• Determine median to define center of

data
• Calculate distance of each data point
from median. This distance is used to
determine to what extent a given
value will contribute to the final signal
The greater the distance to the median,

the smaller the contribution of a data
point
>>this minimizes the effect of
outlyers…
One-step Tukey’s bi-weight

algorithm
• Calculate median M for n values

• Calculate absolute distance of each
data point from median.
• Calculate S, the median of the
absolute distances from M.
• The Median Absolute Deviation
(MAD) is a first measure of the
data distribution
For each datapoint i, a uniform measure of distance u from

the center is given by
xi: value of datapoint i

c: tuning constant (default c=5)
ε: small value used to avoid zero
The weight w of each point is calculated by the bisquare

function:
•For each point the weight w is reduced as a function of its

distance from the median. The weight of extreme values is
reduced to zero
Corrected values can then be computed with the

one-step w-estimate which is a weighted mean
Σ w(u)xi
Tbi =
Σ w(u)
Practical work
Life Sciences Training Facility

Pharmazentrum, room 5021
http://www.bioz.unibas.ch/corelab
GCOS:
•Compute CEL, CHP files
•Determine presence/absence calls
•Data quality control
MIMAS
•Annotate and upload files
Textbooks, Literature & web portals
DNA Microarray Data

Analysis
Follow the link “books and

magazines”.
J Tuimala and M Laine
You can download the pdf file

after registration for free
http://www.csc.fi/molbio/
Microarray Gene
Expression Data Analysis:
A Beginner's Guide
Helen Causton, John Quackenbush,

Alvis Brazma
http://www.amazon.co.uk/exec/obidos/ASIN/
1405106824/qid%3D1047375686/026-
1898565-5814030
Bioinformatics
Sequence and Genome
Analysis
David Mount
2004 CSH Press
http://www.bioinformaticsonl
ine.org/
Liu et al.
Analysis of high density expression microarrays with signed-
rank call algorithms. Bioinformatics 2002
Hubbel et al.
Robust estimators for expression analysis. Bioinformatics 2002
Irizarry et al.
Summaries of Affymetrix GeneChip probe level data. Nucleic
Acids Res 2003.
Bolstadt et al.
A comparison of normalization methods for high density
oligonucleotide array data based on variance and bias.
Bioinformatics 2003
Gautier et al.
affy--analysis of Affymetrix GeneChip data at the probe level.
Bioinformatics 2004
NETAFFX:
http://www.affymetrix.com/analysis/
Register and access info on probes,

annotation, technotes, stats
reference guide, expression manuals
etc.
Certified array data repositories
EBI: ArrayExpress: http://www.ebi.ac.uk/arrayexpress/
NCBI: GeneOmnibus: http://www.ncbi.nlm.nih.gov/projects/geo/

http://www.nslij-genetics.org/microarray/

Micro Arrays

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Micro Arrays

Uploaded by

Copyright:

Available Formats

Bioinformatics I

Mihaela Zavolan | Michael Primig

Leandro Hermida (Database developer)

Goals of BixI Array I

• System available at Biozentrum

1. Introduction to microarray technology

DNA molecules are attached to glass support

adhesion (“spotted microarrays”): single or

covalent bond (“high density oligonucleotide

1. RNA concentrations (expression profiling)

2. DNA mutations (SNP-Chip)

Treat glass slide with poly-lysine (to make

Spot DNA in duplicate/triplicate using robotic

Prepare targets form two samples and label

Hybridize array with mix

Scan microarray with filters in 635nm (CY5)

Green and red indicate different target

YOU DETERMINE A RATIO OF TWO

GMS 427 spotting system

• Fluid transport and

Probe cell (Feature)

Several 106 identical oligos fit into one

A glass slide of 1.28 cm2 can hold up to

11 to 16 oligos of different sequence are

A wild-type oligo is called a Perfect Match

An oligo that contains a point mutation at

YOU DETERMINE a signal per gene for all

Probe pairs are not arranged

This minimizes the impact of

>> Distributed architecture

Total RNA prepration

IVT using biotinylated

Washing and staining

Flat file storage

Raw data for PM and MM at the feature

Affy ID Stat Pairs Stat Pairs used Signal Flag p-value

CHP 222080_s_at 11 11 1025.2 P 0.432373

Raw data from the complete probe set

• CENTRAL DATA REPOSITORY

• VERSATILE DATA STORAGE BACK-END

• COMPLETE AND STANDARDIZED DESCRIPTION OF DATA

• LONG-TERM SAFETY & SECURITY OF DATA

• FLEXIBLE DATA MANIPULATION & ANALYSIS

• MID-PRICED COMMERCIAL SOLUTIONS (< $100K)

• CUSTOM SOLUTION + INTEGRATION W/ MID-PRICED COMMERCIAL SOLUTION

•Oracle Database Back-end

•Loading, Transformation, & Extraction

•Integration with Silicon Genetics Signet

• WEB MANAGEMENT DATABASE

•Microarray Gene Expression Data Society

•Minimal Information about a microarray

DAT >> CEL >> CHP >> Txt

Single Array Analysis

DAT >> CEL file

Cumulative Distribution Function (cdf) is used to determine the intensity of the

Single Array Analysis

•Generates qualitative and quantitative values from one

•Yields data required to compare experiments (one

The Detection call (a quantitative value), indicates

The Detection call is determined by comparing the

A quantitative value, the signal, assigns a relative

Hypothesis testing is the use of statistics to determine the probability

1. Formulate the null hypothesis H0 (commonly, that the

There are four steps: