You are on page 1of 76

From Expression Measurements to Cell Signaling Inference

Michael Ochs The Sidney Kimmel CCC Johns Hopkins University

Focusing on Cell Signaling


Role in Cancer
Makes it a worthy target Potential for improved treatment

Established Biology
Large body of knowledge built from decades of mechanistic studies But need to remain open to updates that affect models
Oncology Biostatistics and Bioinformatics

Focus on Expression
Microarrays Mature, Global
We understand artifacts and normalization We have validated predictions

Methylation, SNPs, CNVs


Array technologies maturing, useful Sequencing mixed (SNPs best)

Mutations
Overall immature for global measurements
Oncology Biostatistics and Bioinformatics

Cancer Revisiting Hallmarks


EGFR ERBB2 RAS RAS p53 pRb p53

RAS VEGF

FAK NEDD9 p53 hTERT

Oncology Biostatistics and Bioinformatics

Hanahan and Weinberg, Cell, 100, 57, 2000

Emerging Hallmarks
Cytokines and Receptors

p53 MDM2 ATM BRCA1 BRCA2


Oncology Biostatistics and Bioinformatics

Cytokines and Receptors

Hanahan and Weinberg, Cell, 44, 646, 2011

Measuring Signaling
Signaling Protein Levels
very hard in vivo for signaling proteins efforts using 2D gel have been successful, but expensive and not scalable not a global measure, only what you seek

Transcript Levels
easily measured from biopsies/tumors but
Oncology Biostatistics and Bioinformatics

Expression of Signaling Proteins


General Relation of mRNA to Protein
correlation coefficient of 0.36 (yeast) human prostate cancer (60% concordant meaning only same direction of change)

Signaling
driven by post-translational modifications signaling proteins have low expression
Oncology Biostatistics and Bioinformatics

Transcriptional Signatures for Signaling M


F
H

Signaling Protein

M A
B

Oncology Biostatistics and Bioinformatics

F C
D

H E
Transcribed
Gene

A

B

C

D

E

Use Expression as Downstream Marker!!

Transcription as Biomarker for Cell Signaling


Isolate Signature in Presence of Overlapping Regulation Determine Significance of Signature for Activity of Transcriptional Regulators Link Transcriptional Regulators to Signaling Processes

Oncology Biostatistics and Bioinformatics

Cell Cycle Data Revisited


******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ******************** ********************

Time series of microarray measurements


(Spellman et al, Mol Biol Cell, 9, 3273, 1999) (Cho et al, Mol Cell, 2, 65, 1998)
Oncology Biostatistics and Bioinformatics

6000 yeast genes

Validation Gold Standard Coregulation

Cherepinsky et al, PNAS, 100, 9668, 2003"


Oncology Biostatistics and Bioinformatics

ROC Curve for Clusters Revisited

Random Line!
Oncology Biostatistics and Bioinformatics

Matrix Factorization
To address multiple regulation

Analytic Solution
Principal Component Analysis

Computational Solutions
Independent Component Analysis Nonnegative Matrix Factorization Bayesian Matrix Factorization

Oncology Biostatistics and Bioinformatics

Principal Component Analysis


Identifies Directions in Data in Terms of Decreasing Variance Each Principal Component (PC) is Orthogonal to all Previous PCs

Npc dim(D)

Oncology Biostatistics and Bioinformatics

PCA of Cell Cycle Data


1 2 3

M phase 4 5

G1 phase 6

S-G2 phase

Oncology Biostatistics and Bioinformatics

PCA
Captures Some Cell Cycle Behavior
Peaks in correct places Shape and overall slope incorrect

Fails to Correctly Link Genes


Poor recovery of gold standard links

Issue of Orthogonality
Biology does not match condition of orthogonality of gene expression
Oncology Biostatistics and Bioinformatics

Goal of Analysis
condition M
condition 1

gene 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
gene N
* * * * * * * * * *

pattern 1

pattern k

Oncology Biostatistics and Bioinformatics

D: Data
vs
Mock

gene 1
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
gene N
* * * *

* * * * * * * * * *
pattern 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
pattern k

P: Patterns of
Behavior

condition M

condition 1

A: Distribution of
Genes in Patterns

Patterns Nonorthogonal in General


condition M
condition 1

* * * * * * * * * *
pattern 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
pattern k

Link Conditions Time Series


t

Normal

P: Patterns of
Behavior
typically
1/n 0 normalized
to sum to 1

Oncology Biostatistics and Bioinformatics

show when process is on show change in activity

Patients
Tumor

group patients with similar expression behaviors

Distributions
A: Distribution of
Distributes Behavior of Genes in Patterns

pattern 1
pattern k
gene 1
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
gene N
* * * *

Oncology Biostatistics and Bioinformatics

Genes Among Patterns


How much of a genes expression is due to a given pattern

Isolates genes together despite multiple regulation Useful for Gene Set Analysis

Issues
Nonorthogonality
No known analytic solution to best patterns (nonorthogonal basis vectors) Infinite number of potential bases (rotation of any basis set is new basis set)

Bias over Variance


Solutions require introducing bias Typically sparseness and positivity
Oncology Biostatistics and Bioinformatics

Original NMF
Nonnegative Matrix Factorization
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *

X!

Oncology Biostatistics and Bioinformatics

Lee and Seung Nature, 401, 788, 1999

NMF
Populate matrices randomly (i.e., from a distribution)
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *

Minimize a Cost Function


Dij Div(D || M ) = Dij log Dij + Aik Pkj ij k Aik Pkj k

Pick next element


* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *

By Moving to Local Minimum in Cost


P P

A
k

Dk / ( Akh Ph )

A
k

A A

P
k

D k / ( A h Phk )

Oncology Biostatistics and Bioinformatics

P
k

NMF Procedures
Choose a Dimensionality Populate the Matrices Step through Matrix Elements, Updating each in turn Stop when Reach Best Fit Repeat from Other Random Starting Points
Oncology Biostatistics and Bioinformatics

Bioinformatics

Fox Chase Cancer Center

NMF Issues
Guaranteed Local Minimum = Almost Guaranteed Global Non-Minimum Updating Sequence through Matrix Encourages Trapping Need to Determine Dimensionality No Ability to Include Uncertainty Information (all data equal variance) No Sampling of Distribution
Oncology Biostatistics and Bioinformatics

Bioinformatics

Fox Chase Cancer Center

NMF of Yeast Cell Cycle Pick Best Fit from 200 Runs

Oncology Biostatistics and Bioinformatics

Wang et al, BMC Bioinfo, 28, 7, 2006

NMF Issues
Poor Sampling
Local maxima lead to poor sampling of actual distribution Lack of information on uncertainty of parameter estimates

Poor Recovery in Complex Cases


Demonstrate later for human tumor data

Oncology Biostatistics and Bioinformatics

Markov Chain Monte Carlo Bayesian Decomposition


MCMC Allows Sampling
Provides more thorough exploration of model space Provides estimates of uncertainties for each parameter

MCMC Expensive
Gibbs sampling computationally complex Potential in newer stochastic methods
Oncology Biostatistics and Bioinformatics

Markov Chain Monte Carlo


Find A and P Simultaneously
)
We can only estimate relative probabilities of possible solutions

Markov Chain Monte Carlo is


used to explore the possible solutions
(Gibbs sampler with simulated annealing)

Oncology Biostatistics and Bioinformatics

Ochs et al, J Magn Res, 137, 161, 1999 Moloshok et al, Bioinformatics, 18, 566, 2002

Fertig et al, Bioinformatics, 2010

Sampling of Posterior Distribution

Likelihood Posterior

Prior

p M |D =

p D|M p M

) ( ) p (D )

Evidence (Marginal of Data)

Oncology Biostatistics and Bioinformatics

Bayes Equation (independently by Laplace)

The Prior
Atomic Domain 1D Infinitely Divisible
Atoms are created ex vacuo with a prior uniform in space, exponential in amplitude,

* * * * * * * * **** * * * * **** * * * * * * * * **** **** **** **** ****

Positive Additive Distribution

Atomic Domain is divided into multiple sections mapping to matrix; one domain for each matrix
Oncology Biostatistics and Bioinformatics

Genes in Patterns
In this case, A and P are positive

A"

Prior Qualities
Atoms
Point Masses Described Completely by Location and Amplitude

Distribution
Bias toward Few Large Atoms Bias toward Minimizing Structure Include Correlations by Kernel Function
Oncology Biostatistics and Bioinformatics

*Sibisi and Skilling, J Royal Stat Soc B, 59, 217, 1997

Specific Priors
For 0 = 4800

Number of Atoms P(N) = (1-a)aN with a = 0/(0+1) 0 is expectation on N (hyperparameter) Flux of Atoms P(z) = q-1e-z/q with q expected flux (hyperparameter)
Oncology Biostatistics and Bioinformatics

3500 For q = 200

CDF for N

150

CDF for z

MCMC

Move

Birth Death

Step
Number of Steps is ~ 1/2[N + Ga(N)]
Oncology Biostatistics and Bioinformatics

First for A Then for P

Birth Death

Exchange

Birthing Atoms
Atomic Domain 1D Infinitely Divisible
probability

Position from U(0,232) Flux from positive Normal Distribution based on gradient and curvature of LogLikelihood As average flux increases, distribution shifts to low amplitude

With low average atom flux


Flux

Oncology Biostatistics and Bioinformatics

probability

With high average atom flux

Flux

Effect of logLikelihood
decreasing gradient

The distributions for choosing atomic flux adjusts to both changes in average flux of present atoms and changes in the logLikelihood around a chosen point Oncology Biostatistics
and Bioinformatics

decreasing curvature

Accepting Changes

* * * * * * * * **** * * * * **** * * * * * * * * **** **** **** **** ****


Oncology Biostatistics and Bioinformatics



Test change in likelihood using standard MCMC criteria Note: Only using relative likelihood changes, ignoring normalization (evidence)

A"

Destroying Atoms
Atomic Domain 1D Infinitely Divisible

Pick Atom at Random Destroy if Effect on LogLikelihood is within allowable MCMC criteria

Oncology Biostatistics and Bioinformatics

Exchanging Flux/Moving Atoms


Atomic Domain 1D Infinitely Divisible Movement involves trying to move atom between neighbors

Pick Atom at Random Find its Neighbor

Look at Joint LogLikelihood and choose fluxes based on random sample from joint LogLikelihood
Oncology Biostatistics and Bioinformatics

Bioinformatics

Fox Chase Cancer Center

BD Patterns of Cell Cycle Data


0.3!

0.25!

G1
0.2! 0.15!

M G1
M/G1! G1A! S/G2! M! G1B! Oscillator!

S/G2

0.1!

0.05!

oscillator

0! M/G1! G1! G1! S! S! G2! M! M! M/G1! M/G1! G1! G1! S! G2! M! M! M/G1!

Oncology Biostatistics and Bioinformatics

Moloshok et al, Bioinformatics, 18, 566, 2002

Bayesian Decomposition

Sensitivity

ROC analysis equivalent to NMF results Matrix factorization identifies known coregulation in yeast data
1 - Specificity

Oncology Biostatistics and Bioinformatics

Bayesian Decomposition
Finds P Matrix
Determines patterns in the data using bias of positivity and minimal structure For time series, can see make sense

Simultaneously Finds A Matrix


Links genes together despite multiple regulation Column of A becomes SIGNATURE
Oncology Biostatistics and Bioinformatics

Transcription as Biomarker for Cell Signaling


Isolate Signature in Presence of Overlapping Regulation Determine Significance of Signature for Activity of Transcriptional Regulators Link Transcriptional Regulators to Signaling Processes

Oncology Biostatistics and Bioinformatics

GIST Experiment Targeted Therapy


GIST-T1 cells in triplicate Agilent arrays Gene level estimates Keep genes with known TF regulation
Oncology Biostatistics and Bioinformatics

Matrix Factorization of GIST Data


Time 48 hr
Time = 0

gene 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
gene N
* * * * * * * * * *

pattern 1

pattern k

D: Data

Oncology Biostatistics and Bioinformatics

gene 1
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
gene N
* * * *

* * * * * * * * * *
pattern 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
pattern k

P: Patterns of
Behavior

Time = 48 hr

Time = 0

A: Distribution of
Genes in Patterns

Patterns in the Data


Mathematically - Basis Vectors
Decreases with IM Increases with IM

Transient with IM

Increasing time with IM


Oncology Biostatistics and Bioinformatics

Estimating Transcription Factor (TF) Activity


Build a Gene Set of All Targets
Genes known to be regulated by a TF High quality annotations important

Build a Statistical Test for TF


Use sum of individual gene tests Test for significance

Oncology Biostatistics and Bioinformatics

Inference on TF Activity
A: Distribution of
For each gene Genes in Patterns

gene 1
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
gene 1363
* * * * *

Oncology Biostatistics and Bioinformatics

and from MCMC 500,000 samples from posterior distribution Zip = ip / ip

pattern 1

pattern 5

Calculate Z-score Calculate Z-score for TF

1 Ztp = Z rp R r Gt

Significance of Ztp
Need distribution of Ztp for
Different Rs Different Patterns

Permutation Test
For each pattern generate Ztp for random collection of R genes
Oncology Biostatistics and Bioinformatics

Transcription as Biomarker for Cell Signaling


Isolate Signature in Presence of Overlapping Regulation Determine Significance of Signature for Activity of Transcriptional Regulators Link Transcriptional Regulators to Signaling Processes

Oncology Biostatistics and Bioinformatics

Linking to Pathway Activity For now Visually


Convert TF p-value to TF Activity
Rescale p < 0.5 for over-representation to a 0 - 1 scale Rescale p < 0.5 for under-representation to a -1 - 0 scale

Visualize as Blue (Low Activity) - Yellow (High Activity) with Pathway Maps Look for Coordinated Changes
Oncology Biostatistics and Bioinformatics

Network Activity TF Target Genes Not Shown

Decreases with IM Transient with IM Increases with IM


Oncology Biostatistics and Bioinformatics

Biological Validation DNA Damage to p53 Activation

Oncology Biostatistics and Bioinformatics

Biological Validation STAT3 and ELK1 Activity


STAT3 and ELK1 Activity

Oncology Biostatistics and Bioinformatics

Advantages of MCMC Sampling


coGAPS: sampling NMF: best 50 of 500

Decreases with IM

Transient with IM

Increases with IM

Oncology Biostatistics and Bioinformatics

Significant High Active TF Significant Low Inactive TF

Summary of Matrix Factorization


Matrix Factorization Recovers Coregulation better than Clustering MCMC Sampling Recovers Signatures of Biology better than NMF Costs
Bias of NMF > Bias of Clustering Complexity of MCMC > Complexity of NMF
Oncology Biostatistics and Bioinformatics

Tumor Responses Method Works in Clinical Data

Cell Line Increases with IM

Tumor Pattern A

Oncology Biostatistics and Bioinformatics

RTOG 0132 Phase II Trial

Initial Clinical Response


Group A vs Group B
>20% Shrinkage

Oncology Biostatistics and Bioinformatics

Logistic Regression

RTOG 0132 Survival


Good Initial Tumor Response May Correlate with Cancer Stem Cell Activation 3 Year PFS Does NOT Correlate with Initial Response

Oncology Biostatistics and Bioinformatics

DESIDE
Differential Expr for Signaling Determination

Inference on Activity of Downstream Signaling Effectors Global Unbiased Measure Multiple Signaling Proteins Potentially Give Same Transcriptional Signature Some Important Signaling Molecules Change Translation Only (mTOR)
Oncology Biostatistics and Bioinformatics

The Role of Methylation


Transcription Factors
Function by turning on genes by binding to DNA in promoters Promoters must be accessible

Methylation
Of promoters silences genes Of histones can also silence genes
Oncology Biostatistics and Bioinformatics

Transcription - Epigenetics

TF
TFBS

BTM

CGCGATACGCG

~1500 5000 bases

Oncology Biostatistics and Bioinformatics

Copy Number Effects


Loss of Gene
Obviously this will also affect the expression of gene LOH can also affect amount of transcript

Measurements
Because of tumor heterogeneity (some normal tissue, infiltrates), exact measurement is difficult
Oncology Biostatistics and Bioinformatics

Simplest Correction
Gene Set for Transcription Factor
Group of genes regulated by TF If gene CANNOT be regulated, remove from the Gene Set

Filtered Gene Set Analysis


Each set is adjusted for methylation and CNV Requires cut-off and patient-specific measurement
Oncology Biostatistics and Bioinformatics

Filtered Gene Set Test


Patient 1
Patient M

TF Targets Gene A Gene B Gene C Gene D Gene E

TF Targets 1 Gene A Gene B Gene D Gene E

gene 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
Gene C * * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
Gene B * * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
Gene E * * * * * * * * * *
* * * * * * * * * *
gene N
* * * * * * * * * *

Methylation

Oncology Biostatistics and Bioinformatics

TF Targets TF Targets Gene A TF Targets Gene A TF B A Gene Targets Gene TF B A Gene Targets Gene Gene TF Targets M C Gene BTargets Gene Gene TF BAA C Gene Gene Gene B A DGene Gene Gene C Gene Gene B D Gene C Gene Gene C E Gene D Gene Gene C E Gene D Gene Gene C E Gene Gene EDD Gene Gene E D Gene Gene E

New Statistic on Genes Needed


Gene Set Analysis
Depends on ranking of genes between two groups But have ELIMINATED ONE GROUP, each patient enters individually

Generate a t-like statistic


Provides a signed statistics suitable for mixed
Oncology Biostatistics and Bioinformatics

A Simple Modified t-Statistic


= x is x iN t is iN
xis expression of gene i in patient s xiN mean expression of gene i in normal samples

iN standard deviation of gene i in normal samples

Oncology Biostatistics and Bioinformatics

HNSCC Data
ARRA Funded Biomarker Study
69 Samples: 44 HNSCC, 25 UPPP

Transcripts Promoter Methylation


16330 X 69 12033 X 69

Copy Number Variation

14599 X 69

Oncology Biostatistics and Bioinformatics

Joe Califano, Patrick Hennessey

Exploring Impact of FGSA Filtered Gene Set Analysis


Vary Cutoffs
Methylation and copy number to see impact on estimation of p-value Compare with GSA

Biological Validation
See predicted changes and try to validate Problem: tumor samples are difficult to work with for validation
Oncology Biostatistics and Bioinformatics

Gene Sets
TRANSFAC Professional 2010.4
1325 TFs filtered to keep only those with 5 unique targets 230 TFs for Analysis

FGSA
TFs can have fewer targets for any individual patient due to filtering

Oncology Biostatistics and Bioinformatics

Most Varying Significant TFs


Color Key and Histogram

Count

pTF2512 Methylation Beta as 0.NN CNV as N.N

0 50

150

0.2

Value

0.4

0.6

HSF2A HSF1L GABPbeta NFYA STAT3 STAT1alpha E2F1 p50:RelAp65 cFos:cJun NFkappaB1p50 LRF STAT5 AP2 E2F4 NFkappaB1 PPARalpha:RXRalpha RXRalpha:PPARgamma RXRalpha:PPARalpha STAT3:STAT3 STAT1:STAT1 sp4 AhR:arnt usf1:USF2 PPARalpha ATF1 NF1 MyoD

pTF100

pTF1515

pTF1512

pTF1508

pTF1500

pTF4515

pTF4512

pTF4508

pTF4500

pTF3515

pTF3512

pTF3508

pTF3500

pTF2515

pTF2512

pTF2508

Oncology Biostatistics and Bioinformatics

pTF2500

Highlights
Some TFs Show Changes
Methylation and CNV affect p-value estimates of activities for some TFs

Some TFs Show No Significant Change Methylation Dominates over CNV


Total CNV range captured in each methylation range by clustering
Oncology Biostatistics and Bioinformatics

Methylation Expression
FADS2 correlation: 0.335633632853448

Normal HPV HPV+

10

Log2 Expression

0.0
Oncology Biostatistics and Bioinformatics

0.2

0.4

0.6

0.8

1.0

Methylation Beta

Methylation Expression
NR1D1 correlation: 0.00821586442546115

8.5

Normal HPV HPV+

Log2 Expression

6.5

7.0

7.5

8.0

0.0

0.2

0.4

0.6

0.8

1.0

Oncology Biostatistics and Bioinformatics

Methylation Beta

Methylation Expression
FABP1 correlation: 0.0441809836195627

Normal HPV HPV+

Log2 Expression

4.0

4.5

5.0

3.0

3.5

0.0

0.2

0.4

0.6

0.8

1.0

Oncology Biostatistics and Bioinformatics

Methylation Beta

Summary
Abstraction
Reduce complexity of signaling networks Link networks to effectors (TFs)

Bias
NMF/MCMC necessary but introduces Bias TF activity defined by known target genes

Experimental Validation
Biological complexity means models limited
Oncology Biostatistics and Bioinformatics

Acknowledgements
Research Signaling and TFs: Ghislain Bidaut (CRC-Marseille) Andrew Kossenkov (Wistar) Computational Modeling: Elana Fertig, Luda Danilova Collaborators HNSCC: Joseph Califano GIST: Andy Godwin (KUMC) Genetics: Giovanni Parmigiani (DFCI, Harvard) Olga Favorova (RSMU)

New MCMC Techniques: Network Modeling: Elana Fertig, Alexander Favorov Donald Geman (JHU) Laurent Younes (JHU) Genetics of Complex Diseases: Alexander Favorov
Oncology Biostatistics and Bioinformatics

You might also like