Professional Documents
Culture Documents
Established Biology
Large body of knowledge built from decades of mechanistic studies But need to remain open to updates that affect models
Oncology Biostatistics and Bioinformatics
Focus on Expression
Microarrays Mature, Global
We understand artifacts and normalization We have validated predictions
Mutations
Overall immature for global measurements
Oncology Biostatistics and Bioinformatics
RAS VEGF
Emerging Hallmarks
Cytokines and Receptors
Measuring Signaling
Signaling Protein Levels
very hard in vivo for signaling proteins efforts using 2D gel have been successful, but expensive and not scalable not a global measure, only what you seek
Transcript Levels
easily measured from biopsies/tumors but
Oncology Biostatistics and Bioinformatics
Signaling
driven by post-translational modifications signaling proteins have low expression
Oncology Biostatistics and Bioinformatics
M A
B
Oncology Biostatistics and Bioinformatics
F C
D
H E
Transcribed
Gene
A
B
C
D
E
Random Line!
Oncology Biostatistics and Bioinformatics
Matrix Factorization
To address multiple regulation
Analytic Solution
Principal Component Analysis
Computational Solutions
Independent Component Analysis Nonnegative Matrix Factorization Bayesian Matrix Factorization
Npc dim(D)
M phase 4 5
G1 phase 6
S-G2 phase
PCA
Captures Some Cell Cycle Behavior
Peaks in correct places Shape and overall slope incorrect
Issue of Orthogonality
Biology does not match condition of orthogonality of gene expression
Oncology Biostatistics and Bioinformatics
Goal of Analysis
condition M
condition 1
gene 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
gene N
* * * * * * * * * *
pattern 1
pattern k
D: Data
vs
Mock
gene 1
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
gene N
* * * *
* * * * * * * * * *
pattern 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
pattern k
P: Patterns of
Behavior
condition M
condition 1
A: Distribution of
Genes in Patterns
* * * * * * * * * *
pattern 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
pattern k
Normal
P: Patterns of
Behavior
typically
1/n 0 normalized
to sum to 1
Oncology Biostatistics and Bioinformatics
Patients
Tumor
Distributions
A: Distribution of
Distributes Behavior of Genes in Patterns
pattern 1
pattern k
gene 1
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
gene N
* * * *
Oncology Biostatistics and Bioinformatics
Isolates genes together despite multiple regulation Useful for Gene Set Analysis
Issues
Nonorthogonality
No known analytic solution to best patterns (nonorthogonal basis vectors) Infinite number of potential bases (rotation of any basis set is new basis set)
Original NMF
Nonnegative Matrix Factorization
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
X!
NMF
Populate matrices randomly (i.e., from a distribution)
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
A
k
Dk / ( Akh Ph )
A
k
A A
P
k
D k / ( A h Phk )
P
k
NMF Procedures
Choose a Dimensionality Populate the Matrices Step through Matrix Elements, Updating each in turn Stop when Reach Best Fit Repeat from Other Random Starting Points
Oncology Biostatistics and Bioinformatics
Bioinformatics
NMF Issues
Guaranteed Local Minimum = Almost Guaranteed Global Non-Minimum Updating Sequence through Matrix Encourages Trapping Need to Determine Dimensionality No Ability to Include Uncertainty Information (all data equal variance) No Sampling of Distribution
Oncology Biostatistics and Bioinformatics
Bioinformatics
NMF of Yeast Cell Cycle Pick Best Fit from 200 Runs
NMF Issues
Poor Sampling
Local maxima lead to poor sampling of actual distribution Lack of information on uncertainty of parameter estimates
MCMC Expensive
Gibbs sampling computationally complex Potential in newer stochastic methods
Oncology Biostatistics and Bioinformatics
Ochs et al, J Magn Res, 137, 161, 1999 Moloshok et al, Bioinformatics, 18, 566, 2002
Likelihood Posterior
Prior
p M |D =
p D|M p M
) ( ) p (D )
The Prior
Atomic Domain 1D Infinitely Divisible
Atoms are created ex vacuo with a prior uniform in space, exponential in amplitude,
Atomic Domain is divided into multiple sections mapping to matrix; one domain for each matrix
Oncology Biostatistics and Bioinformatics
Genes in Patterns
In this case, A and P are positive
A"
Prior Qualities
Atoms
Point Masses Described Completely by Location and Amplitude
Distribution
Bias toward Few Large Atoms Bias toward Minimizing Structure Include Correlations by Kernel Function
Oncology Biostatistics and Bioinformatics
Specific Priors
For 0 = 4800
Number of Atoms P(N) = (1-a)aN with a = 0/(0+1) 0 is expectation on N (hyperparameter) Flux of Atoms P(z) = q-1e-z/q with q expected flux (hyperparameter)
Oncology Biostatistics and Bioinformatics
CDF for N
150
CDF for z
MCMC
Move
Birth Death
Step
Number of Steps is ~ 1/2[N + Ga(N)]
Oncology Biostatistics and Bioinformatics
Birth Death
Exchange
Birthing Atoms
Atomic Domain 1D Infinitely Divisible
probability
Position from U(0,232) Flux from positive Normal Distribution based on gradient and curvature of LogLikelihood As average flux increases, distribution shifts to low amplitude
probability
Flux
Effect of logLikelihood
decreasing gradient
The distributions for choosing atomic flux adjusts to both changes in average flux of present atoms and changes in the logLikelihood around a chosen point Oncology Biostatistics
and Bioinformatics
decreasing curvature
Accepting Changes
Test change in likelihood using standard MCMC criteria Note: Only using relative likelihood changes, ignoring normalization (evidence)
A"
Destroying Atoms
Atomic Domain 1D Infinitely Divisible
Pick Atom at Random Destroy if Effect on LogLikelihood is within allowable MCMC criteria
Look at Joint LogLikelihood and choose fluxes based on random sample from joint LogLikelihood
Oncology Biostatistics and Bioinformatics
Bioinformatics
0.25!
G1
0.2! 0.15!
M G1
M/G1! G1A! S/G2! M! G1B! Oscillator!
S/G2
0.1!
0.05!
oscillator
0! M/G1! G1! G1! S! S! G2! M! M! M/G1! M/G1! G1! G1! S! G2! M! M! M/G1!
Bayesian Decomposition
Sensitivity
ROC analysis equivalent to NMF results Matrix factorization identifies known coregulation in yeast data
1 - Specificity
Bayesian Decomposition
Finds P Matrix
Determines patterns in the data using bias of positivity and minimal structure For time series, can see make sense
gene 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
gene N
* * * * * * * * * *
pattern 1
pattern k
D: Data
Oncology Biostatistics and Bioinformatics
gene 1
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
* * * *
gene N
* * * *
* * * * * * * * * *
pattern 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
pattern k
P: Patterns of
Behavior
Time = 48 hr
Time = 0
A: Distribution of
Genes in Patterns
Transient with IM
Inference on TF Activity
A: Distribution of
For each gene Genes in Patterns
gene 1
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
gene 1363
* * * * *
Oncology Biostatistics and Bioinformatics
pattern 1
pattern 5
1 Ztp = Z rp R r Gt
Significance of Ztp
Need distribution of Ztp for
Different Rs Different Patterns
Permutation Test
For each pattern generate Ztp for random collection of R genes
Oncology Biostatistics and Bioinformatics
Visualize as Blue (Low Activity) - Yellow (High Activity) with Pathway Maps Look for Coordinated Changes
Oncology Biostatistics and Bioinformatics
Decreases with IM
Transient with IM
Increases with IM
Tumor Pattern A
Logistic Regression
DESIDE
Differential Expr for Signaling Determination
Inference on Activity of Downstream Signaling Effectors Global Unbiased Measure Multiple Signaling Proteins Potentially Give Same Transcriptional Signature Some Important Signaling Molecules Change Translation Only (mTOR)
Oncology Biostatistics and Bioinformatics
Methylation
Of promoters silences genes Of histones can also silence genes
Oncology Biostatistics and Bioinformatics
Transcription - Epigenetics
TF
TFBS
BTM
CGCGATACGCG
Measurements
Because of tumor heterogeneity (some normal tissue, infiltrates), exact measurement is difficult
Oncology Biostatistics and Bioinformatics
Simplest Correction
Gene Set for Transcription Factor
Group of genes regulated by TF If gene CANNOT be regulated, remove from the Gene Set
gene 1
* * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
Gene C * * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
Gene B * * * * * * * * * *
* * * * * * * * * *
* * * * * * * * * *
Gene E * * * * * * * * * *
* * * * * * * * * *
gene N
* * * * * * * * * *
Methylation
Oncology Biostatistics and Bioinformatics
TF Targets TF Targets Gene A TF Targets Gene A TF B A Gene Targets Gene TF B A Gene Targets Gene Gene TF Targets M C Gene BTargets Gene Gene TF BAA C Gene Gene Gene B A DGene Gene Gene C Gene Gene B D Gene C Gene Gene C E Gene D Gene Gene C E Gene D Gene Gene C E Gene Gene EDD Gene Gene E D Gene Gene E
HNSCC Data
ARRA Funded Biomarker Study
69 Samples: 44 HNSCC, 25 UPPP
14599 X 69
Biological Validation
See predicted changes and try to validate Problem: tumor samples are difficult to work with for validation
Oncology Biostatistics and Bioinformatics
Gene Sets
TRANSFAC Professional 2010.4
1325 TFs filtered to keep only those with 5 unique targets 230 TFs for Analysis
FGSA
TFs can have fewer targets for any individual patient due to filtering
Count
0 50
150
0.2
Value
0.4
0.6
HSF2A HSF1L GABPbeta NFYA STAT3 STAT1alpha E2F1 p50:RelAp65 cFos:cJun NFkappaB1p50 LRF STAT5 AP2 E2F4 NFkappaB1 PPARalpha:RXRalpha RXRalpha:PPARgamma RXRalpha:PPARalpha STAT3:STAT3 STAT1:STAT1 sp4 AhR:arnt usf1:USF2 PPARalpha ATF1 NF1 MyoD
pTF100
pTF1515
pTF1512
pTF1508
pTF1500
pTF4515
pTF4512
pTF4508
pTF4500
pTF3515
pTF3512
pTF3508
pTF3500
pTF2515
pTF2512
pTF2508
pTF2500
Highlights
Some TFs Show Changes
Methylation and CNV affect p-value estimates of activities for some TFs
Methylation Expression
FADS2 correlation: 0.335633632853448
10
Log2 Expression
0.0
Oncology Biostatistics and Bioinformatics
0.2
0.4
0.6
0.8
1.0
Methylation Beta
Methylation Expression
NR1D1 correlation: 0.00821586442546115
8.5
Log2 Expression
6.5
7.0
7.5
8.0
0.0
0.2
0.4
0.6
0.8
1.0
Methylation Beta
Methylation Expression
FABP1 correlation: 0.0441809836195627
Log2 Expression
4.0
4.5
5.0
3.0
3.5
0.0
0.2
0.4
0.6
0.8
1.0
Methylation Beta
Summary
Abstraction
Reduce complexity of signaling networks Link networks to effectors (TFs)
Bias
NMF/MCMC necessary but introduces Bias TF activity defined by known target genes
Experimental Validation
Biological complexity means models limited
Oncology Biostatistics and Bioinformatics
Acknowledgements
Research Signaling and TFs: Ghislain Bidaut (CRC-Marseille) Andrew Kossenkov (Wistar) Computational Modeling: Elana Fertig, Luda Danilova Collaborators HNSCC: Joseph Califano GIST: Andy Godwin (KUMC) Genetics: Giovanni Parmigiani (DFCI, Harvard) Olga Favorova (RSMU)
New MCMC Techniques: Network Modeling: Elana Fertig, Alexander Favorov Donald Geman (JHU) Laurent Younes (JHU) Genetics of Complex Diseases: Alexander Favorov
Oncology Biostatistics and Bioinformatics