Professional Documents
Culture Documents
Query construction
Footprint
Set of patients is selected through RPDR and data is
gathered into a data mart
Selected
patients
Project
Data directly Data from other Data collected
Specific
RPDR from RPDR hospital sources specifically for Phenotypic
project
Data
Daily Automated Queries search for Patients and add Data
Data is available through a specialized Workbench
Requirements of Genomic Variant Notation
• Maintainability
– Define the variant so it may be reliably identified over time
• Flanking sequences
– 5’ AGGCGCTAGAGAAGTCCGAGGCTC
– 3’ CCGCAAGGAGCTGGAGGAGAAGAT
• RS number
• Gene name + flanking sequences
• HGVS name
RS number
NM_005228.3:c.2155G>T
WE PROPOSE:
– HGVS name (n/t subst, positional info)
– Flanking sequences ( a way to verify positional info)
GenomicMetadata
Version 1.0
ReferenceGenomeVersion hg18
SequenceVariant
HGVSName NM_0005228.3:c.2155G>T
SystematicName c.2155G>T
SystematicNameProtein p.Glu719Cys
AaChange missense
DnaChange substitution
SequenceVariantLocation
GeneName EGFR
FlankingSeq_5 GAATTCAAAAAGATCAAAGTGCTG
FlankingSeq_3 GCTCCGGTGCGTTCGGCACGGTGT
RegionType exon
RegionName Exon 18
Accessions
Accession
Name NM_005228
Type mrna (NCBI)
Accession
Name NP_005219
Type protein (NCBI)
Accession
Name NT_004487
Type contig (NCBI)
ChromosomeLocation
Chromosome chr7
Region 7p12
Orientation +
Combining equivalent terms
Linking to external services
• Genome Browser
– Requires chromosome location; reference genome
• Conservation plots
– Based on location
VISTA workbench tools
Embedded VISTA browser
References
• Kimball, R. The Data Warehousing Toolkit. New York: John Wiley, 1997.
• Murphy, S.N., Gainer, V.S., Chueh, H. A Visual Interface Designed for Novice Users
to find Research Patient Cohorts in a Large Biomedical Database. AMIA, Fall Symp.
2003: 489-493.
• Murphy, S.N., Weber, G., Mendis, M., Gainer, V.S., Churchill, S., Kohane, I.S.
Serving the Enterprise and Beyond with Informatics for Integrating Biology and the
Bedside (i2b2). Journal of the American Medical Informatics Association, 2010 March
1; 17(2): 124-130.
• den Dunnen JT, Antonarakis SE: Mutation nomenclature extensions and suggestions
to describe complex mutations: A discussion. Hum Mutat 2000, 15:7-12.
• Dalgleish R, et al.: Locus Reference Genomic sequences: an improved basis for
describing human DNA variants. Genome Medicine; 2010, 2:24.
• http://www.hgvs.org/mutnomen/recs.html
Integrating Clinical and Genomic
Data:
Opportunities, Challenges
and a Proposal
Henry Lowe MD
Stanford Center for Clinical Informatics
And The Division of Systems Medicine
Stanford University School of Medicine
33
Opportunities – Clinical Data
Warehouses
• Electronic Health Record Deployment
Increasing
• Creation of Clinical Data Warehouses
Increasing
• Support for Research Access to
Clinical Data
• Optimized for use of Aggregate Data
• Cohort Searching, Data Review &
Analysis
Opportunities – Biospecimen
Linkage
• Linkage of Clinical and Biospecimen
Data
• Characterizing Biospecimens using
Clinical Data
• Identifying Biospecimen Cohorts
• Linkage of Genomic Data to Clinical
Data
• Integration of Genomic Data back into
the EHR
Challenges – Clinical Data
Genetic
data
CPR
Clinical data
(CDW)
Why do we need routine
data?
• Clinical research moving abroad
– Glickman SW, McHutchinson JG, Peterson ED, et al. Ethical and
scientific implications of the globalization of clinical research. N
Engl J Med. 2009;360(8):816–823, PMID:19228627.
• If we are to compete…
– Need to make use of routine data
– Only some routine clinical care can be
outsourced
Problems
• Little overlap between clinical and
research data
– Genetic data on study subjects
– Clinical data on patients who are not study
subjects
• Clinical data not like research data
– Measurement error
– Missing data
– Biased data
– …
Attempts to leverage
clinical data
• [Quality of care]
• [(Non-representative)Cohort selection]
• Reproducing large RCTs
– Extremely large sample sizes
– Example: Tannen RL. Weiner MG. Xie D. Use of primary care electronic
medical record database in drug efficacy research on cardiovascular
outcomes: comparison of database and randomised controlled trial findings.
BMJ. 2009. 338:b81. – 8M patients (5.7% of the population of the UK)
• Reproducing prediction rules
– Example: Hripcsak G, Knirsch C, Zhou L, Wilcox A, Melton GB. Using
discordance to improve classification in narrative clinical databases: An
application to community-acquired pneumonia. Comp Biol Med, 37 (2007) 296-
304.
– Often doesn’t work
• Solution 1: eliminate problematic data (10% of sample)
– Bias
• Solution 2: account for the confounds via statistical model
– Requires knowing the answer
Required enabling
technologies
• Infrastructure
– Collect, store, protect, analyze, update
• NLP, NLP, NLP
– Structured (billing) data misleading
• (UTH data) 20% endometrial cancer, 50%
breast cancer
• Statistics
– Requires unusual degree of
collaboration with statistical colleagues
For the present
• Critically important research area
• Careful to maintain enthusiasm
without over-promising
– AI Winter(s)
Thank you!
Elmer Bernstam
Elmer.V.Bernstam@uth.tmc.e
du
Integrating genomic and clinical data: some
challenges from EU and italian projects
Riccardo Bellazzi
Biomedical
University of Pavia, Italy Informatics
Labs
‘Mario
Stefanelli’
Collaborations BMI labs and Pavia
hospitals
HIV
EMR
Biobanks
DW /
Discharge
clinical research
letters
chart
Knowledge Research
Intelligent query
data-bases
repositories / data mining
Reasoning systems
Projects
TRIAD: Transatlantic
registry of inherited
Arrythmogenic i2b2
diseases
ETL - KETTLE
TRIAD and i2b2
Adding statistical functionalities
BIOINFORMATICS METHODOLOGY AND TECHNOLOGY TO
INTEGRATE
CLINICAL AND BIOLOGICAL KNOWLEDGE SUPPORTING
ONCOLOGY TRANSATIONAL RESEARCH (ONCO-I2B2)
Projects
Centre for Inherited Cardiovascular Diseases - IRCCS Policlinico San Matteo - Pavia
From DCM to…
Dystrofinopathies
Laminopathies
Desminopathies
Mitocondriopathies
Epicardinopathies
“DCM” Actinopathies
Zaspopathies
Desmosonopathies
Centre for Inherited Cardiovascular Diseases - IRCCS Policlinico San Matteo - Pavia
Pedigree Non Familial
Family screening Familial: AD, AR,
X-LR, MT
Cardiac, Extra
Symptoms Cardiac,Recent
Duration Onset, Long term
Muscle, Skin
Physical Eyes, Kidney, Diagnostic
evaluation Liver, Lung Hypothesi
ECG Family s:
AVB, PR,
Rest, effort, WPW, etc,
screening Before
holter Clinical Genetic
CPK, Leukocytes, markers Testing
LAB Enzymes, Metab.
Etc
Imaging: echo,
LVNC, DE
MRI Increasing the number
of genotyped CMP
RV Cath EMB One gene ---> one disease
Inheritance architecture
Annotation tools
Cardioregister
Web interface
Text mining
and literature
search engines Data analysis Data
plugin warehouse
Reasoning
module
KB/Red flags I2b2
environment
Legacy
Researc
Database
h
s
Clinical
Documents Data
Domain
ICHD
Ontology
Code
System
NLP
System
ICHD Ontology
Diagnosis Mapped Clinical
Data
CRC
Task 1. Computational methods and tools to perform data
mining and knowledge integration
Efficient management of
MS-data
Mining
annotations
and literature
Automated
Literature search
Web-based data analytics
Web-based annotation
In summary
• Several
Severalprojects where
projects the same
where architecture can
the same
be applied:
architecture can be applied:
• Main
Mainadaptation
adaptationneeds:
needs:
–Specific domain ontologies
Specific domain ontologies
–Representation of genetic information
Representation of genetic information
–Representation of phenotypic information
Representation of phenotypic information
–Importing data from EHR
Importing data from EHR
• Interesting
Interestingresearch directions
research related
directions to building
related to
–omics enabled
building –omicsdecision support
enabled and knowledge
decision support
management
and knowledge tools management tools
Integrating Genomic and Clinical Data
for EHR and Biomedical Repositories
• 550,000
outpatient
visits/year
• 180,000 hospital
admissions
• 17 million orders
• 2 million patients
Request about
individual
Clinician/R
esearcher
wants data
Return data D
EHR and Genomics at
Focus
Getting the Job Done Groups,
Surveys
Community
•Identity
Management
Patient I
•Trust Information
Management Inspection
Exchange
Registry
I can check who or Home
Security Entity which entity looked
(wanted to look) at
the data for what
reasons
Healthcare Entity
Preference Registry
EHR and Genomics at
“this program
shows the p=1
estimated
health risks of
people with
your same x
age, gender,
and risk factor
levels”
“this means that 5 of 100 people
with this level of risk will have a
heart attack or die”
People “like you”
me
“people with your “people with this
same age, level of risk”
gender, and
p=1
risk factor
levels”
x
People “like me”
height
me
gender
Patients “like you”
Patients “like you”
height
me
gender
0 1
Patients “like you”
height risk
1 2
0 1
me
gender
Assessing Quality of Individual
Predictions
• Hybrid model construction
– Non-parametric and parametric regression
– Kernel-based models
• Evaluation of calibration
– Graphical tools based on calibration error
– Input-based assessment
• Calibration methods
– Smooth isotonic regression (1:30 Cyril Magnin II)
– Doubly-penalized SVM
Summary
• We need to aggregate as much information we
can from experiments and clinical data to create
reasonable predictive models
Funding from NLM, NHLBI, NHGRI, NIBIB, NCRR, NIGMS, AHRQ, Fogarty, VAMRF, Komen
Foundation, UCSD Medical Center
March 9, 2011
AMIA TBI-19/CRI-01 ACMI Panel
Honest broker
I tne g r ae d t
D ta a
R e p o s i ot r y
Researcher
The University of Washington data repository
(Amalga) integrates phenotypic data from
30+ interfaces (10/2010)
Scope of Repository
• 3.5M patients, 42M visits,
220M+ lab results, 180M+
diagnoses & procedures
over 18 years
• 14 data systems populating
Amalga via 30+ real-time or
batch interfaces
• 2.7 Terabytes of data
• 4M new messages/day
• Use IRB/HIPAA compliant
Amalga can identify patients with a given
phenotype and help investigators augment
• phenotypic
Eligibility criteria (IRBinformation
approved study)
Patients whose age >=18 years and are not deceased
Link demographic, diagnoses,
AND
labs, & visit history data
Had ICD-9 codes of 648.* OR 250.* OR 648 OR 250
AND
Had lab test results (Albumin >= 30 and <= 400) OR
(Albumin/Creatinine Ratio >= 30 and <= 400) within the last 2
years.
AND
Had ANY encounter in the service centers for Internal Medicine
OR Diabetes Care Center OR Family Medical Center in the last 2
years
AND
Have not had a diagnosis of 592.* OR 592 OR 585.6 OR V42.0
AND
Have not had lab test (Calcium > 10.5) OR (GFR < 60) OR
(Hemoglobin A1C HPLC > 9.5) OR (Hemoglobin A1C Rapid >
9.5).
• Nightly updates to candidate list, automated
notification, & custom study input screen
Some phenotypes more challenging to
capture
Capurro,
Tarczy-
Hornoch
TBI 2011 (TBI-
10)
Semantic alignment in data repositories
pulling data from disparate systems is a
challenge
Increased monitoring
for poor metabolizers
recommended
Motivation
Pharmacogenomic decision support requires
reasoning across assertions with different levels
of evidence
Genotype
scoring system
Raw from
Sheffield et al. Clin Bio Rev. 2009
Methods
Prototype system built on Amalga integrates
Illumina SNP data and clinical data and
basic genomic knowledge