You are on page 1of 111

Integrating Genomic and Clinical Data in

Electronic Health Records and


Biomedical Repositories:
Challenges, Solutions and Opportunities

American College of Medical Informatics


(ACMI)
What is ACMI?

The American College of Medical Informatics is a college of


elected fellows from the United States and abroad who
have made significant and sustained contributions to the
field of medical informatics. Initially incorporated in 1984,
the organization later dissolved its separate corporate
status to merge with the American Association for Medical
Systems and Informatics (AAMSI) and the Symposium on
Computer Applications in Medical Care (SCAMC) when the
American Medical Informatics Association was formed in
1989. The College now exists as an elected body of fellows
within AMIA, with its own bylaws and regulations that guide
the organization, its activities, and its relationship with the
parent organization.
Integration of Genomic and Clinical Data
The integration of genomic data into the more traditional phenomic
databases, such as electronic health records and biomedical data
warehouses, offers great potential for the advancement of biomedical
research and patient care. However, there are a number of challenges
to accomplishing this integration in a seamless manner, including the
consistent, standardized representation and coding of the data, coping
with the shear volume of information, and proper indexing of important
genomic features to facilitate retrieval. The panelists, all Fellows of the
American College of Medical Informatics, each a leader in their own
institutions and in the field of biomedical informatics, will describe their
work on addressing the challenges of genomic-phenomic integration,
with working solutions and examples of how such integration can be
brought to bear on tasks such as helping clinicians understand their
patients’ genetic data, using genetic data to support clinical decision
making, and advancing biomedical research. The panel will also
discuss implications for national standards on representation and data
sharing. The session will include time for audience participants to share
the solutions from their own institutions.
Integration of Genomic and Clinical Data
• Potential for biomedical research and patient care
• Consistent standardized representation
• Consistent standardized coding
• Coping with the volume
• Indexing of important genomic features
• Helping clinicians understand patients’ genetic data
• Using genetic data to support clinical decision making
• Advancing biomedical research
• Implications for national standards for representation
• Implications for national standards for data sharing
Presenters
• Shawn N. Murphy, FACMI (representation)
– Massachusetts General Hospital
– Harvard Medical School
– Partners HealthCare
• Henry Lowe, FACMI (linking genome & phenome)
– Stanford University
• Elmer V. Bernstam, FACMI (reuse for research)
– University of Texas at Houston
• Riccardo Bellazzi, FACMI (supporting research)
– Università di Pavia
• Lucila Ohno-Machado, FACMI (iDASH Center)
– University of California at San Diego
• Peter Tarczy-Hornoch, FACMI (decision support)
– University of Washington
Expression of Genomic
Variants in a Clinical
Research Database

Shawn Murphy MD, Ph.D.


Lori Phillips MS
Brian Wilson
Research Patient Data Registry exists at Partners
Healthcare to find patient cohorts for clinical research

Query construction in web tool


1) Queries for aggregate patient numbers
- Warehouse of in & outpatient clinical data De-
- 5.0 million Partners Healthcare patients identified
- 1.3 billion diagnoses, medications,
procedures, laboratories, & physical findings
Data
coupled to demographic & visit data Warehouse
- Authorized use by faculty status
- Clinicians can construct complex queries Z731984X
- Queries cannot identify individuals, internally Z74902XX
...
can produce identifiers for (2) ...
Encrypted identifiers

2) Returns identified patient data


0000004 OR
2185793
- Start with list of specific patients, usually from (1) ...
0000004
2185793
- Authorized use by IRB Protocol ...
...
- Returns contact and PCP information, demographics, ...

providers, visits, diagnoses, medications, procedures,


Real identifiers
laboratories, microbiology, reports (discharge, LMR,
operative, radiology, pathology, cardiology, pulmonary,

endoscopy), and images into a Microsoft Access


database and text files.
Query items Person who is using tool

Query construction

Results - broken down by number distinct of patients


HGVS Variant Notation

Wildtype Sequence Variant

Footprint
Set of patients is selected through RPDR and data is
gathered into a data mart

Selected
patients
Project
Data directly Data from other Data collected
Specific
RPDR from RPDR hospital sources specifically for Phenotypic
project
Data
Daily Automated Queries search for Patients and add Data
Data is available through a specialized Workbench
Requirements of Genomic Variant Notation

Ability to organize the variants for ease of navigation

Ability to query for the variant in the workbench


 Implication is that the identifier (basecode) for the variant does not
change over time or is maintainable.

Ability to explore or annotate the variant within the workbench


 Implication is that we know enough about the variant so that it can be
located in existing external genome browsers, analytical tools, etc
Challenges of Genomic Variant Notation

• Balancing the capabilities of multiple providers


– Genomic labs may report data differently

• Maintainability
– Define the variant so it may be reliably identified over time

• Balancing the needs of multiple consumers


– Needs may differ for geneticists vs physicians vs research scientists
Proposed Strategy for Clinical Data feeds

Gather SNP data from


genomic lab
reporting system

Gather SNP data from


reference data
Weighing the data provided by the lab source

• Gene location MYH7

• Flanking sequences
– 5’ AGGCGCTAGAGAAGTCCGAGGCTC
– 3’ CCGCAAGGAGCTGGAGGAGAAGAT

• Positional information c.2606

• Nucleotide substitution G>A

• Functional information p.Arg869His


Proposed Strategy for Research Data feeds

1. Store Summarized Genomic Annotation Information Within the


current fact table of the star schema (EAV table)

2. Store Detailed Genomic Annotation Information Within A Object


Orientated Data Base.

3. Store Genomic Datasets (BAM, PED etc…) Within A Secure


File System – Indexed within i2b2 Data Mart
Flow Diagram
PM-Cell CRC-Cell R Perl cURL
(Authentication) (Summary Annotations)

i2b2 Hive Core

I2b2 Hive Domain Power Users

I2b2 Web Service API Genomic Report API


Interface
Export Gene
Report Request Broker Genomic Data Importer Genomic Data Exporter Level Results
( PED, GFF3 ... ) ( PED, GFF3, BED, WIG ... ) to CRC
Report Engine

Genomic Data Analysis


I2b2-Galaxy Adaptor
Features Meta Data
Galaxy Other Experiment gridFS
- Raw Data Storage Meta Data - BAM Files
- ‘Canned’ Workflow Reports
Resources

Domain Experts MongoDB – Data Persistence


How do we make Invariant Variants
…that are palatable for human use in queries?

• RS number
• Gene name + flanking sequences
• HGVS name
RS number

• Uniquely identifies a variant over time ….but….

• Novel variants may not have rs number


– User may not want to submit to dbSNP
Gene name + flanking sequences

• Not guaranteed if gene has several isoforms


– EGFR
HGVS Name

• Uniquely identifies variant within a referenced and versioned


accession and details the nucleotide substitution.

NM_005228.3:c.2155G>T

RefSeq accession Position Nucleotide


substitution
Coding DNA
Is there a common denominator in all of this?

• Yes … all ultimately describe variant location on a chromosome.


• Nucleotide substitution defines the physical manifestation of the variant.

WE PROPOSE:
– HGVS name (n/t subst, positional info)
– Flanking sequences ( a way to verify positional info)

AS A WAY TO UNEQUIVOCALLY EQUATE TWO VARIANTS


– ACROSS DOMAINS
– ACROSS VERSIONS
GenomicMetadata record

GenomicMetadata
Version 1.0
ReferenceGenomeVersion hg18
SequenceVariant
HGVSName NM_0005228.3:c.2155G>T
SystematicName c.2155G>T
SystematicNameProtein p.Glu719Cys
AaChange missense
DnaChange substitution
SequenceVariantLocation
GeneName EGFR
FlankingSeq_5 GAATTCAAAAAGATCAAAGTGCTG
FlankingSeq_3 GCTCCGGTGCGTTCGGCACGGTGT
RegionType exon
RegionName Exon 18
Accessions
Accession
Name NM_005228
Type mrna (NCBI)
Accession
Name NP_005219
Type protein (NCBI)
Accession
Name NT_004487
Type contig (NCBI)
ChromosomeLocation
Chromosome chr7
Region 7p12
Orientation +
Combining equivalent terms
Linking to external services

• Genome Browser
– Requires chromosome location; reference genome

• PolyPhen (predicted functional effects)


– Requires chromosome location; reference genome
– RS number
– Or HGVS name
VISTA Services

• Flankmap (location service)


Converts several formats to a chromosome location on a
reference genome
– Gene/flanking sequence
– Full HGVS notation
– dbSNP rs number

• Conservation plots
– Based on location
VISTA workbench tools
Embedded VISTA browser
References

• Kimball, R. The Data Warehousing Toolkit. New York: John Wiley, 1997.
• Murphy, S.N., Gainer, V.S., Chueh, H. A Visual Interface Designed for Novice Users
to find Research Patient Cohorts in a Large Biomedical Database. AMIA, Fall Symp.
2003: 489-493.
• Murphy, S.N., Weber, G., Mendis, M., Gainer, V.S., Churchill, S., Kohane, I.S.
Serving the Enterprise and Beyond with Informatics for Integrating Biology and the
Bedside (i2b2). Journal of the American Medical Informatics Association, 2010 March
1; 17(2): 124-130.
• den Dunnen JT, Antonarakis SE: Mutation nomenclature extensions and suggestions
to describe complex mutations: A discussion. Hum Mutat 2000, 15:7-12.
• Dalgleish R, et al.: Locus Reference Genomic sequences: an improved basis for
describing human DNA variants. Genome Medicine; 2010, 2:24.
• http://www.hgvs.org/mutnomen/recs.html
Integrating Clinical and Genomic
Data:
Opportunities, Challenges
and a Proposal
Henry Lowe MD
Stanford Center for Clinical Informatics
And The Division of Systems Medicine
Stanford University School of Medicine

33
Opportunities – Clinical Data
Warehouses
• Electronic Health Record Deployment
Increasing
• Creation of Clinical Data Warehouses
Increasing
• Support for Research Access to
Clinical Data
• Optimized for use of Aggregate Data
• Cohort Searching, Data Review &
Analysis
Opportunities – Biospecimen
Linkage
• Linkage of Clinical and Biospecimen
Data
• Characterizing Biospecimens using
Clinical Data
• Identifying Biospecimen Cohorts
• Linkage of Genomic Data to Clinical
Data
• Integration of Genomic Data back into
the EHR
Challenges – Clinical Data

• Clinical Data is not the Entire


Phenotype
• Missing Data (e.g. Occupational
History)
• May be Spread across many eHealth
Systems
• Clinical Data is not Perfect
• Diagnoses may be coded only in
ICD9
Challenges – Identifying
Phenotype
• Creating validated algorithms to define
phenotype from EMRs is complex
• Diagnostic Codes Alone may not be
sufficient (eMERGE)
• Extracting phenotypic data from
clinical text can be difficult
• Phenotype data use goes beyond
genomic studies, e.g. Research Cohort
Identification
Proposal – A National Phenotype
Catalog
• Create a Web-based, searchable
directory of validated high level
phenotype algorithms
• Encourage contributions from multiple
sites
• Algorithms would be freely available
for use
• Would use a standard set of metadata
elements
• Would use a standard description
Challenges in Leveraging
Clinical Data
Elmer Bernstam
Professor
Biomedical Informatics and Internal Medicine
Director, Biomedical Informatics Component
Center for Clinical and Translational Research
The University of Texas Health Science Center at
Main points
• To leverage genomic data, need (matching)
clinical data
– Research data
• Expensive and scarce
• Relatively easy to compute
• May not accurately reflect clinical reality
– Routine clinical data
• Plentiful and “cheap” (though may not match)
• Very hard to compute
• Necessary
• Challenges inherent in routine clinical data
Traditional view

Genetic
data

CPR

Clinical data
(CDW)
Why do we need routine
data?
• Clinical research moving abroad
– Glickman SW, McHutchinson JG, Peterson ED, et al. Ethical and
scientific implications of the globalization of clinical research. N
Engl J Med. 2009;360(8):816–823, PMID:19228627.

• If we are to compete…
– Need to make use of routine data
– Only some routine clinical care can be
outsourced
Problems
• Little overlap between clinical and
research data
– Genetic data on study subjects
– Clinical data on patients who are not study
subjects
• Clinical data not like research data
– Measurement error
– Missing data
– Biased data
– …
Attempts to leverage
clinical data
• [Quality of care]
• [(Non-representative)Cohort selection]
• Reproducing large RCTs
– Extremely large sample sizes
– Example: Tannen RL. Weiner MG. Xie D. Use of primary care electronic
medical record database in drug efficacy research on cardiovascular
outcomes: comparison of database and randomised controlled trial findings.
BMJ. 2009. 338:b81. – 8M patients (5.7% of the population of the UK)
• Reproducing prediction rules
– Example: Hripcsak G, Knirsch C, Zhou L, Wilcox A, Melton GB. Using
discordance to improve classification in narrative clinical databases: An
application to community-acquired pneumonia. Comp Biol Med, 37 (2007) 296-
304.
– Often doesn’t work
• Solution 1: eliminate problematic data (10% of sample)
– Bias
• Solution 2: account for the confounds via statistical model
– Requires knowing the answer
Required enabling
technologies
• Infrastructure
– Collect, store, protect, analyze, update
• NLP, NLP, NLP
– Structured (billing) data misleading
• (UTH data) 20% endometrial cancer, 50%
breast cancer
• Statistics
– Requires unusual degree of
collaboration with statistical colleagues
For the present
• Critically important research area
• Careful to maintain enthusiasm
without over-promising
– AI Winter(s)
Thank you!

Elmer Bernstam
Elmer.V.Bernstam@uth.tmc.e
du
Integrating genomic and clinical data: some
challenges from EU and italian projects

Riccardo Bellazzi
Biomedical
University of Pavia, Italy Informatics
Labs
‘Mario
Stefanelli’
Collaborations BMI labs and Pavia
hospitals

IRCCS Fondazione C. Mondino


Headache

IRCCS Policlinico S. Matteo


The EU Inheritance project

IRCCS Fondazione S. Maugeri


Cardiology
Oncology
Clinical Bioinformatics – the
Italbionet / i2b2 Pavia project

HIV

EMR
Biobanks

DW /
Discharge
clinical research
letters
chart

Knowledge Research
Intelligent query
data-bases
repositories / data mining

Reasoning systems
Projects

IRCCS Fondazione S. Maugeri

Genetic of arrythomogenic diseases


Support to oncology research
TRIAD and I2b2

TRIAD: Transatlantic
registry of inherited
Arrythmogenic i2b2
diseases
ETL - KETTLE
TRIAD and i2b2
Adding statistical functionalities
BIOINFORMATICS METHODOLOGY AND TECHNOLOGY TO
INTEGRATE
CLINICAL AND BIOLOGICAL KNOWLEDGE SUPPORTING
ONCOLOGY TRANSATIONAL RESEARCH (ONCO-I2B2)
Projects

IRCCS Policlinico S. Matteo


The EU Inheritance project

Inheritance: dilated cardiomiopathies


Dilated
cardiomiopathy

Centre for Inherited Cardiovascular Diseases - IRCCS Policlinico San Matteo - Pavia

From DCM to…

Clinically oriented genetic investigation

Dystrofinopathies
Laminopathies
Desminopathies
Mitocondriopathies
Epicardinopathies
“DCM” Actinopathies
Zaspopathies
Desmosonopathies

Centre for Inherited Cardiovascular Diseases - IRCCS Policlinico San Matteo - Pavia
Pedigree Non Familial
Family screening Familial: AD, AR,
X-LR, MT
Cardiac, Extra
Symptoms Cardiac,Recent
Duration Onset, Long term

Muscle, Skin
Physical Eyes, Kidney, Diagnostic
evaluation Liver, Lung Hypothesi
ECG Family s:
AVB, PR,
Rest, effort, WPW, etc,
screening Before
holter Clinical Genetic
CPK, Leukocytes, markers Testing
LAB Enzymes, Metab.
Etc

Imaging: echo,
LVNC, DE
MRI Increasing the number
of genotyped CMP
RV Cath EMB One gene ---> one disease
Inheritance architecture

Annotation tools

Cardioregister

Web interface
Text mining
and literature
search engines Data analysis Data
plugin warehouse

Reasoning
module
KB/Red flags I2b2
environment

Wiki-based collaborative system


Projects

IRCCS Fondazione C. Mondino


Headache
Populating the datawarehouse

Legacy
Researc
Database
h
s
Clinical
Documents Data
Domain
ICHD
Ontology
Code
System
NLP
System
ICHD Ontology
Diagnosis Mapped Clinical
Data

CRC
Task 1. Computational methods and tools to perform data
mining and knowledge integration

Efficient management of

MS-data
Mining
annotations
and literature


Automated
Literature search
Web-based data analytics
Web-based annotation

In summary
• Several
Severalprojects where
projects the same
where architecture can
the same
be applied:
architecture can be applied:
• Main
Mainadaptation
adaptationneeds:
needs:
–Specific domain ontologies
Specific domain ontologies
–Representation of genetic information
Representation of genetic information
–Representation of phenotypic information
Representation of phenotypic information
–Importing data from EHR
Importing data from EHR
• Interesting
Interestingresearch directions
research related
directions to building
related to
–omics enabled
building –omicsdecision support
enabled and knowledge
decision support
management
and knowledge tools management tools
Integrating Genomic and Clinical Data
for EHR and Biomedical Repositories

Lucila Ohno-Machado, MD, PhD


Division of Biomedical Informatics UCSD

TBI-CRI Bridge Day Panel


03/8/11
EHR and Genomics at

Division of Biomedical Informatics overview

Research and Applications


• Clinical Data Warehouse
– NLP, privacy technology, preference management
• integrating Data for Analysis, Anonymization, and Sharing
• Personalized risk assessment
– How ‘personalized’ is it?
Clinical Data Warehouse

• 550,000
outpatient
visits/year
• 180,000 hospital
admissions
• 17 million orders
• 2 million patients
Request about
individual

Clinician/R
esearcher
wants data

UCSD UCLA UC Irvine Community UC Davis UCSF


(Epic) (Epic) (Eclipsys) Partners (Epic) (GE)

Data matching function: Map D onto data dictionaries

Request for data D

Return data D
EHR and Genomics at

Division of Biomedical Informatics overview

Research and Applications


• Clinical Data Warehouse
– NLP, privacy technology, preference management
• integrating Data for Analysis, Anonymization, and Sharing
• Personalized risk assessment
– How ‘personalized’ is it?
Sharing Data
– Today integrating Data for
Analysis, Anonymization
• Public repositories (mostly non-clinical) and Sharing

• Limited data use agreements


– Tomorrow
• Annotated public databases
• Informed consent management system
• Certified trust network

Sharing Computational Resources


– Today
• Computer scientists looking for data, biomedical and behavioral
scientists looking for analytics
• Duplication of pre-processing efforts
• Massive storage and high performance computing limited to a few
institutions
– Tomorrow
• Processed, de-identified, ‘anonymized’, shared data
• Secure biomedical/behavioral cloud
Analysis

(Courtesy Bafna and Varghese)

• Compression • Pattern recognition


• Query language
NLP (computing with streams, rare

events)
• Study design
(2nd generation seq) • High performance computing
Anonymization
Respecting Privacy and Individual preferences

Focus
Getting the Job Done Groups,
Surveys
Community

Does the law, Informed


Provider P requests
Regulation Consent
Data D on individual I Management
for Reason R require D to be
System
sent? No
Do I wish to
disclose data D
to P for Reason
Yes R?
Preferences
Trusted
Yes No
Broker(s)

•Identity
Management
Patient I
•Trust Information
Management Inspection
Exchange
Registry
I can check who or Home
Security Entity which entity looked
(wanted to look) at
the data for what
reasons
Healthcare Entity

Preference Registry
EHR and Genomics at

Division of Biomedical Informatics overview

Research and Applications


• Clinical Data Warehouse
– NLP, privacy technology, preference management
• integrating Data for Analysis, Anonymization,
and Sharing
• Personalized risk assessment
– How ‘personalized’ is it?
Personalized Medicine

If the rule of thumb for


building predictive
models is 10 cases per
variable:

How many individual


genotypes are needed?
22%
16%
Your Risk

“this program
shows the p=1
estimated
health risks of
people with
your same x
age, gender,
and risk factor
levels”
“this means that 5 of 100 people
with this level of risk will have a
heart attack or die”
People “like you”

Input space Output space

me
“people with your “people with this
same age, level of risk”
gender, and
p=1
risk factor
levels”
x
People “like me”

height

me

gender
Patients “like you”
Patients “like you”

height

me

gender
0 1
Patients “like you”

height risk

1 2

0 1
me

gender
Assessing Quality of Individual
Predictions
• Hybrid model construction
– Non-parametric and parametric regression
– Kernel-based models
• Evaluation of calibration
– Graphical tools based on calibration error
– Input-based assessment
• Calibration methods
– Smooth isotonic regression (1:30 Cyril Magnin II)
– Doubly-penalized SVM
Summary
• We need to aggregate as much information we
can from experiments and clinical data to create
reasonable predictive models

• Objective models are being used in a variety of


medical domains, but few users know their
limitations

• We need better methods to assess the quality of


the models
Genome-Phenome Integration @ UCSD

Funding from NLM, NHLBI, NHGRI, NIBIB, NCRR, NIGMS, AHRQ, Fogarty, VAMRF, Komen
Foundation, UCSD Medical Center
March 9, 2011
AMIA TBI-19/CRI-01 ACMI Panel

Integrating Genomic and


Clinical Data in EHRs and
Biomedical Repositories:
Challenges, Solutions and
Opportunities
Peter Tarczy-Hornoch MD
Director, Biomedical Informatics Core, ITHS
Director, Research and Data Integration, ITS
Head and Professor, Biomedical and Health Informatics
Adjunct Professor, Computer Science and Engineering
Professor, Neonatology
Solutions for generating new genomic
knowledge require integrating diverse
phenotypic
Electronic Medical and genomic
Electronic Case data
Biodata Biospecimens
Record/Clinical Report Form Data (Instruments)
Data
IRB approved protocol IRB approved protocol

IRB approved protocol

Honest broker
I tne g r ae d t
D ta a
R e p o s i ot r y
Researcher
The University of Washington data repository
(Amalga) integrates phenotypic data from
30+ interfaces (10/2010)
Scope of Repository
• 3.5M patients, 42M visits,
220M+ lab results, 180M+
diagnoses & procedures
over 18 years
• 14 data systems populating
Amalga via 30+ real-time or
batch interfaces
• 2.7 Terabytes of data
• 4M new messages/day
• Use IRB/HIPAA compliant
Amalga can identify patients with a given
phenotype and help investigators augment
• phenotypic
Eligibility criteria (IRBinformation
approved study)
Patients whose age >=18 years and are not deceased
Link demographic, diagnoses,
AND
labs, & visit history data
Had ICD-9 codes of 648.* OR 250.* OR 648 OR 250
AND
Had lab test results (Albumin >= 30 and <= 400) OR
(Albumin/Creatinine Ratio >= 30 and <= 400) within the last 2
years.
AND
Had ANY encounter in the service centers for Internal Medicine
OR Diabetes Care Center OR Family Medical Center in the last 2
years
AND
Have not had a diagnosis of 592.* OR 592 OR 585.6 OR V42.0
AND
Have not had lab test (Calcium > 10.5) OR (GFR < 60) OR
(Hemoglobin A1C HPLC > 9.5) OR (Hemoglobin A1C Rapid >
9.5).
• Nightly updates to candidate list, automated
notification, & custom study input screen
Some phenotypes more challenging to
capture

Capurro,
Tarczy-
Hornoch
TBI 2011 (TBI-
10)
Semantic alignment in data repositories
pulling data from disparate systems is a
challenge

- 2 medication lists - 1 medication lists - n medication lists


- 2 systems - 1 system - n systems (members)
- Pharmacy - Medical record - hospitals
- Medical record - clinics
- pharmacies
- mail-order
- Single dictionary (A) - Single dictionary (B) - NO dictionary

Ongoing research and research opportunities:


ontologies, semantic alignment
EHR computable phenotypes may not be
granular enough thus text mining is a key
opportunity

CONFIDENTIAL – UNPUBLISHED DATA (Black, Capurro et al)


Systems to bring genomic knowledge to the
point of care need to integrate with both
genomic and phenotypic data
• “Pharmacogenomics (PGx) is the study of the genetic basis of variability 
among individuals in response to drugs” (Pharmacogenomics & Personalized 
Medicine, Colen N. 2008)
 Example: Tamoxifen and time to 
recurrence

Increased monitoring
for poor metabolizers
recommended

Note: Given limited evidence, as of 2009, ASCO


does NOT recommend testing for CYP2D6

Overby, Tarczy-Hornoch et al BMC Bioinformatics 2010

Motivation
Pharmacogenomic decision support requires
reasoning across assertions with different levels
of evidence

Genotype
scoring system

Raw from 
Sheffield et al. Clin Bio Rev.  2009

Overby et al IDAMAP 2010

Methods
Prototype system built on Amalga integrates
Illumina SNP data and clinical data and
basic genomic knowledge

Data is from simulated patients


• Potential applications: discovery of associations,
validation of associations, clinical alerts/reminders
• Research opportunities: genomics, data modeling, data
mining, text mining/NLP, decision support
* Overby: decision support for pharmacogenomics, * Yetisgen-Yildiz: phenotype extraction
Collaboration is key to realize the opportunity
for use biomedical data to advance genomic
research & practiceComputer
Science

Faculty (19+39) 13 Cores including


- Research, Service Biomedical Informatics
Students (39) (and Regulatory/Bioethics)
- MS, PhD, Postdoc
Biomedical Data
&
Biospecimens
Medical Records
Academics & Clinical
Billing Systems
- Lab Medicine
Data Repositories
- Pathology
- Genome Sciences
- Northwest Institute of
Genetic Medicine
Nursing Public Health
Acknowledgements (incomplete)
• Funding: NCRR UL1 RR • ITHS BMI Core Staff
025014, NLM T15 LM07442, Bill Barker, Tony Black, Joshua
NIH, NSF, AHRQ, UW Medicine Franklin, Gene Hart, Greg Hather,
Xenia Hertzenberg, Brent Louie,
• Faculty May Lim, Paul Oldenkamp, Roy
Nick Anderson, Jim Brinkley, Alon Pardee, Jim Piper, Jaime
Halevy, Ira Kalet, Kari Stephens, Prosser, Justin Prosser, Ron
Dan Suciu, Peter Tarczy-Hornoch Shaker, Richard Veino
• PhD Students • ITS Staff
Eithon Cadag, Daniel Capurro, Joe Frost, Jim Hoath, Mike Kuffel,
Paul Fearn, Alicia Guidry, Ping Soohee Lee, Dave Rankin, Dan
Lin, Brent Louie, Peter Mork, Sullivan, Paul Tittel, Tanya
Casey Overby, Rupa Patel, Terry Tobin, and more
Shen
• Shawn: importing Variant data from our clinical genomics repository, several issues in
representing Variant data in a time-invariant way. looking across implementations, same
issues seen in normalizing Variant reporting across sites. nomenclature of SNP not well
formalized for time invariant fashion.. The HGVS name can change with the Gene version
because it is dependant on knowing the first nucleotide of a specific gene and agreeing
upon the exon sequence. The RS number also varies, is only available for a limited set
of "named" SNPs, and does not identify the base pair that was actually substituted at that
location. Thinking of how these details can be addressed (or at least bringing them to
peoples attention) is critical in pooling phenotypic/genotypic data across the clinical sites
and domains, as well as a being able to reuse the data well into the future.
• Bellazzi: supporting research in oncology and cardiology by integrating data from several
sources, including a biobank, and the issue of properly querying and analyzing these data.
• Lucila Ohno-Machado: The iDash National Center for Biomedical Computing
• Peter: representation, querying, decision support both in research warehouse and pilot
clinical context..
• Henry: how one automatically links patient-level clinical (phenotype) and genomic data
(genotype) using the clinical data warehouse models being deployed to support secondary
use of health information (e.g. I2B2, STRIDE and others). how do we design a "service"
that allows researchers to identify genomic data and/or biospecimens using a mix of
phenotype and genotype criteria. What are the opportunities for data sharing and
aggregation at the national level. There are clearly also many questions about using clinical
data as phenotypic data in this setting (being addressed by the eMERGE group). multitude
of questions about how best to integrate genomic data/reports back into clinical workflow
and have the data be understandable and actionable by both clinician and patient.

You might also like