You are on page 1of 3

BIOINFORMATICS As published in BTi June 2006

The universal genetics database: infor-


mation sharing in genetics and beyond
by Dr D. Widdows and Prof M. Barmada
In many empirically intensive fields, researchers spend much of
their time marshalling, formatting, and moving impenetrable blocks
of data from one place to another. To solve this problem and pave
the way for future-proof development, the University of Pittsburgh
Graduate School of Public Health teamed up with MAYA Design, a
research and technology lab spun out of Carnegie Mellon University.
The result is the first prototype of a Universal Genetics Database,
which automatically combines background information from genet-
ic databases with experimental results in particular studies.

THE NEED FOR INTEGRATION plexity, they are similar to the


Biomedical informatics has become a field increasing problems faced in many
with many huge databases that contain valu- other growing fields. MAYA Design
able common information and countless small has spent much of the past 15 years
studies with information kept in local data- researching, developing, and deploy-
bases or text files that are spread across differ- ing systems for information collabo- Figure 1. The same information objects can be replicated to
ent file systems and exchanged by e-mail. Data ration, and during this period found many locations and seen in different ways by different users.
integration has become a huge challenge in endemic infrastructure patterns that
itself, with a variety of relational databases, were hampering integration. fier, so that any data object can refer to any
markup languages, and heterogeneous ontolo- Electronic information is dispersed across a other data object, even if they are from com-
gies vying for attention. huge number of locations, and to find that pletely different datasets on different peers in
Researchers in the Department of Human information, users need to request copies of the network. New variables or attributes can be
Genetics at the University of Pittsburgh's files directly from those locations. Every time a added to any u-form, so the format of each
Graduate School of Public Health (GSPH) are file is moved from one machine to another, or data item can evolve as the need arises. These
acutely aware of the information integration even to a different location on the same two requirements (data identity and data
problems in the field of genetic epidemiology. machine, it changes its identity and becomes a extensibility) have been recognized as vitally
With the advent of the genomics age ushered different file. Imagine if every time a book important by other developing frameworks
in by the Human Genome Project has come a moved to a different bookshelf, it became a dif- such as the Semantic Web, and since (UUID,
multitude of databases, each with their own ferent book! Library systems would never attribute, value) triples can be mapped directly
unique syntax. New graduate students with a work, and the reliable transmission of knowl- to (URI, predicate, object) triples, the
background in biological sciences have to edge that enabled modern science to develop Universal Database can automatically be used
become familiar with scripting and query lan- would probably never have happened. to express the Resource Definition Format of
guages just to do the data plumbing that has MAYA's core innovation is an information the Semantic Web. The key difference
become prerequisite to almost any scientific system in which information itself, and not its
inference. physical location or transmission medium, is
To pave the way towards solving these prob- the primary currency. Because information
lems once and for all, the GSPH started an retains its identity wherever it is found, the
ongoing collaboration with MAYA Design, a information system is described as the
Pittsburgh technology lab that has been devel- Universal Database. The Universal Database is
oping and helping to deploy collaborative a peer-to-peer system in which all information
information architectures for the past 15 years. is broken down into data objects called
u-forms, which can be moved independently
THE UNIVERSAL DATABASE ARCHITECTURE and replicated to many places at once
While the problems currently facing biomed- [Figure 1]. Figure 2. The Genetics Information Commons
is the part of the Universal Genetics Database
ical informatics are unique in scale and com- Each u-form has a universally unique identi- that is in the public domain.
BIOINFORMATICS As published in BTi June 2006

between these two frameworks is that the


Universal Database encourages data liquidity, UGD CASE STUDIES AT THE
i.e. the flow of information to wherever it is GRADUATE SCHOOL OF PUBLIC HEALTH
needed. Because the identifiers in the Universal
Database are not physical locations, a u-form Founded in 1948, GSPH is world-renowned for contributions that have influenced public health
with a particular UUID can often be found in practices and medical care for millions of people. The Human Genetics Department within the
many locations, and can be replicated to a GSPH is concerned with identifying genetic susceptibility loci for common complex disorders,
user's own venue. This is vital for supporting and with understanding the impact of those susceptibility loci on disease prevention and public
offline activity, and for optimizing analytical health.
work with small portions of large datasets. Carrying out a genetic study to identify which genes are responsible for different effects involves
Because objects from authoritative datasets collecting and collating information about family trees, which family members are affected, what
can be replicated to many different locations, biologic samples are taken from each individual, which "versions" (alleles/haplotypes) of each
MAYA uses the Universal Database architec- gene are present in each sample, and what information is available in standard genetic databases
ture as a platform for disseminating the concerning genes of interest. This process could take up to 6 person weeks for a single study, and
Information Commons, through which defin- the integration of information from a plethora of local and remote databases, spreadsheets, and
itive and authoritative public data can be lab results in different file formats. To complicate matters more, typical studies generate several
obtained and massively replicated. potentially overlapping partial data sets which
must be compiled together to form a whole.
THE UNIVERSAL GENETICS DATABASE AND To streamline this process, MAYA Design and
THE GENETICS INFORMATION COMMONS the GSPH worked together to create import and
Researchers at the University of Pittsburgh's data fusion tools that do the heavy-lifting
Graduate School of Public Health teamed up involved in creating a genetic study. As well as
with MAYA Design to test the viability of a creating data import and expert tools, this
Universal Genetics Database (UGD), which process involved importing datasets into the
uses the Universal Database as a platform for GIC, including the National Center for
collection and fusing together genetic infor- Biotechnology Information's dbSNP and
mation and making this information and Iceland's Decode Genetics databases that keep
accompanying tools automatically available to track of known genetic loci and their positions
researchers. on the chromosome. These datasets are indexed
By studying the workflow and data needs of a along useful (and easily extended) dimensions
typical genetic study, researchers from the such as name and genetic distance in base pairs.
GSPH and MAYA created an information As well as different portals for viewing the infor-
architecture and data fusion tools to fulfill all mation space, the main functionality created Figure 3. An example of the type of data used in a
of the information representation needs of was data import and export tools. Using simple genetic epidemiology study. In addition to the
several end-to-end studies, using the commands, whole collections of input data can information about who is related to whom, and
Universal Database architecture. (See attached be brought into the system. As well as uniting what condition each person has, genetic studies
need to deal with information on the markers
case study). Part of the novelty of this data from several different file formats into a
typed in each individual (like the NOD2 marker,
approach is that information from publicly single model, this action caused data to be linked
shown here, which is a known risk factor for
available genetic databases is represented to important background information, such as inflammatory bowel disease). As the figure
using the same infrastructure as study-specific public databases of information about genetic demonstrates, individuals carry different "ver-
data about particular genetic samples, mark- markers. sions" (or alleles) of markers within genes of inter-
ers, and pedigrees. Where necessary, personal- The GIC tools were used to import and analyse est, and the patterns of inheritance can give infor-
ly identifiable information is protected by previous studies concerning Ulcerative Colitis mation on the likelihood that the gene is involved
encryption and by preventing its flow outside and Crohn's Disease (the two common subtypes in the disorder (when combined with information
of the GSPH network. In this way, the public of Inflammatory Bowel Disease), as well as data about the pedigree and the disease). In addition,
parts of the UGD become a Genetics from several population- and family-based public databases such as those housed at the U.S.
National Center for Biotechnology Information
Information Commons (GIC), contributing studies of acute and chronic pancreatitis. The
(NCBI) can give additional information, both
to the growing wealth of Information automatic import, indexing and fusion tools
about the disease of interest and about the gene or
Commons data [Figure 2]. within a single information infrastructure markers that are used in a study. All of this infor-
enabled results to be obtained much more mation must be integrated in a meaningful fash-
THE GREATER GIC VISION quickly, enabling researchers to spend more ion for a genetic study to have a hope of identify-
Together with MAYA Design, GSPH time on data analysis and less time on data inte- ing genes for complex human genetic diseases.
gration.
BIOINFORMATICS As published in BTi June 2006

researchers are extending the GIC tools to ciplines will eventually enable scientists from nity. For more information, contact Josh
address a richer variety of human conditions, many fields to collaborate and explore infor- Knauer, Director of Advanced Development
and to fuse data across a wider collection of mation at a much more granular level than is (knauer@maya.com).
datasets from different domains that are possible using the traditional journal article as
already part of the Information Commons. the main means for information publication.
The boundaries of the Genetics Information Analysts using the Universal Database do not THE AUTHORS
Commons are not fixed. Instead there is an spend time researching the artificial data mis- Dominic Widdows, D.Phil.
integrated heterogeneous information space matches that result from data evolving sepa- Senior Research Engineer
through which researchers can access demo- rately in separate machines. This leaves the MAYA Design, Inc.
graphic, textual, environmental and econom- much more important problem of research- Building 2, Suite 300,
ic datasets, all fused around common points ing the relationships between phenomena in 2730 Sidney Street
of reference such as shared spatial or temporal the real world. Pittsburgh, PA 15203
concepts. USA
The long-term goal of this system, as it devel- TAKING PART IN GIC DEVELOPMENT Tel: + 1 412 488-2900
ops, is to create an information infrastructure Researchers interested in joining the Genetics e-mail widdows@maya.com
in which researchers spend less time tracking Information Commons can do so in a num-
down, parsing and organizing different ber of ways. Currently, MAYA Design is offer- Michael Barmada, Ph.D.
datafiles ("data plumbing"), and more time ing pilot versions of the Universal Database Associate Professor of Human Genetics
analysing and publishing scientific results. with command-line tools and APIs to select Director, Center for Computational Genetics
Universal identity, extensibility, and data liq- researchers who are interested in joining the Co-Director, Bioinformatics Analysis Core
uidity gradually encourage an environment in GIC. In addition, MAYA Design is seeking Services,
which the importing and formatting of data partners to collaborate in pioneering the Graduate School of Public Health, University of
by one researcher naturally makes it available development of the GIC through joint proj- Pittsburgh,
to others if the initial research team so desires. ects that will add useful data and tools to the 130 Desoto Street, Pittsburgh, PA 15261
Data reuse and availability across multiple dis- GIC for distribution to the research commu- USA

You might also like