You are on page 1of 5

technology feature

Which tools do I use? 294


What about processing power? 294
Arent pipelines for oil? 294
Can I buy the data analysis? 295
Beyond doubts, questions await 296

Drilling into big cancer-genome data


Vivien Marx

Paving roads through data mountains, consortia are developing workflows and tools for widespread use.

Cancer geneticist Matthew Meyerson, who Those data, along with other sequenc- sequence files, the TCGA centers have cre-
is at the Dana-Farber Cancer Institute and ing results such as exome or mRNA ated tiers. Higher-level datafor example,
the Broad Institute of MIT and Harvard, sequence data, are held at the Cancer the list of somatic mutations in exome data
tracks the many ways tumors wreak chaos Genomics Hub at the University of or copy-number changes along the genome,
2013 Nature America, Inc. All rights reserved.

in orderly cells. He wants to squeeze into California, Santa Cruz (UCSC), with or expression levels of different genesall
his schedule a dedicated time period in controlled access for data that could allow of these are public data, Getz says. Those
Gad Getzs lab at the Broad Institute to individuals to be identified. Nonsequence are much smaller in size than raw sequence
hone his computational skills for analyz- data are kept at the TCGA data portal. files, he says, a difference that can help sci-
ing data about cancer genomes. Recently, a researcher in Getzs group at entists shopping for more manageable files.
Such collaborations could become the Broad downloaded genome sequences Since its launch in 2008, the ICGC has
more common as scientists dive into data of patients tumor and normal tissue from amassed around 250 terabytes of data from
sets generated by large consortia includ- the Cancer Genomics Hub. We down- approximately 1,300 donors, in Lincoln
ing The Cancer Genome Atlas (TCGA) loaded something like 20 whole genomes, Steins rough estimation. He directs bio-
Research Network and the International tumor-normal pairs, in 3 days, Getz says. informatics and computational biology at
Cancer Genome Consortium (ICGC). Thats quite fast. the Ontario Institute for Cancer Research
To shape an experiment, Getz suggests To create data packets that are easier to (OICR), which is also the ICGCs data coor-
that scientists first look at existing data. handle for analysis than the gigantic raw dination center. ICGC scientists in Asia,
However, this shift in habits is not an easy
sell, and doubts about tools and compu-
tational approaches abound. To make
choosing among the options easier for the
npg

community, Getz and colleagues at other


institutions are comparing and benchmark-
ing software tools and making analysis pipe-
lines more accessible. Separately, companies
are expanding the ways to help customers
work through big cancer-genome data.

Where do I find data?


TCGA teams are profiling molecular dif-
ferences between tumor cells and healthy
cells in 500 patients and for more than
20 cancer types1. Since 2006, TCGA has
US National Institutes of Health/TCGA

explored these differences using a vari-


ety of platforms across more than 6,000
patient tumor-normal sample pairs, using
single-nucleotide polymorphism, small
RNA, transcriptome, exome and meth-
ylation data from sequencing and micro-
arrays, says Kenna Shaw, TCGA program
office director. For many samples, whole-
genome sequence data are becoming TCGA is characterizing many tumor types. In this simplified Circos plot visualizing TCGA breast cancer
available. data, scientists can integrate results and explore the genome data inter-relationships.

nature methods | VOL.10 NO.4 | APRIL 2013 | 293


technology feature
algorithms for publication, they often do the process of testing and validating, mak-
not finish the software engineering need- ing sure what the false positive and false
ed to stabilize the tool. negative rate is, and so on, says Trinh.
Addressing these hurdles, both TCGA
and ICGC have begun benchmarking tools What about processing power?
against a so-called gold-standard data Downloading and analyzing large data
set, which can take months. The OICR is sets takes planning and plenty of compu-
wrapping up the benchmarking of nearly tational horsepower. For researchers who

S. Ogden/Dana-Farber Cancer Institute


two dozen algorithms that detect struc- do not regularly need continuous large-
tural genome rearrangement. scale analysis, cloud computing can be an
Stein believes that tool developers would option.
save their colleagues time by distributing After downloading a dataset, analy-
software preinstalled on a virtual machine, sis at the Broad runs on-site in a high-
a so-called instance, on Amazons Elastic performance computing environment. To
Compute Cloud, Oracles VirtualBox or expand their offerings and make data and
Somatic mutations are delivering big surprises in VMware. Many scientists already load tools tools more available to the community, Getz,
terms of the types of genes to be found mutated into the online and cloud-based genome UCSCs David Haussler and colleagues from
in cancer, says Matthew Meyerson. data analysis platform Galaxy, which also other institutions are exploring cloud com-
has a software repository called Tool Shed. puting options, which must be made secure
Europe and North America are character- Although not all tools are guaranteed to to process patient data. These are things
2013 Nature America, Inc. All rights reserved.

izing over 24,000 tumor genomes from 50 run, the more restricted environment of that are still in flux, Getz says. But were
tumor types, comparing tumor and normal Galaxys virtual machine offers a predict- experimenting with building our compute
tissue2. able version of the operating system with pipeline on the cloud.
ICGC data are deposited in the European preinstalled libraries, says Stein. People The clouds best feature, he says, is elas-
Genome Phenome Archive. Somatic vari- can star a tool that they like and dislike, ticity. Researchers pay for the amount of
ant data are openly accessible at the ICGC so if it doesnt work, it will get low ratings, compute time used, not for maintaining
Data Portal but scientists must apply to and we would probably not even bother their own hardware. You could say, OK,
access data such as raw sequence, germline with it in our benchmarking, he says. now I need 1,000 computers, he says.
mutations or clinical data. One popular tool in use at OICR is the And then the next day you only need two.
In the past, each ICGC country housed Broads sequence-variant caller, GATK. This solution works for both big genome
its own data, but that strategy is changing. It teases out the alterations between a centers and small labs, which can use
The federated model, as weve discovered, persons tumor and normal tissue as well clouds run by Amazon, Google, Microsoft,
has an Achilles heel, Stein says. Network as variations from the human reference IBM or other providers.
connectivity issues have on occasion made genome, says Quang Trinh, a computa- Once the analysis is done, the virtual
data inaccessible. All the interpreted data tional biologist at OICR. Careful testing computers are released, and the data have
are now being copied into a centralized precedes the addition of any tool to the to travel, which costs time and money.
database administered by OICR. This OICR production pipeline, an approach he To address data transfer issues, Getz and
npg

transfer will be completed this autumn. hopes others can follow, too. Each time his colleagues are exploring ways to keep
The year-long project is worth the effort you pick a tool, you have to run through data in the cloud. He says more help will
because the new system scales well, Stein come from increased access to the high-
says. The database uses the distributed speed academic Internet backbone called
MongoDB architecture, which also offers Internet2, which includes 100-gigabit-
high data availability, he says. per-second connections and is being set
up by a consortium of universities, gov-
Which tools do I use? ernment agencies and companies.
A full toolbox is evidence of a vibrant
developer community. Cancer genome Arent pipelines for oil?
analysis tools number easily in the hun- Genomics analysis pipelines cannot get
dreds, says Stein, and every conference oil from point A to point B, but they can
poster session brings more. Its daunting transform data from A to Z. Every 2 weeks,
for experts in the field as well. the Broads Genome Data Analysis Center
Its good to have many tools, but there (GDAC; http://gdac.broadinstitute.org/),
is no systematic comparison of these with team members from the Broad, MD
tools, says Getz. Stein and his team find Anderson Cancer Center and Harvard
that many published tools have issues Medical School, swoops up all the gen-
beyond a lack of documentation. They erated TCGA data, normalizes them and
OICR

dont install; they crash; they dont pass makes them available.
their own internal tests, Stein says. Cancer genome analysis tools number easily in In a separate automated analysis pipeline
Although many tool builders test their the hundreds, says Lincoln Stein. series, these data sets are run through many

294 | VOL.10 NO.4 | APRIL 2013 | nature methods


technology feature

Virtual data factory Stein. The team plans to tally their findings
into a series of best practices, which stand
2013_01_16 analyses Run
AnalysisReport
BLCA
BRCA
# Pipelines
49
66
% Sucessful
100%
100%
Download
Open Protected
Open Protected
to help researchers use pipelines.
CESC 46 100% Open Protected
COADREAD
COAD
Analysis summary
66
66
100%
100%
Open Protected
Open Protected
DLBC 2013_01_16
8 100% Open Protected
GBM
HNSC
KICH
68
49
23
100%
100%
100%
Open Protected
Open Protected
Open Protected
Can I buy the data analysis?
TCGA KIRC
KIRP
66
63
100%
100%
Open Protected
Open Protected Expr
Beyond open-source tools, many commer-
LAML 31 CN
100% Open Protected
clusters
analysis LGG 63 100% Open Protected
LIHC
LUAD
15
66
peaks
100%
100%
Open Protected
Open Protected cial offerings exist. As the Broad widens
data LUSC 66 100% Open Protected

the Firehose user base, some tools might


OV 72 100% Open Protected
PAAD 39 100% Open Protected
PRAD 46
Significantly
100% Survival
Open Protected
READ 66 100% Open Protected
SARC
SKCM
STAD
mutated genes
11
49
44
100%
100%
100%
Open Protected
Open Protected
Open Protected
be commercialized, says Getz, via a model
that evolves tools by keeping them free for
THCA 49 100% Open Protected
Pathways

M. Noble/Broad Institute
UCEC 66 100% Open Protected

academics and nonprofits but requiring a


2,500+ pipelines Results dashboards Open to public for Democratize TCGA
fee from companies. Its typically not that
per month, across all and biologist-friendly browsing and science by lowering easy to get funding to support tools and
TCGA disease types reports automatic download entry barriers make them commercial-level tools.
Taking SeqWare beyond academia,
OConnor launched a consulting company,
Every 2 weeks, the Broads Genome Data Analysis Center (GDAC) takes the TCGA data, normalizes them
and makes them available. Analysis pipelines that are run in a computational framework called Firehose
Nimbus Informatics, providing an Amazon
take these data sets through many software tools. cloudbased version of SeqWare. He tailors
workflows for clients: for example, helping
Courtagen scale up their sequencing ser-
2013 Nature America, Inc. All rights reserved.

software tools: for example, to detect sig- institution or type of computational infra- vices and Life Technologies analyze the
nificant copy-number alterations, correlate structure. Iceman genome (http://icemangenome.
methylation status with clinical features or Whereas Firehose handles a substan- net/), a mummy dating back to 3,300 bc.
find significantly mutated genes, Getz says. tial portion of analytical workflows for Some companies focus on sequence data
The pipelines run in a computational TCGA, SeqWare currently handles just analysis for drug discovery or clinical uses.
framework called Firehose, which also the ICGC variant annotation pipeline, says Cancer research right now is not unlike
generates analysis reports. OConnor. the phase when whole-genome sequenc-
Soon the Broad will open Firehose to With SeqWare, data coming off the ing took off, says Thomas Knudsen, CEO
all TCGA scientists and, eventually, the sequencer flow into a database that is of the bioinformatics firm CLC bio, which
wider research community. We want monitored by a software-based decider has customers in academia, biotech and
to make the system available so people that triggers predetermined workflows for pharma. First, the early adopters in large
can install their own tools and run more assembly, alignment and analysis. This genome centers built their own tools, and
tools, Getz says. The future aim is to type of system has allowed us to automati- then companies such as his offered theirs.
generate something that looks like a cally analyze thousands of samples with Similarly, large-scale cancer research will
publication automatically, with figures, very little human interaction, he says. Our
supplementary information and figure plans are to release these workflows to the
legends, he says. The pipeline report still public, which would allow people to rep- Sequencers
npg

requires interpretation by scientists, but it licate our work at their own organization
jump-starts analysis. or on the Amazon cloud. There is also a
One analysis challenge has been the portal for nontechies to interact with the
Babel Problem, as Broad software engineer system and get analyzed data back.
Tracking
Michael Noble calls it. Scientists were not Not all labs need platforms for large-
database
able to precisely refer to TCGA data slices, scale automated analysis of terabases of
which reduced reproducibility. They did sequence data. However, thats changing,
not speak the same language, says Getz. To says OConnor. As sequencing technolo- Workflows
resolve this issue, Noble created Version gies evolve, individual labs increasingly
Files

Stamp to tag each data set and analysis produce data hills similar to the out-
run. Scientists can now identify the specific put of small genome centers from a few
data they use for a particular analysis. years ago.
Cluster engine
Firehose has a cousin called SeqWare SeqWare-based workflows are among
developed by computational biologist the many ICGC pipelines. As Stein
B. OConnor/OICR

Brian OConnor during his postdoctoral explains, the consortium is currently Local and cloud-based
fellowship at the University of California, addressing this multipipeline situation computing

Los Angeles. In 2011, he joined OICR, by benchmarking their pipelines in an


where he is senior software architect. exercise run by Ivo Gut of Spains Centre
With the pipeline framework SeqWare, data
SeqWare is a framework to package, Nacional dAnlisi Genmica. At a recent flow off the sequencer into a database, where
archive and share sequence analysis work- meeting, the researchers reviewed the first software triggers predetermined analysis
flows, OConnor says. Built to be location results. Its kind of interesting because workflows. SeqWare handles genome-variant
agnostic, it is not restricted to any one nobody got exactly the same results, says annotation at the OICR.

nature methods | VOL.10 NO.4 | APRIL 2013 | 295


technology feature
soon broaden, but it is now more limited Anatomy of a Firehose Version Stamp
to groups with bioinformaticians for in-
house software development. analyses__2013_01_16
Knudsens customer base has grown as
more companies and clinical research-
ers have begun using second-generation
sequencing. Another trend is that we sell What Separator When
to more and more customers who replace
Semantically Double Chronologically
their open-source pipelines and internally
unique underscore unique and
developed software with our solutions. aids parsing sortable
Jorge Conde, a cofounder of Knome,
sees TCGA and similar projects as a source
of growth for his firms user base, which stddata__2013_01_16 Data snapshot on 16 January 2013,

M. Noble/Broad Institute
includes scientists seeking additional packaged into standardized form
computational know-how. Customers can
approach Knome to find genomic variants awg_lgg__2013_01_16 Packages with same date guaranteed
in data by using the companys platform, to contain same data subset (for example,
which integrates public data sources and custom analyses of lower-grade glioma data)
analysis tools.
The Broad Institute confronts the Babel Problem that emerged when scientists used TCGA data but
Early versions of academically produced could not readily identify data sets. Version Stamp makes each automated analysis identifiable.
2013 Nature America, Inc. All rights reserved.

software can start out clunky and buggy,


says bioinformatician Martin Ferguson,
who consults for TCGA, setting up pro- includes cancer genome data, such as Recently, researchers at the Whitehead
cesses that ease data comparison across those from TCGA, and analysis tools, and Institute for Biomedical Research used
the hundreds of participating clinical sites. it is free for academics but not for com- data posted in the 1000 Genomes Project
Though they may have small begin- panies. and public genealogy information on the
nings, academic cancer genome analysis Firms are big users of TCGA data, slic- web to identify 50 individuals on the basis
tools can develop significant business ing out what they need, often with com- of short tandem repeats from sequence
careers. One example is Compendia mercial success, Ferguson says. Some data3. This finding makes it likely that the
Bioscience, a 2006 University of Michigan of his clients are pharma companies. privacy landscape of genomic data per se is
spin-off that Life Technologies acquired Theyre using the data for anything they going to tighten up, Sheldon says.
last fall. Compendias founders sought can: theyre mining it for new targets, Many companies offer to help research-
applications in drug development and theyre mining it for potential biomarkers ers set up cloud-based genomics analysis.
clinical research. Its platform Oncomine that can be tested and turned into a com- Its perhaps a better economic solution
panion diagnostic, he says. than trying to put together your own com-
Cancer genome projects are among the pute farm, hiring an IT staff to maintain it
reasons Oracle built a platform for sci- and then hiring a bunch of programmers
npg

entists to scale up genomic data analysis to build pipelines for you, says Elaine
and include large public-domain data sets Mardis, who codirects The Genome
such as TCGA, and to view them across Institute at Washington University School
genotype and phenotype, says Jonathan of Medicine. She also advises DNAnexus,
Sheldon, global senior director of transla- a company offering these types of genome
tional medicine in Oracles health sciences analysis services.
business unit. Frankly, bioinformaticians
have to spend way too much time doing Beyond doubts, questions await
the mundane but necessary formatting and Researchers can use the available data
reformatting work to load these public data and tools on their own hardware and the
into systems ready for analysiswe are cloud to pursue their questions of inter-
R. Boston/Washington Univ. St. Louis

productizing this step so they can focus on est. There are plenty of open questions
working with the disease scientists. to plumb because large genome centers
To this end, the company built an omics do not have time for in-depth analysis,
data model, which involves such tasks as says Meyerson. Besides working to better
defining data structures and how they understand the mix of normal and can-
relate to one another, and a platform that cerous cells in a tumor, scientists seek to
can analyze data from different sequenc- discern mutations that drive cancer pro-
The momentum in cancer genomics and analysis
ers and analysis pipelines, either locally gression. The identification of drivers
stands to help cancer patients, says Elaine or in a secured cloud-based computing and passengers either computationally
Mardis. Thats really at the end of the day why environment or a combination of both, or experimentally remains a challenge,
we are doing all of this. Sheldon says. he says.

296 | VOL.10 NO.4 | APRIL 2013 | nature methods


technology feature
somatic mutations are delivering big sur- in an acute myeloid leukemia patient 5 .
prises in terms of the types of genes to be At the time, her project proposal about
found mutated in cancer, says Meyerson. whole-genome analysis was met with dis-
Although large-scale cancer genome belief and derision, she says. Yet advanc-
projects such as those connected to TCGA es require pushing the technology and
and the ICGC are reaching the afternoons picking aggressive goals. Doing things
of their first iterations, the systematic because its early is often the only way to
characterization of cancer genomes is get them figured out.
at the very earliest moment of dawn, Todays cancer patients often react to
Meyerson says. I think the cancer genome targeted therapies with dramatic improve-

M. Nemchuk/Broad Institute
is more deeply disordered, especially in ments and then, almost inevitably,
terms of rearrangements and all sort of relapse into therapy-resistant disease,
unexpected structural events than we had she says. Scientists do not yet understand
ever anticipated. what fundamental changes in the genome
He believes genome-based diagnosis explain such events, but the momentum
We want to make the system available so people will become common for cancer patients, in cancer genomics and analysis can
can install their own tools and run more tools, and his commercial ventures reflect address such conundrums, which stands
says Gad Getz. this view, including a licensed patent to help cancer patients, she says. Thats
to LabCorp of America and the launch really at the end of the day why we are
The genes most commonly mutated of Foundation Medicine, which offers doing all of this.
2013 Nature America, Inc. All rights reserved.

in cancer are turning out to be ones sequencing-based cancer diagnosis. 1. Chin, L., Hahn, W.C., Getz, G. & Meyerson, M.
t hat had b e en ident if ie d on eit her Though individual scientists lack- Genes Dev. 25, 534555 (2011).
the gene or pathway level prior to the ing computational expertise cannot yet 2. The International Cancer Genome Consortium et
al. Nature 464, 993998 (2010).
advent of second-generation sequenc- take raw whole-genome sequence reads 3. Gymrek, M., McGuire, A.L., Golan, D., Halperin, E.
ing, says Meyerson. But the data are and find the important variants in their & Erlich, Y. Science 339, 321324 (2013).
also delivering unexpected results. He samples, they can stitch together open- 4. The Cancer Genome Atlas Research Network.
Nature 489, 519525 (2012).
and TCGA colleagues discovered pre- source tools from the genome centers to 5. Ley, T.J. et al. Nature 456, 6672 (2008).
viously unreported loss-of-function analyze their own large data sets, says
mutations in the HLA-A gene in over 170 Mardis.
squamous cell lung cancers 4. The team A little over 5 years ago, her team pub- Vivien Marx is technology editor for
noted that this discovery speaks to cancers lished the first whole-genome sequence Nature and Nature Methods
ability to evade the immune system. Such comparison of tumor and normal tissue (v.marx@us.nature.com).
npg

nature methods | VOL.10 NO.4 | APRIL 2013 | 297

You might also like