You are on page 1of 3

TECHNOLOGY FEATURE

DEEP LEARNING FOR BIOLOGY


A popular artificial-intelligence method provides a powerful tool for surveying and classifying
biological data. But for the uninitiated, the technology poses significant difficulties.
ALFRED PASIEKA/SPL/GETTY

The brain’s neural network has long inspired artificial-intelligence researchers.

BY SARAH WEBB human brain to comprehend”, Finkbeiner says. learning, one of the most promising branches
He and his team produce reams of data using of artificial intelligence (AI), is making inroads

F
our years ago, scientists from Google a high-throughput imaging strategy known as in biology. The algorithms are already infil-
showed up on neuroscientist Steve robotic microscopy, which they had developed trating modern life in smartphones, smart
Finkbeiner’s doorstep. The researchers for studying brain cells. But the team couldn’t speakers and self-driving cars. In biology,
were based at Google Accelerated Science, a analyse its data at the speed it acquired them, deep-learning algorithms dive into data in
research division in Mountain View, Califor- so Finkbeiner welcomed the opportunity to ways that humans can’t, detecting features
nia, that aims to use Google technologies to collaborate. that might otherwise be impossible to catch.
speed scientific discovery. They were inter- “I can’t honestly say at the time that I had Researchers are using the algorithms to classify
ested in applying ‘deep-learning’ approaches a clear grasp of what questions might be cellular images, make genomic connections,
to the mountains of imaging data generated addressed with deep learning, but I knew that advance drug discovery and even find links
by Finkbeiner’s team at the Gladstone Institute we were generating data at about twice to three across different data types, from genomics and
of Neurological Disease in San Francisco, also times the rate we could analyse it,” he says. imaging to electronic medical records.
in California. Today, those efforts are beginning to pay off. More than 440 articles on the bioRxiv pre-
Deep-learning algorithms take raw features Finkbeiner’s team, with scientists at Google, print server discuss deep learning; PubMed
from an extremely large, annotated data set, trained a deep algorithm with two sets of cells, lists more than 700 references in 2017. And
such as a collection of images or genomes, and one artificially labelled to highlight features that the tools are on the cusp of becoming widely
use them to create a predictive tool based on scientists can’t normally see, the other unla- available to biologists and clinical researchers.
patterns buried inside. Once trained, the algo- belled. When they later exposed the algorithm But researchers face challenges in understand-
rithms can apply that training to analyse other to images of unlabelled cells that it had never ing just what these algorithms are doing, and
data, sometimes from wildly different sources. seen before, Finkbeiner says, “it was astonish- ensuring that they don’t lead users astray.
The technique can be used to “tackle really ingly good at predicting what the labels should
hard, tough, complicated problems, and be be for those images”. A publication detailing TRAINING SMART ALGORITHMS
able to see structure in data — amounts of data that work is now in the press. Deep-learning algorithms (see ‘Deep
that are just too big and too complex for the Finkbeiner’s success highlights how deep thoughts’) rely on neural networks, a

2 2 F E B R UA RY 2 0 1 8 | VO L 5 5 4 | N AT U R E | 5 5 5
©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
TECHNOLOGY MACHINE LEARNING

computational model first proposed in the images. For example, in 2005, Anne Carpenter, One biotech firm that is using such data is
1940s, in which layers of neuron-like nodes a computational biologist at the Broad Insti- Verily Life Sciences (formerly Google Life
mimic how human brains analyse information. tute of MIT and Harvard in Cambridge, Mas- Sciences) in San Francisco. Researchers at
Until about five years ago, machine-learning sachusetts, released an open-source software Verily, a subsidiary of Google’s parent company,
algorithms based on neural networks relied on package called CellProfiler to help biologists Alphabet, have developed a deep-learning tool
researchers to process the raw information into to quantitatively measure individual features: that identifies a common type of genetic varia-
a more meaningful form before feeding it into the number of fluorescent cells in a microscopy tion, called single-nucleotide polymorphisms,
the computational models, says Casey Greene, field, for example, or the length of a zebrafish. more accurately than conventional tools. Called
a computational biologist at the University of But deep learning is allowing her team to go DeepVariant, the software translates genomic
Pennsylvania in Philadelphia. But the explosion further. “We’ve been shifting towards measur- information into image-like representations,
in the size of data sets — from sources such as ing things that biologists don’t realize they want which are then analysed as images (see ‘Tools
smartphone snapshots or large-scale genomic to measure out of images,” she says. Recording for deep diving’). Mark DePristo, who heads
sequencing — and algorithmic innovations and combining visual features such as DNA deep-learning-based genomic research at Ver-
have now made it possible for humans to take a staining, organelle texture and the quality of ily, expects DeepVariant to be particularly use-
step back. This advance in machine learning — empty spaces in a cell can produce thousands ful for researchers studying organisms outside
the ‘deep’ part — forces the computers, not their of ‘features’, any one of which can reveal fresh the mainstream — those with low-quality refer-
human programmers, to find the meaningful insights. The current version of CellProfiler ence genomes and high error rates in identifying
relationships embedded in pixels and bases. includes some deep-learning elements, and her genetic variants. Working with DeepVariant in
And as the layers in the neural network filter team expects to add more-sophisticated deep- plants, his colleague Ryan Poplin has achieved
and sort information, they also communicate learning tools in the next year. error rates closer to 2% than the more-typical
with each other, allowing each layer to refine “Most people have a hard time wrapping 20% of other approaches.
the output from the previous one. their heads around this,” Carpenter says, “but Brendan Frey, chief executive of the Cana-
Eventually, this process allows a trained there’s just as much information, in fact maybe dian company Deep Genomics in Toronto,
algorithm to analyse a new image and correctly more, in a single image of cells as there is in a also focuses on genomic data, but with the
identify it as, for example, Charles Darwin or a transcriptomic analysis of a cell population.” goal of predicting and treating disease. Frey’s
diseased cell. But as researchers distance them- That type of processing allows Carpenter’s academic team at the University of Toronto
selves from the algorithms, they can no longer team to take a less supervised approach to developed algorithms trained on genomic and
control the classification process or even explain translating cell images into disease-associated transcriptomic data from healthy cells. Those
precisely what the software is doing. Although phenotypes — and to capitalize on it. Carpen- algorithms built predictive models of RNA-
these deep-learning networks can be stunningly ter is a scientific adviser to Recursion Phar- processing events such as splicing, transcrip-
accurate at making predictions, Finkbeiner maceuticals in Salt Lake City, Utah, which is tion and polyadenylation within those data.
says, “it’s still challenging sometimes to figure using its deep-learning tools to target rare, When applied to clinical data, the algorithms
out what it is the network sees that enables it to single-gene disorders for drug development. were able to identify mutations and flag them as
make such a good prediction”. pathogenic, Frey says, even though they’d never
Still, many subdisciplines of biology, includ- MINING GENOMIC DATA seen clinical data. At Deep Genomics, Frey’s
ing imaging, are reaping the rewards of those When it comes to deep learning, not just any team is using the same tools to identify and
predictions. A decade ago, software for auto- data will do. The method often requires massive, target the disease mechanisms that the software
mated biological-image analysis focused well-annotated data sets. Imaging data provide a uncovered, to develop therapies derived from
on measuring single parameters in a set of natural fit, but so, too, do genomic data. short nucleic-acid sequences.
Another discipline with massive data sets
that are amenable to deep learning is drug

Tools for deep diving discovery. Here, deep-learning algorithms are


helping to solve categorization challenges, sift-
ing through such molecular features as shape
Deep-learning tools are evolving rapidly, software-sharing site GitHub, as is an open- and hydrogen bonding to identify criteria
and labs will need dedicated computational source version of DeepVariant, a tool for on which to rank those potential drugs. For
expertise, collaborations or both to take accurately identifying genetic variation. instance, Atomwise, a biotech company based
advantage of them. Google Accelerated Science, a Google in San Francisco, has developed algorithms that
First, take a colleague with deep-learning research division based in Mountain View, convert molecules into grids of 3D pixels, called
expertise out to lunch and ask whether California, collaborates with a range of voxels. This representation allows the company
the strategy might be useful, advises Steve scientists, including biologists, says Michelle to account for the 3D structure of proteins and
Finkbeiner, a neuroscientist at the Gladstone Dimon, one of its research scientists. small molecules with atomic precision, model-
Institutes in San Francisco, California. Projects require a compelling biological ling features such as the geometries of carbon
With some data sets, such as imaging question, large amounts of high-quality, atoms. Those features are then translated into
data, an off-the-shelf program might work; labelled data, and a challenge that will allow mathematical vectors that the algorithm can
for more complicated projects, consider the company’s machine-learning experts to use to predict which small molecules are likely
a collaborator, he says. Workshops and make unique computational contributions to interact with a given protein, says Abraham
meetings can provide training opportunities. to the field, Dimon says. Heifets, the company’s chief executive. “A lot of
Access to cloud-computing resources Those wishing to get up to speed on the work we do is for [protein] targets with no
means that researchers might not need deep learning should check out the ‘deep known binders,” he says.
an on-site computer cluster to use deep review’, a comprehensive, crowdsourced Atomwise is using this strategy to power
learning — they can run the computation review led by computational biologist Casey its new AI-driven molecular-screening pro-
elsewhere. Google’s TensorFlow, an Greene of the University of Pennsylvania gramme, which scans a library of 10 million
open-source platform for building deep- in Philadelphia (T. Ching et al. Preprint at compounds to provide academic researchers
learning algorithms, is available on the bioRxiv http://doi.org/gbpvh5; 2018). S.W. with up to 72 potential small-molecule binders
for their protein of interest.

5 5 6 | N AT U R E | VO L 5 5 4 | 2 2 F E B R UA RY 2 0 1 8
©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
MACHINE LEARNING TECHNOLOGY

DEEP THOUGHTS the computers are both unintelligent and lazy,


SOURCE: JEREMY LINSLEY/DREW LINSLEY/STEVE FINKBEINER/THOMAS SERRE

Deep-learning algorithms take many forms. Steve Finkbeiner’s lab used a convolutional neural network (CNN) notes Michelle Dimon, a research scientist at
such as this one to identify, with high accuracy, dead neurons in a population of live and dead cells. Google Accelerated Science. They lack the
judgement to distinguish biologically relevant
INPUT TRAINING AI differences from normal variation. “The com-
The network is trained using Over multiple iterations, the network discovers patterns in the data puter is shockingly good at finding batch varia-
several hundred thousand that can distinguish live from dead cells. Convolutional layers tion,” she notes. As a result, obtaining data that
annotated images of live and identify structural features of the images, which are integrated in
dead cells. fully connected layers. will be fed into a deep-learning algorithm often
means applying a high bar for experimental
Images of neurons Convolutional layers Fully connected layers
design and controls. Google Accelerated Sci-
ence requires researchers to place controls
Live randomly on cell-culture plates to account for
subtle environmental factors such as incubator
Live
temperature, and to use twice as many controls
as a biologist might otherwise run. “We make
it hard to pipette,” Dimon quips.
This hazard underscores the importance
of biologists and computer scientists working
together to design experiments that incorpo-
rate deep learning, Dimon says. And that care-
Dead
ful design has become even more important
Combining layers of with one of Google’s latest projects: Contour,
different structure lets
the network adapt to
a strategy for clustering cellular-imaging data
recognize images of in ways that highlight trends (such as dose
varying type and clarity. responses) instead of putting them into spe-
cific categories (such as alive or dead).
APPLICATION New Live Although deep-learning algorithms can eval-
image
Challenged with uate data without human preconceptions and
unlabelled images, the Trained
CNN
Classifier filters, Greene cautions, that doesn’t mean they
network assigns each
cell as alive or dead
are unbiased. Training data can be skewed — as
with high accuracy. Dead happens, for example, when genomic data only
from northern Europeans are used. Deep-learn-
ing algorithms trained on such data will acquire
Deep-learning tools could also help research- data points representing different experi- embedded biases and reflect them in their pre-
ers to stratify disease types, understand disease mental and physiological conditions — give dictions, which could in turn lead to unequal
subpopulations, find new treatments and match researchers the most flexibility for training an patient care. If humans help to validate these
them with the appropriate patients for clinical algorithm. Finkbeiner notes that algorithm predictions, that provides a potential check on
testing and treatment. Finkbeiner, for instance, training in his work improves significantly the problem. But such concerns are troubling if
is part of a consortium called Answer ALS, an after about 15,000 examples. Those high- a computer alone is left to make key decisions.
effort to combine a range of data — genomics, quality ‘ground truth’ data can be exceptionally “Thinking of these methods as a way to aug-
transcriptomics, epigenomics, proteomics, hard to come by, says Carpenter. ment humans is better than thinking of these
imaging and even pluripotent stem-cell biology To circumvent this challenge, researchers methods as replacing humans,” Greene says.
— from 1,000 people with the neurodegenera- have been working on ways to train more with And then there’s the challenge of under-
tive disease amyotrophic lateral sclerosis (also less data. Advances in the underlying algo- standing exactly how these algorithms are
called motor neuron disease). “For the first rithms are allowing the neural networks to use building the characteristics, or features, that
time, we’ll have a data set where we can apply data much more efficiently, Carpenter says, they use to classify data in the first place. Com-
deep learning and look at whether deep learning enabling training on just a handful of images puter scientists are attacking this question by
can uncover a relationship between the things for some applications. Scientists can also exploit changing or shuffling individual features in a
we can measure in a dish around a cell, and transfer learning, the ability of neural networks model and then examining how those tweaks
what’s happening to that patient,” he says. to apply classification prowess acquired from change the accuracy of predictions, says Polina
one data type to another type. For example, Mamoshina, a research scientist at Insilico
CHALLENGES AND CAUTIONS Finkbeiner’s team has developed an algorithm Medicine in Baltimore, Maryland, which uses
For all its promise, deep learning poses signifi- that it initially taught to predict cell death on deep learning to improve drug discovery. But
cant challenges, researchers warn. As with any the basis of morphology changes. Although the different neural networks working on the same
computational-biology technique, the results researchers trained it to study images of rodent problem won’t approach it in the same way,
that arise from algorithms are only as good cells, it achieved 90% accuracy the first time it Greene cautions. Researchers are increasingly
as the data that go in. Overfitting a model to was exposed to images of human cells, improv- focusing on algorithms that make both accu-
its training data is also a concern. In addition, ing to 99% as it gained experience. rate and explainable predictions, he says, but
for deep learning, the criteria for data quantity For some of its biological image-recognition for now the systems remain black boxes.
and quality are often more rigorous than some work, Google Accelerated Science uses algo- “I don’t think highly explainable
experimental biologists might expect. rithms that were initially trained on hundreds deep-learning models are going to come on
Deep-learning algorithms have required of millions of consumer images mined from the scene in 2018, though I’d love to be wrong,”
extremely large data sets that are well anno- the Internet. Researchers then refine that train- Greene says. ■
tated so that the algorithms can learn to distin- ing, using as few as several hundred biological
guish features and categorize patterns. Larger, images similar to the ones they wish to study. Sarah Webb is a freelance writer in
clearly labelled data sets — with millions of Another challenge with deep learning is that Chattanooga, Tennessee.

2 2 F E B R UA RY 2 0 1 8 | VO L 5 5 4 | N AT U R E | 5 5 7
©
2
0
1
8
M
a
c
m
i
l
l
a
n
P
u
b
l
i
s
h
e
r
s
L
i
m
i
t
e
d
,
p
a
r
t
o
f
S
p
r
i
n
g
e
r
N
a
t
u
r
e
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.

You might also like