You are on page 1of 11

Classification of crystallization outcomes using deep convolutional neural networks

Andrew E. Bruno
Center for Computational Research, University at Buffalo, Buffalo, New York, United States of America.

Patrick Charbonneau
Department of Chemistry, Duke University, Durham, North Carolina, USA and
Department of Physics, Duke University, Durham, North Carolina, USA

Janet Newman
Collaborative Crystallisation Centre CSIRO, Parkville, Victoria, Australia.
arXiv:1803.10342v2 [q-bio.BM] 26 May 2018

Edward H. Snell
Hauptman-Woodward Medical Research Institute and SUNY Buffalo,
Department of Materials, Design, and Innovation,
Buffalo, New York 14203, United States of America.

David R. So and Vincent Vanhoucke


Google Brain, Google Inc., Mountain View, California, United States of America.

Christopher J. Watkins
IM&T Scientific Computing, CSIRO, Clayton South, Victoria, Australia

Shawn Williams
Platform Technology and Sciences, GlaxoSmithKline Inc.,
Collegeville, Pennsylvania, United States of America.

Julie Wilson
Department of Mathematics, University of York, York, United Kingdom.
(Dated: May 29, 2018)
The Machine Recognition of Crystallization Outcomes (MARCO) initiative has assembled roughly
half a million annotated images of macromolecular crystallization experiments from various sources
and setups. Here, state-of-the-art machine learning algorithms are trained and tested on different
parts of this data set. We find that more than 94% of the test images can be correctly labeled,
irrespective of their experimental origin. Because crystal recognition is key to high-density screening
and the systematic analysis of crystallization experiments, this approach opens the door to both
industrial and fundamental research applications.
Author summary: Protein crystal growth experiments are routinely imaged, but the mass of
accumulated data is difficult to manage and analyze. Using state-of-the-art machine learning
algorithms on a large and diverse set of reference images, we manage to recapitulate the labels of
a remarkably large fraction of the set. This automation should enable a number of industrial and
fundamental applications.

I. INTRODUCTION to patients. Yet, despite decades of methodological ad-


vances, crystallizing molecular targets of interest remains
the bottleneck of the entire crystallography program in
X-ray crystallography provides the atomic structure structural biology.
of molecules and molecular complexes. These structures
in turn provide insight into the molecular driving forces Even when crystallization is facile, it is microscopically
for small molecule binding, protein-protein interactions, rare; for macromolecules it is also uncommon [2–5]. Ex-
supramolecular assembly and other biomolecular pro- perimental trials typically involve: (i) mixing a purified
cesses. The technique is thus foundational to molecular sample with chemical cocktails designed to promote molec-
modeling and design. Beyond the obvious importance ular association, (ii) generating a supersaturated solution
of structure information for understanding and altering of the desired molecule via evaporation or equilibration,
the role of biomolecules, it also has important industrial and (iii) visually monitoring the outcomes, before (iv)
applications. The pharmaceutical industry, for instance, optimizing those conditions and analyzing the resultant
uses structures to guide chemistry as part of a “predict crystal with an X-ray beam. One hopes for the formation
first” strategy [1], employing expert systems to reduce op- of a crystal instead of non-specific (amorphous) precipi-
timization cycle times and more effectively bring medicine tates or of nothing at all. In order to help run these trials,
2

commercial crystallization screens have been developed; scores to the same image on different occasions during
each screen generally contains 96 formulations designed the study, with the average agreement rate for scores on
to promote crystal growth. Whether these screens are a control set at the beginning and middle of the study
equally effective or not [5, 6] remains debated, but their being 77%, rising to 84% for the agreement in scores be-
overall yield is in any case paltry. Typically fewer than 5% tween the middle and end of the study. Crystallographers
of crystallization attempts produce useful results (with a also tend to be optimistically biased when scoring their
success rate as low as 0.2% in some contexts [7]). own experiments [16]. A better use of expert time and
The practical solution to this hurdle has been to in- attention would be to focus on scientific inquiry.
crease the convenience and number of crystallization tri- An algorithm that could analyze images of drops, dis-
als. To offset the expense of reagents and scientist time, tinguish crystals from trivial outcomes, and reduce the
labs routinely employ industrial robotic liquid handlers, effort spent cataloging failure, would present clear value
nanoliter-size drops, and record trial outcomes using auto- both to the discipline and to industry. Ideally, such an
mated imaging systems [5, 8–11]. Hoping to compensate algorithm would act like an experienced crystallographer
for the rarity of crystallization, commercially available in:
systems readily probe a large area of chemical space with
• recognizing macromolecular crystals appropriate for
minimal sample volume with a throughput of ∼ 1000
diffraction experiments;
individual experiments per hour.
While liquid handling is readily automated, crystal • recognizing outcomes that, while requiring optimiza-
recognition is not. Imaging systems may have made view- tion, would lead to crystals for diffraction experi-
ing results more comfortable than bending over a micro- ments;
scope, but crystallographers still manually inspect images
and/or drops, looking for crystals or, more commonly, • recognizing non-macromolecular crystals;
conditions that are likely to produce good crystals when • ignoring technical failures;
optimized. This human cost makes crystal recognition a
key experimental bottleneck within the larger challenge • identifying non-crystalline outcomes that require
of crystallizing biomolecules [7]. A typical experiment follow up;
for a given sample includes four 96-well screens at two
temperatures, i.e., 768 conditions (and can have up to • being agnostic as to the imaging platform used;
twice that [12]). Assuming that it takes 2 seconds to man- • being indefatigable and unbiased;
ually scan a droplet (and noting that the scans have to
be repeated, as crystallization is time dependent), simply • occurring in a time frame that does not impede the
looking at a single set of 96 trials over the lifetime of process;
an experiment can take the better part of an hour [13].
• learning from experience.
For the sake of illustration, the U.S. Structural Science
group at GlaxoSmithKline performs ∼ 1200 96-well ex- Such an algorithm would further reduce the variance in the
periments per year. If the targeted observation schedule assessments, irrespective of its accuracy. A high-variance,
were rigorously followed, the group would spend a quarter manual process is not conducive to automating the quality
of the year staring at drops, of which the vast majority control of the system end-to-end, including the imaging
contains no crystal. Recording outcomes and analyzing equipment. Enhanced reproducibility enables traceability
the results of the 96 trials would further increase the time of the outcomes, and paves the way for putting in place
burden. Current operations are already straining existing measurable, continuous improvement processes across the
resources, and the approach simply does not scale for entire imaging chain.
proposed higher-density screening [10]. Automated crystallization image classifications that
Crystal growth is also sufficiently uncommon that the attempt to meet the above criteria have been previously
tolerance for false negatives is almost nil. Yet most crystal- attempted. The research laboratories that first automated
lographers are misguided in thinking that they themselves crystallization inspection quickly realized that image anal-
would never miss identifying a crystal given an image con- ysis would be a huge problem, and concomitantly devel-
taining an crystal, or indeed miss a crystal in a droplet oped algorithms to interpret them [17–20]. None of these
viewed directly under a microscope [14]. In fact, not only programs was ever widely adopted. This may have been
do crystallographers miss crystals due to lack of atten- due in part to their dependence on a particular imag-
tion through boredom, they often disagree on the class ing system, and to the relatively limited use of imaging
an image should be assigned to. An overall agreement systems at the time. Many of the early image analysis
rate of ∼ 70% was found when the classes assigned to programs further required very time consuming collation
1200 images by 16 crystallographers were compared [14]. of features and significant preprocessing, e.g., drop seg-
(When considering only crystalline outcomes, agreement mentation to locate the experimental droplet within the
rose to ∼ 93%.) Consistency in visual scoring was also image. To the best of our knowledge, there was also no
considered by Snell et al. when compiling a ∼ 150, 000 widespread effort to make a widely available image anal-
image dataset [15]. They found that viewers give different ysis package in the same way that that the diffraction
3

oriented programs have been organized, e.g., the CCP4 images not obviously falling into the three major classes,
package [21]. and came to assume a functional significance as the classi-
Can a better algorithm be constructed and trained? In fication process was further investigated. Examination of
order to help answer this question, the Machine Recog- the least classifiable five percent of images indeed revealed
nition of Crystallization Outcomes (MARCO) initiative many instances of process failure, such as dispensing er-
was set up [22]. MARCO assembled a set of roughly rors or illumination problems. These uninterpretable
half a million classified images of crystallization trials images were then labelled as “other” during the rescoring,
through an international collaboration with five separate which added an element of quality control to the overall
institutions. Here, we present a machine-learning based process [25].
approach to categorize these images. Remarkably, the
algorithm we employ manages to obtain an accuracy ex-
ceeding 94%, which is even above what was once thought Relabeling
possible for human categorization. This suggests that a
deployment of this technology in a variety of laboratory After a first baseline system was trained (see Sec. III),
settings is now conceivable. The rest of this paper is as the 5% of the images that were most in disagreement
follows. Section II describes the dataset and the scoring with the classifier (independently of whether the image
scheme, Sec. III describes the machine-learning model and was in the training or the test set), were relabeled by one
training procedure, Secs. IV and V describe and discuss expert, in order to obtain a systematic eye on the most
the results, respectively, and Sec. VI briefly concludes. problematic images.
Because no rules were established and no exemplars
were circulated prior to the initial scoring, individual
viewpoints varied on classifying certain outcomes. For
II. MATERIAL AND METHODS example, the bottom 5% contained many instances of
phase separation, where the protein forms oil droplets or
Image Data an oily film that coats the bottom of the crystallization
well. Images were found to be inconsistently scored as
The MARCO data set used in this study contains “clear”, “precipitate”, or “other” depending on the amount
493,214 scored images from five institutions (See Ta- and visibility of the oil film. This example highlights the
ble I [22]). The images were collected from imagers difficulty of scoring experimental outcomes beyond crystal
made from two different manufacturers (Rigaku Automa- identification. A more serious source of ambiguity arises
tion and Formulatrix), which have different optical sys- from process failure. Many of the problematic images did
tems, as well as by the in-house imaging equipment not capture experimental results at all. They were out of
built at the Hauptman-Woodward Medical Research Insti- focus, dark, overexposed, dropless, etc. Whatever labeling
tute (HWMRI) High-Throughput Crystallization Center convention was initially followed, for the relabeling the
(HTCC). Different versions of the setups were also used – “other” category was deemed to also diagnose problems
some Rigaku images are collected with a true color camera, with the imaging process.
some are collected as greyscale images. The zoom extent A total of 42.6% of annotations for the images that
varies, with some imagers set up to collect a field-of-view were revisited disagreed with the original label, suggesting
(FOV) of only the experimental droplet, and some set for somewhat high (1 to 2%) label noise in this difficult
the FOV to encompass a larger area of the experimental fraction of the dataset. For a fraction of this data, multiple
setup. The Rigaku and Formulatrix automation imaged raters were asked to label the images independently and
vapor diffusion based experiments while the HTCC sys- had an inter-rater disagreement rate of approximately
tem imaged microbatch-under-oil experiments. A random 22%. The inherent difficulty of assigning a label to a
selection of 50,284 test images was held out for validation. small fraction of the images is therefore consistent with
Images in the test set were not represented in the training the results of Ref. [14]. Table II shows the final image
set. The precise data split is available from the MARCO counts after relabeling.
website [22].

III. MACHINE LEARNING MODEL


Labeling
The goal of the classifier here is to take an image as
Images were scored by one or more crystallographers. an input, and output the probability of it belonging to
As the dataset is composed of archival data, no common each of four classes (crystals, precipitate, clear, other) (see
scoring system was imposed, nor were exemplar images Fig. 1). The classifier used is a deep Convolutional Neural
distributed to the various contributors. Instead, existing Network (CNN). CNNs, originally proposed in Ref. [26],
scores were collapsed into four comprehensive and fairly and their modern ‘deep’ variants (see, e.g., Refs. [27, 28]
robust categories: clear, precipitate, crystal, and other. for recent reviews), have proven to consistently provide
This last category was originally used as a catchall for reliable results on a broad variety of visual recognition
4

TABLE I. Breakdown of data sources and imaging technology per institution contributing to MARCO.
Institution Technical Setup # of Images
Bristol-Myers Squibb Formulatrix Rock Imager (FRI) 8719
CSIRO Sitting drop, FRI, Rigaku Minstrel [23, 24] 15933
HWMRI Under oil, Home system [15] 79632
GlaxoSmithKline Sitting drop, FRI 83126
Merck Sitting drop, FRI 305804

TABLE II. Data distribution. Final number of


images in the dataset for each category after collapsing
the labels and relabeling.
Number of images
Label Training Validation
Crystals 56,672 6632
Precipitate 212,541 23,892
Clear 148,861 16760
Other 24,856 3,000

tasks, and are particularly amenable to addressing data-


rich problems. They have been, for instance, state of the
art on the very competitive ILSVRC image recognition
challenge [29] since 2012.
This approach to visual perception has been making un-
precedented inroads in areas such as medical imaging [30]
and computational biology [31], and have also shown to
be human-competitive on a variety of specialized visual
identification [32, 33]. The chosen classifier is thus well
suited for the current analysis.

Model Architecture

The model is a variation on the widely-used Inception-


v3 architecture [36], which was state of the art on the
ILSVRC challenge around 2015. Several more recent al-
ternatives were tried, including Inception-ResNet-v2 [37],
and automatically generated variants of NASNet [38],
but none yielded any significant improvements. An ex- Fig 1. Conceptual Representation of a
tensive hyperparameter search was also conducted using Convolutional Neural Network. A CNN is a stack
Vizier [39], also without providing significant improvement of nonlinear filters (three filter levels are depicted here)
over the baseline. that progressively reduce the spatial extent of the image,
The Inception-v3 architecture is a complex deep CNN while increasing the number of filter outputs that
architecture described in detail in Ref. [36] as well as the describe the image at every location. On top of this
reference implementation [40]. We only describe here the stack sits a multinomial logistic regression classifier,
modifications made to tailor the model to the task at which maps the representation to one probability value
hand. per output class (Crystals vs. Precipitate vs. Clear vs.
Standard Inception-v3 operates on a 299x299 square im- Others). The entire network is jointly optimized through
age. Because the current problem involves very detailed, backpropagation [34], in general by means of a variant of
thin structures, it is plausible to assume that a larger stochastic gradient descent [35].
input image may yield better outcomes. We use instead
599x599 images, and compress them down to 299x299
using an additional convolutional layer at the very bottom every other location in the image to reduce the dimen-
of the network, before the layer labeled Conv2d 1a 3x3 sionality of the feature map). This modification improved
in the reference implementation. The additional convo- classification absolute accuracy by approximately 0.3%.
lutional layer has a depth (number of filters) of 16, a A few other convolutional layers were shrunk compared
3 × 3 receptive field (it operates on a 3 × 3 square patch to the standard Inception-v3 by capping their depth as
convolved over the image) and a stride of 2 (it skips over described in Table III, using the conventions from the
5

reference implementation.

TABLE III. Limits applied to layer depths to


reduce the model complexity. In each named layer
of the deep network – here named after the conventions
of the reference implementation – every convolutional
subblock had its number of filters reduced to contain no
more than these many outputs.
Layer Max depth
Conv2d 4a 3x3 144
Mixed 6b 128
Mixed 6c 144
Mixed 6d 144
Mixed 6e 96
Fig 2. Classifier Accuracy. Accuracy on the training
Mixed 7a 96
Mixed 7b 192 and validation sets as a function of the number of steps
Mixed 7c 192 of training. Training halts when the performance on the
evaluation set no longer increases (‘early stopping’). As
While these parameters are exhaustively reported here is typical for this type of stochastic training,
to ensure reproducibility of the results, their fine tuning performance increases rapidly at first as large training
is not essential to maximizing the success rate, and was steps are taken, and slows down as the learning rate is
mainly motivated by improving the speed of training. In annealed and the model fine-tunes its weights.
the end, it was possible to train the model at larger batch
size (64 instead of 32) and still fit within the memory of a
NVidia K80 GPU (see more details in the training section Training
below). Given the large number of examples available, all
dropout [41] regularizers were removed from the model The model is implemented in TensorFlow [42], and
definition at no cost in performance. trained using an asynchronous distributed training
setup [43] across 50 NVidia K80 GPUs. The optimizer is
RmsProp [44], with a batch size of 64, a learning rate of
Data Preprocessing and Augmentation 0.045, a momentum of 0.9, a decay of 0.9 and an epsilon
of 0.1. The learning rate is decayed every two epochs
The source data is partitioned randomly into 415990 by a factor of 0.94. Training completed after 1.7M steps
training images and 47062 test images. (Fig. 2) in approximately 19 hours, having processed 100M
The training data is generated dynamically by taking images, which is the equivalent of 260 epochs. The model
random 599x599 patches of the input images, and sub- thus sees every training sample 260 times on average, with
jecting them to a wide array of photometric distortions, a different crop and set of distortions applied each time.
identical to the reference implementation: The model used at test time is a running average of the
training model over a short window to help stabilize the
• randomized brightness (± 32 out of 255),
predictions.
• randomized saturation (from 50% to 150%),
• randomized hue (± 0.2 out of 0.5),
IV. RESULTS
• randomized contrast (from 50% to 150%).
In addition, images are randomly flipped left to right Classification
with 50% probability, and, in contrast to the usual practice
for natural scenes which don’t have a vertical symmetry, The original labeling gave rise to a model with 94.2%
they are also flipped upside down with 50% probability. accuracy on the test set. Relabeling improved reported
Because images in this dataset have full rotational invari- classification accuracy by approximately 0.3% absolute,
ance, one could also consider rotations beyond the mere with the caveat that the figures are not precisely compara-
90◦ , 180◦ , 270◦ that these flips provide, but we didn’t ble since some of the test labels changed in between. The
attempt it here, as we surmise the incremental benefits revised model thus achieves 94.5% accuracy on the test
would likely be minimal for the additional computational set for the four-way classification task. It overfits mod-
cost. This form of aggressive data augmentation greatly estly to the training set, reaching just above 97% at the
improves the robustness of image classifiers, and partly early-stopping mark of 1.7M steps. Table IV summarizes
alleviates the need for large quantities of human labels. the confusions between classes. Although the classifier
For evaluation, no distortion is applied. The test images does not perform equally well on images from the various
are center cropped and resized to 599x599. datasets, the standard deviation in performance from one
6

TABLE IV. Confusion Matrix. Fraction of the test TABLE VI. Validation at C3 Precision, recall and
data that is assigned to each class based on the posterior accuracy from an independent set of images collected
probability assigned by the classifier. For instance, 0.8% after the MARCO tool was developed. The 38K images
of images labeled as Precipitate in the test set were of sitting drop trials were collected between January 1
classified as Crystals. and March 30, 2018 on two Formulatrix Rock Imager
True Predictions (FRI) instruments.
Label Crystals Precipitate Clear Other DL tool Precision Recall Accuracy
Crystals 91.0% 5.8% 1.7% 1.5% DeepCrystal 0.4928 0.4520 0.7391
Precipitate 0.8% 96.1% 2.3% 0.7% MARCO 0.7777 0.8663 0.9018
Clear 0.2% 1.8% 97.9% 0.2%
Other 4.8% 19.7% 5.9% 69.6%
Pixel Attribution

TABLE V. Standard Deviation of the predictions We visually inspect to what parts of the image the
across data sources. Note in particular the large classifier learns to attend by aggregating noisy gradients
variability in the consistency of the label ’Other’ across of the image with respect to its label on a per-pixel basis.
datasets, which leads to comparatively poor selectivity of The SmoothGrad [47] approach is used to visualize the
that less well-defined class. focus of the classifier. The images in Fig. 4 are constructed
True Predictions by overlaying a heat map of the classifier’s attention over
Label Crystals Precipitate Clear Other a grayscale version of the input image.
Crystals 5% 4% 1% 1% Note that saliency methods are imperfect and do not
Precipitate 2% 4% 1% 2% in general weigh faithfully all the evidence present in an
Clear 1% 3% 5% 1% image according to their contributions to the decision, es-
Other 7% 15% 6% 21% pecially when the evidence is highly correlated. Although
these visualizations paint a simplified and very partial
picture of the classifier’s decision mechanisms, they help
confirm that it is likely not picking up and overfitting to
set to another is fairly small, about 5% (see Table V), cues that are irrelevant to the task.
compared to the overall performance of the classifier.
The classifier outputs a posterior probability for each
class. By varying the acceptance threshold for a proposed Inference and Availability
classification, one can trade precision of the classifica-
tion against recall. The receiver operating characteristic The model is open-sourced and available online at [48].
(ROC) curves can be seen in Fig. 3. It can be run locally using TensorFlow or TensorFlow Lite,
or as a Google Cloud Machine Learning [49] endpoint. At
time of writing, inference on a standard Cloud instance
takes approximately 260ms end-to-end per standalone
query. However, due to the very efficient parallelism
properties of convolutional networks, latency per image
Validation
can be dramatically cut down for batch requests.

At CSIRO C3 a workflow [45] has been set up which


uses a variation of the analysis tool from DeepCrystal [46] V. DISCUSSION
to analyze newly collected crystallisation images and to
assign either no score, ‘crystal score or ‘clear score. A Previous attempts at automating the analysis of crys-
total of 37,851 images were collected in Q1 2018 and tallisation images have employed various pattern recog-
assigned a human score by a C3 user were used as an nition and machine learning techniques, including linear
independent dataset to test the MARCO tool. Within this discriminant analysis [50, 51], decision trees and random
dataset, 9746 images had been identified as containing forests [52–54], and support vector machines [20, 55]. Neu-
crystals. The current, DeepCrystal tool (which assigns ral networks, including self-organizing maps, have also
only crystal or clear scores) was found to have an overall been used classify these images [17, 56], with the most
accuracy rate of 74%, while the MARCO tool has 90%. recent involving deep learning [57]. However, all previ-
Although this retrospective analysis doesnt allow for a ous approaches have required a consistent set of images
direct comparison of the ROC, the precision, recall and with the same field of view and resolution, in order to
accuracy of the two tools all favor the MARCO tool, as identify the crystallization droplet in the well [23], and
shown in table 6. The precision achieved by MARCO thereby restrict the analysis. Various statistical, geomet-
on this dataset is also very similar to that seen for the ric or textural features were then extracted, either directly
CSIRO images in the training data. from the image or from some transformation of the region
7

(a) (b)
Fig 3. Receiver Operating Characteristic Curves. (Q) Percentage of the correctly accepted detection of
crystals as a function of the percentage of incorrect detections (AUC: 98.8). 98.7% of the crystal images can be
recalled at the cost of less than 19% false positives. Alternatively, 94% of the crystals can be retrieved with less than
1.6% false positives. (B) Percentage of the correctly accepted detection of precipitate as a function of the percentage
of incorrect detections (AUC: 98.9). 99.6% of the precipitate images can be recalled at the cost of less than 25% false
positives. Alternatively, 94% of the precipitates can be retrieved with less than 3.4% false positives.

(a) (b) (c)


Fig 4. Sample heatmaps for various types of images. (A) Crystal: the classifier focuses on some of the
angular geometric features of individual crystals (arrows). (B) Precipitate: the classifier lands on the precipitate
(arrows). (C) Clear: The classifier broadly samples the image, likely because this label is characterized by the absence
of structures rather than their presence. Note the slightly more pronounced focus on some darker areas (circle and
arrows) that could be confused for crystals or precipitate. Because the ‘Others’ class is defined negatively by the the
image being not identifiable as belonging to the other three classes, heatmaps for images of that class are not
particularly informative.

of interest, to be used as variables in the classification 54, 58], in which missed crystals are reported with much
algorithms. lower rates. This advance comes at the expense of more
false positives. For example, Pan et al. report just under
The results from various studies can be difficult to 3% false negatives, but almost 38% false positives [55].
compare head-to-head because different groups present
confusion matrices with the number of classes ranging As the trained algorithms are specific to a set of im-
from 2 to 11, only sometimes aggregating results for ages, they are also restricted to a particular type of crys-
crystals/crytalline materials. There is also a tradeoff tallisation experiment. Prior to the curation of the cur-
between the number of false negatives and the number rent dataset, the largest set of images (by far) came
of false positives. Yet most report classification rates from the Hauptman-Woodward Medical Research Insti-
for crystals around 80-85% even in more recent work [8, tute HTCC [15]. This dataset, which contains 147,456
8

images from 96 different proteins but is limited to ex- across a large range of conditions, and trained a CNN
periments with the microbatch-under-oil technique, has on the labels of these images. Remarkably, the resulting
been used in a number of studies [57, 59]. Most notably, machine-learning scheme was able to recapitulate the la-
Yann et al. used a deep convolutional neural network bels of more than 94% of a test set. Such accuracy has
that automatically extracted features, and reported a rarely been obtained, and has no equal for an uncurated
correct classification rates as high as 97% for crystals and dataset. The analysis also identified a small subset of
96% for non-crystals. Although impressive, these results problematic images, which upon reconsideration revealed
were however obtained from a curated subset of 85,188 a high level of label discrepancy. This variability inher-
clean images, i.e., images with class labels on which sev- ent to using human labeling highlights one of the main
eral human experts agreed [57]. In order to validate our benefits of automatic scoring. Such accuracy also make
approach, we retrained our model to perform the same 10- conceivable high-density screening.
way classification on that subset of the data alone without Enhancing the imaging capabilities by including UV or
any tuning of the model’s hyperparameters and achieved SONICC results, for instance, could certainly enrich the
94.7% accuracy, compared to the reported 90.8%. model. But several research avenues could also be pursued
In this context, the current results are especially re- without additional laboratory equipment. In particular,
markable. A crystallographer can classify images of exper- it should be possible to leverage side information that is
iments independently of the systems used to create those currently not being used.
images. They can view an experiment with a microscope
• The four-way classification scheme used is a distilla-
or look at a computer image and reach similar conclu-
tion of 38 categories which are present in the source
sions. They can look at a vapor diffusion experiment or a
data. While these categories are presumed to be
microbatch-under-oil setup and, again, asses either with
somewhat inconsistent across datasets, they could
confidence. Here, we show that this can be accomplished
potentially provide an additional supervision signal.
equally well, if not better, using deep CNNs. A benchtop
researcher can classify many images, especially if they • Because one goal of this classifier is to be able to
relate to a project that has been years in the making. For generalize across datasets, it would be worthwhile
high-throughput approaches, however, that task becomes to investigate the contribution of techniques that
challenging. The strength of computational approaches is have been designed to specifically reduce the effect of
that each image is treated like the previous one, with no domain shift across data sources on the classification
fatigue. Classification of 10,000 images is as consistent outcomes [60, 61].
as classification of one. This advance opens the door for
complete classification of all results in a high-throughput • Each crystallization experiment records a series of
setting and for data mining of repositories of past image images taken over times. Using the timecourse
data. information could enhance the success rate of the
Another remarkable aspect of our results is that they classifier [62].
leverage a very generic computer vision architecture orig- Note in closing that the current study focused on crys-
inally designed for a different classification problem – tallization as an outcome, which is but a small fraction
categorization of natural images – with very distinct char- of the protein solubility diagram. Patterns of precipi-
acteristics. For instance, one can presume that the global tation, phase separation, and clear drops, also provide
geometric relationships between object parts would play information as to whether and where crystallization might
a greater role in identifying a car or a dog in an image, occur. The success in identifying crystals, precipitate and
compared to the very local, texture-like features involved clear can be thus also be used to accurately chart the
in recognizing crystal-like structures. Yet no particu- crystallization regimes and to identify pathways for opti-
lar specialization of the model was required to adapt it mization [59, 63, 64]. The application of this approach to
to the widely differing visual appearances of the sam- large libraries of historical data may therefore reveal pat-
ples originating from different imagers. This convergence terns that guide future crystallization strategies, including
of approaches toward a unified perception architecture novel chemical screens and mutagenesis programs.
across a wide range of computer vision problems has been
a common theme in recent years, further suggesting that
the technology is now ready for wide adoption for any ACKNOWLEDGMENTS
human-mediated visual recognition task.
We acknowledge discussions at various stages of this
project with I. Altan, S. Bowman, R. Dorich, D. Fusco,
E. Gualtieri, R. Judge, A. Narayanaswamy, J. Noah-
VI. CONCLUSION
Vanhoucke, P. Orth, M. Pokross, X. Qiu, P. F. Riley,
V. Shanmugasundaram, B. Sherborne and F. von Delft.
In this work, we have collated biomolecular crystal- PC acknowledges support from National Science Founda-
lization images for nearly half a million of experiments tion Grant no. NSF DMR-1749374.
9

[1] S Harrison, B Lahue, Z Peng, A Donofrio, C Chang, and [15] Edward H. Snell, Joseph R. Luft, Stephen A. Potter,
M Glick, “Extending predict first to the design make- Angela M. Lauricella, Stacey M. Gulde, Michael G.
test cycle in small-molecule drug discovery,” Future Med. Malkowski, Mary Koszelak-Rosenblum, Meriem I. Said,
Chem. 9, 533536 (2017). Jennifer L. Smith, Christina K. Veatch, Robert J. Collins,
[2] Alexander McPherson, Crystallization of Biological Geoff Franks, Max Thayer, Christian Cumbaa, Igor Ju-
Macromolecules (CSHL Press, Cold Spring Harbor, 1999) risica, and George T. DeTitta, “Establishing a training
p. 586. set through the visual analysis of crystallization trials.
[3] Naomi E. Chayen, “Turning protein crystallisation from part i: 1̃50 000 images,” Acta Cryst. D 64, 1123–1130
an art into a science,” Curr. Opin. Struct. Biol. 14, 577– (2008).
583 (2004). [16] David Hargreaves, Personal communication.
[4] Diana Fusco and Patrick Charbonneau, “Soft matter per- [17] Glen Spraggon, Scott A. Lesley, Andreas Kreusch, and
spective on protein crystal assembly,” Colloids Surf. B: John P. Priestle, “Computational analysis of crystalliza-
Biointerfaces 137, 22–31 (2016). tion trials,” Acta Cryst. D 58, 1915–1923 (2002).
[5] J.T. Ng, C. Dekker, P. Reardon, and F. von Delft, [18] Christian Cumbaa and Igor Jurisica, “Automatic classifi-
“Lessons from ten years of crystallization experiments at cation and pattern discovery in high-throughput protein
the SGC,” Acta Cryst. D 72, 224–235 (2016). crystallization trials,” J. Struct. Funct. Genomics 6, 195–
[6] V.J. Fazio, T.S. Peat, and J. Newman, “Lessons for the 202 (2005).
future,” Methods Mol. Biol. 1261, 141–156 (2015). [19] Kuniaki Kawabata, Kanako Saitoh, Mutsunori Takahashi,
[7] Janet Newman, Evan E. Bolton, Jochen Muller- Hajime Asama, Taketoshi Mishima, Mitsuaki Sugahara,
Dieckmann, Vincent J. Fazio, D. Travis Gallagher, David and Masashi Miyano, “Evaluation of protein crystalliza-
Lovell, Joseph R. Luft, Thomas S. Peat, David Ratcliffe, tion state by sequential image classification,” Sensor Rev.
Roger A. Sayle, Edward H. Snell, Kerry Taylor, Pascal 28, 242–247 (2008).
Vallotton, Sameer Velanker, and Frank von Delft, “On [20] Samarasena Buchala and Julie C. Wilson, “Improved clas-
the need for an international effort to capture, share and sification of crystallization images using data fusion and
use crystallization screening data,” Acta Cryst. F 68, multiple classifiers,” Acta Cryst. D 64, 823–833 (2008).
253–258 (2012). [21] Martyn D. Winn, Charles C. Ballard, Kevin D. Cowtan,
[8] Yulia Kotseruba, Christian A Cumbaa, and Igor Jurisica, Eleanor J. Dodson, Paul Emsley, Phil R. Evans, Ronan M.
“High-throughput protein crystallization on the world com- Keegan, Eugene B. Krissinel, Andrew G. W. Leslie, Air-
munity grid and the gpu,” J. Phys. Conf. Ser. 341, 012027 lie McCoy, Stuart J. McNicholas, Garib N. Murshudov,
(2012). Navraj S. Pannu, Elizabeth A. Potterton, Harold R. Pow-
[9] Janet Newman, “One plate, two plates, a thousand plates. ell, Randy J. Read, Alexei Vagin, and Keith S. Wilson,
how crystallisation changes with large numbers of sam- “Overview of the ccp4 suite and current developments,”
ples,” Methods 55, 73 – 80 (2011). Acta Cryst. D 67, 235–242 (2011).
[10] Shuheng Zhang, Charline J.J. Gerard, Aziza Ikni, Gilles [22] “MAchine Recognition of Crystallization Outcomes
Ferry, Laurent M. Vuillard, Jean A. Boutin, Nathalie (MARCO),” (2017), https://marco.ccr.buffalo.edu
Ferte, Romain Grossier, Nadine Candoni, and Stphane [Online; accessed 17-March-2018].
Veesler, “Microfluidic platform for optimization of crys- [23] Pascal Vallotton, Changming Sun, David Lovell, Vin-
tallization conditions,” J. Cryst. Growth 472, 18 – 28 cent J. Fazio, and Janet Newman, “Droplit, an improved
(2017), industrial Crystallization and Precipitation in image analysis method for droplet identification in high-
France (CRISTAL-8), May 2016, Rouen (France). throughput crystallization trials,” J. Appl. Crystallogr.
[11] Y. Thielmann, J. Koepke, and H. Michel, “The esfri in- 43, 1548–1552 (2010).
struct core centre frankfurt: Automated high-throughput [24] N. Rosa, M. Ristic, B. Marshall, and J. Newman, “Keep-
crystallization suited for membrane proteins and more,” ing crystallographers app-y,” Acta Cryst. F submitted.
J. Struct. Funct. Genomics 13, 63–69 (2012). [25] Katarina Mele, Rongxin Li, Vincent J. Fazio, and Janet
[12] Edward H. Snell, Angela M. Lauricella, Stephen A. Potter, Newman, “Quantifying the quality of the experiments
Joseph R. Luft, Stacey M. Gulde, Robert J. Collins, Geoff used to grow protein crystals: the iqc suite,” J. Appl.
Franks, Michael G. Malkowski, Christian Cumbaa, Igor Cryst. 47, 1097–1106 (2014).
Jurisica, and George T. DeTitta, “Establishing a training [26] Yann LeCun, Bernhard Boser, John S Denker, Donnie
set through the visual analysis of crystallization trials. Henderson, Richard E Howard, Wayne Hubbard, and
part II: Crystal examples,” Acta Cryst. D 64, 1131–1137 Lawrence D Jackel, “Backpropagation applied to hand-
(2008). written zip code recognition,” Neural computation 1, 541–
[13] This estimate is based on personnal communication with 551 (1989).
five experienced crystallographers at GlaxoSmithKline: 2 [27] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep
seconds/observation × 8 observations × 96 wells. Note learning,” Nature 521, 436 (2015).
that current technology can automatically store and image [28] Waseem Rawat and Zenghui Wang, “Deep convolutional
plates at about 3 min/plate. neural networks for image classification: A comprehensive
[14] Julie Wilson, “Automated classification of images from review,” Neural Comput. 29, 2352–2449 (2017).
crystallisation experiments,” in Advances in Data Mining. [29] A Berg, J Deng, and L Fei-Fei, “Large scale visual
Applications in Medicine, Web Mining, Marketing, Image recognition challenge (ILSVRC),” (2010) http://www.
and Signal Mining, edited by Petra Perner (Springer Berlin image-net.org/challenges/LSVRC.
Heidelberg) pp. 459–473.
10

[30] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Ar- houcke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
naud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Pete Warden, Martin Wattenberg, Martin Wicke, Yuan
Ghafoorian, Jeroen AWM van der Laak, Bram van Gin- Yu, and Xiaoqiang Zheng, “TensorFlow: Large-scale
neken, and Clara I Sánchez, “A survey on deep learning machine learning on heterogeneous systems,” (2015).
in medical image analysis,” Med. Image Anal. 42, 60–88 [43] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen,
(2017). Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker,
[31] Christof Angermueller, Tanel Pärnamaa, Leopold Parts, Ke Yang, Quoc V Le, et al., “Large scale distributed deep
and Oliver Stegle, “Deep learning for computational biol- networks,” in Advances in neural information processing
ogy,” Mol. Syst. Biol. 12, 878 (2016). systems (2012) pp. 1223–1231.
[32] Jonathan Krause, Varun Gulshan, Ehsan Rahimy, Peter [44] Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-
Karth, Kasumi Widner, Greg S Corrado, Lily Peng, and rmsprop: Divide the gradient by a running average of
Dale R Webster, “Grader variability and the importance of its recent magnitude,” COURSERA: Neural networks for
reference standards for evaluating machine learning mod- machine learning 4, 26–31 (2012).
els for diabetic retinopathy,” arXiv:1710.01711 [cs.CV] [45] Chris Watkins, “C4, C3 Classifier Pipeline. v1. CSIRO.
(preprint) (2017). Software Collection.” (2018), https://doi.org/10.
[33] Yun Liu, Krishna Gadepalli, Mohammad Norouzi, 4225/08/5a97375e6c0aa [Online; accessed 09-May-2018].
George E Dahl, Timo Kohlberger, Aleksey Boyko, Sub- [46] “DeepCrystal,” (2017), http://www.deepcrystal.com
hashini Venugopalan, Aleksei Timofeev, Philip Q Nelson, [Online; accessed 09-May-2018].
Greg S Corrado, et al., “Detecting cancer metastases on [47] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda
gigapixel pathology images,” arXiv:1703.02442 [cs.CV] Viégas, and Martin Wattenberg, “Smoothgrad: Re-
(preprint) (2017). moving noise by adding noise,” arXiv:1706.03825 [cs.LG]
[34] David E Rumelhart, Geoffrey E Hinton, and Ronald J (preprint) (2017).
Williams, “Learning representations by back-propagating [48] Vincent Vanhoucke, “Marco repository in Tensor-
errors,” Nature 323, 533 (1986). Flow Models,” (2018), http://github.com/tensorflow/
[35] Léon Bottou, “Large-scale machine learning with stochas- models/tree/master/research/marco [Online; accessed
tic gradient descent,” in Proceedings of COMPSTAT’2010 01-May-2018].
(Springer, 2010) pp. 177–186. [49] “Google Cloud Machine Learning Engine,” (2018), [On-
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon line; accessed 02-February-2018].
Shlens, and Zbigniew Wojna, “Rethinking the incep- [50] Christian A. Cumbaa, Angela Lauricella, Nancy Fehrman,
tion architecture for computer vision,” in Proceedings of Christina Veatch, Robert Collins, Joe Luft, George De-
the IEEE Conference on Computer Vision and Pattern Titta, and Igor Jurisica, “Automatic classification of
Recognition (2016) pp. 2818–2826. sub-microlitre protein-crystallization trials in 1536-well
[37] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, plates,” Acta Cryst. D 59, 1619–1627 (2003).
and Alexander A Alemi, “Inception-v4, inception-resnet [51] Kanako Saitoh, Kuniaki Kawabata, Hajime Asama, Take-
and the impact of residual connections on learning.” toshi Mishima, Mitsuaki Sugahara, and Masashi Miyano,
arXiv:1602.07261 [cs.CV] (preprint) (2017). “Evaluation of protein crystallization states based on tex-
[38] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and ture information derived from greyscale images,” Acta
Quoc V Le, “Learning transferable architectures for Cryst. D 61, 873–880 (2005).
scalable image recognition,” arXiv:1707.07012 [cs.CV] [52] Marshall Bern, David Goldberg, Raymond C. Stevens,
(preprint) (2017). and Peter Kuhn, “Automatic classification of protein
[39] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, crystallization images using a curve-tracking algorithm,”
Greg Kochanski, John Karro, and D Sculley, “Google J. Appl. Cryst. 37, 279–287 (2004).
vizier: A service for black-box optimization,” in Proceed- [53] R. Liu, Y. Freund, and G. Spraggon, “Image-based crystal
ings of the 23rd ACM SIGKDD International Conference detection: a machine-learning approach.” Acta Cryst. D
on Knowledge Discovery and Data Mining (ACM, 2017) 64, 1187–95 (2008).
pp. 1487–1495. [54] Christian A. Cumbaa and Igor Jurisica, “Protein crystal-
[40] Nathan Silberman and Sergio Guadarrama, “TensorFlow- lization analysis on the world community grid,” J. Struct.
Slim image classification model library,” (2017), Funct. Genomics 11, 61–69 (2010).
https://github.com/tensorflow/models/tree/ [55] Shen Pan, Gidon Shavit, Marta Penas-Centeno, Dong-Hui
master/research/slim [Online; accessed 02-February- Xu, Linda Shapiro, Richard Ladner, Eve Riskin, Wim
2018]. Hol, and Deirdre Meldrum, “Automated classification
[41] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya of protein crystallization images using support vector
Sutskever, and Ruslan Salakhutdinov, “Dropout: A sim- machines with scale-invariant texture and gabor features,”
ple way to prevent neural networks from overfitting,” J. Acta Cryst. D 62, 271–279 (2006).
Mach. Learn. Res. 15, 1929–1958 (2014). [56] M.J. Po and A.F. Laine, “Leveraging genetic algorithm
[42] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene and neural network in automated protein crystal recog-
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, nition,” in Proceedings of the 30th Annual International
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- Conference of the IEEE Engineering in Medicine and Biol-
mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, ogy Society, EMBS’08 - ”Personalized Healthcare through
Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Technology” (2008) pp. 1926–1929.
Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, [57] Margot Lisa-Jing Yann and Yichuan Tang, “Learning
Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, deep convolutional neural networks for x-ray protein crys-
Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya tallization image analysis,” in Thirtieth AAAI Conference
Sutskever, Kunal Talwar, Paul Tucker, Vincent Van- on Artificial Intelligence (2016).
11

[58] Jeffrey Hung, John Collins, Mehari Weldetsion, Oliver separation networks,” in Advances in Neural Information
Newland, Eric Chiang, Steve Guerrero, and Kazunori Processing Systems (2016) pp. 343–351.
Okada, “Protein crystallization image classification with [62] Katarina Mele, B. M. Thamali Lekamge, Vincent J. Fazio,
elastic net,” in SPIE Medical Imaging, Vol. 9034 (SPIE) and Janet Newman, “Using time courses to enrich the
p. 14. information obtained from images of crystallization trials,”
[59] Diana Fusco, Timothy J. Barnum, Andrew E. Bruno, Cryst. Growth Des 14, 261269 (2014).
Joseph R. Luft, Edward H. Snell, Sayan Mukherjee, and [63] Edward H. Snell, Ray M. Nagel, Ann Wojtaszcyk, Hugh
Patrick Charbonneau, “Statistical analysis of crystal- O’Neill, Jennifer L. Wolfley, and Joseph R. Luft, “The
lization database links protein physico-chemical features application and use of chemical space mapping to interpret
with crystallization mechanisms,” PLoS ONE 9, e101123 crystallization screening results,” Acta Cryst. D 64, 1240–
(2014). 1249 (2008).
[60] Yaroslav Ganin and Victor Lempitsky, “Unsupervised [64] Irem Altan, Patrick Charbonneau, and Edward H. Snell,
domain adaptation by backpropagation,” in International “Computational crystallization,” Arch. Biochem. Biophys.
Conference on Machine Learning (2015) pp. 1180–1189. 602, 1220 (2016).
[61] Konstantinos Bousmalis, George Trigeorgis, Nathan Sil-
berman, Dilip Krishnan, and Dumitru Erhan, “Domain

You might also like