You are on page 1of 17

H ELIX: Holistic Optimization for

Accelerating Iterative Machine Learning

Doris Xin, Stephen Macke, Litian Ma, Jialin Liu, Rong Ma, Shuchen Song, Aditya Parameswaran
University of Illinois (UIUC)
{dorx0,smacke,litianm2,jialin2,ssong18,adityagp}@illinois.edu

ABSTRACT is actually spent on model training [23], with the bulk of the time
Machine learning application developers and data scientists spend spent iterating on all steps of the machine learning workflow.
an inordinate amount of time iterating on machine learning work- E XAMPLE 1 (G ENE F UNCTION P REDICTION ). Consider the
flows—by modifying the data pre-processing, model training, and following example from our bioinformatics collaborators who form
post-processing steps—via trial-and-error, until the desired accu- part of a genomics center at the University of Illinois [28]. Their
racy is achieved. Unfortunately, most work on making machine goal is to identify pairs of genes that are functionally related to
learning workflows faster has focused on optimizing model train- each other, and their correlation with diseases, by mining scien-
ing. While a few optimize the one-shot execution of the workflow, tific literature. As part of this activity, they process published pa-
they ignore the fact that rapid iteration is, in fact, the key bottle- pers to extract entity—gene and disease—mentions, compute co-
neck. We propose H ELIX, a declarative machine learning system occurrence vectors for these mentions based on surrounding text,
that H ELIX optimizes the execution across iterations—intelligently and compute embeddings using a word2vec [22]-like approach,
caching and reusing, or recomputing intermediates as appropri- and finally cluster the embeddings to find related entities. They
ate. H ELIX captures a wide variety of application needs within repeatedly iterate on this workflow, to try to improve accuracy. For
its Scala DSL, defining unified processes for data pre-processing, example, they may (i) expand or shrink the literature corpus, (ii)
model specification, and learning. We demonstrate that for the sim- add in external knowledge sources such as known gene databases
plest setting, this optimization objective reduces to M AX -F LOW, to refine how entities are identified, and (iii) try different NLP li-
while for more complex settings, is NP-H ARD—we develop light- braries for tokenization and entity recognition. They may also (iv)
weight heuristics for this purpose. We demonstrate that H ELIX change the algorithm used for computing word embedding vec-
is not just able to handle a wide variety of use cases in one uni- tors, e.g., from word2vec to LINE [36], or (v) tweak the number
fied workflow, but is also succinct—and much faster—providing of clusters to control the granularity of the clustering. Every single
latency reductions of up to an order of magnitude over state-of-the- change that they make necessitates waiting for the entire workflow
art systems that do not optimize across iterations, such as DeepDive to rerun from scratch—often multiple hours on a large cluster for
or KeystoneML. each single change, even though the change may be quite small.
As this example illustrates, the key bottleneck in applying ma-
1. INTRODUCTION chine learning is iteration— every single small change to the work-
From emergent applications like precision medicine, voice-cont- flow results in several hours of recomputation, even though the
rolled devices, and driverless cars, to well-established ones like change may only affect a small portion of the workflow. For exam-
product recommendations and credit card fraud detection, machine ple, changing the regularization parameter or adding a new feature
learning continues to be the key driver of innovations that are trans- should only affect the parts of the workflow that depend on it, as
forming our everyday lives. At the same time, it is well-known that opposed to the rest, which shouldn’t need to be rerun. Indeed, one
developing machine learning applications is time-consuming and approach to mitigate this expensive recomputation is to manually
cumbersome. To this end, there have been a number of efforts to materialize every single intermediate that doesn’t change across it-
make machine learning more declarative, and to speed up the model erations, but this approach requires the developers to write code to
training process [3]. keep track of what changes and what doesn’t across iterations, as
However, the majority of the development time is in fact spent well as worry about how and when to materialize the intermedi-
iterating on the machine learning workflow, incrementally modi- ates, and to reuse them in subsequent iterations. Since this is so
fying various steps within, including the (i) pre-processing steps: cumbersome, developers instead often opt to rerun the entire work-
by transforming, cleaning, or extracting data differently, or adding, flow from scratch.
deleting, or transforming features (e.g., via normalization or clean- Unfortunately, existing machine learning systems do not facili-
ing); (ii) model training steps: by tweaking parameters, changing tate rapid, optimized iteration in machine learning workflows. For
the algorithm or regularization; and (iii) post-processing steps: by example, KeystoneML [31], which allows developers to specify
evaluating the model on test data, or generating statistics or charts. workflows at a higher-level abstraction, is thereby able to optimize
The reason for these iterations is that it is often difficult to predict the one-shot execution of that workflow by applying techniques
the performance of a workflow a priori, both due to the variabil- such as careful operator selection and intermediate result caching,
ity of the data, and due to the complexity and unpredictability of as well as dead code elimination. Likewise, Columbus [44], tar-
machine learning. Thus, developers must resort to iterative mod- geted at generalized linear models, optimizes the one-shot execu-
ifications of the workflow via “trial-and-error” to improve per- tion of such workflows. On the other extreme, DeepDive [43], tar-
formance. A recent survey reports that less than 15% of the time geted at knowledge-base construction, materializes the results of all
of the feature extraction and engineering steps, while also applying chine learning systems like KeystoneML), while also employing
approximate inference to speed up the model training. While this UDFs as needed to insert imperative code, say for feature extrac-
naïve materialization approach does help iterative execution some- tion or transformation. This interoperability allows data scientists
what, this approach can be both wasteful and time-consuming, as to seamlessly leverage existing functions and libraries within Scala,
we will see below. such as the CoreNLP toolkit for natural language processing [19],
In this paper, we present H ELIX, a declarative, general-purpose the Deeplearning4j [38] toolkit for deep learning, and the ImageJ [27]
machine learning system that optimizes across iterative machine package for computer vision—thereby retaining all of the benefits
learning workflows. H ELIX is able to match or exceed the perfor- of a full imperative language. Moreover, H ELIX is built on top
mance of KeystoneML and DeepDive on one-shot execution, while of Spark, allowing data scientists to mix Spark-specific operators
providing gains of up to 10× on iterative execution. By op- within machine learning workflows, and leverage Spark’s parallel
timizing across iterations, H ELIX allows data scientists to not be processing capabilities.
constrained in running the entire workflow from scratch every time Thus, the modular construction of H ELIX’s DSL enables it to
they make a change, but to instead run machine learning workflows not just automatically identify data dependencies and data flow, but
“at the speed of thought”, repeatedly iterating on workflows with also encapsulates all typical machine learning workflow designs.
the execution time proportional to the complexity of the change Unlike Columbus [44] or DeepDive [43], we don’t restrict H ELIX’s
made. Thereby, H ELIX is able to substantially increase the produc- learning paradigm to be regression or factor-graph-based , enabling
tivity of developers and data scientists and reduce the time spent data scientists to use their preferred model training approach. H E -
waiting for workflows to complete execution. LIX ’s workflows can easily capture supervised, semi-supervised,
Developing H ELIX involves two types of challenges—challenges or unsupervised learning methodologies with applications ranging
in iterative execution optimization and challenges in specification from natural language processing, to network analysis, to computer
and generalization. vision. Developing this DSL, while satisfying all of these require-
Challenges in Iterative Execution Optimization. Suppose we ments, was challenging. As we will see in the following, H ELIX’s
can represent the current machine learning workflow as a directed DSL is both as succinct as other declarative machine learning sys-
acyclic graph (translating the workflow into this graph is another tems, but is also general and powerful; at the same time, it enables
challenge that we will discuss later), where each node corresponds iterative execution optimization by allowing H ELIX to synthesize a
to a collection of data—be it the original data items, such as doc- graph of data collections and their dependencies. Finally, by study-
uments or images, the transformed data items, such as sentences ing the variation in this graph across iterations, H ELIX is able to
or words, the extracted features, or the final model or model out- identify reuse opportunities across iterations.
comes. This graph, for practical workflows, can be quite large and Contributions and Outline. The rest of the paper is organized
complex. One simple approach to enable iterative execution opti- as following: Section 2 presents a quick recap of ML workflows,
mization (adopted by DeepDive [43]) is to materialize the result of statistics on how users iteration on ML workflows collected from
every single node, such that the next time the workflow is run, we applied ML literature, an architectural overview of the system, and
can simply check if the result can be reused from the previous itera- a concrete workflow to illustrate concepts discussed in the subse-
tion, and if so, we can simply reuse it. Unfortunately, this approach quent sections; Section 3 describes the programming interface for
is not only wasteful in storage but also potentially very costly— effortless end-to-end workflow specification; Section 4 discusses
since materialization increases the overheads of the system. More- H ELIX system internals, including the compilation process for gen-
over, in a subsequent iteration, it may be cheaper to recompute an erating the intermediate representation (IR), pruning techniques to
intermediate result, as opposed to reading it from disk. optimize the IR, and change tracking between iterations; Section 5
A better approach is, for each node, determine whether this result formally present the two major optimization problems in accelerat-
is worth materializing, taking into account the time taken for com- ing iterative ML and H ELIX’s solution to both problems. We eval-
puting that node, and the time taken for computing the parents, and uate our framework on four workflows from different applications
which ancestors, if any, are materialized. Then, during subsequent domains and against two state-of-the-art systems closely related to
iterations, determine whether to read the result for a node from our work. Section 6 presents the results of these evaluations and our
persistent storage (if materialized), or to compute it from scratch. analysis. We discuss related work in Section 7 and finally conclude
In this paper, we formally demonstrate that the latter problem is in Section 8.
in PTIME via a non-trivial reduction to M AX -F LOW using the
P ROJECT S ELECTION P ROBLEM [13], while the former problem
is, in fact, NP-H ARD. Further complicating matters is the fact that 2. BACKGROUND AND OVERVIEW
the materialization decision cannot be deferred to the end of the We now provide a high-level overview of machine learning work-
iteration, and needs to be done on the fly. flows and characterize how developers iterate on workflows by re-
Challenges in Specification and Generalization. To enable itera- viewing the applied machine learning research literature. We then
tive execution optimization, we need to support the specification of describe the H ELIX system architecture, and conclude with a sam-
the end-to-end machine learning workflow in a high-level language. ple workflow programmed in H ELIX that will serve as a running
Unfortunately, this is rather challenging. The data preparation, fea- example.
ture extraction, engineering, and transformation steps—in fact, ev-
erything apart form the model training step—are often written in 2.1 A Brief Overview of Workflows
imperative code, and often in a different programming language, A Machine Learning (ML) workflow accomplishes a specific
making it hard to automatically analyze the workflow to identify ML task, ranging from simple tasks like classification or cluster-
data collections, and their relationships with each other, to apply ing, to complex ones like entity resolution or text and image cap-
holistic iterative execution optimization. tioning. The more complex ML tasks are often broken down into
Instead, we adopt a hybrid approach within H ELIX: data sci- smaller subtasks; e.g., image captioning is broken down into iden-
entists write code in a simple, intuitive, and modular domain- tifying individual objects or actions via classification, followed by
specific language (DSL) built on Scala (similar to existing ma- generating sentences using a language model [12].
Within H ELIX, we decompose ML workflows into three compo- • DPR: the number of feature transformation methods (e.g., nor-
nents: data preparation (DPR), learning/inference (L/I), and post- malization);
processing (PPR). These three components are generic and adapt • L/I: nml − 1, where nml is the number of ML algorithms (e.g.,

to a wide variety of supervised, semi-supervised, and unsupervised SVM), optimization techniques (e.g., SGD) and model tuning
settings, as we will demonstrate in Section 6. Let R be the raw steps (e.g., regularization);
input data for the ML workflow. • PPR: min(nmetrics , nf t ), where nmetrics is the number of

Data Preparation (DPR). During data preparation, R is trans- metrics reported (counting closely related metrics, such as pre-
formed, through a series of operations, into some representation cision and recall, as one), and nf t is the number of figures
D ∈ X , where X is the input space for model training purposes. and/or tables containing evaluation results.
The transformation from R to D can involve a variety of oper- Assuming that at most one variable is changed between two itera-
ations, such as fine-grained feature definition from individual at- tions, these estimators are all lower bounds because they count only
tributes (e.g., number of vowels in a word), joining in other data the distinct results but not the transitions between them. Treating
sources (e.g., joining user information into log data), parsing (e.g., each result as a node in a connected graph, the number of edges is
a text document to individual words), and aggregation (e.g., count- lower bounded by the number of nodes minus one.
ing clicks for an ad from log data). Figure 1(c) shows the average number of each type of iteration for
Learning/Inference (L/I). During learning, an algorithm is run on workflows by application domain. CV and NLP workflows have
D to obtain a model f : X → Y, such as a linear classifier, decision fewer DPR iterations, due to the fact that all surveyed papers study
tree, or cluster centers, where Y denotes the space of the target out- the same pre-processed and annotated datasets, in addition to the
puts for the ML task. Then, during inference1 , this model f is used prevalence of deep neural networks (DNNs). We discovered that by
to process new data from X . For example, in spam classification, convention, CV papers report only the final model parameters and
f decides whether a new email is spam; in clustering, f assigns not the entire model tuning process, hence the below-average num-
a datapoint to a specific (set of) cluster(s); in image captioning, f ber of L/I iterations. On the other hand, NS papers tend to report on
generates a text caption for an image; and in word embeddings, a larger number of models (e.g., SVM, Random Forest) since the
f maps a string containing a word onto a vector. We treat learn- applicability of a model class to the problem investigated presents
ing and inference as a unified component because data processed value to future researchers (SVM is the most popular choice by a
by the DPR component can either be used for learning, or, in the large margin). Overall, the average (max) numbers of DPR, L/I,
case a model has already been learned, inference. As shown in the and PProc iterations are 1.3 (5), 1.5 (7), and 2.8 (6), respectively.
examples above, this observation is valid for both supervised and We also find that the average number of iterations is four or more
unsupervised learning. across all domains, with WWW, SocS, and NS papers reporting an
Post-processing (PPR). In addition to learning and inference, an average of > 6 iterations.
ML workflow usually contains additional operations on the output In addition, we highlight four interesting characteristics discov-
model or inference results. This could include model evaluation, ered during our survey in Figure 1(b). First, web applications and
visualizations, or other application-specific activities. We refer to SocS are much more likely to incorporate multiple data sources in
these operations as post-processing. creating an ML model. Here, we focus on cases where a single
model relies on multiple data sources, not the case where models
2.2 A Small-Scale Survey of Applied ML Lit. are evaluated on multiple datasets. For web applications, this often
ML workflow development is anecdotally regarded to be highly entails joining log data with user profiles; SocS often consider both
iterative in nature, due to the unpredictable performance arising the social network and auxiliary information such as geographic
from data variability and model complexity [17, 35]. However, features. Second, except in CV, most domains still rely upon fea-
to the best of our knowledge, no quantitative studies of iteration tures handcrafted by domain experts. Third, contrary to common
exist in the literature. To remedy this, we conducted a small-scale belief, deep neural networks are not ubiquitously adopted, espe-
survey of the applied ML literature, trying to analyze the workflow cially in SocS and web applications, due largely in part to limited
variations corresponding to the results reported in each paper. Note data and computing resources availability, and to the lack of human
that since authors report on only a small subset of iterations neces- interpretability of outputs. Lastly, in addition to reporting aggregate
sary to produce the experimental results presented in the paper, this metrics such as AUC, authors often conduct fine-grained case stud-
survey captures only a limited view of the iterative process. Thus, ies on specific datapoints to elucidate the limitations of their ap-
our findings are merely a lower bound on the actual number of it- proach. Almost every CV paper contains case studies since images
erations. Nevertheless, the survey provides valuable insights into readily lend themselves to visualization. Furthermore, authors in
common practices for iterating on ML workflows. natural sciences study specific high-impact features to derive new
We surveyed 105 papers randomly sampled from KDD ’16 Ap- scientific insights.
plied Data Science Track, ACL ’16, Nature ’16, and CVPR ’16, We use iteration trends presented in Figure 1(c) to guide our em-
spanning applications in social sciences (SocS), web applications pirical evaluations in Section 6; specific system design decisions
(WWW), natural sciences (NS), natural language processing (NLP), are informed by insights presented in Figure 1(b).
and computer vision (CV). Paper topics were determined using the
ACM Computing Classification System. See Figure 1(a) for paper
counts by conference and topic. 2.3 System Architecture
For each paper, we recorded details about the datasets, feature The H ELIX system consists of a domain specific language (DSL)
engineering steps, ML models and optimization algorithms, and in Scala as the programming interface, a compiler for the DSL, and
evaluation methods. To approximate the number of DPR, L/I, and an execution engine, as shown in Figure 2. The three components
PPR iterations in each paper, we use the following estimators: work collectively to minimize the execution time for both the cur-
1
We use the term as defined in the ML community; this term is not rent iteration and subsequent iterations, by judiciously materializ-
to be confused with statistical inference, which is concerned with ing intermediate results for reuse. We provide a brief overview of
estimating distributions based on data. each of these three components below.
Figure 1: (a) Paper count by domains for conference included in the literature survey. (b) Fraction of papers with each characteristic in each
domain. (c) The average DRP, L/I, and PPR iterations for each domain.

Programming
Scala DSL
Interface
Data Workflow
object App extends Workflow {
data refers_to FileSource(train="trainData", test="testData")

}

Intermediate Code Gen.


Compilation

Intermediate
Workflow DAG Code Gen.

DAG Optimizer

Optimized DAG
Execution

Mat. Execution Engine


Optimizer App-Specific
Libraries
...
Spark Figure 3: Workflow lifecycle.
Figure 2: System architecture. A program written by the user in
the H ELIX DSL, known as a Workflow, is first compiled into an
intermediate DAG representation, which is optimized to produce a 3. Execution Engine. The execution engine carries out the phys-
physical plan to be run by the execution engine. At runtime, the ical execution plan produced during the compilation phase, termi-
execution engine communicates with the materialization optimizer nating at the earliest point of failure. While running each oper-
to decide whether to materialize intermediate results on disk. ator, the execution engine communicates with the materialization
optimizer that determines whether its results is materialized2 , in or-
der to minimize run time of future executions. Materialized results
can be loaded in subsequent executions to prune not only the cor-
1. Programming Interface. H ELIX provides a single Scala inter- responding operator but also potentially all of its ancestors in the
face named Workflow for programming the entire workflow. Users DAG, which can lead to significant speedup. We present the al-
code their applications using the H ELIX DSL, which enables em- gorithm used by the materialization optimizer in Section 5.3. The
bedding of imperative code in declarative statements for easy inline execution engine uses Spark [42] to support data processing, and
UDF declaration, in a similar fashion to SparkSQL [2]. Through domain-specific libraries such as CoreNLP [19] and Deeplearn-
just a handful of extensible operator types, the DSL supports a wide ing4j [39] for custom application needs.
range of use cases for both data preparation and machine learning.
We describe the DSL in detail in Section 3. 2.4 The Workflow Lifecycle
2. Compilation. A Workflow is internally represented as a di- Figure 3 provides an overview of the lifecycle of ML workflows
rected acyclic graph (DAG) of intermediate results corresponding in H ELIX. Starting with W0 , an initial version of the user pro-
to the declared operators. We formally define the DAG and describe grammed workflow, the lifecycle includes the following stages:
the compilation of Workflow in the DAG in Section 4.1. The com- • DAG Compilation. The Workflow in HML is compiled into
piled DAG is analyzed by the DAG optimizer, which considers both the DAG of intermediate results used by H ELIX’s optimization
the DAG and relevant data, to produce an optimal physical execu- and execution engines.
tion plan that minimizes the one-shot run time of the workflow. The • DAG Optimization. After obtaining the workflow DAG from
optimal plan prunes extraneous operators via program slicing [40] the compilation step, the DAG optimizer creates a physical plan
and orders the retained operators so that they are evaluated lazily, WiOP T to be executed by pruning and ordering the nodes in the
i.e., only when they are needed, to reduce memory footprint. We DAG and deciding whether any computation can be replaced
discuss the mechanisms that enable pruning in Section 4.2. The with loading previous results from disk.
DAG optimizer tracks changes between iterations to determine the
• Materialization Optimization. During execution, the materi-
reusability of previous results and selectively reloads previous in-
alization optimizer determines which nodes should have their
termediate results via a M AX -F LOW-based algorithm. We describe
the change tracking feature in Section 4.3, and the M AX -F LOW- 2
We use materialization to mean persisting to disk throughout the
based algorithm in Section 5.2. paper.
results persisted to disk so that they may be reloaded to avoid dashed edge) except the ones marked with dots in Figure 4b). This
recomputation in future iterations. DAG is then transformed by the optimizer, which prunes away
raceExt (grayed out) because it does not contribute to the output,
• User Interaction. Upon execution completion, the user may
and adds the edges marked by dots to link relevant features to the
modify the workflow based on the results, such as tuning hyper-
model. DPR involves nodes in purple, and L/I and PPR involve
parameters. The updated workflow fed back to H ELIX marks
nodes in orange. Nodes with a drum to the right are materialized to
the beginning of a new iteration, and the cycle repeats.
disk, either as mandatory output or for aiding in future iterations.
Without loss of generality, we assume that a workflow Wt is only
executed once in each iteration. We model a repeated execution of Updated Workflow. In the updated version of the workflow, a
Wt as a new iteration where Wt+1 = Wt . Distinguishing two feature named msExt is added below line 9 of the original version.
executions of the same workflow is important because they may Additionally, the feature clExt in the original version is removed
have different run times—the second execution can reuse results and replaced with the new feautre msExt in line 13.
materialized in the first execution for a potential run time reduction. Updated Workflow: Optimized DAG. In the optimized DAG for
the updated workflow, a node is added for the new feature msExt,
2.5 Example Workflow and the node for clExt gets pruned. Additionally, the materialized
We demonstrate the usage of H ELIX with a simple example ML results from last iteration for rows is loaded from disk (drum to the
workflow for predicting income using census data from Kohavi[14], left), allowing data to be pruned. data and rows are roughly the
shown in Figure 4a); this workflow will serve as a running example same size; thus, computing rows from data is less efficient than
throughout the paper. Details about the individual operators will be simply reloading rows. Furthermore, since a new feature is intro-
provided in subsequent sections. We overlay the original workflow duced, rows, which is the input to the feature extractor, cannot be
with an iterative update, with additions annotated with + and dele- pruned. Recomputing ageBucket requires making a pass over the
tions annotated with -, while the rest of the lines are retained as is. entire dataset to compute the bounds, a costly operation that we
We begin by describing the original workflow consisting of all the avoid by reloading a previous materialization. H ELIX materializes
unannotated lines plus the line annotated with - (deletions). predictions in both iterations because it has changed due to the
new feature set. Although predictions is not reused in the updated
Original Workflow: DPR Steps. First, after some variable name workflow, its materialization has high expected payoff over itera-
declarations, the user defines in line 3-4 a data collection rows tions because PPR iterations (changes to checked in this case) are
read from a data source data consisting of two CSV files, one for the most common as per our survey results shown in Figure 1(c).
training and one for test data, and names the columns of the CSV This example illustrates that
files age, education, etc. In lines 5-10, the user declares simple • Nodes selected for materialization lead to significant speedup
features that are values from specific named columns. Note that in subsequent iterations.
the user is not required to specify the feature type, which is auto- • H ELIX reuses results safely, deprecating old results when changes
matically inferred by H ELIX from data. In line 11 ageBucket is are detected (e.g., predictions is not reused because of the
declared as a derived feature formed by discretizing age into ten model change).
buckets (whose boundaries are computed by H ELIX), while line 12 • H ELIX correctly prunes away extraneous operations via dataflow
declares an interaction feature, commonly used to capture higher- analysis.
order patterns, formed out of the concatenation of eduExt and oc-
cExt.
Once the features are declared, the next step, line 13, declares
the features to be extracted from and associated with each element 3. PROGRAMMING INTERFACE
of rows. Users do not need to worry about how these features are In this section, we define H ELIX’s programming interface that
attached and propagated; users are also free to perform manual fea- enables users to program ML workflows with high-level object-
ture selection here, studying the impact of various feature combi- oriented abstractions, unburdened by low-level system details.
nations, by excluding some of the feature extractors. Finally, as The H ELIX DSL, called HML, is an embedded DSL in the Scala
last step of data preprocessing, line 14 declares that an example programming language. An embedded DSL exists as a library in
collection named income is to be made from rows using target the host language (Scala in our case), leading to seamless inte-
as labels. Importantly, this step converts the features from human- gration. LINQ [20], a data query framework integrated in .NET
readable formats (e.g., color=red) into an indexed vector represen- languages, is another example of an embedded DSL. In H ELIX,
tation required for learning. users can freely incorporate Scala code for user-defined functions
Original Workflow: L/I & PPR Steps. Line 15 declares an ML (UDFs) directly into HML. A wide range of JVM based libraries
model named incPred with type “Logistic Regression” and regu- such as CoreNLP [19], Deeplearning4j [39] and MLlib [21] can be
larization parameter 0.1, while line 16 specifies that incPred is to imported directly into HML to support application-specific needs.
be learned on the training data in income and applied on all data in The basic building blocks of HML are H ELIX objects, each of
income to produce a new example collection called predictions. which can be either a Data Collection (DC) or an operator. State-
Line 17-18 declare a Reducer named checkResults, which out- ments in HML either declare new instances of objects or the rela-
puts a scalar using a UDF for computing prediction accuracy. Line tionships between declared objects. Users program the entire work-
19 explicitly specifies checkResults’s dependency on target since flow in a single Workflow interface, as shown in Figure 4a). We
the content of the UDF is opaque to the optimizer. Line 20 declares describe DCs and operators in detail in Sections 3.1 and 3.2, re-
that the output scalar named checked is only to be computed from spectively, and we elaborate on the HML syntax and semantics in
the test data in income. Lines 21 declares that checked must be Section 3.3.
part of the final output.
Original Workflow: Optimized DAG. The H ELIX compiler first Unified learning support. HML provides unified support for
translates verbatim the program in Figure 4a) into a DAG, which training and test data by treating them a single data source, as done
contains all nodes including raceExt and all edges (including the in Line 4 in Figure 4a. This design ensures that both training and
1. object Census extends Workflow {
2. // Declare variable names (for consistent reference) omitted. data data
3. data refers_to new FileSource(train="dir/train.csv", test="dir/test.csv")
4. data is_read_into rows using CSVScanner(Array("age", "education", ...))
5. ageExt refers_to FieldExtractor("age") rows rows
Data Preprocessing

6. eduExt refers_to FieldExtractor("education")


7. occExt refers_to FieldExtractor("occupation")
8. clExt refers_to FieldExtractor("capital_loss")
9. raceExt refers_to FieldExtractor("race") raceExt eduExt target raceExt eduExt target msExt
msExt
+ msExt refers_to FieldExtractor("marital_status")
10. target refers_to FieldExtractor("target") occExt clExt ageExt occExt clExt
clExt ageExt
11. ageBucket refers_to Bucketizer(ageExt, bins=10)
12. eduXocc refers_to InteractionFeature(Array(eduExt, occExt))
13. - rows has_extractors(eduExt, ageBucket, eduXocc, clExt, target) eduXocc ageBucket eduXocc ageBucket
+ rows has_extractors(eduExt, ageBucket, eduXocc, msExt, target)
14. income results_from rows with_labels target
Machine Learning

15. incPred refers_to new Learner(modelType="LR"”, regParam=0.1)


16. predictions results_from incPred on income income income
17. checkResults refers_to new Reducer( (preds: DataCollection) => {
18. // Scala UDF for checking prediction accuracy omitted. })
19. checkResults uses extractorName(rows, target) predictions predictions
20. checked results_from checkResults on testData(predictions)
21. checked is_output() checked checked
22. }
a) Census Workflow Program b) Optimized DAG for original workflow c) Optimized DAG for modified workflow
Figure 4: Example workflow for predicting income from census data.

test data undergo the exact same data preparation steps, thus pre- tic Units (e.g., data, rows in Figure 4a). DC[E]s are input to or
cluding bugs caused by inconsistent data, and eliminates repetitive output from machine learning operations.
code that handles data preparation for training and test data sepa-
rately. H ELIX automatically selects the appropriate data for train- 3.2 H ELIX Operators
ing and evaluation. A H ELIX operator takes one or more DCs and outputs DCs, ML
models, or scalars. Each operator encapsulates a function, written
3.1 H ELIX Data Collections in Scala, to be applied to individual elements in the input DCs. We
refer to this encapsulated function simply as the UDF below.
A data collection (DC) is analogous to a relation in a RDBMS,
An operator belongs to one of the following types:
while each element in the DC is analogous to a tuple in the relation.
The content of a DC comes from either i) reading a data source Scanner. Scanners support relational operators in data prepara-
from persistent storage (e.g., data in Line 4 in Figure 4a) or ii) ap- tion such as projection and selection, in addition to parsing (one-
plying an operation on other DCs (e.g., rows in Line 5 in Figure 4a, to-many mapping) and other general one-to-one mappings, such as
obtained by applying the CSVScanner on data). An element in a format conversion. The UDF in a Scanner outputs zero or more SUs
DC can either be a Semantic Unit or an Example, described next. for each input SU. For example, CSVScanner in Line 4 of Fig-
ure 4a) transforms each input SU containing a file line as a string
Semantic Units. Semantics units (SUs) are the main data struc- into a map containing column name to value pairs.
ture for supporting data preparation in HML. An SU encapsulates Extractor. Extractors support feature engineering, particularly well
a unit data value v, such as a string, a number, or an array, and fea- adapted for fine-grain feature definitions using domain knowledge,
tures derived from v. Human-defined features are common in many a common practice in most application domains as discovered by
application domains, as discovered by our survey reported in Sec- our survey in Section 2.2. Extractors can be composed for complex
tion 2.2. To facilitate this type of feature engineering, data in SUs is features, such as eduXocc in Line 13 of Figure 4a) that generates
kept in human-readable formats (e.g. “color=red”, “weight=2g”), interaction features between the education and occupation fea-
which often require additional transformations to become compat- tures.
ible with ML algorithms.
Synthesizer. Many ML applications involve multiple data sources
Example. Examples are the data structure for machine learning in (event logs, user database, etc), as discovered by our survey in Sec-
HML. Each Example corresponds to an example (also referred to tion 2.2. Synthesizers support join operations required in working
as instance) in ML, a datapoint from the input space to the learning with multiple data sources by allowing users to make Examples us-
algorithm. An Example contains one or more SUs and an optional ing the SUs from multiple DC[SU]s. Conceptually, synthesizers
label, which can be the value of an extracted feature. Examples in this case work over the cartesian product of the input DC[SU]s
inherit all features extracted on constituent SUs and their ancestors. but operate more efficiently in practice. In addition, synthesizers
(this way, features can be used across different data granularity, support aggregation operations such as sliding windows in time se-
e.g., words, sentences, documents). Examples alleviate the burden ries, which combine multiple consecutive datapoints to form a sin-
of low-level data preparation by automatically transforming these gle learning example. Synthesizers in this case operate on a single
raw features into ML-compatible representations, e.g., mapping the DC[SU] S, but the input space is the power set of elements in S. In
isCapital feature onto {0, 1}. simple use cases where an Example is made from a single SU in a
A DC is parameterized by the type of elements it contains. single DC[SU], users implicitly declare a pass-through synthesizer
DC[SU] and DC[E] denote a DC of Semantic Units and a DC of by simply naming the output DC[E], as done in Line 15 of Fig-
Examples, respectively. DCs are not polymorphic, i.e., they can ure 4a). Note that synthesizers are not responsible for converting
only contain a single type of elements. The type of elements in a the raw feature representations into ML compatible ones, since this
DC is determined by the operation that produced the DC and not conversion is internal to the system and opaque to the user.
explicitly specified by the user. All DCs preceding the key phrase Learner. Learners in H ELIX handle both learning and inference in
results_from contain Examples (e.g., income in Line 15 and pre- a single operator. A Learner L contains an ML model M , which
dictions in Line 18 of Figure 4a), whereas the rest contain Seman- can be populated by learning from the input data or loading from
Workflow Component Operator Purpose
Scanner : DC[SU ] → DC[SU ] Projection, Selection, Parsing, Format Conversion
Data Preparation Extractor : DC[SU ] → DC[SU ] Feature Engineering
Synthesizer : (DC[SU ], DC[SU ], ...) → DC[E] Join, Aggregate
Learner : DC[E] → (DC[E], ML Model) Model Training, Inference
Learning/Inference
Reducer : DC[E] → Scalar Evaluation, Statistics, Analysis
Table 1: Summary of operators in H ELIX.

existing sources. When M is empty, L attempts to learn a model GW = (N, E), where node ni ∈ N represents the output of
using input data designated for model training; when M is popu- fi ∈ F and (ni , nj ) ∈ E if the output of fi is an input to fj .
lated, L performs inference on the input data using M and output
the inference results into a DC[E]. incPred in Figure 4a) Line 19 is Figure 4b) shows an example of the Workflow DAG, in which
a Learner trained on the “train” portion of the DC[E] income and the node data corresponds to a data source DC loaded from stor-
outputs inference results as the DC[E] predictions. age, and all other nodes are the output of operators declared in Fig-
Existing frameworks (e.g. ScikitLearn[25], KeystoneML[31]) ure 4a). Nodes for operators involved in DPR are in purple whereas
distinguish the two modes of operation via the concepts of Estima- those involved in L/I and PPR are in orange.
tors for learning and Transformers for inference. This is necessary Constructing the Workflow DAG. The transformation from HML
when the training and test data are handled separately. We elim- into a Workflow DAG is largely a straightforward process, in which
inate this overhead by providing unified support for training and a node is created for each declared operator and edges are con-
testing in H ELIX, as discussed in 3. structed between these nodes based on the linking expressions, e.g.
Reducer. Reducers support post-processing operations on infer- A results_from B creates an edge (B, A). Additionally, the com-
ence results, such as accuracy evaluation and visualization. The piler introduces edges not specified in the Workflow between Ex-
UDF in a Reducer aggregates data collected from the input DC[E] tractor and Synthesizer nodes. The edges marked by dots in Fig-
to produce scalar outputs, similar to the reduce operator in Spark. ure 4b) are examples of such edges. These edges ensure that Ex-
For example, checkResults in Figure 4a) Line 19 computes the amples inherit all features of the constituent SUs, as mentioned in
prediction accuracy of the inference results in predictions. Section 3.1. However, automatic feature aggregation could poten-
Table 1 summarizes the purposes and signatures of the operators tially lead to running Extractors that are not contributory to the final
described above, as well as the workflow component they belong output. This is counteracted by the pruning mechanisms in H ELIX,
to. To declare an operator, the user specifies the name, operator to be discussed next.
type and the UDF satisfying the signature of the operator type. For
Learners, the user can directly import ML models and learning al-
4.2 Pruning
gorithms from MLlib[21], which supports a large variety of com- H ELIX prunes extraneous operators by applying program slicing
monly used ML models and algorithms. on the Workflow DAG. In a nutshell, H ELIX traverses the DAG
backwards from the output nodes and prunes away any nodes not
3.3 Syntax & Semantics visited in this traversal. Users can explicitly guide this process in
Statements in HML either declare new H ELIX objects and/or the programming interface through the has_extractors and uses
specify the relationship between existing ones via linking expres- keywords, described in Table 1. An example of an Extractor pruned
sions. Expressions are infixed to mimic natural language, e.g., in this fashion is raceExt(grayed out) in Figure 4b), as it is ex-
word has_extractors (first_letter, last_letter). cluded from the rows has_extractors statement. This allows
Both DCs and operators are referenced by their string identifier users to conveniently perform manual feature selection using do-
in HML, as shown in the syntax specifications in Figure 8 (located main knowledge.
in Appendix A) The complete set of linking expressions, along Data-Driven Pruning. Furthermore, H ELIX inspects relevant data
with their usages and functions, is listed in Table 2. With just a to automatically identify operators to prune. The key challenge in
handful of object types and linking expressions, HML supports a data-driven pruning is data lineage tracking across the entire work-
wide range of use cases encompassing supervised and unsupervised flow. For many existing systems, once the features are joined to
learning and many distinct application domains. form ML examples, it is difficult to trace features in the learned
model back to the operators that produced them. To overcome this
4. COMPILATION AND REPRESENTATION limitation, H ELIX performs additional provenance bookkeeping to
track the operators that led to each feature in the model when con-
In this section we discuss the compilation process by which a
verting DPR output to ML-compatible formats. An example of
Workflow programmed by the user is transformed into H ELIX’s
data-drive workflow optimization enabled by this bookkeeping is
internal representation that enables workflow optimizations to be
pruning features by model weights. Operators resulting in features
described in Section 5.
with zero weights can be pruned without changing the prediction
4.1 The Workflow DAG outcome, thus lowering the overall runtime without compromising
model performance.
At compile time, H ELIX’s Intermediate Code Generator con-
Data-driven pruning is a powerful technique that can be extended
structs a DAG from the HML declarations, with nodes correspond-
to unlock the possibilities for many more impactful automatic work-
ing to operator outputs, which can be DCs, scalars, or ML models,
flow optimizations. Possible future work includes using this tech-
and edges indicating the input-output relationships between the op-
nique to minimize online inference time in large-scale high-QPS
erators. We refer to this DAG as the Workflow DAG:
settings and to adapt the workflow online in stream processing.
D EFINITION 1. For a Workflow W containing H ELIX oper- Cache Pruning. While Spark provides automatic data uncaching
ators F = {fi }, the Workflow DAG is a directed acyclic graph via a least-recently-used (LRU) scheme, H ELIX improves upon the
Phrase Usage Operation Example
refers_to string refers_to H ELIX object Register a H ELIX object to a string name “ext1” refers_to Extractor(...)
is_read_into DCi [SU ] is_read_into DCj [SU ] “sentence” is_read_into “word”
Apply scanner on DCi to obtain DCj
... using using scanner using whitespaceTokenizer
has_extractors DC[SU ] has_extractors extractor+ Apply extractors to DC “word” has_extractors (“ext1”, “ext2”)
synthesizer/learner/reducer on Apply synthesizer/learner on input DC(s) to “match” on
on
DC[∗]+ produce an output DC[E] (“person_candidate” ,“known_persons”)
Wrap each element in DCi in an Example
DCi [E] results_from DCj [∗] “income” results_from “rows”
and optionally labels the Examples with the
results_from [with_label extractor] with_label “target”
output of extractor.
DC[E]/Scalar results_from clause Specify the name for clause’s output DC[E]. “learned” results_from “L” on “income”
Specify synthesizer/learner’s dependency
synthesizer/learner/reducer uses on the output of extractors+ to prevent
uses “match” uses (“ext1”, “ext2”)
extractors+ pruning or uncaching of intermediate results
due to optimization.
is_output DC[∗]/result is_output Requires DC/result to be materialized. “learned” is_output

Table 2: Usage and functions of key phrases in HML. DC[A] denotes a DC with name DC and elements of type A ∈ {SU, E}, with A = ∗
indicating both types are legal. x+ indicates that x appears one or more times. When appearing in the same statement, on takes precedence
over results_from.

performance by actively managing the set of data to evict from n2


cache. From the DAG, H ELIX can detect when a node n becomes n1 n3
out-of-scope.

D EFINITION 2. Given a Workflow DAG Gw = (N, E), ni ∈


n4 n5
N is out-of-scope at runtime if all children of ni have either been
computed or reloaded from disk, thus allowing ni to be safely re-
moved from the cache without impacting performance. n6
Upon the completion of every operator, H ELIX analyzes the DAG
to uncache newly out-of-scope nodes. Combined with the lazy
n7 n8
evaluation order, the intermediate results for an operator reside in
cache only when it is immediately needed for a dependent operator.
One limitation of this eager eviction scheme is that any depen- Figure 5: A legal statement assignment for a Workflow DAG. Red
dencies undetected by H ELIX, such as the ones created in a UDF, nodes are loaded, blue nodes are computed, and dashed nodes are
can lead to premature uncaching of DCs before they are truly out- pruned. Loading n7 allows us to prune everything above the red
of-scope. The uses keyword in HML, described in Table 2, pro- dashed line; however, computing n8 requires that n5 must not be
vides a mechanism for users to manually prevent this by explicitly pruned.
declaring a UDF’s dependencies on other operators. In the future,
we plan on providing automatic UDF dependency detection via in- programming language community have been dedicated to veri-
trospection. fying operational equivalence of programs for specific classes of
programs [41, 26, 9]. H ELIX currently employs a simple repre-
4.3 Tracking Changes sentational equivalence verification — an operator remains opera-
As described in Section 2.4, a user starts with an initial workflow tionally equivalent across iterations if its declaration in the DSL is
W0 and iterates on the workflow by making incremental changes not modified and all of its ancestors are unchanged. We plan to
based on the results obtained from the current version. Let Wt be incorporate more advanced programming languages techniques for
the version of the workflow at iteration t ≥ 0 with the correspond- verifying operator equivalence as future work.
ing DAG Gtw = (Nt , Et ). Thus, Wt+1 denotes the workflow ob-
tained in the next iteration. To describe the changes between Wt
and Wt+1 , we introduce the notion of operator equivalence. 5. OPTIMIZATION AND EXECUTION
In this section, we describe H ELIX’s workflow-level optimiza-
D EFINITION 3. Operator equivalence: a node nti ∈ Nt is equiv-
tions. Recall that these optimizations are motivated by the simple
alent to nt+1
i ∈ Nt+1 , denoted as nti ≡ nt+1i , if both contain
observation that workflows often share a large amount of interme-
identical data content derived from the same inputs.
diate computation between iterations. Thus, if certain intermediate
Newly added operators in Wt+1 do not have equivalent nodes in results are chosen at time t to be materialized on disk, the H ELIX
Wt ; neither do nodes in Wt that are removed from Wt+1 . For a optimizer can choose to reuse these results at time t + 1 if they
node that persists across iterations, nti and nt+1 are equivalent if a) are still valid and reduce workflow execution time. In the follow-
i
ntj ≡ nt+1 ∀ ntj ∈ parents(nti ), nt+1 ∈ parents(nt+1 ), and b) ing sections, we first provide an intuitive performance model that
j j i
the two versions of the operators compute identical results on the captures reuse. Using this model, we present H ELIX’s optimization
same inputs. Thus, if the output of any node ancestor to nti changes strategies for choosing the optimal set of reusable results to mini-
at time t + 1, then nti 6≡ nt+1 . mize run time and and selecting intermediate results to materialize
i
Rice’s Theorem, proven through a reduction to the Halting prob- for accelerating future iterations.
lem, implies that determining b) for arbitrary functions is intrinsi-
cally an undecidable problem [29]. Entire bodies of work in the 5.1 Workflow Performance Model
In this section, we describe H ELIX’s workflow execution perfor- each node ni ∈ G, introduce binary indicator variables Ai and Bi
mance model. The iteration number t is omitted when the model defined as follows:
treats the workflow as static.
Ai = I {s(ni ) = Sp }
Operator Run Time. For a Workflow DAG Gw = (N, E) in- Bi = I {s(ni ) 6= Sc }
troduced in Section 4.1, each node ni ∈ N corresponding to the
output of the operator fi is associated with a load time li ≥ 0, the That is, Ai = 1 if node ni is pruned, and Bi = 1 if node ni is
time it takes to read ni from disk, and a compute time ci ≥ 0, the either pruned or loaded from storage, but not computed. Note that
time it takes to compute ni from inputs. If ni has not been ma- it is not possible to have Ai = 1 and Bi = 0. Also note that these
terialized on disk previously, we set li = ∞. For simplicity, we variables uniquely determine node ni ’s state s(ni ).
assume the time to write ni to disk is also li for li 6= ∞. With the {Ai } and {Bi } thus defined, our ILP is as follows:
Operator State. In the execution of workflow W , each node ni
|N |
assumes one of the following states: X
• Load, or Sl , if ni is loaded from disk;
maximize Ai li + Bi (ci − li ) (3a)
Ai , Bi i=1
• Compute, or Sc , ni is computed from inputs;

• Prune, or Sp , if ni is skipped (neither loaded nor computed). subject to Ai , Bi ∈ {0, 1}, 1 ≤ i ≤ |N |, (3b)
Let s(ni ) ∈ {Sl , Sc , Sp } denote the state of each ni ∈ N . To Ai = 0, ∀ni ∈ {outputs}, (3c)
ensure that nodes in the Compute state have their inputs available, Bi = 1, ∀ni ∈ {inputs}, (3d)
i.e., not pruned, the states in a Workflow DAG Gw = (N, E) must
satisfy the following execution state constraint: Bi − Ai ≥ 0, 1 ≤ i ≤ |N |, (3e)
X
Bi − Aj ≥ 0, 1 ≤ i ≤ |N | (3f)
C ONSTRAINT 1. For a node ni ∈ N , if s(ni ) = Sc , then nj ∈Pa(ni )
s(nj ) 6= Sp for every nj ∈ parents(ni ).
Equation (3b) follows from the definition of Ai and Bi , Equa-
tion (3c) ensures that output nodes are never pruned, Equation (3d)
Workflow Run Time. A node ni in state Sc , Sl , or Sp has run time ensures that input nodes are never computed (they have no inputs
ci , li , or 0, respectively. The total run time of W w.r.t. s is thus from which they could be computed), Equation (3e) ensures that a
X node ni cannot simultaneously have s(ni ) = Sp and s(ni ) = Sc ,
T (W, s) = I {s(ni ) = Sc } ci + I {s(ni ) = Sl } li (1) and Equation (3f) is equivalent to Constraint 1. By negating Equa-
ni ∈N
tion (3a) and adding the constant |N
P |
i=1 ci , we see that maximizing
where I {} is the indicator function. Equation (3a) is equivalent to
Choosing to load certain nodes can have cascading effects since |N |
X
all ancestors of a loaded node can potentially be pruned, leading min (Bi − Ai )li + (1 − Bi )ci (4)
Ai ,Bi
to large reductions in run time. On the other hand, Constraint 1 i=1
prevents the pruning of parent nodes to computed nodes. Thus,
which corresponds to the execution time as given in Equation (2).
decisions to load a node can be affected by nodes not in its depen-
Although ILPs are, in general, NP-Hard, we will shortly show
dency chain. For example, in Figure 5, loading n7 allows n1−6
that this particular ILP can be reduced to the project-selection prob-
to be pruned, which can lead to a much lower run time. However,
lem, which is solvable via a known reduction to M AX -F LOW [13].
the decision to compute n8 , possibly spawning from the fact that
In the project selection problem, we are given a DAG of projects
l8  c8 , requires that n5 must not be pruned.
and need to pick the subset of projects which maximizes some no-
tion of profit. We elaborate below.
5.2 Optimal Execution Plan
The Optimal Execution Plan (OEP) problem is the core problem D EFINITION 4. A project DAG is a directed acyclic graph where
solved by H ELIX’s DAG optimizer, which determines the optimal each node is associated with some real-valued (possibly negative)
execution plan given results and statistics from previous iterations. profit, and an edge a → b between nodes a and b indicates that b
is a prerequisite for a; i.e., before we can select project a, we must
P ROBLEM 1. (O PTIMAL -E XECUTION -P LAN) Given a Work- first select project b.
flow W with DAG GW = (N, E), find a state assignment s for Given a project DAG, we want to select a set of projects in order
N that minimizes T (W, s) while satisfying Constraint 1. to maximize profit. We make this notion formal below.

Let T ∗ (W ) be the minimum execution time achieved by the so- P ROBLEM 2. P ROJECT-S ELECTION -P ROBLEM(PSP) Given a
lution to OEP, i.e., project DAG G, select a (possibly empty) subset of the projects such
that the prerequisites of the selected projects are also selected, and
T ∗ (W ) = min T (W, s) (2) the sum of the profits of the selected projects is as large as possible.
s
That is, if project a is selected, then each project b with a → b is
Note that since this optimization takes place prior to execution, also selected.
we must resort to operator statistics from past iterations. If a node
Reducing the ILP to the Project Selection Problem. Each in-
has no equivalent operator as defined in 3 in the last iteration, we
stance of Problem 1 can be transformed to an equivalent instance
set its compute cost to 0 and load time to ∞, which forces the node
of Problem 2, in the sense that an optimal solution to the equiva-
to be computed.
lent P ROJECT-S ELECTION -P ROBLEM instance maps to an optimal
Integer Linear Programming Formulation. Problem 1 can be solution of the O PTIMAL -E XECUTION -P LAN instance in question.
formulated as an integer linear program (ILP) as follows. First, for The reduction, φ, is as follows:
1. Create a bipartite graph partitioned by sets A and B. For P ROOF. To see this, note that the execution DAG already must
each variable Ai , create a corresponding project node ai ∈ have Ω(|N |) edges if it is connected, so when we map it to project
A, and for each variable Bi , create a corresponding project selection, adding additional O (|N |) edges does not change the
node bi ∈ B. asymptotic number of edges. Similarly, the number of nodes added
is also O (|N |) Finally, the reduction to M AX -F LOW described
2. For each Ai not appearing in Equation (3c), assign a profit in [13] verify this source adds O (|N |) edges to E 0 , so that overall
of li to the corresponding ai . For each Bi not appearing in |E 0 | = O (|E|). A constant number of nodes are added (1 source
Equation (3d), assign a profit of ci − li . This corresponds to and 1 sink), so that overall |N 0 | = O (|N |).
the objective given in Equation (3a).
We use Edmonds-Karp algorithm [6] for solving M AX  -F LOW in
3. For each Ai appearing in Equation (3c), assign a correspond- our implementation, which runs in time O |N | · |E|2 .
ing profit of −1. For each Bi appearing in Equation (3d),
assign a profit of 1 to the corresponding project bi . 5.3 Optimal Materialization Plan
The Optimal Materialization Problem (OMP) is tackled by H E -
4. For each Ai and Bi appearing in a constraint represented LIX ’s materialization optimizer while running workflow Wt at it-
by Equation (3e), draw an edge between the corresponding eration t. Intermediate results are selectively materialized for the
project nodes ai and bi as ai → bi . purpose of accelerating executions in iterations > t. Materializa-
tion decisions are made on the fly during execution so that the data
5. For each Aj and Bi appearing in a constraint represented size and operator run time are accurate and out-of-scope (see Defi-
by Equation (3f), draw an edge between the corresponding nition 2) results are not cached for the sole purpose of materializa-
project nodes aj and bi as aj → bi . tion later.
Given a solution to φ(P ), we construct the solution to P by setting To quantify the benefit of materializing intermediate results at
Ai = I {Project ai was selected} and Bi = I {Project bi was selected}. time t on subsequent iterations, we introduce materialization cost
TM (Wt ), which captures the tradeoff between the additional time
to materialize intermediate results and the run time reduction in
Intuition Behind the Reduction. At a high level, constraints rep- iteration t + 1.
resented by any of Equations (3e) and (3f) map to dependency
edges in the project DAG — e.g., ai cannot be performed before D EFINITION 5. Given a workflow Wt , a subset of nodes M ⊆
bi is performed, corresponding to the constraints Bi − Ai ≥ 0. Nt , the materialization cost is defined as
Assigning a profit of −1 to each ai corresponding to an Ai con- X
strained to be 0 ensures that an optimal job selection will, in fact, TM (Wt ) = li + T ∗ (Wt+1 ) (5)
ni ∈M
not select that ai (so that Ai = 0), and likewise, by assigning a
profit of 1 to each bi corresponding to a Bi constrained to be 1 where T ∗ (Wt ) is the optimal workflow run time found using the
ensures that an optimal job selection will, in fact, select that bi . algorithm in Section 5.2.
Execution Plan Interpretation. In terms of the original execution
P ROBLEM 3. (O PTIMAL -M ATERIALIZATION -P LAN) Given a
workflow, one can imagine starting with all nodes in state Sc , save
Workflow Wt with DAG GtW = (Nt , Et ) at iteration t and a
for the inputs, which are in state Sl . One can then work one’s way
storage budget S, find a subset of nodes MP⊆ Nt to materialize at
down the execution DAG, performing jobs ai and bi — performing
t in order to minimize TM (Wt ) such that ni ∈M li ≤ S.
a job of type bj will involve setting some node nj to state Sl , which
may be costly if lj > cj and represents “savings” of li − ci , but Let M ∗ be the optimal solution to OMP, i.e.,
if node nj has a parent ni without any other children in state Sc , X
then node ni can be pruned — representing a job of type ai and argmax li + T ∗ (Wt+1 ) (6)
corresponding to savings of li . M ⊆N t ni ∈M
We are now ready to prove the correctness of this reduction.
As we have already seen in Section 2.2, there are many possibil-
T HEOREM 1. Given an instance of O PTIMAL -E XECUTION -P LAN ities for Wt+1 , and the modus operandi are different for each ap-
P , an optimal, feasible selection of projects in problem φ(P ) maps plication domain. User modeling and predicative analysis of Wt+1
back to a feasible and optimal assignment of Ai and Bi when set- itself is a substantial research topic that we will address in a follow-
ting each Ai = 1 if project ai was selected (0 otherwise) and each up work. This user model can be incorporated into OMP by using
Bi = 1 if project bi was selected (0 otherwise). the predicted changes to better estimate the likelihood of resue for
each operator.
P ROOF. See appendix. For the scope of this work, we sidestep the user model issue by
making the simplifying assumption that
Computational Complexity. The reduction in the previous dis- Gt+1 t
W = GW (7)
cussion does not change (asymptotically) the number of edges or
nodes when transforming Problem 1 to Problem 2. A similar claim Under this assumption, we achieve maximum reusability of mate-
can be made for transforming Problem 2 to M AX -F LOW, indicat- rialized intermediate results since all operators between t and t + 1
ing that the complexity characteristic of O PTIMAL -E XECUTION - are equivalent. Note that this assumption still allows us to create
P LAN is equivalent to that of M AX -F LOW. We prove these claims algorithms that are effective in practice — the changes from Wt to
formally below. Wt+1 is often small because users tend study the effect of a single
component in one iteration.
T HEOREM 2. For a given execution DAG G = (N, E), the NP-Hardness. Even with the simplifying assumption in Eq (7),
instance of M AX -F LOW that we map P to will lead to a graph G0 OMP is NP-Hard, which we show through a reduction from the
with |N 0 | = O (|N |) and |E 0 | = O (|E|). known NP-hard problem Knapsack.
0 l0 ←   mini si Algorithm 1: Streaming OMP
li ← si
ci ← pi + 2si Data: Gw = (N, E), li , ci , S
1 M ← ∅;
1 2 ... N 2 while Workflow is running do
3 O ← FindOutOfScope(N );
Figure 6: OMP DAG for Knapsack reduction. 4 for ni ∈ O do
5 if 2li − C(ni ) ≥ 0 and S − li ≥ 0 then
6 Materialize ni ;
P ROBLEM 4. (Knapsack) Given a knapsack capacity B and a 7 M ← M ∪ {ni };
set N of n items, with each i ∈ N having a size si and a profit pi , 8 S ← S = li
find S ∗ = 9 end
X 10 end
argmax pi (8) 11 end
S⊆N
i∈S
P
such that i∈S ∗ si ≤ B.
Algorithm 1 shows the heuristics employed by H ELIX’s mate-
For an instance of Knapsack, we construct a simple Workflow
rialization optimizer to decide what intermediate results to mate-
DAG W as shown in Figure 6. For each item i in Knapsack, we
rialize. In a nutshell, Algorithm 1 decides to materialize if twice
construct an output node ni with li = si and ci = pi + 2si . We
the load cost is less than the cumulative run time for a node. The
add an input node n0 with l0 =  < min si that all output nodes
intuition behind this algorithm is that assuming loading a node al-
depend on. Let Yi ∈ {0, 1} indicate whether a node ni ∈ M in
lows all of its ancestors to be pruned, the materialization time in
the optimal solution to OMP in Eq (6) and Xi ∈ {0, 1} indicate
iteration t and the load time in iteration t + 1 combined should be
whether an item is pickedP in the Knapsack problem. We use B as less than the total pruned compute time for the materialization to be
the storage budget, i.e., i∈∈{0,1} Yi li ≤ B.
cost effective. Intricate dependencies between descendants of an-
T HEOREM 3. We obtain an optimal solution to the Knapsack cestors such as the one between n8 and n5 in Figure 5 are ignored
problem for Xi = Yi ∀i ∈ {1, 2, . . . , n}. by Algorithm 1 because of the streaming constraint — we cannot
retroactively update our decision once n8 has been run. We show
P ROOF. First, we observe that for each ni , T ∗ (W ) will pick in the experiments that this simple algorithm is effective in multiple
min(li , ci ) given the flat structure of the DAG. By construction, application domains.
min(li , ci ) = li in our reduction. Second, materializing ni helps
in the first iteration only when it is loaded in the second iteration.
Thus, we can rewrite Eq (6) as 6. EXPERIMENTAL RESULTS
! To assess H ELIX’s ability to accelerate iterative ML application
N N
X X development, we evaluate its runtime performance on four distinct
argmin Y i li + Yi li + (1 − Yi )ci (9) workflows and compare against two similar systems, DeepDive and
Y∈{0,1}N i=1 i=1
KeystoneML. Table 3 summarizes the characteristics of the four
where Y = (Y1 , Y2 , . . . , YN ). Substituting in our choices of li and workflows used for evaluation, as well as the three system’s ability
ci in terms of pi and si in (9), we obtain argminY∈{0,1}N N
P
−Y i p i . to support each workflow.
i=1
Clearly, satisfying the storage constraint also satisfies the budget
constraint in Knapsack by construction. Thus, the optimal solution
6.1 Workflows
to OMP as constructed gives the optimal solution to Knapsack. The four workflows studied span a diverse range of application
domains with distinct characteristics to test each system’s versatil-
ity. KeystoneML, which focuses on large scale classification prob-
Streaming constraint. As mentioned above, we want to avoid lems, does not provide a programming interface conducive to struc-
keeping intermediate results in cache solely for the purpose of ma- tured prediction problems. On the other hand, DeepDive is highly
terialization later, which imposes undue pressure on memory and specialized for information extraction and therefore inflexible in
cripples performance. Thus, we impose the following constraint on the type of ML tasks it supports.
the materialization optimizer:
6.1.1 Census Workflow
C ONSTRAINT 2. Once ni becomes out-of-scope, it is either ma-
terialized immediately or removed from cache. This workflow illustrates a simple classification task with straight-
forward features from structured input. The dataset from [?], hosted
by the UCI Machine Learning Repository [5], contains 14 contin-
OMP Heuristics. We now describe the heuristic employed by
uous and categorical attributes representing demographic informa-
H ELIX to approximate OMP, which is NP-Hard, while satisfying
tion, such as age, education, occupation. The classification task is
Constraint 2. First, we introduce some notations. Given a Work-
to use demographic information to predict whether a person’s an-
flow DAG Gw = (N, E), we denote the operator run time as
nual income is >50K. The complexity of this application is rep-
t(ni ) ∈ {li , ci , 0}.
resentative of applications from the social and natural sciences,
D EFINITION 6. Given Workflow DAG Gw = (N, E), the where well-defined variables are being studied for covariate analy-
cumulative run time for a node ni is defined as sis. Code for the initial version of this workflow is shown in Fig-
X ure 4a).
C(ni ) = t(ni ) + t(nj ) (10)
nj ∈ancestors(ni )
6.1.2 NLP Workflow
Census NLP MNIST Genomics
Num. Data Source Single Multiple Single Multiple
Input Rec. to ML Example Mapping One-to-One One-to-Many One-to-One One-to-Many
Feature Granularity Fine Grained Fine Grained Coarse Grained N/A
Learning Type Supervised; Classification Structured Prediction Supervised; Classification Unsupervised
Application Domain Social Sciences NLP Computer Vision Natural Sciences
Supported by H ELIX X X X X
Supported by KeystoneML X X X
Supported by DeepDive X X
Table 3: Summary of workflow characteristics and support by the systems compared.

Figure 7: Cumulative run time in log scale for the four workflows studied. To indicate the type of change in each iteration, we color the area
under the curve purple for DPR, orange for ML, and green for PPR. DeepDive has missing data for iterations > 2 because its learning and
evaluation components are not user configurable.

This is a complex structured prediction task that identifies men- In our evaluations, the input articles have been pre-processed us-
tions of spouse pairs from news articles. DeepDive provides a ing the Bazaar parser 4 prior to being ingested by both H ELIX and
detailed tutorial describing the steps involved in this workflow 3 . DeepDive. This is to exclude from the overall run time the time for
In contrast to Census, the input to this workflow is unstructured NLP parsing, which both systems perform by invoking the same
text, and the objective is to extract structured information instead Scala library.
of simple classification. Thus, this workflow requires more data
pre-processing steps to enable learning, which mirrors the industry 6.1.3 MNIST Workflow
application setting, in which extensive data ETL is necessary. The MNIST dataset contains images of handwritten digits to be
classified by an ML model, which is a well-studied task in the com-
3
puter vision community. The workflow, implemented in both Key-
https://github.com/HazyResearch/deepdive/blob/master/doc/example-
4
spouse.md https://github.com/HazyResearch/bazaar
stoneML and H ELIX, involve very little data pre-processing, and the particular setting of the application. The small discrepancy
the majority of the time is spent on learning the model. We include observed between KeystoneML and H ELIX is mainly due to data
this application in our evaluations to ensure that in the extreme case provenance overhead, which H ELIX maintains to facilitate feature
where there is little reuse in each iteration, H ELIX does not behave engineering.
suboptimally. The Genomics workflow also has minimal data pre-processing
and encompasses multiple learning steps. DeepDive does not ap-
6.1.4 KnowEng Workflow pear to support multiple learning steps for models that are not factor
This workflow is described in Example 1. It involves two major graphs or logistic regression. Hence we compare H ELIX with only
steps: 1) split the input articles into words and learning vector rep- KeystoneML for this workflow. The materialize-nothing strategy
resentations for each word using word2vec[22]; 2) cluster the vec- in KeystoneML clearly leads to no run time reduction in subse-
tor representation of genes using K-Means to identify functional quent iterations. H ELIX, on the other hand, shows per iteration run
similarity. This workflow also has minimal data pre-processing but time that are proportional to the number of operators affected by
involves multiple learning steps, unlike MNIST. Additionally, each the type of change in that iteration. For example, in PPR iterations
input record, which is an articles, maps onto many training exam- (green) in which only the evaluation metrics is changed, H ELIX has
ples, which are gene names. nearly zero run time. Iterations with DPR changes (purple) have the
least amount of reusable intermediate results from the last iteration,
6.2 Experiments hence having the largest rerun time of all three iteration types. As
expected, the rerun time for ML iterations is somewhere in the mid-
6.2.1 Simulating iterative development dle between DPR and PPR. This trend is also true of the iteration
The modus operandi for all application domains represented in rerun times in the Census workflow.
the workflows are studied in our survey in Section 2.2. We con-
vert the iteration counts in Figure 1 into fractions and use them to
create user models for each domain, with the fractions representing 7. RELATED WORK
the likelihood of an iteration containing a certain type of change. End-to-end systems. An end-to-end ML system supports speci-
These user models are used to determine the type of change in each fication and execution for both the data preparation and statistical
iteration as follows. At each iteration, draw an iteration type from modeling components of a workflow. Such systems have the poten-
{DPR, L/I, PPR} according to the user model for the application tial to simplify workflow development and identify new optimiza-
domain of the workflow; then randomly choose an operator of the tion opportunities that are impossible in siloed systems. Existing
drawn type and modify the code for the operator. For example, if end-to-end ML systems impose various limitation on the nature of
an “L/I” iteration were drawn, we could choose to change the reg- workflow supported, trading generality for specialization. C OLUM -
ularization parameter for the ML model. BUS () [44] focuses on optimizing the feature selection process lim-
ited to regression models; DeepDive [4] specializes in iterative fea-
6.2.2 Running workflows ture engineering in knowledge base constructions. MLBase [15]
All experiments are run on a server with 120GB RAM and 16 provides optimized support for model selection through its TuPAQ
CPUs (Intel Xeon @ 2.40GHz), running Spark in local mode. After system [32] but does not address needs for complex data prepara-
running the initial version to obtain the run time for iteration 0, a tion. Scikit-learn’s comprehensive machine learning and data min-
workflow is modified according to the type of change determined ing libraries allows users to construct complex workflows entirely
as above. In all three systems the modified workflow is recompiled. within its framework[25]. However, it does not natively support
In DeepDive, we rerun the workflow using the command deepdive parallel processing, which is the cornerstone of current big data sys-
run. In H ELIX and Helix, we resubmit a job to Spark in local mode. tems. Lower level systems such as Tensorflow and SystemML af-
fords more programming flexibility but inherently lack robust sup-
6.2.3 Results port for either component. TFLearn [37] provides high-level deep
Figure 7 shows the cumulative run time in log scale for all four learning APIs on top of Tensorflow, while data preparation still re-
workflows. For the Census workflow, H ELIX shows nearly an mains a pain point in Tensorflow. Spark [42] provides end-to-end
order of magnitude reduction in cumulative run time over Key- support through the combination of its SparkSQL[2], which pro-
stoneML in just nine iterations. Extrapolating from the run time vides robust support for SQL query execution and DataFrame(DF)
trends, the performance gap can only widen with more iterations. manipulations for data preparation, and MLlib[21], which provides
DeepDive has data for only the first two iterations because its learn- a rich learning API. Work such as MADlib and Bismarck inject
ing and evaluation components are opaque and therefore inconfig- ML support into single node RDBMS, leveraging existing database
urable to the end user. For the same reason, we only include DPR operations for data preparation. The (object-)relational data model
iterations for the NLP workflow, which is only supported by H ELIX used in these systems flounders in the face of unstructured data, and
and DeepDive. The 40-60% per iteration speed up over DeepDive the subsystems for data preparation and ML are not well integrated.
in the NLP workflow is mainly due to the fact that unlike DeepDive, H ELIX overcomes these limitations by introducing a well defined
H ELIX does not need to rely on an external procedural language yet general data preparation protocol and primitives for ML model
(Python in the case of DeepDive) for UDFs. composition, and by leveraging existing specialized systems, in-
As mentioned above, the bulk of the workflow run time for MNIST cluding Spark [42] for distributed computation, through seamless
is spent on learning the model, which cannot be reused across it- interoperability.
erations because the pre-processing is nondeterministic. Thus, this KeystoneML[31], a dedicated end-to-end ML system, shares many
workflow does not benefit from reuse of intermediate results, which of its objectives with H ELIX, including integrated support for fea-
become invalid in the next iteration. This workflow is meant to test ture engineering and whole workflow optimization. Operationally,
H ELIX’s ability to be efficient in the extreme case of zero reuse. H ELIX is better equipped to handle unstructured data; the program-
Figure 7 demonstrates thatH ELIX does not wastefully materialize ming interface to KeystoneML is less structured. Unlike the depth-
useless intermediate results in this case, intelligently adapting to first execution model in KeystoneML, H ELIX optimizes the order
of operations at compile-time. Consequently, the materialization these libraries, MLlib relies on integration with SparkSQL while
strategies used in H ELIX are driven by program analysis, differ- Scikit-learn is self-sufficient for end-to-end support. None leads to
ing from the data-driven strategies used in KeystoneML. Instead of optimization of the data preparation component based on the model
operator-level optimization as carried out in KeystoneML, H ELIX content that H ELIX is capable of. By allowing a workflow to be de-
focuses on uncovering novel whole-workflow optimization tech- clared end-to-end in a single system, H ELIX seizes on opportunities
niques. for longer range optimization.
Whole workflow optimization. We identify three main types
of whole workflow optimization carried out by existing end-to-end 8. CONCLUSION
systems that lead to significant overall performance gains.
• Dead code elimination (DCE). An operation in a workflow is
In this paper we present H ELIX, a declarative machine learning
dead code if it does not affect the final results. Systems ca- framework aimed to accelerate iterative machine learning applica-
pable of this optimization [1, 4, 8, 31] support efficient partial tion development. In addition to its user friendly, flexible, and suc-
executions of workflows. H ELIX performs DCE through an ap- cinct programming interface, H ELIX tackles two major optimiza-
proach similar to [1, 4, 8] by analyzing the data and operation tion problems, namely O PTIMAL -E XECUTION -P LAN and O PTIMAL -
dependencies of the output in a bottom-up fashion. Spark and M ATERIALIZATION -P LAN, that together enable cross-iteration op-
KeystoneML achieve DCE through lazy evaluation. timizations resulting in significant run time reduction for future it-
• Common subexpression elimination (CSE). Oftentimes the out-
erations. We devise a PTIME algorithm to solve the O PTIMAL -
put of a set of operations is the input to multiple other oper- E XECUTION -P LAN by using a reduction to M AX -F LOW. The O PTIMAL -
ations in the workflow. [1, 4, 8, 31] perform CSE on the op- M ATERIALIZATION -P LAN, which we have shown is NP-Hard through
erations DAG they construct internally for execution. H ELIX a reduction from Knapsack, addresses the problem of selecting in-
achieves CSE by constructing a data dependency graph, similar termediate results to materialize to benefit future iterations. We
to RDD lineage in Spark, that automatically merges code paths proposed an efficient linear time approximation scheme to this prob-
with common subexpressions. lem that proves to be effective by the reported experimental results.
• Redundant computation elimination across iterations shortens
We evaluate H ELIX against DeepDive and KeystoneML on four
each cycle in iterative refinement of the workflow, which can workflows from social sciences, NLP, computer vision, and natural
significantly improve developer productivity during model pro- sciences that vary greatly in characteristics to test the versatility and
totyping. Tensorflow achieves this to some extent through feed the limitations of our system. We found that H ELIX support all four
nodes that accept user defined input sources and values. Deep- uses cases with ease and demonstrates 40-60% cumulative run time
Dive, backed by an RDBMS, reduces redundancy by building reduction on complex learning tasks and nearly an order of mag-
upon incremental view maintenance developed in the database nitude reduction on simpler ML tasks compared to both DeepDive
community. C OLUMBUS, on the other hand, caches models in and KeystoneML. We note that although H ELIX is implemented in
main memory for reuse between iterations in the same execu- a specific language and uses a specific data processing engine, the
tion run , where reuse is determined by a similarity measure techniques and modeling presented in this work are general pur-
instead of common code paths. H ELIX’s mechanism for cross pose. Other systems can enjoy the benefits of H ELIX’s core opti-
iteration redundancy elimination is similar to that of Tensor- mization engines through simple wrappers and connectors.
flow, where users can specify paths to obtain results for certain
nodes. H ELIX skips all computation leading to a node whose 9. REFERENCES
value can be loaded from an external source, which can be a file
system or DBMS. Other systems do not explicitly address this [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
need. C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.
Tensorflow: Large-scale machine learning on heterogeneous
Programming model. A number of systems and libraries have
distributed systems. arXiv preprint arXiv:1603.04467, 2016.
been built to help nonexperts program machine learning tasks with
high level abstractions. They operate at a higher level of abstraction [2] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K.
than declarative systems designed for implementing new ML algo- Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi,
rithms such as [34, 8, 33, 1, 30]. To support ML task declaration, et al. Spark sql: Relational data processing in spark. In
MLBase [15] proposes coarse grain statements such as doClassfiy Proceedings of the 2015 ACM SIGMOD International
to allow its optimizer to perform model selection. DeepDive[4]’s Conference on Management of Data, pages 1383–1394.
declarative language DDLog complements MLBase’s as it special- ACM, 2015.
izes in feature extraction queries, with an inference layer opaque to [3] M. Boehm, A. V. Evfimievski, N. Pansare, and B. Reinwald.
the programmer. H ELIX’s provides language primitives to support Declarative machine learning-a classification of basic
complexity both in the data preparation and the modeling stages. properties and types. arXiv preprint arXiv:1605.05826, 2016.
MADlib [11] and enables declarative ML through a library that [4] C. De Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and
provides native support for ML algorithms in RDBMS’s, while Bis- C. Zhang. Deepdive: Declarative knowledge base
marck[7] aims to achieve a unified architecture to support all ML construction. SIGMOD Rec., 45(1):60–67, June 2016.
algorithms in RDBMS’s without low level ad hoc implementation. [5] D. Dheeru and E. Karra Taniskidou. UCI machine learning
Such systems inherit the limited expressivity of RDBMS program- repository, 2017.
ming. As an embedded DSL, H ELIX benefits from the same type [6] J. Edmonds and R. M. Karp. Theoretical improvements in
of declarative and procedural integration in a single environment algorithmic efficiency for network flow problems. Journal of
boasted by SparkSQL[2]. In the same vein, KeystoneML com- the ACM (JACM), 19(2):248–264, 1972.
bines high-level pipeline construction declarations with low-level [7] X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified
operator extensions. architecture for in-rdbms analytics. In Proceedings of the
Libraries that provide high level ML APIs [21, 24, 10, 18, 16] 2012 ACM SIGMOD International Conference on
allow users with limited ML knowledge to prototype models. Of Management of Data, pages 325–336. ACM, 2012.
Data Proc. Support ML Support Distributed DCE CSE Cross Iter. PM
Columbus [44] ? ? X H
DeepDive [4] ??? ? X X X H
MLBase [15] ? ?? X H
Scikit-Learn [25] ??? ??? H
Tensorflow [1] ? ? ? ?† X X X X L
SystemML [8] ? ?? X X X L
MADlib[11]/Bismarck[7] ?? ?? H
MLlib[21] + SparkSQL[2] ?? ??? X X X H+L
KeystoneML [31] ?? ??? X X X H+L
H ELIX ??? ??? X X X X H+L
Table 4: Comparison of end-to-end systems for ML workflows. Of the systems mentioned in Section 7, the ones shown in this table support
end-to-end workflows to varying extents. The majority tend to emphasize on either data preparation or ML and provide limited support for
the other. Lower level systems such as SystemML and Tensorflow are less skewed but require more effort to program both components. The
last column indicates the programming model abstraction supported in each system, with “H” indicating high-level declarative support and
“L” indicating low-level procedural support.

Through TFLearn.

[8] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, [20] E. Meijer, B. Beckman, and G. Bierman. Linq: Reconciling
V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. object, relations and xml in the .net framework. In
Systemml: Declarative machine learning on mapreduce. In Proceedings of the 2006 ACM SIGMOD International
2011 IEEE 27th International Conference on Data Conference on Management of Data, SIGMOD ’06, pages
Engineering, pages 231–242. IEEE, 2011. 706–706, New York, NY, USA, 2006. ACM.
[9] A. D. Gordon. A tutorial on co-induction and functional [21] X. Meng, J. Bradley, E. Sparks, S. Venkataraman, D. Liu,
programming. In Functional Programming, Glasgow 1994, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. Mllib:
pages 78–95. Springer, 1995. Machine learning in apache spark. 2016.
[10] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, [22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
and I. H. Witten. The weka data mining software: an update. J. Dean. Distributed representations of words and phrases
ACM SIGKDD explorations newsletter, 11(1):10–18, 2009. and their compositionality. In Advances in neural
[11] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, information processing systems, pages 3111–3119, 2013.
E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, [23] M. A. Munson. A study on the importance of and time spent
et al. The madlib analytics library: or mad skills, the sql. on different modeling steps. ACM SIGKDD Explorations
Proceedings of the VLDB Endowment, 5(12):1700–1711, Newsletter, 13(2):65–71, 2012.
2012. [24] S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in
[12] A. Karpathy and L. Fei-Fei. Deep visual-semantic action. 2012.
alignments for generating image descriptions. In [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Proceedings of the IEEE Conference on Computer Vision B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
and Pattern Recognition, pages 3128–3137, 2015. V. Dubourg, et al. Scikit-learn: Machine learning in python.
[13] J. Kleinberg and E. Tardos. Algorithm design. Pearson Journal of Machine Learning Research, 12(Oct):2825–2830,
Education, 2006. 2011.
[14] R. Kohavi. Scaling up the accuracy of naive-bayes [26] A. M. Pitts. Operationally-based theories of program
classifiers: a decision-tree hybrid. In Proceedings of the equivalence. Semantics and Logics of Computation, 14:241,
Second International Conference on Knowledge Discovery 1997.
and Data Mining, pages 202–207. AAAI Press, 1996. [27] W. Rasband. Imagej: Image processing and analysis in java.
[15] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Astrophysics Source Code Library, 2012.
Franklin, and M. I. Jordan. Mlbase: A distributed [28] X. Ren, J. Shen, M. Qu, X. Wang, Z. Wu, Q. Zhu, M. Jiang,
machine-learning system. In CIDR, volume 1, pages 2–1, F. Tao, S. Sinha, D. Liem, et al. Life-inet: A structured
2013. network-based knowledge exploration and analytics system
[16] J. Langford, L. Li, and A. Strehl. Vowpal wabbit online for life sciences. Proceedings of ACL 2017, System
learning project, 2007. Demonstrations, pages 55–60, 2017.
[17] J. Lin and D. Ryaboy. Scaling big data mining infrastructure: [29] H. G. Rice. Classes of recursively enumerable sets and their
the twitter experience. ACM SIGKDD Explorations decision problems. Transactions of the American
Newsletter, 14(2):6–19, 2013. Mathematical Society, 74(2):358–366, 1953.
[18] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and [30] S. Schelter, A. Palumbo, S. Quinn, S. Marthi, and
J. M. Hellerstein. Distributed graphlab: a framework for A. Musselman. Samsara: Declarative machine learning on
machine learning and data mining in the cloud. Proceedings distributed dataflow systems. In ML Systems Workshop at
of the VLDB Endowment, 5(8):716–727, 2012. NIPS 2016.
[19] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, [31] E. Sparks. End-to-end large scale machine learning with
S. Bethard, and D. McClosky. The stanford corenlp natural keystoneml. 2016.
language processing toolkit. In ACL (System [32] E. R. Sparks, A. Talwalkar, M. J. Franklin, M. I. Jordan, and
Demonstrations), pages 55–60, 2014. T. Kraska. Tupaq: An efficient planner for large-scale
predictive analytic queries. arXiv preprint arXiv:1502.00068, APPENDIX
2015.
[33] E. R. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, A. HML GRAMMAR
J. Gonzalez, M. J. Franklin, M. I. Jordan, and T. Kraska. Mli:
hvari ::= hstringi
An api for distributed machine learning. In 2013 IEEE 13th
International Conference on Data Mining, pages 1187–1192. hscanneri ::= hvari | hscanner-obji
IEEE, 2013. hextractori ::= hvari | hextractor-obji
[34] A. Sujeeth, H. Lee, K. Brown, T. Rompf, H. Chafi, M. Wu, htyped-exti ::= ‘(’ hvari ‘,’ hextractori ‘)’
A. Atreya, M. Odersky, and K. Olukotun. Optiml: an hextractorsi ::= ‘(’ hextractori { ‘,’ hextractori } ‘)’
implicitly parallel domain-specific language for machine
htyped-extsi ::= ‘(’ htyped-exti {‘,’ htyped-exti} ‘)’
learning. In Proceedings of the 28th International
Conference on Machine Learning (ICML-11), pages hobji ::= hdata-sourcei | hscanner-obji | hextractor-obji |
609–616, 2011. hlearner-obji | hsynthesizer-obji | hreducer-obji
[35] R. Sumbaly, J. Kreps, and S. Shah. The big data ecosystem at hassigni ::= hvari ‘refers_to’ hobji
linkedin. In Proceedings of the 2013 ACM SIGMOD hexpr1i ::= hvari ‘is_read_into’ hvari ‘using’ hscanneri
International Conference on Management of Data, pages hexpr2i ::= hvari ‘has_extractors’ hextractorsi
1125–1134. ACM, 2013.
hlisti ::= hvari | ‘(’ hvari ‘,’ hvari { ‘,’ hvari } ‘)’
[36] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei.
happlyi ::= hvari ‘on’ hlisti
Line: Large-scale information network embedding. In
Proceedings of the 24th International Conference on World hexpr3i ::= happlyi ‘as_examples’ hvari
Wide Web, pages 1067–1077. International World Wide Web hexpr4i ::= happlyi ‘as_results’ hvari
Conferences Steering Committee, 2015. hexpr5i ::= hvari ‘as_examples’ hvari
[37] Y. Tang. Tf. learn: Tensorflow’s high-level module for ‘with_labels’ hextractori
distributed machine learning. arXiv preprint hexpr6i ::= hvari ‘uses’ htyped-extsi
arXiv:1612.04251, 2016.
hexpr7i ::= hvari ‘is_output()’
[38] D. Team. Deeplearning4j: Open-source distributed deep
learning for the jvm. Apache Software Foundation License, hstatementi ::= hassigni | hexpr1i | hexpr2i | hexpr3i | hexpr4i |
2, 2016. hexpr5i | hexpr6i | hexpr7i | hScala expri
[39] D. Team et al. Deeplearning4j: Open-source distributed deep hprogrami ::= ‘object’ hstringi ‘extends Workflow {’
learning for the jvm. Apache Software Foundation License, 2. { hstatementi hline-breaki }
[40] M. Weiser. Program slicing. In Proceedings of the 5th ‘}’
international conference on Software engineering, pages Figure 8: H ELIX syntax in Extended Backus-Naur Form. <string>
439–449. IEEE Press, 1981. denotes a legal String object in Scala; <*-obj> denotes the correct
[41] J. Woodcock, P. G. Larsen, J. Bicarregui, and J. Fitzgerald. syntax for instantiating object of type “*”; <Scala expr> denotes
Formal methods: Practice and experience. ACM computing any legal Scala expression. A H ELIX Workflow can be comprised
surveys (CSUR), 41(4):19, 2009. of any combination of H ELIX and Scala expressions, a direct ben-
[42] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, efit of being an embedded DSL.
M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica.
Resilient distributed datasets: A fault-tolerant abstraction for
in-memory cluster computing. In Proceedings of the 9th B. PROOF FOR THEOREM 1
USENIX conference on Networked Systems Design and
Implementation. USENIX Association, 2012. T HEOREM 4. Given an instance of O PTIMAL -E XECUTION -P LAN
[43] C. Zhang. DeepDive: A data management system for P , an optimal, feasible selection of projects in problem φ(P ) maps
automatic knowledge base construction. PhD thesis, back to a feasible and optimal assignment of Ai and Bi when set-
Citeseer, 2015. ting each Ai = 1 if project ai was selected (0 otherwise) and each
[44] C. Zhang, A. Kumar, and C. Ré. Materialization Bi = 1 if project bi was selected (0 otherwise).
optimizations for feature selection workloads. ACM The proof for Theorem 4 follows directly from the two lemmas
Transactions on Database Systems (TODS), 41(1):2, 2016. proven below.
L EMMA 1. A feasible solution to PSP under φ also produces a
feasible solution to OEP.
P ROOF. We first show that satisfying the prerequisite constraint
in PSP leads to satisfying Contraint 1 in O PTIMAL -E XECUTION -
P LAN. Suppose for contradiction that a feasible solution to PSP
under φ does not produce a feasible solution to OEP. This implies
that for some node ni ∈ N s. t. s(ni ) = Sc , at least one parent
nj has s(nj ) = Sp . By construction, projects ai and bi are not
selected, and both project aj and bj are selected. By 5, there exists
an edge aj → bi . The project selection entailed by the operator
states leads to a violation of the prerequisite constraint, since aj is
selected but its prerequisite bi is not. Thus, a feasible solution to
PSP must produce a feasible solution to OEP under φ.
L EMMA 2. An optimal solution to PSP is also an optimal solu-
tion to OEP under φ.
P ROOF. Let Xai be the indicator for whether project ai is se-
lected and Xbi for the indicator for bi . The optimization object for
PSP can then be written as
|N |
X
max Xai p(ai ) + Xbi p(bi ) (11)
Xai ,Xb
i i=1

Substituting our choice for p(ai ) and p(bi ), Eq (11) becomes


|N |
X
max Xai li + Xbi (ci − li ) (12)
Xai ,Xb
i i=1
|N |
X
= max − (Xbi − Xai )li − Xbi ci (13)
Xai ,Xb
i i=1

Thus the maximization problem in Eq (13) is equivalent to the min-


imization problem in Eq (4). By setting Ai as the value of Xa1 and
Bi as the value of Xbi , we obtain an optimal solution to OEP from
the optimal solution to PSP.

You might also like