0 views

Uploaded by Utsav Mishra

RL

- 9955_ACCO_User_Manual_4.15.3
- Blend Expert
- Datamining Intro IEP ppt
- Fraud Analytics
- Optimizing_mining_complexes_with_multipl.pdf
- 13. Pre-midterm Assignment - Instruction and Questions
- Burr_Settles_from Theories to Queries_active Learning in Practice
- Simulation HW1
- Security Evaluation of Pattern Classifiers under Attack
- Using Bayes Network in Weka
- Paper24 - S.skogestad - Engelien (Heat_integrated_distillati)
- optimization project 2
- 10.1.1.30
- Robust Representation and Recognition of Facial
- Mathematical Modeling for Business Analytics
- ORSI_Prospectus_latest.pdf
- Final Report of the Cooperative Research Program on Shell-And-Tube-Heat Exchangers
- Paper
- Application of Intelligent Technique for Economic Load Dispatch for Power System: A Review
- Operation Research -3

You are on page 1of 7

with Submodular Trees of Classifiers

Matt J. Kusner, Wenlin Chen, Quan Zhou,

Zhixiang (Eddie) Xu, Kilian Q. Weinberger, Yixin Chen

Washington University in St. Louis, 1 Brookings Drive, MO 63130

Tsinghua University, Beijing 100084, China

{mkusner, wenlinchen, zhixiang.xu, kilian, ychen25}@wustl.edu

zhouq10@mails.tsinghua.edu.cn

Abstract spam classifier may have 10 milliseconds per email, but can

spend more time on difficult cases as long as it can be faster

During the past decade, machine learning algorithms have be-

on easier ones. A key insight to facilitate this flexibility is to

come commonplace in large-scale real-world industrial appli-

cations. In these settings, the computation time to train and extract different features for different test instances.

test machine learning algorithms is a key consideration. At Recently, a model for the feature-cost efficient learning

training-time the algorithms must scale to very large data set setting was introduced that demonstrates impressive test-

sizes. At testing-time, the cost of feature extraction can domi- time cost reductions for web search (Xu et al. 2014). This

nate the CPU runtime. Recently, a promising method was pro- model, called Cost-Sensitive Tree of Classifiers (CSTC), par-

posed to account for the feature extraction cost at testing time, titions test inputs into subsets using a tree structure. Each

called Cost-sensitive Tree of Classifiers (CSTC). Although node in the tree is a sparse linear classifier that decides

the CSTC problem is NP-hard, the authors suggest an approx- what subset a test input should fall into, by passing these

imation through a mixed-norm relaxation across many clas- inputs along a path through the tree. Once an input reaches

sifiers. This relaxation is slow to train and requires involved

a leaf node it is classified by a specialized leaf classifier. By

optimization hyperparameter tuning. We propose a different

relaxation using approximate submodularity, called Approx- traversing the tree along different paths, different features

imately Submodular Tree of Classifiers (ASTC). ASTC is are extracted for different test inputs. While CSTC demon-

much simpler to implement, yields equivalent results but re- strates impressive real-world performance, the training time

quires no optimization hyperparameter tuning and is up to is non-trivial and the optimization procedure involves sev-

two orders of magnitude faster to train. eral sensitive hyperparameters.

Here, we present a simplification of CSTC, that takes ad-

vantage of approximate submodularity to train a near op-

Introduction timal tree of classifiers. We call our model Approximately

Machine learning is rapidly moving into real-world in- Submodular Tree of Classifiers (ASTC). We make a number

dustrial settings with widespread impact. Already, it has of novel contributions: 1. We show how the CSTC optimiza-

touched ad placement (Bottou et al. 2013), web-search rank- tion can be formulated as an approximately submodular set

ing (Zheng et al. 2008), large-scale threat detection (Zhang function optimization problem, which can be solved greed-

et al. 2013), and email spam classification (Weinberger et ily. 2. We reduce the training complexity of each classifier

al. 2009). This shift, however, often introduces resource within the tree to scale linearly with the training set size. 3.

constraints on machine learning algorithms. Specifically, We demonstrate on several real-world datasets that ASTC

Google reported an average of 5.9 billion searches executed matches CSTC’s cost/accuracy trade-offs, yet it is signifi-

per day in 2013. Running a machine learning algorithm is cantly quicker to implement, requires no additional hyper-

impractical unless it can generate predictions in tens of mil- parameters for the optimization and is up to two orders of

liseconds. Additionally, a large scale corporation may want magnitude faster to train.

to limit the amount of carbon consumed through electricity

use. Taking into account the needs of the learning setting Resource-efficient learning

is a step machine learning must make if it is to be widely In this section we formalize the general setting of resource-

adopted outside of academic settings. efficient learning and introduce our notation. We are given a

In this paper we focus on the resource of test-time CPU training dataset with inputs and class labels {(xi , yi )}ni=1 =

cost. This test-time cost is the total cost to evaluate the clas-

(X, y) ∈ R(n,d) × Y n (in this paper we limit our analysis to

sifier and to extract features. In industrial settings, where

the regression case). Without loss of generality we assume

linear models are prevalent, the feature extraction cost dom-

for each feature vector xf ∈ Rn that kxα k2 = 1, for all α =

inates the test-time CPU cost. Typically, this cost must be

1, . . . , d, and that kyk2 = 1, all with zero mean. Our aim is

strictly within budget in expectation. For example, an email

to learn a linear classifier hβ (x) = x> β that generalizes well

Copyright
c 2014, Association for the Advancement of Artificial to unseen data and uses the available resources as effectively

Intelligence (www.aaai.org). All rights reserved. as possible. We assume that each feature α incurs a cost of

1939

c(α) during extraction. Additionally, let `(·) be a convex loss

function and let B be a budget on its feature cost. Formally, β1 , θ1 v3 β3

we want to solve the following optimization problem,

β0 , θ0 v1

X x� β 1 ≤ θ 1

min `(y; X, β) subject to

β

c(α) ≤ B, (1) x� β 0 > θ 0 v4 x� β 4

α:|β|>0 x 0

v

v5 β5

P

where α:|β|>0 c(α) sums over the (used) features with

non-zero weight in β.

v2

Cost-sensitive tree of classifiers (CSTC) β2 , θ2 v6 β6

Recent work in resource-budgeted learning (Xu et al. 2014)

(CSTC) shows impressive results by learning multiple clas- Figure 1: The CSTC tree (depth 3). Instances x are sent

sifiers from eq. (1) which are arranged in a tree (depth along a path through the tree (e.g., in red) based on the pre-

D

D) β 1 , . . . , β 2 −1 . The CSTC model is shown in figure 1 dictions of node classifiers β k . If predictions are above a

(throughout the paper we consider the linear classifier ver- threshold θk , x is sent to an upper child node, otherwise it is

sion of CSTC). Each node v k is a classifier whose predic- sent to a lower child. The leaf nodes predict the class of x.

tions x> β k for an input x are thresholded by θk . The thresh-

old decides whether to send x to the upper or lower child of `0 norm and differentiability. The second optimization

v k . An input continues through the tree in this way until ar- road-block is that this feature cost term is non-continuous,

riving at a leaf node, which predicts its label. and is thus hard to optimize. Their solution is to derive a

continuous relaxation of the `0 norm using the mixed-norm

Combinatorial optimization. There are two road-blocks

(Kowalski 2009). The final optimization is non-covex and

to learning the CSTC classifiers β k . First, because instances not differentiable and the authors present a variational ap-

traverse different paths through the tree, the optimization is proach, introducing auxiliary variables for both `0 and `1

a complex combinatorial problem. In (Xu et al. 2014), they norms so that the optimization can be solved with cyclic

fix this by probabilistic tree traversal. Specifically, each clas- block-coordinate descent.

sifier x> β k is trained using all instances, weighted by the There are a number of practical difficulties that arise when

probability that instances reach v k . This probability is de- using CSTC for a given dataset. First, optimizing a non-

rived by squashing node predictions using the sigmoid func- leaf node in the CSTC tree affects all descendant nodes via

tion: σ(x> β) = 1/(1 + exp(−x> β)). the instance probabilities. This slows the optimization and

To make the optimization more amenable to gradient- is difficult to implement. Second, the optimization is sensi-

based methods, (Xu et al. 2014) convert the constrained op- tive to gradient learning rates and convergence thresholds,

timization problem in eq. (1) to the Lagrange equivalent and which require careful tuning. In the same vein, selecting

minimize the expected classifier loss plus the expected fea- appropriate ranges for hyperparameters λ and ρ may take

ture cost, where the expectation is taken over the probability repeated trial runs. Third, because CSTC needs to reopti-

of an input reaching node v k , mize all classifier nodes the training time is non-trivial for

1X k

n large datasets, making hyperparameter tuning on a valida-

min pi (yi − x> k 2 k k

i β ) +ρkβ k1 + λ E[C(β )]. (2) tion set time-consuming. Additionally, the stationary point

βk n | {z } reached by block coordinate descent is initialization depen-

i=1

exp. feature cost

| {z }

exp. squared loss

dent. These difficulties may serve as significant barriers to

entry, potentially preventing practitioners from using CSTC.

Here pki is the probability that instance xi traverses to v k and

ρ is the regularization constant to control overfitting. The last Model

term is the expected feature cost of β k ,

# In this paper we propose a vastly simplified variant of the

X

"

X

X

CSTC classifier, called Approximately Submodular Tree of

k l j

E[C(β )] = p c(α)

|βα |
. (3) Classifiers (ASTC). Instead of relaxing the expected cost

v l ∈P k α
vj ∈πl
term into a continuous function, we reformulate the entire

0

optimization as an approximately submodular set function

k

The outer-most sum is over all leaf nodes P affected by optimization problem.

β k and pl is the probability of any input reaching such a

leaf node v l . The remaining terms describe the cost incurred ASTC nodes. We begin by considering an individual clas-

for an input traversing to that leaf node. CSTC makes the sifier β k in the CSTC tree, optimized using eq. (2). If we

assumption that, once a feature is extracted for an instance ignore the effect of β k on descendant leaf nodes P k and

it is free for future requests. Therefore, CSTC sums over all previous nodes on its path π k , the feature cost changes:

features α and if any classifier along the path to leaf node X

v l uses feature α, it is paid for exactly once (π l is the set of E[C(β k )] = c(α)kβαk k0 . (4)

nodes on the path to v l ). α

1940

This combined with the loss term is simply a weighted CSTC optimization ASTC optimization

classifier with cost-weighted `0 -regularization. We propose

to greedily select features based on their performance/cost

trade-off and to build the tree of classifiers top-down, start- v3

ing from the root. We will solve one node at a time and set v 1 v1

features ‘free’ that are used by parent nodes (as they need v4

not be extracted twice). Figure 2 shows a schematic of the v0 2

(β , θ ) 2 v0

difference between the optimization of ASTC and the reop- 5

v

timization of CSTC. v 2 v2 θ2

Resource-constrained submodular optimization. An al- v6

β 2 = (X� X)−1 X� y

ternative way to look at the optimization of a single CSTC

node is as an optimization over sets of features. Let Ω be

the set of all features. Define the loss function for node v k ,

`k (A), over a set of features A ⊆ Ω as such,

Figure 2: The optimization schemes of CSTC and ASTC.

n

1X k Left: When optimizing the classifier and threshold of node

`k (A) = min pi (yi − δA (xi )> β k )2 (5) v 2 , (β 2 , θ2 ) in CSTC, it affects all of the descendant nodes

βk n

i=1 (highlighted in blue). If the depth of the tree is large (i.e.,

larger than 3), this results in a complex and expensive gra-

where we treat probabilities pki as indicator weights: pki = 1 dient computation. Right: ASTC on the other hand opti-

if input xi is sent to v k , and to 0 otherwise. Define δA (x) as mizes each node greedily using the familiar ordinary least

an element-wise feature indicator function that returns fea- squares closed form solution (shown above). θ2 is set by bi-

ture xa if a ∈ A and 0 otherwise. Thus, `k (·) is the squared nary search to send half of the inputs to each child node.

loss of the optimal model using only (a) inputs that reach

v k and (b) the features in set A. Our goal is to select a set Optimization

of features A that have low cost, and simultaneously have a

low optimal loss `k (A). In this section we demonstrate that optimizing a single

Certain problems in constrained set function optimization CSTC node, and hence the CSTC tree, greedily is approx-

have very nice properties. Particularly, a class of set func- imately submodular. We begin by introducing a modifica-

tions, called submodular set functions, have been shown to tion to the set function (5). We then connect this to the ap-

admit simple near-optimal greedy algorithms (Nemhauser, proximation ratio of the greedy cost-aware algorithm eq. (7),

Wolsey, and Fisher 1978). For the resource-constrained case, demonstrating that it produces near-optimal solutions. The

each feature (set element) a has a certain resource cost c(a), resulting optimization is very simple to implement and is

and we would like to ensure that the cost of selected features described in Algorithm 1.

fall under some resource budget B. For a submodular func- Approximate submodularity. To make `k (·) amenable to

tion s that is non-decreasing and non-negative the resource- resource-constrained set function optimization (6) we con-

constrained set function optimization, vert the loss minimization problem into an equivalent label

X ‘fit’ maximization problem. Define the set function zk ,

max s(A) subject to c(a) ≤ B (6)

A⊆Ω

a∈A Var(y; pk ) − `k (A)

zk (A) = (8)

can be solved near-optimally by greedily selecting set ele- Var(y; pk )

ments a ∈ Ω that maximize s as such,

where Var(y; pk ) = i pki (yi − ȳ)2 is the variance of the

P

training label vector y multiplied by 0/1 probabilities pki

" #

s(Gj−1 ∪ a) − s(Gj−1 )

gj = argmax . (7) (ȳ is the mean predictor). It is straightforward to show that

a∈Ω c(a)

maximizing zk (·) is equivalent to minimizing `k (·). In fact,

Where we define the greedy ordering Gj−1 = the following approximation guarantees hold for zk (·) con-

(g1 , g2 , . . . , gj−1 ). To find gj we evaluate all remain- structed from a wide range of loss functions (via a mod-

ing set elements a ∈ Ω \ Gj−1 and pick the element gj = â ification of Grubb and Bagnell (2012)). As we are inter-

for which s(Gj−1 ∪ a) increases the most over s(Gj−1 ) ested in developing a new method for CSTC training, we

per cost. Let GhT i = (g1 , . . . , gm ) be the largest feasible focus purely on the squared loss. Note that zk (·) is always

Pm non-negative (as the mean predictor is a worse training set

greedy set, having total cost T (i.e., j=1 c(gj ) = T ≤ B predictor than the OLS solution using one feature, assum-

and T + c(gm+1 ) > B). Streeter and Golovin (2007) ing it takes on more than one value). To see that it is also

prove that for any non-decreasing and non-negative sub- non-decreasing note that zk (·) is precisely the squared mul-

modular function s and some budget B, eq. (7) gives an tiple correlation R2 (Diekhoff 1992), (Johnson and Wichern

approximation ratio of (1 − e−1 ) ≈ 0.63 with respect to 2002), which is known to be non-decreasing.

∗

the optimal set with cost T ≤ B. Call this set ChT i . Then, If the features are orthogonal then zk (·) is submodular

−1 ∗

s(GhT i ) ≥ (1 − e )s(ChT i ) for the optimization in eq. (6). (Krause and Cevher 2010). However, if this is not the case

1941

it can be shown that zk (·) is approximately submodular and Algorithm 1 ASTC in pseudo-code.

has a submodularity ratio, defined as such: 1: Inputs: {X, y}; tree depth D; node budget B, costs c

2: Set the initial costs c1 = c

Definition 2.1 (Submodularity Ratio) (Das and Kempe 3: for k = 1 to 2D − 1 nodes do

2011). Any non-negative set function z(·) has a submodular- 4: Gk = ∅

ity ratio γU,i for a set U ⊆ Ω and i ≥ 1, 5:

P

while budget not exceeded: g∈Gk c(g) ≤ B do

Select feature a ∈ Ω via eq. (7)

Xh i h i

z(L ∪ {s}) − z(L) ≥ γU,i z(L ∪ S) − z(L) , 6:

7: Gk = Gk ∪ {a}

s∈S

8: end while

for all L, S ⊂ U , such that |S| ≤ i and S ∩ L = ∅. 9: Solve β k using weighted ordinary least squares

10: if v k is not a leaf node, with children v l and v u then

The submodularity ratio ranges from 0 (z(·) is not sub- 11: Set child probabilities:

modular) to 1 (z(·) is submodular) and measures how close

1 if pki > θk l 1 if pki ≤ θk

a function z(·) is to being submodular. As is done in Grubb

and Bagnell (2012), for simplicity we will only consider the pui = pi =

0 otherwise 0 otherwise

smallest submodularity ratio γ ≤ γU,i for all U and i.

The submodularity ratio in general is non-trivial to com- 12: Set new feature costs: cu = cl = ck

pute. However, we can take advantage of work by Das and 13: Free used features: cu (Gk ) = cl (Gk ) = 0

Kempe (2011) who show that the submodularity ratio of 14: end if

zk (·), defined in eq. (8), is further bounded. Define CkA as 15: end for

D

the covariance matrix of X̃, where x̃i = pki δA (xi ) (in- 16: Return {β 1 , β 2 , . . . β 2 −1 }

puts weighted by the probability of reaching v k , using only

the features in A). Das and Kempe (2011) show in Lemma

2.4 that for zk (·), it holds that γ ≥ λmin (CkA ), where Fast greedy selection

λmin (CkA ) is the minimum eigenvalue of CkA .

Equation (7) requires solving an ordinary least squares prob-

Approximation ratio. As in the submodular case, we lem, eq. (5), when selecting the feature that improves zk (·)

can optimize zk (·) subject to the resource constraint that the most. This requires a matrix inversion which typically

the cost of selected features must total less than a resource takes O d3 time. However, because we only consider se-

budget B. This optimization can be done greedily using the lecting one feature at a time we can avoid the inversion for

rule described in eq. (7). The following theorem—which is zk (·) altogether using the QR decomposition. Let Gj =

proved for any non-decreasing, non-negative, approximately (g1 , g2 , . . . , gj ) be our current set of greedily-selected fea-

submodular set function in Grubb and Bagnell (2012)— tures. For simplicity let XGj = δGj (X), the data masked so

gives an approximation ratio for this greedy rule. that only features in Gj are non-zero. Computing zk (·) re-

quires computing the weighted squared loss, eq. (5), which,

Theorem 2.2 (Grubb and Bagnell 2012). The greedy al- after the

q QR decomposition q requires no inverse. Redefine

gorithm selects an ordering GhT i such that, pk pk

xi = n xi and yi =

i

n yi , then we have,

i

∗

i)

`k (Gj ) = min(y − XGj β k )> (y − XGj β k ). (9)

where GhT i = (g1 , g2 , . . . , gm ) is the greedy sequence βk

Pm

truncated at cost T , such that i=1 c(gi ) = T ≤ B and

∗

ChT Let XGj = QR be the QR decomposition of XGj . Plugging

i is the set of optimal features having cost T .

in this decomposition, taking the gradient of `k (Gj ) with

Thus, the approximation ratio depends directly on the sub- respect to β k , and solving at 0 yields (Hastie, Tibshirani,

modularity ratio of zk (·). For each node in the CSTC tree we and Friedman 2001),

greedily select features using the rule described in (7). If we

are not at the root node, we set the cost of features used by β k = R−1 Q> y

the parent of v k to 0, and select them immediately (as we

have already paid their cost). We fix a new-feature budget The squared loss for the optimal β k is,

B—identical for each node in the tree—and then greedily `k (Gj ) = (y − QRR−1 Q> y)> (y − QRR−1 Q> y)

select new features up to cost B for each node. By setting

probabilities pki to 0 or 1 depending on if xi traverses to v k , = (y − QQ> y)> (y − QQ> y)

learning each node is like solving a unique approximately = y> y − y> QQ> y. (10)

submodular optimization problem, using only the inputs sent

to that node. Finally, we set node thresholds θk to send half Imagine we have extracted j features and we are considering

of the training inputs to each child node. selecting a new feature a. The immediate approach would be

We call our approach Approximately Submodular Tree of to recompute Q including this feature and then recompute

Classifiers (ASTC), which is shown in Algorithm 1. The op- the squared loss (10). However, computing qj+1 (the col-

timization is much simpler than CSTC. umn corresponding to feature a) can be done incrementally

1942

using the Gram–Schmidt process: 0.13

Yahoo!

0.32

Forest

Pj 0.125 0.31

0.3

qj+1 =

Precision @ 5

(higher is better)

Pj >

0.29

test error

0.28

0.11

Xa − QQ> Xa

0.27

0.105 cost-weighted l1-classifier r

= CSTC

0.26

ASTC

ASTC, soft

0.25

0.095 0.24

0 500 1000 1500

0.23

0 10 20 30 40 50

CIFAR MiniBooNE

selected feature). Finally, in order to select the best next fea- 0.44 0.24

ture using eq. (7), for each feature a we must compute, 0.42

0.22

0.2

zk (Gj ∪ a) − zk (Gj ) −`k (Gj ∪ a) + `k (Gj ) 0.4

test error

test error

= 0.18

0.16

1:j+1 y − y QQ y

0.14

= 0.34

0.12

Var(y; pk )c(a)

0.32 0.1

0 50 100 150 200 250 0 10 20 30 40 50

>

(qj+1 y)2 feature cost feature cost

= (11)

Var(y; pk )c(a) Figure 3: Plot of ASTC, CSTC, and a cost-sensitive base-

line on on real-world feature-cost sensitive dataset (Yahoo!)

where Q1:j+1 = Q, qj+1 . The first two equalities follow

from the definitions of zk (·) and `k (·). The third equality and three non-cost sensitive datasets (Forest, CIFAR, Mini-

follows because Q and Q1:j+1 are orthogonal matrices. BooNE). ASTC demonstrates roughly the same error/cost

We can compute all of the possible qj+1 columns, corre- trade-off as CSTC, sometime improving upon CSTC. For

sponding to all of the remaining features a in parallel, call Yahoo! circles mark the CSTC points that are used for train-

this matrix Qremain . Then we can compute eq. (11) vector- ing time comparison, otherwise, all points are compared.

wise on Qremain and select the feature with the largest cor-

responding value of zk (·). norm, as is assumed by the submodularity ratio bound anal-

ysis. We use the Precision@5 metric, which is often used for

Complexity. Computing the ordinary least squares solu- binary ranking datasets.

tion the naive way for the (j +1)th feature: (X> X)−1 X> y Figure 3 compares the test Precision@5 of CSTC with

requires O n(j + 1)2 + (j + 1)3 for the covariance mul- the greedy algorithm described in Algorithm 1 (ASTC). For

tiplication and inversion. This must be done d − j times to both algorithms we set a maximum tree depth of 5. We also

compute zk (·) for every remaining feature. Using the QR compare against setting the probabilities pki using the sig-

decomposition, computing qj+1 requires O nj time and moid function σ(x> β k ) = 1/(1 + exp(−x> β k )) on the

computing eq. (11) takes O n time. As before, this must node predictions as is done by CSTC (ASTC, soft). Specif-

be done d − j times for all remaining features, but as men- ically, the probability of an input x traversing from parent

tioned above both steps can be done in parallel. node v k to its upper child v u is σ(x> β k − θk ) and to its

lower child v l is 1 − σ(x> β k − θk ). Thus, the probability of

Results x reaching node v k from the root is the product of all such

In this section, we evaluate our approach on a real-world parent-child probabilities from the root to v k . Unlike CSTC,

feature-cost sensitive ranking dataset: the Yahoo! Learn- we disregard the effect β k has on descendant node probabil-

ing to Rank Challenge dataset. We begin by describing the ities (see Figure 2). Finally, we also compare against a single

dataset and show Precision@5 per cost compared against cost-weighted `1 -regularized classifier.

CSTC (Xu et al. 2014) and another cost-sensitive baseline. We note that the ASTC methods perform just as good, and

We then present results on a diverse set of non-cost sensitive sometimes slightly better, than state-of-the-art CSTC. All of

datasets, demonstrating the flexibility of our approach. For the techniques perform better than the single `1 classifier, as

all datasets we evaluate the training times of our approach it must extract features that perform well for all instances.

compared to CSTC for varying tree budgets. CSTC and ASTC instead may select a small number of ex-

Yahoo! learning to rank. To judge how well our approach pert features to classify small subsets of test inputs.

performs in a particular real-world setting, we test ASTC Forest, CIFAR, MiniBooNE. We evaluate ASTC on three

on the binary Yahoo! Learning to Rank Challenge data set very different non-cost sensitive datasets in tree type and im-

(Chen et al. 2012). The dataset consists of 473, 134 web age classification (Forest, CIFAR), as well as particle iden-

documents and 19, 944 queries. Each input xi is a query- tification (MiniBooNE). As the feature extraction costs are

document pair containing 519 features, each with extraction unknown we set the cost of each feature α to c(α) = 1. As

costs in the set {1, 5, 20, 50, 100, 150, 200}. The unit of cost before, ASTC is able to improve upon CSTC.

is in weak-learner evaluations (i.e., the most expensive fea-

ture takes time equivalent to 200 weak-learner evaluations). Training time speed-up. Tables 1 and 2 show the speed-

We remove the mean and normalize the features by their `2 up of our approaches over CSTC for various tree budgets.

1943

Table 1: Training speed-up of ASTC over CSTC for different tree budgets on Yahoo! and Forest datasets.

YAHOO ! F OREST

C OST B UDGETS 10 52 86 169 468 800 1495 3 5 8 13 23 50

ASTC 119 X 52 X 41 X 21 X 15 X 9.2 X 6.6 X 8.4 X 7.0 X 6.3 X 4.9 X 3.1 X 1.4 X

ASTC, SOFT 121 X 48 X 46 X 18 X 15 X 8.2 X 6.4 X 8.0 X 6.4 X 5.7 X 4.5 X 2.8 X 1.5 X

Table 2: Training speed-up of ASTC over CSTC for CIFAR and MiniBooNE datasets.

CIFAR M INI B OO NE

C OST B UDGETS 9 24 76 180 239 4 5 12 14 18 33 47

ASTC 5.6 X 2.3 X 0.68 X 0.25 X 0.14 X 7.4 X 7.9 X 5.5 X 5.2 X 4.1 X 3.1 X 2.0 X

ASTC, SOFT 5.3 X 2.3 X 0.62 X 0.27 X 0.13 X 7.2 X 6.2 X 5.9 X 4.2 X 4.3 X 2.5 X 1.7 X

For a fair speed comparison, we first learn a CSTC tree for in which a learner can purchase features in ‘rounds’. Most

different values of λ, which controls the allowed feature ex- similar to our work is Golovin and Krause (2011) who learn

traction cost (the timed settings on the Yahoo! dataset are a policy to adaptively select features to optimize a set func-

marked with black circles on Figure 3, whereas all points are tion. Differently, their work assumes the set function is fully

timed for the other datasets). We then determine the cost of submodular and every policy action only selects a single ele-

unique features extracted at each node in the learned CSTC ment (feature). To our knowledge, this work is the first tree-

tree. We set these unique feature costs as individual node based model to tackle resource-efficient learning using ap-

budgets B k for ASTC methods and greedily learn tree fea- proximate submodularity.

tures until reaching the budget for each node. We note that

on the real-world feature-cost sensitive dataset Yahoo! the Conclusion

ASTC methods are consistently faster than CSTC. Of the

remaining datasets ASTC is faster in all settings except for We have introduced Approximately Submodular Tree of

three parameter settings on CIFAR. One possible explana- Classifiers (ASTC), using recent developments in approxi-

tion for the reduced speed-ups is that the training set of these mate submodular optimization to develop a practical near-

datasets are much smaller (Forest: n = 36, 603 d = 54; CI- optimal greedy method for feature-cost sensitive learning.

FAR: n = 19, 761 d = 400; MiniBooNE: n = 45, 523 d = 50) The resulting optimization yields an efficient update scheme

than Yahoo! (n = 141, 397 and d = 519). Thus, the speed-ups for training ASTC up to 120 times faster than CSTC.

are not as pronounced and the small, higher dimensionality One limitation of this approach is that the approximation

CIFAR dataset trains slightly slower than CSTC. guarantee does not hold if features are preprocessed. Specif-

ically, for web-search ranking, it is common to first perform

Related work gradient boosting to generate a set of limited-depth decision

trees. The predictions of these decision trees can then be

Prior to CSTC (Xu et al. 2014), a natural approach to con- used as features, as in non-linear of CSTC (Xu et al. 2013a).

trolling feature resource cost is to use `1 -regularization to Additionally, a set of features may cost less than the sum

obtain a sparse set of features (Efron et al. 2004). One down- of their individual costs. One common example is image de-

side of these approaches is that certain inputs may only re- scriptors for object detection such as HOG features (Dalal

quire a small number of cheap features to compute, while and Triggs 2005). A descriptor can be thought of as a group

other inputs may require a number of expensive features. of features. Once a single feature from the group is selected,

This scenario motivated the development of CSTC (Xu the remaining features in the group become ‘free’, as they

et al. 2014). There are a number of models that use sim- were already computed for the descriptor. Extending ASTC

ilar decision-making schemes to speed-up test-time clas- to these features would widen the scope of the approach.

sification. This includes Dredze, Gevaryahu, and Elias- Overall, by presenting a simple, near-optimal method for

Bachrach (2007) who build feature-cost sensitive decision feature-cost sensitive learning we hope to bridge the gap

trees, Busa-Fekete et al. (2012) who use a Markov decision between machine learning models designed for real-world

process to adaptively select features for each instance, Xu et industrial settings and those implemented in such settings.

al. (2013b) who build feature-cost sensitive representations, Without the need for expert tuning and with faster training

and Wang and Saligrama (2012) who learn subset-specific we believe our approach can be rapidly used in an increasing

classifiers. number of large-scale machine learning applications.

Feature selection has been tackled by a number of sub-

modular optimization papers (Krause and Cevher 2010; Das Acknowledgements. KQW, MK, ZX, YC, and WC are

and Kempe 2011; Das, Dasgupta, and Kumar 2012; Krause supported by NIH grant U01 1U01NS073457-01 and NSF

and Guestrin 2012). Surprisingly, until recently, there were grants 1149882, 1137211, CNS-1017701, CCF-1215302,

relatively few papers addressing resource-efficient learning. IIS-1343896. QZ is partially supported by Key Technologies

Grubb and Bagnell (2012) greedily learn weak learners that Program of China grant 2012BAF01B03 and NSFC grant

are cost-effective using (orthogonal) matching pursuit. Work 61273233. Computations were performed via the Washing-

last year (Zolghadr et al. 2013) considers an online setting ton University Center for High Performance Computing,

1944

partially through grant NCRR 1S10RR022984-01A1. Weinberger, K.; Dasgupta, A.; Langford, J.; Smola, A.; and Atten-

berg, J. 2009. Feature hashing for large scale multitask learning.

References In International Conference on Machine Learning, 1113–1120.

Bottou, L.; Peters, J.; Quiñonero-Candela, J.; Charles, D. X.; Xu, Z.; Kusner, M. J.; Chen, M.; and Weinberger, K. Q. 2013a.

Chickering, D. M.; Portugaly, E.; Ray, D.; Simard, P.; and Snel- Cost-sensitive tree of classifiers. In International Conference on

son, E. 2013. Counterfactual reasoning and learning systems: the Machine Learning.

example of computational advertising. The Journal of Machine Xu, Z.; Kusner, M.; Huang, G.; and Weinberger, K. Q. 2013b.

Learning Research 14(1):3207–3260. Anytime representation learning. In International Conference on

Busa-Fekete, R.; Benbouzid, D.; Kégl, B.; et al. 2012. Fast clas- Machine Learning.

sification using sparse decision dags. In International Conference Xu, Z.; Kusner, M. J.; Weinberger, K. Q.; Chen, M.; and Chapelle,

on Machine Learning. O. 2014. Budgeted learning with trees and cascades. Journal of

Chen, M.; Weinberger, K. Q.; Chapelle, O.; Kedem, D.; and Xu, Z. Machine Learning Research.

2012. Classifier cascade for minimizing feature evaluation cost. In Zhang, D.; He, J.; Si, L.; and Lawrence, R. 2013. Mileage: Multiple

International Conference on Artificial Intelligence and Statistics, instance learning with global embedding. International Conference

218–226. on Machine Learning.

Dalal, N., and Triggs, B. 2005. Histograms of oriented gradients Zheng, Z.; Zha, H.; Zhang, T.; Chapelle, O.; Chen, K.; and Sun, G.

for human detection. In Computer Vision and Pattern Recognition, 2008. A general boosting method and its application to learning

volume 1, 886–893. ranking functions for web search. In Advances in Neural Informa-

Das, A., and Kempe, D. 2011. Submodular meets spectral: Greedy tion Processing Systems. 1697–1704.

algorithms for subset selection, sparse approximation and dictio- Zolghadr, N.; Bartók, G.; Greiner, R.; György, A.; and Szepesvári,

nary selection. International Conference on Machine Learning. C. 2013. Online learning with costly features and labels. In Ad-

Das, A.; Dasgupta, A.; and Kumar, R. 2012. Selecting diverse vances in Neural Information Processing Systems, 1241–1249.

features via spectral regularization. In Advances in Neural Infor-

mation Processing Systems, 1592–1600.

Diekhoff, G. 1992. Statistics for the social and behavioral sci-

ences: univariate, bivariate, multivariate. Wm. C. Brown Publish-

ers Dubuque, IA.

Dredze, M.; Gevaryahu, R.; and Elias-Bachrach, A. 2007. Learning

fast classifiers for image spam. In Conference on Email and Anti-

Spam.

Efron, B.; Hastie, T.; Johnstone, I.; and Tibshirani, R. 2004. Least

angle regression. The Annals of Statistics 32(2):407–499.

Golovin, D., and Krause, A. 2011. Adaptive submodularity: The-

ory and applications in active learning and stochastic optimization.

Journal of Artificial Intelligence Research 42(1):427–486.

Grubb, A., and Bagnell, J. A. 2012. Speedboost: Anytime predic-

tion with uniform near-optimality. In International Conference on

Artificial Intelligence and Statistics.

Hastie, T.; Tibshirani, R.; and Friedman, J. 2001. The elements

of statistical learning: data mining, inference and prediction. New

York: Springer-Verlag 1(8):371–406.

Johnson, R. A., and Wichern, D. W. 2002. Applied multivariate

statistical analysis, volume 5. Prentice hall Upper Saddle River,

NJ.

Kowalski, M. 2009. Sparse regression using mixed norms. Applied

and Computational Harmonic Analysis 27(3):303–324.

Krause, A., and Cevher, V. 2010. Submodular dictionary selection

for sparse representation. In International Conference on Machine

Learning, 567–574.

Krause, A., and Guestrin, C. E. 2012. Near-optimal nonmy-

opic value of information in graphical models. arXiv preprint

arXiv:1207.1394.

Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An anal-

ysis of approximations for maximizing submodular set functions i.

Mathematical Programming 14(1):265–294.

Streeter, M., and Golovin, D. 2007. An online algorithm for maxi-

mizing submodular functions. Technical report, DTIC Document.

Wang, J., and Saligrama, V. 2012. Local supervised learning

through space partitioning. In Advances in Neural Information Pro-

cessing Systems, 91–99.

1945

- 9955_ACCO_User_Manual_4.15.3Uploaded byp030206
- Blend ExpertUploaded byMohammed Alkhafaji
- Datamining Intro IEP pptUploaded byMamilla Babu
- Fraud AnalyticsUploaded byPrashanth Mohan
- Optimizing_mining_complexes_with_multipl.pdfUploaded byzerg
- 13. Pre-midterm Assignment - Instruction and QuestionsUploaded byHarjas Bakshi
- Burr_Settles_from Theories to Queries_active Learning in PracticeUploaded bySanjan
- Simulation HW1Uploaded byHoang Lam Nguyen
- Security Evaluation of Pattern Classifiers under AttackUploaded byJAYAPRAKASH
- Using Bayes Network in WekaUploaded byfayshox1325
- Paper24 - S.skogestad - Engelien (Heat_integrated_distillati)Uploaded byHardikManvar
- optimization project 2Uploaded byapi-320259366
- 10.1.1.30Uploaded byDevayani Vernekar
- Robust Representation and Recognition of FacialUploaded byshalini
- Mathematical Modeling for Business AnalyticsUploaded byhelioteixeira
- ORSI_Prospectus_latest.pdfUploaded byrubina.sk
- Final Report of the Cooperative Research Program on Shell-And-Tube-Heat ExchangersUploaded byacckypenryn
- PaperUploaded byOtieno Moses
- Application of Intelligent Technique for Economic Load Dispatch for Power System: A ReviewUploaded byAnonymous vQrJlEN
- Operation Research -3Uploaded byrupeshdahake
- Barabash O ArticleUploaded bySCASPEE
- SPSS ResultsUploaded byShobhitKapta
- SMD06paperUploaded bybooshscrbd
- FEA Optimization GuideUploaded bya3576159
- 6b. Construction Process OptimizationUploaded byAbdelhadi Sharawneh
- 10.1.1.118.9806Uploaded byopenid_vfaeUBzH
- research onUploaded byRajesh Kumar
- Performance Evaluation of Machine Learning Classifiers in Sentiment MiningUploaded byseventhsensegroup
- Resource-constrained Project Scheduling Problem ReUploaded byDEEPESH
- PhreePlot.pdfUploaded bycresponoelia

- ConstructionmanagementandAccounts[1]Uploaded bymankani_ramesh
- ConstructionmanagementandAccounts[1]Uploaded bymankani_ramesh
- summer_training_report(CHL_APOLLO)Uploaded byRam Bairwa
- Problem Statement Mineo Case StudyUploaded byUtsav Mishra
- Tractor Production and Sales in India.pdfUploaded byUtsav Mishra
- TheevolutionofelitehighrisecondominiumsinIndiafromtheglobaltotheneocolonialUploaded byUtsav Mishra
- Sample Business PlanUploaded bySaurabh S Gattani
- Buildingbye Laws 140425034549 Phpapp01Uploaded byUtsav Mishra
- Buildingbye Laws 140425034549 Phpapp01Uploaded byUtsav Mishra
- Wharton casebook 2010Uploaded byForshwetz
- Case Book Michigan 2006Uploaded byzeronomity
- Tractor Production and Sales in India.pdfUploaded byUtsav Mishra
- Business Model Canvas PosterUploaded byosterwalder
- Report on the Tractor SectorUploaded bySanjeev Chaudhary
- WB 2Uploaded byUtsav Mishra
- Intellecap.pdfUploaded byKhushbu Singh
- AntonioUploaded bysans9

- logistic.pdfUploaded byAnil
- chapra 15Uploaded byPatrick M Jane
- Cavity ErturkUploaded byReza Shahvari
- rep151Uploaded byAnonymous CmgstGQCIg
- introduction to fem pptUploaded byPavan Kishore
- lec14Uploaded byvaradu
- Lagrange interpolation to compute the derivatives of a functionUploaded byRashed2010
- Iterative MethodsUploaded byRao Visweswara Mohan
- Numerical Methods for Partial Differential Equations IIUploaded byRicardo
- Tutorial 06 SolutionUploaded byMohd HelmiHazim
- Product Optimization_Sunil PillaiUploaded bySunil Pillai
- Deep Learning TutorialUploaded byNasir Kujrakovic
- Sem 1 - Numerical Methods for Civil EngineeringUploaded byHitesh Parmar
- NeuralUploaded bysramiz_1987
- Chapter3-1.pdfUploaded byali
- List of AlgorithmsUploaded byDevender Sharma
- TDMA Fortran CUploaded byAnonymous kVwp7D
- Prob Cost DegenUploaded byNikhil Chaple
- InterpolationUploaded byNatassa Adi Putri
- An Exact Solution to the Colebrook EquationUploaded byJohnathan Connect
- Backprop[1] CopyUploaded bySukhoiLover
- Analysis of the Serviceability Performances of the Bridge by Genetic Algorithm ApproachesUploaded bylucasgambi
- ch 5-8Uploaded byNadine
- Performing Monte Carlo Simulations in Crystal BallUploaded byVishwanath Verenkar
- Introduction to Hopfield NetworksUploaded byDivneet Singh Kapoor
- A Finite Element Formulation Satisfying the Discrete Geometric Conservation Law Based on Averaged JacobiansUploaded byLuciano Garelli
- Tutorial 4Uploaded byMelchizedek Mashambo
- 19 Noraini Jamaludin(1)Uploaded byBianca Doyle
- NNFLUploaded byVikram Gupta
- Draft-1Uploaded byNikhil Ratna Shakya