You are on page 1of 5

Machine Teaching:

An Inverse Problem to Machine Learning and


an Approach Toward Optimal Education
Xiaojin Zhu
Department of Computer Sciences, University of WisconsinMadison
Madison, WI, USA 53706
jerryzhu@cs.wisc.edu

Abstract A
D
I draw the readers attention to machine teaching, the prob- A(D)
lem of finding an optimal training set given a machine learn-
ing algorithm and a target model. In addition to generating
fascinating mathematical questions for computer scientists to
ponder, machine teaching holds the promise of enhancing ed- 1
1
ucation and personnel training. The Socratic dialogue style ( )
aims to stimulate critical thinking.
D
D
Given a training set D D, machine learning returns a
Of Machines model A(D) . Note A in general is many-to-one. Con-
Q: I know machine learning; What is machine teaching? versely, given a target model the inverse function
A1 returns the set of training sets that will result in . Ma-
Consider a student who is a machine learning algo-
chine teaching aims to identify the optimal member among
rithm, for example, a Support Vector Machine (SVM) or
A1 ( ). However, A1 is often challenging to compute,
kmeans clustering. Now consider a teacher who wants the
and may even be empty for some . Machine teaching must
student to learn a target model . For example, can be
handle these issues.
a specific hyperplane in SVM, or the location of the k cen-
troids in kmeans. The teacher knows and the students Q: Isnt machine teaching just active learning / experimental
learning algorithm, and teaches by giving the student train- design?
ing examples. Machine teaching aims to design the optimal No. Recall active learning allows the learner to ask
training set D. questions by selecting items x and asking an ora-
Q: What do you mean by optimal? cle for its label y (Settles 2012). Consider learning a
noiseless threshold classifier in [0, 1], as shown below.
One definition is the cardinality of D: the smaller |D| is,
the better. But there are other definitions as we shall see.
{

Q: If we already know the true model , why bother training


a learner?
The applications are such that the teacher and the learner 0 1
are separate entities, and the teacher cannot directly hard To learn the decision boundary up to , active learning
wire the learner. One application is education where the needs to perform binary search with log( 1 ) queries. In con-
learner is a human student. A more sinister application trast, in machine teaching the teacher only needs two exam-
is security where the learner is an adaptive spam filter, and ples: ( 2 , 1), ( + 2 , +1). The key difference is that
the teacher is a hacker who wishes to change the filtering the teacher knows upfront and doesnt need to explore.
behavior by sending the spam filter specially designed mes- Note passive learning, where training items x are sampled
sages. Regardless of the intention, machine teaching aims to iid uniformly from [0, 1], requires O( 1 ) items.
maximally influence the learner via optimal training data. Q: So the teacher can create arbitrary training items?
Q: How is machine teaching an inverse problem to machine That is one teaching setting. Another setting is pool-
learning? based teaching where the teacher is given a pool of can-
One may view a machine learning algorithm A as a func- didate items and can only select (or modify) those items.
tion mapping the space of training sets D to a model space There are many other aspects of the teaching setting that one
, see this figure: can consider: whether teaching is done in a batch or sequen-
tially, whether the teacher has full knowledge of the learner,
Copyright c 2015, Association for the Advancement of Artificial whether the learner is unsupervised or supervised, how to
Intelligence (www.aaai.org). All rights reserved. define training set optimality, and so on.
Q: The optimal teaching data D can be non-iid? Q: What is D under your minimization in (1)?
Yes, as the threshold classifier example shows. This is D is the search space of training sets. This is another de-
most noticeable when optimality of D is measured by its sign choice we must make. For example, we may decide
cardinality, but is in general true. It calls for new theoretical upfront that in D we want n/2 positive training examples
analysis methods for machine teaching, whereas traditional and n/2 negative ones, but the teacher can arbitrarily de-
concentration inequalities based on iid-ness no longer apply. sign each d dimensional feature vectors. The corresponding
Q: Did you just invent machine teaching? search space can be
No. What is new is our focus on tractable teaching compu-
tation for modern machine learners while generalizing the D = {{(xi , yi )1:n } | yi = (1)i , xi Rd , i = 1 . . . n},
definition of optimality beyond training set cardinality. But (6)
there has been a long history of related work. The seminal which is equivalent to Rnd . Such a continuous search space
notion of teaching dimension concerns the cardinality of the is necessary for many standard optimization methods.
optimal teaching set (Goldman and Kearns 1995; Shinohara As another example, in pool-based machine teaching we
and Miyano 1991). However, the learners algorithm A was are given a candidate item set S. The training set D must be
restricted to empirical risk minimization (specifically elim- a subset of S. Then D = 2S . Discrete optimization methods,
inating hypotheses inconsistent with training data). Subse- such as those based on submodularity, may be applicable
quent theoretical developments can be found in e.g. (Zilles here (Krause and Golovin 2014; Iyer and Bilmes 2013; Bach
et al. 2011; Balbach and Zeugmann 2009; Angluin 2004; 2013; Feige 1998; Krause and Guestrin 2005; Nemhauser,
Angluin and Krikis 1997; Goldman and Mathias 1996; Wolsey, and Fisher 1978).
Mathias 1997; Balbach and Zeugmann 2006; Balbach 2008; Q: What about A(D) in (2)? Many machine learning algo-
Kobayashi and Shinohara 2009; Angluin and Krikis 2003; rithms do not have a closed-form solution w.r.t. the training
Rivest and Yin 1995; Ben-David and Eiron 1998; Doliwa set D.
et al. 2014). There are similar ideas in psychology, some of For some learners we are lucky to have a closed-form
them can be found in the references in (Patil et al. 2014). A(D), and we can just plug in the closed-form expres-
Q: So you have a way to compute the optimal training set? sion. One example is ordinary least squares regression where
This is in general still an open problem, but we now under- A(D) = (X > X)1 X > y with D = (X, y). Another ex-
stand the solution for certain machine teaching tasks (Zhu ample is a Bayesian learner with a prior conjugate to D,
2013; Patil et al. 2014). Specifically, we restrict ourselves where A(D) is simply the posterior distribution (Zhu 2013;
to batch teaching with full knowledge of the learning algo- Tenenbaum and Griffiths 2001; Rafferty and Griffiths 2010).
rithm. The idea is as follows. Instead of directly computing Yet another example is a kernel density estimator where
the difficult inverse function A1 ( ), we first convert it into A(D) is written as a weighted sum of items in D (Patil et
an optimization problem: al. 2014).
But you are right. For most learners, there is no closed-
min (D) (1) form A(D). Nonetheless, a very large fraction of modern
DD
machine learning algorithms are optimization-based. That
s.t. A(D) = . (2) means the learner itself can be expressed as an optimization
This is not the final formulation it will evolve later. problem of the form
Q: OK. What is that (D) objective? min R(, D) + () (7)
(D) is a teaching effort function which we must define
to capture the notion of training set optimality. For example, s.t. g() 0, h() = 0. (8)
if we define
(D) = |D| (3) Specifically, R(, D) is the empirical risk function, () is
the regularizer, and g, h are constraints when applicable. For
then we prefer small training sets. Alternatively, if we re- these modern machine learners, we may replace A(D) in (2)
quire the optimal training set to contain exactly n items we by the machine learning optimization problem. This turns
may define the original teaching problem (1) into a bilevel optimization
(D) = I|D|=n (4) problem:
where the indicator function IZ = 0 if Z is true, and
otherwise. This teaching effort function is useful for de- min (D) (9)
DD
signing human experiments as was done in (Patil et al. s.t. argmin R(, D) + () (10)
2014). One can encode more complex notion of optimal-
ity with (D). For example, in teaching a classification task s.t. g() 0, h() = 0. (11)
we may prefer that any two training items from different In this bilevel optimization problem, the teaching objec-
classes be clearly distinguishable. Here, D is of the form tive (9) is known as the upper problem while the learning
D = (x1 , y1 ), . . . , (xn , yn ). We may define problem (10) is the lower problem.
X Q: Wait, I remember bilevel optimization is difficult.
(D) = kxi xj k1 (5)
True, and solving (9) in general is an challenge. For cer-
i,j:yi 6=yj
tain convex learners, one strategy is to further replace the
to avoid any near identical training items with different la- lower problem (10) by the corresponding Karush-Kuhn-
bels. Tucker conditions. In doing so, the lower problem becomes
a set of new constraints for the upper problem, and bilevel make predictions on some test items X and observe the pre-
optimization reduces to a single level optimization problem. dicted labels A(D0 )(X). The teacher can then eliminate all
Q: I am still concerned. Constraints (2) and (10) looks A0 A where A0 (D0 )(X) 6= A(D0 )(X). This procedure
overly stringent; it is like matching a needle in a haystack. is repeated until A is sufficiently reduced. Then the teacher
I agree. The feasible sets can be exceedingly ill-formed. finds the optimal training set for one of the remaining al-
One way to address the issue is to relax the teaching con- gorithms in A. An open question is how to make combined
straint such that the learner does not need to exactly learn probing and teaching optimal.
. Indeed, the original teaching problem (1) is equivalent to Q: You generalized the notion of teaching optimality from
|D| to arbitrary (D). Is there a theory for (D) similar to
min IA(D)= + (D). (12) teaching dimension?
DD
This is another open question.
Recall the indicator function is IZ = 0 if Z is true, and
otherwise. We may relax the indicator by another teaching Of Humans
risk function () with a minimum at :
Q: What can machine teaching do for humans?
min (A(D), ) + (D), (13) Machine teaching provides a unique approach to enhance
DD education and personnel training. In principle, we can use
where is a weight that balances teaching risk and effort. machine teaching to design the optimal lesson for individual
() would measure the quality of the learned model A(D) students.
against the target model . For example, a natural choice is Q: There are already many intelligent computer tutoring
(A(D), ) = kA(D) k, which measure how close the systems out there, and MOOC what is unique about ma-
learned model A(D) is to in the parameter space with an chine teaching?
appropriate norm. As another example, for teaching a clas- Oversimplified, some existing education systems treat the
sifier we may measure the teaching risk by how much A(D) human learner as a black-box function f (D). The input D
and disagree on future test data: is educational intervention, e.g. which course modules to of-
fer to the student. The output of f (D) is the students test
(A(D), ) = ExPX 1A(D)(x)6= (x) (14) score. The systems can make point evaluation of f at some
input D, and aim to maximize f based on such evaluations,
where PX is the marginal test distribution; 1Z = 1 if Z is see e.g. (Lindsey et al. 2013). Being a black-box, the actual
true, and 0 otherwise; and we treat the models as classifiers. learning process of the student is not directly modeled.
We can then relax the bilevel optimization problem using In contrast, machine teaching explicitly assumes a com-
the teaching risk function () as follows: putational learning algorithm, or a cognitive model, A of
min (, ) + (D) (15) the student. Given educational intervention D, one may
DD, first compute the resulting cognitive state A(D), which then
s.t. argmin R(, D) + () (16) leads to the observed test score via an appropriate teaching
s.t. g() 0, h() = 0. (17) risk function (A(D), ). Thus, machine teaching treats the
human learner as a transparent box. There is literature on
We can also bring the teaching effort function down as a such cognitive models for tutoring (Koedinger et al. 2013).
constraint (D) B within some budget B. Machine teachings focus is to explicitly compute the in-
Q: There is still a glaring flaw: Can the teacher really have verse A1 to find the optimal lesson directly.
full knowledge of the learning algorithm? Q: What is the advantage of machine teaching?
Good point. Under some circumstances this is plausible, The lesson will be optimal and personalized, given correct
such as when the learner is a robot and the teacher has its cognitive model of the student.
specifications. But when the learner is a human, we will rely Q: Isnt correct cognitive model a strong assumption?
on established cognitive models in the psychology and edu- True. Like many things in mathematics, the more condi-
cation literature. We will turn our attention to human learn- tions one posits, the stronger the result. Of course, empiri-
ers in the next part of the article. cally the quality of the lesson will depend on the validity of
Before that, though, I briefly mention a setting where the the cognitive model.
teacher knows that the learning algorithm A is in a set A of Q: Do we have the correct cognitive model for education?
candidate learning algorithms, but not which one. As a con- Yes and no. For low level personnel training tasks such as
crete example, A may be the set of SVMs with different reg- improving the performance of human categorization, there
ularization weights. Note that, given the same training data are multiple well-established but competing cognitive mod-
D, the learned model A(D) may be different for different els. On the other hand, accurate cognitive models for higher
A A. The teacher knows that the learner is an SVM but level education is still a work in progress. The latter is cur-
doesnt know its regularization weight. rently the major limit on the applicability of machine teach-
A natural idea for this teaching setting is to probe the ing to education. However, the community does have useful
learner. A simple probing strategy for classification is as fol- model fragments, and the machine teacher should take fully
lows. Start by teaching the learner with an initial training set advantage of them.
D0 . We assume the teacher cannot directly observe the re- Q: So, is there indication that machine teaching works at all
sulting model A(D0 ). But the teacher can ask the learner to on humans?
Yes. In (Patil et al. 2014) the authors studied a human to a human student. Let D be the lesson, i.e., the train-
categorization task, where participants learn to classify line ing data, that the participant comes up with. Assuming hu-
segments as short vs. long. The assumed cognitive model man teacher is (near) optimal, we can then compare D to
was a limited capacity retrieval model of human learners A1 1
1 ( ), . . . , Am ( ) to identify the best matching cogni-
similar to kernel density classifier. The teaching risk func- tive model Aj . We may then view Aj as what the human
tion () encodes the desire to minimize the expected future teacher assumes about the student. This is the strategy used
test error. in (Khan, Zhu, and Mutlu 2011).
Machine teaching identified an interesting optimal train-
ing set: the negative training items are tightly clumped to-
gether, so are the positive training items, and the negative Coda
and positive items are symmetric but far away from the de- Q: Now I am excited about machine teaching. What can I
cision boundary, see figure below: do?
y=1 y=1 I call on the research community to study open problems
in machine teaching, including:
(Optimization) Despite our early successes in certain
0 =0.5 1 teaching settings, solving for the optimal training data D is
This solution resembles the toy example earlier; the larger still difficult in general. We expect that many tools devel-
separation between positive and negative items is due to a oped in the optimization community can be brought to bear
wide kernel bandwidth parameter in the cognitive model that on difficult problems like (15).
fits noisy human behaviors. (Theory) Machine teaching originated from the theoret-
Human participants indeed generalize better from this ical study of teaching dimension. It is important to under-
training set computed by machine teaching, compared to stand the theoretical properties of the optimal training set
a control group who received iid uniformly sampled train- under more general teaching settings. We speculate that in-
ing sets. Both groups are tested on a test set consisting of a formation theory may be a suitable tool here: the teacher is
dense grid over the feature space. The group who received the encoder, the learner is the decoder, and the message is
machine-teaching training set had an average test-set accu- the target model . But there is a twist: the decoder is not
racy of 72.5%, while the control group 69.8%. The differ- ideal. It is specified by whatever machine learning algorithm
ence is statistically significant. it runs.
Q: Thats promising. Still, how do you know that you have (Psychology) Cognitive psychology has been the first
the correct cognitive model? I remember you said there are place where machine teaching met humans. More studies
other competing models. are needed to adjudicate existing cognitive models for hu-
As is often said all models are wrong, some are more man categorization. Human experiments also call for new
useful. One may use the prior literature and data to come developments in machine teaching, such as the sequential
up with the best hypothesized model. They may be good teaching setting (Bengio et al. 2009; Khan, Zhu, and Mutlu
enough for machine teaching to produce useful lessons. 2011; McCandliss et al. 2002; Kumar, Packer, and Koller
While identifying such a model is an issue, it is also an op- 2010; Lee and Grauman 2011; Cakmak and Lopes 2012;
portunity Strangely enough, another potential use of ma- Pashler and Mozer 2013; Kobayashi and Shinohara 2009;
chine teaching in psychology is to adjudicate cognitive mod- Balbach and Zeugmann 2009). More broadly, psychologists
els. have proposed cognitive models for many supervised and
Q: Really? How can machine teaching tell which cognitive unsupervised human learning tasks besides categorization.
model is correct? These tasks form an ideal playground for machine teaching
Different cognitive models correspond to different learn- practitioners.
ing algorithms. Call these algorithms A1 , . . . , Am . Given a (Education) Arguably more complex, education first
teaching target , machine teaching can compute the op- needs to identify computable cognitive models of the stu-
timal training set A1 1
1 ( ), . . . , Am ( ) for each cogni- dent. Existing intelligent tutoring systems are a good place
tive model, respectively. If previous result is any indication, to start: with a little effort, one may hypothesize the inner
these optimal training sets will each be non-iid and idiosyn- works of the student black-box. As a concrete example, at
cratic. University of WisconsinMadison we are developing a cog-
We can then conduct a human experiment where we teach nitive model for chemistry education.
a participant using one of the m training sets, and test her on (Novel applications) Consider computer security. As
a common test set. Comparing multiple participants test set mentioned earlier, machine teaching also describes the opti-
performance, we identify the best training set, say A1
j ( ), mal attack strategy if a hacker wants to influence a learning
1 1
among A1 ( ), . . . , Am ( ). This could lend weight to agent, see (Mei and Zhu 2015) and the references therein.
the j-th cognitive model Aj being the best cognitive model. The question is, knowing the optimal attack strategy pre-
Q: Tell me more about it. dicted by machine teaching, can we effectively defend the
A variant is as follows. we conduct a different kind of learning agent? There may be other serendipitous applica-
human experiment: we ask the participant to be a teacher, tions of machine teaching besides education and computer
and let the participant design a lesson intended to teach security. Perhaps you will discover the next one!
Acknowledgments Kobayashi, H., and Shinohara, A. 2009. Complexity of
I am grateful to many people with whom I had help- teaching by a restricted number of examples. In COLT.
ful discussions, including Michael Kearns, Robert Nowak, Koedinger, K. R.; Brunskill, E.; de Baker, R. S. J.;
Daniel Lowd, Timothy Rogers, Chuck Kalish, Bradley McLaughlin, E. A.; and Stamper, J. C. 2013. New potentials
Love, Michael Mozer, Joshua Tenenbaum, Tom Griffiths, for data-driven intelligent tutoring system development and
Stephen Wright, Michael Ferris, Bilge Mutlu, Martina Rau, optimization. AI Magazine 34(3):2741.
Kaustubh Patil, and Shike Mei. I also thank the anonymous Krause, A., and Golovin, D. 2014. Submodular func-
reviewers for their constructive comments. Support for this tion maximization. In Tractability: Practical Approaches
research was provided in part by NSF grant IIS 0953219 and to Hard Problems (to appear). Cambridge University Press.
by the University of WisconsinMadison Graduate School Krause, A., and Guestrin, C. 2005. Near-optimal value of
with funding from the Wisconsin Alumni Research Founda- information in graphical models. In UAI.
tion.
Kumar, M. P.; Packer, B.; and Koller, D. 2010. Self-paced
learning for latent variable models. In NIPS.
References
Lee, Y. J., and Grauman, K. 2011. Learning the easy things
Angluin, D., and Krikis, M. 1997. Teachers, learners and first: Self-paced visual category discovery. In CVPR.
black boxes. In COLT, 285297. ACM.
Lindsey, R.; Mozer, M.; Huggins, W. J.; and Pashler, H.
Angluin, D., and Krikis, M. 2003. Learning from different 2013. Optimizing instructional policies. In NIPS.
teachers. Machine Learning 51(2):137163.
Mathias, H. D. 1997. A model of interactive teaching. J.
Angluin, D. 2004. Queries revisited. Theoretical Computer Comput. Syst. Sci. 54(3):487501.
Science 313(2):175194.
McCandliss, B. D.; Fiez, J. A.; Protopapas, A.; Conway, M.;
Bach, F. 2013. Learning with submodular functions: A con- and McClelland, J. L. 2002. Success and failure in teach-
vex optimization perspective. Foundations and Trends in ing the [r]-[l] contrast to Japanese adults: Tests of a Hebbian
Machine Learning 6(2-3):145373. model of plasticity and stabilization in spoken language per-
Balbach, F. J., and Zeugmann, T. 2006. Teaching random- ception. Cognitive, Affective, & Behavioral Neuroscience
ized learners. In COLT, 229243. Springer. 2(2):89108.
Balbach, F. J., and Zeugmann, T. 2009. Recent develop- Mei, S., and Zhu, X. 2015. Using machine teaching to
ments in algorithmic teaching. In The 3rd Intl. Conf. on identify optimal training-set attacks on machine learners. In
Language and Automata Theory and Applications, 118. AAAI.
Balbach, F. J. 2008. Measuring teachability using variants of Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978.
the teaching dimension. Theor. Comput. Sci. 397(1-3):94 An analysis of approximations for maximizing submodular
113. set functionsI. Mathematical Programming 14(1):265294.
Ben-David, S., and Eiron, N. 1998. Self-directed learning Pashler, H., and Mozer, M. C. 2013. When does fading
and its relation to the vc-dimension and to teacher-directed enhance perceptual category learning? Journal of Experi-
learning. Machine Learning 33(1):87104. mental Psychology: Learning, Memory, and Cognition.
Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. Patil, K.; Zhu, X.; Kopec, L.; and Love, B. 2014. Optimal
2009. Curriculum learning. In ICML. teaching for limited-capacity human learners. In NIPS.
Cakmak, M., and Lopes, M. 2012. Algorithmic and human Rafferty, A. N., and Griffiths, T. L. 2010. Optimal language
teaching of sequential decision tasks. In AAAI. learning: The importance of starting representative. 32nd
Doliwa, T.; Fan, G.; Simon, H. U.; and Zilles, S. 2014. Annual Conference of the Cognitive Science Society.
Recursive teaching dimension, VC-dimension and sample Rivest, R. L., and Yin, Y. L. 1995. Being taught can be faster
compression. JMLR 15:31073131. than asking questions. In COLT, 144151. ACM.
Feige, U. 1998. A threshold of ln n for approximating set Settles, B. 2012. Active Learning. Synthesis Lectures on Ar-
cover. J. ACM 45(4):634652. tificial Intelligence and Machine Learning. Morgan & Clay-
Goldman, S., and Kearns, M. 1995. On the complexity pool.
of teaching. Journal of Computer and Systems Sciences Shinohara, A., and Miyano, S. 1991. Teachability in com-
50(1):2031. putational learning. New Generation Computing 8(4):337
Goldman, S., and Mathias, H. 1996. Teaching a smarter 348.
learner. Journal of Computer and Systems Sciences Tenenbaum, J. B., and Griffiths, T. L. 2001. The rational
52(2):255267. basis of representativeness. 23rd Annual Conference of the
Iyer, R. K., and Bilmes, J. A. 2013. Submodular optimiza- Cognitive Science Society.
tion with submodular cover and submodular knapsack con- Zhu, X. 2013. Machine teaching for Bayesian learners in
straints. In NIPS. the exponential family. In NIPS.
Khan, F.; Zhu, X.; and Mutlu, B. 2011. How do humans Zilles, S.; Lange, S.; Holte, R.; and Zinkevich, M. 2011.
teach: On curriculum learning and teaching dimension. In Models of cooperative teaching and learning. JMLR 12:349
NIPS. 384.

You might also like