Smola - Introduction To Machine Learning

INTRODUCTION TO MACHINE LEARNING
Introduction to Machine Learning

Alex Smola and S.V.N. Vishwanathan
Yahoo! Labs
Santa Clara
and
Departments of Statistics and Computer Science
Purdue University
and
College of Engineering and Computer Science
Australian National University
published by the press syndicate of the university of cambridge

The Pitt Building, Trumpington Street, Cambridge, United Kingdom
cambridge university press
The Edinburgh Building, Cambridge CB2 2RU, UK
40 West 20th Street, New York, NY 100114211, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
Ruiz de Alarc
on 13, 28014 Madrid, Spain
Dock House, The Waterfront, Cape Town 8001, South Africa
http://www.cambridge.org
c Cambridge University Press 2008
This book is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without
the written permission of Cambridge University Press.
First published 2008
Printed in the United Kingdom at the University Press, Cambridge
Typeface Monotype Times 10/13pt
System LATEX 2
Vishwanathan]
[Alexander J. Smola and S.V.N.
A catalogue record for this book is available from the British Library
Library of Congress Cataloguing in Publication data available
ISBN 0 521 82583 0 hardback
Author: smola
Revision: 174
Timestamp: February 8, 2010
URL: svn://smola@repos.stat.purdue.edu/thebook/trunk/Book/thebook.tex
Contents
Preface
1
page 1
Introduction
1.1 A Taste of Machine Learning
1.1.1 Applications
1.1.2 Data
1.1.3 Problems
1.2 Probability Theory
1.2.1 Random Variables
1.2.2 Distributions
1.2.3 Mean and Variance
1.2.4 Marginalization, Independence, Conditioning, and
Bayes Rule
1.3 Basic Algorithms
1.3.1 Naive Bayes
1.3.2 Nearest Neighbor Estimators
1.3.3 A Simple Classifier
1.3.4 Perceptron
1.3.5 K-Means
3
3
3
7
9
12
12
13
15
16
20
22
25
27
29
32
Density Estimation
2.1 Limit Theorems
2.1.1 Fundamental Laws
2.1.2 The Characteristic Function
2.1.3 Tail Bounds
2.1.4 An Example
2.2 Parzen Windows
2.2.1 Discrete Density Estimation
2.2.2 Smoothing Kernel
2.2.3 Parameter Estimation
2.2.4 Silvermans Rule
2.2.5 Watson-Nadaraya Estimator
2.3 Exponential Families
2.3.1 Basics
37
37
38
42
45
48
51
51
52
54
57
59
60
60
v
vi
0 Contents
2.4
2.5
2.3.2 Examples
Estimation
2.4.1 Maximum Likelihood Estimation
2.4.2 Bias, Variance and Consistency
2.4.3 A Bayesian Approach
2.4.4 An Example
Sampling
2.5.1 Inverse Transformation
2.5.2 Rejection Sampler
62
64
66
68
71
75
77
78
82
Directed Graphical Models

3.1 Introduction
3.1.1 Alarms and Burglars
3.1.2 Formal Definition
3.1.3 d-Separation and Dependence
3.2 Estimation
3.2.1 Information Theory Primer
3.2.2 An Example Clustering
3.2.3 Direct Maximization
3.2.4 Expectation Maximization
3.2.5 Gibbs Sampling
3.2.6 Collapsed Sampling
3.3 Applications
3.3.1 Hidden Markov Models
3.3.2 Kalman Filter
3.3.3 Factor Analysis
3.3.4 Latent Dirichlet Allocation
89
90
90
92
93
96
98
100
102
103
106
108
108
108
111
113
113
Undirected Graphical Models

4.0.5 Definition
4.1 Examples
4.2 Nonparametric Exponential Families
4.3 Inference
4.4 The Generalized Distributive Law
4.5 Approximate Inference
119
120
122
123
123
124
124
Optimization
5.1 Preliminaries
5.1.1 Convex Sets
5.1.2 Convex Functions
5.1.3 Subgradients
5.1.4 Strongly Convex Functions
127
127
128
128
131
132
Contents
5.2
5.3
5.4
5.5
vii
5.1.5 Convex Functions with Lipschitz Continous Gradient133

Unconstrained Smooth Convex Minimization
133
5.2.1 Minimizing a One-Dimensional Convex Function
134
5.2.2 Gradient Descent
136
5.2.3 Higher Order Methods
138
Constrained Optimization
150
5.3.1 Lagrange Duality
151
5.3.2 Linear and Quadratic Programs
154
Stochastic Optimization
157
5.4.1 Stochastic Gradient Descent
158
Nonconvex Optimization
159
5.5.1 BFGS
159
5.5.2 Randomization
159
5.5.3 Concave-Convex Procedure
159
Conditional Densities
6.1 Conditional Exponential Models
6.1.1 Basic Model
6.1.2 Joint Feature Map
6.1.3 Optimization
6.1.4 Gaussian Process Link
6.2 Binary Classification
6.2.1 Binomial Model
6.2.2 Optimization
6.3 Regression
6.3.1 Conditionally Normal Models
6.3.2 Posterior Distribution
6.3.3 Heteroscedastic Estimation
6.4 Multiclass Classification
6.4.1 Conditionally Multinomial Models
6.5 What is a CRF?
6.5.1 Linear Chain CRFs
6.5.2 Higher Order CRFs
6.5.3 Kernelized CRFs
6.6 Optimization Strategies
6.6.1 Getting Started
6.6.2 Optimization Algorithms
6.6.3 Handling Higher order CRFs
6.7 Hidden Markov Models
6.8 Further Reading
163
163
163
163
163
163
163
163
163
163
163
164
164
164
164
164
164
164
164
165
165
165
165
165
165
viii
0 Contents
6.8.1
Optimization
165
Kernels and Function Spaces

7.1 Kernels
7.1.1 Feature Maps
7.1.2 The Kernel Trick
7.1.3 Examples of Kernels
7.2 Algorithms
7.2.1 Kernel Perceptron
7.2.2 Trivial Classifier
7.2.3 Kernel Principal Component Analysis
7.3 Reproducing Kernel Hilbert Spaces
7.3.1 Hilbert Spaces
7.3.2 Theoretical Properties
7.3.3 Regularization
7.4 Banach Spaces
7.4.1 Properties
7.4.2 Norms and Convex Sets
167
167
167
167
167
167
167
167
167
167
167
167
167
168
168
168
Linear Models
8.1 Support Vector Classification
8.1.1 A Regularized Risk Minimization Viewpoint
8.1.2 An Exponential Family Interpretation
8.1.3 Specialized Algorithms for Training SVMs
8.1.4 The trick
8.2 Support Vector Regression
8.2.1 Incorporating the Trick
8.2.2 Regularized Risk Minimization
8.3 Novelty Detection
8.3.1 Density Estimation via the Exponential Family
8.4 Ordinal Regression
8.4.1 Preferences
8.4.2 Dual Problem
8.4.3 Optimization
8.5 Margins and Probability
8.6 Large Margin Classifiers with Structure
8.6.1 Margin
8.6.2 Penalized Margin
8.6.3 Nonconvex Losses
8.7 Applications
8.7.1 Sequence Annotation
169
169
172
173
174
179
180
183
184
184
186
188
188
188
188
188
188
188
188
188
188
188
Contents
8.8
8.9
ix
8.7.2 Matching
8.7.3 Ranking
8.7.4 Shortest Path Planning
8.7.5 Image Annotation
8.7.6 Contingency Table Loss
Optimization
8.8.1 Column Generation
8.8.2 Bundle Methods
8.8.3 Overrelaxation in the Dual
CRFs vs Structured Large Margin Models
8.9.1 Loss Function
8.9.2 Dual Connections
8.9.3 Optimization
188
188
188
188
188
188
188
189
189
189
189
189
189
Model Selection
9.1 Basics
9.1.1 Estimators
9.1.2 Maximum Likelihood Revisited
9.1.3 Empirical Methods
9.2 Uniform Convergence Bounds
9.2.1 Vapnik Chervonenkis Dimension
9.2.2 Rademacher Averages
9.2.3 Compression Bounds
9.3 Bayesian Methods
9.3.1 Priors Revisited
9.3.2 PAC-Bayes Bounds
9.4 Asymptotic Analysis
9.4.1 Efficiency of an Estimator
9.4.2 Asymptotic Efficiency
191
191
191
191
191
191
191
191
191
191
191
191
191
191
192
10
Maximum Mean Discrepancy

10.1 Fenchel Duality
10.1.1 Motivation
10.1.2 Applications
10.2 Dual Problems
10.2.1 Maximum Likelihood
10.2.2 Maximum Aposteriori
10.3 Priors
10.3.1 Motivation
10.3.2 Conjugate Priors
10.3.3 Priors and Maxent
193
193
193
193
193
193
193
193
193
193
193
0 Contents
10.4 Moments
10.4.1 Sufficient Statistics and the Marginal Polytope
10.5 Two Sample Test
10.5.1 Maximum Mean Discrepancy
10.5.2 Mean Map and Norm
10.5.3 Efficient Estimation
10.5.4 Covariate Shift Correction
10.6 Independence Measures
10.6.1 Test Statistic
10.7 Applications
10.7.1 Independent Component Analysis
10.7.2 Feature Selection
10.7.3 Clustering
10.7.4 Maximum Variance Unfolding
10.8 Introduction
10.9 The Maximum Mean Discrepancy
10.9.1 Definition of the Maximum Mean Discrepancy
10.9.2 The MMD in Reproducing Kernel Hilbert Spaces
10.9.3 Witness Function of the MMD for RKHSs
10.9.4 The MMD in Other Function Classes
10.9.5 Examples of Non-RKHS Function Classes
10.10 Background Material
10.10.1Statistical Hypothesis Testing
10.10.2A Negative Result
10.10.3Previous Work
10.11 Tests Based on Uniform Convergence Bounds
10.11.1Bound on the Biased Statistic and Test
10.11.2Bound on the Unbiased Statistic and Test
10.12 Test Based on the Asymptotic Distribution of the Unbiased Statistic
10.13 A Linear Time Statistic and Test
10.14 Similarity Measures Related to MMD
10.14.1Link with L2 Distance between Parzen Window
Estimates
10.14.2Set Kernels and Kernels Between Probability
Measures
10.14.3Kernel Measures of Independence
10.14.4Kernel Statistics Using a Distribution over Witness
Functions
194
194
194
194
194
194
194
194
194
194
194
194
194
194
194
194
196
197
198
200
201
203
205
205
206
207
208
208
210
211
212
214
214
215
216
218
Contents
11
xi
10.14.5Outlier Detection
10.15 Experiments
10.15.1Toy Example: Two Gaussians
10.15.2Data Integration
10.15.3Computational Cost
10.15.4Attribute Matching
10.16 Conclusion
10.17 Large Deviation Bounds for Tests with Finite Sample
Guarantees
10.17.1Preliminary Definitions and Theorems
10.17.2Bound when p and q May Differ
10.17.3Bound when p = q and m = n
10.18 Proofs for Asymptotic Tests
10.18.1Convergence of the Empirical MMD under H0
10.18.2Moments of the Empirical MMD Under H0
219
219
220
221
223
224
228
Reinforcement Learning
239
230
230
230
232
233
234
236
Appendix 1
Linear Algebra and Functional Analysis
241
Appendix 2
Conjugate Distributions
242
Appendix 3
Bibliography
Loss Functions
244
261
Preface
Since this is a textbook we biased our selection of references towards easily

accessible work rather than the original references. While this may not be
in the interest of the inventors of these concepts, it greatly simplifies access
to those topics. Hence we encourage the reader to follow the references in
the cited works should they be interested in finding out who may claim
intellectual ownership of certain key ideas.
1
0 Preface
Structure of the Book

Introduction
Density
Estimation
Graphical
Models
Duality and
Estimation
Conditional
Densities
Linear Models
Kernels
Moment
Methods
Optimization
Conditional
Random Fields
Structured
Estimation
Reinforcement
Learning
Duality and
Estimation
Introduction
Introduction
Density
Estimation
Density
Estimation
Graphical
Models
Graphical
Models
Conditional
Densities
Kernels
Moment
Methods
Linear Models
Duality and
Estimation
Optimization
Conditional
Random Fields
Kernels
Structured
Estimation
Reinforcement
Learning
Canberra, August 2008
Conditional
Densities
Moment
Methods
Linear Models
Optimization
Conditional
Random Fields
Structured
Estimation
Reinforcement
Learning
1
Introduction
Over the past two decades Machine Learning has become one of the mainstays of information technology and with that, a rather central, albeit usually
hidden, part of our life. With the ever increasing amounts of data becoming
available there is good reason to believe that smart data analysis will become
even more pervasive as a necessary ingredient for technological progress.
The purpose of this chapter is to provide the reader with an overview over
the vast range of applications which have at their heart a machine learning
problem and to bring some degree of order to the zoo of problems. After
that, we will discuss some basic tools from statistics and probability theory,
since they form the language in which many machine learning problems must
be phrased to become amenable to solving. Finally, we will outline a set of
fairly basic yet effective algorithms to solve an important problem, namely
that of classification. More sophisticated tools, a discussion of more general
problems and a detailed analysis will follow in later parts of the book.

Machine learning can appear in many guises. We now discuss a number of
applications, the types of data they deal with, and finally, we formalize the
problems in a somewhat more stylized fashion. The latter is key if we want to
avoid reinventing the wheel for every new application. Instead, much of the
art of machine learning is to reduce a range of fairly disparate problems to
a set of fairly narrow prototypes. Much of the science of machine learning is
then to solve those problems and provide good guarantees for the solutions.
1.1.1 Applications
Most readers will be familiar with the concept of web page ranking. That
is, the process of submitting a query to a search engine, which then finds
webpages relevant to the query and which returns them in their order of
relevance. See e.g. Figure 1.1 for an example of the query results for machine learning. That is, the search engine returns a sorted list of webpages
given a query. To achieve this goal, a search engine needs to know which
3
1 Introduction
Web
Images
Maps
Google
Web
News
Shopping
Gmail
more !
machine learning
Sign in
Search
Scholar
Advanced Search
Preferences
Results 1 - 10 of about 10,500,000 for machine learning. (0.06 seconds)
Machine learning - Wikipedia, the free encyclopedia

As a broad subfield of artificial intelligence, machine learning is concerned with the design
and development of algorithms and techniques that allow ...
en.wikipedia.org/wiki/Machine_learning - 43k - Cached - Similar pages
Machine Learning textbook
Sponsored Links
Machine Learning
Google Sydney needs machine
learning experts. Apply today!
www.google.com.au/jobs
Machine Learning is the study of computer algorithms that improve automatically through
experience. Applications range from datamining programs that ...
www.cs.cmu.edu/~tom/mlbook.html - 4k - Cached - Similar pages
machine learning
www.aaai.org/AITopics/html/machine.html - Similar pages
Machine Learning
A list of links to papers and other resources on machine learning.
www.machinelearning.net/ - 14k - Cached - Similar pages
Introduction to Machine Learning

This page has pointers to my draft book on Machine Learning and to its individual
chapters. They can be downloaded in Adobe Acrobat format. ...
ai.stanford.edu/~nilsson/mlbook.html - 15k - Cached - Similar pages
Machine Learning - Artificial Intelligence (incl. Robotics ...

Machine Learning - Artificial Intelligence. Machine Learning is an international forum for
research on computational approaches to learning.
- 39k - Cached - Similar pages
Fig. 1.1. Thewww.springer.com/computer/artificial/journal/10994
5 top scoring webpages
for the query machine learning
Machine Learning (Theory)

Graduating students in Statistics appear to be at a substantial handicap compared to
graduating students in Machine Learning, despite being in substantially ...
hunch.net/ - 94k - Cached - Similar pages
Amazon.com: Machine Learning: Tom M. Mitchell: Books
pages are relevant and which pages match the query. Such knowledge can be
gained fromMachine
several
sources: the link structure of webpages, their content,
Learning Journal
the frequency with which users will follow the suggested links in a query, or
from examples
queries
CS 229:of
Machine
Learning in combination with manually ranked webpages.
Increasingly machine learning rather than guesswork and clever engineering
is used to automate the process of designing a good search engine [RPB06].
A rather related application is collaborative
filtering. Internet bookNext
stores such as Amazon, or video rental sites such as Netflix use this information extensively to entice users to purchase additional goods (or rent more
movies). The problem is quite similar to the one of web page ranking. As
before, we want to obtain a sorted list (in this case of articles). The key difference is that an explicit query is missing and instead we can only use past
purchase and viewing decisions of the user to predict future viewing and
purchase habits. The key side information here are the decisions made by
similar users, hence the collaborative nature of the process. See Figure 1.2
for an example. It is clearly desirable to have an automatic system to solve
this problem, thereby avoiding guesswork and time [BK07].
An equally ill-defined problem is that of automatic translation of documents. At one extreme, we could aim at fully understanding a text before
translating it using a curated set of rules crafted by a computational linguist
well versed in the two languages we would like to translate. This is a rather
arduous task, in particular given that text is not always grammatically correct, nor is the document understanding part itself a trivial one. Instead, we
could simply use examples of translated documents, such as the proceedings
of the Canadian parliament or other multilingual entities (United Nations,
European Union, Switzerland) to learn how to translate between the two
Amazon.com: Machine Learning: Tom M. Mitchell: Books.
www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077 - 210k Cached - Similar pages
Machine Learning publishes articles on the mechanisms through which intelligent systems
improve their performance over time. We invite authors to submit ...
pages.stern.nyu.edu/~fprovost/MLJ/ - 3k - Cached - Similar pages
STANFORD. CS229 Machine Learning Autumn 2007. Announcements. Final reports from
this year's class projects have been posted here. ...
cs229.stanford.edu/ - 10k - Cached - Similar pages
1 2 3 4 5 6 7 8 9 10
machine learning
Search
Search within results | Language Tools | Search Tips | Dissatisfied? Help us improve | Try Google Experimental
2008 Google - Google Home - Advertising Programs - Business Solutions - About Google
languages. In other words, we could use examples of translations to learn

how to translate. This machine learning approach proved quite successful
[BPX+ 07].
Many security applications, e.g. for access control, use face recognition as
one of its components. That is, given the photo (or video recording) of a
person, recognize who this person is. In other words, the system needs to
classify the faces into one of many categories (Alice, Bob, Charlie, . . . ) or
decide that it is an unknown face. A similar, yet conceptually quite different
problem is that of verification. Here the goal is to verify whether the person
in question is who he claims to be. Note that differently to before, this
is now a yes/no question. To deal with different lighting conditions, facial
expressions, whether a person is wearing glasses, hairstyle, etc., it is desirable
to have a system which learns which features are relevant for identifying a
person.
Another application where learning helps is the problem of named entity
recognition (see Figure 1.4). That is, the problem of identifying entities,
such as places, titles, names, actions, etc. from documents. Such steps are
crucial in the automatic digestion and understanding of documents. Some
modern e-mail clients, such as Apples Mail.app nowadays ship with the
ability to identify addresses in mails and filing them automatically in an
address book. While systems using hand-crafted rules can lead to satisfactory results, it is far more efficient to use examples of marked-up documents
to learn such dependencies automatically, in particular if we want to deploy our system in many languages. For instance, while bush and rice
Hello. Sign in to get personalized recommendations. New customer? Start here.
Your Amazon.com
Today's Deals
Gifts & Wish Lists
Your Account | Help
Gift Cards
Books
Advanced Search
Books
Browse Subjects
Hot New Releases
Bestsellers
The New York Times Best Sellers
Libros En Espaol
Bargain Books
Textbooks
Join Amazon Prime and ship Two-Day for free and Overnight for $3.99. Already a member? Sign in.
Machine Learning (Mcgraw-Hill International Edit)

(Paperback)
Quantity:
by Thomas Mitchell (Author) "Ever since computers were invented, we have wondered whether
they might be made to learn..." (more)
or
(30 customer reviews)
List Price: $87.47

Price: $87.47 & this item ships for FREE with Super Saver Shipping.
Details
Availability: Usually ships within 4 to 7 weeks. Ships from and sold by Amazon.com. Giftwrap available.
Sign in to turn on 1-Click ordering.
More Buying Choices
16 used & new from

$52.00
Have one to sell?
16 used & new available from $52.00
Share your own customer images
Search inside another edition of this book
Also Available in: List Price: Our Price: Other Offers:

Hardcover (1)
Are You an Author or

Publisher?
Find out how to publish
your own Kindle Books
$153.44
$153.44
34 used & new from $67.00
Better Together
Buy this book with Introduction to Machine Learning (Adaptive Computation and Machine Learning) by Ethem Alpaydin today!
Buy Together Today: $130.87
Customers Who Bought This Item Also Bought
Pattern Recognition and

Machine Learning
(Information Science and
Statistics) by Christopher
M. Bishop
(30) $60.50
Artificial Intelligence: A
Modern Approach (2nd
Edition) (Prentice Hall
Series in Artificial
Intelligence) by Stuart
Russell
(76) $115.00
Explore similar items : Books
The Elements of Statistical

Learning by T. Hastie
(25) $72.20
Pattern Classification (2nd

Edition) by Richard O.
Duda
(25) $115.00
Data Mining: Practical

Machine Learning Tools
and Techniques, Second
Edition (Morgan Kaufmann
Series in Data
Management Systems) by
Ian H. Witten
(21) $39.66
(50)
Editorial Reviews
Fig. 1.2. Books recommended by Amazon.com when viewing Tom Mitchells Machine Learning Book [Mit97]. It is desirable for the vendor to recommend relevant
books which a user might purchase.
Book Description
This exciting addition to the McGraw-Hill Series in Computer Science focuses on the concepts and techniques that contribute to the rapidly
changing field of machine learning--including probability and statistics, artificial intelligence, and neural networks--unifying them all in a logical
and coherent manner. Machine Learning serves as a useful reference tool for software developers and researchers, as well as an outstanding text
for college students. --This text refers to the Hardcover edition.
Book Info
Presents the key algorithms and theory that form the core of machine learning. Discusses such theoretical issues as How does learning
performance vary with the number of training examples presented? and Which learning algorithms are most appropriate for various types of
learning tasks? DLC: Computer algorithms. --This text refers to the Hardcover edition.
Product Details
Paperback: 352 pages
Publisher: McGraw-Hill Education (ISE Editions); 1st edition (October 1, 1997)
Language: English
ISBN-10: 0071154671
ISBN-13: 978-0071154673
Product Dimensions: 9 x 5.9 x 1.1 inches
Shipping Weight: 1.2 pounds (View shipping rates and policies)
Fig. 1.3. 11 Pictures of the same person taken from the Yale face recognition
database. The challenge is to recognize that we are dealing with the same person in all 11 cases.
Average Customer Review:
(30 customer reviews)
Amazon.com Sales Rank: #104,460 in Books (See Bestsellers in Books)

Popular in this category:
(What's this?)
#11 in Books > Computers & Internet > Computer Science > Artificial Intelligence > Machine Learning
(Publishers and authors: Improve Your Sales)
In-Print Editions: Hardcover (1) | All Editions
Would you like to update product info or give feedback on images? (We'll ask you to sign in so we can get back to you)
Inside This Book
(learn more)
Browse and search another edition of this book.
First Sentence:
Ever since computers were invented, we have wondered whether they might be made to learn. Read the first page
Browse Sample Pages:
Front Cover | Copyright | Table of Contents | Excerpt | Index | Back Cover | Surprise Me!
Search Inside This Book:
Customers viewing this page may be interested in these Sponsored Links

Online Law Degree
http://www.edu-onlinedegree.org
Learning CDs
www.mindperk.com
(What's this?)
Juris Doctor JD & LLM Masters Low tuition, Free Textbooks
Save on powerful mind-boosting CDs & DVDs. Huge Selection
Video Edit Magic

www.deskshare.com/download
Video Editing Software trim, modify color, and merge video

Advertise on Amazon
Tags Customers Associate with This Product
(What's this?)
Click on a tag to find related items, discussions, and people.

machine learning
computer science
(6)
artificial intelligence
(2)
Your tags: Add your first tag
(1)
pattern recognition
(1)
Search Products Tagged with
1 Introduction
HAVANA (Reuters) - The European Unions top development aid official

left Cuba on Sunday convinced that EU diplomatic sanctions against
the communist island should be dropped after Fidel Castros
retirement, his main aide said.
<TYPE="ORGANIZATION">HAVANA</> (<TYPE="ORGANIZATION">Reuters</>) - The
<TYPE="ORGANIZATION">European Union</>s top development aid official left
<TYPE="ORGANIZATION">Cuba</> on Sunday convinced that EU diplomatic sanctions
against the communist <TYPE="LOCATION">island</> should be dropped after
<TYPE="PERSON">Fidel Castro</>s retirement, his main aide said.
Fig. 1.4. Named entity tagging of a news article (using LingPipe). The relevant
locations, organizations and persons are tagged for further information extraction.
are clearly terms from agriculture, it is equally clear that in the context of
contemporary politics they refer to members of the Republican Party.
Other applications which take advantage of learning are speech recognition (annotate an audio sequence with text, such as the system shipping
with Microsoft Vista), the recognition of handwriting (annotate a sequence
of strokes with text, a feature common to many PDAs), trackpads of computers (e.g. Synaptics, a major manufacturer of such pads derives its name
from the synapses of a neural network), the detection of failure in jet engines, avatar behavior in computer games (e.g. Black and White), direct
marketing (companies use past purchase behavior to guesstimate whether
you might be willing to purchase even more) and floor cleaning robots (such
as iRobots Roomba). The overarching theme of learning problems is that
there exists a nontrivial dependence between some observations, which we
will commonly refer to as x and a desired response, which we refer to as y,
for which a simple set of deterministic rules is not known. By using learning
we can infer such a dependency between x and y in a systematic fashion.
We conclude this section by discussing the problem of classification,
since it will serve as a prototypical problem for a significant part of this
book. It occurs frequently in practice: for instance, when performing spam
filtering, we are interested in a yes/no answer as to whether an e-mail contains relevant information or not. Note that this issue is quite user dependent: for a frequent traveller e-mails from an airline informing him about
recent discounts might prove valuable information, whereas for many other
recipients this might prove more of an nuisance (e.g. when the e-mail relates
to products available only overseas). Moreover, the nature of annoying emails might change over time, e.g. through the availability of new products
(Viagra, Cialis, Levitra, . . . ), different opportunities for fraud (the Nigerian
419 scam which took a new twist after the Iraq war), or different data types
(e.g. spam which consists mainly of images). To combat these problems we
Fig. 1.5. Binary classification; separate stars from diamonds. In this example we
are able to do so by drawing a straight line which separates both sets. We will see
later that this is an important example of what is called a linear classifier.
want to build a system which is able to learn how to classify new e-mails.
A seemingly unrelated problem, that of cancer diagnosis shares a common
structure: given histological data (e.g. from a microarray analysis of a patients tissue) infer whether a patient is healthy or not. Again, we are asked
to generate a yes/no answer given a set of observations. See Figure 1.5 for
an example.
1.1.2 Data
It is useful to characterize learning problems according to the type of data
they use. This is a great help when encountering new challenges, since quite
often problems on similar data types can be solved with very similar techniques. For instance natural language processing and bioinformatics use very
similar tools for strings of natural language text and for DNA sequences.
Vectors constitute the most basic entity we might encounter in our work.
For instance, a life insurance company might be interesting in obtaining the
vector of variables (blood pressure, heart rate, height, weight, cholesterol
level, smoker, gender) to infer the life expectancy of a potential customer.
A farmer might be interested in determining the ripeness of fruit based on
(size, weight, spectral data). An engineer might want to find dependencies
in (voltage, current) pairs. Likewise one might want to represent documents
by a vector of counts which describe the occurrence of words. The latter is
commonly referred to as bag of words features.
One of the challenges in dealing with vectors is that the scales and units
of different coordinates may vary widely. For instance, we could measure the
height in kilograms, pounds, grams, tons, stones, all of which would amount
to multiplicative changes. Likewise, when representing temperatures, we
have a full class of affine transformations, depending on whether we represent them in terms of Celsius, Kelvin or Farenheit. One way of dealing
1 Introduction
with those issues in an automatic fashion is to normalize the data. We will

discuss means of doing so in an automatic fashion.
Lists: In some cases the vectors we obtain may contain a variable number
of features. For instance, a physician might not necessarily decide to perform
a full battery of diagnostic tests if the patient appears to be healthy.
Sets may appear in learning problems whenever there is a large number of
potential causes of an effect, which are not well determined. For instance, it is
relatively easy to obtain data concerning the toxicity of mushrooms. It would
be desirable to use such data to infer the toxicity of a new mushroom given
information about its chemical compounds. However, mushrooms contain a
cocktail of compounds out of which one or more may be toxic. Consequently
we need to infer the properties of an object given a set of features, whose
composition and number may vary considerably.
Matrices are a convenient means of representing pairwise relationships.
For instance, in collaborative filtering applications the rows of the matrix
may represent users whereas the columns correspond to products. Only in
some cases we will have knowledge about a given (user, product) combination, such as the rating of the product by a user.
A related situation occurs whenever we only have similarity information
between observations, as implemented by a semi-empirical distance measure. Some homology searches in bioinformatics, e.g. variants of BLAST
[AGML90], only return a similarity score which does not necessarily satisfy
the requirements of a metric.
Images could be thought of as two dimensional arrays of numbers, that is,
matrices. This representation is very crude, though, since they exhibit spatial coherence (lines, shapes) and (natural images exhibit) a multiresolution
structure. That is, downsampling an image leads to an object which has very
similar statistics to the original image. Computer vision and psychooptics
have created a raft of tools for describing these phenomena.
Video adds a temporal dimension to images. Again, we could represent
them as a three dimensional array. Good algorithms, however, take the temporal coherence of the image sequence into account.
Trees and Graphs are often used to describe relations between collections of objects. For instance the ontology of webpages of the DMOZ project
(www.dmoz.org) has the form of a tree with topics becoming increasingly
refined as we traverse from the root to one of the leaves (Arts Animation
Anime General Fan Pages Official Sites). In the case of gene ontology the relationships form a directed acyclic graph, also referred to as the
GO-DAG [ABB+ 00].
Both examples above describe estimation problems where our observations
are vertices of a tree or graph. However, graphs themselves may be the

observations. For instance, the DOM-tree of a webpage, the call-graph of
a computer program, or the protein-protein interaction networks may form
the basis upon which we may want to perform inference.
Strings occur frequently, mainly in the area of bioinformatics and natural
language processing. They may be the input to our estimation problems, e.g.
when classifying an e-mail as spam, when attempting to locate all names of
persons and organizations in a text, or when modeling the topic structure
of a document. Equally well they may constitute the output of a system.
For instance, we may want to perform document summarization, automatic
translation, or attempt to answer natural language queries.
Compound structures are the most commonly occurring object. That
is, in most situations we will have a structured mix of different data types.
For instance, a webpage might contain images, text, tables, which in turn
contain numbers, and lists, all of which might constitute nodes on a graph of
webpages linked among each other. Good statistical modelling takes such dependencies and structures into account in order to tailor sufficiently flexible
models.
1.1.3 Problems
The range of learning problems is clearly large, as we saw when discussing
applications. That said, researchers have identified an ever growing number
of templates which can be used to address a large set of situations. It is those
templates which make deployment of machine learning in practice easy and
our discussion will largely focus on a choice set of such problems. We now
give a by no means complete list of templates.
Binary Classification is probably the most frequently studied problem
in machine learning and it has led to a large number of important algorithmic
and theoretic developments over the past century. In its simplest form it
reduces to the question: given a pattern x drawn from a domain X, estimate
which value an associated binary random variable y {1} will assume.
For instance, given pictures of apples and oranges, we might want to state
whether the object in question is an apple or an orange. Equally well, we
might want to predict whether a home owner might default on his loan,
given income data, his credit history, or whether a given e-mail is spam or
ham. The ability to solve this basic problem already allows us to address a
large variety of practical settings.
There are many variants exist with regard to the protocol in which we are
required to make our estimation:
10
1 Introduction
Fig. 1.6. Left: binary classification. Right: 3-class classification. Note that in the
latter case we have much more degree for ambiguity. For instance, being able to
distinguish stars from diamonds may not suffice to identify either of them correctly,
since we also need to distinguish both of them from triangles.
We might see a sequence of (xi , yi ) pairs for which yi needs to be estimated

in an instantaneous online fashion. This is commonly referred to as online
learning.
We might observe a collection X := {x1 , . . . xm } and Y := {y1 , . . . ym } of
pairs (xi , yi ) which are then used to estimate y for a (set of) so-far unseen
X = x1 , . . . , xm . This is commonly referred to as batch learning.
We might be allowed to know X already at the time of constructing the
model. This is commonly referred to as transduction.
We might be allowed to choose X for the purpose of model building. This
is known as active learning.
We might not have full information about X, e.g. some of the coordinates
of the xi might be missing, leading to the problem of estimation with
missing variables.
The sets X and X might come from different data sources, leading to the
problem of covariate shift correction.
We might be given observations stemming from two problems at the same
time with the side information that both problems are somehow related.
This is known as co-training.
Mistakes of estimation might be penalized differently depending on the
type of error, e.g. when trying to distinguish diamonds from rocks a very
asymmetric loss applies.
Multiclass Classification is the logical extension of binary classification. The main difference is that now y {1, . . . , n} may assume a range
of different values. For instance, we might want to classify a document according to the language it was written in (English, French, German, Spanish,
Hindi, Japanese, Chinese, . . . ). See Figure 1.6 for an example. The main difference to before is that the cost of error may heavily depend on the type of
11
Fig. 1.7. Regression estimation. We are given a number of instances (indicated by

black dots) and would like to find some function f mapping the observations X to
R such that f (x) is close to the observed values.
error we make. For instance, in the problem of assessing the risk of cancer, it
makes a significant difference whether we mis-classify an early stage of cancer as healthy (in which case the patient is likely to die) or as an advanced
stage of cancer (in which case the patient is likely to be inconvenienced from
overly aggressive treatment).
Structured Estimation goes beyond simple multiclass estimation by
assuming that the labels y have some additional structure which can be used
in the estimation process. For instance, y might be a path in an ontology,
when attempting to classify webpages, y might be a permutation, when
attempting to match objects, to perform collaborative filtering, or to rank
documents in a retrieval setting. Equally well, y might be an annotation of
a text, when performing named entity recognition. Each of those problems
has its own properties in terms of the set of y which we might consider
admissible, or how to search this space. We will discuss a number of those
problems in Chapter ??.
Regression is another prototypical application. Here the goal is to estimate a real-valued variable y R given a pattern x (see e.g. Figure 1.7). For
instance, we might want to estimate the value of a stock the next day, the
yield of a semiconductor fab given the current process, the iron content of
ore given mass spectroscopy measurements, or the heart rate of an athlete,
given accelerometer data. One of the key issues in which regression problems
differ from each other is the choice of a loss. For instance, when estimating
stock values our loss for a put option will be decidedly one-sided. On the
other hand, a hobby athlete might only care that our estimate of the heart
rate matches the actual on average.
Novelty Detection is a rather ill-defined problem. It describes the issue
of determining unusual observations given a set of past measurements.
Clearly, the choice of what is to be considered unusual is very subjective.
A commonly accepted notion is that unusual events occur rarely. Hence a
possible goal is to design a system which assigns to each observation a rating
12
1 Introduction
Fig. 1.8. Left: typical digits contained in the database of the US Postal Service.
Right: unusual digits found by a novelty detection algorithm [SPST+ 01] (for a
description of the algorithm see Section 8.3). The score below the digits indicates
the degree of novelty. The numbers on the lower right indicate the class associated
with the digit.
as to how novel it is. Readers familiar with density estimation might contend
that the latter would be a reasonable solution. However, we neither need a
score which sums up to 1 on the entire domain, nor do we care particularly
much about novelty scores for typical observations. We will later see how this
somewhat easier goal can be achieved directly. Figure 1.8 has an example of
novelty detection when applied to an optical character recognition database.

In order to deal with the instances of where machine learning can be used, we
need to develop an adequate language which is able to describe the problems
concisely. Below we begin with a fairly informal overview over probability
theory. For more details and a very gentle and detailed discussion see the
excellent book of [BT03].
1.2.1 Random Variables

Assume that we cast a dice and we would like to know our chances whether
we would see 1 rather than another digit. If the dice is fair all six outcomes
X = {1, . . . , 6} are equally likely to occur, hence we would see a 1 in roughly
1 out of 6 cases. Probability theory allows us to model uncertainty in the outcome of such experiments. Formally we state that 1 occurs with probability
1
6.
In many experiments, such as the roll of a dice, the outcomes are of a
numerical nature and we can handle them easily. In other cases, the outcomes
may not be numerical, e.g., if we toss a coin and observe heads or tails. In
these cases, it is useful to associate numerical values to the outcomes. This
is done via a random variable. For instance, we can let a random variable
13
height
X take on a value +1 whenever the coin lands heads and a value of 1

otherwise. Our notational convention will be to use uppercase letters, e.g.,
X, Y etc to denote random variables and lower case letters, e.g., x, y etc to
denote the values they take.
(x)
weight
Fig. 1.9. The random variable maps from the set of outcomes of an experiment
(denoted here by X) to real numbers. As an illustration here X consists of the
patients a physician might encounter, and they are mapped via to their weight
and height.
1.2.2 Distributions
Perhaps the most important way to characterize a random variable is to
associate probabilities with the values it can take. If the random variable is
discrete, i.e., it takes on a finite number of values, then this assignment of
probabilities is called a probability mass function or PMF for short. A PMF
must be, by definition, non-negative and must sum to one. For instance,
if the coin is fair, i.e., heads and tails are equally likely, then the random
variable X described above takes on values of +1 and 1 with probability
0.5. This can be written as
P r(X = +1) = 0.5 and P r(X = 1) = 0.5.
(1.1)
When there is no danger of confusion we will use the slightly informal notation p(x) := P r(X = x).
In case of a continuous random variable the assignment of probabilities
results in a probability density function or PDF for short. With some abuse
of terminology, but keeping in line with convention, we will often use density
or distribution instead of probability density function. As in the case of the
PMF, a PDF must also be non-negative and integrate to one. Figure 1.10
shows two distributions: the uniform distribution
p(x) =
1
ba
if x [a, b]
otherwise,
(1.2)
14
1 Introduction
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
-4
-2
-4
-2
Fig. 1.10. Two common densities. Left: uniform distribution over the interval
[1, 1]. Right: Normal distribution with zero mean and unit variance.
and the Gaussian distribution (also called normal distribution)

p(x) =
1
2 2
exp
(x )2
2 2
(1.3)
Closely associated with a PDF is the indefinite integral over p. It is commonly referred to as the cumulative distribution function (CDF).
Definition 1.1 (Cumulative Distribution Function) For a real valued
random variable X with PDF p the associated Cumulative Distribution Function F is given by
x
F (x ) := Pr X x
dp(x).
(1.4)
The CDF F (x ) allows us to perform range queries on p efficiently. For

instance, by integral calculus we obtain
b
Pr(a X b) =
dp(x) = F (b) F (a).
(1.5)
The values of x for which F (x ) assumes a specific value, such as 0.1 or 0.5
have a special name. They are called the quantiles of the distribution p.
Definition 1.2 (Quantiles) Let q (0, 1). Then the value of x for which
Pr(X < x ) q and Pr(X > x ) 1 q is the q-quantile of the distribution
p. Moreover, the value x associated with q = 0.5 is called the median.
15
p(x)
Fig. 1.11. Quantiles of a distribution correspond to the area under the integral of
the density p(x) for which the integral takes on a pre-specified value. Illustrated
are the 0.1, 0.5 and 0.9 quantiles respectively.
1.2.3 Mean and Variance

A common question to ask about a random variable is what its expected
value might be. For instance, when measuring the voltage of a device, we
might ask what its typical values might be. When deciding whether to administer a growth hormone to a child a doctor might ask what a sensible
range of height should be. For those purposes we need to define expectations
and related quantities of distributions.
Definition 1.3 (Mean) We define the mean of a random variable X as
E[X] :=
xdp(x)
(1.6)
More generally, if f : R R is a function, then f (X) is also a random

variable. Its mean is mean given by
E[f (X)] :=
f (x)dp(x).
(1.7)
Whenever X is a discrete random variable the integral in (1.6) can be replaced by a summation:
E[X] =
xp(x).
(1.8)
For instance, in the case of a dice we have equal probabilities of 1/6 for all
6 possible outcomes. It is easy to see that this translates into a mean of
(1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5.
The mean of a random variable is useful in assessing expected losses and
benefits. For instance, as a stock broker we might be interested in the expected value of our investment in a years time. In addition to that, however,
we also might want to investigate the risk of our investment. That is, how
likely it is that the value of the investment might deviate from its expectation since this might be more relevant for our decisions. This means that we
16
1 Introduction
need a variable to quantify the risk inherent in a random variable. One such
measure is the variance of a random variable.
Definition 1.4 (Variance) We define the variance of a random variable
X as
Var[X] := E (X E[X])2 .
(1.9)
As before, if f : R R is a function, then the variance of f (X) is given by

Var[f (X)] := E (f (X) E[f (X)])2 .
(1.10)
The variance measures by how much on average f (X) deviates from its expected value. As we shall see in Section 2.1, an upper bound on the variance
can be used to give guarantees on the probability that f (X) will be within
of its expected value. This is one of the reasons why the variance is often
associated with the risk of a random variable. Note that often one discusses
properties of a random variable in terms of its standard deviation, which is
defined as the square root of the variance.
1.2.4 Marginalization, Independence, Conditioning, and Bayes

Rule
Given two random variables X and Y , one can write their joint density
p(x, y). Given the joint density, one can recover p(x) by integrating out y.
This operation is called marginalization:
p(x) =
dp(x, y).
(1.11)
If Y is a discrete random variable, then we can replace the integration with

a summation:
p(x) =
p(x, y).
(1.12)
We say that X and Y are independent, i.e., the values that X takes does
not depend on the values that Y takes whenever
p(x, y) = p(x)p(y).
(1.13)
Independence is useful when it comes to dealing with large numbers of random variables whose behavior we want to estimate jointly. For instance,
whenever we perform repeated measurements of a quantity, such as when
17
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
-0.5
-0.5
0.0
0.5
1.0
1.5
2.0
-0.5
-0.5
0.0
0.5
1.0
1.5
2.0
Fig. 1.12. Left: a sample from two dependent random variables. Knowing about
first coordinate allows us to improve our guess about the second coordinate. Right:
a sample drawn from two independent random variables, obtained by randomly
permuting the dependent sample.
measuring the voltage of a device, we will typically assume that the individual measurements are drawn from the same distribution and that they are
independent of each other. That is, having measured the voltage a number
of times will not affect the value of the next measurement. We will call such
random variables to be independently and identically distributed, or in short,
iid random variables. See Figure 1.12 for an example of a pair of random
variables drawn from dependent and independent distributions respectively.
Conversely, dependence can be vital in classification and regression problems. For instance, the traffic lights at an intersection are dependent of each
other. This allows a driver to perform the inference that when the lights are
green in his direction there will be no traffic crossing his path, i.e. the other
lights will indeed be red. Likewise, whenever we are given a picture x of a
digit, we hope that there will be dependence between x and its label y.
Especially in the case of dependent random variables, we are interested
in conditional probabilities, i.e., probability that X takes on a particular
value given the value of Y . Clearly P r(X = rain|Y = cloudy) is higher than
P r(X = rain|Y = sunny). In other words, knowledge about the value of Y
significantly influences the distribution of X. This is captured via conditional
probabilities:
p(x|y) :=
p(x, y)
.
p(y)
(1.14)
Equation 1.14 leads to one of the key tools in statistical inference.

Theorem 1.5 (Bayes Rule) Denote by X and Y random variables then
18
1 Introduction
the following holds

p(y|x) =
p(x|y)p(y)
.
p(x)
(1.15)
This follows from the fact that p(x, y) = p(x|y)p(y) = p(y|x)p(x). The key
consequence of (1.15) is that we may reverse the conditioning between a
pair of random variables.
1.2.4.1 An Example
We illustrate our reasoning by means of a simple example inference using
an AIDS test. Assume that a patient would like to have such a test carried
out on him. The physician recommends a test which is guaranteed to detect
HIV-positive whenever a patient is infected. On the other hand, for healthy
patients it has a 1% error rate. That is, with probability 0.01 it diagnoses
a patient as HIV-positive even when he is, in fact, HIV-negative. Moreover,
assume that 0.15% of the population is infected.
Now assume that the patient has the test carried out and the test returns HIV-negative. In this case, logic implies that he is healthy, since the
test has 100% detection rate. In the converse case things are not quite as
straightforward. Denote by X and T the random variables associated with
the health status of the patient and the outcome of the test respectively. We
are interested in p(X = HIV+|T = HIV+). By Bayes rule we may write
p(X = HIV+|T = HIV+) =
p(T = HIV+|X = HIV+)p(X = HIV+)

p(T = HIV+)
While we know all terms in the numerator, p(T = HIV+) itself is unknown.
That said, it can be computed via
p(T = HIV+) =
p(T = HIV+, x)
x{HIV+,HIV-}
p(T = HIV+|x)p(x)
x{HIV+,HIV-}
= 1.0 0.0015 + 0.01 0.9985.

Substituting back into the conditional expression yields
p(X = HIV+|T = HIV+) =
1.0 0.0015
= 0.1306.
1.0 0.0015 + 0.01 0.9985
In other words, even though our test is quite reliable, there is such a low
prior probability of having been infected with AIDS that there is not much
evidence to accept the hypothesis even after this test.
19
test 1
age
x
test 2
Fig. 1.13. A graphical description of our HIV testing scenario. Knowing the age of
the patient influences our prior on whether the patient is HIV positive (the random
variable X). The outcomes of the tests 1 and 2 are independent of each other given
the status X. We observe the shaded random variables (age, test 1, test 2) and
would like to infer the un-shaded random variable X. This is a special case of a
graphical model which we will discuss in Chapter ??.
Let us now think how we could improve the diagnosis. One way is to obtain further information about the patient and to use this in the diagnosis.
For instance, information about his age is quite useful. Suppose the patient
is 35 years old. In this case we would want to compute p(X = HIV+|T =
HIV+, A = 35) where the random variable A denotes the age. The corresponding expression yields:
p(T = HIV+|X = HIV+, A)p(X = HIV+|A)
p(T = HIV+|A)
Here we simply conditioned all random variables on A in order to take additional information into account. We may assume that the test is independent
of the age of the patient, i.e.
p(t|x, a) = p(t|x).
What remains therefore is p(X = HIV+|A). Recent US census data pegs this
number at approximately 0.9%. Plugging all data back into the conditional
10.009
expression yields 10.009+0.010.991
= 0.48. What has happened here is that
by including additional observed random variables our estimate has become
more reliable. Combination of evidence is a powerful tool. In our case it
helped us make the classification problem of whether the patient is HIVpositive or not more reliable.
A second tool in our arsenal is the use of multiple measurements. After
the first test the physician is likely to carry out a second test to confirm the
diagnosis. We denote by T1 and T2 (and t1 , t2 respectively) the two tests.
Obviously, what we want is that T2 will give us an independent second
opinion of the situation. In other words, we want to ensure that T2 does
not make the same mistakes as T1 . For instance, it is probably a bad idea
to repeat T1 without changes, since it might perform the same diagnostic
20
1 Introduction
mistake as before. What we want is that the diagnosis of T2 is independent

of that of T2 given the health status X of the patient. This is expressed as
p(t1 , t2 |x) = p(t1 |x)p(t2 |x).
(1.16)
See Figure 1.13 for a graphical illustration of the setting. Random variables
satisfying the condition (1.16) are commonly referred to as conditionally
independent. In shorthand we write T1 , T2
X. For the sake of the argument
we assume that the statistics for T2 are given by
p(t2 |x)
x = HIV-
x = HIV+
t2 = HIV0.95
0.01
t2 = HIV+ 0.05
0.99
Clearly this test is less reliable than the first one. However, we may now
combine both estimates to obtain a very reliable estimate based on the
combination of both events. For instance, for t1 = t2 = HIV+ we have
1.0 0.99 0.009
= 0.95.
1.0 0.99 0.009 + 0.01 0.05 0.991
In other words, by combining two tests we can now confirm with very high
confidence that the patient is indeed diseased. What we have carried out is a
combination of evidence. Strong experimental evidence of two positive tests
effectively overcame an initially very strong prior which suggested that the
patient might be healthy.
Tests such as in the example we just discussed are fairly common. For
instance, we might need to decide which manufacturing procedure is preferable, which choice of parameters will give better results in a regression estimator, or whether to administer a certain drug. Note that often our tests
may not be conditionally independent and we would need to take this into
account.
p(X = HIV+|T1 = HIV+, T2 = HIV+) =

We conclude our introduction to machine learning by discussing four simple
algorithms, namely Naive Bayes, Nearest Neighbors, the Mean Classifier,
and the Perceptron, which can be used to solve a binary classification problem such as that described in Figure 1.5. We will also introduce the K-means
algorithm which can be employed when labeled data is not available. All
these algorithms are readily usable and easily implemented from scratch in
their most basic form.
For the sake of concreteness assume that we are interested in spam filtering. That is, we are given a set of m e-mails xi , denoted by X := {x1 , . . . , xm }
21
From: "LucindaParkison497072" <LucindaParkison497072@hotmail.com>

To: <kargr@earthlink.net>
Subject: we think ACGU is our next winner
Date: Mon, 25 Feb 2008 00:01:01 -0500
MIME-Version: 1.0
X-OriginalArrivalTime: 25 Feb 2008 05:01:01.0329 (UTC) FILETIME=[6A931810:01C8776B]
Return-Path: lucindaparkison497072@hotmail.com
(ACGU) .045 UP 104.5%
I do think that (ACGU) at its current levels looks extremely attractive.
Asset Capital Group, Inc., (ACGU) announced that it is expanding the marketing of bio-remediation fluids and cleaning equipment. After
its recent acquisition of interest in American Bio-Clean Corporation and an 80
News is expected to be released next week on this growing company and could drive the price even higher. Buy (ACGU) Monday at open. I
believe those involved at this stage could enjoy a nice ride up.
Fig. 1.14. Example of a spam e-mail
x1 : The quick brown fox jumped over the lazy dog.

x2 : The dog hunts a fox.
x1
x2
the
quick
brown
fox
jumped
over
lazy
dog
hunts
2
1
1
0
1
0
1
1
1
0
1
0
1
0
1
1
0
1
0
1
Fig. 1.15. Vector space representation of strings.
and associated labels yi , denoted by Y := {y1 , . . . , ym }. Here the labels satisfy yi {spam, ham}. The key assumption we make here is that the pairs
(xi , yi ) are drawn jointly from some distribution p(x, y) which represents
the e-mail generating process for a user. Moreover, we assume that there
is sufficiently strong dependence between x and y that we will be able to
estimate y given x and a set of labeled instances X, Y.
Before we do so we need to address the fact that e-mails such as Figure 1.14
are text, whereas the three algorithms we present will require data to be
represented in a vectorial fashion. One way of converting text into a vector
is by using the so-called bag of words representation [Mar61, Lew98]. In its
simplest version it works as follows: Assume we have a list of all possible
words occurring in X, that is a dictionary, then we are able to assign a unique
number with each of those words (e.g. the position in the dictionary). Now
we may simply count for each document xi the number of times a given
word j is occurring. This is then used as the value of the j-th coordinate
of xi . Figure 1.15 gives an example of such a representation. Once we have
the latter it is easy to compute distances, similarities, and other statistics
directly from the vectorial representation.
22
1 Introduction
1.3.1 Naive Bayes

In the example of the AIDS test we used the outcomes of the test to infer
whether the patient is diseased. In the context of spam filtering the actual
text of the e-mail x corresponds to the test and the label y is equivalent to
the diagnosis. Recall Bayes Rule (1.15). We could use the latter to infer
p(y|x) =
p(x|y)p(y)
.
p(x)
We may have a good estimate of p(y), that is, the probability of receiving
a spam or ham mail. Denote by mham and mspam the number of ham and
spam e-mails in X. In this case we can estimate
mspam
mham
p(ham)
and p(spam)
.
m
m
The key problem, however, is that we do not know p(x|y) or p(x). We may
dispose of the requirement of knowing p(x) by settling for a likelihood ratio
L(x) :=
p(x|spam)p(spam)
p(spam|x)
=
.
p(ham|x)
p(x|ham)p(ham)
(1.17)
Whenever L(x) exceeds a given threshold c we decide that x is spam and

consequently reject the e-mail. If c is large then our algorithm is conservative
and classifies an email as spam only if p(spam|x)
p(ham|x). On the other
hand, if c is small then the algorithm aggressively classifies emails as spam.
The key obstacle is that we have no access to p(x|y). This is where we make
our key approximation. Recall Figure 1.13. In order to model the distribution
of the test outcomes T1 and T2 we made the assumption that they are
conditionally independent of each other given the diagnosis. Analogously,
we may now treat the occurrence of each word in a document as a separate
test and combine the outcomes in a naive fashion by assuming that
# of words in x
p(wj |y),
p(x|y) =
(1.18)
j=1
where wj denotes the j-th word in document x. This amounts to the assumption that the probability of occurrence of a word in a document is
independent of all other words given the category of the document. Even
though this assumption does not hold in general for instance, the word
York is much more likely to after the word New it suffices for our
purposes (see Figure 1.16).
This assumption reduces the difficulty of knowing p(x|y) to that of estimating the probabilities of occurrence of individual words w. Estimates for
23
word 1
word 2
word 3
...
word n
Fig. 1.16. Naive Bayes model. The occurrence of individual words is independent
of each other, given the category of the text. For instance, the word Viagra is fairly
frequent if y = spam but it is considerably less frequent if y = ham, except when
considering the mailbox of a Pfizer sales representative.
p(w|y) can be obtained, for instance, by simply counting the frequency occurrence of the word within documents of a given class. That is, we estimate
p(w|spam)
m
i=1
# of words in xi
j=1
m
i=1
yi = spam and wij = w
# of words in xi
j=1
{yi = spam}
Here yi = spam and wij = w equals 1 if and only if xi is labeled as spam

and w occurs as the j-th word in xi . The denominator is simply the total
number of words in spam documents. Similarly one can compute p(w|ham).
In principle we could perform the above summation whenever we see a new
document x. This would be terribly inefficient, since each such computation
requires a full pass through X and Y. Instead, we can perform a single pass
through X and Y and store the resulting statistics as a good estimate of
the conditional probabilities.
Algorithm 1.1 has details of an implementation. Note that we performed
1
a number of optimizations: Firstly, the normalization by m1
spam and mham
respectively is independent of x, hence we incorporate it as a fixed offset.
Secondly, since we are computing a product over a large number of factors
the numbers might lead to numerical overflow or underflow. This can be
addressed by summing over the logarithm of terms rather than computing
products. Thirdly, we need to address the issue of estimating p(w|y) for
words w which we might not have seen before. One way of dealing with
this is to increment all counts by 1. This method is commonly referred to
as Laplace smoothing. We will encounter a theoretical justification for this
heuristic in Section 2.3.
This simple algorithm is known to perform surprisingly well, and variants
of it can be found in most modern spam filters. It amounts to what is
24
1 Introduction
Algorithm 1.1 Naive Bayes

Train(X, Y) {reads documents X and labels Y}
Compute dictionary D of X with n words.
Compute m, mham and mspam .
Initialize b := log c + log mham log mspam to offset the rejection threshold
Initialize p R2n with pij = 1, wspam = 0, wham = 0.
{Count occurrence of each word}
{Here xji denotes the number of times word j occurs in document xi }
for i = 1 to m do
if yi = spam then
for j = 1 to n do
p0,j p0,j + xji
wspam wspam + xji
end for
else
for j = 1 to n do
p1,j p1,j + xji
wham wham + xji
end for
end if
{Normalize counts to yield word probabilities}
for j = 1 to n do
p0,j p0,j /wspam
p1,j p1,j /wham
end for
end for
Classify(x) {classifies document x}
Initialize score threshold t = b
for j = 1 to n do
t t + xj (log p0,j log p1,j )
end for
if t > 0 return spam else return ham
commonly known as Bayesian spam filtering. Obviously, we may apply it

to problems other than document categorization, too.
25
Fig. 1.17. 1 nearest neighbor classifier. Depending on whether the query point x is
closest to the star, diamond or triangles, it uses one of the three labels for it.
Fig. 1.18. k-Nearest neighbor classifiers using Euclidean distances. Left: decision
boundaries obtained from a 1-nearest neighbor classifier. Middle: color-coded sets
of where the number of red / blue points ranges between 7 and 0. Right: decision
boundary determining where the blue or red dots are in the majority.
1.3.2 Nearest Neighbor Estimators

An even simpler estimator than Naive Bayes is nearest neighbors. In its most
basic form it assigns the label of its nearest neighbor to an observation x
(see Figure 1.17). Hence, all we need to implement it is a distance measure
d(x, x ) between pairs of observations. Note that this distance need not even
be symmetric. This means that nearest neighbor classifiers can be extremely
flexible. For instance, we could use string edit distances to compare two
documents or information theory based measures.
However, the problem with nearest neighbor classification is that the estimates can be very noisy whenever the data itself is very noisy. For instance,
if a spam email is erroneously labeled as nonspam then all emails which
are similar to this email will share the same fate. See Figure 1.18 for an
example. In this case it is beneficial to pool together a number of neighbors,
say the k-nearest neighbors of x and use a majority vote to decide the class
membership of x. Algorithm 1.2 has a description of the algorithm. Note
that nearest neighbor algorithms can yield excellent performance when used
26
1 Introduction
Fig. 1.19. k-Nearest neighbor regression estimator using Euclidean distances. Left:
some points (x, y) drawn from a joint distribution. Middle: 1-nearest neighbour
classifier. Right: 7-nearest neighbour classifier. Note that the regression estimate is
much more smooth.
with a good distance measure. For instance, the technology underlying the
Netflix progress prize [BK07] was essentially nearest neighbours based.
Algorithm 1.2 k-Nearest Neighbor Classification
Classify(X, Y, x) {reads documents X, labels Y and query x}
for i = 1 to m do
Compute distance d(xi , x)
end for
Compute set I containing indices for the k smallest distances d(xi , x).
return majority label of {yi where i I}.
Note that it is trivial to extend the algorithm to regression. All we need
to change in Algorithm 1.2 is to return the average of the values yi instead
of their majority vote. Figure 1.19 has an example.
Note that the distance computation d(xi , x) for all observations can become extremely costly, in particular whenever the number of observations is
large or whenever the observations xi live in a very high dimensional space.
Random projections are a technique that can alleviate the high computational cost of Nearest Neighbor classifiers. A celebrated lemma by Johnson
and Lindenstrauss [DG03] asserts that a set of m points in high dimensional
Euclidean space can be projected into a O(log m/ 2 ) dimensional Euclidean
space such that the distance between any two points changes only by a factor of (1 ). Since Euclidean distances are preserved, running the Nearest
Neighbor classifier on this mapped data yields the same results but at a
lower computational cost [GIM99].
The surprising fact is that the projection relies on a simple randomized
algorithm: to obtain a d-dimensional representation of n-dimensional random observations we pick a matrix R Rdn where each element is drawn
27
w
x
Fig. 1.20. A trivial classifier. Classification is carried out in accordance to which of

the two means or + is closer to the test point x. Note that the sets of positive
and negative labels respectively form a half space.
1
independently from a normal distribution with n 2 variance and zero mean.

Multiplying x with this projection matrix can be shown to achieve this property with high probability. For details see [DG03].
1.3.3 A Simple Classifier
We can use geometry to design another simple classification algorithm [SS02]
for our problem. For simplicity we assume that the observations x Rd , such
as the bag-of-words representation of e-mails. We define the means + and
to correspond to the classes y {1} via
:=
1
m
xi and + :=
yi =1
1
m
xi .
yi =1
Here we used m and m+ to denote the number of observations with label

yi = 1 and yi = +1 respectively. An even simpler approach than using the
nearest neighbor classifier would be to use the class label which corresponds
to the mean closest to a new query x, as described in Figure 1.20.
For Euclidean distances we have
x
+ x
2 , x and
(1.19)
+ x
= +
+ x
2 + , x .
(1.20)
Here , denotes the standard dot product between vectors. Taking differences between the two distances yields
f (x) := + x
= 2 + , x +
.
(1.21)
This is a linear function in x and its sign corresponds to the labels we estimate for x. Our algorithm sports an important property: The classification
28
1 Introduction
H
x
(x)
Fig. 1.21. The feature map maps observations x from X into a feature space H.
The map is a convenient way of encoding pre-processing steps systematically.
rule can be expressed via dot products. This follows from

+
= + , + = m2
+
xi , xj and + , x = m1
+
yi =yj =1
xi , x .
yi =1
Analogous expressions can be computed for . Consequently we may express the classification rule (1.21) as
m
f (x) =
i xi , x + b
(1.22)
i=1
2
where b = m2
yi =yj =1 xi , xj and i = yi /myi .

yi =yj =1 xi , xj m+
This offers a number of interesting extensions. Recall that when dealing
with documents we needed to perform pre-processing to map e-mails into a
vector space. In general, we may pick arbitrary maps : X H mapping
the space of observations into a feature space H, as long as the latter is
endowed with a dot product (see Figure 1.21). This means that instead of
dealing with x, x we will be dealing with (x), (x ) .
As we will see in Chapter 7, whenever H is a so-called Reproducing Kernel
Hilbert Space, the inner product can be abbreviated in the form of a kernel
function k(x, x ) which satisfies
k(x, x ) := (x), (x ) .
(1.23)
This small modification leads to a number of very powerful algorithm and

it is at the foundation of an area of research called kernel methods. We
will encounter a number of such algorithms for regression, classification,
segmentation, and density estimation over the course of the book. Examples
of suitable k are the polynomial kernel k(x, x ) = x, x d for d N and the
2
Gaussian RBF kernel k(x, x ) = e xx for > 0.
The upshot of (1.23) is that our basic algorithm can be kernelized. That
29
Algorithm 1.3 The Perceptron

Perceptron(X, Y) {reads stream of observations (xi , yi )}
Initialize w = 0 and b = 0
while There exists some (xi , yi ) with yi ( w, xi + b) 0 do
w w + yi xi and b b + yi
end while
is, we may rewrite (1.21) as

m
f (x) =
i k(xi , x) + b
(1.24)
i=1
where as before i = yi /myi and the offset b is computed analogously. As

a consequence we have now moved from a fairly simple and pedestrian linear classifier to one which yields a nonlinear function f (x) with a rather
nontrivial decision boundary.
1.3.4 Perceptron
In the previous sections we assumed that our classifier had access to a training set of spam and non-spam emails. In real life, such a set might be difficult
to obtain all at once. Instead, a user might want to have instant results whenever a new e-mail arrives and he would like the system to learn immediately
from any corrections to mistakes the system makes.
To overcome both these difficulties one could envisage working with the
following protocol: As emails arrive our algorithm classifies them as spam or
non-spam, and the user provides feedback as to whether the classification is
correct or incorrect. This feedback is then used to improve the performance
of the classifier over a period of time.
This intuition can be formalized as follows: Our classifier maintains a
parameter vector. At the t-th time instance it receives a data point xt , to
which it assigns a label yt using its current parameter vector. The true label
yt is then revealed, and used to update the parameter vector of the classifier.
Such algorithms are said to be online. We will now describe perhaps the
simplest classifier of this kind namely the Perceptron [Heb49, Ros58].
Let us assume that the data points xt Rd , and labels yt {1}. As
before we represent an email as a bag-of-words vector and we assign +1 to
spam emails and 1 to non-spam emails. The Perceptron maintains a weight
30
1 Introduction
w*
xt
wt+1
w*
xt
wt
Fig. 1.22. The Perceptron without bias. Left: at time t we have a weight vector wt
denoted by the dashed arrow with corresponding separating plane (also dashed).
For reference we include the linear separator w and its separating plane (both
denoted by a solid line). As a new observation xt arrives which happens to be
mis-classified by the current weight vector wt we perform an update. Also note the
margin between the point xt and the separating hyperplane defined by w . Right:
This leads to the weight vector wt+1 which is more aligned with w .
Algorithm 1.4 The Kernel Perceptron

KernelPerceptron(X, Y) {reads stream of observations (xi , yi )}
Initialize f = 0
while There exists some (xi , yi ) with yi f (xi ) 0 do
f f + yi k(xi , ) + yi
end while
vector w Rd and classifies xt according to the rule
yt := sign{ w, xt + b},
(1.25)
where w, xt denotes the usual Euclidean dot product and b is an offset. Note
the similarity of (1.25) to (1.21) of the simple classifier. Just as the latter,
the Perceptron is a linear classifier which separates its domain Rd into two
halfspaces, namely {x| w, x + b > 0} and its complement. If yt = yt then
no updates are made. On the other hand, if yt = yt the weight vector is
updated as
w w + yt xt and b b + yt .
(1.26)
Figure 1.22 shows an update step of the Perceptron algorithm. For simplicity
31
we illustrate the case without bias, that is, where b = 0 and where it remains
unchanged. A detailed description of the algorithm is given in Algorithm 1.3.
An important property of the algorithm is that it performs updates on w
by multiples of the observations xi on which it makes a mistake. Hence we
may express w as w = iError yi xi . Just as before, we can replace xi and x
by (xi ) and (x) to obtain a kernelized version of the Perceptron algorithm
[FS99] (Algorithm 1.4).
If the dataset (X, Y) is linearly separable, then the Perceptron algorithm
eventually converges and correctly classifies all the points in X. The rate of
convergence however depends on the margin. Roughly speaking, the margin
quantifies how linearly separable a dataset is, and hence how easy it is to
solve a given classification problem.
Definition 1.6 (Margin) Let w Rd be a weight vector and let b R be
an offset. The margin of an observation x Rd with associated label y is
(x, y) := y ( w, x + b) .
(1.27)
Moreover, the margin of an entire set of observations X with labels Y is

(X, Y) := min (xi , yi ).
i
(1.28)
Geometrically speaking (see Figure 1.22) the margin measures the distance
of x from the hyperplane defined by {x| w, x + b = 0}. Larger the margin,
the more well separated the data and hence easier it is to find a hyperplane
with correctly classifies the dataset. The following theorem asserts that if
there exists a linear classifier which can classify a dataset with a large margin, then the Perceptron will also correctly classify the same dataset after
making a small number of mistakes.
Theorem 1.7 (Novikoff s theorem) Let (X, Y) be a dataset with at least
one example labeled +1 and one example labeled 1. Let R := maxt xt , and
assume that there exists (w , b ) such that w = 1 and t := yt ( w , xt +
2
)2 )
b ) for all t. Then, the Perceptron will make at most (1+R )(1+(b
2
mistakes.
This result is remarkable since it does not depend on the dimensionality
of the problem. Instead, it only depends on the geometry of the setting,
as quantified via the margin and the radius R of a ball enclosing the
observations. Interestingly, a similar bound can be shown for Support Vector
Machines [Vap95] which we will be discussing in Chapter 8.
Proof We can safely ignore the iterations where no mistakes were made
32
1 Introduction
and hence no updates were carried out. Therefore, without loss of generality
assume that the t-th update was made after seeing the t-th observation and
let wt denote the weight vector after the update. Furthermore, for simplicity
assume that the algorithm started with w0 = 0 and b0 = 0. By the update
equation (1.26) we have
wt , w + bt b = wt1 , w + bt1 b + yt ( xt , w + b )
wt1 , w + bt1 b + .
By induction it follows that wt , w +bt b t. On the other hand we made
an update because yt ( xt , wt1 + bt1 ) < 0. By using yt yt = 1,
wt
+ b2t = wt1
+ b2t1 + yt2 xt
wt1
+ b2t1 + xt
+ 1 + 2yt ( wt1 , xt + bt1 )
+1
Since xt 2 = R2 we can again apply induction to conclude that wt 2 +b2t

t R2 + 1 . Combining the upper and the lower bounds, using the CauchySchwartz inequality, and w = 1 yields
t wt , w + bt b =
wt
bt
t(R2 + 1)
w
b
wt
bt
=
,
wt
w
b
2
+ b2t
1 + (b )2
1 + (b )2 .
Squaring both sides of the inequality and rearranging the terms yields an
upper bound on the number of updates and hence the number of mistakes.
The Perceptron was the building block of research on Neural Networks
[Hay98, Bis95]. The key insight was to combine large numbers of such networks, often in a cascading fashion, to larger objects and to fashion optimization algorithms which would lead to classifiers with desirable properties.
In this book we will take a complementary route. Instead of increasing the
number of nodes we will investigate what happens when increasing the complexity of the feature map and its associated kernel k. The advantage of
doing so is that we will reap the benefits from convex analysis and linear
models, possibly at the expense of a slightly more costly function evaluation.
1.3.5 K-Means
All the algorithms we discussed so far are supervised, that is, they assume
that labeled training data is available. In many applications this is too much
33
to hope for; labeling may be expensive, error prone, or sometimes impossible. For instance, it is very easy to crawl and collect every page within the
www.purdue.edu domain, but rather time consuming to assign a topic to
each page based on its contents. In such cases, one has to resort to unsupervised learning. A prototypical unsupervised learning algorithm is K-means,
which is clustering algorithm. Given X = {x1 , . . . , xm } the goal of K-means
is to partition it into k clusters such that each point in a cluster is similar
to points from its own cluster than with points from some other cluster.
Towards this end, define prototype vectors 1 , . . . , k and an indicator
vector rij which is 1 if, and only if, xi is assigned to cluster j. To cluster our
dataset we will minimize the following distortion measure, which minimizes
the distance of each point from the prototype vector:
1
J(r, ) :=
2
rij xi j
(1.29)
i=1 j=1
where r = {rij }, = {j }, and 2 denotes the usual Euclidean square

norm.
Our goal is to find r and , but since it is not easy to jointly minimize J
with respect to both r and , we will adapt a two stage strategy:
Stage 1 Keep the fixed and determine r. In this case, it is easy to see
that the minimization decomposes into m independent problems.
The solution for the i-th data point xi can be found by setting:
rij = 1 if j = argmin xi j
(1.30)
and 0 otherwise.
Stage 2 Keep the r fixed and determine . Since the rs are fixed, J is an
quadratic function of . It can be minimized by setting the derivative
with respect to j to be 0:
m
rij (xi j ) = 0 for all j.
(1.31)
i=1
Rearranging obtains
j =
i rij xi
i rij
(1.32)
Since i rij counts the number of points assigned to cluster j, we are

essentially setting j to be the sample mean of the points assigned
to cluster j.
34
1 Introduction
The algorithm stops when the cluster assignments do not change significantly. Detailed pseudo-code can be found in Algorithm 3.2.
Algorithm 1.5 K-Means
Cluster(X) {Cluster dataset X}
Initialize cluster centers j for j = 1, . . . , k randomly
repeat
for i = 1 to m do
Compute j = argminj=1,...,k d(xi , j )
Set rij = 1 and rij = 0 for all j = j
end for
for j = 1 to k doP
r xi
Compute j = Pi ij
i rij
end for
until Cluster assignments rij are unchanged
return {1 , . . . , k } and rij
Two issues with K-Means are worth noting. First, it is sensitive to the
choice of the initial cluster centers . A number of practical heuristics have
been developed. For instance, one could randomly choose k points from the
given dataset as cluster centers. Other methods try to pick k points from X
which are farthest away from each other. Second, it makes a hard assignment
of every point to a cluster center. Variants which we will encounter later in
the book will relax this. Instead of letting rij {0, 1} these soft variants
will replace it with the probability that a given xi belongs to cluster j.
The K-Means algorithm concludes our discussion of a set of basic machine
learning methods for classification and regression. They provide a useful
starting point for an aspiring machine learning researcher. In this book we
will see many more such algorithms as well as connections between these
basic algorithms and their more advanced counterparts.
Problems
Problem 1.1 (Eyewitness) Assume that an eyewitness is 90% certain
that a given person committed a crime in a bar. Moreover, assume that
there were 50 people in the restaurant at the time of the crime. What is the
posterior probability of the person actually having committed the crime.
Problem 1.2 (DNA Test) Assume the police have a DNA library of 10
million records. Moreover, assume that the false recognition probability is
35
below 0.00001% per record. Suppose a match is found after a database search
for an individual. What are the chances that the identification is correct? You
can assume that the total population is 100 million people. Hint: compute
the probability of no match occurring first.
Problem 1.3 (Bomb Threat) Suppose that the probability that one of a
thousand passengers on a plane has a bomb is 1 : 1, 000, 000. Assuming that
the probability to have a bomb is evenly distributed among the passengers,
the probability that two passengers have a bomb is roughly equal to 1012 .
Therefore, one might decide to take a bomb on a plane to decrease chances
that somebody else has a bomb. What is wrong with this argument?
Problem 1.4 (Monty-Hall Problem) Assume that in a TV show the
candidate is given the choice between three doors. Behind two of the doors
there is a pencil and behind one there is the grand prize, a car. The candidate chooses one door. After that, the showmaster opens another door behind
which there is a pencil. Should the candidate switch doors after that? What
is the probability of winning the car?
Problem 1.5 (Mean and Variance for Random Variables) Denote by
Xi random variables. Prove that in this case
EX1 ,...XN
xi =
i
EXi [xi ] and VarX1 ,...XN

i
xi =
i
VarXi [xi ]
i
To show the second equality assume independence of the Xi .

Problem 1.6 (Two Dices) Assume you have a game which uses the maximum of two dices. Compute the probability of seeing any of the events
{1, . . . , 6}. Hint: prove first that the cumulative distribution function of the
maximum of a pair of random variables is the square of the original cumulative distribution function.
Problem 1.7 (Matching Coins) Consider the following game: two players bring a coin each. the first player bets that when tossing the coins both
will match and the second one bets that they will not match. Show that even
if one of the players were to bring a tainted coin, the game still would be
fair. Show that it is in the interest of each player to bring a fair coin to the
game. Hint: assume that the second player knows that the first coin favors
heads over tails.
36
1 Introduction
Problem 1.8 (Randomized Maximization) How many observations do

you need to draw from a distribution to ensure that the maximum over them
is larger than 95% of all observations with at least 95% probability? Hint:
generalize the result from Problem 1.6 to the maximum over n random variables.
Application: Assume we have 1000 computers performing MapReduce [DG08]
and the Reducers have to wait until all 1000 Mappers are finished with their
job. Compute the quantile of the typical time to completion.
Problem 1.9 Prove that the Normal distribution (1.3) has mean and
variance 2 . Hint: exploit the fact that p is symmetric around .
Problem 1.10 (Cauchy Distribution) Prove that for the density
p(x) =
1
(1 + x2 )
(1.33)
mean and variance are undefined. Hint: show that the integral diverges.
Problem 1.11 (Quantiles) Find a distribution for which the mean exceeds the median. Hint: the mean depends on the value of the high-quantile
terms, whereas the median does not.
Problem 1.12 (Multicategory Naive Bayes) Prove that for multicategory Naive Bayes the optimal decision is given by
n
p([x]i |y)
y (x) := argmax p(y)

y
(1.34)
i=1
where y Y is the class label of the observation x.

Problem 1.13 (Bayes Optimal Decisions) Denote by y (x) = argmaxy p(y|x)
the label associated with the largest conditional class probability. Prove that
for y (x) the probability of choosing the wrong label y is given by
l(x) := 1 p(y (x)|x).
Moreover, show that y (x) is the label incurring the smallest misclassification
error.
Problem 1.14 (Nearest Neighbor Loss) Show that the expected loss incurred by the nearest neighbor classifier does not exceed twice the loss of the
Bayes optimal decision.
2
Density Estimation
2.1 Limit Theorems

Assume you are a gambler and go to a casino to play a game of dice. As
it happens, it is your unlucky day and among the 100 times you toss the
dice, you only see 6 eleven times. For a fair dice we know that each face
should occur with equal probability 16 . Hence the expected value over 100
draws is 100
6 17, which is considerably more than the eleven times that we
observed. Before crying foul you decide that some mathematical analysis is
in order.
The probability of seeing a particular sequence of m trials out of which n
n mn
m!
are a 6 is given by 16 56
. Moreover, there are m
n = n!(mn)! different
sequences of 6 and not 6 with proportions n and mn respectively. Hence
we may compute the probability of seeing a 6 only 11 or less via
11
Pr(X 11) =
11
p(i) =
i=0
i=0
100
i
1
6
5
6
100i
7.0%
(2.1)
After looking at this figure you decide that things are probably reasonable.
And, in fact, they are consistent with the convergence behavior of a simulated dice in Figure 2.1. In computing (2.1) we have learned something
useful: the expansion is a special case of a binomial series. The first term
m=10
m=20
m=50
m=100
m=200
m=500
0.3
0.3
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0.1
0.0
1234 56
0.0
1234 56
0.0
1234 56
0.0
1234 56
0.0
1234 56
0.0
1234 56
Fig. 2.1. Convergence of empirical means to expectations. From left to right: empirical frequencies of occurrence obtained by casting a dice 10, 20, 50, 100, 200, and
500 times respectively. Note that after 20 throws we still have not observed a single
20
6, an event which occurs with only 56
2.6% probability.
37
38
2 Density Estimation
counts the number of configurations in which we could observe i times 6 in a

sequence of 100 dice throws. The second and third term are the probabilities
of seeing one particular instance of such a sequence.
Note that in general we may not be as lucky, since we may have considerably less information about the setting we are studying. For instance,
we might not know the actual probabilities for each face of the dice, which
would be a likely assumption when gambling at a casino of questionable
reputation. Often the outcomes of the system we are dealing with may be
continuous valued random variables rather than binary ones, possibly even
with unknown range. For instance, when trying to determine the average
wage through a questionnaire we need to determine how many people we
need to ask in order to obtain a certain level of confidence.
To answer such questions we need to discuss limit theorems. They tell
us by how much averages over a set of observations may deviate from the
corresponding expectations and how many observations we need to draw to
estimate a number of probabilities reliably. For completeness we will present
proofs for some of the more fundamental theorems in Section 2.1.2. They
are useful albeit non-essential for the understanding of the remainder of the
book and may be omitted.
2.1.1 Fundamental Laws

The Law of Large Numbers developed by Bernoulli in 1713 is one of the
fundamental building blocks of statistical analysis. It states that averages
over a number of observations converge to their expectations given a sufficiently large number of observations and given certain assumptions on the
independence of these observations. It comes in two flavors: the weak and
the strong law.
Theorem 2.1 (Weak Law of Large Numbers) Denote by X1 , . . . , Xm
random variables drawn from p(x) with mean = EXi [xi ] for all i. Moreover
let
m := 1
X
m
Xi
(2.2)
i=1
be the empirical average over the random variables Xi . Then for any
the following holds
>0
m
lim Pr X
(2.3)
= 1.
2.1 Limit Theorems
39
6
5
4
3
2
1
101
102
103
Fig. 2.2. The mean of a number of casts of a dice. The horizontal straight line
n as a
denotes the mean 3.5. The uneven solid line denotes the actual mean X
function of the number of draws, given as a semilogarithmic plot. The crosses denote
n ever more closely approaches the mean 3.5
the outcomes of the dice. Note how X
are we obtain an increasing number of observations.
This establishes that, indeed, for large enough sample sizes, the average will
converge to the expectation. The strong law strengthens this as follows:
Theorem 2.2 (Strong Law of Large Numbers) Under the conditions
m = = 1.
of Theorem 2.1 we have Pr limm X
m
The strong law implies that almost surely (in a measure theoretic sense) X
converges to , whereas the weak law only states that for every the random
m will be within the interval [ , + ]. Clearly the strong implies
variable X
m = converges to 1, hence
the weak law since the measure of the events X
any -ball around would capture this.
Both laws justify that we may take sample averages, e.g. over a number
of events such as the outcomes of a dice and use the latter to estimate their
means, their probabilities (here we treat the indicator variable of the event
as a {0; 1}-valued random variable), their variances or related quantities. We
postpone a proof until Section 2.1.2, since an effective way of proving Theorem 2.1 relies on the theory of characteristic functions which we will discuss
in the next section. For the moment, we only give a pictorial illustration in
Figure 2.2.
m = m1 m Xi conOnce we established that the random variable X
i=1
verges to its mean , a natural second question is to establish how quickly it
m are.
converges and what the properties of the limiting distribution of X
Note in Figure 2.2 that the initial deviation from the mean is large whereas
as we observe more data the empirical mean approaches the true one.
40
6
5
4
3
2
1
101
102
103
Fig. 2.3. Five instantiations of a running average over outcomes of a toss of a dice.
Note that all of them converge to the mean 3.5. Moreover note that they all are
well contained within the upper and lower envelopes given by VarX [x]/m.
The central limit theorem answers this question exactly by addressing a

slightly more general question, namely whether the sum over a number of
independent random variables where each of them arises from a different
distribution might also have a well behaved limiting distribution. This is
the case as long as the variance of each of the random variables is bounded.
The limiting distribution of such a sum is Gaussian. This affirms the pivotal
role of the Gaussian distribution.
Theorem 2.3 (Central Limit Theorem) Denote by Xi independent random variables with means i and standard deviation i . Then
12
i2
Zm :=
i=1
Xi i
(2.4)
i=1
converges to a Normal Distribution with zero mean and unit variance.

Note that just like the law of large numbers the central limit theorem (CLT)
is an asymptotic result. That is, only in the limit of an infinite number of
observations will it become exact. That said, it often provides an excellent
approximation even for finite numbers of observations, as illustrated in Figure 2.4. In fact, the central limit theorem and related limit theorems build
the foundation of what is known as asymptotic statistics.
Example 2.1 (Dice) If we are interested in computing the mean of the
values returned by a dice we may apply the CLT to the sum over m variables
2.1 Limit Theorems
41
which have all mean = 3.5 and variance (see Problem 2.1)
VarX [x] = EX [x2 ] EX [x]2 = (1 + 4 + 9 + 16 + 25 + 36)/6 3.52 2.92.
We now study the random variable Wm := m1 m
i=1 [Xi 3.5]. Since each
of the terms in the sum has zero mean, also Wm s mean vanishes. Moreover,
Wm is a multiple of Zm of (2.4). Hence we have that Wm converges to a
1
normal distribution with zero mean and standard deviation 2.92m 2 .
Consequently the average of m tosses of the dice yields a random variable with mean 3.5 and it will approach a normal distribution with variance
1
m 2 2.92. In other words, the empirical mean converges to its average at
1
rate O(m 2 ). Figure 2.3 gives an illustration of the quality of the bounds
implied by the CLT.
One remarkable property of functions of random variables is that in many
conditions convergence properties of the random variables are bestowed upon
the functions, too. This is manifest in the following two results: a variant
of Slutskys theorem and the so-called delta method. The former deals with
limit behavior whereas the latter deals with an extension of the central limit
theorem.
Theorem 2.4 (Slutskys Theorem) Denote by Xi , Yi sequences of random variables with Xi X and Yi c for c R in probability. Moreover,
denote by g(x, y) a function which is continuous for all (x, c). In this case
the random variable g(Xi , Yi ) converges in probability to g(X, c).
For a proof see e.g. [Bil68]. Theorem 2.4 is often referred to as the continuous
mapping theorem (Slutsky only proved the result for affine functions). It
means that for functions of random variables it is possible to pull the limiting
procedure into the function. Such a device is useful when trying to prove
asymptotic normality and in order to obtain characterizations of the limiting
distribution.
Theorem 2.5 (Delta Method) Assume that Xn Rd is asymptotically
2
normal with a2
n (Xn b) N(0, ) for an 0. Moreover, assume that
d
l
g : R R is a mapping which is continuously differentiable at b. In this
case the random variable g(Xn ) converges
a2
n (g(Xn ) g(b)) N(0, [x g(b)][x g(b)] ).
(2.5)
Proof Via a Taylor expansion we see that

2
a2
n [g(Xn ) g(b)] = [x g(n )] an (Xn b)
(2.6)
42
Here n lies on the line segment [b, Xn ]. Since Xn b we have that n b,

too. Since g is continuously differentiable at b we may apply Slutskys the2
orem to see that a2
n [g(Xn ) g(b)] [x g(b)] an (Xn b). As a consequence, the transformed random variable is asymptotically normal with
covariance [x g(b)][x g(b)] .
We will use the delta method when it comes to investigating properties of
maximum likelihood estimators in exponential families. There g will play the
role of a mapping between expectations and the natural parametrization of
a distribution.
2.1.2 The Characteristic Function

The Fourier transform plays a crucial role in many areas of mathematical
analysis and engineering. This is equally true in statistics. For historic reasons its applications to distributions is called the characteristic function,
which we will discuss in this section. At its foundations lie standard tools
from functional analysis and signal processing [Rud73, Pap62]. We begin by
recalling the basic properties:
Definition 2.6 (Fourier Transform) Denote by f : Rn C a function
defined on a d-dimensional Euclidean space. Moreover, let x, Rn . Then
the Fourier transform F and its inverse F 1 are given by
d
F [f ]() := (2) 2
f (x) exp(i , x )dx
(2.7)
g() exp(i , x )d.
(2.8)
Rn
d
F 1 [g](x) := (2) 2
Rn
The key insight is that F 1 F = F F 1 = Id. In other words, F and

F 1 are inverses to each other for all functions which are L2 integrable on
Rd , which includes probability distributions. One of the key advantages of
Fourier transforms is that derivatives and convolutions on f translate into
d
multiplications. That is F [f g] = (2) 2 F [f ] F [g]. The same rule applies
d
to the inverse transform, i.e. F 1 [f g] = (2) 2 F 1 [f ]F 1 [g].
The benefit for statistical analysis is that often problems are more easily
expressed in the Fourier domain and it is easier to prove convergence results
there. These results then carry over to the original domain. We will be
exploiting this fact in the proof of the law of large numbers and the central
limit theorem. Note that the definition of Fourier transforms can be extended
to more general domains such as groups. See e.g. [BCR84] for further details.
2.1 Limit Theorems
43
We next introduce the notion of a characteristic function of a distribution.1
Definition 2.7 (Characteristic Function) Denote by p(x) a distribution

of a random variable X Rd . Then the characteristic function X () with
Rd is given by
d
X () := (2) 2 F 1 [p(x)] =
exp(i , x )dp(x).
(2.9)
In other words, X () is the inverse Fourier transform applied to the probability measure p(x). Consequently X () uniquely characterizes p(x) and
moreover, p(x) can be recovered from X () via the forward Fourier transform. One of the key utilities of characteristic functions that they allow us
to deal in easy ways with sums of random variables.
Theorem 2.8 (Sums of random variables and convolutions) Denote
by X, Y R two independent random variables. Moreover, denote by Z :=
X + Y the sum of both random variables. Then the distribution over Z satisfies p(z) = p(x) p(y). Moreover, the characteristic function yields:
Z () = X ()Y ().
(2.10)
Proof Z is given by Z = X + Y . Hence, for a given Z = z we have

the freedom to choose X = x freely provided that Y = z x. In terms of
distributions this means that the joint distribution p(z, x) is given by
p(z, x) = p(Y = z x)p(x)
and hence p(z) =
p(Y = z x)dp(x) = [p(x) p(y)](z).
The result for characteristic functions follows form the property of the
Fourier transform.
For sums of several random variables the characteristic function is the product of the individual characteristic functions. This allows us to prove both
the weak law of large numbers and the central limit theorem (see Figure 2.4
for an illustration) by proving convergence in the Fourier domain.
Proof [Weak Law of Large Numbers] At the heart of our analysis lies
a Taylor expansion of the exponential into
exp(iwx) = 1 + i w, x + o(|w|)
and hence X () = 1 + iwEX [x] + o(|w|).
1
In Chapter 10 we will discuss more general descriptions of distributions of which X is a special

case. In particular, we will replace the exponential exp(i , x ) by a kernel function k(x, x ).
44
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.0
-5
0.0
-5
0.0
-5
0.0
-5
0.0
1.5
1.5
1.5
1.5
1.5
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.5
0.0
-1 0
0.0
-1 0
0.0
-1 0
0.0
-1 0
0.0
-5
-1 0
Fig. 2.4. A working example of the central limit theorem. The top row contains
distributions of sums of uniformly distributed random variables on the interval
[0.5, 0.5]. From left to right we have sums of 1, 2, 4, 8 and 16 random variables.
The
bottom row contains the same distribution with the means rescaled by m, where
m is the number of observations. Note how the distribution converges increasingly
to the normal distribution.
Given m random variables Xi with mean EX [x] = this means that their
m := 1 m Xi has the characteristic function
average X
i=1
m
X m () =
1+
i
w + o(m1 |w|)
m
(2.11)
In the limit of m this converges to exp(iw), the characteristic function of the constant distribution with mean . This proves the claim that in
m is essentially constant with mean .
the large sample limit X
Proof [Central Limit Theorem] We use the same idea as above to prove
the CLT. The main difference, though, is that we need to assume that the
second moments of the random variables Xi exist. To avoid clutter we only
prove the case of constant mean EXi [xi ] = and variance VarXi [xi ] = 2 .
2.1 Limit Theorems
45
Let Zm := 1 2 m
i=1 (Xi ). Our proof relies on showing convergence
m
of the characteristic function of Zm , i.e. Zm to that of a normally distributed random variable W with zero mean and unit variance. Expanding
the exponential to second order yields:
1
exp(iwx) = 1 + iwx w2 x2 + o(|w|2 )
2
1
and hence X () = 1 + iwEX [x] w2 VarX [x] + o(|w|2 )
2
Since the mean of Zm vanishes by centering (Xi ) and the variance per
variable is m1 we may write the characteristic function of Zm via
Zm () =
1 2
1
w + o(m1 |w|2 )
2m
As before, taking limits m yields the exponential function. We have

that limm Zm () = exp( 12 2 ) which is the characteristic function of
the normal distribution with zero mean and variance 1. Since the characteristic function transform is injective this proves our claim.
Note that the characteristic function has a number of useful properties. For
instance, it can also be used as moment generating function via the identity:
n X (0) = in EX [xn ].
(2.12)
Its proof is left as an exercise. See Problem 2.2 for details. This connection
also implies (subject to regularity conditions) that if we know the moments
of a distribution we are able to reconstruct it directly since it allows us
to reconstruct its characteristic function. This idea has been exploited in
density estimation [Cra46] in the form of Edgeworth and Gram-Charlier
expansions [Hal92].
2.1.3 Tail Bounds

In practice we never have access to an infinite number of observations. Hence
the central limit theorem does not apply but is just an approximation to the
real situation. For instance, in the case of the dice, we might want to state
worst case bounds for finite sums of random variables to determine by how
much the empirical mean may deviate from its expectation. Those bounds
will not only be useful for simple averages but to quantify the behavior of
more sophisticated estimators based on a set of observations.
The bounds we discuss below differ in the amount of knowledge they
assume about the random variables in question. For instance, we might only
46
know their mean. This leads to the Gauss-Markov inequality. If we know

their mean and their variance we are able to state a stronger bound, the
Chebyshev inequality. For an even stronger setting, when we know that
each variable has bounded range, we will be able to state a Chernoff bound.
Those bounds are progressively more tight and also more difficult to prove.
We state them in order of technical sophistication.
Theorem 2.9 (Gauss-Markov) Denote by X 0 a random variable and
let be its mean. Then for any > 0 we have
Pr(X ) .
(2.13)
Proof We use the fact that for nonnegative random variables
Pr(X ) =
dp(x)
dp(x)
xdp(x) =
This means that for random variables with a small mean, the proportion of
samples with large value has to be small.
Consequently deviations from the mean are O( 1 ). However, note that this
bound does not depend on the number of observations. A useful application
of the Gauss-Markov inequality is Chebyshevs inequality. It is a statement
on the range of random variables using its variance.
Theorem 2.10 (Chebyshev) Denote by X a random variable with mean
and variance 2 . Then the following holds for > 0:
Pr(|x | )
2
2
(2.14)
Proof Denote by Y := |X |2 the random variable quantifying the

deviation of X from its mean . By construction we know that EY [y] = 2 .
Next let := 2 . Applying Theorem 2.9 to Y and yields Pr(Y > ) 2 /
which proves the claim.
Note the improvement to the Gauss-Markov inequality. Where before we had
bounds whose confidence improved with O( 1 ) we can now state O( 2 )
bounds for deviations from the mean.
m := m1 m Xi is
Example 2.2 (Chebyshev bound) Assume that X
i=1
the average over m random variables with mean and variance 2 . Hence
m also has mean . Its variance is given by
X
m
m2 VarXi [xi ] = m1 2 .
VarX m [
xm ] =
i=1
2.1 Limit Theorems
47
Applying Chebyshevs inequality yields that the probability of a deviation

2
of from the mean is bounded by m 2 . For fixed failure probability =
m | > ) we have
Pr(|X
2 m1 2 and equivalently / m.
This bound is quite reasonable for large but it means that for high levels
of confidence we need a huge number of observations.
Much stronger results can be obtained if we are able to bound the range
of the random variables. Using the latter, we reap an exponential improvement in the quality of the bounds in the form of the McDiarmid [McD89]
inequality. We state the latter without proof:
Theorem 2.11 (McDiarmid) Denote by f : Xm R a function on X
and let Xi be independent random variables. In this case the following holds:
Pr (|f (x1 , . . . , xm ) EX1 ,...,Xm [f (x1 , . . . , xm )]| > ) 2 exp 2 2 C 2 .
Here the constant C 2 is given by C 2 =
m
2
i=1 ci
where
f (x1 , . . . , xi , . . . , xm ) f (x1 , . . . , xi , . . . , xm ) ci
for all x1 , . . . , xm , xi and for all i.
This bound can be used for averages of a number of observations when
they are computed according to some algorithm as long as the latter can be
encoded in f . In particular, we have the following bound [Hoe63]:
Theorem 2.12 (Hoeffding) Denote by Xi iid random variables with bounded
m := m1 m Xi be their average.
range Xi [a, b] and mean . Let X
i=1
Then the following bound holds:
m >
Pr X
2 exp
2m 2
(b a)2
(2.15)
m each individual random

Proof This is a corollary of Theorem 2.11. In X
m . Straightvariable has range [a/m, b/m] and we set f (X1 , . . . , Xm ) := X
2
2
2
forward algebra shows that C = m (b a) . Plugging this back into
McDiarmids theorem proves the claim.
Note that (2.15) is exponentially better than the previous bounds. With
increasing sample size the confidence level also increases exponentially.
Example 2.3 (Hoeffding bound) As in example 2.2 assume that Xi are
m be their average. Moreover, assume that
iid random variables and let X
48
Xi [a, b] for all i. As before we want to obtain guarantees on the probability

m | > . For a given level of confidence 1 we need to solve
that |X
2
2m
2 exp (ba)
2
for . Straightforward algebra shows that in this case

|b a|
[log 2 log ] /2m
(2.16)
needs to satisfy
(2.17)
In other words, while the confidence level only enters logarithmically into the
1
inequality, the sample size m improves our confidence only with = O(m 2 ).
That is, in order to improve our confidence interval from = 0.1 to = 0.01
we need 100 times as many observations.
While this bound is tight (see Problem 2.5 for details), it is possible to obtain better bounds if we know additional information. In particular knowing
a bound on the variance of a random variable in addition to knowing that it
has bounded range would allow us to strengthen the statement considerably.
The Bernstein inequality captures this connection. For details see [BBL05]
or works on empirical process theory [vdVW96, SW86, Vap82].
2.1.4 An Example
It is probably easiest to illustrate the various bounds using a concrete example. In a semiconductor fab processors are produced on a wafer. A typical
300mm wafer holds about 400 chips. A large number of processing steps
are required to produce a finished microprocessor and often it is impossible
to assess the effect of a design decision until the finished product has been
produced.
Assume that the production manager wants to change some step from
process A to some other process B. The goal is to increase the yield of
the process, that is, the number of chips of the 400 potential chips on the
wafer which can be sold. Unfortunately this number is a random variable,
i.e. the number of working chips per wafer can vary widely between different
wafers. Since process A has been running in the factory for a very long
time we may assume that the yield is well known, say it is A = 350 out
of 400 processors on average. It is our goal to determine whether process
B is better and what its yield may be. Obviously, since production runs
are expensive we want to be able to determine this number as quickly as
possible, i.e. using as few wafers as possible. The production manager is risk
averse and wants to ensure that the new process is really better. Hence he
requires a confidence level of 95% before he will change the production.
2.1 Limit Theorems
49
A first step is to formalize the problem. Since we know process A exactly

we only need to concern ourselves with B. We associate the random variable
Xi with wafer i. A reasonable (and somewhat simplifying) assumption is to
posit that all Xi are independent and identically distributed where all Xi
have the mean B . Obviously we do not know B otherwise there would
m the average of the yields of m
be no reason for testing! We denote by X
wafers using process B. What we are interested in is the accuracy for
which the probability
m B | > ) satisfies 0.05.
= Pr(|X
Let us now discuss how the various bounds behave. For the sake of the
argument assume that B A = 20, i.e. the new process produces on
average 20 additional usable chips.
Chebyshev In order to apply the Chebyshev inequality we need to bound
the variance of the random variables Xi . The worst possible variance would
occur if Xi {0; 400} where both events occur with equal probability. In
other words, with equal probability the wafer if fully usable or it is entirely
broken. This amounts to 2 = 0.5(200 0)2 + 0.5(200 400)2 = 40, 000.
Since for Chebyshev bounds we have
2 m1
(2.18)
we can solve for m = 2 / 2 = 40, 000/(0.05 400) = 20, 000. In other words,
we would typically need 20,000 wafers to assess with reasonable confidence
whether process B is better than process A. This is completely unrealistic.
Slightly better bounds can be obtained if we are able to make better
assumptions on the variance. For instance, if we can be sure that the yield
of process B is at least 300, then the largest possible variance is 0.25(300
0)2 + 0.75(300 400)2 = 30, 000, leading to a minimum of 15,000 wafers
which is not much better.
Hoeffding Since the yields are in the interval {0, . . . , 400} we have an explicit bound on the range of observations. Recall the inequality (2.16) which
bounds the failure probably = 0.05 by an exponential term. Solving this
for m yields
m 0.5|b a|2
log(2/) 737.8
(2.19)
In other words, we need at lest 738 wafers to determine whether process B

is better. While this is a significant improvement of almost two orders of
magnitude, it still seems wasteful and we would like to do better.
50
Central Limit Theorem The central limit theorem is an approximation.

This means that our reasoning is not accurate any more. That said, for
large enough sample sizes, the approximation is good enough to use it for
practical predictions. Assume for the moment that we knew the variance 2
m is approximately normal with mean
exactly. In this case we know that X
1
2
B and variance m . We are interested in the interval [ , + ] which
contains 95% of the probability mass of a normal distribution. That is, we
need to solve the integral
1
2 2
exp
(x )2
2 2
dx = 0.95
(2.20)
This can be solved efficiently using the cumulative distribution function of

a normal distribution (see Problem 2.3 for more details). One can check
that (2.20) is solved for = 2.96. In other words, an interval of 2.96
contains 95% of the probability mass of a normal distribution. The number
of observations is therefore determined by
2
= 2.96/ m and hence m = 8.76 2
(2.21)
Again, our problem is that we do not know the variance of the distribution.
Using the worst-case bound on the variance, i.e. 2 = 40, 000 would lead to
a requirement of at least m = 876 wafers for testing. However, while we do
not know the variance, we may estimate it along with the mean and use the
empirical estimate, possibly plus some small constant to ensure we do not
underestimate the variance, instead of the upper bound.
Assuming that fluctuations turn out to be in the order of 50 processors,
i.e. 2 = 2500, we are able to reduce our requirement to approximately 55
wafers. This is probably an acceptable number for a practical test.
Rates and Constants The astute reader will have noticed that all three
confidence bounds had scaling behavior m = O( 2 ). That is, in all cases
the number of observations was a fairly ill behaved function of the amount
of confidence required. If we were just interested in convergence per se, a
statement like that of the Chebyshev inequality would have been entirely
sufficient. The various laws and bounds can often be used to obtain considerably better constants for statistical confidence guarantees. For more
complex estimators, such as methods to classify, rank, or annotate data,
a reasoning such as the one above can become highly nontrivial. See e.g.
[MYA94, Vap98] for further details.
2.2 Parzen Windows
51
2.2 Parzen Windows

2.2.1 Discrete Density Estimation
The convergence theorems discussed so far mean that we can use empirical
observations for the purpose of density estimation. Recall the case of the
Naive Bayes classifier of Section 1.3.1. One of the key ingredients was the
ability to use information about word counts for different document classes
to estimate the probability p([x]i |y), where [x]i denoted the number of occurrences of word i in document x, given that it was of category y. In the
following we discuss an extremely simple and crude method for estimating
probabilities. It relies on the fact that for random variables Xi drawn from
distribution p(x) with discrete values Xi X we have
lim pX (x) = p(x)
(2.22)
m
1
{xi = x} for all x X.
where pX (x) := m
(2.23)
i=1
Let us discuss a concrete case. We assume that we have 12 documents and

would like to estimate the probability of occurrence of the word dog from
it. As raw data we have:
Document ID
10
11
12
Occurrences of dog
This means that the word dog occurs the following number of times:
Occurrences of dog
Number of documents
Something unusual is happening here: for some reason we never observed

5 instances of the word dog in our documents, only 4 and less, or alternatively 6 times. So what about 5 times? It is reasonable to assume that
the corresponding value should not be 0 either. Maybe we did not sample
enough. One possible strategy is to add pseudo-counts to the observations.
This amounts to the following estimate:
m
pX (x) := (m + |X|)1 1 +
{xi = x} = p(x)
(2.24)
i=1
Clearly the limit for m is still p(x). Hence, asymptotically we do not

lose anything. This prescription is what we used in Algorithm 1.1 used a
method called Laplace smoothing. Below we contrast the two methods:
52
Occurrences of dog
Number of documents
Frequency of occurrence
Laplace smoothing
4
0.33
0.26
2
0.17
0.16
2
0.17
0.16
1
0.083
0.11
1
0.083
0.11
0
0
0.05
2
0.17
0.16
The problem with this method is that as |X| increases we need increasingly
more observations to obtain even a modicum of precision. On average, we
will need at least one observation for every x X. This can be infeasible for
large domains as the following example shows.
Example 2.4 (Curse of Dimensionality) Assume that X = {0, 1}d , i.e.
x consists of binary bit vectors of dimensionality d. As d increases the size of
X increases exponentially, requiring an exponential number of observations
to perform density estimation. For instance, if we work with images, a 100
100 black and white picture would require in the order of 103010 observations
to model such fairly low-resolution images accurately. This is clearly utterly
infeasible the number of particles in the known universe is in the order
of 1080 . Bellman [Bel61] was one of the first to formalize this dilemma by
coining the term curse of dimensionality.
This example clearly shows that we need better tools to deal with highdimensional data. We will present one of such tools in the next section.
2.2.2 Smoothing Kernel

We now proceed to proper density estimation. Assume that we want to
estimate the distribution of weights of a population. Sample data from a
population might look as follows: X = {57, 88, 54, 84, 83, 59, 56, 43, 70, 63,
90, 98, 102, 97, 106, 99, 103, 112}. We could use this to perform a density
estimate by placing discrete components at the locations xi X with weight
1/|X| as what is done in Figure 2.5. There is no reason to believe that weights
are quantized in kilograms, or grams, or miligrams (or pounds and stones).
And even if it were, we would expect that similar weights would have similar
densities associated with it. Indeed, as the right diagram of Figure 2.5 shows,
the corresponding density is continuous.
The key question arising is how we may transform X into a realistic
estimate of the density p(x). Starting with a density estimate with only
discrete terms
p(x) =
1
m
(x xi )
i=1
(2.25)
2.2 Parzen Windows
53
we may choose to smooth it out by a smoothing kernel h(x) such that the
probability mass becomes somewhat more spread out. For a density estimate
on X Rd this is achieved by
p(x) =
1
m
rd h
xxi
r
(2.26)
i=1
This expansion is commonly known as the Parzen windows estimate. Note

that obviously h must be chosen such that h(x) 0 for all x X and
moreover that h(x)dx = 1 in order to ensure that (2.26) is a proper probability distribution. We now formally justify this smoothing. Let R be a
small region such that
p(x) dx.
q=
R
Out of the m samples drawn from p(x), the probability that k of them fall
in region R is given by the binomial distribution
m k
q (1 q)mk .
k
The expected fraction of points falling inside the region can easily be computed from the expected value of the Binomial distribution: E[k/m] = q.
Similarly, the variance can be computed as Var[k/m] = q(1 q)/m. As
m the variance goes to 0 and hence the estimate peaks around the
expectation. We can therefore set
k mq.
If we assume that R is so small that p(x) is constant over R, then
q p(x) V,
where V is the volume of R. Rearranging we obtain
p(x)
k
.
mV
(2.27)
Let us now set R to be a cube with side length r, and define a function
h(u) =
Observe that h
xxi
r
1 if |ui |
1
2
0 otherwise.
is 1 if and only if xi lies inside a cube of size r centered
54
around x. If we let
m
k=
h
i=1
x xi
r
then one can use (2.27) to estimate p via

1
p(x) =
m
rd h
i=1
x xi
h
where rd is the volume of the hypercube of size r in d dimensions. By symmetry, we can interpret this equation as the sum over m cubes centered around
m data points xn . If we replace the cube by any smooth kernel function h()
this recovers (2.26).
There exists a large variety of different kernels which can be used for the
kernel density estimate. [Sil86] has a detailed description of the properties
of a number of kernels. Popular choices are
1
1 2
h(x) = (2) 2 e 2 x
h(x) =
h(x) =
h(x) =
1 |x|
2e
3
4 max(0, 1
1
2 [1,1] (x)
x )
h(x) = max(0, 1 |x|)
Gaussian kernel
(2.28)
Laplace kernel
(2.29)
Epanechnikov kernel
(2.30)
Uniform kernel
(2.31)
Triangle kernel.
(2.32)
Further kernels are the triweight and the quartic kernel which are basically
powers of the Epanechnikov kernel. For practical purposes the Gaussian kernel (2.28) or the Epanechnikov kernel (2.30) are most suitable. In particular,
the latter has the attractive property of compact support. This means that
for any given density estimate at location x we will only need to evaluate
terms h(xi x) for which the distance xi x is less than r. Such expansions are computationally much cheaper, in particular when we make use of
fast nearest neighbor search algorithms [GIM99, IM98]. Figure 2.7 has some
examples of kernels.
2.2.3 Parameter Estimation
So far we have not discussed the issue of parameter selection. It should be
evident from Figure 2.6, though, that it is quite crucial to choose a good
kernel width. Clearly, a kernel that is overly wide will oversmooth any fine
detail that there might be in the density. On the other hand, a very narrow
kernel will not be very useful, since it will be able to make statements only
about the locations where we actually observed data.
2.2 Parzen Windows
55
0.10
0.05
0.04
0.03
0.05
0.02
0.01
0.00
40
50
60
70
80
90
100
0.00
40
110
50
60
70
80
90
100
110
Fig. 2.5. Left: a naive density estimate given a sample of the weight of 18 persons.
Right: the underlying weight distribution.
0.050
0.050
0.050
0.050
0.025
0.025
0.025
0.025
0.000
40
60
80 100
0.000
40
60
80 100
0.000
40
60
80 100
0.000
40
60
80 100
Fig. 2.6. Parzen windows density estimate associated with the 18 observations of
the Figure above. From left to right: Gaussian kernel density estimate with kernel
of width 0.3, 1, 3, and 10 respectively.
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.0
-2 -1 0 1 2
0.0
-2 -1 0 1 2
0.0
-2 -1 0 1 2
0.0
-2 -1 0 1 2
Fig. 2.7. Some kernels for Parzen windows density estimation. From left to right:
Gaussian kernel, Laplace kernel, Epanechikov kernel, and uniform density.
Moreover, there is the issue of choosing a suitable kernel function. The

fact that a large variety of them exists might suggest that this is a crucial
issue. In practice, this turns out not to be the case and instead, the choice
of a suitable kernel width is much more vital for good estimates. In other
words, size matters, shape is secondary.
The problem is that we do not know which kernel width is best for the
data. If the problem is one-dimensional, we might hope to be able to eyeball
the size of r. Obviously, in higher dimensions this approach fails. A second
56
option would be to choose r such that the log-likelihood of the data is

maximized. It is given by
m
p(xi ) = m log m +
log
i=1
rd h
log
i=1
xi xj
r
(2.33)
j=1
Remark 2.13 (Log-likelihood) We consider the logarithm of the likelihood for reasons of computational stability to prevent numerical underflow.
While each term p(xi ) might be within a suitable range, say 102 , the product of 1000 of such terms will easily exceed the exponent of floating point
representations on a computer. Summing over the logarithm, on the other
hand, is perfectly feasible even for large numbers of observations.
Unfortunately computing the log-likelihood is equally infeasible: for decreasing r the only surviving terms in (2.33) are the functions h((xi xi )/r) =
h(0), since the arguments of all other kernel functions diverge. In other
words, the log-likelihood is maximized when p(x) is peaked exactly at the
locations where we observed the data. The graph on the left of Figure 2.6
shows what happens in such a situation.
What we just experienced is a case of overfitting where our model is too
flexible. This led to a situation where our model was able to explain the
observed data unreasonably well, simply because we were able to adjust
our parameters given the data. We will encounter this situation throughout
the book. There exist a number of ways to address this problem.
Validation Set: We could use a subset of our set of observations as an
estimate of the log-likelihood. That is, we could partition the observations into X := {x1 , . . . , xn } and X := {xn+1 , . . . , xm } and use
the second part for a likelihood score according to (2.33). The second
set is typically called a validation set.
n-fold Crossvalidation: Taking this idea further, note that there is no
particular reason why any given xi should belong to X or X respectively. In fact, we could use all splits of the observations into sets
X and X to infer the quality of our estimate. While this is computationally infeasible, we could decide to split the observations into
n equally sized subsets, say X1 , . . . , Xn and use each of them as a
validation set at a time while the remainder is used to generate a
density estimate.
Typically n is chosen to be 10, in which case this procedure is
2.2 Parzen Windows
57
referred to as 10-fold crossvalidation. It is a computationally attractive procedure insofar as it does not require us to change the basic
estimation algorithm. Nonetheless, computation can be costly.
Leave-one-out Estimator: At the extreme end of crossvalidation we could
choose n = m. That is, we only remove a single observation at a time
and use the remainder of the data for the estimate. Using the average
over the likelihood scores provides us with an even more fine-grained
estimate. Denote by pi (x) the density estimate obtained by using
X := {x1 , . . . , xm } without xi . For a Parzen windows estimate this
is given by
pi (xi ) = (m 1)1
rd h
xi xj
r
m
m1
p(xi ) rd h(0) .
j=i
(2.34)
Note that this is precisely the term rd h(0) that is removed from
the estimate. It is this term which led to divergent estimates for
r 0. This means that the leave-one-out log-likelihood estimate
can be computed easily via
m
log p(xi ) rd h(0) .
m
L(X) = m log m1
+
(2.35)
i=1
We then choose r such that L(X) is maximized. This strategy is very

robust and whenever it can be implemented in a computationally
efficient manner, it is very reliable in performing model selection.
An alternative, probably more of theoretical interest, is to choose the scale r
a priori based on the amount of data we have at our disposition. Intuitively,
we need a scheme which ensures that r 0 as the number of observations
increases m . However, we need to ensure that this happens slowly
enough that the number of observations within range r keeps on increasing
in order to ensure good statistical performance. For details we refer the
reader to [Sil86]. Chapter 9 discusses issues of model selection for estimators
in general in considerably more detail.
2.2.4 Silvermans Rule

Assume you are an aspiring demographer who wishes to estimate the population density of a country, say Australia. You might have access to a limited
census which, for a random portion of the population determines where they
live. As a consequence you will obtain a relatively high number of samples
58
Fig. 2.8. Nonuniform density. Left: original density with samples drawn from the
distribution. Middle: density estimate with a uniform kernel. Right: density estimate
using Silvermans adjustment.
of city dwellers, whereas the number of people living in the countryside is

likely to be very small.
If we attempt to perform density estimation using Parzen windows, we
will encounter an interesting dilemma: in regions of high density (i.e. the
cities) we will want to choose a narrow kernel width to allow us to model
the variations in population density accurately. Conversely, in the outback,
a very wide kernel is preferable, since the population there is very low.
Unfortunately, this information is exactly what a density estimator itself
could tell us. In other words we have a chicken and egg situation where
having a good density estimate seems to be necessary to come up with a
good density estimate.
Fortunately this situation can be addressed by realizing that we do not
actually need to know the density but rather a rough estimate of the latter.
This can be obtained by using information about the average distance of the
k nearest neighbors of a point. One of Silvermans rules of thumb [Sil86] is
to choose ri as
c
ri =
x xi .
(2.36)
k
xkN N (xi )
Typically c is chosen to be 0.5 and k is small, e.g. k = 9 to ensure that the

estimate is computationally efficient. The density estimate is then given by
p(x) =
1
m
rid h
xxi
ri
(2.37)
i=1
Figure 2.8 shows an example of such a density estimate. It is clear that a

locality dependent kernel width is better than choosing a uniformly constant
kernel density estimate. However, note that this increases the computational
complexity of performing a density estimate, since first the k nearest neighbors need to be found before the density estimate can be carried out.
2.2 Parzen Windows
59
2.2.5 Watson-Nadaraya Estimator

Now that we are able to perform density estimation we may use it to perform
classification and regression. This leads us to an effective method for nonparametric data analysis, the Watson-Nadaraya estimator [Wat64, Nad65].
The basic idea is very simple: assume that we have a binary classification
problem, i.e. we need to distinguish between two classes. Provided that we
are able to compute density estimates p(x) given a set of observations X we
could appeal to Bayes rule to obtain
p(x|y)p(y)
p(y|x) =
=
p(x)
my
m
xi x
1
d
i:yi =y r h
my
r
m
xi x
1
d
i=1 r h
m
r
(2.38)
Here we only take the sum over all xi with label yi = y in the numerator.
The advantage of this approach is that it is very cheap to design such an
estimator. After all, we only need to compute sums. The downside, similar
to that of the k-nearest neighbor classifier is that it may require sums (or
search) over a large number of observations. That is, evaluation of (2.38) is
potentially an O(m) operation. Fast tree based representations can be used
to accelerate this [BKL06, KM00], however their behavior depends significantly on the dimensionality of the data. We will encounter computationally
more attractive methods at a later stage.
For binary classification (2.38) can be simplified considerably. Assume
that y {1}. For p(y = 1|x) > 0.5 we will choose that we should estimate
y = 1 and in the converse case we would estimate y = 1. Taking the
difference between twice the numerator and the denominator we can see
that the function
f (x) =
i yi h
ih
xi x
r
xi x
r
yi
i
xi x
r
xi x
h
i
r
=:
yi wi (x)
(2.39)
can be used to achieve the same goal since f (x) > 0 p(y = 1|x) > 0.5.
Note that f (x) is a weighted combination of the labels yi associated with
weights wi (x) which depend on the proximity of x to an observation xi .
In other words, (2.39) is a smoothed-out version of the k-nearest neighbor
classifier of Section 1.3.2. Instead of drawing a hard boundary at the k closest
observation we use a soft weighting scheme with weights wi (x) depending
on which observations are closest.
Note furthermore that the numerator of (2.39) is very similar to the simple
classifier of Section 1.3.3. In fact, for kernels k(x, x ) such as the Gaussian
RBF kernel, which are also kernels in the sense of a Parzen windows density estimate, i.e. k(x, x ) = rd h xx
the two terms are identical. This
r
60
Fig. 2.9. Watson Nadaraya estimate. Left: a binary classifier. The optimal solution
would be a straight line since both classes were drawn from a normal distribution
with the same variance. Right: a regression estimator. The data was generated from
a sinusoid with additive noise. The regression tracks the sinusoid reasonably well.
means that the Watson Nadaraya estimator provides us with an alternative

explanation as to why (1.24) leads to a usable classifier.
In the same fashion as the Watson Nadaraya classifier extends the knearest neighbor classifier we also may construct a Watson Nadaraya regression estimator by replacing the binary labels yi by real-valued values
yi R to obtain the regression estimator i yi wi (x). Figure 2.9 has an example of the workings of both a regression estimator and a classifier. They
are easy to use and they work well for moderately dimensional data.

Distributions from the exponential family are some of the most versatile
tools for statistical inference. Gaussians, Poisson, Gamma and Wishart distributions all form part of the exponential family. They play a key role in
dealing with graphical models, classification, regression and conditional random fields which we will encounter in later parts of this book. Some of the
reasons for their popularity are that they lead to convex optimization problems and that they allow us to describe probability distributions by linear
models.
2.3.1 Basics
Densities from the exponential family are defined by
p(x; ) := exp ( (x), g()) .
(2.40)
61
Here (x) is a map from x to the sufficient statistics (x). is commonly

referred to as the natural parameter. It lives in the space dual to (x). Moreover, g() is a normalization constant which ensures that p(x) is properly
normalized. g is often referred to as the log-partition function. The name
stems from physics where Z = eg() denotes the number of states of a physical ensemble. g can be computed as follows:
g() = log
exp ( (x), ) dx.
(2.41)
Example 2.5 (Binary Model) Assume that X = {0; 1} and that (x) =
x. In this case we have g() = log e0 + e = log 1 + e . It follows that
e
1
p(x = 0; ) = 1+e
and p(x = 1; ) = 1+e . In other words, by choosing
different values of one can recover different Bernoulli distributions.
One of the convenient properties of exponential families is that the logpartition function g can be used to generate moments of the distribution
itself simply by taking derivatives.
Theorem 2.14 (Log partition function) The function g() is convex.
Moreover, the distribution p(x; ) satisfies
g() = Ex [(x)] and 2 g() = Varx [(x)] .
(2.42)
Proof Note that 2 g() = Varx [(x)] implies that g is convex, since the
covariance matrix is positive semidefinite. To show (2.42) we expand
g() =
X (x) exp
(x), dx
=
X exp (x),
(x)p(x; )dx = Ex [(x)] .
(2.43)
Next we take the second derivative to obtain

2 g() =
(x) [(x) g()] p(x; )dx
= Ex (x)(x)
Ex [(x)] Ex [(x)]
(2.44)
(2.45)
which proves the claim. For the first equality we used (2.43). For the second
line we used the definition of the variance.
One may show that higher derivatives n g() generate higher order cumulants of (x) under p(x; ). This is why g is often also referred as the
cumulant-generating function. Note that in general, computation of g()
is nontrivial since it involves solving a highdimensional integral. For many
62
cases, in fact, the computation is NP hard, for instance when X is the domain of permutations [FJ95]. Throughout the book we will discuss a number
of approximation techniques which can be applied in such a case.
Let us briefly illustrate (2.43) using the binary model of Example 2.5.
e
e
2
We have that = 1+e
and = (1+e )2 . This is exactly what we would
have obtained from direct computation of the mean p(x = 1; ) and variance
p(x = 1; ) p(x = 1; )2 subject to the distribution p(x; ).
2.3.2 Examples
A large number of densities are members of the exponential family. Note,
however, that in statistics it is not common to express them in the dot
product formulation for historic reasons and for reasons of notational compactness. We discuss a number of common densities below and show why
they can be written in terms of an exponential family. A detailed description
of the most commonly occurring types are given in a table.
Gaussian Let x, Rd and let Rdd where
0, that is, is a
positive definite matrix. In this case the normal distribution can be
expressed via
d
1
1
p(x) = (2) 2 || 2 exp (x ) 1 (x )
(2.46)
2
1
= exp x 1 + tr xx
1 c(, )
2
where c(, ) = 12 1 + d2 log 2 + 12 log ||. By combining the

terms in x into (x) := (x, 12 xx ) we obtain the sufficient statistics
of x. The corresponding linear coefficients (1 , 1 ) constitute the
natural parameter . All that remains to be done to express p(x) in
terms of (2.40) is to rewrite g() in terms of c(, ). The summary
table on the following page contains details.
Multinomial Another popular distribution is one over k discrete events.
In this case X = {1, . . . , k} and we have in completely generic terms
p(x) = x where x 0 and x x = 1. Now denote by ex Rk
the x-th unit vector of the canonical basis, that is ex , ex = x,x .
In this case we may rewrite p(x) via
p(x) = x = exp ( ex , log )
(2.47)
where log = (log 1 , . . . , log k ). In other words, we have succeeded

in rewriting the distribution as a member of the exponential family
63
where (x) = ex and where = log . Note that in this definition

is restricted to a k 1 dimensional manifold. If we relax those
constraints we need to ensure that p(x) remains normalized. Details
are given in the summary table.
Poisson This distribution is often used to model distributions over discrete
events. For instance, the number of raindrops which fall on a given
surface area in a given amount of time, the number of stars in a
given volume of space, or the number of Prussian soldiers killed by
horse-kicks in the Prussian cavalry all follow this distribution. It is
given by
e x
1
=
exp (x log ) where x N0 .
(2.48)
x!
x!
By defining (x) = x we obtain an exponential families model. Note
1
that things are a bit less trivial here since x!
is the nonuniform
counting measure on N0 . The case of the uniform measure which
leads to the exponential distribution is discussed in Problem 2.16.
The reason why many discrete processes follow the Poisson distribution is that it can be seen as the limit over the average of a large
number of Bernoulli draws: denote by z {0, 1} a random variable
with p(z = 1) = . Moreover, denote by zn the sum over n draws
from this random variable. In this case zn follows the multinomial
distribution with p(zn = k) = nk k (1 )nk . Now assume that
we let n such that the expected value of zn remains constant.
That is, we rescale = n . In this case we have
p(x) =
n!
k
p(zn = k) =
(n k)!k! nk
=
k
k!
1
n
nk
n!
k
n (n k)!
(2.49)
1
For n the second term converges to e . The third term converges to 1, since we have a product of only 2k terms, each of which
converge to 1. Using the exponential families notation we may check
that E[x] = and that moreover also Var[x] = .
Beta This is a distribution on the unit interval X = [0, 1] which is very
versatile when it comes to modelling unimodal and bimodal distributions. It is given by
p(x) = xa1 (1 x)b1
(a + b)
.
(a)(b)
(2.50)
64
0.40
3.5
0.35
3.0
0.30
2.5
0.25
2.0
0.20
1.5
0.15
1.0
0.10
0.5
0.05
0.00
0
10
15
20
25
30
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Fig. 2.10. Left: Poisson distributions with = {1, 3, 10}. Right: Beta distributions
with a = 2 and b {1, 2, 3, 5, 7}. Note how with increasing b the distribution
becomes more peaked close to the origin.
Taking logarithms we see that this, too, is an exponential families

distribution, since p(x) = exp((a 1) log x + (b 1) log(1 x) +
log (a + b) log (a) log (b)).
Figure 2.10 has a graphical description of the Poisson distribution and the
Beta distribution. For a more comprehensive list of exponential family distributions see the table below and [Fel71, FT94, MN83]. In principle any
map (x), domain X with underlying measure are suitable, as long as the
log-partition function g() can be computed efficiently.
Theorem 2.15 (Convex feasible domain) The domain of definition
of g() is convex.
Proof By construction g is convex and differentiable everywhere. Hence the
below-sets for all values c with {x|g(x) c} exist. Consequently the domain
of definition is convex.
Having a convex function is very valuable when it comes to parameter inference since convex minimization problems have unique minimum values and
global minima. We will discuss this notion in more detail when designing
maximum likelihood estimators.
2.4 Estimation
In many statistical problems the challenge is to estimate parameters of interest. For instance, in the context of exponential families, we may want
to estimate a parameter such that it is close to the true parameter
in the distribution. While the problem is fully general, we will describe the
Lebesgue
[0, )
R
Laplace
Gaussian
[0, 1]
[0, )
Cn
Sn
R+
N
Beta
Gamma
Wishart
Dirichlet
Inverse 2
Logarithmic
Conjugate
Lebesgue
1
x
( ni=1 xi )1
1
e 2x
|X|
1
x(1x)
1
x
n+1
2
x 2
Lebesgue
log (
( 1) log 2 + log( 1)
log( log(1 e ))
generic
(, g())
n
i=1 log (i )
log (1 ) 1 log 2
1 log |2 | + 1 n log 2
+ ni=1 log 1 + 1i
2
n
i=1 i )
2
1
1
1 1
2 log 2 2 log 2 + 2 2
1
n
1
1
2 log 2 2 log |2 | + 2 1 2 1
1
1
2 log 2 1 2 2 log 2
1 )(2 )
log (
(1 +2 )
(R+ )n
(0, )
(, 0)
(0, )2
R Cn
R2
(0, )2
Rn Cn
R (0, )
(, 0)
R
RN
(, 0)
R
log 1 + e
i
log N
i=1 e
log 1 e
e
log
Domain
g()
(log x1 , . . . , log xn )
log x
x
(log x, x)
log |x|, 12 x
(log x, log (1 x))
x, 21 x2
x, 21 xx
x, x1
x
ex
x
x
(x)
Sn denotes the probability simplex in n dimensions. Cn is the cone of positive semidefinite matrices in Rnn .
[0, )
Inverse Normal
1
x!
Lebesgue
Counting
Counting
Counting
{0, 1}
{1..N }
N+
0
N+
0
Bernoulli
Multinomial
Exponential
Poisson
Measure
Domain X
Name
2.4 Estimation
65
66
relevant steps in obtaining estimates for the special case of the exponential
family. This is done for two reasons firstly, exponential families are an
important special case and we will encounter slightly more complex variants
on the reasoning in later chapters of the book. Secondly, they are of a sufficiently simple form that we are able to show a range of different techniques.
In more advanced applications only a small subset of those methods may be
practically feasible. Hence exponential families provide us with a working
example based on which we can compare the consequences of a number of
different techniques.
2.4.1 Maximum Likelihood Estimation

Whenever we have a distribution p(x; ) parametrized by some parameter
we may use data to find a value of which maximizes the likelihood that
the data would have been generated by a distribution with this choice of
parameter.
For instance, assume that we observe a set of temperature measurements
X = {x1 , . . . , xm }. In this case, we could try finding a normal distribution
such that the likelihood p(X; ) of the data under the assumption of a normal
distribution is maximized. Note that this does not imply in any way that the
temperature measurements are actually drawn from a normal distribution.
Instead, it means that we are attempting to find the Gaussian which fits the
data in the best fashion.
While this distinction may appear subtle, it is critical: we do not assume
that our model accurately reflects reality. Instead, we simply try doing the
best possible job at modeling the data given a specified model class. Later
we will encounter alternative approaches at estimation, namely Bayesian
methods, which make the assumption that our model ought to be able to
describe the data accurately.
Definition 2.16 (Maximum Likelihood Estimator) For a model p(; )
parametrized by and observations X the maximum likelihood estimator
(MLE) is
ML [X] := argmax p(X; ).
(2.51)
In the context of exponential families this leads to the following procedure:

given m observations drawn iid from some distribution, we can express the
2.4 Estimation
67
joint likelihood as
m
p(X; ) =
exp ( (xi ), g())
p(xi ; ) =
i=1
i=1
= exp (m ( [X], g()))

where [X] :=
1
m
(2.52)
(2.53)
(xi ).
(2.54)
i=1
Here [X] is the empirical average of the map (x). Maximization of p(X; )
is equivalent to minimizing the negative log-likelihood log p(X; ). The
latter is a common practical choice since for independently drawn data,
the product of probabilities decomposes into the sum of the logarithms of
individual likelihoods. This leads to the following objective function to be
minimized
log p(X; ) = m [g() , [X] ]
(2.55)
Since g() is convex and , [X] is linear in , it follows that minimization

of (2.55) is a convex optimization problem. Using Theorem 2.14 and the first
order optimality condition g() = [X] for (2.55) implies that
= [ g]1 ([X]) or equivalently Exp(x;) [(x)] = g() = [X]. (2.56)
Put another way, the above conditions state that we aim to find the distribution p(x; ) which has the same expected value of (x) as what we observed
empirically via [X]. Under very mild technical conditions a solution to
(2.56) exists.
In general, (2.56) cannot be solved analytically. In certain special cases,
though, this is easily possible. We discuss two such choices in the following:
Multinomial and Poisson distributions.
Example 2.6 (Poisson Distribution) For the Poisson distribution1 where
1
p(x; ) = x!
exp(x e ) it follows that g() = e and (x) = x. This allows
us to solve (2.56) in closed form using
1
g() = e =
m
xi log m.
xi and hence = log

i=1
(2.57)
i=1
Often the Poisson distribution is specified using := log as its rate parameter. In this case we
have p(x; ) = x e /x! as its parametrization. The advantage of the natural parametrization
using is that we can directly take advantage of the properties of the log-partition function as
generating the cumulants of x.
68
Example 2.7 (Multinomial Distribution) For the multinomial distrii

bution the log-partition function is given by g() = log N
i=1 e , hence we
have that
i g() =
ei
N
j
j=1 e
1
=
m
{xj = i} .
(2.58)
j=1
m
1
It is easy to check that (2.58) is satisfied for ei = m
j=1 {xj = i}. In other
words, the MLE for a discrete distribution simply given by the empirical
frequencies of occurrence.
The multinomial setting also exhibits two rather important aspects of exponential families: firstly, choosing i = c + log m
i=1 {xj = i} for any c R
will lead to an equivalent distribution. This is the case since the sufficient
statistic (x) is not minimal. In our context this means that the coordinates
of (x) are linearly dependent for any x we have that j [(x)]j = 1,
hence we could eliminate one dimension. This is precisely the additional
degree of freedom which is reflected in the scaling freedom in .
Secondly, for data where some events do not occur at all, the expression
m
log
j=1 {xj = i} = log 0 is ill defined. This is due to the fact that this
particular set of counts occurs on the boundary of the convex set within
which the natural parameters are well defined. We will see how different
types of priors can alleviate the issue.
Using the MLE is not without problems. As we saw in Figure 2.1, convergence can be slow, since we are not using any side information. The latter
can provide us with problems which are both numerically better conditioned
and which show better convergence, provided that our assumptions are accurate. Before discussing a Bayesian approach to estimation, let us discuss
basic statistical properties of the estimator.
2.4.2 Bias, Variance and Consistency
When designing any estimator (X)

we would like to obtain a number of
desirable properties: in general it should not be biased towards a particular
solution unless we have good reason to believe that this solution should
be preferred. Instead, we would like the estimator to recover, at least on
average, the correct parameter, should it exist. This can be formalized in
the notion of an unbiased estimator.
Secondly, we would like that, even if no correct parameter can be found,
e.g. when we are trying to fit a Gaussian distribution to data which is not
2.4 Estimation
69
normally distributed, that we will converge to the best possible parameter

choice as we obtain more data. This is what is understood by consistency.
Finally, we would like the estimator to achieve low bias and near-optimal
estimates as quickly as possible. The latter is measured by the efficiency
of an estimator. In this context we will encounter the Cramer-Rao bound
which controls the best possible rate at which an estimator can achieve this
goal. Figure 2.11 gives a pictorial description.
Fig. 2.11. Left: unbiased estimator; the estimates, denoted by circles have as mean
the true parameter, as denoted by a star. Middle: consistent estimator. While the
true model is not within the class we consider (as denoted by the ellipsoid), the
estimates converge to the white star which is the best model within the class that
approximates the true model, denoted by the solid star. Right: different estimators
have different regions of uncertainty, as made explicit by the ellipses around the
true parameter (solid star).
Definition 2.17 (Unbiased Estimator) An estimator [X]

is unbiased
if for all where X p(X; ) we have EX [[X]] = .

In other words, in expectation the parameter estimate matches the true parameter. Note that this only makes sense if a true parameter actually exists.
For instance, if the data is Poisson distributed and we attempt modeling it
by a Gaussian we will obviously not obtain unbiased estimates.
For finite sample sizes MLE is often biased. For instance, for the normal
distribution the variance estimates carry bias O(m1 ). See problem 2.19
for details. In general, under fairly mild conditions, MLE is asymptotically
unbiased [DGL96]. We prove this for exponential families. For more general
settings the proof depends on the dimensionality and smoothness of the
family of densities that we have at our disposition.
Theorem 2.18 (MLE for Exponential Families) Assume that X is an
m-sample drawn iid from p(x; ). The estimate [X]

= g 1 ([X]) is asymp-
70
totically normal with

1
m 2 [[X]
] N(0, 2 g()
).
(2.59)
In other words, the estimate [X]

is asymptotically normal, it converges to
the true parameter , and moreover, the variance at the correct parameter
is given by the inverse of the covariance matrix of the data, as given by the
second derivative of the log-partition function 2 g().
Proof Denote by = g() the true mean. Moreover, note that 2 g() is
the covariance of the data drawn from p(x; ). By the central limit theorem
1
(Theorem 2.3) we have that n 2 [[X] ] N(0, 2 g()).
Now note that [X]

= [ g]1 ([X]). Therefore, by the delta method
(Theorem 2.5) we know that [X]

is also asymptotically normal. Moreover,
by the inverse function theorem the Jacobian of g 1 satisfies [ g]1 () =
1
2 g() . Applying Slutskys theorem (Theorem 2.4) proves the claim.
Now that we established the asymptotic properties of the MLE for exponential families it is only natural to ask how much variation one may expect in
[X]
when performing estimation. The Cramer-Rao bound governs this.
Theorem 2.19 (Cram
er and Rao [Rao73]) Assume that X is drawn from
p(X; ) and let [X]

be an asymptotically unbiased estimator. Denote by I
the Fisher information matrix and by B the variance of [X]

where
I := Var [ log p(X; )] and B := Var [X]

.
(2.60)
In this case det IB 1 for all estimators [X].

Proof We prove the claim for the scalar case. The extension to matrices is
straightforward. Using the Cauchy-Schwarz inequality we have
Cov2 log p(X; ), [X]

Var [ log p(X; )] Var [X]
= IB.
(2.61)
Note that at the true parameter the expected log-likelihood score vanishes
EX [ log p(X; )] =
p(X; )dX = 1 = 0.
(2.62)
Hence we may simplify the covariance formula by dropping the means via
Cov log p(X; ), [X]

= EX log p(X; )[X]
=
=
p(X; )(X)
log p(X; )d
p(X; )(X)dX
= = 1.
2.4 Estimation
71
Here the last equality follows since we may interchange integration by X

and the derivative with respect to .
The Cramer-Rao theorem implies that there is a limit to how well we may
estimate a parameter given finite amounts of data. It is also a yardstick by
which we may measure how efficiently an estimator uses data. Formally, we
define the efficiency as the quotient between actual performance and the
Cramer-Rao bound via
e := 1/det IB.
(2.63)
The closer e is to 1, the lower the variance of the corresponding estimator
(X).
Theorem 2.18 implies that for exponential families MLE is asymptotically efficient. It turns out to be generally true.
Theorem 2.20 (Efficiency of MLE [Cra46, GW92, Ber85]) The maximum likelihood estimator is asymptotically efficient (e = 1).
So far we only discussed the behavior of [X]

whenever there exists a true
generating p(; X). If this is not true, we need to settle for less: how well [X]
approaches the best possible choice of within the given model class. Such
behavior is referred to as consistency. Note that it is not possible to define
consistency per se. For instance, we may ask whether converges to the
converges to the optimal density
optimal parameter , or whether p(x; )
p(x; ), and with respect to which norm. Under fairly general conditions
this turns out to be true for finite-dimensional parameters and smoothly
parametrized densities. See [DGL96, vdG00] for proofs and further details.
2.4.3 A Bayesian Approach

The analysis of the Maximum Likelihood method might suggest that inference is a solved problem. After all, in the limit, MLE is unbiased and it
exhibits as small variance as possible. Empirical results using a finite amount
of data, as present in Figure 2.1 indicate otherwise.
While not making any assumptions can lead to interesting and general
theorems, it ignores the fact that in practice we almost always have some
idea about what to expect of our solution. It would be foolish to ignore such
additional information. For instance, when trying to determine the voltage
of a battery, it is reasonable to expect a measurement in the order of 1.5V
or less. Consequently such prior knowledge should be incorporated into the
estimation process. In fact, the use of side information to guide estimation
72
turns out to be the tool to building estimators which work well in high
dimensions.
Recall Bayes rule (1.15) which states that p(|x) = p(x|)p()
. In our conp(x
text this means that if we are interested in the posterior probability of
assuming a particular value, we may obtain this using the likelihood (often
referred to as evidence) of x having been generated by via p(x|) and our
prior belief p() that might be chosen in the distribution generating x.
Observe the subtle but important difference to MLE: instead of treating
as a parameter of a density model, we treat as an unobserved random
variable which we may attempt to infer given the observations X.
This can be done for a number of different purposes: we might want to
infer the most likely value of the parameter given the posterior distribution
p(|X). This is achieved by
MAP (X) := argmax p(|X) = argmin log p(X|) log p().
(2.64)
The second equality follows since p(X) does not depend on . This estimator
is also referred to as the Maximum a Posteriori, or MAP estimator. It differs
from the maximum likelihood estimator by adding the negative log-prior
to the optimization problem. For this reason it is sometimes also referred
to as Penalized MLE. Effectively we are penalizing unlikely choices via
log p().
Note that using MAP (X) as the parameter of choice is not quite accurate.
After all, we can only infer a distribution over and in general there is no
guarantee that the posterior is indeed concentrated around its mode. A more
accurate treatment is to use the distribution p(|X) directly via
p(x|X) =
p(x|)p(|X)d.
(2.65)
In other words, we integrate out the unknown parameter and obtain the
density estimate directly. As we will see, it is generally impossible to solve
(2.65) exactly, an important exception being conjugate priors. In the other
cases one may resort to sampling from the posterior distribution to approximate the integral.
While it is possible to design a wide variety of prior distributions, this book
focuses on two important families: norm-constrained prior and conjugate
priors. We will encounter them throughout, the former sometimes in the
guise of regularization and Gaussian Processes, the latter in the context of
exchangeable models such as the Dirichlet Process.
2.4 Estimation
73
Norm-constrained priors take on the form

p() exp( 0 dp ) for p, d 1 and > 0.
(2.66)
That is, they restrict the deviation of the parameter value from some guess
0 . The intuition is that extreme values of are much less likely than more
moderate choices of which will lead to more smooth and even distributions
p(x|).
A popular choice is the Gaussian prior which we obtain for p = d = 1
and = 1/2 2 . Typically one sets 0 = 0 in this case. Note that in (2.66)
we did not spell out the normalization of p() in the context of MAP
estimation this is not needed since it simply becomes a constant offset in
the optimization problem (2.64). We have
MAP [X] = argmin m [g() , [X] ] + 0
d
p
(2.67)
For d, p 1 and 0 the resulting optimization problem is convex and it

has a unique solution. Moreover, very efficient algorithms exist to solve this
problem. We will discuss this in detail in Chapter 5. Figure 2.12 shows the
regions of equal prior probability for a range of different norm-constrained
priors.
As can be seen from the diagram, the choice of the norm can have profound
consequences on the solution. That said, as we will show in Chapter ??, the
estimate MAP is well concentrated and converges to the optimal solution
under fairly general conditions.
Fig. 2.12. From left to right: regions of equal prior probability in R2 for priors using
the 1 , 2 and norm. Note that only the 2 norm is invariant with regard to the
coordinate system. As we shall see later, the 1 norm prior leads to solutions where
only a small number of coordinates is nonzero.
An alternative to norm-constrained priors are conjugate priors. They are

designed such that the posterior p(|X) has the same functional form as the
74
prior p(). In exponential families such priors are defined via

p(|n, ) = exp ( n, ng() h(, n)) where
h(, n) = log
(2.68)
exp ( n, ng()) d.
(2.69)
Note that p(|n, ) itself is a member of the exponential family with the
feature map () = (, g()). Hence h(, n) is convex in (n, n). Moreover,
the posterior distribution has the form
p(|X) p(X|)p(|n, ) exp ( m[X] + n, (m + n)g()) . (2.70)
That is, the posterior distribution has the same form as a conjugate prior
and m + n. In other words, n acts like a phantom
with parameters m[X]+n
m+n
sample size and is the corresponding mean parameter. Such an interpretation is reasonable given our desire to design a prior which, when combined
with the likelihood remains in the same model class: we treat prior knowledge as having observed virtual data beforehand which is then added to the
actual set of observations. In this sense data and prior become completely
equivalent we obtain our knowledge either from actual observations or
from virtual observations which describe our belief into how the data generation process is supposed to behave.
Eq. (2.70) has the added benefit of allowing us to provide an exact normalized version of the posterior. Using (2.68) we obtain that
p(|X) = exp
m[X] + n, (m + n)g() h
m[X]+n
,m
m+n
+n
The main remaining challenge is to compute the normalization h for a range

of important conjugate distributions. The table on the following page provides details. Besides attractive algebraic properties, conjugate priors also
have a second advantage the integral (2.65) can be solved exactly:
p(x|X) =
exp ( (x), g())

exp
m[X] + n, (m + n)g() h
m[X]+n
,m
m+n
+n
Combining terms one may check that the integrand amounts to the normalization in the conjugate distribution, albeit (x) added. This yields
p(x|X) = exp h
m[X]+n+(x)
,m
m+n+1
+n+1 h
m[X]+n
,m
m+n
+n
Such an expansion is very useful whenever we would like to draw x from

p(x|X) without the need to obtain an instantiation of the latent variable .
We provide explicit expansions in appendix 2. [GS04] use the fact that
2.4 Estimation
75
can be integrated out to obtain what is called a collapsed Gibbs sampler for
topic models [BNJ03].
2.4.4 An Example
Assume we would like to build a language model based on available documents. For instance, a linguist might be interested in estimating the frequency of words in Shakespeares collected works, or one might want to
compare the change with respect to a collection of webpages. While models describing documents by treating them as bags of words which all have
been obtained independently of each other are exceedingly simple, they are
valuable for quick-and-dirty content filtering and categorization, e.g. a spam
filter on a mail server or a content filter for webpages.
Hence we model a document d as a multinomial distribution: denote by
wi for i {1, . . . , md } the words in d. Moreover, denote by p(w|) the
probability of occurrence of word w, then under the assumption that the
words are independently drawn, we have
md
p(wi |).
p(d|) =
(2.71)
i=1
It is our goal to find parameters such that p(d|) is accurate. For a given
collection D of documents denote by mw the number of counts for word w
in the entire collection. Moreover, denote by m the total number of words
in the entire collection. In this case we have
p(w|)mw .
p(di |) =
p(D|) =
i
(2.72)
Finding suitable parameters given D proceeds as follows: In a maximum

likelihood model we set
p(w|) =
mw
.
m
(2.73)
In other words, we use the empirical frequency of occurrence as our best

guess and the sufficient statistic of D is (w) = ew , where ew denotes the unit
vector which is nonzero only for the coordinate w. Hence [D]w = mmw .
We know that the conjugate prior of the multinomial model is a Dirichlet
model. It follows from (2.70) that the posterior mode is obtained by replacing
[D] by m[D]+n
. Denote by nw := w n the pseudo-counts arising from
m+n
the conjugate prior with parameters (, n). In this case we will estimate the
76
probability of the word w as

p(w|) =
mw + nw
m w + nw
=
.
m+n
m+n
(2.74)
In other words, we add the pseudo counts nw to the actual word counts mw .
This is particularly useful when the document we are dealing with is brief,
that is, whenever we have little data: it is quite unreasonable to infer from
a webpage of approximately 1000 words that words not occurring in this
page have zero probability. This is exactly what is mitigated by means of
the conjugate prior (, n).
Finally, let us consder norm-constrained priors of the form (2.66). In this
case, the integral required for
p(D) =
p(D|)p()d
exp 0
d
p
+ m [D], mg() d
is intractable and we need to resort to an approximation. A popular choice

is to replace the integral by p(D| ) where maximizes the integrand. This
is precisely the MAP approximation of (2.64). Hence, in order to perform
estimation we need to solve
minimize g() [D], +
0
m
d
p.
(2.75)
A very simple strategy for minimizing (2.75) is gradient descent. That is for
a given value of we compute the gradient of the objective function and take
a fixed step towards its minimum. For simplicity assume that d = p = 2 and
= 1/2 2 , that is, we assume that is normally distributed with variance
2 and mean 0 . The gradient is given by
[ log p(D, )] = Exp(x|) [(x)] [D] +
1
[ 0 ]
m 2
(2.76)
In other words, it depends on the discrepancy between the mean of (x)

with respect to our current model and the empirical average [X], and the
difference between and the prior mean 0 .
Unfortunately, convergence of the procedure [. . .] is usually
very slow, even if we adjust the steplength efficiently. The reason is that
the gradient need not point towards the minimum as the space is most likely
distorted. A better strategy is to use Newtons method (see Chapter 5 for
a detailed discussion and a convergence proof). It relies on a second order
2.5 Sampling
77
Taylor approximation
1
log p(D, + ) log p(D, ) + , G + H
(2.77)
2
where G and H are the first and second derivatives of log p(D, ) with
respect to . The quadratic expression can be minimized with respect to
by choosing = H 1 G and we can fashion an update algorithm from this
by letting H 1 G. One may show (see Chapter 5) that Algorithm 2.1
is quadratically convergent. Note that the prior on ensures that H is well
conditioned even in the case where the variance of (x) is not. In practice this
means that the prior ensures fast convergence of the optimization algorithm.
Algorithm 2.1 Newton method for MAP estimation
NewtonMAP(D)
Initialize = 0
while not converged do
1
Compute G = Exp(x|) [(x)] [D] + m
2 [ 0 ]
1
Compute H = Varxp(x|) [(x)] + m2 1
Update H 1 G
end while
return
2.5 Sampling
So far we considered the problem of estimating the underlying probability
density, given a set of samples drawn from that density. Now let us turn to
the converse problem, that is, how to generate random variables given the
underlying probability density. In other words, we want to design a random
variable generator. This is useful for a number of reasons:
We may encounter probability distributions where optimization over suitable model parameters is essentially impossible and where it is equally impossible to obtain a closed form expression of the distribution. In these cases
it may still be possible to perform sampling to draw examples of the king
of data we expect to see from the model. Chapter 3 discusses a number of
graphical models where this problem arises.
Secondly, assume that we are interested in testing the performance of a
network router under different load conditions. Instead of introducing the
under-development router in a live network and wreaking havoc, one could
estimate the probability density of the network traffic under various load
conditions and build a model. The behavior of the network can then be
78
simulated by using a probabilistic model. This involves drawing random

variables from an estimated probability distribution.
Carrying on, suppose that we generate data packets by sampling and see
an anomalous behavior in your router. In order to reproduce and debug
this problem one needs access to the same set of random packets which
caused the problem in the first place. In other words, it is often convenient
if our random variable generator is reproducible; At first blush this seems
like a contradiction. After all, our random number generator is supposed
to generate random variables. This is less of a contradiction if we consider
how random numbers are generated in a computer given a particular
initialization (which typically depends on the state of the system, e.g. time,
disk size, bios checksum, etc.) the random number algorithm produces a
sequence of numbers which, for all practical purposes, can be treated as iid.
A simple method is the linear congruential generator [PTVF94]
xi+1 = (axi + b) mod c.
The performance of these iterations depends significantly on the choice of the
constants a, b, c. For instance, the GNU C compiler uses a = 1103515245, b =
12345 and c = 232 . In general b and c need to be relatively prime and a 1
needs to be divisible by all prime factors of c and by 4. It is very much
advisable not to attempt implementing such generators on ones own unless
it is absolutely necessary.
Useful desiderata for a pseudo random number generator (PRNG) are that
for practical purposes it is statistically indistinguishable from a sequence of
iid data. That is, when applying a number of statistical tests, we will accept
the null-hypothesis that the random variables are iid. See Chapter 10 for
a detailed discussion of statistical testing procedures for random variables.
In the following we assume that we have access to a uniform RNG U [0, 1]
which draws random numbers uniformly from the range [0, 1].
2.5.1 Inverse Transformation

In the following we assume that we have access to (pseudo)random variables
drawn from the uniform distribution U [0, 1]. However, we would like to draw
from some distinctively non-uniform distribution. Whenever the latter is
relatively simple this can be achieved by applying an inverse transform:
Theorem 2.21 For x p(x) with x X and an injective transformation
: X Z with inverse transform 1 on (X) it follows that the random
variable z := (x) is drawn from z 1 (z) p(1 (z)).
2.5 Sampling
79
Fig. 2.13. Left: discrete probability distribution over 5 possible outcomes. Right:
associated cumulative distribution function. When sampling, we draw x uniformly
at random from U [0, 1] and compute the inverse of F .
This follows immediately by applying a variable transformation for a measure, i.e. we change dp(x) to dp(1 (z)) z 1 (z) . Such a conversion strategy is particularly useful for univariate distributions.
Corollary 2.22 Denote by p(x) a distribution on R with cumulative distrix
bution function F (x ) = dp(x). Then the transformation (x) = F 1 (x)
converts samples from U [0, 1] to samples drawn from p(x).
We now apply this strategy to a number of univariate distributions. One of
the most common cases is sampling from a discrete distribution.
Example 2.8 (Discrete Distribution) In the case of a discrete distribution over {1, . . . , k} the cumulative distribution function is a step-function
with steps at {1, . . . , k} where the height of each step is given by the corresponding probability of the event.
The implementation works as follows: denote by p [0, 1]k the vector of
probabilities and denote by f [0, 1]k with fi = fi1 + pi and f1 = p1 the
steps of the cumulative distribution function. Then for a random variable z
drawn from U [0, 1] we obtain x(z) := argmaxi {fi z}. See Figure 2.13 for
an example of a distribution over 5 events.
Example 2.9 (Laplace Distribution) The density of a Laplace-distributed
random variable is given by
p(x|) = exp(x) if > 0 and x 0.
(2.78)
80
Fig. 2.14. Left: Laplace distribution with = 1. Right: associated cumulative distribution function. When sampling, we draw x uniformly at random from U [0, 1]
and compute the inverse.
This allows us to compute its cdf as

F (x|) = 1 exp(x)if > 0 for x 0.
(2.79)
Therefore to generate a Laplace distributed random variable we draw z

U [0, 1] and solve x = F 1 (x|) = 1 log(1 z). Since z and 1 z are
drawn from U [0, 1] we can simplify this to x = 1 log z.
We could apply the same reasoning to the normal distribution in order to
draw Gaussian random variables. Unfortunately, the cumulative distribution
function of the Gaussian is not available in closed form and we would need
resort to rather nontrivial numerical techniques. It turns out that there exists
a much more elegant algorithm which has its roots in Gauss proof of the
normalization constant of the Normal distribution. This technique is known
as the Box-M
uller transform.
Example 2.10 (Box-M
uller Transform) Denote by X, X independent
Gaussian random variables with zero mean and unit variance. We have
1 2
1 2
1
1
1 1 (x2 +y2 )
p(x, y) = e 2 x e 2 y =
e 2
2
2
2
(2.80)
The key observation is that the joint distribution p(x, y) is radially symmetric, i.e. it only depends on the radius r2 = x2 + y 2 . Hence we may perform
a variable substitution in polar coordinates via the map where
x = r cos and y = r sin hence (x, y) = (r, ).
(2.81)
2.5 Sampling
81
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00 4
Fig. 2.15. Red: true density of the standard normal distribution (red line) is contrasted with the histogram of 20,000 random variables generated by the Box-M
uller
transform.
This allows us to express the density in terms of (r, ) via

p(r, ) = p((r, )) |r, (r, )| =
1 1 r2
e 2
2
cos
sin
r sin r cos
r 1 r2
e 2 .
2
The fact that p(r, ) is constant in means that we can easily sample
[0, 2] by drawing a random variable, say z from U [0, 1] and rescaling it with
2. To obtain a sampler for r we need to compute the cumulative distribution
1 2
function for p(r) = re 2 r . The latter is given by applying (2.81):
r
F (r ) =
1 2
re 2 r dr = e 2 z
and hence r =
2 log F (r).
(2.82)
This yields the following sampler: draw z , zr U [0, 1] and compute x and
y by
x=
2 log zr cos 2z and y =
2 log zr sin 2z .
Note that the Box-M

uller transform yields two independent Gaussian random variables. See Figure 2.15 for an example of the sampler.
Example 2.11 (Uniform distribution on the disc) A similar strategy
can be employed when sampling from the unit disc. In this case the closedform expression of the distribution is simply given by
p(x, y) =
if x2 + y 2 1
otherwise
(2.83)
82
Fig. 2.16. Rejection sampler. Left: samples drawn from the uniform distribution on
[0, 1]2 . Middle: the samples drawn from the uniform distribution on the unit disc
are all the points in the grey shaded area. Right: the same procedure allows us to
sample uniformly from arbitrary sets.
Using the variable transform of (2.81) yields

p(r, ) = p((r, )) |r, (r, )| =
if r 1
otherwise
(2.84)
Solving the integral for yields p(r) = 2r for r [0, 1] with corresponding
CDF F (r) = r2 for r [0, 1]. Hence our sampler draws zr , z U [0, 1] and
then computes x = zr cos 2z and y = zr sin 2z .

2.5.2 Rejection Sampler
All the methods for random variable generation that we looked at so far require intimate knowledge about the pdf of the distribution. We now describe
a general purpose method, which can be used to generate samples from an
arbitrary distribution. Let us begin with sampling from a set:
Example 2.12 (Rejection Sampler) Denote by X X a set and let p be
a density on X. Then a sampler for drawing from pX (x) p(x) for x X
and pX (x) = 0 for x X, that is, pX (x) = p(x|x X) is obtained by the
procedure:
repeat
draw x p(x)
until x X
return x
That is, the algorithm keeps on drawing from p until the random variable is
contained in X. The probability that this occurs is clearly p(X). Hence the
larger p(X) the higher the efficiency of the sampler. See Figure 2.16.
Example 2.13 (Uniform distribution on a disc) The procedure works
trivially as follows: draw x, y U [0, 1]. Accept if (2x 1)2 + (2y 1)2 1
and return sample (2x 1, 2y 1). This sampler has efficiency 4 since this
is the surface ratio between the unit square and the unit ball.
Note that this time we did not need to carry out any sophisticated measure
transform. This mathematical convenience came at the expense of a slightly
less efficient sampler about 21% of all samples are rejected.
2.5 Sampling
83
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.0
0.5
0.0
0.2
0.4
0.6
0.8
1.0
0.00.0
0.2
0.4
0.6
0.8
1.0
Fig. 2.17. Accept reject sampling for the Beta(2, 5) distribution. Left: Samples are
generated uniformly from the blue rectangle (shaded area). Only those samples
which fall under the red curve of the Beta(2, 5) distribution (darkly shaded area)
are accepted. Right: The true density of the Beta(2, 5) distribution (red line) is
contrasted with the histogram of 10,000 samples drawn by the rejection sampler.
The same reasoning that we used to obtain a hard accept/reject procedure

can be used for a considerably more sophisticated rejection sampler. The
basic idea is that if, for a given distribution p we can find another distribution
q which, after rescaling, becomes an upper envelope on p, we can use q to
sample from and reject depending on the ratio between q and p.
Theorem 2.23 (Rejection Sampler) Denote by p and q distributions on
X and let c be a constant such that such that cq(x) p(x) for all x X.
Then the algorithm below draws from p with acceptance probability c1 .
repeat
draw x q(x) and t U [0, 1]
until ct p(x)
q(x)
return x
Proof Denote by Z the event that the sample drawn from q is accepted.
Then by Bayes rule the probability Pr(x|Z) can be written as follows
Pr(Z|x) Pr(x)
Pr(x|Z) =
=
Pr(Z)
Here we used that Pr(Z) =
p(x)
cq(x)
q(x)
c1
Pr(Z|x)q(x)dx =
= p(x)
(2.85)
c1 p(x)dx = c1 .
Note that the algorithm of Example 2.12 is a special case of such a rejection
1
sampler we majorize pX by the uniform distribution rescaled by p(X)
.
Example 2.14 (Beta distribution) Recall that the Beta(a, b) distribution,
84
as a member of the Exponential Family with sufficient statistics (log x, log(1

x)), is given by
p(x|a, b) =
(a + b) a1
x (1 x)b1 ,
(a)(b)
(2.86)
For given (a, b) one can verify (problem 2.25) that

argmax p(x|a, b) =
x
a1
.
a+b2
(2.87)
provided a > 1. Hence, if we use as proposal distribution the uniform disa1

we may apply Theorem 2.23.
tribution U [0, 1] with scaling factor c = a+b2
As illustrated in Figure 2.17, to generate a sample from Beta(a, b) we first
generate a pair (x, t), uniformly at random from the shaded rectangle. A
sample is retained if ct p(X|a, b), and rejected otherwise. The acceptance
a1
rate of this sampler is a+b2
.
Example 2.15 (Normal distribution) We may use the Laplace distribution to generate samples from the Normal distribution. That is, we use
q(x|) =
|x|
e
2
(2.88)
as the proposal distribution. For a normal distribution p = N(0, 1) with zero

mean and unit variance it turns out that choosing = 1 yields the most
efficient sampling scheme (see Problem 2.27) with
p(x)
2e
q(x| = 1)
As illustrated in Figure 2.18, we first generate x q(x| = 1) using the

inverse transform method (see Example 2.9 and Problem 2.21) and t
U [0, 1]. If t 2e/p(x) we accept the sample, otherwise we reject it. The
efficiency of this scheme is 2e

.
While rejection sampling is fairly efficient in low dimensions its efficiency is
unsatisfactory in high dimensions. This leads us to an instance of the curse of
dimensionality [Bel61]: the pdf of a d-dimensional Gaussian random variable
centered at 0 with variance 2 1 is given by
d
p(x| 2 ) = (2) 2 d e 22
Now suppose that we want to draw from p(x| 2 ) by sampling from another
Gaussian q with slightly larger variance 2 > 2 . In this case the ratio
2.5 Sampling
85
0.7
p(x)
2e
0.6
g(x|0,1)
0.5
0.4
0.3
0.2
0.1
0.0 4
Fig. 2.18. Rejection sampling for the Normal distribution (red curve). Samples are
generated uniformly from the Laplace distribution rescaled by 2e/. Only those
samples which fall under the red curve of the standard normal distribution (darkly
shaded area) are accepted.
between both distributions is maximized at 0 and it yields

c=
q(0| 2 )
=
p(0|2 )
If suppose = 1.01, and d = 1000, we find that c 20960. In other words,

we need to generate approximately 21,000 samples on the average from q to
draw a single sample from p. We will discuss a more sophisticated sampling
algorithms, namely Gibbs Sampling, in Section 3.2.5. It allows us to draw
from rather nontrivial distributions as long as the distributions in small
subsets of random variables are simple enough to be tackled directly.
Problems
Problem 2.1 (Bias Variance Decomposition {1}) Prove that the variance VarX [x] of a random variable can be written as EX [x2 ] EX [x]2 .
Problem 2.2 (Moment Generating Function {2}) Prove that the characteristic function can be used to generate moments as given in (2.12). Hint:
use the Taylor expansion of the exponential and apply the differential operator before the expectation.
86
Problem 2.3 (Cumulative Error Function {2})

x
erf(x) =
2/
ex dx.
(2.89)
Problem 2.4 (Weak Law of Large Numbers {2}) In analogy to the proof
of the central limit theorem prove the weak law of large numbers. Hint: use
a first order Taylor expansion of eit = 1 + it + o(t) to compute an approximation of the characteristic function. Next compute the limit m for
X m . Finally, apply the inverse Fourier transform to associate the constant
distribution at the mean with it.
Problem 2.5 (Rates and confidence bounds {3}) Show that the rate
of hoeffding is tight get bound from central limit theorem and compare to
the hoeffding rate.
Problem 2.6 Why cant we just use each chip on the wafer as a random
variable? Give a counterexample. Give bounds if we actually were allowed to
do this.
Problem 2.7 (Union Bound) Work on many bounds at the same time.
We only have logarithmic penalty.
Problem 2.8 (Randomized Rounding {4}) Solve the linear system of
equations Ax = b for integral x.
Problem 2.9 (Randomized Projections {3}) Prove that the randomized projections converge.
Problem 2.10 (The Count-Min Sketch {5}) Prove the projection trick
Problem 2.11 (Parzen windows with triangle kernels {1}) Suppose
you are given the following data: X = {2, 3, 3, 5, 5}. Plot the estimated density using a kernel density estimator with the following kernel:
k(u) =
0.5 0.25 |u| if |u| 2

0 otherwise.
Problem 2.12 Gaussian process link with Gaussian prior on natural parameters
Problem 2.13 Optimization for Gaussian regularization
2.5 Sampling
87
Problem 2.14 Conjugate prior (student-t and wishart).

Problem 2.15 (Multivariate Gaussian {1}) Prove that 0 is a necessary and sufficient condition for the normal distribution to be well defined.
Problem 2.16 (Discrete Exponential Distribution {2}) (x) = x and
uniform measure.
Problem 2.17 Exponential random graphs.
Problem 2.18 (Maximum Entropy Distribution) Show that exponential families arise as the solution of the maximum entropy estimation problem.
Problem 2.19 (Maximum Likelihood Estimates for Normal Distributions)
Derive the maximum likelihood estimates for a normal distribution, that is,
show that they result in
1
=
m
1
xi and
=
m
i=1
(xi
)2
(2.90)
i=1
using the exponential families parametrization. Next show that while the
1
mean estimate
is unbiased, the variance estimate has a slight bias of O( m
).
2
To see this, take the expectation with respect to
.
Problem 2.20 (cdf of Logistic random variable {1}) Show that the cdf
of the Logistic random variable (??) is given by (??).
Problem 2.21 (Double-exponential (Laplace) distribution {1}) Use
the inverse-transform method to generate a sample from the double-exponential
(Laplace) distribution (2.88).
Problem 2.22 (Normal random variables in polar coordinates {1})
If X1 and X2 are standard normal random variables and let (R, ) denote the polar coordinates of the pair (X1 , X2 ). Show that R2 22 and
Unif[0, 2].
Problem 2.23 (Monotonically increasing mappings {1}) A mapping
T : R R is one-to-one if, and only if, T is monotonically increasing, that
is, x > y implies that T (x) > T (y).
88
Problem 2.24 (Monotonically increasing multi-maps {2}) Let T : Rn

Rn be one-to-one. If X pX (x), then show that the distribution pY (y) of
Y = T (X) can be obtained via (??).
Problem 2.25 (Argmax of the Beta(a, b) distribution {1}) Show that
the mode of the Beta(a, b) distribution is given by (2.87).
Problem 2.26 (Accept reject sampling for the unit disk {2}) Give at
least TWO different accept-reject based sampling schemes to generate samples uniformly at random from the unit disk. Compute their efficiency.
Problem 2.27 (Optimizing Laplace for Standard Normal {1}) Optimize
the ratio p(x)/g(x|, ), with respect to and , where p(x) is the standard
normal distribution (??), and g(x|, ) is the Laplace distribution (2.88).
Problem 2.28 (Normal Random Variable Generation {2}) The aim
of this problem is to write code to generate standard normal random variables (??) by using different methods. To do this generate U Unif[0, 1]
and apply
(i) the Box-Muller transformation outlined in Section ??.
(ii) use the following approximation to the inverse CDF
1 () t
a0 + a1 t
,
1 + b1 t + b2 t2
(2.91)
where t2 = log(2 ) and

a0 = 2.30753, a1 = 0.27061, b1 = 0.99229, b2 = 0.04481
(iii) use the method outlined in example 2.15.
Plot a histogram of the samples you generated to confirm that they are normally distributed. Compare these different methods in terms of the time
needed to generate 1000 random variables.
Problem 2.29 (Non-standard Normal random variables {2}) Describe
a scheme based on the Box-Muller transform to generate d dimensional normal random variables p(x|0, I). How can this be used to generate arbitrary
normal random variables p(x|, ).
Problem 2.30 (Uniform samples from a disk {2}) Show how the ideas
described in Section ?? can be generalized to draw samples uniformly at ranx2
x2
dom from an axis parallel ellipse: {(x, y) : a21 + b22 1}.
3
Directed Graphical Models
Reasoning and inference from data is a key problem in machine learning

and in principle the tools developed in the previous two chapters suffice to
address this problem: simply observe instances, construct a density estimate
of the random variables involved, and finally compute the density of what
we want to predict conditioned on the context. Unfortunately this approach
is infeasible as the following example illustrates:
Suppose that you want to decide whether to bring an umbrella when leaving the house. You would typically take factors such as the current weather,
your plans for the day, your clothes, your means of transport, or your health
into consideration. On the other hand, it is fairly safe to todays stock market or the contents of your fridge in your reasoning. In other words, we can
judiciously ignore certain factors since it is safe to assume that the event
we want to predict is independent of them. If we naively performed a joint
density estimate without exploiting independence we would likely arrive at
a poor estimate and therefore be making poor decisions.
A somewhat more subtle type of information is conditional independence.
It is safe to assume that conditioned on the state of the weather your decision
of bringing an umbrella is independent of your neighbors decision to do the
same. On the other hand, without knowing the weather, clearly your and
your neighbors decisions will not be independent.
There exists a rich structure between these factors which can help us improve decision making. This structure will allow us to reduce the expressive
complexity of joint distributions of random variables to a set of simpler dependencies. The latter can be estimated accurately requiring less data and
less computation than the general case requires. It is also useful for incorporating prior knowledge and for visualization.
Our language of choice are graphical models. We will be discussing directed and undirected models in this and the following chapters. These two
models are by no means the only possible choices for describing dependency
structures between random variables. Alternative representations use the cumulative distribution function directly. This leads to the copula framework
[Nel06], rediscovered recently as cumulative distribution function networks
[HF08]. Directed models are slightly more intuitive when describing notions
89
90
3 Directed Graphical Models
of (causal) dependence. In the present chapter we focus on directed models

and provide a number of applications of such models. Some details regarding
inference are relegated to Chapter 4, since both frameworks are amenable
to the generalized distributive law and message passing.
3.1 Introduction
3.1.1 Alarms and Burglars
Directed graphical models are some of the most intuitive ways of describing
dependencies between random variables. Consider the following chain of
events which will serve as a running example in this section: we denote by
B the event that your house is burgled, let A be the event that the alarm
in the house is triggered, and let N be the event that you receive a phone
call from your neighbor. To model this set of events we can always write
p(B, A, N ) = p(B)p(A, N |B) = p(B)p(A|B)p(N |B, A).
(3.1)
In other words, we may start with the probability of a burglary, the probability of the alarm being triggered by the burglary, and finally, the probability
of the neighbor calling given the burglary and the alarm. It is here that we
can make a simplifying modeling assumption: the probability of the neighbor does not directly depend on the burglar but only on the alarm being
triggered. This is probably reasonable since burglars tend to be stealthy and
try not being seen by neighbors. In other words, we assume that
p(N |B, A) = p(N |A) and hence p(B, A, N ) = p(B)p(A|B)p(N |A).
(3.2)
We have just observed an instance of a rather fundamental phenomenon:

modeling the joint distribution over B, A, N has become easier since now,
instead of specifying the probability of the neighbor calling as a random
variable depending on four possible contexts (B, A) {0, 1}2 we now only
need to consider A {0, 1}. If we need to estimate the probability of N from
observations it is surely much easier to estimate p(N ) when conditioned on
a smaller number of covariates. Secondly, conditioned on the status of the
alarm A, the random variables B and N become independent:
p(B, N |A) =
p(B, A, N )
=
B ,N p(B , A, N )
p(B )p(A|B )
p(N |A)
B p(B )p(A|B )
(3.3)
=p(B|A)p(N |A).
We express the fact that B and A are conditionally independent given A
as B
N |A. Note that above we dropped the summation over N above
3.1 Introduction
91
Fig. 3.1. From left to right: a simple three variable model involving a burglary
B, an alarm A, and a phone call from a neighbor N . The second model from the
left denotes that we observe A. This is typically indicated by solid colored nodes.
The second graph from the right denotes a model where we added another random
variable, the event of an earthquake. Note that while B and E are independent of
each other, they cease being so once we observe A or N . Clearly, we can add further
random variables, e.g. whether we receive information about an earthquake on the
radio R.
since by definition N p(N |A) = 1. This means that for a given model,
inference is simpler, since we may split the chain of reasoning into one part
affecting the random variables (B, A) and a second part affecting (A, N ).
Graphically we may describe the model by the directed graph of Figure 3.1. The key issue was that we were able to drop the arc B N from
the diagram. Here the arrow B N has the semantic that N depends on
B. Such diagrams are quite useful in specifying chains of causality.
Let us now extend the model somewhat. Californian homes are threatened
by earthquakes. Hence let us add the random variable E denoting such
an event. Obviously earthquakes may trigger the alarm A. On the other
hand, we can assume that burglars B are not privy to advance knowledge
of earthquakes, hence the events B and E can reasonably be considered to
be independent (we ignore looting). It is an equally reasonable assumption
that alarms and phone calls will not trigger earthquakes. This leads to the
model (see Figure 3.1)
p(B, A, N, E) = p(B)p(E)p(A|B, E)p(N |A).
(3.4)
While obviously B and E are independent, let us now consider the situation
that we observe A. In this case, the probability p(B, E|A) does not factorize
in B and E any more. In other words, conditioned on the alarm A, the
92
random variables B and E are dependent:

p(B, E|A) =
p(B)p(E)p(A|B, E)p(N |A)

p(B)p(E)p(A|B, E).
B,E,N p(B)p(E)p(A|B, E)p(N |A)
N
This effect is often referred to as explaining away since now the observation
of a joint effect couples the causes. It is easy to check (see Problem 4.2) that
the same holds when conditioning on N rather than A. What happens is
that the information that it is likely either a burglar or an earthquake but
quite unlikely both at the same time gets passed along to N .
3.1.2 Formal Definition

Elementary algebra was sufficient to draw nontrivial conclusions about the
above model of 4 random variables B, A, N, E. However, it is evident that
in general such conclusions are somewhat more difficult as the number of
variables increases. Fortunately there exists a formal framework which allows
us to deal with such models in an automatic fashion, e.g. when checking
independence (or conditional independence) and when performing inference.
Fig. 3.2. From left to right: an undirected planar graph, that is a graph that can
be represented on a plane without crossing edges. Middle: a directed acyclic graph
it does not contain cycles. Right: a directed loop.
Definition 3.1 (Graphs) Denote by V N a set of vertices and let E

V 2 a subset of pairs of edges. In this case we denote by G(V, E) the graph on
V , where an edge (i, j) is formed when (i, j) E. We call a graph directed
when the order (i, j) matters and undirected when the order does not matter.
The directed graph G(V, E) is called a directed acyclic graph (DAG) whenever it is not possible to find a sequence of vertices in V such that all adjacent
pairs are contained in E. That is, V has no loops (see Figure 3.2).
For a vertex j V we refer to all i with (i, j) E as parents of j,
3.1 Introduction
93
denoted by Par(j). Conversely, for a given i we refer to all j with (i, j) E

as children, denoted by Chi(i).
The notion of a DAG allows us to formally introduce directed graphical
models, also known as Bayesian Networks.
Theorem 3.2 Denote by G(V, E) a directed acyclic graph. Moreover, denote by Xi for i V random variables associated with the vertices V . Then
the following is a proper probability distribution function.
p(Xi |XPar(i) )
p(X) =
(3.5)
iV
Proof We prove this claim by induction on |V |. If |V | = 1 the theorem

trivially holds, since the corresponding vertex has no parent, hence (3.5) is
proper. Now assume that the claim holds for |V | n. For a DAG G with
|V | = n + 1 there exists at least one vertex i in G which does not have
children. In this case, summing over Xi leaves all terms unchanged except
for p(Xi |XPar(i) ). The latter integrates out to 1, in which case we obtain
(3.5) for the case where i was removed from G. This proves our claim.
Note that the condition that G does not contain cycles is essential since
otherwise normalization is not guaranteed. The following example shows
why: Assume that we have four random binary random variables arranged
in a loop, that is A B C D A as shown in Figure 3.2. Moreover,
assume that p(A|B) = A,B , that is, A takes on the same value as B and
that the same holds for (B, C), (C, D) and (D, A). In this case
p(A = B = C = D = 1) = p(A = B = C = D = 1) = 1 1 1 1 = 1.
This is clearly impossible the distribution is not properly normalized. We
will encounter undirected graphical models which can deal such issues more
effectively.
3.1.3 d-Separation and Dependence

An immediate question arising from a given directed graphical model is
which variables are independent of each other, when conditioned on a given
set of observations. For specific settings this question is easily answered.
For instance, consider the chain of Figure 3.3. There each observation only
depends on its ancestor. Such models are quite in practice. For instance,
we could assume that todays weather only depends on yesterdays weather.
Likewise todays bank account only depends on yesterdays account status
94
Fig. 3.3. A Markov chain. Each random variable i depends on its predecessor i 1.
Clearly, observing 5 makes {1, 2, 3, 4} independent from {6, 7}.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3.4. Transition behavior of the Bayes-ball algorithm. The ball may only pass
whenever variables depend on each other: (a) The variables A and C depend on each
other via B; (b) A and C are independent, since B is not observed and it therefore
can be integrated out without side effects; (c) A and C depend on each other through
the joint conditioning on their parent B; (d) Observing B makes C independent of
A; The ball bounces back for C to deal with downstream dependencies. This is a
special case of the next diagram; (e) Since B has A and C as causes, observing B
makes A and C dependent (e.g. imagine an XOR relationship); (f) observing the
joint cause of A and C makes them independent. The Bayes-ball algorithm simply
allows messages to travel along lines of dependence, as indicated by the arrows on
the side of B.
and whatever income we derived in the meantime. The position of a robot

might only depend on its speed and location an instant before plus whatever
uncertainty we incur due to moving it.
An important extension of such models is the case where the current state
cannot be observed directly but rather only via some indirect measurement.
This leads to Hidden Markov models and the Kalman Filter. They are very
popular for instance in speech recognition where the hidden states are assumed to be the text and the observation would be the utterances of the
speaker. Likewise, the robot may not observe its actual speed or location but
only some indirect data such as the images of its environment via a sensor.
We will discuss a variety of such structures in Section 3.3.
In order to design an automatic means of determining which random variables are conditionally independent of each other for a given graphical model
we rely on the following two key results: one which allows us expand the
conditional independence between sets of variables as one involving all pairs
3.1 Introduction
95
from the sets, and a second one which allows us to check dependence between
random variables by a reachability problem on a graph.
Theorem 3.3 ([Pea01]) Denote by A, B, C three sets of random variables
associated with a DAG G(V, E). In this case A
B|C if and only if a
b|C
holds for all a A and b B.
Theorem 3.4 ([GVP90]) Denote a, b random variables and C a set of
random variables associated with a DAG G(V, E). Then a and b are dependent given C if there exists a path from a to b where each node with two
incoming arcs on the path either is or has a descendant in C. If no such
paths exists we call a, b to be d-separated by C.
Figure 3.4 explains the rationale behind this theorem: two variables i, j
depend on each other when the connecting variable k is unobserved and it is
not a joint child of i and j. This relation is reversed when we observe j. In
this case, only if j is a joint child of i and j, the observation of k renders i and
j dependent. Theorem 3.4 states that this dependence is transitive, that is,
it is inherited along a chain of random variables. Algorithm 3.1 determines
the reachability efficiently. Moreover, its runtime is linear in the size of the
graph G(V, E).
Theorem 3.5 ([Sha98]) For a given graph G(V, E) algorithm 3.1 terminates in at most O(|V | + |E|) time. Moreover for disjoint sets A, C V the
set B = BayesBall(G(V,E), A, C) contains all variables which are conditionally dependent on A given C.
Proof Each vertex is visited at most twice since we set the flags ti and
bi according to whether we follow to the children or parents of a vertex.
Moreover, we follow each edge at most twice (once from the parent and once
from the descendant).
To see the dependence, note that algorithm 3.1 is a reachability test:
the variables ti and bi are only book-keeping tools to ensure that we do
not traverse a given edge in a given direction twice (ti deals with childparent paths, bj with parent-child paths). Now note that any path between
vertices a A and b B will traverse an unobserved vertex i C with the
exception of a parent-child-parent path or unless the ball bounces back from
an observed descendant. Moreover, it will traverse an observed vertex i C
only if it is a parent-child-parent path. This is consistent with Theorem 3.4,
which concludes our proof.
The notion of directed graphical models can be extended to deterministic
96
Algorithm 3.1 Bayes Ball algorithm for d-separation

VisitParents(i)
if ti = 0 then set ti = 1 and for all j Par(i) do Visit(j,from child)
VisitChildren(i)
if bi = 0 then set bi = 1 and for all j Chi(i) do Visit(j,from parent)
Visit(i,label)
if j C then
if label = from parent then
VisitChildren(i) {pass through}
else
VisitParents(i) {pass through}
VisitChildren(i) {bounce back down}
end if
else
if label = from parent then VisitParents(i) {bounce back up}
end if
BayesBall(G(V,E), A, C)
For all i set ti = 0 and bi = 0
For all i A do Visit(i,from child)
return {i|bi = 1 and i A C}
nodes which encode direct consequences of some other observed random

variables. For instance, we might have a random variable describing the
event of a little boy dropping a vase and the then deterministic consequence
of the vase being broken. See [Sha98] for details.
3.2 Estimation
So far we demonstrated that directed graphical models are a versatile tool
when it comes to specifying dependencies in human-readable form. If we
observe all variables in the graph we are immediately able to read off the
probability / density for the corresponding event. Many applications, however, do not fit into this category typically we observe only some of the
variables while the others are unknown. In this context we would like to
perform three basic operations: we would like to quantify the density of
the observed variables. This requires an operation called Marginalization.
A second question is to determine the density of some configuration of the
unobserved variables conditioned on the observed random variables. This
requires Conditioning. Finally, we may want to adjust the parameters of
3.2 Estimation
97
the model itself. While this, strictly speaking, could be viewed as a case of
marginalization over the unknown model parameters, it is practically useful,
nonetheless, to study algorithms such as Expectation Maximization, which
can be tailored towards the adaptation of model parameters.
Let us the discuss the basic operations in some more detail. The example
of Figure 3.1 will serve as an example to motivate the problem further.
Marginalization: Assume that we are only interested in the probability
of receiving a phone call from our neighbor N . In this case we
need to sum over all possible constellations of (B, A, N ) to obtain
p(N ) = B,A,N p(B, A, N, E). In other words, we integrate out the
unobserved random variables in order to obtain the marginal distribution of the random variable we are actually interested in.
More formally, given a distribution p(X, Y ) we want to compute
p(X). This requires us to sum over the variables Y via
p(X) =
p(Zi |ZPar(i) )
p(X, Y ) =
y
(3.6)
iV
Here we used Z := (X, Y ) as a shorthand to denote the union of

both random variables. Clearly, whenever Y only contains children
of x this operation is straightforward. In other cases, we will need
to resort to message passing algorithms and dynamic programming
to compute the sum efficiently. A principled discussion of this family
of algorithms is deferred to Section ??. Moreover, we will encounter
examples of this algorithm family in Section 3.3.
Conditioning: Returning to Figure 3.1, we could ask what chances are that
we will receive a call, provided that someone actually did break into
the house. This requires us to obtain p(A, N, E|B) or more specifically p(N |B) from p(B, A, N, E). This operation requires us to perform both conditioning on B and marginalization (over the variables
A, E) in order to answer this query.
Formally, given a distribution p(X, Y ) we want to compute p(Y |X).
As long as the random variables X are exclusively parents of Y this
is easily achieved. In general we need to solve
p(Y |X) =
p(X, Y )
.
p(X)
(3.7)
This means that the same operations driving marginalization are

also key ingredients in conditioning. Note that p(X, Y ) and p(X)
only need to be known up to a joint multiplicative constant. This
simple observation will greatly simplify calculations in practice.
98
Parameter Estimation: So far we assumed that the probabilities in our

simple model of burglars and neighbors were known beforehand.
However, it is clear that at some point we will want to use empirical
data to estimate the parameters accurately.
In general, we want to determine the probabilities p(Xi |XPar(i) )
from observations. In some cases, the parameters of these distributions themselves will be drawn from some prior distribution and we
will discuss algorithms for obtaining estimates for them.
If we fully observe X in the form of some instances X := {x1 , . . . , xm } it is
possible to decompose the log-likelihood log p(X; ) into
m
p(xij |Par(xij ); ) =
log p(X; ) = log

i=1 j=1
log p(xij |Par(xij ); ).

i=1 j=1
Here we denote by xij the j-th coordinate of xi . As a consequence, any

estimation problem in terms of decouples into n separate problems on the
(child, parent) sets of the underlying directed acyclic graph, possibly coupled
by the choice of parameters . As a consequence, we can use the techniques
introduced in Chapter 2 to adjust . Usually things are not quite that easy.
3.2.1 Information Theory Primer

The techniques we employ for estimation require some modicum of knowledge in information theory. These tools will prove useful throughout the
book. We introduce them here since they are directly useful in the context
of inference in graphical models. That said, they have a much broader range
of applications e.g. in designing inference principles, data selection, and the
design of codes. See [CT91] for an excellent introduction.
Entropy A basic measure when dealing with distributions p is the amount
of information that can be encoded by observations drawn from p. Without
further ado, we define the entropy as
H(p) :=
log p(x)dp(x) and H(p) :=
log p(x)dp(x)
(3.8)
xX
for continuous and discrete domains respectively. The following theorem

which dates back to Shannon [Sha48] describes the basic connection between
information and distributions:
Theorem 3.6 (Source Coding) Assume that X = {x1 , . . . , xm } with xi
3.2 Estimation
99
{1, . . . , n} is drawn iid from p with entropy H(p). Then in the limit of
m the average number of bits required to encode any element of X
is given by H(p)
log 2 .
In other words, it is possible to encode the datastream drawn from p with
H(p)
log 2 bits per symbol. Moreover, it is impossible to decode it with less than
that threshold if lossless compression is required.
A simple example of the above theorem can be seen as follows: if we have a
distribution over 2n equidistributed events we could obviously encode each
of the events with n bits. However, if one of these events were to occur
very frequently, we might want to reserve a shorter bit sequence for the
frequent event at the expense of investing more bits for the less frequent
terms. Theorem 3.6 indicates that the best possible code to encode event
x should require log2 p(x) bits. For continuous variables this analogy is,
unfortunately, somewhat fragile, since by their very nature, storing numbers
in R at full precision almost always requires an infinite number of bits.
Nonetheless, entropy proves to be a useful concept when measuring the
complexity of distributions under consideration.
Kullback Leibler Divergence Given two distributions it is only natural
to define a distance between them. We will discuss this topic in great detail in
Section 10 where we will be analyzing it from the viewpoint of expectations
over a given set of random variables. In the context of information theory
we may compute distances between distributions by asking how many extra
bits we would need to spend on encoding data drawn from p when using
the code for q rather than the (optimal) code for p. This is precisely the
Kullback Leibler (KL) Divergence:
D(p q) :=
log
p(x)
dp(x) and D(p q) :=
q(x)
log
xX
p(x)
p(x)
q(x)
(3.9)
for continuous and discrete distributions respectively. Note that in general

the KL-divergence is nonsymmetric, i.e. D(p q) = D(q p). Moreover, the
following holds:
Theorem 3.7 For all distributions p and q we have D(q p) 0 with equality only for p = q. For distributions the equality holds in the measuretheoretic sense (i.e. almost everywhere).
Proof The equality is obvious since log 1 = 0. To see nonnegativity note
that the negative logarithm is a convex function, hence by Jensens inequality
100
we have Ex [ log f (x)] log Ex [f (x)]. It follows that

D(p q) = Ex log
q(x)
q(x)
log Ex
= log 1 = 0.
p(x)
p(x)
Since the negative logarithm is strictly convex it follows that D(p q) = 0

only if p = q.
The KL-Divergence enjoys a number of attractive properties. For instance,
it is convex in q and it is concave in p. This connection will be used subsequently in the design of variational inference techniques for graphical models.
Moreover by choosing q(x) = x0 ,x it follows that H(p) = D(p q), hence:
Corollary 3.8 The entropy H(p) is nonnegative.
Mutual Information We conclude our brief analysis of information theory
by defining the mutual information between random variables. It is a simple
quantity which measures how related two random variables are. In particular, whenever two random variables are independent the mutual information
vanishes. We define it as follows:
I(X, Y ) := H(pX ) + H(pY ) H(pX,Y ).
(3.10)
One may check that I(X, Y ) = D(pX,Y pX pY ), that is, the mutual information is the distance between the joint distribution in X and Y and
the product of its marginals. Consequently we have that I(X, Y ) 0 and
moreover that I(X, Y ) = 0 if an only if pX,Y = pX pY . Finally we have
H(pX,Y ) =
[log p(y|x) + log p(x)]dp(y|x)dp(x) = H(pY |X ) + H(pX )
Hence we have I(X, Y ) = H(pY ) EX H(pY |X ) . Mutual information

can be used to address problems such as independent component analysis [LGBS00, Car97, YAC98]. There the goal is to find un-mix a typically
linear mixture of independent source such that we recover the independent
signals without having direct access to the mixing matrix.
3.2.2 An Example Clustering

To make things more concrete we introduce clustering. It is one of the
most commonly used techniques in exploratory and descriptive data analysis. Moreover, it will allow us to explain the concepts of latent variables,
variational inference, Expectation Maximization, and Gibbs sampling more
3.2 Estimation
101
y1
y2
y3
x1
xi
xi
...
ym
yi
yi
xm
xi
xi
i=1..m
1,
1
...
k,
k
j,j
i=1..m
j,j
j=1..k
j=1..k
Fig. 3.5. K-means clustering as a graphical model. We assume that there exists some
distribution over k terms, denoted by yi which generates vector-valued observations
xi . The diagram in the middle represents the same model, now in plate notation.
The model on the right hand side is an extension obtained by adding conjugate
priors on , and .
concretely. Its simplicity allows us to focus on concepts rather than problemspecific details. A considerable number of additional more complex examples covering applications ranging from Hidden Markov Models to Latent
Dirichlet Allocation and Collaborative Filtering models can be found in Section 3.3.
Consider the setting in Figure 3.5 which describes the basic mixture of
Gaussians clustering model. In the graph on the left hand side xi Rn
denote observations, yi {1, . . . k} correspond to the latent cluster membership variables, denotes the discrete distribution over class memberships
and j , j denote the means and variances of x for a given cluster membership. We may express this model as follows:
m
p(xi |yi , , )p(yi |)
p(X, Y|, , ) =
i=1
m
(2) 2 |yi | 2 e 2 (xi yi )
(3.11)
1
yi (xi yi )
p(yi |) (3.12)
i=1
This means that if we knew X and Y it would be quite easy to obtain a

maximum likelihood or a maximum a-posteriori estimate of the parameters
, , simply by computing the empirical means, variances, and class probabilities. A naive approach would be to perform maximum likelihood (or
maximum a-posteriori) estimation directly. This can be achieved by solving
102
the optimization problem:

m
maximize
,,
i=1
(2) 2 |y | 2 e 2 (xi y )
log
1
y (xi y )
p(y|)
(3.13)
yY
Unfortunately this problem quite intractable and nonconvex, hence we need

to resort to a number of simpler surrogate techniques.
3.2.3 Direct Maximization
Maximizing the likelihood p(X|, , ) directly is rather difficult, whereas
maximizing p(X, Y |, , ) with respect to Y and , , can be carried out
efficiently. That is, for a given choice of , , we can find the maximum
likelihood values of Y and vice versa. Note that this procedure is fraught
with many problems there is no guarantee that this will lead to any representative choice of model parameters whatsoever. For instance, if we have
a multimodal distribution (i.e. a distribution with several maxima) the algorithm will only pick the largest peak and ignore the remaining modes which
(potentially) may represent a much more significant part of the distribution.
Quite surprisingly, this admittedly rather crude procedure leads to a well
known and widely algorithm: k-means clustering algorithm [Mac67]. We
make the following two simplifying assumptions: Firstly we assume that
p(y|) = k1 is fixed and that we will not attempt to adjust this at all.
Secondly, we assume that the variance of the Gaussians is also fixed to a
multiple of the diagonal matrix, that is, y = 2 1 for all y {1, . . . , k}. In
other words, the algorithm only attempts to find the cluster assignments Y
and the corresponding cluster centers .
Assume that we are given some cluster centers 1 , . . . , k . In this case
maximizing p(X, Y |, ) with respect to Y decomposes into
yi = argmax p(xi |y, , )p(y|) = argmin xi y
y
(3.14)
This can be seen directly from (3.12): all y are multiples of the identity matrix and p(y|) is constant, hence the only term that matters is the distance
from the cluster mean. Once we have determined the most likely cluster
assignments we can compute the most likely cluster centers by solving
m
p(xi |y, , )p(y|)
= argmax
i=1
x i y
and hence y = argmin

i
yi =y
1
xi .
|{yi = y}| y =y
i
3.2 Estimation
103
Algorithm 3.2 k-means Algorithm

kmeans(X)
Initialize y as distinct random points in X.
Compute yi := argminy xi y 2 for all i.
Compute y := |{yi1=y}| yi =y xi for all y.
end while
return Y,
The second equality follows from the fact that only terms for which yi = y
depend on y and that p(y|) is constant. The above pair of equations yields
the k-means clustering algorithm (Algorithm 3.2).
Theorem 3.9 (k-means Convergence) Algorithm 3.2 converges in a finite number of iterations.
Proof At each step the likelihood of the data wrt. Y and is increased,
hence the algorithm either makes progress or it has converged. Since the set
of assignments Y {1, . . . , k}m is finite the number of steps is finite, too.
Note that while this shows convergence in a finite number of steps it does
not provide any guarantee regarding the rate of convergence. In practice,
though, k-means converges very quickly (often 10s of steps).
3.2.4 Expectation Maximization

The reason why (3.13) is difficult to solve is that some of the random variables are unobserved, leading to a nonconvex optimization problem. Expectation Maximization [DLR77] addresses precisely this issue by modeling the distribution of the unobserved variables in alternation with a joint
model of both the observed and latent variables. It also is the first example of what has become known as variational methods in graphical models
[JGJS99, Bis06, WJ08]. It is based on the following observation.
Theorem 3.10 (Variational Bound) Let p(x, y; ) be a distribution that
is parametrized by with marginal p(x; ). In this case the following holds.
log p(x; ) Eyq(y) [log p(x, y; )] + H(q)
The inequality (3.15) is tight when choosing q(y) = p(y|x; ).
(3.15)
104
Algorithm 3.3 Expectation Maximization Algorithm

EM(x)
Compute q(y) := p(y|x; )
argmax Eyq(y) [log p(x, y; )]
end while
return
Proof By the properties of the KL-divergence we have

log p(x; ) log p(x; ) D(q(y) p(y|x; ))
=
dq(y) [log p(x; ) + log p(y|x; ) log q(y)]
dq(y) log p(x, y; )
dq(y) log q(y)
This shows the inequality. To see that the bound is tight, simply note that
for q(y) = p(y|x; ) the KL-divergence vanishes.
This inequality immediately suggests a procedure to maximize p(x; ): for
a given choice of q maximize Eyq(y) [log p(x, y; )]. Subsequently adapt q
and repeat. Note that this increases the lower bound on the best possible
value of log p(x; ) at every step: this is clearly the case when optimizing
over . Moreover, when recomputing q we tighten the lower bound (it becomes exact in this case), hence the lower bound increases again. This is
what is described in Algorithm 3.3. This procedure is known as the Expectation Maximization (EM) algorithm. One may show [DLR77] that the EM
algorithm is convergent. Note though, that in general it is difficult to obtain
guarantees on the rate of convergence.
Theorem 3.11 (EM Convergence) Algorithm 3.3 generates monotonically increasing likelihood scores for p(x|) and it converges to a local maximum of the likelihood.
In the context of maximum likelihood estimation for clustering the latent
variables are the cluster memberships yi whereas the parameters are given
by , i , i . For a given choice of the latter the probability p(y|x; , , )
factorizes into p(yi |xi ; , , ) which leads to
n
qi (y) p(xi , y; , , ) = p(y|) (2) 2 |y | 2 e 2 (xi y )
1
y (xi y )
(3.16)
3.2 Estimation
105
The normalization can be computed simply by summing p(xi , y; , , ) with

respect to y. Given q we can compute the negative expected log-likelihood
L(q, , , ) := Eyq(y) [ log p(X, Y ; , , )]. This yields
m
qi (y) log p(y|) +
L(q, , , ) =
i=1 y
m
1
2
mn
log 2+
2
(3.17)
qi (y) log |y | + (xi y ) 1

y (xi y )
i=1
Minimizing L with respect to is straightforward the best we can do is

to set p(y|) to the average of qi (y). This follows directly from Section 2.3.2.
Hence we have
1
p(y|) =
m
qi (y)
(3.18)
i=1
Next note that the problem of minimizing L decomposes into separate subproblems for each value of y. Taking with respect to y and y yields
m
qi (y)1
y (xi y )
y L =
(3.19)
i=1
1
qi (y) 1
y y
y L =
i
qi (y)(xi y )(xi y )
1
(3.20)
y
Solving the above with respect to and yields the well-known k-means
clustering equations: for my = m
i=1 qi (y) update
1
y =
my
m
i=1
1
qi (y)xi and y =
my
qi (y)(xi y )(xi y ) .
(3.21)
i=1
In practice it is advisable to adjust (3.18) and (3.21) slightly by adding a

conjugate prior for both the cluster distribution in y and the normal distribution in x|y. Algorithm 3.4 provides the updated equations. The additional
parameters are a default smoothing mass per class, set to m0 , and a default
second order moment Q0 . See also Problem 4.7 for further details. In a nutshell, this ensures that the probability of a given cluster will never get stuck
at 0 probability due to the m0 prior and moreover, that the variance never
vanishes due to the default quadratic moment Q0 .
The inequality in Theorem 3.10 allows for a much more general class of
algorithms: it is not always possible to find q such that q(y) = p(y|x; ), in
particular not whenever the normalization with respect to y is intractable.
106
Algorithm 3.4 k-means Algorithm

MixtureOfGaussians(X, m0 , Q0 )
Initialize y as distinct random points in X.
Initialize p(y|) = k1 .
Initialize y = Var[X].
while not converged or iteration count exceeded do
for i = 1 to m do
n
1
1
1
qi (y) = p(y|) (2) 2 |y | 2 e 2 (xi y ) y (xi y ) for all y.
c = y qi (y)
qi (y) = qi (y)
for all y
c
end for
for y = 1 to k do
my := m
i=1 qi (y)
m +m
p(y|) = m00 k+my
m
1
y = m0 +m
i=1 qi (y)xi
y
1
y = m0 +m
m0 Q0 +
y
end for
end while
return , ,
m
i=1 qi (y)xi xi
y y
In such a case we could try to pick q from a restricted simpler class Q of

distributions.
q(y) = argmax Eyq(y) [log p(x, y; )] + H(q).
(3.22)
qQ
This is what is commonly known as a variational method. We shall see

applications of variational inference when discussing topic models in Section 3.3.4. See [JGJS99] for a much more detailed overview over variational
inference algorithms.
3.2.5 Gibbs Sampling

The methods we presented so far have the distinct disadvantage of not providing any guarantees whatsoever of converging to the correct estimate.
Instead, both the k-means and the Mixture of Gaussians algorithm only
converge to a local optimum of the entire set of parameters.
An alternative is to draw observations from the distribution over the latent
variables. This is appropriate, in particular when we have a joint probability
3.2 Estimation
107
distribution in the latent variables and parameters of the problem.1 While

there exists a large number of advanced techniques, such as Markov-Chain
Monte Carlo sampling [DdFG01] we restrict ourselves to Gibbs sampling.
Remark 3.12 (Gibbs Sampling) Denote by x := (x1 , . . . , xn ) a random
variable with probability distribution p(x). The procedure
xi p(xi |x\xi )
(3.23)
when applied to all x1 , . . . , xn in sequence will converge to a draw from p(x)

after an infinite number of passes, provided that the regions of high density
are suitably connected.
To see the condition of connectivity consider the case where we have a
uniform distribution over the pairs (1, 2) and (2, 1). In this case, if x1 then
x2 must be 2 and vice versa. Hence, regardless of our starting point, we will
never draw more than one distinct value by using the sampler.
In the context of clustering we need to introduce priors on , , in order
for our model to be complete. For convenience both the multinomial and
the Normal distribution are members of the Exponential Family we use
conjugate priors. Dirichlet for the multinomial and Gauss-Wishart for the
Normal. This yields
p(|) = exp (f oo)
(3.24)
In the context of clustering we have two sets of variables: the cluster labels
Y which can be sampled efficiently given , , , and the model parameters
(i , i ) and , all of which again can be sampled independently of each other
for given Y . As we can see, the main difference between k-means, the Mixture
of Gaussians clustering, and the Gibbs sampling algorithm is that rather
than finding optimal assignments we now sample from the distribution. The
fact that we alternate between cluster description and model description
remains unchanged. In the following we describe these steps in more detail.
Drawing from Pr(yi |xi , , , )
Drawing from Pr(|Y )
1
Note that in this case there is no need whatsoever to distinguish between latent variables and
parameters at all. Overall, we have a joint distribution over the unobserved variables and the
parameters of the model.
108
3.2.6 Collapsed Sampling

3.3 Applications
When designing graphical models we often need to deal with repeated random variables, such as in the k-means clustering model of Figure 3.5. It is
quite tedious if we need to specify large models explicitly. This motivates
the need for a statistical equivalent of a for loop. The convention is that
random variables are instantiated using the index denoted in the frame. The
variables are uniquely identified by their name. That is, in order to express
a chain-like dependency we only need to characterize one link, say between
yi and yi+1 . Furthermore, we may nest such loops. Figure 3.8 describes a
famous example Latent Dirichlet Allocation, a topic model for sparse
data. Finally, intersections between plates imply that variables contained
inside the intersection are repeated according to all plates theyre contained
in (the latter is not quite as easily possible with for loops).
We now give a short summary of a number of popular directed graphical models. This list is by far not complete. For instance, see [RG99] for
a taxonomy of a family of models and [KF09] for a much more detailed
treatment.
3.3.1 Hidden Markov Models

In Figure 3.5 we tried to model data xi as an unordered set of observations.
In reality, we often have some guiding dimension when data is gathered:
data may be time-dependent (music, speech, the news-stream of a blog). It
may be inherently sequential, such as text or biological sequence code, or it
may be interpreted as a sequence of actions.
A simple way of extending the clustering model is to introduce dependence
between subsequent observations xi and xi+1 via their hidden states yi and
yi+1 . That is, instead of drawing yi+1 from some distribution p(yi+1 |) we
now draw it from p(yi+1 |yi , ).
For instance, p(yi+1 |yi , ) may denote the distribution over a set of new
states of a robot given its current state and p(xi |yi ) denotes the observations
the robot makes given its current state. Such models have led to simultaneous localization and mapping algorithms for robots [FMT96] where y denotes
the location on a (to be estimated) map. Similar techniques have led to state
of the art speech recognition systems [Rab89]. There the text is the latent
variable yi while xi denotes the utterances. Similarly, in bioinformatics yi
may denote whether a particular segment is an intron, an exon, or whether
it relates to other genetically relevant regions [KMH94]. Often, the transi-
3.3 Applications
109
x1
x2
x3
x4
y1
y2
y3
y4
...
xm
xi
ym
yi
xi+1
i=1..m
Fig. 3.6. A simple hidden Markov model. The variables xi are observed, while the
states yi responsible for generating xi are unknown. The graph on the right is an
equivalent representation as a plate. Our notation is slightly sloppy by not dealing
explicitly with the last hidden state y10 which does not have a child y11 . That said,
since y11 is an unobserved child node in the DAG, it integrates out without any
further effect on the joint probability distribution.
tion probabilities p(yi+1 |yi ) are constrained by what is commonly known as

a grammar. That is, only certain sequences of states are allowed. This is
useful for the incorporation of prior knowledge. See [LBBH98] for a highly
nontrivial system including grammars to build an industrial check reader
system.
The problem when performing inference in a HMM is that, just as before
in clustering, the hidden variables yi are unknown. We can use the EMalgorithm for inference. However, unlike before, estimating p(Y|X, ) is no
trivial task, since the random variables depend on each other. We now discuss this estimation in detail, since it is a good introduction to the world of
dynamic programming. We will use these techniques extensively Section ??
when dealing with undirected graphical models and what is known as message passing.
The joint likelihood of the model in Figure 3.6 can be written as
m1
p(yi+1 |yi )p(xi |yi ) p(xm |ym )
p(x, y) = p(y1 )
(3.25)
i=1
If we want to compute the likelihood of x we need too perform a sum over
110
all values of y. That is, we need to compute

m1
p(y1 )
p(x) =
y1 ,...,ym
(3.26)
i=1
m1
p(y1 )p(y2 |y1 )p(x1 |y1 )
=
y2 ,...,ym
y1
i=2
:=f2 (y2 )
m1
h(y2 )p(y3 |y2 )p(x2 |y2 )
=
y3 ,...,ym
y2
i=3
:=f3 (y3 )
While the original sum is most likely to be too costly as it involves a sum over
an exponential number of terms, we have managed to compute it step-bystep simply by pushing summations into the product of factors. Extending
this idea we can compute p(x) by solving the recursion
f1 (y1 ) = p(y1 )p(x1 |y1 ), fi (yi ) = p(xi |yi )
fi1 (yi1 )p(yi |yi1 ). (3.27)

yi1
and therefore p(x) = ym fm (ym ). As a result thereof we can immediately

apply Bayes rule to obtain p(y|x) = p(x, y)/p(x). Note that just as well we
could have applied a recursion from the end of the Markov chain to obtain
the recursion
bm (ym ) = p(ym |xm ), bi (yi ) = p(xi |yi )
bi+1 (yi+1 )p(yi+1 |yi )
(3.28)
yi+1
such that p(x) = y1 p(y1 )b1 (y1 ). The recursions (3.27) and (3.28) are commonly known as the forward and backward passes through a Hidden Markov
model. Note that both of them can be computed in linear time in the length
of the chain.
Now assume that we want to obtain the probability p(yi |x) By Bayes rule
i)
these probabilities can be computed via p(yi |x) = p(x,y
p(x) The only difference
is that we must not perform the summation over yi to obtain the correct
numerator. This is achieved as follows: we perform the forward pass of (3.27)
until we reach fi (yi ). Likewise we perform the backward pass until we reach
bi (yi ). This yields that
p(x, yi ) =
fi (yi )bi (yi )

p(xi |yi )
(3.29)
An analogous reasoning leads to p(x, yi , yi+1 ) = fi (yi )p(yi+1 |yi )bi+1 (yi+1 ).
3.3 Applications
111
As a consequence, we can compute all marginal probabilities for individual

events yi and pairs (yi , yi+1 ) in O(m) time provided that we use O(m) storage
to cache the functions fi (yi ) and bi (yi ).
Besides being useful in its right, we can use these equations as components in the EM algorithm as follows: assume that we want to maximize
the likelihood of observing the data x, that is p(x; ) for some parameter
. Algorithm 3.3 requires that we compute the expected log-likelihood with
respect to the distribution q(y) := p(y|x; ). Plugging in (3.25) yields
m
Eyi q log p(xi |yi ; )
Eyq [log p(x, y; )] =Ey1 q log p(y1 ; ) +

i=1
m1
Eyi+1 ,yi q log p(yi+1 |yi ; )
(3.30)
i=1
Note that it is precisely the expectations with respect to yi and with respect
to (yi , yi+1 ) that we need to compute the expected log-likelihood. Assuming
that we are dealing with a HMM clustering model with Gaussian outputs,
that is yi {1, . . . , k} and xi |yi N(yi , y2i ), we obtain update equations
very similar to clustering. In fact, only (??) needs changing. Eq. (??) and
(??) remain unchanged. The difference is that now instead of estimating
p(yi |) we have the equations
p(y1 ; ) = q(y1 ) and p(yi+1
1
= a|yi = b; ) =
m1
q(yj+1 = a|yj = b)
j=1
In other words, we pick the transition probabilities p(yi+1 |yi ) as the average over the posterior transition probabilities p(yi+1 |yi , x). Only the initial
probability p(y1 ) is treated differently. The probabilities p(xi |yi ), also called
emission probabilities, are identical to what is done in a clustering model.
While the above model is deceptively simple we have a transition
between a given number of states which emit Gaussian random variables,
they form the basis of advanced speech [BM90] and handwriting recognition
[LBBH98] algorithms and they have found widespread applications to bioinformatics [DEKM98]. There exist countless variants on the HMM theme. See
e.g. Figure 3.7 for some examples.
3.3.2 Kalman Filter

So far we assumed that the hidden state yi is discrete and that the observations xi are generated by a mixture of distributions (in our case Gaussians).
112

xi
y1
...
x2
x3
xi+1
yi
yi+1
zi
i=1..m
yn
yi
x1
xi
...
xm
zi
yi+1
i=1..m
Fig. 3.7. Left: Factor analysis. The observations x1 , . . . , xm arise from latent factors y1 , . . . , yn which are considered independent. Middle: an Input-Output HMM.
A sequence x1 , . . . , xm is observed and transcribed according to a hidden state
y1 , . . . , ym into an output z1 , . . . , zm . For instance, translation models can be fashioned after this structure. Right: a factorial HMM. To deal with richly structured
states we assume that instead of having just a single latent set of states y1 , . . . , ym
which cause z1 , . . . , zm , we have an additional set x1 , . . . , xm . For instance, x could
represent the text we are trying to read and y could correspond to the vocal apparatus which is generating speech.
Let us now consider the case where yi Rn and xi Rm and moreover

where all distributions are Gaussians. We assume that
y1 N(1 , 1 ) and
yi+1 |yi N(y + Ayi , y ) and
xi |yi N(x + Byi , x )
(3.31a)
(3.31b)
(3.31c)
In other words, we assume that the hidden state evolves according to a dynamical system yi+1 = y + Ayi with some Gaussian noise with variance y
added, and where the observations are affine transformations of the hidden
state xi = x + Byi with some Gaussian noise x added.
Since the sum of two Gaussians is Gaussian again (see Problem 4.12), it
follows that if yi was Gaussian then also both yi+1 and xi are. By and large,
the estimation of the distributions p(yi+1 |yi ) and p(xi |yi ) proceeds in the
same fashion as in the case of the discrete Hidden Markov Model. See e.g.
[Hay91, WB06] for details. For illustration purposes we discuss how xi and
yi+1 are related, conditioned on yi .
Assume that yi N(i , i ). By (3.31b) and (3.31c) it follows that the
random variables (xi , yi+1 ) are jointly Gaussian: they have a variable offset
(Byi , Ayi ) which is Gaussian. Moreover, we can compute the means and
variances easily by exploiting that they are both additive. Hence we have
3.3 Applications
113
that
yi+1
xi
y + A
x + B
y + AA
BA
AB
x + BB
(3.32)
If we observe xi we can use this to improve our estimate of yi+1 . Using (4.5)
of Problem 4.15 we may check that yi+1 |xi N(i+1 , i+1 ) where
i+1 = y + Ai + Ai B (x + Bi B )1 (xi x Bi )
i+1 = y + Ai A Ai B (x + Bi B )
Bi A
(3.33)
(3.34)
The above equations can be used to track the variables yi via the indirect
observations xi . Note that the same could be achieved in the discrete case.
For details on inference and nonlinear extensions see [WB06].
3.3.3 Factor Analysis

Figure 3.7 contains yet another common variant of directed graphical models
factor graphs. They are a convenient way of representing hidden causes
(or factors) for observations.
In their simplest form the observations x1 , . . . , xm are merely linear functions of the latent variables y1 , . . . , yn . This problem is also commonly known
as the independent component analysis problem for linear mixtures [HKO01,
Car98]. While simplistic, it has wide ranging applications from linear demixing of audio sources (here we assume that our microphones record x while
the actual signal is contained in y) to medical applications [ADSM06].
Note that p(y|x) and p(x|y) factorize, which makes parameter inference
relatively easy. A detailed discussion of the variants of such models is beyond
the scope of this book. See [RG99] for more details.
3.3.4 Latent Dirichlet Allocation

Assume that we would like to cluster a collection of documents. This leads
to an interesting challenge: documents tend to cover a number of topics
rather than addressing a single issue. For instance, we might have a webpage addressing a lawsuit against a fast food company, or a page describing
about the arrest of an actor, or a news article concerning a large financial
donation to a symphony orchestra. In each of those cases the articles cannot
be adequately described by a single pure topic but rather by a mixture of
topics.
A simple combinatorial argument shows us why. Assume that we are allowed to encode the content of a document by k out of 1000 terms. In this
114
zij
wij
i=1..m
l
j=1..mi
l=1..k
Fig. 3.8. Latent Dirichlet Allocation. For given parameters , we draw multinomial distributions i and l from Dirichlet distributions with and respectively.
Subsequently we draw mi topics zij for each document. Finally, for every topic zij
we draw a word wij from the corresponding topic distribution zij .
combinations. For k = 2 this amounts to 499, 500

case, we can encode 1,000
k
choices. Hence, if we wanted to encode the same information by a clustering
model (i.e. k = 1) we would need a significantly higher number of clusters
and consequently a significantly larger amount of data to estimate these clusters accurately. Moreover, the fact that two documents are partially similar
by sharing some common topics is useful for search and retrieval purposes.
Figure 3.8 contains the most basic variant of a topic model as introduced
by [BNJ03]. We observe a collection of m documents of length mi respectively with words wij . The idea is that they are generated as follows: for each
document draw a topic distribution i from a Dirichlet distribution over k
events. Subsequently, draw topics zij from i . For each zij draw a word wij
from one of k multinomial distributions zij . All l are in turn drawn from
a conjugate prior, namely a Dirichlet distribution with coefficients .
Unfortunately, inference in this model is considerably less trivial. In particular, it is not possible to simply apply the EM algorithm to estimate
p(, , z|w, , ). The reason is that it is impossible to compute the sum
over the variables , , z exactly any more. Instead, we will need further
approximations, such as a restriction of family of distributions Q, or a sampling algorithm. These advanced issues are discussed in Section ??. We will
discuss further instances of directed graphical models in Chapter 6.
Problems
Problem 3.1 (Chain Smoker) Smoking - tar in the lungs - cancer. Asbestos. What happens now? What if we know that the guy was in the construction industry and had to deal with asbestos?
Problem 3.2 (Burglar, Earthquake, and Neighbor) Show that in our
3.3 Applications
115
joint model p(B, A, N, E) the random variables B, E are dependent when

conditioned on N . Hint: integrate out A to obtain p(N |B, E).
Problem 3.3 (Bayes-ball) Bayes-ball for a number of graphs.
Problem 3.4 (Deterministic Nodes) Extend Bayes ball to networks with
deterministic nodes.
Problem 3.5 (Discrete Sampler and Entropy) Recall the sampler for
discrete distributions of Example 2.8. Discuss how we can take advantage of
the fact that some probabilities may be much larger than others in sampling.
Prove that in the limit of an infinitely long draw the expected number of
comparison operations per sample is given by the entropy H(p).
Problem 3.6 (Stationary Distribution in a Markov Chain) Random
walk on a graph distribution over states. Same transition at every step.
Problem 3.7 (Page Rank) Random surfer visits each link uniformly. With
given probability resets. Speed of convergence (find eigenvalue).
Problem 3.8 (k-means with Conjugate Prior) Figure 3.5 on the right.
Default mean 0 = 0 and default per cluster sample size m0 . Moreover, default class smoother m0 k. This simplifies equations a bit.
Problem 3.9 (User behavior in search engine) Let us model the click
behavior of a user when dealing with the results of a search engine.
v
vi
ci
ui
vi+1
10
Problem 3.10 Assume that f is a convex function which satisfies f (1) =

0. Show that in this case the f -divergence Df (q p) := Exq(x) f p(x)
q(x)
vanishes if and only if p = q and that moreover D is nonnegative.
Problem 3.11 Assume we are on a graph. Want to reach B from A. How
to solve this in O(V + E) time? Flood fill the graph (Dijkstra algorithm).
Problem 3.12 Assume we want to find the most likely sequence in a HMM.
116
Problem 3.13 Modify the HMM of Figure 3.6 to account for a conjugate
prior on the transition probabilities and the emission probabilities. Hint: use
a Dirichlet prior for p(yi+1 |yi ) and a Wishart prior for p(xi |yi ). Design an
EM algorithm for it. This is a generalization of Problem 4.7.
Problem 3.14 (Gaussian Models) Show that the sum of n independent
Gaussians is Gaussian again and that means and show that means and variances are additive. Moreover, show that that if (x, y) is jointly Gaussian,
then also the marginal x and the conditional x|y are. Compute their means
and variances.
Problem 3.15 (Sparse Linear Systems) Assume that we have a matrix
2
M Rm for which we would like to solve the linear system M x = y.
Moreover, assume that M is diagonal. Can you find an O(m) algorithm for
it? Now assume that M is banded with nonzero terms only within one band
of the main diagonal. Show that you can still solve the system in O(m) time,
albeit with slightly higher cost. Show that you can solve M x = y by processing
both the upper left and the lower right corner of the matrix simultaneously.
Problem 3.16 (Pair-HMM) Assume that we have a gene sequence which
is being transformed under mutation. In particular, assume that we have 4
symbols (A, C, G, T) and that for each symbol there is a probability that it is
changed to another symbol, that we insert a symbol, and that we will omit this
symbol [Wat99]. Compute the probability that we observe a sequence x given
an original sequence x . Next assume that we have a uniform distribution
over sequences. Compute the probability that two sequences arise from the
same ancestor.
Note: this type of problem is common when comparing the genomes of
different organisms, e.g. in the computation of phylogenetic trees.
Problem 3.17 (Gaussian Conditioning) Assume that (x, y) are jointly
normal with
x
y
x
y
xx xy
xy yy
(3.35)
Show that in this case y|x is Normal with

1
y|x N(y + xy 1
xx (x x ), yy xy xx xy )
(3.36)
Hint: expand p(x, y) in terms depending only on y and a multiplicative remainder which can be dropped since you need to normalize for p(y|x)
3.3 Applications
117
p(x, y). You will need to invert a 2x2 block matrix. Then perform quadratic
expansion to obtain linear and quadratic terms for a Gaussian model in y.
Problem 3.18 (Cumulative Distribution Function Networks) Marginalization
F (x) = maxy F (x, y). Conditioning F (y|x) (check this). [HF08] paper.
Problem 3.19 (Directed and Undirected Models) Directed and Undirected Models are not equivalent. Use the example from Kevin Murphy.
4
Reasoning and inference from data is a key problem in machine learning

and in principle the tools developed in the previous two chapters suffice to
address this problem: simply observe instances, construct a density estimate
of the random variables involved, and finally compute the density of what
we want to predict conditioned on the context. Unfortunately this approach
is infeasible as the following example illustrates:
Suppose that you want to decide whether to bring an umbrella when leaving the house. You would typically take factors such as the current weather,
your plans for the day, your clothes, your means of transport, or your health
into consideration. On the other hand, it is fairly safe to todays stock market or the contents of your fridge in your reasoning. In other words, we can
judiciously ignore certain factors since it is safe to assume that the event
we want to predict is independent of them. If we naively performed a joint
density estimate without exploiting independence we would likely arrive at
a poor estimate and therefore be making poor decisions.
A somewhat more subtle type of information is conditional independence.
It is safe to assume that conditioned on the state of the weather your decision
of bringing an umbrella is independent of your neighbors decision to do the
same. On the other hand, without knowing the weather, clearly your and
your neighbors decisions will not be independent.
There exists a rich structure between these factors which can help us improve decision making. This structure will allow us to reduce the expressive
complexity of joint distributions of random variables to a set of simpler dependencies. The latter can be estimated accurately requiring less data and
less computation than the general case requires. It is also useful for incorporating prior knowledge and for visualization.
BREAK
Directed graphical models are good for modeling dependencies whenever
we are able to specify the interaction between random variables are a directed acyclic graph. Unfortunately, this is not always possible or even desirable. Consider the graph on the right in Figure 3.2. If we want to model a
set of random variables A, B, C, D where A and B, B and C, C and D, and
D and A are related in a symmetric fashion, it is clear that any symmetric
119
120
4 Undirected Graphical Models

3
2
1
3
5
2
5
1
1
6
9
8
Fig. 4.1. Undirected graph and cliques
directed graph cannot be acyclic (see also Problem 4.17 for an example).
Loosely speaking, the relationship between directed and undirected models
is that the former make a model of causal dependencies between random
variables whereas the latter only model correlations.
4.0.5 Definition
We begin by introducing some more graph-theoretic tools:
Definition 4.1 (Cliques) Denote by G(V, E) an undirected graph with vertices V and edges E. Then a clique c G is a subset of vertices Vc V with
all associated edges Ec E such that all vertices in c are fully connected. A
maximal clique is a clique which cannot be extended by adding more vertices.
Figure 4.1 provides an example of such graphs. As before in the case of directed models we associate with each vertex of the graph a random variable.
For a given graph G denote by C the set of all maximal cliques. In this case
we define
p(x) :=
1
Z
c (xc ) where Z :=
cC
X cC
c (xc )dx.
(4.1)
to be the family of densities associated with G. Here xc Xc denotes the

restriction of x (and X respectively) onto the clique c and c are nonnegative functions defined on Xc . The constant Z ensures that the probability
distribution is properly normalized. It is commonly referred to as the partition function. As before with directed graphical models let us investigate
the conditional independence relations. As we shall see, distributions of the
form (4.1) have the property that random variables are conditionally independent if the parts of the graph they are contained in becomes disjoint
after removing the conditioning variables. To prove this we need a slightly
sophisticated tool the M
obius inversion formula.
121
Theorem 4.2 (M
obius) Denote by g a real-valued function on X and denote by gA (x) the restriction of g onto the subset A via gA (x) = g(xA ) by
setting all coordinates of x not contained in A to 0. Then the following holds:
(1)|A\B| gB (x) g(x) =
A (x) =
BA
A (x)
(4.2)
Moreover, for A = if xi = 0 for some i A we have A (x) = 0.

Proof To show the second claim divide all subsets of A into two sets
one containing i an another which does not contain i. Since they differ only
in the sign all terms in the left sum of (4.2) cancel. To see the first claim we
plug the left equality into the left one to obtain
(1)|A\B| gB (x) =
A (x) =
A
A BA
(1)|A\B| .
gB (x)
B
AB
Denote by n = max {|A\B|} the largest increment in A relative to B. The

sum over A can be rewritten as follows
n
|A\B|
(1)
AB
(1)i
=
i=0
n
i
= (1 + (1))n = 0 whenever n > 0
Moreover, for n = 0, that is, when B is maximal, the sum amounts to 1

hence we get gB (x) = g(x). This proves the claim.
We can now state what could be considered the fundamental theorem of
undirected graphical models, the Hammersley-Clifford theorem. It was first
proposed by [HC71] and the first correct proof is due to [Bes74].
Theorem 4.3 (Hammersley-Clifford) Denote by G(V, E) an undirected
graph and let p be a probability distribution parametrized according to (4.1).
Denote by A, B, C V disjoint subsets of vertices. Then a sufficient condition for xA
xB |xC is that after removing C from G, the sets A and B are
disconnected from each other.
Moreover, any nonnegative probability distribution which satisfies the above
conditional independence properties for any sets A, B, C according to G has
the factorization (4.1).
Proof Without loss of generality assume that A B C = V . Eq. (4.1)
allows us to factorize p(xA , xB , xC ) into a product of terms containing only
(xA , xC ) and (xB , xC ). This can be seen by contradiction: assume that there
exists some factor c (xc ) such that c A and c B are both nonempty.
Since c is a clique, this means that there exists a path from A to B which
is impossible by our assumptions.
122
To see the converse we explicitly construct a suitable expansion using the

conditional independence relatiions. Define
g(x) := log p(x) and correspondingly gA (x) := g(xA )
(4.3)
as in Theorem 4.2. It follows that g(x) = A A (x). All we need to show

now is that A (x) = 0 whenever A is not a maximal clique in order to
prove the claim, since we may then set A (xA ) := exp A (x) to establish
the factorization.
Whenever A is not a maximal clique, there exist at least two coordinates
(a, b) which do not share an edge in the graph. Hence we may decompose
the sum in (4.2) via
(1)|A\B| gB (x) + gB{a,b} (x) gB{a} (x) gB{b} (x)
A (x) =
BA\{a,b}
Now note that that due to the Markov property (p(xa |xB , xb ) = p(xa |xB ))
p(xB , xa , xb )
p(xa |xB , xb )p(xB , xb )
= log
p(xB , xa )
p(xa |xB )p(xB )
p(xB , xb )
= log
= gB{b} (x) gB (x).
p(xB )
gB{a,b} (x) gB{a} (x) = log
Hence A (x) = 0 and therefore A (x) = 1 whenever A is not a clique.

The consequence of this profound connection is that whenever the joint
distribution over random variables has conditional independence properties
according to an undirected graph, we can decompose the distribution into
a product of potentially much smaller factors which can subsequently be
estimated with much higher confidence.
4.1 Examples
When modeling distributions over random variables we have three principal
strategies to keep the models simple.
We can make stringent conditional independence assumptions by means
of a suitable graph. This is the main point of the present chapter and we
discuss how to build such models and perform inference in them.
We can make assumptions of symmetry. For instance, if we have a set
of n random variables which depend on each other in a ring-wise fashion
(see Figure 4.2) we may make the assumption that the clique potentials
A (xA ) all take the same functional form. This is justified whenever there
is no specific property associated with a particular position on the ring.
123
Such a choice can significantly improve our estimates, since we only need
to estimate one potential A (xA ) instead of n separate potentials.
We can assume that the clique-potentials themselves are simple. For instance, we could assume that A (xA ) is slowly varying. This will become
the focus of our discussions when it comes to regularization and priors.
We will discuss the first and second of those modeling choices for a number
of common structures: chains, rings, and lattices. In the process of doing
so we will again encounter dynamic programming algorithms which are a
special case of the generalized distributed law. A more detailed discussion
of the latter will follow in Section 4.3.
Chains
Rings
Lattices
1,2,3
1,3
1,3,4
1,4
1,4,5
1,5
7
1,7,8
1,7
1,6,7
1,6
1,5,6
Fig. 4.2. Ring

Exponential inner product notation. We get sum decomposition.
4.3 Inference
Message passing, junction trees
1,2
3,6
3,5,6
1,2
4,8
5,6
5,6,7
2,3,4,6
6
4,8
2,3,4,6
6
6
6,9
Fig. 4.3. Junction graph
3,6
3,5,6
5,6
5,6,7
6,7
6,7,9
124
4.4 The Generalized Distributive Law

semi-ring definition
examples
show how the worked example on linear chains can be transformed by
using semirings
GDL (brief mention and pointer to main paper)
Message passing on trees

variable elimination
triangulated graphs and Junction tree
loopy BP
Tree sampling
Need to mention that we did not cover many topics:
Querying a graphical model

learning the structure (la Bartlett Taskar et al or Ghaharmani style)
Variational methods for inference
Algorithmic tricks not based on DP/GDL used for certain graphical models (viz. planar graphs, grids etc.)
When algorithmic tricks are used then all three inference subproblems
might not have same complexity (for instance in case of planar graphs).
Problems
Problem 4.1 (Chain Smoker) Smoking - tar in the lungs - cancer. Asbestos. What happens now? What if we know that the guy was in the construction industry and had to deal with asbestos?
Problem 4.2 (Burglar, Earthquake, and Neighbor) Show that in our
joint model p(B, A, N, E) the random variables B, E are dependent when
conditioned on N . Hint: integrate out A to obtain p(N |B, E).
Problem 4.3 (Bayes-ball) Bayes-ball for a number of graphs.
Problem 4.4 (Deterministic Nodes) Extend Bayes ball to networks with
deterministic nodes.
Problem 4.5 (Stationary Distribution in a Markov Chain) Random
walk on a graph distribution over states. Same transition at every step.
125
Problem 4.6 (Page Rank) Random surfer visits each link uniformly. With
given probability resets. Speed of convergence (find eigenvalue).
Problem 4.7 (k-means with Conjugate Prior) Figure 3.5 on the right.
Problem 4.8 (User behavior in search engine) Let us model the click
behavior of a user when dealing with the results of a search engine.
v
vi
ci
ui
vi+1
10
Problem 4.9 Assume we are on a graph. Want to reach B from A. How

to solve this in O(V + E) time? Flood fill the graph (Dijkstra algorithm).
Problem 4.10 Assume we want to find the most likely sequence in a HMM.
Problem 4.11 Modify the HMM of Figure 3.6 to account for a conjugate
prior on the transition probabilities and the emission probabilities. Hint: use
a Dirichlet prior for p(yi+1 |yi ) and a Wishart prior for p(xi |yi ). Design an
EM algorithm for it. This is a generalization of Problem 4.7.
Problem 4.12 (Gaussian Models) Show that the sum of n independent
Gaussians is Gaussian again and that means and show that means and variances are additive. Moreover, show that that if (x, y) is jointly Gaussian,
then also the marginal x and the conditional x|y are. Compute their means
and variances.
Problem 4.13 (Sparse Linear Systems) Assume that we have a matrix
2
M Rm for which we would like to solve the linear system M x = y.
Moreover, assume that M is diagonal. Can you find an O(m) algorithm for
it? Now assume that M is banded with nonzero terms only within one band
of the main diagonal. Show that you can still solve the system in O(m) time,
albeit with slightly higher cost. Show that you can solve M x = y by processing
both the upper left and the lower right corner of the matrix simultaneously.
Problem 4.14 (Pair-HMM) Assume that we have a gene sequence which
is being transformed under mutation. In particular, assume that we have 4
126
symbols (A, C, G, T) and that for each symbol there is a probability that it is
changed to another symbol, that we insert a symbol, and that we will omit this
symbol [Wat99]. Compute the probability that we observe a sequence x given
an original sequence x . Next assume that we have a uniform distribution
over sequences. Compute the probability that two sequences arise from the
same ancestor.
Note: this type of problem is common when comparing the genomes of
different organisms, e.g. in the computation of phylogenetic trees.
Problem 4.15 (Gaussian Conditioning) Assume that (x, y) are jointly
normal with
x
y
x
y
xx xy
xy yy
(4.4)
Show that in this case y|x is Normal with

1
y|x N(y + xy 1
xx (x x ), yy xy xx xy )
(4.5)
Hint: expand p(x, y) in terms depending only on y and a multiplicative remainder which can be dropped since you need to normalize for p(y|x)
p(x, y). You will need to invert a 2x2 block matrix. Then perform quadratic
expansion to obtain linear and quadratic terms for a Gaussian model in y.
Problem 4.16 (Cumulative Distribution Function Networks) Marginalization
F (x) = maxy F (x, y). Conditioning F (y|x) (check this). [HF08] paper.
Problem 4.17 (Directed and Undirected Models) Directed and Undirected Models are not equivalent. Use the example from Kevin Murphy.
5
Optimization
Optimization plays an increasingly important role in machine learning. For

instance, many machine learning algorithms minimize a regularized risk
functional:
min J(f ) := (f ) + Remp (f ),
f
(5.1)
with the empirical risk

1
Remp (f ) :=
m
l(f (xi ), yi ).
(5.2)
i=1
Here xi are the training instances and yi are the corresponding labels. l the
loss function measures the discrepancy between y and the predictions f (xi ).
Finding the optimal f involves solving an optimization problem.
This chapter provides a self-contained overview of some basic concepts and
tools from optimization, especially geared towards solving machine learning
problems. In terms of concepts, we will cover topics related to convexity,
duality theory, and Lagrange multipliers. In terms of tools, we will cover
a variety of optimization algorithms including gradient descent, stochastic
gradient descent, Newton method, and Quasi-Newton methods. We will also
look at some specialized algorithms tailored towards solving Linear Programming and Quadratic Programming problems which often arise in machine
learning problems.
5.1 Preliminaries
Minimizing an arbitrary function is, in general, very difficult, but if the objective function to be minimized is convex then things become considerably
simpler. As we will see shortly, the key advantage of dealing with convex
functions is that a local optima is also the global optima. Therefore, well
developed tools exist to find the global minima of a convex function. Consequently, many machine learning algorithms are now formulated in terms of
convex optimization problems. We briefly review the concept of convex sets
and functions in this section.
127
128
5 Optimization
5.1.1 Convex Sets

Definition 5.1 (Convex Set) A subset C of Rn is said to be convex if
(1 )x + y C whenever x C, y C and 0 < < 1.
Intuitively, what this means is that the line joining any two points x and y
from the set C lies inside C (see Figure 5.1). It is easy to see (Exercise 5.1)
that intersections of convex sets are also convex.
Fig. 5.1. The convex set (left) contains the line joining any two points that belong
to the set. A non-convex set (right) does not satisfy this property.
A vector sum i i xi is called a convex combination if i 0 and

1. Convex combinations are helpful in defining a convex hull:
i i
Definition 5.2 (Convex Hull) The convex hull, conv(X), of a finite subset X = {x1 , . . . , xn } of Rn consists of all convex combinations of x1 , . . . , xn .
5.1.2 Convex Functions

Let f be a real valued function defined on a set X Rn . The set
{(x, ) : x X, R, f (x)}
(5.3)
is called the epigraph of f . The function f is defined to be a convex function

if its epigraph is a convex set in Rn+1 . An equivalent, and more commonly
used, definition (Exercise 5.5) is as follows (see Figure 5.2 for geometric
intuition):
Definition 5.3 (Convex Function) A function f defined on a set X is
called convex if, for any x, x X and any 0 < < 1 such that x + (1
)x X, we have
f (x + (1 )x ) f (x) + (1 )f (x ).
(5.4)
5.1 Preliminaries
129
A function f is called strictly convex if

f (x + (1 )x ) < f (x) + (1 )f (x )
(5.5)
whenever x = x .
1000
1.5
800
1.0
0.5
f(x)
f(x)
600
400
0.0
0.5
200
1.0
0
6
0
x
1.5 3
0
x
Fig. 5.2. A convex function (left) satisfies (5.4); the shaded region denotes its epigraph. A nonconvex function (right) does not satisfy (5.4).
If f : X R is differentiable, then f is convex if, and only if,

f (x ) f (x) + x x, f (x) for all x, x X.
(5.6)
In other words, the first order Taylor approximation lower bounds the convex
function universally (see Figure 5.4).
If f is twice differentiable, then f is convex if, and only if, its Hessian is
positive semi-definite, that is,
2 f (x)
0.
(5.7)
For twice differentiable strictly convex functions, the Hessian matrix is positive definite, that is, 2 f (x)
0. We briefly summarize some operations
which preserve convexity:
Addition
Scaling
Affine Transform
Adding a Linear Function
Pointwise Maximum
Scalar Composition
If f1 and f2 are convex, then f1 + f2 is also convex.

If f is convex, then f is convex for > 0.
If f is convex, then g(x) = f (Ax + b) for some matrix
A and vector b is also convex.
If f is convex, then g(x) = f (x)+ a, x for some vector
a is also convex.
If fi are convex, then g(x) = maxi fi (x) is also convex.
If f (x) = h(g(x)), then f is convex if a) g is convex,
and h is convex, non-decreasing or b) g is concave, and
h is convex, non-increasing.
130
5 Optimization
3
18
16
14
12
10
8
6
4
2
0
2
1
0
-1
3
-2
2
1
-3
-2
0
-1
-3
3
-1
-2
-3
-1
1
-2
3 -3
Fig. 5.3. Left: Convex Function in two variables. Right: the corresponding convex
below-sets {x|f (x) c}, for different values of c.
There is an intimate relation between convex functions and convex sets.

For instance, the following lemma show that the below sets (level sets) of
convex functions, sets for which f (x) c, are convex.
Lemma 5.4 (Below-Sets of Convex Functions) Denote by f : X R
a convex function. Then the set
Xc := {x | x X and f (x) c}, for all c R,
(5.8)
is convex.
Proof For any x, x Xc , we have f (x), f (x ) c. Moreover, since f is
convex, we also have
f (x + (1 )x ) f (x) + (1 )f (x ) c for all 0 < < 1.
(5.9)
Hence, for all 0 < < 1, we have (x + (1 )x ) Xc , which proves the

claim. Figure 5.3 depicts this situation graphically.
As we hinted in the introduction of this chapter, minimizing an arbitrary
function on a (possibly not even compact) set of arguments can be a difficult
task, and will most likely exhibit many local minima. In contrast, minimization of a convex objective function on a convex set exhibits exactly one global
minimum. We now prove this property.
Theorem 5.5 (Minima on Convex Sets) If the convex function f : X
R attains its minimum, then the set of x X, for which the minimum value
is attained, is a convex set. Moreover, if f is strictly convex, then this set
contains a single element.
5.1 Preliminaries
131
Proof Denote by c the minimum of f on X. Then the set Xc := {x|x

X and f (x) c} is clearly convex.
If f is strictly convex, then for any two distinct x, x Xc and any 0 <
< 1 we have
f (x + (1 )x ) < f (x) + (1 )f (x ) = c + (1 )c = c,
which contradicts the assumption that f attains its minimum on Xc . Therefore Xc must contain only a single element.
As the following lemma shows, the minimum point can be characterized
precisely.
Lemma 5.6 Let f : X R be a differentiable convex function. Then x is
a minimizer of f , if, and only if,
x x, f (x) 0 for all x .
(5.10)
Proof To show the forward implication, suppose that x is the optimum

but (5.10) does not hold, that is, there exists an x for which
x x, f (x) < 0.
Consider the line segment z() = (1 )x + x , with 0 < < 1. Since X
is convex, z() lies in X. On the other hand,
d
f (z())
d
= x x, f (x) < 0,
=0
which shows that for small values of we have f (z()) < f (x), thus showing
that x is not optimal.
The reverse implication follows from (5.6) by noting that f (x ) f (x),
whenever (5.10) holds.
One way to ensure that (5.10) holds is to set f (x) = 0. In other words,
minimizing a convex function is equivalent to finding a x such that f (x) =
0. Therefore, the first order conditions are both necessary and sufficient
when minimizing a convex function.
5.1.3 Subgradients
So far, we worked with differentiable convex functions. The subgradient is a
generalization of gradients appropriate for convex functions, including those
which are not necessarily smooth.
132
5 Optimization
Definition 5.7 (Subgradient) Suppose x is a point where a convex function f is finite. Then a subgradient is the normal vector of any tangential
supporting hyperplane of f at x. Formally is called a subgradient of f at
x if, and only if,
f (x ) f (x) + x x, for all x .
(5.11)
The set of all subgradients at a point is called the subdifferential, and is denoted by x f (x). If this set is not empty then f is said to be subdifferentiable
at x. On the other hand, if this set is a singleton then, the function is said
to be differentiable at x. In this case we use f (x) to denote the gradient
of f . Convex functions are subdifferentiable everywhere in their domain. We
now state some simple rules of subgradient calculus:
Addition
Scaling
Affine Transform
Pointwise Maximum
x (f1 (x) + f2 (x)) = x f1 (x) + x f2 (x)

x f (x) = x f (x), for > 0
If g(x) = f (Ax + b) for some matrix A and vector b,
then x g(x) = A y f (y).
If g(x) = maxi fi (x) then g(x) = conv(x fi ) where
i argmaxi fi (x).
The definition of a subgradient can also be understood geometrically. As

illustrated by Figure 5.4, a differentiable convex function is always lower
bounded by its first order Taylor approximation. This concept can be extended to non-smooth functions via the concept of subgradients, as Figure
5.5 shows.
By using more involved concepts, the proof of Lemma 5.6 can be extended
to subgradients. In this case, minimizing a convex nonsmooth function entails finding a x such that 0 f (x).
5.1.4 Strongly Convex Functions
When analyzing optimization algorithms, it is sometimes easier to work with
strongly convex functions, which generalize the definition of convexity.
Definition 5.8 (Strongly Convex Function) A convex function f is strongly
convex if, and only if, there exists a constant > 0 such that the function
f (x) 2 x 2 is convex.
The constant is called the modulus of strong convexity of f . If f is twice
differentiable, then there is an equivalent, and perhaps easier, definition of
strong convexity: f is strongly convex if there exists a such that
2 f (x)
I.
(5.12)
5.2 Unconstrained Smooth Convex Minimization
133
In other words, the smallest eigenvalue of the Hessian of f is uniformly lower

bounded by everywhere.
If f is a strongly convex function with modulus , then one can show
(Exercise 5.6) that
f (x ) f (x) + x x, +
x x 2 x, x and f (x). (5.13)
2
The right hand side can be minimized by setting the gradient with respect
to x equal to zero (since the RHS is a convex function of w , Lemma 5.6
applies). This yields x = x 1 . Substituting this back into (5.13) yields
1
2 x, x and f (x).
(5.14)
2
In particular, by setting x = x , the global minimizer of f , one obtains
f (x ) f (x)
f (x) f (x )
x and f (x).
(5.15)
5.1.5 Convex Functions with Lipschitz Continous Gradient

A somewhat symmetric concept to strong convexity is the Lipschitz continuity of the gradient.
Definition 5.9 (Lipschitz Continuous Gradient) A differentiable convex function f is said to have a Lipschitz continuous gradient, if there exists
a constant L > 0, such that
f (x) f (x ) L x x
x, x .
(5.16)
If f has a Lipschitz continuous gradient with modulus L, then one can show
(Exercise 5.6) that
L
x x 2 f (x) x, x .
(5.17)
2
Furthermore, if f is twice differentiable, then there is an equivalent definition: f has a Lipschitz continuous gradient if there exists a L such that
f (x ) x x, f (x) +
LI
2 f (x).
(5.18)
In other words, the largest eigenvalue of the Hessian of f is uniformly upper

bounded by L everywhere.
In this section we will describe various methods to minimize a smooth convex
objective function.
134
5 Optimization
Fig. 5.4. A convex function is always lower bounded by its first order Taylor approximation. This is true even if the function is not differentiable (see Figure 5.5)
5
4
3
2
1
0
14
Fig. 5.5. Geometric intuition of a subgradient. The nonsmooth convex function

(solid blue) is only subdifferentiable at the kink points. We illustrate two of its
subgradients (dashed green and red lines) at a kink point which are tangential to
the function. The normal vectors to these lines are subgradients. Observe that the
first order Taylor approximations obtained by using the subgradients lower bounds
the convex function.
5.2.1 Minimizing a One-Dimensional Convex Function

As a warm up let us consider the problem of minimizing a smooth one dimensional convex function J : R R in the interval [L, U ]. This seemingly
simple problem has many applications. As we will see later, many optimization methods find a direction of descent and minimize the objective function
135
Algorithm 5.1 Interval Bisection

1: Input: L, U , precision
2: Set t = 0, a0 = L and b0 = U
3: while (bt at ) J (U ) > do
t
4:
if J ( at +b
2 ) > 0 then
t
5:
at+1 = at and bt+1 = at +b
2
6:
else
t
7:
at+1 = at +b
and bt+1 = bt
2
8:
end if
9:
t=t+1
10: end while
a +b
11: Return: t 2 t
along this direction1 ; this subroutine is called a line search. Algorithm 5.1
depicts a simple line search routine based on interval bisection.
Before we show that Algorithm 5.1 converges, let us first derive an important property of convex functions of one variable. For a differentiable
one-dimensional convex function J (5.6) reduces to
J(w) J(w ) + (w w ) J (w ),
(5.19)
where J (w) denotes the gradient of J. Exchanging the role of w and w in

(5.19), we can write
J(w) J(w ) + (w w ) J (w ).
(5.20)
Adding the above two equations yields

(w w ) (J (w) J (w )) 0.
(5.21)
If w w , then this implies that J (w) J (w ). In other words, the gradient

of a one dimensional convex function is monotonically non-decreasing.
Recall that minimizing a convex function is equivalent to finding w such
that J(w ) = 0. Furthermore, it is easy to see that the interval bisection
maintains the invariant J (at ) < 0 and J (bt ) > 0. This along with the
monotonicity of the gradient suffices to ensure that w (at , bt ). Setting
w = w in (5.19), and using the monotonicity of the gradient allows us to
write for any w (at , bt )
J(w ) J(w ) (w w ) J (w ) (bt at ) J (U ).
1
(5.22)
If the objective function is convex, then the one dimensional function obtained by restricting
it along the search direction is also convex (Exercise 5.7).
136
5 Optimization
Algorithm 5.2 Gradient Descent

1: Input: Initial point w0 , precision
2: Set t = 0
3: while J(wt ) > do
4:
Compute t = argmin J(wt J(wt )) e.g., via Algorithm 5.1.
5:
wt+1 = wt t J(wt )
6:
t=t+1
7: end while
8: Return: wt
Since we halve the interval (at , bt ) at every iteration, it follows that (bt at ) =
(U L)/2t . Therefore
J(w ) J(w )
(U L) J (U )
,
2t
(5.23)
for all w (at , bt ). In other words, to find an -accurate solution, that is,
J(w ) J(w ) we only need log(U L) + log J (U ) + log(1/ ) < t iterations. An algorithm which converges to an accurate solution in O(log(1/ ))
iterations is said to be linearly convergent.
For multi-dimensional objective functions, one cannot rely on the monotonicity property of the gradient. Therefore, one needs more sophisticated
optimization algorithms, some of which we now describe.
5.2.2 Gradient Descent

Gradient descent (also widely known as steepest descent) is one of the simplest optimization techniques to implement for minimizing smooth functions
of the form J : Rd R. The basic idea is as follows: Given a location wt at
iteration t, compute the gradient J(wt ), and update
wt+1 = wt t J(wt ),
(5.24)
where the scalar t minimizes the one dimensional objective function J(wt
J(wt )) with respect to . See Algorithm 5.2 for details.
In order to analyze the convergence behavior of the gradient descent procedure let us assume that J is strongly convex with modulus of strong convexity . Furthermore, we assume that J has a Lipschitz continuous gradient
with modulus L, and that an exact line search is use to find t . For such functions, (5.15) implies that J(wt ) 2 2(J(wt )J(w )). Therefore, ensuring that J(wt ) < is sufficient to ensure that J(wt ) J(w ) < 2 /2. In
137
particular, setting 2 /2 = yields a simple and easy to monitor stopping

criterion.
Since J is assumed to have a Lipschitz continuous gradient, (5.17) implies
that
J(wt ) t J(wt )
In particular, setting t =
1
L
J(wt )
Lt2
J(wt )
2
J(wt+1 ).
shows that
1
J(wt )
2L
J(wt+1 ).
Combining this with J(wt ) 2 2(J(wt ) J(w )), one can write
1
(J(wt ) J(w )) J(wt+1 ) J(w ).
L
Letting c := 1 L , and applying the above equation recursively we obtain
ct (J(w0 ) J(w )) J(wt ) J(w ),
which shows that J(wt ) J(w )
after at most
log((J(w0 ) J(w ))/ )

log(1/c)
iterations.
The above analysis shows that gradient descent with an exact line search
enjoys linear rates of convergence, i.e., to obtain an accurate solution
we only require log(1/ ) iterations. But it also points to a disturbing fact,
namely, the number of iterations depends inversely on log(1/c). If we approximate log(1/c) = log(1 /L) /L, then it shows that convergence
depends on the ratio L/. This ratio is called the condition number of a
problem. If the problem is well conditioned, i.e., L then gradient descent converges extremely fast. On the other hand, if
L then gradient
descent requires many iterations. This is best illustrated with an example:
Consider the quadratic objective function
1
J(w) = (w w ) A(w w ) + c,
2
(5.25)
where A is a symmetric positive definite matrix, and c is a constant1 . This

is clearly a convex function with minimum at w , and J(w ) = c.
Recall that a twice differentiable convex function has modulus of strong
, if and only if its Hessian satisfies 2 J(w) I (see (5.12)). Similarly, the
1
In fact, we can rewrite any convex quadratic function J(w) = w Aw + b w + d in the form
(5.25)
138
5 Optimization
modulus of Lipschitz continuity of the gradient L satisfies LI 2 J(w) (see

(5.18)). Our quadratic function (5.25) is clearly twice differentiable, and its
Hessian is simply the matrix A. Therefore = min and L = max , where
min (respectively max ) denote the minimum (respectively maximum) eigenvalue of A. One can thus change the condition number of the problem by
varying the eigen-spectrum of the matrix A. We illustrate the behavior of
gradient descent on two different quadratic problems with different condition
numbers in Figure 5.6.
Fig. 5.6. Convergence of gradient descent with exact line search on two quadratic
problems (5.25). The problem on the left is ill-conditioned, whereas the problem
on the right is well-conditioned. We plot the contours of the objective function,
and the steps taken by gradient descent. As can be seen gradient descent converges
fast on the well conditioned problem, while it zigzags and takes many iterations to
converge on the ill-conditioned problem.
5.2.3 Higher Order Methods

Unlike gradient based methods, which minimize the first order Taylor approximation to the objective function, the methods we will discuss in this
section minimize a second order Taylor approximation to the objective function (see Figure 5.7). Concretely, let us assume that we want to minimize a
smooth, twice differentiable, strictly convex objective function J : Rd R.
We can compute a second order Taylor approximation to the objective function at wt via
1
Qt (p) := J(wt ) + J(wt ), p + p 2 J(wt )p,
2
(5.26)
where J(wt ) denotes the gradient of J evaluated at wt and 2 J(wt ) is

the Hessian of J evaluated at wt . A Newton method minimizes this locally
quadratic model to obtain a direction of descent. Towards this end, take the
139
Algorithm 5.3 Newton Method

2: Set t = 0
4:
Compute pt := 2 J(wt )1 J(wt )
5:
Compute t = argmin J(wt + pt ) e.g., via Algorithm 5.1.
6:
wt+1 = wt + t pt
7:
t=t+1
8: end while
9: Return: wt
gradient of Qt (p) with respect to p and set it to zero to obtain:

pt := 2 J(wt )1 J(wt ),
(5.27)
Since J is assumed to be strictly convex, the Hessian matrix is positive

definite, and hence invertible. Since we are only minimizing a model of the
objective function, we perform a line search along the descent direction
(5.27) to compute the step size t , which yields the next iterate:
wt+1 = wt + t pt .
(5.28)
Details can be found in Algorithm 5.3.

Suppose w denotes the minimum of J(w). We say that an algorithm
exhibits quadratic convergence if the sequences of iterates {wk } generated
by the algorithm satisfies:
wk+1 w C wk w
(5.29)
for some constant C > 0. We now show that Newtons method exhibits
quadratic convergence close to the optimum.
Theorem 5.10 (Quadratic convergence of Newton Method) Suppose
J is twice differentiable, strongly convex, and the Hessian of J is bounded and
Lipschitz continuous in a neighborhood of the solution w . Furthermore, assume that 2 J(w)1 M . The iterations wt+1 = wt 2 J(wt )1 J(wt )
converge quadratically to w , the minimizer of J.
Proof First notice that
1
J(wt ) J(w ) =
0
2 J(wt + t(w wt ))(wt w )dt.
(5.30)
140
5 Optimization
Next using the fact that 2 J(wt ) is invertible and the gradient vanishes at
the optimum (J(w ) = 0), write
wt+1 w = wt w 2 J(wt )1 J(wt )
= 2 J(wt )1 [2 J(wt )(wt w ) (J(wt ) J(w ))]. (5.31)
Using (5.31), (5.30), and the Lipschitz continuity of 2 J
2 J(wt )(wt w ) (J(wt ) J(w ))
1
=
0
1
[2 J(wt ) 2 J(wt + t(w wt ))](wt w )dt

[2 J(wt ) 2 J(wt + t(w wt ))]
(wt w ) dt
wt w
1
2
Lt dt =
0
L
wt w 2 .
2
(5.32)
Finally use (5.31) and (5.32) to conclude that

wt+1 w
L 2 1
J (wt ) wt w
2
LM
wt w 2 .
2
The Newtons method as we described it suffers from two major problems. First, it applies only to twice differentiable, strictly convex functions.
Second, it involves computing and inverting of the d d Hessian matrix at
every iteration, thus making it computationally very expensive. Although
Newton method can be extended to deal with positive semi-definite Hessian
matrices, the computational burden often makes it unsuitable for large scale
applications. In such cases one resorts to Quasi-Newton methods, which we
now describe.
5.2.3.1 Quasi-Newton Methods
Unlike the Newton method, which computes the Hessian of the objective
function at every iteration, quasi-Newton methods never compute the Hessian; they approximate it from past gradients. Since they do not require
the objective function to be twice differentiable, quasi-Newton methods are
much more widely applicable. They are widely regarded as the workhorses of
smooth nonlinear optimization due to their combination of computational efficiency and good asymptotic convergence. The most popular quasi-Newton
algorithm is BFGS, named after its discoverers Broyde, Fletcher, Goldfarb,
141
1200
1000
800
600
400
200
0
200
400 6
Fig. 5.7. The blue solid line depicts the one dimensional convex function f (x) =
x4 + 20x2 + x. The green dotted-dashed line represents the first order Taylor approximation to f (x), while the red dashed line represents the second order Taylor
approximation, both evaluated at x = 2.
and Shanno. In this section we will describe BFGS and its limited memory
counterpart L-BFGS.
Suppose we are given a smooth (not necessarily strictly) convex objective
function J : Rd R and a current iterate wt Rd . Just like the Newton
method, BFGS forms a local quadratic model of the objective function, J:
1
Qt (p) := J(wt ) + J(wt ), p + p Ht p.
2
(5.33)
Unlike the Newton method which uses the computed Hessian (5.26) to build
its quadratic model, BFGS uses the matrix Ht
0, which is a positivedefinite estimate of the Hessian. As before J denotes the gradient of J. A
quasi-Newton direction of descent is found by minimizing Qt (p):
pt = Ht1 J(wt ).
(5.34)
The step size t > 0 is found by a line search obeying the Wolfe conditions:
J(wt+1 ) J(wt ) + c1 t J(wt ) pt
J(wt+1 ) pt c2 J(wt ) pt
(sufficient decrease)
(curvature)
(5.35)
(5.36)
142
5 Optimization
with 0 < c1 < c2 < 1. The final update is given by

wt+1 = wt + t pt .
(5.37)
Given wt+1 we need to update our quadratic model (5.33) to

1
(5.38)
Qt+1 (p) := J(wt+1 ) + J(wt+1 ), p + p Ht+1 p.
2
When updating our model it is reasonable to expect that the gradient of
Qt+1 should match the gradient of J at wt and wt+1 . Clearly,
Qt+1 (p) = J(wt+1 ) + Ht+1 p,
(5.39)
which implies that Qt+1 (0) = J(wt+1 ), and hence our second condition
is automatically satisfied. In order to satisfy our first condition, we require
Qt+1 (t pt ) = J(wt+1 ) t Ht+1 pt = J(wt ).
(5.40)
By rearranging, we obtain the so-called secant equation:

Ht+1 st = yt ,
(5.41)
where st := wt+1 wt and yt := J(wt+1 ) J(wt ) denote the most recent

step along the optimization trajectory in parameter and gradient space,
respectively. Since Ht+1 is a positive definite matrix, pre-multiplying the
secant equation by st yields the curvature condition
st yt > 0.
(5.42)
If the curvature condition is satisfied, then there are an infinite number

of matrices Ht+1 which satisfy the secant equation (the secant equation
represents n linear equations, but the symmetric matrix Ht+1 has n(n+1)/2
degrees of freedom). To resolve this issue we choose the closest matrix to
Ht which satisfies the secant equation. The key insight of the BFGS comes
from the observation that the descent direction computation (5.34) involves
1
the inverse matrix Bt := Ht1 . Therefore, we choose a matrix Bt+1 := Ht+1
such that it is close to Bt and also satisfies the secant equation:
min B Bt
(5.43)
s. t. B = B
and Byt = st .
(5.44)
If the matrix norm is appropriately chosen [NW99], then it can be shown

that
Bt+1 = (1 t st yt )Bt (1 t yt st ) + t st st ,
(5.45)
where t := (yt st )1 . In other words, the matrix Bt is modified via an
143
Algorithm 5.4 LBFGS

2: Set t = 0 and B0 = I
4:
pt = Bt J(wt )
5:
Find t that obeys (5.35) and (5.36)
6:
s t = t pt
7:
wt+1 = wt + st
8:
yt := J(wt+1 ) J(wt )
9:
10:
11:
12:
13:
14:
if t = 0 : Bt :=
st yt
I
yt yt
t = (st yt )1
Bt+1 = (I t st yt )Bt (I t yt st ) + t st st
t=t+1
end while
Return: wt
incremental rank-two update, which is very efficient to compute, to obtain

Bt+1 .
Limited-memory BFGS (LBFGS) is a variant of BFGS designed for solving large-scale optimization problems where the O(d2 ) cost of storing and
updating Bt would be prohibitively expensive. LBFGS approximates the
quasi-Newton direction (5.34) directly from the last m pairs of st and yt via
a matrix-free approach. This reduces the cost to O(md) space and time per
iteration, with m freely chosen. Details can be found in Algorithm 5.4.
5.2.3.2 Conjugate Gradient
Conjugate gradient was first developed as an iterative method for solving a
linear system of equations
Ax = b,
(5.46)
where A is a n n symmetric positive definite matrix. The problem of

computing the solution x of (5.46) can equivalently be posed as minimizing
the function
1
(x) = x Ax bx.
2
(5.47)
Conveniently, the gradient of is the residual of the linear system

(x) = Ax b =: g(x).
(5.48)
144
5 Optimization
A key concept underlying the algorithm we are going to describe is that of

conjugate directions.
Definition 5.11 (Conjugate Directions) Non zero vectors pi and pj are
said to be conjugate with respect to a symmetric positive definite matrix A
if pi Apj = 0 if i = j.
Conjugate directions {p0 , . . . , pn1 } are linearly independent and form a
basis. To see this, supposed the pi s are not linear independent. Then there
exists non-zero coefficients i such that i i pi = 0. The pi s are conjugate
directions, therefore pj A( i i pi ) = i i pj Api = j pj Apj = 0 for all j.
Since A is positive definite this implies that j = 0 for all j, a contradiction.
Since the conjugate directions form a basis, one can write the minimizer
x of (5.47) as
x = 0 p0 + 1 p1 + . . . + n1 pn1 .
Premultiplying the above equation by pt A one obtains
pt Ax = t pt Apt ,
which implies that
x =
t
pt Ax
pt =
pt Apt
pt b
pt .
pt Apt
(5.49)
The advantage of using the conjugate directions now becomes apparent. The
coefficients i are now expressed in terms of a known vector b, which implies
that if we if we had a cheap way to compute the conjugate directions, then
we can solve (5.46) efficiently. This following theorem shows how this idea
can be extended to minimize the quadratic function (5.47) sequentially.
Theorem 5.12 Let {p0 , . . . , pn1 } denote a set of conjugate directions. For
any x0 Rn , the sequence {xt } generated by
xt+1 = xt + t pt ,
(5.50)
with
t =
gt pt
,
pt Apt
(5.51)
and gt := Axt b converges to the solution x of (5.47) after at most n

steps. Furthermore
gt pj = 0 for all j < t.
(5.52)
145
Proof Let x denote the minimizer of (5.47). Since the pi s form a basis
x x0 = 0 p0 + . . . + n1 pn1 ,
for some scalars i . Our proof strategy will be to show that the coefficients
t coincide with t defined in (5.51). Towards this end we premultiply with
pt A and use conjugacy to obtain
t =
pt A(x xt )
.
pt Apt
(5.53)
On the other hand, following the iterative process (5.50) from x0 until xt
yields
xt x0 = 0 p0 + . . . + t1 pt1 .
Again premultiplying with pt A and using conjugacy
pt A(xt x0 ) = 0.
(5.54)
Substituting (5.54) into (5.53) produces

t =
pt A(x xt )
g pt
= t
,
pt Apt
pt Apt
(5.55)
thus showing that t = t .

To prove (5.52) we use induction. For t = 0 the statement is trivially true.
Recall that gt+1 := Axt+1 b. Premultiplying with pj , and expanding xt+1
using (5.50) and (5.51) yields
pj gt+1 = pj
Axt
= p j gt
gt pt
Apt b
pt Apt
pj Apt
pt Apt
gt pt .
For j = t, both terms cancel out, while for j < t both terms vanish due
to the induction hypothesis as well as the fact that the pj are conjugate
directions.
In a nutshell, the above theorem already contains the conjugate gradient
descent algorithm: At each step, we perform gradient descent with respect
to one of the conjugate directions, which means that after n steps we will
reach the minimum. We still need a way to generate the conjugate directions.
It turns out that we can generate them on the fly efficiently.
146
5 Optimization
Starting with any x0 Rn define p0 = g0 = b Ax0 and set

xt+1 = xt + t pt
(5.56)
gt pt
pt Apt
(5.57)
t =
pt+1 = gt+1 + t+1 pt
(5.58)
gt+1 Apt
pt Apt
(5.59)
t+1 =
Note that the scalar t+1 is found by the requirement that pt and pt+1 must
be conjugate directions. The following theorem asserts that the directions
pt are indeed conjugate directions:
Theorem 5.13 Suppose the t-th iterate generated by the conjugate gradient
method (Equations (5.56) to (5.59)) is not the solution point x , then the
following properties hold:
span{g0 , g1 , . . . , gt } = span{g0 , Ag0 , . . . , At g0 }.
t
span{p0 , p1 , . . . , pt } = span{g0 , Ag0 , . . . , A g0 }.

pt Apj = 0 for all i < t.
(5.60)
(5.61)
(5.62)
Proof The proof is by induction. Clearly (5.60), (5.61), and (5.62) hold
when t = 0. Assuming that they are true for some t, we prove that they
continue to hold for t + 1. Recall that gt+1 = Axt+1 b. Using (5.50) and
(5.51) one can conclude that
gt+1 = Axt + t Apt b = gt + t Apt .
By our induction hypothesis gt span{g0 , Ag0 , . . . , At g0 }, while Apt
span{Ag0 , A2 g0 , . . . , At+1 g0 }. Combining the two it is easy to see that gt+1
span{g0 , Ag0 , . . . , At+1 g0 }. On the other hand, (5.52) implies that gt+1 is orthogonal to {p0 , p1 , . . . , pt }. Therefore, gt+1
/ span{p0 , p1 , . . . , pt }, thus our
induction assumption implies that gt+1
/ span{g0 , Ag0 , . . . , At g0 }. This allows us to conclude that span{g0 , g1 , . . . , gt+1 } = span{g0 , Ag0 , . . . , At+1 g0 }.
The proof of (5.61) is immediate by using (5.58) and (5.60).
To show (5.62) we use (5.58) to write
pt+1 Apj = gt+1 Apj + t+1 pt Apj
By the definition of t+1 (5.59) the above expression vanishes for j = t. For
j < t, the first term is zero because Apj span{p0 , p1 , . . . , pj+1 }, a subspace
orthogonal to gt+1 by (5.52). The induction hypothesis guarantees that the
second term is zero.
147
A practical implementation of CG requires two more observations: First,

using (5.58) and (5.52) it follows that gt pt = gt gt t gt pt1 = gt gt .
Therefore (5.57) simplifies to
t =
gt gt
.
pt Apt
(5.63)
Second, using BUGBUG it follows that Apt = 1t (gt+1 gt ). But gt

span{p0 , . . . , pt }, a subspace orthogonal to gt+1 . Therefore gt+1 Apt = 1t (gt+1 gt+1 ).
Substituting this back into (5.59) and using (5.63) yields
t+1 =
gt+1 gt+1
.
gt gt
(5.64)
We summarize the CG algorithm in Algorithm BUGBUG.

Given that the CG algorithm can be viewed as a minimization algorithm
for the convex quadratic function , it is natural to ask if this idea can
be extended for minimizing a general convex function. The main difficulty
here is that Theorem 5.13 does not hold. In spite of this, conjugate gradient
has proven to be effective even in this setting. Basically the update rules
for gt and pt remain the same, but the parameters t and t are computed
differently. Table 5.1 gives an overview of different extensions. See [NW99,
Lue84] for details.
Table 5.1. Non-Quadratic modifications of Conjugate Gradient Descent

Generic Method
Compute Hessian Kt := 2 f (xt ) and update t

and t with
t = p
Fletcher-Reeves
g t pt
Kt pt
and t =
gt+1 Kt pt
pt Kt pt
Find t via a line search and set

t = argmin f (xt + pt ) and t =
Polak-Ribi`ere
gt+1 gt+1
.
gt gt
Find t via a line search and set

t ) gt+1
t = argmin f (xt + pt ) and t = (gt+1 g
.
gt gt
In practice, Polak-Ribi`ere tends to be better than
Fletcher-Reeves.
algorithm, key ideas, only reference to proof (but state theorem).
148
5 Optimization
5.2.3.3 Bundle Methods

The methods we discussed above are applicable for minimizing smooth, convex objective functions. Some regularized risk minimization problems involve
a non-smooth objective function. In such cases, one needs to use bundle
methods. In order to lay the ground for bundle methods we first describe
their precursor the cutting plane method [Kel60]. Cutting plane method is
based on a simple observation: A convex function is bounded from below by
its linearization (i.e., first order Taylor approximation). see Figures 5.4 and
5.5 for geometric intuition. Mathematically:
J(w) J(w ) + w w , s
w.
(5.65)
Given subgradients s1 , s2 , . . . , st evaluated at locations w0 , w1 , . . . , wt1 , we

can construct a tighter (piecewise linear) lower bound for J as follows (also
see Figure 5.8):
J(w) JtCP (w) := max {J(wi1 ) + w wi1 , si }.
1it
(5.66)
t1
Given iterates {wi }i=0
, the cutting plane method minimizes JtCP to obtain
the next iterate wt :
wt := argmin JtCP (w).
(5.67)
This iteratively refines the piecewise linear lower bound J CP and allows us
to get close to the minimum of J (see Figure 5.8 for an illustration).
If w denotes the minimizer of J, then clearly each J(wi ) J(w ) and
hence min0it J(wi ) J(w ). On the other hand, since J JtCP it follows that J(w ) JtCP (wt ). In other words, J(w ) is sandwiched between
min0it J(wi ) and JtCP (wt ) (see Figure 5.9 for an illustration). The cutting
plane method monitors the monotonically decreasing quantity
t
:= min J(wi ) JtCP (wt ),

0it
(5.68)
and terminates whenever t falls below a predefined threshold . This ensures

that the solution J(wt ) is optimum, that is, J(wt ) J(w ) + .
Although cutting plane method was shown to be convergent [Kel60], it is
well known (see e.g., [LNN95, Bel05]) that it can be very slow when new
iterates move too far away from the previous ones (i.e., causing unstable
zig-zag behavior in the iterates). In fact, in the worst case the cutting
plane method might require exponentially many steps to converge to an
optimum solution.
Bundle methods stabilize CPM by augmenting the piecewise linear lower
149
Fig. 5.8. A convex function (blue solid curve) is bounded from below by its linearizations (dashed lines). The gray area indicates the piecewise linear lower bound
obtained by using the linearizations. We depict a few iterations of the cutting plane
method. At each iteration the piecewise linear lower bound is minimized and a new
linearization is added at the minimizer (red rectangle). As can be seen, adding more
linearizations improves the lower bound.
(e.g., JtCP (w) in (5.66)) with a prox-function (i.e., proximity control function) which prevents overly large steps in the iterates [Kiw90]. Roughly
speaking, there are 3 popular types of bundle methods, namely, proximal
[Kiw90], trust region [SZ92], and level set [LNN95]. All three versions use
2
1
as their prox-function, but differ in the way they compute the new
2
iterate:
t
ww
t1 2 + JtCP (w)},
2
w
1
ww
t1 2 t },
trust region: wt := argmin{JtCP (w) |
2
w
1
t1 2 | JtCP (w) t },
level set: wt := argmin{ w w
2
w
proximal:
wt := argmin{
(5.69)
(5.70)
(5.71)
where w
t1 is the current prox-center, and t , t , and t are positive tradeoff parameters of the stabilization. Although (5.69) can be shown to be
equivalent to (5.70) for appropriately chosen t and t , tuning t is rather
difficult while a trust region approach can be used for automatically tuning
150
5 Optimization
Fig. 5.9. A convex function (blue solid curve) with four linearizations evaluated at
four different locations (magenta circles). The approximation gap 3 at the end of
fourth iteration is indicated by the height of the cyan horizontal band i.e., difference
between lowest value of J(w) evaluated so far and the minimum of J4CP (w) (red
diamond).
t . Consequently the trust region algorithm BT of [SZ92] is widely used in

practice.
5.3 Constrained Optimization

So far our focus was on unconstrained optimization problems. Many machine learning problems involve constraints, and can often be written in the
following canonical form:
min f (x)
(5.72a)
s. t. ci (x) 0 for i I
(5.72b)
ei (x) = 0 for i E
(5.72c)
where both ci and ei are convex functions. We say that x is feasible if and
only if it satisfies the constraints, that is, ci (x) 0 for i I and ei (x) = 0
for i E.
Recall that x is the minimizer of an unconstrained problem if and only
if f (x) = 0 (see Lemma 5.6). Unfortunately, when constraints are present
one cannot use this simple characterization of the solution. For instance, the
x at which f (x) = 0 may not be a feasible point. To illustrate, consider
151
the following simple minimization problem (see Figure 5.10):

1 2
x
2
s. t. 1 x 2 0.
min
(5.73a)
Clearly,
1 2
2x
(5.73b)
is minimized at x = 0, but because of the presence of the
14
12
10
f(x)
8
6
4
2
0
6
0
x
Fig. 5.10. The unconstrained minimizer of the quadratic function 12 x2 is attained

at x = 0 (red circle). But, if we enforce the constraints 1 x 2 (illustrated by
the shaded area) then the minimizer is attained at x = 1 (green diamond).
constrains, the minimum of (5.73) is attained at x = 1 where f (x) = x

is equal to 1. Clearly, we need other ways to characterize the minimum of
constrained optimization problems.
5.3.1 Lagrange Duality

Lagrange duality plays a central role in constrained convex optimization.
The basic idea here is to augment the objective function (5.72) with a
weighted sum of the constraint functions by defining the Lagrangian:
L(x, , ) = f (x) +
i ci (x) +
iI
i ei (x)
(5.74)
iE
for i 0 and i R. In the sequel, we will refer to (respectively ) as the

Lagrange multipliers associated with the inequality (respectively equality)
constraints. Furthermore, we will call and dual feasible if and only if
i 0 and i R. The Lagrangian satisfies the following fundamental
property, which makes it extremely useful for constrained optimization.
152
5 Optimization
Theorem 5.14 The Lagrangian (5.74) of (5.72) satisfies

f (x) if x is feasible
max L(x, , ) =
otherwise.
0,
In particular, if P denotes the optimal value of (5.72), then

P = min max L(x, , ).
x
0,
Proof First assume that x is feasible, that is, ci (x) 0 for i I and
ei (x) = 0 for i E. Since i 0 we have
i ei (x) 0,
i ci (x) +
(5.75)
iE
iI
with equality being attained by setting i = 0 whenever ci (x) < 0. Consequently,

max L(x, , ) = max f (x) +
0,
0,
i ci (x) +
iI
i ei (x) = f (x)
iE
whenever x is feasible. On the other hand, if x is not feasible then either

ci (x) > 0 or ei (x) = 0 for some i . In the first case simply let i to
see that max0, L(x, , ) . Similarly, when ei (x) = 0 let i if
ei (x) > 0 or i if ei (x) < 0 to arrive at the same conclusion.
If define the Lagrange dual function
g(, ) = min L(x, , ),
x
(5.76)
for 0 and , then one can prove the following property, which is often
called as weak duality.
Theorem 5.15 (Weak Duality) The Lagrange dual function (5.76) satisfies
g(, ) f (x)
for all feasible x and 0 and . In particular
D := max min L(x, , ) min max L(x, , ) = P .
0,
0,
Proof As before, observe that whenever x is feasible

i ei (x) 0.
i ci (x) +
iI
iE
(5.77)
153
Therefore
x
i ei (x) f (x)
i ci (x) +
g(, ) = min L(x, , ) = min f (x) +
iE
iI
for all feasible x and 0 and . In particular, one can choose x to be the
minimizer of (5.72) and 0 and to be maximizers of g(, ) to obtain
(5.77).
Weak duality holds for any arbitrary function, not-necessarily convex. When
the objective function and constraints are convex, and certain technical conditions hold then we can say more.
Theorem 5.16 (Strong Duality) BUGBUG
The proof of the above theorem is quite technical and can be found in
any standard reference (e.g., [BV04]). Therefore we will omit the proof and
proceed to discuss various implications of strong duality. First note that
strong duality implies
min max L(x, , ) = max min L(x, , ).
x
0,
0,
(5.78)
In other words, one can switch the order of minimization over x with maximization over and . This is called the saddle point property of convex
functions.
Suppose the primal and dual optimal values are attained at x and ( , )
respectively, and consider the following line of argument:
f (x ) = g( , )
(5.79a)
i ci (x) +
= min f (x) +
x
iI
f (x ) +
i ej (x)
i ci (x ) +
iI
(5.79b)
iE
i ei (x )
(5.79c)
iE
f (x ).
(5.79d)
To write (5.79a) we used strong duality, while (5.79c) obtains by setting

x = x in (5.79c). Finally, to obtain (5.79d) we used the fact that x is
feasible and hence (5.75) holds. Since (5.79) holds with equality, one can
conclude that the following complementary slackness condition:
i ci (x ) +
iI
i ei (x ) = 0.
iE
154
5 Optimization
In other words, i ci (x ) = 0 or equivalently i = 0 whenever ci (x) <

0. Furthermore, since x minimizes L(x, , ) over x, it follows that its
gradient must vanish at x , that is,
i ei (x ) = 0.
i ci (x ) +
f (x ) +
iE
iI
Putting everything together, we obtain

ci (x ) 0
i I
(5.80a)
ej (x ) = 0 i E
(5.80b)
(5.80c)
)=0
(5.80d)
i ei (x ) = 0.
(5.80e)
i ci (x
f (x ) +
i ci (x ) +
iI
iE
The above conditions are called the KKT conditions. If the primal problem
is convex, then the KKT conditions are both necessary and sufficient. In
satisfy (5.80) then x
are primal and
other words, if x
and (
, )
and (
, )
dual optimal with zero duality gap. To see this note that the first conditions
show that x
is feasible. Since i 0, L(x, , ) is convex in x. Finally the
last condition states that x
maximizes L(x, , ). Since
i ci (
x) = 0 and
ej (
x) = 0, we have
= L(
g(
, )
x,
, )
n
i ci (x ) +
= f (
x) +
i=1
j ej (x )
j=1
= f (
x).
5.3.2 Linear and Quadratic Programs
So far we discussed general constrained optimization problems. Many machine learning problems have special structure and can therefore be reduced
to a linear or quadratic program. We discuss the implications of duality for
these class of problems.
An optimization problem with a linear objective function and (both equality and inequality) linear constraints is said to be a linear program (LP). A
canonical linear program is of the following form:
min c x
(5.81a)
s. t. Ax = b, x 0.
(5.81b)
155
Here x and c are n dimensional vectors, while b is a m dimensional vector,

and A is a m n matrix with m < n.
Suppose we are given a LP of the form:
min c x
(5.82a)
s. t. Ax b,
(5.82b)
we can transform it into a canonical LP by introducing non-negative slack

variables
min c x
(5.83a)
s. t. Ax = b, 0.
(5.83b)
Next, we split x into its positive and negative parts x+ and x respectively
by setting x+
i = max(0, xi ) and xi = max(0, xi ). Using these new variables
we rewrite (5.83) as
+
c
x
min
(5.84a)
c
x
x
0
+
+
x
x
s. t. A A I x = b, x 0,
(5.84b)
thus yielding a canonical LP (5.81) in the variables x+ , x and .

By introducing non-negative Lagrange multipliers and s one can write
the Lagrangian of (5.81) as
L(x, , s) = c x (Ax b) s x.
(5.85)
Taking gradients with respect to the primal and dual variables and setting
them to zero obtains
A +s=c
(5.86a)
Ax = b
(5.86b)
s x=0
(5.86c)
(5.86d)
s 0.
(5.86e)
Condition (5.86c) can be simplified by noting that both x and s are constrained to be non-negative, therefore s x = 0 if, and only if, si xi = 0 for
i = 1, . . . , n.
156
5 Optimization
Substituting (5.86a) into the objective of (5.81), and using (5.86b) and
(5.86c) one can eliminate the primal variable x to obtain the following dual
LP
max b
(5.87a)
,s
s.t. A + s = c, 0, s 0.
(5.87b)
This can easily be converted into a canonical LP as follows

max
b
0
s.t.
,s
s
I
(5.88a)
= c,
0.
(5.88b)
It can be easily verified that the primal-dual problem is symmetric; by taking

the dual of the dual we recover the primal. One important thing to note
however is that the primal (5.81) involves n variables and n + m constraints,
while the dual (5.88) involves n + m variables and 2m + n constraints.
An optimization problem with a convex quadratic objective function and
linear constraints is said to be a convex quadratic program (QP). They
are rather important because many machine learning algorithms require the
solution of a convex QP. The canonical convex QP can be written as follows:
1
x Gx + x d
2
s.t. ai x = bi for i E
(5.89b)
ai x bi for i I
(5.89c)
min
x
(5.89a)
Here G 0 is a n n positive semi-definite matrix, E and I are finite set of

indices, while d and ai are n dimensional vectors, and bi are scalars.
As a warm up let us consider, the arguably simpler, equality constrained
quadratic programs. In this case, we can stack the ai into a matrix A and
the bi into a vector b to write the QP as
1
x Gx + x d
2
s.t. Ax = b
min
x
(5.90a)
(5.90b)
By introducing non-negative Lagrange multipliers the Lagrangian of the

above optimization problem can be written as
1
L(x, ) = x Gx + x d (Ax b).
2
(5.91)
5.4 Stochastic Optimization
157
To find the saddle point of the Lagrangian we take gradients with respect
to x and and set them to zero. This obtains
Gx + d A = 0
Ax = b.
Putting these two conditions together yields the following linear system of
equations
G A
A
0
d
b
Furthermore, the KKT conditions ensure that

(Ax b) = 0.
This can be interpreted as follows: Let P = {i : ai x = bi }, then the KKT
conditions ensure that
5.4 Stochastic Optimization
Recall that regularized risk minimization involves a data-driven optimization
problem in which the objective function involves the summation of loss terms
over a set of data to be modeled:
min J(f ) := (f ) +
f
1
m
l(f (xi ), yi ).
i=1
Classical optimization techniques must compute this sum in its entirety for
each evaluation of the objective, respectively its gradient. As available data
sets grow ever larger, such batch optimizers therefore become increasingly
inefficient. They are also ill-suited for the incremental setting, where partial
data must be modeled as it arrives.
Stochastic gradient-based methods, by contrast, work with gradient estimates obtained from small subsamples (mini-batches) of training data. This
can greatly reduce computational requirements: on large, redundant data
sets, simple stochastic gradient descent routinely outperforms sophisticated
second-order batch methods by orders of magnitude.
The key idea here is that J(w) is replaced by an instantaneous estimate
Jt which is computed from a mini-batch of size k comprising of a subset of
points (xti , yit ) with i = 1, . . . , k drawn from the dataset:
Jt (w) = (w) +
1
k
l(w, xti , yit ).

i=1
(5.92)
158
5 Optimization
Algorithm 5.5 Stochastic Gradient Descent

1: Input: Maximum iterations T , batch size k, and
2: Set t = 0 and w0 = 0
3: while t < T do
4:
Choose a subset of k data points (xti , yit )
5:
6:
7:
8:
9:
+t
Compute step size t =

wt+1 = wt t Jt (wt )
t=t+1
end while
Return: wT
Setting k = 1 obtains an algorithm which processes data points as they

arrive.
5.4.1 Stochastic Gradient Descent

Perhaps the simplest stochastic optimization algorithm is Stochastic Gradient Descent (SGD). The parameter update of SGD takes the form:
wt+1 = wt t Jt (wt ).
(5.93)
If Jt is not differentiable, then one can choose an arbitrary subgradient from

Jt (wt ) to compute the update. It has been shown that SGD asymptotically
converges to the true minimizer of J(w) if the step size t decays as O(1/ t).
For instance, one could set
t =
,
+t
(5.94)
where > 0 is a tuning parameter. See Algorithm 5.5 for details.

5.4.1.1 Practical Considerations
One simple yet effective rule of thumb to tune is to select a small subset
of data, try various values of on this subset, and choose the that most
reduces the objective function.
In some cases letting t to decay as O(1/t) has been found to be more
effective:
t =
.
(5.95)
+t
The free parameter > 0 can be tuned as described above.
5.5 Nonconvex Optimization
159
If (w) is strongly convex with modulus , then dividing the step size t
by yields good practical performance.
Finally, many sophisticated step size adaptation algorithms such as SMD
have been also been proposed for automatically tuning t .

5.5.1 BFGS
and variants (LBFGS, oLBFGS)
5.5.2 Randomization
randomized maximization
5.5.3 Concave-Convex Procedure

Any function with a bounded Hessian can be decomposed into the difference
of two (non-unique) convex functions, that is, one can write
J(w) = f (w) g(w),
(5.96)
where f and g are convex functions. Clearly, J is not convex, but there
exists a reasonably simple algorithm namely the Concave-Convex Procedure
(CCP) for finding a local minima of J. The basic idea is simple: In the
tth iteration replace g by its first order Taylor expansion at wt , that is,
g(wt ) + w wt , g(wt ) and minimize
Jt (w) = f (w) g(wt ) w wt , g(wt ) .
(5.97)
Taking gradients and setting it to zero shows that Jt is minimized by setting

f (wt+1 ) = g(wt ).
(5.98)
The iterations of CCP on a toy minimization problem is illustrated in Figure

5.11, while the complete algorithm listing can be found in Algorithm 5.6.
Theorem 5.17 Let J be a function which can be decomposed into a difference of two convex functions e.g., (5.96). The iterates generated by (5.98)
monotically decrease J. Furthermore, the stationary point of the iterates is
a local minima of J.
160
5 Optimization
10
200
20
150
30
100
40
50
50
60
0
70
801.0
1.5
2.0
2.5
3.0
3.5
4.0
501.0
1.5
2.0
2.5
3.0
3.5
4.0
Fig. 5.11. Given the function on the left we decompose it into the difference of two
convex functions depicted on the right panel. The CCP algorithm generates iterates
by matching points on the two convex curves which have the same tangent vectors.
As can be seen, the iterates approach the solution x = 2.0.
Algorithm 5.6 Concave-Convex Procedure

1: Input: Initial point w0 , maximum iterations T , convex functions f , rg
2: Set t = 0
3: while t < T do
4:
Set wt+1 = argminw f (w) g(wt ) w wt , g(wt )
5:
t=t+1
6: end while
7: Return: wT
Proof Since f and g are convex

f (wt ) f (wt+1 ) + wt wt+1 , f (wt+1 )
g(wt+1 ) g(wt ) + wt+1 wt , g(wt ) .
Adding the two inequalities, rearranging, and using (5.98) shows that J(wt ) =
f (wt ) g(wt ) f (wt+1 ) g(wt+1 ) = J(wt+1 ), as claimed.
Let w be a stationary point of the iterates. Then f (w ) = g(w ),
which in turn implies that w is a local minima of J because J(w ) = 0.
There are a number of extensions to CCP. We mention only a few in the
passing. First, it can be shown that all instances of the EM algorithm (Section ??) can be shown to be special cases of CCP. Second, the rate of convergence of CCP is related to the eigenvalues of the positive semi-definite
matrix 2 (f + g). Third, CCP can also be extended to solve constrained
161
problems of the form:

min f0 (w) g0 (w)
w
s.t. fi (w) gi (w) ci for i = 1, . . . , n.

where, as before, fi and gi for i = 0, 1, . . . , n are assumed convex. At every
iteration, we replace gi by its first order Taylor approximation and solve the
following constrained convex problem:
min f0 (w) g0 (wt ) + w wt , g0 (wt )
w
s.t. fi (w) gi (wt ) + w wt , gi (wt ) ci for i = 1, . . . , n.

Problems
Problem 5.1 (Intersection of Convex Sets {1}) If C1 and C2 are convex sets, then show that C1 C2 is also convex. Extend your result to show
that ni=1 Ci are convex if Ci are convex.
Problem 5.2 (Linear Transform of Convex Sets {1}) Given a set C
Rn and a linear transform A Rmn , define AC := {y = Ax : x C}. If
C is convex then show that AC is also convex.
Problem 5.3 (Convex Combinations {1}) Show that a subset of Rn is
convex if and only if it contains all the convex combination of its elements.
Problem 5.4 (Convex Hull {2}) Show that the convex hull, conv(X) is
the smallest convex set which contains X.
Problem 5.5 (Epigraph of a Convex Function {2}) Show that a function satisfies Definition 5.3 if, and only if, its epigraph is convex.
Problem 5.6 (Strong Convexity and Lipschitz Continuous Gradient {2})
Prove 5.13 and 5.17.
Problem 5.7 (One Dimensional Projection {1}) If f : Rd R is
convex, then show that for an arbitrary x and p in Rd the one dimensional
function () := f (x + p) is also convex.
Problem 5.8 (Quasi-Convex Functions {2}) In Section 5.1 we showed
that the below-sets of a convex function Xc := {x | f (x) c} are convex. Give
a counter-example to show that the converse is not true, that is, there exist
non-convex functions whose below-sets are convex. This class of functions is
called Quasi-Convex.
6
Conditional Densities
6.1 Conditional Exponential Models

6.1.1 Basic Model
- definition - log partition function
6.1.2 Joint Feature Map

- definition - kernels
6.1.3 Optimization
- distribution over natural parameter (for posterior) - newton method and
bundle method
6.1.4 Gaussian Process Link

- joint random variables - reparameterization and integrating out - connection between norm and gp posterior
6.2 Binary Classification

6.2.1 Binomial Model
logistic, posterior simple model
6.2.2 Optimization
Newton method why the posterior isnt nice
6.3 Regression
6.3.1 Conditionally Normal Models
fixed variance
163
164
6 Conditional Densities
6.3.2 Posterior Distribution

integrating out vs. Laplace approximation, efficient estimation (sparse greedy)
6.3.3 Heteroscedastic Estimation

explain that we have two parameters. not too many details (do that as an
assignment).
6.4 Multiclass Classification

6.4.1 Conditionally Multinomial Models
joint feature map
6.5 What is a CRF?

Motivation with learning a digit example
general definition
Gaussian process + structure = CRF
6.5.1 Linear Chain CRFs

Graphical model
Applications
Optimization problem
6.5.2 Higher Order CRFs

2-d CRFs and their applications in vision
Skip chain CRFs
Hierarchical CRFs (graph transducers, sutton et. al. JMLR etc)
6.5.3 Kernelized CRFs
From feature maps to kernels

The clique decomposition theorem
The representer theorem
Optimization strategies for kernelized CRFs
165

6.6.1 Getting Started
three things needed to optimize
MAP estimate
log-partition function
gradient of log-partition function
Worked out example (linear chain?)
6.6.2 Optimization Algorithms

- Optimization algorithms (LBFGS, SGD, EG (Globerson et. al))
6.6.3 Handling Higher order CRFs

- How things can be done for higher order CRFs (briefly)
6.7 Hidden Markov Models
Definition
Discuss that they are modeling joint distribution p(x, y)
The way they predict is by marginalizing out x
Why they are wasteful and why CRFs generally outperform them
6.8 Further Reading

What we did not talk about:
Details of HMM optimization
CRFs applied to predicting parse trees via matrix tree theorem (collins,
koo et al)
CRFs for graph matching problems
CRFs with Gaussian distributions (yes they exist)
6.8.1 Optimization
issues in optimization (blows up with number of classes). structure is not
there. can we do better?
166
Problems
Problem 6.1 Poisson models
Problem 6.2 Bayes Committee Machine
Problem 6.3 Newton / CG approach
6 Conditional Densities
7
Kernels and Function Spaces
7.1 Kernels
7.1.1 Feature Maps
give examples, linear classifier, nonlinear ones with r2-r3 map
7.1.2 The Kernel Trick

7.1.3 Examples of Kernels
gaussian, polynomial, linear, texts, graphs
- stress the fact that there is a difference between structure in the input
space and structure in the output space
7.2 Algorithms
7.2.1 Kernel Perceptron
7.2.2 Trivial Classifier
7.2.3 Kernel Principal Component Analysis
7.3 Reproducing Kernel Hilbert Spaces
7.3.1 Hilbert Spaces
evaluation functionals, inner products
7.3.2 Theoretical Properties

Mercers theorem, positive semidefiniteness
7.3.3 Regularization
Representer theorem, regularization
167
168
7 Kernels and Function Spaces
7.4 Banach Spaces

7.4.1 Properties
7.4.2 Norms and Convex Sets
- smoothest function (L2) - smallest coefficients (L1) - structured priors
(CAP formalism)
Problems
8
Linear Models
A hyperplane in a dot product space H is described by the set
{x H| w, x + b = 0},
(8.1)
where w H and b R. Such a hyperplane naturally divides H into two

half-spaces: {x H| w, x + b 0} and {x H| w, x + b < 0}, and
hence can be used as the decision boundary of a binary classifier. In this
chapter we will study a number of algorithms which employ such linear
decision boundaries. Although such models look restrictive at first glance,
when combined with kernels (Chapter 7) they yield a large class of useful
algorithms.
All the algorithms we will study in this chapter maximize the margin.
Given a set X = {x1 , . . . , xm }, the margin is the distance of the closest point
in X to the hyperplane (8.1). Elementary geometric arguments (Exercise
BUGBUG) show that the distance of a point xi to a hyperplane is given by
| w, xi + b |/ w , and hence the margin is simply
min
i=1,...,m
| w, xi + b |
.
w
(8.2)
Note that the parameterization of the hyperplane (8.1) is not unique; if we

multiply both w and b by the same non-zero constant, then we obtain the
same hyperplane. One way to resolve this ambiguity is to set
min | w, xi + b| = 1.
i=1,...m
In this case, the margin simply becomes 1/ w . We postpone justification

of margin maximization for later and jump straight ahead to the description
of various algorithms.
Consider a binary classification task, where we are given a training set
{(x1 , y1 ), . . . , (xm , ym )} with xi H and yi {1}. Our aim is to find
a hyperplane parameterized by (w, b) such that w, xi + b 0 whenever
yi = +1 and w, xi + b < 0 whenever yi = 1. Furthermore, as discussed
169
170
8 Linear Models
above, we fix the scaling of w by requiring mini=1,...m | w, xi + b | = 1. A

compact way to write our desiderata is to require yi ( w, xi + b) 1 for all
i. The problem of maximizing the margin therefore reduces to
1
w
max
w,b
(8.3a)
s.t. yi ( w, xi + b) 1 for all i,
(8.3b)
or equivalently
1
w 2
w,b 2
s.t. yi ( w, xi + b) 1 for all i.
min
(8.4a)
(8.4b)
This is a constrained convex optimization problem with a quadratic objective function and linear constraints (see Section 5.3). In deriving (8.4) we
implicitly assumed that the data is linearly separable, that is, there is a
hyperplane which correctly classifies the training data. Such a classifier is
called a hard margin classifier. If the data is not linearly separable, then
(8.4) does not have a solution. To deal with this situation we introduce
non-negative slack variables i to relax the constraints:
yi ( w, xi + b) 1 i .
Given any w and b the constraints can now be satisfied by making i large
enough. This renders the whole optimization problem useless. Therefore, one
has to penalize large i . This is done via the following modified optimization
problem:
min
w,b,
1
w
2
C
m
(8.5a)
i=1
s.t. yi ( w, xi + b) 1 i for all i

i 0,
(8.5b)
(8.5c)
where C > 0 is a penalty parameter. The resultant classifier is said to be a

soft margin classifier.
By introducing non-negative Lagrange multipliers i and i one can write
the Lagrangian
L(w, b, , , ) =
1
w
2
C
m
i=1
i (1 i yi ( w, xi + b))
i +
i=1
i i .
i=1
171
Taking gradients with respect to w, b and and setting them to 0 yields

m
w L = w
i yi xi = 0
(8.6)
i=1
m
b L =
i yi = 0
(8.7)
i=1
i L =
C
i i = 0.
m
(8.8)
Substituting (8.6), (8.7), and (8.8) into the Lagrangian and simplifying yields
the dual objective function:
1
2
yi yj i j xi , xj +
i,j
i ,
(8.9)
i=1
which needs to be maximized with respect to . For notational convenience

we will minimize the negative of (8.9) below. Next we turn our attention
to the dual constraints. Recall that i 0 and i 0. Using this and
C
. Furthermore, by (8.7) m
(8.8) immediately yields 0 i m
i=1 i yi = 0.
Therefore, the dual optimization problem boils down to
m
1
min
yi yj i j xi , xj
i,j
(8.10a)
i=1
s.t.
i yi = 0
(8.10b)
C
.
m
(8.10c)
i=1
0 i
If we let H be a m m matrix with entries Hij = yi yj xi , xj , e be a m

dimensional vector of all ones, be a vector whose entries are i , and y be
a vector whose entries are yi , then the above dual can be compactly written
as the following Quadratic Program (QP) (Section 5.3.2):
1
H e
2
s.t. y = 0
C
0 i .
m
min
(8.11a)
(8.11b)
(8.11c)
Before turning our attention to algorithms for solving (8.11), a number

of observations are in order. First, note that computing H only requires
computing dot products between training examples. If we map the input
172
8 Linear Models
data to a Reproducing Kernel Hilbert Space (RKHS) via a feature map ,

then we can still compute the entries of H and solve for the optimal . In
this case, Hij = yi yj (xi ), (xj ) = yi yj k(xi , xj ), where k is the kernel
associated with the RKHS. Given the optimal , one can easily recover the
decision boundary. This is a direct consequence of (8.6), which allows us to
write w as a linear combination of the training data:
m
w=
i yi (xi ),
i=1
and hence the decision boundary as

m
w, x + b =
i yi k(xi , x) + b.
(8.12)
i=1
As a consequence of the KKT conditions (Section 5.3) we have

i (1 i yi ( w, xi + b)) = 0 and i i = 0.
We now consider three cases for yi ( w, xi + b) and the implications of the
KKT conditions.
yi ( w, xi + b) < 1: In this case, i > 0, and hence the KKT conditions
C
(see (8.8)). Such points are
imply that i = 0. Consequently, i = m
said to be margin errors.
yi ( w, xi + b) > 1: In this case, i = 0, (1i yi ( w, xi +b)) < 0, and by
the KKT conditions i = 0. Such points are said to be well classified.
It is easy to see that the decision boundary (8.12) does not change
even if these points are removed from the training set.
yi ( w, xi + b) = 1: In this case i = 0 and i 0. Since i is non-negative
C
and satisfies (8.8) it follows that 0 i m
. Such points are said
to be on the margin. They are also sometimes called support vectors.
Because it uses support vectors, the overall algorithm is called C-Support
Vector classifier or C-SV classifier for short.
8.1.1 A Regularized Risk Minimization Viewpoint

A closer examination of (8.5) reveals that i = 0 whenever yi ( w, xi +b) > 1.
On the other hand, i = 1 yi ( w, xi + b) whenever yi ( w, xi + b) <
1. In short, i = max(0, 1 yi ( w, xi + b)). Using this observation one
can eliminate i from (8.5), and write it as the following unconstrained
173
optimization problem:
1
min
w
w,b 2
C
+
m
max(0, 1 yi ( w, xi + b)).
(8.13)
i=1
Writing (8.5) as (8.13) is particularly revealing because it shows that a

support vector classifier is nothing but a regularized risk minimizer. Here
the regularizer is the square norm of the decision hyperplane 12 w 2 , and
the loss function is the so-called binary hinge loss:
max(0, 1 yi ( w, xi + b)).
(8.14)
In the later part of the chapter generalizations of the binary hinge loss will
be used to extend the support vector machinery to deal with a large class of
problems such as novelty detection, multiclass classification, and structured
prediction.
8.1.2 An Exponential Family Interpretation
Our motivating arguments for deriving the SVM algorithm have largely
been geometric. We now show that an equally elegant probabilistic interpretation also exists. Assuming that the training set {(x1 , y1 ), . . . , (xm , ym )}
was drawn iid from some underlying distribution, and using the Bayes rule
(1.15) one can write the likelihood
m
p(|X, Y ) p(Y |X, )p() =
p(yi |xi , ) g(|xi )p(),
(8.15)
log p(yi |xi , ) log p() + const.
(8.16)
i=1
and hence the negative log-likelihood

m
log p(|X, Y ) =
i=1
In the absence of any prior knowledge about the data, we choose a zero
mean unit variance isotropic normal for p(). This yields
log p(|X, Y ) =
m
2
log p(yi |xi , ) + const.
(8.17)
i=1
The maximum aposteriori (MAP) estimate for is obtained by minimizing

(8.17) with respect to . Given the optimal , we can predict the class label
at any given x via
y = argmax p(y|x, ).
y
(8.18)
174
8 Linear Models
Of course, our aim is not just to maximize p(yi |xi , ) but also to ensure
that p(y|xi , ) is small for all y = yi . This, for instance, can be achieved by
requiring
p(yi |xi , )
, for all y = yi and some 1.
p(y|xi , )
(8.19)
As we saw in Section 2.3 exponential families of distributions are rather flexible modeling tools. We could, for instance, model p(yi |xi , ) as a conditional
exponential family distribution. Recall the definition:
p(y|x, ) = exp ( (x, y), g(|x)) .
(8.20)
Here (x, y) is a joint feature map which depends on both the input data
x and the label y, while g(|x) is a log-partition function. Now (8.19) boils
down to
p(yi |xi , )
= exp
maxy=yi p(y|xi , )
(xi , yi ) max (xi , y),
y=yi
If we choose such that log = 1, set (x, y) =

y {1} we can rewrite (8.21) as
y
2 (x),
(8.21)
and observe that
yi
yi
(xi )
(xi ), = yi (xi ), 1.
2
2
(8.22)
By replacing log p(yi |xi , ) in (8.17) with the condition (8.22) we obtain
the following objective function:
min
s.t.
1
2
2
yi (xi ), 1 for all i,
(8.23a)
(8.23b)
which recovers (8.4), but without the bias b. As before, we can replace (8.22)
by a linear penalty for constraint violation in order to recover (8.5).
8.1.3 Specialized Algorithms for Training SVMs

The main task in training SVMs boils down to solving (8.11). The m m
matrix H is usually dense and cannot be stored in memory. Decomposition
methods are designed to overcome these difficulties. The basic idea here
is to identify and update a small working set B by solving a small subproblem at every iteration. Formally, let B {1, . . . , m} be the working set
= {1, . . . , m} \ B
and B be the corresponding sub-vector of . Define B
175
and B analogously. In order to update B we need to solve the following

sub-problem of (8.11) obtained by freezing B :
min
B
s.t.
1
2
B B
HBB HB B
HBB
HB B
B
B
B B
B B y = 0
C
for all i B.
0 i
m
(8.24a)
(8.24b)
(8.24c)
HBB HB B
is a permutation of the matrix H. By eliminating
HBB
HB B
constant terms and rearranging, one can simplify the above problem to
Here,
1
HBB B + B (HBB
B
e)
B
2 B
s.t. B yB = B yB
C
for all i B.
0 i
m
min
(8.25a)
(8.25b)
(8.25c)
An extreme case of a decomposition method is the Sequential Minimal Optimization (SMO) algorithm of Platt [Pla99], which updates only two coefficients per iteration. The advantage of this strategy as we will see below is
that the resultant sub-problem can be solved analytically. Without loss of
generality let B = {i, j}, and define s = yi /yj , ci cj = (HBB
B
e)
and d = (B yB /yj ). Then (8.25) specializes to
1
(Hii i2 + Hjj j2 + 2Hij j i ) + ci i + cj j
i ,j 2
s.t. si + j = d
C
0 i , j .
m
min
(8.26a)
(8.26b)
(8.26c)
This QP in two variables has an analytic solution.

Lemma 8.1 (Analytic solution of 2 variable QP) Define bounds
L=
max(0,
max(0,
H=
C
d m
s
d
s)
(8.27)
otherwise
C d
min( m
, s)
C
C d m
min( m
, s
if s > 0
if s > 0
)
otherwise,
(8.28)
176
8 Linear Models
and auxiliary variables

= (Hii + Hjj s2 2sHij ) and
(8.29)
= (cj s ci Hij d + Hjj ds).
(8.30)
The optimal value of (8.26) can be computed analytically as follows: If = 0

then
i =
if < 0
otherwise.
If > 0, then i = max(L, min(H, /)). In both cases, j = (d si ).

Proof Eliminate the equality constraint by setting j = (d si ). Due to
C
the constraint 0 j m
it follows that si = d j can be bounded
C
C
via d m si d. Combining this with 0 i m
one can write
L i H where L and H are given by (8.27) and (8.28) respectively.
Substituting j = (dsi ) into the objective function, dropping the terms
which do not depend on i , and simplifying by substituting and yields
the following optimization problem in i :
1
min i2 i
i
2
s.t. L i H.
First consider the case when = 0. In this case, i = L if < 0 otherwise
i = H. On other hand, if > 0 then the unconstrained optimum of the
above optimization problem is given by /. The constrained optimum is
obtained by clipping appropriately: max(L, min(H, /)). This concludes
the proof.
To complete the description of SMO we need a valid stopping criterion as
well as a scheme for selecting the working set at every iteration. In order
to derive a stopping criterion we will use the KKT gap, that is, the extent
to which the KKT conditions are violated. Towards this end introduce nonnegative Lagrange multipliers b R, Rm and Rm and write the
Lagrangian of (8.11).
1
C
L(, b, , ) = H e + b y + ( e).
2
m
(8.31)
If we let J() = 21 H e be the objective function and J() =

H e its gradient, then taking gradient of the Lagrangian with respect to
and setting it to 0 shows that
J() + by = .
(8.32)
177
Furthermore, by the KKT conditions we have

i i = 0 and i (
C
i ) = 0,
m
(8.33)
with i 0 and i 0. Equations (8.32) and (8.33) can be compactly

rewritten as
J()i + byi 0 if i = 0
C
J()i + byi 0 if i =
m
J()i + byi = 0 if 0 < i <
(8.34a)
(8.34b)
C
.
m
(8.34c)
Since yi {1}, we can further rewrite (8.34) as

yi J()i b for all i Iup
yi J()i b for all i Idown ,
where the index sets Iup and Idown are defined as
C
, yi = 1 or i > 0, yi = 1}
m
C
= {i : i < , yi = 1 or i > 0, yi = 1}.
m
Iup = {i : i <
Idown
(8.35a)
(8.35b)
In summary, the KKT conditions imply that is a solution of (8.11) if and

only if
m() M ()
where
m() = max yi J()i and M () = min yi J()i .
iIup
iIdown
(8.36)
Therefore, a natural stopping criterion is to stop when the KKT gap falls
below a desired tolerance , that is,
m() M () + .
(8.37)
Finally, we turn our attention to the issue of working set selection. The
first order approximation to the objective function J() can be written as
J( + d) J() + J() d.
Since we are only interested in updating coefficients in the working set B
we set d = dB 0 , in which case we can rewrite the above first order
178
8 Linear Models
approximation as
J()B dB J( + d) J().
From among all possible directions dB we wish to choose one which decreases
the objective function the most while maintaining feasibility. This is best
expressed as the following optimization problem:
min J()B dB
(8.38a)
s.t. yB dB = 0
(8.38b)
dB
di 0 if i = 0 and i B
C
di 0 if i =
and i B
m
1 di 1.
(8.38c)
(8.38d)
(8.38e)
Here (8.38b) comes from y ( + d) = 0 and y = 0, while (8.38c) and

C
. Finally, (8.38e) prevents the objective
(8.38d) comes from 0 i m
function from diverging to . If we specialize (8.38) to SMO, we obtain
min J()i di + J()j dj
(8.39a)
s.t. yi di + yj dj = 0
(8.39b)
i,j
dk 0 if k = 0 and k {i, j}
C
dk 0 if k =
and k {i, j}
m
1 dk 1 for k {i, j}.
(8.39c)
(8.39d)
(8.39e)
At first glance, it seems that choosing the optimal i and j from the set
{1, . . . , m}{1, . . . m} requires O(m2 ) effort. We now show that O(m) effort
suffices.
Define new variables dk = yk dk for k {i, j}, and use the observation
yk {1} to rewrite the objective function as
(yi J()i + yj J()j ) dj .
Consider the case J()i yi J()j yj . Because of the constraints
(8.39c) and (8.39d) if we choose i Iup and j Idown , then dj = 1 and
di = 1 is feasible and the objective function attains a negative value. For
all other choices of i and j (i, j Iup ; i, j Idown ; i Idown and j Iup )
the objective function value of 0 is attained by setting di = dj = 0. The
case J()j yj J()i yi is analogous. In summary, the optimization
179
problem (8.39) boils down to

min
iIup ,jIdown
yi J()i yj J()j = min yi J()i max yj J()j ,

iIup
jIdown
which clearly can be solved in O(m) time. Comparison with (8.36) shows
that at every iteration of SMO we choose to update coefficients i and j
which maximally violate the KKT conditions.
8.1.4 The trick

In the soft margin formulation the parameter C is a trade-off between two
conflicting requirements namely maximizing the margin and minimizing the
training error. Unfortunately, this parameter is rather unintuitive and hence
difficult to tune. The -SVM was proposed to address this issue. As Theorem
8.2 shows, controls the number of support vectors and margin errors. The
primal problem for the -SVM can be written as
min
w,b,,
1
w
2
1
m
(8.40a)
i=1
s.t. yi ( w, xi + b) i for all i
(8.40b)
i 0, and 0.
(8.40c)
As before, we write the Lagrangian by introducing non-negative Lagrange

multipliers i , i , and .
1
w
2
1
m
i ( i yi ( w, xi + b))
i +
i=1
i=1
i i .
i=1
Taking gradients with respect to the primal variables and setting them to 0
yields
m
i yi x i = w
(8.41)
i=1
m
i yi = 0
(8.42)
i=1
1
m
(8.43)
i = .
(8.44)
i + i =
m
i=1
180
8 Linear Models
Plugging these conditions back into the Lagrangian and using i 0, i 0

and 0 yields the following dual, which can be optimized by a variant of
the SMO algorithm.
min
1
2
yi yj i j xi , xj
(8.45a)
i,j
s.t.
i yi = 0
(8.45b)
1
m
(8.45c)
i .
(8.45d)
i=1
0 i
m
i=1
The following theorems, which we state without proof, explain the significance of and the connection of -SVM and the soft margin formulation.
Theorem 8.2 Suppose we run -SVM with kernel k on some data and
obtain > 0. Then
(i) is an upper bound on the fraction of margin errors, that is points
for which yi ( w, xi + bi ) < .
(ii) is a lower bound on the fraction of support vectors, that is points
for which yi ( w, xi + bi ) = .
(iii) Suppose the data (X, Y ) were generated iid from a distribution p(x, y)
such that neither p(x, y = +1) or p(x, y = 1) contain any discrete
components. Moreover, assume that the kernel k is analytic and nonconstant. With probability 1, asympotically, equals both the fraction
of support vectors and fraction of margin errors.
Theorem 8.3 If (8.40) leads to a decision function with > 0, then (8.5)
with C = 1 leads to the same decision function.
As opposed to classification where the labels yi are binary valued, in regression they are real valued. Given a tolerance , our aim here is to find a
hyperplane parameterized by (w, b) such that
|yi ( w, xi + b)| .
(8.46)
181
In other words, we want to find a hyperplane such that all the training data
lies within an tube around the hyperplane. We may not always be able to
find such a hyperplane, hence we relax the above condition by introducing
slack variables i+ and i and write the corresponding primal problem as
1
w
2
min
w,b, + ,
s.t.
C
m
(i+ + i )
(8.47a)
i=1
( w, xi + b) yi + i+ for all i
(8.47b)
yi ( w, xi + b) + i for all i
(8.47c)
i+
0, and
0.
(8.47d)
The Lagrangian can be written by introducing non-negative Lagrange multipliers i+ , i , i+ and i :
1
L(w, b, , , , , , ) = w
2
+
C
+
m
(i+
i=1
i )
(i+ i+ + i i )
i=1
i+ (( w, xi + b) yi + )
+
i=1
m
i (yi ( w, xi + b) ).
+
i=1
Taking gradients with respect to the primal variables and setting them to
0, we obtain the following conditions:
m
(i i+ )xi
w=
(8.48)
i=1
m
i+ =
i=1
(8.49)
i=1
C
m
C
i + i = .
m
i+ + i+ =
(8.50)
(8.51)
182
8 Linear Models
{+,}
{+,}
Noting that i
, i
0 and substituting the above conditions into
the Lagrangian yields the dual
1
2
min
+ ,
(i i+ )(j j+ ) xi , xj
m
(i+ + i )
+
i=1
yi (i i+ )
i=1
i+ =
s.t.
(8.52a)
i,j
i=1
(8.52b)
i=1
C
(8.52c)
m
C
0 i .
(8.52d)
m
This is a quadratic programming problem with one equality constraint, and
hence a SMO like decomposition method can be derived for finding the
optimal coefficients + and (Problem ??).
As a consequence of (8.48), analogous to the classification case, one can
map the data via a feature map into an RKHS with kernel k and recover
the decision boundary f (x) = w, (x) + b via
0 i+
(i
f (x) =
i+ )
(i i+ )k(xi , x) + b. (8.53)
(x)i , (x) + b =
i=1
i=1
Finally, the KKT conditions

C
i+ i+ = 0
m
C
i i = 0 and
m
i+ (( w, xi + b) yi + ) = 0 i (yi ( w, xi + b) ) = 0,
allow us to draw many useful conclusions:
Whenever |yi ( w, xi + b)| < , this implies that i+ = i = i+ =
i = 0. In other words, points which lie inside the tube around the
hyperplane w, x + b do not contribute to the solution thus leading to
sparse expansions in terms of .
C
If ( w, xi +b)yi > we have i+ > 0 and therefore i+ = m
. On the other
hand, = 0 and i = 0. The case yi ( w, xi + b) > is symmetric

C
and yields + = 0, i > 0, i = m
, and i+ = 0.
C
Finally, if ( w, xi + b) yi = we have i+ = 0 and 0 i+ m
, while
= 0 and i = 0. Similarly, when yi ( w, xi + b) = we obtain

C
i = 0, 0 i m
, + = 0 and i+ = 0.
183
Note that i+ and i are never simultaneously non-zero.

8.2.1 Incorporating the Trick
The primal problem obtained after incorporating the trick can be written
as
min
w,b, + , ,
s.t.
1
w
2
1
+
m
(i+ + i )
(8.54a)
( w, xi + b) yi + i+ for all i
(8.54b)
+C
i=1
yi ( w, xi + b) +
i+
0, i
0, and
for all i
(8.54c)
0.
(8.54d)
The Lagrangian can be written by introducing non-negative Lagrange multipliers i+ , i , i+ , i , and 0:

L(w, b, + , , , + , , + , , ) =
1
w
2
+ C +
C
m
(i+ + i )
i=1
(i+ i+ + i i )

i=1
m
i+ (( w, xi + b) yi + )
+
i=1
m
i (yi ( w, xi + b) ).
+
i=1
Taking gradients with respect to the primal variables and setting them to
0, we obtain the following conditions:
m
(i i+ )xi
w=
(8.55)
i=1
m
C =
(8.56)
(8.57)
i=1
m
i+ =
i=1
(i+ + i )
i=1
C
m
C
i + i = .
m
i+ + i+ =
(8.58)
(8.59)
184
8 Linear Models
{+,}
{+,}
Noting that i
, i
0 and substituting the above conditions into
the Lagrangian yields the dual
min
+ ,
1
2
(i
i+ )(j
yi (i i+ )
xi , xj
i,j
(8.60a)
i=1
m
i+ =
s.t.
j+ )
i=1
(8.60b)
i=1
C
m
C
0 i
m
0 i+
(8.60c)
(8.60d)
(i+ + i ) C.
(8.60e)
i=1
8.2.2 Regularized Risk Minimization

General loss functions: Huber, quadratic, L1, quantile regression,

The large margin approach can also be adapted to perform novelty detection
or quantile estimation. Novelty detection is an unsupervised task where one
is interested in flagging a small fraction of the input X = {x1 , . . . , xm } as
atypical or novel. It can be viewed as a special case of the quantile estimation
task, where we are interested in estimating a simple set C such that P r(x
C) for some [0, 1]. One way to measure simplicity is to use the
volume of the set. Formally, if |C| denotes the volume of a set, then the
quantile estimation task is to estimate
arginf{|C| s.t. P r(x C) }.
(8.61)
Given the input data X one can compute the empirical density
p(x) =
1
m
if x X
otherwise,
and estimate its (not necessarily unique) -quantiles. Unfortunately, such

estimates are very brittle and do not generalize well to unseen data. One
possible way to address this issue is to restrict C to be simple subsets such
as spheres or half spaces. In other words, we estimate simple sets which
contain fraction of the dataset. For our purposes, we specifically work
185
with half-spaces defined by hyperplanes. While half-spaces may seem rather

restrictive remember that the kernel trick can be used to map data into
a high-dimensional space; half-spaces in the mapped space correspond to
non-linear decision boundaries in the input space. Furthermore, instead of
explicitly identifying C we will learn an indicator function for C, that is, a
function f which takes on values 1 inside C and 1 elsewhere.
With 12 w 2 as a regularizer, the problem of estimating a hyperplane such
that a large fraction of the points in the input data X lie on one of its sides
can be written as:
1
min
w
w,, 2
s.t.
1
+
m
(8.62a)
i=1
w, xi i for all i
i 0.
(8.62b)
(8.62c)
Clearly, we want to be as large as possible so that the volume of the halfspace w, x is minimized. Furthermore, [0, 1] is a parameter which
is analogous to we introduced for the -SVM earlier. Roughly speaking,
it denotes the fraction of input data for which w, xi . An alternative
interpretation of (8.62) is to assume that we are separating the data set X
from the origin (See Figure BUGBUG for an illustration). Therefore, this
method is also widely known as the one-class SVM.
The Lagrangian of (8.62) can be written by introducing non-negative
Lagrange multipliers i , and i :
L(w, , , , ) =
1
w
2
1
m
i +
i=1
i ( i w, xi )
i=1
i i .
i=1
By taking gradients with respect to the primal variables and setting them
to 0 we obtain
m
w=
i xi
(8.63)
i=1
i =
1
1
i
m
m
(8.64)
i = 1.
(8.65)
i=1
Noting that i , i 0 and substituting the above conditions into the La-
186
8 Linear Models
grangian yields the dual

min
1
2
i j xi , xj
(8.66a)
1
m
(8.66b)
i,j
s.t. 0 i
m
i = 1.
(8.66c)
i=1
This can easily be solved by a straightforward modification of the SMO

algorithm (see Section 8.1.3 and Problem 8.2). Like in the previous sections,
an analysis of the KKT conditions shows that 0 < if and only if w, xi ;
such points are called support vectors. The following theorem explains the
significance of the parameter .
Theorem 8.4 Assume that the solution of (8.66) satisfies = 0, then the
following statements hold:
(i) is an upper bound on the fraction of support vectors, that is points
for which w, xi .
(ii) Suppose the data X were generated independently from a distribution
p(x) which does not contain discrete components. Moreover, assume
that the kernel k is analytic and non-constant. With probability 1,
asympotically, equals the fraction of support vectors.
8.3.1 Density Estimation via the Exponential Family
One approach to novelty detection is to first perform density estimation
and then flag regions of low density as novel. For instance, if we make the
standard assumption that the points xi X are iid samples from some
underlying parametric distribution p(x|), then we can write the likelihood
p(X|) as
m
p(xi |).
p(X|) =
(8.67)
i=1
The maximum likelihood estimation (MLE) problem is to find a which

maximizes p(X|). Since the MLE estimates are often brittle, it is customary
to add a prior p() and write the posterior
m
p(|X)
p(xi |)p().
i=1
(8.68)
187
The maximum aposteriori estimation (MAP) problem is to find the mode

of p(|X). However, this approach is wasteful because we do not really care
about how the data distribution looks like in the high density areas. All we
want is to flag the regions where p(x|) , for some [0, 1], as novel.
Therefore, one can replace p(xi |) in (8.68) by
p(xi |)
,1 ,
min
(8.69)
take logs and minimize the following objective function instead

m
log p()
log min
i=1
p(xi |)
,1
(8.70)
Assuming that p() is normally distributed with zero mean and unit variance, we can further rewrite (8.70) as
1
m
2
log min
i=1
p(xi |)
,1 .
(8.71)
A rather convenient parametric family of density functions is the exponential

family (See Section 2.3):
p(x|) = exp ( (x), g()) .
(8.72)
Recall that (x) denotes the sufficient statistics and g() is the log-partition
function which normalizes the distribution to sum to one. For our task,
only the shape of p(x|) matters and the normalization g() is irrelevant.
Therefore we can absorb it into by defining a new constant
and rewrite
(8.71) as
1
m
2
log min
i=1
exp ( (x), )
,1 .
(8.73)
Since
is unknown, we can introduce a variable and set
= exp(m)
for some [0, 1]. Plugging this into (8.73) and some simple algebraic
manipulations yield
J(, ) :=
m
2
max (m (x), , 0) .
(8.74)
i=1
svnvish: BUGBUG I am stuck here because I dont know how to connect

this up with (8.63).
188
8 Linear Models
8.4 Ordinal Regression

8.4.1 Preferences
relative terms matter, not in absolute
8.4.2 Dual Problem

8.4.3 Optimization
the issue of quadratically many constraints. thorstens trick
8.5 Margins and Probability

discuss the connection between probabilistic models and linear classifiers.
issues of consistency, optimization, efficiency, etc.
8.6 Large Margin Classifiers with Structure

8.6.1 Margin
define margin pictures
8.6.2 Penalized Margin

different types of loss, rescaling
8.6.3 Nonconvex Losses

the max - max loss
8.7 Applications
8.7.1 Sequence Annotation
8.7.2 Matching
8.7.3 Ranking
8.7.4 Shortest Path Planning
8.7.5 Image Annotation
8.7.6 Contingency Table Loss
8.8 Optimization
8.8.1 Column Generation
subdifferentials
8.9 CRFs vs Structured Large Margin Models
189
8.8.2 Bundle Methods

8.8.3 Overrelaxation in the Dual
when we cannot do things exactly
8.9 CRFs vs Structured Large Margin Models

8.9.1 Loss Function
8.9.2 Dual Connections
8.9.3 Optimization
Problems
Problem 8.1 (SVM without Bias {1}) A homogeneous hyperplane is one
which passes through the origin, that is,
{x H| w, x = 0}.
(8.75)
If we devise a soft margin classifier which uses the homogeneous hyperplane

as a decision boundary, then the corresponding primal optimization problem
can be written as
min
w,
s.t.
1
w
2
m
2
+C
(8.76a)
i=1
yi w, xi 1 i for all i
(8.76b)
i 0,
(8.76c)
Derive the dual of (8.76) and contrast it with (8.11). What changes to the
SMO algorithm would you make to solve this dual?
Problem 8.2 (SMO for various SVM formulations {2}) Derive an SMO
like decomposition algorithm for solving the dual of the following problems:
-SVM (8.45).
SV regression (8.52).
SV novelty detection (8.66).
Problem 8.3 (Novelty detection with Balls {2}) In Section 8.3 we assumed that we wanted to estimate a halfspace which contains a major fraction of the input data. An alternative approach is to use balls, that is, we
estimate a ball of small radius in feature space which encloses a majority of
the input data. Write the corresponding optimization problem and its dual.
Show that if the kernel is translation invariant, that is, k(x, x ) depends only
190
8 Linear Models
on x x then the optimization problem with balls is equivalent to (8.66).

Explain why this happens geometrically.
Problem 8.4 Invariances (basic loss)
Problem 8.5 Polynomial transformations - SDP constraints
9
Model Selection
9.1 Basics
Why model selection. overfitting ...
9.1.1 Estimators
unbiased estimator, bias variance dilemma
9.1.2 Maximum Likelihood Revisited
When it may overfit, when it is ok,
9.1.3 Empirical Methods
crossvalidation, show that it is unbiased
9.2 Uniform Convergence Bounds
9.2.1 Vapnik Chervonenkis Dimension
covering number arguments, chernoff bounds, just basic idea, maybe radius
margin bound
9.2.2 Rademacher Averages
explain the basic approach (from annals paper)
9.2.3 Compression Bounds
9.3 Bayesian Methods
9.3.1 Priors Revisited
9.3.2 PAC-Bayes Bounds
9.4 Asymptotic Analysis
9.4.1 Efficiency of an Estimator
Cramer Rao Bound
191
192
9.4.2 Asymptotic Efficiency

Amari Yoshizawa Murata results
9 Model Selection
10
Maximum Mean Discrepancy
10.1 Fenchel Duality

10.1.1 Motivation
- definition - implications
10.1.2 Applications
- dual of f (x) + g(x)
10.2 Dual Problems

10.2.1 Maximum Likelihood
- Dual of MLE is maxent with exact matching
10.2.2 Maximum Aposteriori

- Dual of MAP is maxent with approx matching (Smola, Altun, Hegland) Some philosophy about maxent
10.3 Priors
10.3.1 Motivation
- Philosophy and examples - Properties of priors: l1 is sparsity inducing, l2
is amenable to kernels etc
10.3.2 Conjugate Priors

- Conjugate priors and examples
10.3.3 Priors and Maxent

- Priors as norm stabilization in maxent
193
194
10 Maximum Mean Discrepancy
10.4 Moments
10.4.1 Sufficient Statistics and the Marginal Polytope
Exponential families again

natural parameters vs mean parameters
connections via the link function (log-partition function)
Map from distributions to means is unique for regular exponential families
(Wainwright and Jordan + ASH)
10.5 Two Sample Test

10.5.1 Maximum Mean Discrepancy
10.5.2 Mean Map and Norm
U-statistic
10.5.4 Covariate Shift Correction

simple optimization problem
10.6 Independence Measures

10.6.1 Test Statistic
take mmd between joint and product of marginals

tr HKHL criterion convergence theorems
10.7 Applications
10.7.1 Independent Component Analysis
10.7.2 Feature Selection
10.7.3 Clustering
10.7.4 Maximum Variance Unfolding
10.8 Introduction
We address the problem of comparing samples from two probability distributions, by proposing statistical tests of the hypothesis that these distributions
are different (this is called the two-sample or homogeneity problem). Such
10.8 Introduction
195
tests have application in a variety of areas. In bioinformatics, it is of interest

to compare microarray data from identical tissue types as measured by different laboratories, to detect whether the data may be analysed jointly, or
whether differences in experimental procedure have caused systematic differences in the data distributions. Equally of interest are comparisons between
microarray data from different tissue types, either to determine whether two
subtypes of cancer may be treated as statistically indistinguishable from a diagnosis perspective, or to detect differences in healthy and cancerous tissue.
In database attribute matching, it is desirable to merge databases containing
multiple fields, where it is not known in advance which fields correspond:
the fields are matched by maximising the similarity in the distributions of
their entries.
We test whether distributions p and q are different on the basis of samples
drawn from each of them, by finding a well behaved (e.g. smooth) function
which is large on the points drawn from p, and small (as negative as possible)
on the points from q. We use as our test statistic the difference between the
mean function values on the two samples; when this is large, the samples
are likely from different distributions. We call this statistic the Maximum
Mean Discrepancy (MMD).
Clearly the quality of the MMD as a statistic depends on the class F of
smooth functions that define it. On one hand, F must be rich enough
so that the population MMD vanishes if and only if p = q. On the other
hand, for the test to be consistent, F needs to be restrictive enough for
the empirical estimate of MMD to converge quickly to its expectation as the
sample size increases. We shall use the unit balls in universal reproducing
kernel Hilbert spaces [Ste01] as our function classes, since these will be shown
to satisfy both of the foregoing properties (we also review classical metrics
on distributions, namely the Kolmogorov-Smirnov and Earth-Movers distances, which are based on different function classes). On a more practical
note, the MMD has a reasonable computational cost, when compared with
other two-sample tests: given m points sampled from p and n from q, the
cost is O(m+n)2 time. We also propose a less statistically efficient algorithm
with a computational cost of O(m+n), which can yield superior performance
at a given computational cost by looking at a larger volume of data.
We define three non-parametric statistical tests based on the MMD. The
first two, which use distribution-independent uniform convergence bounds,
provide finite sample guarantees of test performance, at the expense of being
conservative in detecting differences between p and q. The third test is based
on the asymptotic distribution of the MMD, and is in practice more sensitive
to differences in distribution at small sample sizes. The present work syn-
196
thesizes and expands on results of [?, GBR+ 07], [SGSS07], and [SZS+ 08]1
who in turn build on the earlier work of [BGR+ 06]. Note that the latter
addresses only the third kind of test, and that the approach of [?, GBR+ 07]
employs a more accurate approximation to the asymptotic distribution of
the test statistic.
We begin our presentation in Section 10.9 with a formal definition of the
MMD, and a proof that the population MMD is zero if and only if p = q
when F is the unit ball of a universal RKHS. We also review alternative
function classes for which the MMD defines a metric on probability distributions. In Section 10.10, we give an overview of hypothesis testing as
it applies to the two-sample problem, and review other approaches to this
problem. We present our first two hypothesis tests in Section 10.11, based
on two different bounds on the deviation between the population and empirical MMD. We take a different approach in Section 10.12, where we use
the asymptotic distribution of the empirical MMD estimate as the basis for
a third test. When large volumes of data are available, the cost of computing the MMD (quadratic in the sample size) may be excessive: we therefore
propose in Section 10.13 a modified version of the MMD statistic that has
a linear cost in the number of samples, and an associated asymptotic test.
In Section 10.14, we provide an overview of methods related to the MMD in
the statistics and machine learning literature. Finally, in Section 10.15, we
demonstrate the performance of MMD-based two-sample tests on problems
from neuroscience, bioinformatics, and attribute matching using the Hungarian marriage method. Our approach performs well on high dimensional
data with low sample size; in addition, we are able to successfully distinguish
distributions on graph data, for which ours is the first proposed test.
In this section, we present the maximum mean discrepancy (MMD), and
describe conditions under which it is a metric on the space of probability
distributions. The MMD is defined in terms of particular function spaces
that witness the difference in distributions: we therefore begin in Section
10.9.1 by introducing the MMD for some arbitrary function space. In Section
10.9.2, we compute both the population MMD and two empirical estimates
when the associated function space is a reproducing kernel Hilbert space,
and we derive the RKHS function that witnesses the MMD for a given pair
of distributions in Section 10.9.3. Finally, we describe the MMD for more
general function classes in Section 10.9.4.
1
In particular, most of the proofs here were not provided by [?]
197
10.9.1 Definition of the Maximum Mean Discrepancy

Our goal is to formulate a statistical test that answers the following question:
Problem 10.1 Let p and q be Borel probability measures defined on a domain X. Given observations X := {x1 , . . . , xm } and Y := {y1 , . . . , yn },
drawn independently and identically distributed (i.i.d.) from p and q, respectively, can we decide whether p = q?
To start with, we wish to determine a criterion that, in the population
setting, takes on a unique and distinctive value only when p = q. It will be
defined based on Lemma 9.3.2 of [Dud02].
Lemma 10.1 Let (X, d) be a metric space, and let p, q be two Borel probability measures defined on X. Then p = q if and only if Exp (f (x)) =
Eyq (f (y)) for all f C(X), where C(X) is the space of bounded continuous functions on X.
Although C(X) in principle allows us to identify p = q uniquely, it is not
practical to work with such a rich function class in the finite sample setting.
We thus define a more general class of statistic, for as yet unspecified function
classes F, to measure the disparity between p and q [FM53, M
ul97].
Definition 10.2 Let F be a class of functions f : X R and let p, q, X, Y
be defined as above. We define the maximum mean discrepancy (MMD) as
MMD [F, p, q] := sup (Exp [f (x)] Eyq [f (y)]) .
(10.1)
f F
[M
ul97] calls this an integral probability metric. A biased empirical estimate
of the MMD is
MMDb [F, X, Y ] := sup
f F
1
m
f (xi )
i=1
1
n
f (yi ) .
(10.2)
i=1
The empirical MMD defined above has an upward bias (we will define an
unbiased statistic in the following section). We must now identify a function
class that is rich enough to uniquely identify whether p = q, yet restrictive
enough to provide useful finite sample estimates (the latter property will be
established in subsequent sections).
198
10.9.2 The MMD in Reproducing Kernel Hilbert Spaces

If F is the unit ball in a reproducing kernel Hilbert space H, the empirical
MMD can be computed very efficiently. This will be the main approach we
pursue in the present study. Other possible function classes F are discussed
at the end of this section. We will refer to H as universal whenever H, defined
on a compact metric space X and with associated kernel k : X2 R, is dense
in C(X) with respect to the L norm. It is shown in [Ste01] that Gaussian
and Laplace kernels are universal. We have the following result:
Theorem 10.3 Let F be a unit ball in a universal RKHS H, defined on the
compact metric space X, with associated kernel k(, ). Then MMD [F, p, q] =
0 if and only if p = q.
Proof It is clear that MMD [F, p, q] is zero if p = q. We prove the converse by
showing that MMD [C(X), p, q] = D for some D > 0 implies MMD [F, p, q] >
0: this is equivalent to MMD [F, p, q] = 0 implying MMD [C(X), p, q] = 0
(where this last result implies p = q by Lemma 10.1, noting that compactness
of the metric space X implies its separability). Let H be the universal RKHS
of which F is the unit ball. If MMD [C(X), p, q] = D, then there exists some
f C(X) for which Ep f Eq f D/2. We know that H is dense in C(X)
with respect to the L norm: this means that for = D/8, we can find some
f H satisfying f f
< . Thus, we obtain Ep [f ] Ep f <
and consequently
|Ep [f ] Eq [f ]| > Ep f Eq f 2 >
Finally, using f
D
2
2D
8 =
D
4
> 0.
< , we have
[Ep [f ] Eq [f ]] / f
D/(4 f
H)
> 0,
and hence MMD [F, p, q] > 0.

We now review some properties of H that will allow us to express the MMD
in a more easily computable form [SS02]. Since H is an RKHS, the operator
of evaluation x mapping f H to f (x) R is continuous. Thus, by the
Riesz representation theorem, there is a feature mapping (x) from X to
R such that f (x) = f, (x) H . Moreover, (x), (y) H = k(x, y), where
k(x, y) is a positive definite kernel function. The following lemma is due to
[BGR+ 06].
Lemma 10.4 Denote the expectation of (x) by p := Ep [(x)] (assuming
its existence).1 Then

MMD[F, p, q] = sup
f
199
[p] [q], f = [p] [q]
H 1
(10.3)
Proof
2
2
MMD [F, p, q] =
sup (Ep [f (x)] Eq [f (y)])

f
H 1
sup (Ep [ (x), f

f
H 1
H]
Eq [ (y), f
H ])
p q , f
sup
f
H 1
= p q
2
H
Given we are in an RKHS, the norm p q 2H may easily be computed

in terms of kernel functions. This leads to a first empirical estimate of the
MMD, which is unbiased.
Lemma 10.5 Given x and x independent random variables with distribution p, and y and y independent random variables with distribution q, the
population MMD2 is
MMD2 [F, p, q] = Ex,x p k(x, x ) 2Exp,yq [k(x, y)] + Ey,y q k(y, y ) .
(10.4)
Let Z := (z1 , . . . , zm ) be m i.i.d. random variables, where zi := (xi , yi ) (i.e.
we assume m = n). An unbiased empirical estimate of MMD2 is
MMD2u [F, X, Y
1
]=
(m)(m 1)
h(zi , zj ),
(10.5)
i=j
which is a one-sample U-statistic with h(zi , zj ) := k(xi , xj ) + k(yi , yj )

k(xi , yj ) k(xj , yi ) (we define h(zi , zj ) to be symmetric in its arguments due
to requirements that will arise in Section 10.12).
Proof Starting from the expression for MMD2 [F, p, q] in Lemma 10.4,
MMD2 [F, p, q] =
=
p q
2
H
p , p
+ q , q
= Ep (x), (x )
1
2 p , q
+ Eq (y), (y )
H
2Ep,q (x), (y)
A sufficient condition for this is p 2H < , which is rearranged as Ep [k(x, x )] < , where
x and x are independent random variables drawn according to p. In other words, k is a trace
class operator with respect to the measure p.
200
Fig. 10.1. Illustration of the function maximizing the mean discrepancy in the case
where a Gaussian is being compared with a Laplace distribution. Both distributions
have zero mean and unit variance. The function f that witnesses the MMD has been
scaled for plotting purposes, and was computed empirically on the basis of 2 104
samples, using a Gaussian kernel with = 0.5.
The proof is completed by applying (x), (x )

estimate follows straightforwardly.
= k(x, x ); the empirical
The empirical statistic is an unbiased estimate of MMD2 , although it does

not have minimum variance, since we are ignoring the cross-terms k(xi , yi )
of which there are only O(n). The minimum variance estimate is almost
identical, though [Ser80, Section 5.1.4].
The biased statistic in (10.2) may also be easily computed following the
m
1
above reasoning. Substituting the empirical estimates [X] := m
i=1 (xi )
and [Y ] := n1 ni=1 (yi ) of the feature space means based on respective
samples X and Y , we obtain
1
MMDb [F, X, Y ] = 2
m
m
i,j=1
2
k(xi , xj )
mn
m,n
i,j=1
1
k(xi , yj ) + 2
n
1
2
k(yi , yj ) .
i,j=1
(10.6)
Intuitively we expect the empirical test statistic MMD[F, X, Y ], whether
biased or unbiased, to be small if p = q, and large if the distributions are
far apart. It costs O((m + n)2 ) time to compute both statistics.
Finally, we note that [HBM08] recently proposed a modification of the
kernel MMD statistic in Lemma 10.4, by scaling the feature space mean distance using the inverse within-sample covariance operator, thus employing
the kernel Fisher discriminant as a statistic for testing homogeneity. This
statistic is shown to be related to the 2 divergence.
10.9.3 Witness Function of the MMD for RKHSs

It is also instructive to consider the witness f which is chosen by MMD
to exhibit the maximum discrepancy between the two distributions. The
population f and its empirical estimate f(x) are respectively
f (x)
f(x)
(x), [p] [q]

(x), [X] [Y ]
= Ex p [k(x, x )] Ex q [k(x, x )]
m
n
1
1
= m
i=1 k(xi , x) n
i=1 k(yi , x).
201
This follows from the fact that the unit vector v maximizing v, x H in a
Hilbert space is v = x/ x .
We illustrate the behavior of MMD in Figure 10.1 using a one-dimensional
example. The data X and Y were generated from distributions p and q with
equal means and variances, with p Gaussian and q Laplacian. We chose F
to be the unit ball in an RKHS using the Gaussian kernel. We observe that
the function f that witnesses the MMD in other words, the function
maximizing the mean discrepancy in (10.1) is smooth, positive where the
Laplace density exceeds the Gaussian density (at the center and tails), and
negative where the Gaussian density is larger. Moreover, the magnitude of f
is a direct reflection of the amount by which one density exceeds the other,
insofar as the smoothness constraint permits it.
10.9.4 The MMD in Other Function Classes

The definition of the maximum mean discrepancy is by no means limited to
RKHS. In fact, any function class F that comes with uniform convergence
guarantees and is sufficiently powerful will enjoy the above properties.
Definition 10.6 Let F be a subset of some vector space. The star S[F] of
a set F is
S[F] := {x|x F and [0, )}
Theorem 10.7 Denote by F the subset of some vector space of functions
from X to R for which S[F] C(X) is dense in C(X) with respect to the
L (X) norm. Then MMD [F, p, q] = 0 if and only if p = q.
Moreover, under the above conditions MMD[F, p, q] is a metric on the
space of probability distributions. Whenever the star of F is not dense, MMD
is a pseudo-metric space.1
Proof The first part of the proof is almost identical to that of Theorem 10.3
and is therefore omitted. To see the second part, we only need to prove the
triangle inequality. We have
sup |Ep f Eq f | + sup |Eq g Er g| sup [|Ep f Eq f | + |Eq f Er |]
f F
gF
f F
sup |Ep f Er f | .
f F
1
According to [Dud02, p. 26] a metric d(x, y) satisfies the following four properties: symmetry,
triangle inequality, d(x, x) = 0, and d(x, y) = 0 = x = y. A pseudo-metric only satisfies the
first three properties.
202
The first part of the theorem establishes that MMD[F, p, q] is a metric, since
only for p = q do we have MMD[F, p, q] = 0.
Note that any uniform convergence statements in terms of F allow us immediately to characterize an estimator of MMD(F, p, q) explicitly. The following result shows how (we will refine this reasoning for the RKHS case in
Section 10.11).
Theorem 10.8 Let (0, 1) be a confidence level and assume that for
some (, m, F) the following holds for samples {x1 , . . . , xm } drawn from p:
Pr sup Ep [f ]
f F
1
m
f (xi ) > (, m, F)
(10.7)
i=1
In this case we have that

Pr {|MMD[F, p, q] MMDb [F, X, Y ]| > 2 (/2, m, F)} .
(10.8)
Proof The proof works simply by using convexity and suprema as follows:
|MMD[F, p, q] MMDb [F, X, Y ]|
1
= sup |Ep [f ] Eq [f ]| sup
f F
f F m
sup Ep [f ] Eq [f ]
f F
sup Ep [f ]
f F
1
m
1
m
m
i=1
1
f (xi )
n
f (xi ) +
i=1
1
n
f (yi )
i=1
f (xi ) + sup Eq [f ]
i=1
f (yi )
i=1
m
f F
1
n
f (yi ) .
i=1
Bounding each of the two terms via a uniform convergence bound proves
the claim.
This shows that MMDb [F, X, Y ] can be used to estimate MMD[F, p, q] and
that the quantity is asymptotically unbiased.
Remark 10.9 (Reduction to Binary Classification) Any classifier which
maps a set of observations {zi , li } with zi X on some domain X and labels
li {1}, for which uniform convergence bounds exist on the convergence
of the empirical loss to the expected loss, can be used to obtain a similarity
measure on distributions simply assign li = 1 if zi X and li = 1 for
zi Y and find a classifier which is able to separate the two sets. In this
case maximization of Ep [f ] Eq [f ] is achieved by ensuring that as many
z p(z) as possible correspond to f (z) = 1, whereas for as many z q(z)
203
as possible we have f (z) = 1. Consequently neural networks, decision trees,

boosted classifiers and other objects for which uniform convergence bounds
can be obtained can be used for the purpose of distribution comparison. For
instance, [BBCP07, Section 4] use the error of a hyperplane classifier to
approximate the A-distance between distributions of [KBG04].
10.9.5 Examples of Non-RKHS Function Classes

Other function spaces F inspired by the statistics literature can also be
considered in defining the MMD. Indeed, Lemma 10.1 defines an MMD with
F the space of bounded continuous real-valued functions, which is a Banach
space with the supremum norm [Dud02, p. 158]. We now describe two further
metrics on the space of probability distributions, the Kolmogorov-Smirnov
and Earth Movers distances, and their associated function classes.
10.9.5.1 Kolmogorov-Smirnov Statistic
The Kolmogorov-Smirnov (K-S) test is probably one of the most famous twosample tests in statistics. It works for random variables x R (or any other
set for which we can establish a total order). Denote by Fp (x) the cumulative
distribution function of p and let FX (x) be its empirical counterpart, that
is
Fp (z) := Pr {x z for x p(x)} and FX (z) :=
1
|X|
1zxi .
i=1
It is clear that Fp captures the properties of p. The Kolmogorov metric is

simply the L distance FX FY for two sets of observations X and
Y . [Smi39] showed that for p = q the limiting distribution of the empirical
cumulative distribution functions satisfies
lim Pr
m,n
mn
m+n
1
2
FX FY
>x
(1)j1 e2j
=2
2 x2
for x 0.
j=1
(10.9)
This allows for an efficient characterization of the distribution under the
null hypothesis H0 . Efficient numerical approximations to (10.9) can be
found in numerical analysis handbooks [PTVF94]. The distribution under
the alternative, p = q, however, is unknown.
The Kolmogorov metric is, in fact, a special instance of MMD[F, p, q] for
a certain Banach space [M
ul97, Theorem 5.2]
204
Proposition 10.10 Let F be the class of functions X R of bounded variation1 1. Then MMD[F, p, q] = Fp Fq .
10.9.5.2 Earth-Mover Distances
Another class of distance measures on distributions that may be written as
an MMD are the Earth-Mover distances. We assume (X, d) is a separable
metric space, and define P1 (X) to be the space of probability measures on
X for which d(x, z)dp(z) < for all p P1 (X) and x X (these are the
probability measures for which E |x| < when X = R). We then have the
following definition [Dud02, p. 420].
Definition 10.11 (Monge-Wasserstein metric) Let p P1 (X) and q
P1 (X). The Monge-Wasserstein distance is defined as
W (p, q) :=
inf
d(x, y)d(x, y),
M (p,q)
where M (p, q) is the set of joint distributions on X X with marginals p and

q.
We may interpret this as the cost (as represented by the metric d(x, y)) of
transferring mass distributed according to p to a distribution in accordance
with q, where is the movement schedule. In general, a large variety of costs
of moving mass from x to y can be used, such as psychooptical similarity
measures in image retrieval [RTG00]. The following theorem holds [Dud02,
Theorem 11.8.2].
Theorem 10.12 (Kantorovich-Rubinstein) Let p P1 (X) and q P1 (X),
where X is separable. Then a metric on P1 (S) is defined as
W (p, q) = p q
= sup
f
f d(p q) ,
L 1
where
f
|f (x) f (y)|
d(x, y)
x=y X
:= sup
is the Lipschitz seminorm1 for real valued f on X.

1
A function f defined on [a, b] is of bounded variation C if the total variation is bounded by C,

i.e. the supremum over all sums
X
|f (xi ) f (xi1 )|,
1in
where a x0 . . . xn b [Dud02, p. 184].

A seminorm satisfies the requirements of a norm besides x = 0 only for x = 0 [Dud02, p.
156].
205
A simple example of this theorem is as follows [Dud02, Exercise 1, p. 425].

Example 10.1 Let X = R with associated d(x, y) = |x y|. Then given f
such that f L 1, we use integration by parts to obtain
f d(p q) =
(Fp Fq )(x)f (x)dx
|(Fp Fq )| (x)dx,
where the maximum is attained for the function g with derivative g =

2 1Fp >Fq 1 (and for which g L = 1). We recover the L1 distance between
distribution functions,
W (P, Q) =
|(Fp Fq )| (x)dx.
One may further generalize Theorem 10.12 to the set of all laws P(X) on
arbitrary metric spaces X [Dud02, Proposition 11.3.2].
Definition 10.13 (Bounded Lipschitz metric) Let p and q be laws on
a metric space X. Then
(p, q) :=
f d(p q)
sup
f
BL 1
is a metric on P(X), where f belongs to the space of bounded Lipschitz

functions with norm
f
BL
:= f
+ f

We now present three background results. First, we introduce the terminology used in statistical hypothesis testing. Second, we demonstrate via an
example that even for tests which have asymptotically no error, one cannot
guarantee performance at any fixed sample size without making assumptions
about the distributions. Finally, we briefly review some earlier approaches
to the two-sample problem.
10.10.1 Statistical Hypothesis Testing
Having described a metric on probability distributions (the MMD) based
on distances between their Hilbert space embeddings, and empirical estimates (biased and unbiased) of this metric, we now address the problem of
determining whether the empirical MMD shows a statistically significant difference between distributions. To this end, we briefly describe the framework
206
of statistical hypothesis testing as it applies in the present context, following

[CB02, Chapter 8]. Given i.i.d. samples X p of size m and Y q of size
n, the statistical test, T(X, Y ) : Xm Xn {0, 1} is used to distinguish
between the null hypothesis H0 : p = q and the alternative hypothesis
H1 : p = q. This is achieved by comparing the test statistic1 MMD[F, X, Y ]
with a particular threshold: if the threshold is exceeded, then the test rejects
the null hypothesis (bearing in mind that a zero population MMD indicates
p = q). The acceptance region of the test is thus defined as the set of real
numbers below the threshold. Since the test is based on finite samples, it is
possible that an incorrect answer will be returned: we define the Type I error
as the probability of rejecting p = q based on the observed sample, despite
the null hypothesis having generated the data. Conversely, the Type II error
is the probability of accepting p = q despite the underlying distributions
being different. The level of a test is an upper bound on the Type I error:
this is a design parameter of the test, and is used to set the threshold to
which we compare the test statistic (finding the test threshold for a given
is the topic of Sections 10.11 and 10.12). A consistent test achieves a level
, and a Type II error of zero, in the large sample limit. We will see that
the tests proposed in this paper are consistent.
10.10.2 A Negative Result

Even if a test is consistent, it is not possible to distinguish distributions with
high probability at a given, fixed sample size (i.e., to provide guarantees
on the Type II error), without prior assumptions as to the nature of the
difference between p and q. This is true regardless of the two-sample test
used. There are several ways to illustrate this, which each give different
insight into the kinds of differences that might be undetectable for a given
number of samples. The following example1 is one such illustration.
Example 10.2 Assume that we have a distribution p from which we draw
m iid observations. Moreover, we construct a distribution q by drawing m2
iid observations from p and subsequently defining a discrete distribution over
these m2 instances with probability m2 each. It is easy to check that if we
2
1 > 0.63
now draw m observations from q, there is at least a mm mm!
2m > 1e
probability that we thereby will have effectively obtained an m sample from
p. Hence no test will be able to distinguish samples from p and q in this case.
1
1
This may be biased or unbiased.

This is a variation of a construction for independence tests, which was suggested in a private
communication by John Langford.
207
We could make the probability of detection arbitrarily small by increasing the

size of the sample from which we construct q.
10.10.3 Previous Work

We next give a brief overview of some earlier approaches to the two sample
problem for multivariate data. Since our later experimental comparison is
with respect to certain of these methods, we give abbreviated algorithm
names in italics where appropriate: these should be used as a key to the
tables in Section 10.15. A generalisation of the Wald-Wolfowitz runs test to
the multivariate domain was proposed and analysed by [FR79, HP99] (FR
Wolf ), and involves counting the number of edges in the minimum spanning
tree over the aggregated data that connect points in X to points in Y .
The resulting test relies on the asymptotic normality of the test statistic,
and this quantity is not distribution-free under the null hypothesis for finite
samples (it depends on p and q). The computational cost of this method
using Kruskals algorithm is O((m + n)2 log(m + n)), although more modern
methods improve on the log(m + n) term. See [Cha00] for details. [FR79]
claim that calculating the matrix of distances, which costs O((m + n)2 ),
dominates their computing time; we return to this point in our experiments
(Section 10.15). Two possible generalisations of the Kolmogorov-Smirnov
test to the multivariate case were studied in [Bic69, FR79]. The approach of
Friedman and Rafsky (FR Smirnov) in this case again requires a minimal
spanning tree, and has a similar cost to their multivariate runs test.
A more recent multivariate test was introduced by [Ros05]. This entails
computing the minimum distance non-bipartite matching over the aggregate
data, and using the number of pairs containing a sample from both X and
Y as a test statistic. The resulting statistic is distribution-free under the
null hypothesis at finite sample sizes, in which respect it is superior to the
Friedman-Rafsky test; on the other hand, it costs O((m + n)3 ) to compute.
Another distribution-free test (Hall) was proposed by [HT02]: for each point
from p, it requires computing the closest points in the aggregated data,
and counting how many of these are from q (the procedure is repeated for
each point from q with respect to points from p). As we shall see in our
experimental comparisons, the test statistic is costly to compute; [HT02]
consider only tens of points in their experiments.
Yet another approach is to use some distance (e.g. L1 or L2 ) between
Parzen window estimates of the densities as a test statistic [AHT94, BG05],
based on the asymptotic distribution of this distance given p = q. When
the L2 norm is used, the test statistic is related to those we present here,
208
although it is arrived at from a different perspective. Briefly, the test of

[AHT94] is obtained in a more restricted setting where the RKHS kernel is
an inner product between Parzen windows. Since we are not doing density
estimation, however, we need not decrease the kernel width as the sample
grows. In fact, decreasing the kernel width reduces the convergence rate
of the associated two-sample test, compared with the (m + n)1/2 rate for
fixed kernels. We provide more detail in Section 10.14.1. The L1 approach
of [BG05] (Biau) requires the space to be partitioned into a grid of bins,
which becomes difficult or impossible for high dimensional problems. Hence
we use this test only for low-dimensional problems in our experiments.

In this section, we introduce two statistical tests of independence which
have exact performance guarantees at finite sample sizes, based on uniform
convergence bounds. The first, in Section 10.11.1, uses the [McD89] bound
on the biased MMD statistic, and the second, in Section 10.11.2, uses a
[Hoe63] bound for the unbiased statistic.
10.11.1 Bound on the Biased Statistic and Test

We establish two properties of the MMD, from which we derive a hypothesis
test. First, we show that regardless of whether or not p = q, the empirical
1
MMD converges in probability at rate O((m + n) 2 ) to its population value.
This shows the consistency of statistical tests based on the MMD. Second,
we give probabilistic bounds for large deviations of the empirical MMD in
the case p = q. These bounds lead directly to a threshold for our first hypothesis test. We begin our discussion of the convergence of MMDb [F, X, Y ]
to MMD[F, p, q].
Theorem 10.14 Let p, q, X, Y be defined as in Problem 10.1, and assume
0 k(x, y) K. Then
1
Pr |MMDb [F, X, Y ] MMD[F, p, q]| > 2 (K/m) 2 + (K/n) 2 +
2 exp
See Appendix 10.17.2 for proof. Our next goal is to refine this result in
a way that allows us to define a test threshold under the null hypothesis
p = q. Under this circumstance, the constants in the exponent are slightly
improved.
2 mn
2K(m+n)
209
Theorem 10.15 Under the conditions of Theorem 10.14 where additionally

p = q and m = n,
1
MMDb [F, X, Y ] m 2
2Ep [k(x, x) k(x, x )] + (2K/m)1/2 + ,

B2 (F,p)
B1 (F,p)
2
both with probability at least 1 exp 4Km

proof ).
(see Appendix 10.17.3 for the
In this theorem, we illustrate two possible bounds B1 (F, p) and B2 (F, p) on

the bias in the empirical estimate (10.6). The first inequality is interesting
inasmuch as it provides a link between the bias bound B1 (F, p) and kernel
size (for instance, if we were to use a Gaussian kernel with large , then
k(x, x) and k(x, x ) would likely be close, and the bias small). In the context
of testing, however, we would need to provide an additional bound to show
convergence of an empirical estimate of B1 (F, p) to its population equivalent.
Thus, in the following test for p = q based on Theorem 10.15, we use B2 (F, p)
to bound the bias.1
Corollary 10.16 A hypothesis test of level for the null hypothesis p = q,
that is, for MMD[F, p, q] = 0, has the acceptance region MMDb [F, X, Y ] <
2K/m 1 + 2 log 1 .
We emphasise that Theorem 10.14 guarantees the consistency of the test,
1
and that the Type II error probability decreases to zero at rate O(m 2 ),
assuming m = n. To put this convergence rate in perspective, consider a
test of whether two normal distributions have equal means, given they have
unknown but equal variance [CB02, Exercise 8.41]. In this case, the test
statistic has a Student-t distribution with n + m 2 degrees of freedom, and
its error probability converges at the same rate as our test.
It is worth noting that bounds may be obtained for the deviation between
expectations [p] and the empirical means [X] in a completely analogous
fashion. The proof requires symmetrization by means of a ghost sample, i.e.
a second set of observations drawn from the same distribution. While not
the key focus of the present paper, such bounds can be used in the design
of inference principles based on moment matching [AS06, DS06, DPS04].
1
Note that we use a tighter bias bound than [?].
210
10.11.2 Bound on the Unbiased Statistic and Test

While the previous bounds are of interest since the proof strategy can be used
for general function classes with well behaved Rademacher averages, a much
easier approach may be used directly on the unbiased statistic MMD2u in
Lemma 10.5. We base our test on the following theorem, which is a straightforward application of the large deviation bound on U-statistics of [Hoe63,
p. 25].
Theorem 10.17 Assume 0 k(xi , xj ) K, from which it follows 2K
h(zi , zj ) 2K. Then
Pr MMD2u (F, X, Y ) MMD2 (F, p, q) > t exp
t2 m2
8K 2
where m2 := m/2 (the same bound applies for deviations of t and below).
A consistent statistical test for p = q using MMD2u is then obtained.
Corollary 10.18 A hypothesis test of level for the null hypothesis p = q
has the acceptance region MMD2u < (4K/ m) log(1 ).

We now compare the thresholds of the two tests. We note first that the
threshold for the biased statistic applies to an estimate of MMD, whereas
that for the unbiased statistic is for an estimate of MMD2 . Squaring the former threshold to make the two quantities comparable, the squared threshold in Corollary 10.16 decreases as m1 , whereas the threshold in Corollary
10.18 decreases as m1/2 . Thus for sufficiently large1 m, the McDiarmidbased threshold will be lower (and the associated test statistic is in any case
biased upwards), and its Type II error will be better for a given Type I
bound. This is confirmed in our Section 10.15 experiments. Note, however,
that the rate of convergence of the squared, biased MMD estimate to its
population value remains at 1/ m (bearing in mind we take the square of
a biased estimate, where the bias term decays as 1/ m).

Finally, we note that the bounds we obtained here are rather conservative
for a number of reasons: first, they do not take the actual distributions into
account. In fact, they are finite sample size, distribution free bounds that
hold even in the worst case scenario. The bounds could be tightened using
localization, moments of the distribution, etc. Any such improvements could
be plugged straight into Theorem 10.8 for a tighter bound. See e.g. [BBL05]
for a detailed discussion of recent uniform convergence bounding methods.
1
In the case of = 0.05, this is m 12.
10.12 Test Based on the Asymptotic Distribution of the Unbiased Statistic
211
Second, in computing bounds rather than trying to characterize the distribution of MMD(F, X, Y ) explicitly, we force our test to be conservative by
design. In the following we aim for an exact characterization of the asymptotic distribution of MMD(F, X, Y ) instead of a bound. While this will not
satisfy the uniform convergence requirements, it leads to superior tests in
practice.
10.12 Test Based on the Asymptotic Distribution of the

Unbiased Statistic
We now propose a third test, which is based on the asymptotic distribution
of the unbiased estimate of MMD2 in Lemma 10.5.
Theorem 10.19 We assume E h2 < . Under H1 , MMD2u converges in
distribution [?, see e.g.]Section 7.2]GriSti01 to a Gaussian according to
1
m 2 MMD2u MMD2 [F, p, q] N 0, u2 ,

2
where u2 = 4 Ez (Ez h(z, z ))2 Ez,z (h(z, z )) , uniformly at rate
1/ m [Ser80, Theorem B, p. 193]. Under H0 , the U-statistic is degenerate, meaning Ez h(z, z ) = 0. In this case, MMD2u converges in distribution
according to
mMMD2u
l zl2 2 ,
(10.10)
l=1
where zl N(0, 2) i.i.d., i are the solutions to the eigenvalue equation

X
x )i (x)dp(x) = i i (x ),
k(x,
i , xj ) := k(xi , xj ) Ex k(xi , x) Ex k(x, xj ) + Ex,x k(x, x ) is the

and k(x
centred RKHS kernel.
The asymptotic distribution of the test statistic under H1 is given by [Ser80,
Section 5.5.1], and the distribution under H0 follows [Ser80, Section 5.5.2]
and [AHT94, Appendix]; see Appendix 10.18.1 for details. We illustrate the
MMD density under both the null and alternative hypotheses by approximating it empirically for both p = q and p = q. Results are plotted in Figure
10.2.
Our goal is to determine whether the empirical test statistic MMD2u is so
large as to be outside the 1 quantile of the null distribution in (10.10)
(consistency of the resulting test is guaranteed by the form of the distribution
212
Fig. 10.2. Left: Empirical distribution of the MMD under H0 , with p and q both
Gaussians with unit standard deviation, using 50 samples from each. Right: Empirical distribution of the MMD under H1 , with p a Laplace distribution
with unit
standard deviation, and q a Laplace distribution with standard deviation 3 2, using
100 samples from each. In both cases, the histograms were obtained by computing
2000 independent instances of the MMD.
under H1 ). One way to estimate this quantile is using the bootstrap on the
aggregated data, following [AG92]. Alternatively, we may approximate the
null distribution by fitting Pearson curves to its first four moments [JKB94,
Section 18.8]. Taking advantage of the degeneracy of the U-statistic, we
obtain (see Appendix 10.18.2)
E
MMD2u
MMD2u
2
Ez,z h2 (z, z ) and
m(m 1)
8(m 2)
Ez,z h(z, z )Ez h(z, z )h(z , z )
= 2
m (m 1)2
(10.11)
+ O(m4 ).
(10.12)
The fourth moment E
MMD2u
is not computed, since it is both very
O(m4 ),
small,
and expensive to calculate, O(m4 ). Instead, we replace the
kurtosis1 with a lower bound due to [Wil44], kurt MMD2u skew MMD2u
1.
Note that MMD2u may be negative, since it is an unbiased estimator of
(MMD[F, p, q])2 . However, the only terms missing to ensure nonnegativity
are the terms h(zi , zi ), which were removed to remove spurious correlations
between observations. Consequently we have the bound
MMD2u +
1
m(m 1)
k(xi , xi ) + k(yi , yi ) 2k(xi , yi ) 0.
(10.13)
i=1

While the above tests are already more efficient than the O(m2 log m) and
O(m3 ) tests described earlier, it is still desirable to obtain O(m) tests which
do not sacrifice too much statistical power. Moreover, we would like to obtain
tests which have O(1) storage requirements for computing the test statistic
in order to apply it to data streams. We now describe how to achieve this
1
The kurtosis is defined in terms of the fourth and second moments as kurt MMD2u =
4
E [MMD2
u]
h
i2 3.
2
2
E [MMDu ]
213
by computing the test statistic based on a subsampling of the terms in the

sum. The empirical estimate in this case is obtained by drawing pairs from
X and Y respectively without replacement.
Lemma 10.20 Recall m2 := m/2 . The estimator
MMD2l [F, X, Y
1
] :=
m2
m2
h((x2i1 , y2i1 ), (x2i , y2i ))

i=1
can be computed in linear time. Moreover, it is an unbiased estimate of

MMD2 [F, p, q].
While it is expected (as we will see explicitly later) that MMD2l has higher
variance than MMD2u , it is computationally much more appealing. In particular, the statistic can be used in stream computations with need for only
O(1) memory, whereas MMD2u requires O(m) storage and O(m2 ) time to
compute the kernel h on all interacting pairs.
Since MMD2l is just the average over a set of random variables, Hoeffdings
bound and the central limit theorem readily allow us to provide both uniform
convergence and asymptotic statements for it with little effort. The first
follows directly from [Hoe63, Theorem 2].
Theorem 10.21 Assume 0 k(xi , xj ) K. Then
Pr MMD2l (F, X, Y ) MMD2 (F, p, q) > t exp
t2 m2
8K 2
where m2 := m/2 (the same bound applies for deviations of t and below).
Note that the bound of Theorem 10.17 is identical to that of Theorem 10.21,
which shows the former is rather loose. Next we invoke the central limit
theorem.
Corollary 10.22 Assume 0 < E h2 < . Then MMD2l converges in distribution to a Gaussian according to
1
m 2 MMD2l MMD2 [F, p, q] N 0, l2 ,

where l2 = 2 Ez,z h2 (z, z ) Ez,z h(z, z )
, uniformly at rate 1/ m.
The factor of 2 arises since we are averaging over only m/2 observations. Note the difference in the variance between Theorem 10.19 and Corollary 10.22, namely in the former case we are interested in the average conditional variance Ez Varz [h(z, z )|z], whereas in the latter case we compute
the full variance Varz,z [h(z, z )].
214
We end by noting another potential approach to reducing the computational cost of the MMD, by computing a low rank approximation to the
Gram matrix [FS01, WS01, SS00]. An incremental computation of the MMD
based on such a low rank approximation would require O(md) storage and
O(md) computation (where d is the rank of the approximate Gram matrix which is used to factorize both matrices) rather than O(m) storage and
O(m2 ) operations. That said, it remains to be determined what effect this
approximation would have on the distribution of the test statistic under H0 ,
and hence on the test threshold.

Our main point is to propose a new kernel statistic to test whether two
distributions are the same. However, it is reassuring to observe links to
other measures of similarity between distributions.
10.14.1 Link with L2 Distance between Parzen Window

Estimates
In this section, we demonstrate the connection between our test statistic and
the Parzen window-based statistic of [AHT94]. We show that a two-sample
test based on Parzen windows converges more slowly than an RKHS-based
test, also following [AHT94]. Before proceeding, we motivate this discussion
with a short overview of the Parzen window estimate and its properties
[Sil86]. We assume a distribution p on Rd , which has an associated density
function also written p to minimise notation. The Parzen window estimate
of this density from an i.i.d. sample X of size m is
1
p(x) =
m
(xl x) where satisfies

l=1
We may rescale according to

estimate requires
hdm
lim hdm = 0
x
hm
and
(x) dx = 1 and (x) 0.

X
. Consistency of the Parzen window

lim mhdm = .
(10.14)
We now show that the L2 distance between Parzen windows density estimates [AHT94] is a special case of the biased MMD in equation (10.6).
Denote by Dr (p, q) := p q r the Lr distance. For r = 1 the distance
Dr (p, q) is known as the Levy distance [Fel71], and for r = 2 we encounter
distance measures derived from the Renyi entropy [GP02].
215
Assume that p and q are given as kernel density estimates with kernel
(x x ), that is, p(x) = m1 i (xi x) and q(y) is defined by analogy.
In this case
1
m
D2 (
p, q) =
=
1
m2
1
(xi z)
n
k(xi xj ) +
i,j=1
1
n2
(yi z)
dz
i
n
k(yi yj )
i,j=1
2
mn
(10.15)
m,n
k(xi yj ),
i,j=1
(10.16)
where k(x y) = (x z)(y z)dz. By its definition k(x y) is a Mercer
kernel [Mer09], as it can be viewed as inner product between (x z) and
(y z) on the domain X.
A disadvantage of the Parzen window interpretation is that when the
Parzen window estimates are consistent (which requires the kernel size to
decrease with increasing sample size), the resulting two-sample test converges more slowly than using fixed kernels. According to [AHT94, p. 43],
d/2
the Type II error of the two-sample test converges as m1/2 hm . Thus,
given the schedule for the Parzen window size decrease in (10.14), the convergence rate will lie in the open interval (0, 1/2): the upper limit is approached as hm decreases more slowly, and the lower limit corresponds to
hm decreasing near the upper bound of 1/m. In other words, by avoiding
density estimation, we obtain a better convergence rate (namely m1/2 ) than
using a Parzen window estimate with any permissible bandwidth decrease
schedule. In addition, the Parzen window interpretation cannot explain the
excellent performance of MMD based tests in experimental settings where
the dimensionality greatly exceeds the sample size (for instance the Gaussian toy example in Figure 10.4B, for which performance actually improves
when the dimensionality increases; and the microarray datasets in Table
10.1). Finally, our tests are able to employ universal kernels that cannot be
written as inner products between Parzen windows, normalized or otherwise:
several examples are given by [Ste01, Section 3] and [MXZ06, Section 3]. We
may further generalize to kernels on structured objects such as strings and
graphs [STV04]: see also our experiments in Section 10.15.
10.14.2 Set Kernels and Kernels Between Probability Measures
[?] propose kernels to deal with sets of observations. These are then used
in the context of Multi-Instance Classification (MIC). The problem MIC
attempts to solve is to find estimators which are able to infer from the
216
fact that some elements in the set satisfy a certain property, then the set
of observations has this property, too. For instance, a dish of mushrooms
is poisonous if it contains poisonous mushrooms. Likewise a keyring will
open a door if it contains a suitable key. One is only given the ensemble,
however, rather than information about which instance of the set satisfies
the property.
The solution proposed by [?] is to map the ensembles Xi := {xi1 , . . . , ximi },
where i is the ensemble index and mi the number of elements in the ith ensemble, jointly into feature space via
1
(Xi ) :=
mi
mi
(xij ),
(10.17)
j=1
and use the latter as the basis for a kernel method. This simple approach
affords rather good performance. With the benefit of hindsight, it is now
understandable why the kernel
k(Xi , Xj ) =
1
mi mj
mi ,mj
k(xiu , xjv )
(10.18)
u,v
produces useful results: it is simply the kernel between the empirical means
in feature space (Xi ), (Xj ) [HLB04, Eq. 4]. [JK03] later extended this
setting by smoothing the empirical densities before computing inner products.
Note, however, that property testing for distributions is probably not optimal when using the mean [p] (or [X] respectively): we are only interested in determining whether some instances in the domain have the desired
property, rather than making a statement regarding the distribution of those
instances. Taking this into account leads to an improved algorithm [ATH03].
10.14.3 Kernel Measures of Independence

We next demonstrate the application of MMD in determining whether two
random variables x and y are independent. In other words, assume that pairs
of random variables (xi , yi ) are jointly drawn from some distribution p :=
Prx,y . We wish to determine whether this distribution factorizes, i.e. whether
q := Prx Pry is the same as p. One application of such an independence
measure is in independent component analysis [Com94], where the goal is to
find a linear mapping of the observations xi to obtain mutually independent
outputs. Kernel methods were employed to solve this problem by [BJ02,
217
GBSS05, GHS+ 05]. In the following we re-derive one of the above kernel
independence measures using mean operators instead.
We begin by defining
[Pr] := Ex,y [v((x, y), )]
xy
and [Pr Pr] := Ex Ey [v((x, y), )] .

x
Here we assumed that V is an RKHS over XY with kernel v((x, y), (x , y )).
If x and y are dependent, the equality [Prxy ] = [Prx Pry ] will not hold.
Hence we may use := [Prxy ] [Prx Pry ] as a measure of dependence.
Now assume that v((x, y), (x , y )) = k(x, x )l(y, y ), i.e. that the RKHS V
is a direct product H G of the RKHSs on X and Y. In this case it is easy
to see that
2 =
Exy [k(x, )l(y, )] Ex [k(x, )] Ey [l(y, )]
= Exy Ex y k(x, x )l(y, y ) 2Ex Ey Ex y k(x, x )l(y, y )

+Ex Ey Ex Ey k(x, x )l(y, y )
The latter, however, is exactly what [GBSS05] show to be the HilbertSchmidt norm of the cross-covariance operator between RKHSs: this is zero
if and only if x and y are independent, for universal kernels. We have the
following theorem:
Theorem 10.23 Denote by Cxy the covariance operator between random
variables x and y, drawn jointly from Prxy , where the functions on X and
Y are the reproducing kernel Hilbert spaces F and G respectively. Then the
Hilbert-Schmidt norm Cxy HS equals .
Empirical estimates of this quantity are as follows:
Theorem 10.24 Denote by K and L the kernel matrices on X and Y respectively, and by H = I 1 /m the projection matrix onto the subspace
orthogonal to the vector with all entries set to 1. Then m2 tr HKHL is an
estimate of 2 with bias O(m1 ). With high probability the deviation from
1
2 is O(m 2 ).
[GBSS05] provide explicit constants. In certain circumstances, including in
the case of RKHSs with Gaussian kernels, the empirical 2 may also be
interpreted in terms of a smoothed difference between the joint empirical
characteristic function (ECF) and the product of the marginal ECFs [Feu93,
Kan95]. This interpretation does not hold in all cases, however, e.g. for
218
Fig. 10.3. Illustration of the function maximizing the mean discrepancy when MMD
is used as a measure of independence. A sample from dependent random variables
x and y is shown in black, and the associated function f that witnesses the MMD
is plotted as a contour. The latter was computed empirically on the basis of 200
samples, using a Gaussian kernel with = 0.2.
kernels on strings, graphs, and other structured spaces. An illustration of

the witness function f F from Definition 10.2 is provided in Figure 10.3.
This is a smooth function which has large magnitude where the joint density
is most different from the product of the marginals.
We remark that a hypothesis test based on the above kernel statistic is
more complicated than for the two-sample problem, since the product of the
marginal distributions is in effect simulated by permuting the variables of
the original sample. Further details are provided by [?].
10.14.4 Kernel Statistics Using a Distribution over Witness

Functions
[STD07] define a distance between distributions as follows: let H be a set of
functions on X and r be a probability distribution over F. Then the distance
between two distributions p and q is given by
D(p, q) := Ef r(f ) |Exp [f (x)] Exq [f (x)]| .
(10.19)
That is, we compute the average distance between p and q with respect to
a distribution of test functions.
Lemma 10.25 Let H be a reproducing kernel Hilbert space, f H, and assume r(f ) = r( f H ) with finite Ef r [ f H ]. Then D(p, q) = C [p] [q] H
for some constant C which depends only on H and r.
Proof By definition Ep [f (x)] = [p], f
product, Equation (10.19) equals
| [p] [q], f
= [p] [q]
H.
Using linearity of the inner
H | dr(f )
[p] [q]
,f
[p] [q] H
dr(f ),
H
where the integral is independent of p, q. To see this, note that for any p, q,
[p][q]
[p][q] H is a unit vector which can turned into, say, the first canonical
basis vector by a rotation which leaves the integral invariant, bearing in
10.15 Experiments
219
mind that r is rotation invariant.
10.14.5 Outlier Detection

An application related to the two sample problem is that of outlier detection:
this is the question of whether a novel point is generated from the same
distribution as a particular i.i.d. sample. In a way, this is a special case of
a two sample test, where the second sample contains only one observation.
Several methods essentially rely on the distance between a novel point to
the sample mean in feature space to detect outliers.
For instance, [DGDR02] use a related method to deal with nonstationary
time series. Likewise [SC04, p. 117] discuss how to detect novel observations by using the following reasoning: the probability of being an outlier
is bounded both as a function of the spread of the points in feature space
and the uncertainty in the empirical feature space mean (as bounded using
symmetrisation and McDiarmids tail bound).
Instead of using the sample mean and variance, [TD99] estimate the center and radius of a minimal enclosing sphere for the data, the advantage
being that such bounds can potentially lead to more reliable tests for single
observations. [SPST+ 01] show that the minimal enclosing sphere problem is
equivalent to novelty detection by means of finding a hyperplane separating
the data from the origin, at least in the case of radial basis function kernels.
10.15 Experiments
We conducted distribution comparisons using our MMD-based tests on datasets
from three real-world domains: database applications, bioinformatics, and
neurobiology. We investigated both uniform convergence approaches (MMDb
with the Corollary 10.16 threshold, and MMD2u H with the Corollary 10.18
threshold); the asymptotic approaches with bootstrap (MMD2u B) and moment matching to Pearson curves (MMD2u M), both described in Section
10.12; and the asymptotic approach using the linear time statistic (MMD2l )
from Section 10.13. We also compared against several alternatives from
the literature (where applicable): the multivariate t-test, the FriedmanRafsky Kolmogorov-Smirnov generalisation (Smir), the Friedman-Rafsky
Wald-Wolfowitz generalisation (Wolf ), the Biau-Gyorfi test (Biau), and the
Hall-Tajvidi test (Hall). See Section 10.10.3 for details regarding these tests.
Note that we do not apply the Biau-Gyorfi test to high-dimensional prob-
220
lems (since the required space partitioning is no longer possible), and that
MMD is the only method applicable to structured data such as graphs.
An important issue in the practical application of the MMD-based tests
is the selection of the kernel parameters. We illustrate this with a Gaussian
RBF kernel, where we must choose the kernel width (we use this kernel for
univariate and multivariate data, but not for graphs). The empirical MMD
is zero both for kernel size = 0 (where the aggregate Gram matrix over
X and Y is a unit matrix), and also approaches zero as (where
the aggregate Gram matrix becomes uniformly constant). We set to be
the median distance between points in the aggregate sample, as a compromise between these two extremes: this remains a heuristic, similar to those
described in [TLSS06, Sch97], and the optimum choice of kernel size is an
ongoing area of research.
10.15.1 Toy Example: Two Gaussians

In our first experiment, we investigated the scaling performance of the various tests as a function of the dimensionality d of the space X Rd , when
both p and q were Gaussian. We considered values of d up to 2500: the
performance of the MMD-based tests cannot therefore be explained in the
context of density estimation (as in Section 10.14.1), since the associated
density estimates are necessarily meaningless here. The levels for all tests
were set at = 0.05, m = 250 samples were used, and results were averaged
over 100 repetitions. In the first case, the distributions had different means
and unit variance. The percentage of times the null hypothesis was correctly
rejected over a set of Euclidean distances between the distribution means
(20 values logarithmically spaced from 0.05 to 50), was computed as a function of the dimensionality of the normal distributions. In case of the t-test,
a ridge was added to the covariance estimate, to avoid singularity (the ratio
of largest to smallest eigenvalue was ensured to be at most 2). In the second case, samples were drawn from distributions N(0, I) and N(0, 2 I) with
different variance. The percentage of null rejections was averaged over 20
values logarithmically spaced from 100.01 to 10. The t-test was not compared
in this case, since its output would have been irrelevant. Results are plotted
in Figure 10.4.
In the case of Gaussians with differing means, we observe the t-test performs best in low dimensions, however its performance is severely weakened
when the number of samples exceeds the number of dimensions. The performance of M M Du2 M is comparable to the t-test for low sample sizes,
and outperforms all other methods for larger sample sizes. The worst per-
10.15 Experiments
221
Fig. 10.4. Type II performance of the various tests when separating two Gaussians,
with test level = 0.05. A Gaussians have same variance and different means. B
Gaussians have same mean and different variances.
formance is obtained for M M Du2 H, though M M Db also does relatively

poorly: this is unsurprising given that these tests derive from distributionfree large deviation bounds, whereas the sample size is relatively small. Remarkably, M M Dl2 performs quite well compared with classical tests in high
dimensions.
In the case of Gaussians of differing variance, the Hall test performs best,
followed closely by M M Du2 . FR Wolf and (to a much greater extent) FR
Smirnov both have difficulties in high dimensions, failing completely once
the dimensionality becomes too great. The linear test M M Dl2 again performs surprisingly well, almost matching the M M Du2 performance in the
highest dimensionality. Both M M Du2 H and M M Db perform poorly, the former failing completely: this is one of several illustrations we will encounter
of the much greater tightness of the Corollary 10.16 threshold over that in
Corollary 10.18.
10.15.2 Data Integration
In our next application of MMD, we performed distribution testing for data
integration: the objective being to aggregate two datasets into a single sample, with the understanding that both original samples were generated from
the same distribution. Clearly, it is important to check this last condition before proceeding, or an analysis could detect patterns in the new dataset that
are caused by combining the two different source distributions, and not by
real-world phenomena. We chose several real-world settings to perform this
task: we compared microarray data from normal and tumor tissues (Health
status), microarray data from different subtypes of cancer (Subtype), and
local field potential (LFP) electrode recordings from the Macaque primary
visual cortex (V1) with and without spike events [?, Neural Data I and
II, as described in more detail by]]RasGreMurMasetal08. In all cases, the
two data sets have different statistical properties, but the detection of these
differences is made difficult by the high data dimensionality (indeed, for
the microarray data, density estimation is impossible given the sample size
and data dimensionality, and no successful test can rely on accurate density
estimates as an intermediate step).
We applied our tests to these datasets in the following fashion. Given two
datasets A and B, we either chose one sample from A and the other from
222
B (attributes = different); or both samples from either A or B (attributes =

same). We then repeated this process up to 1200 times. Results are reported
in Table 10.1. Our asymptotic tests perform better than all competitors
besides Wolf : in the latter case, we have greater Type II error for one neural
dataset, lower Type II error on the Health Status data (which has very high
dimension and low sample size), and identical (error-free) performance on
the remaining examples. We note that the Type I error of the bootstrap test
on the Subtype dataset is far from its design value of 0.05, indicating that
the Pearson curves provide a better threshold estimate for these low sample
sizes. For the remaining datasets, the Type I errors of the Pearson and
Bootstrap approximations are close. Thus, for larger datasets, the bootstrap
is to be preferred, since it costs O(m2 ), compared with a cost of O(m3 )
for Pearson (due to the cost of computing (10.12)). Finally, the uniform
convergence-based tests are too conservative, with MMDb finding differences
in distribution only for the data with largest sample size, and MMD2u H never
finding differences.
Dataset
Attr.
MMDb
MMD2u H
MMD2u B
MMD2u M
t-test
Wolf
Smir
Hall
Neural Data I
Same
100.0
100.0
96.5
96.5
100.0
97.0
95.0
96.0
38.0
100.0
0.0
0.0
42.0
0.0
10.0
49.0
100.0
100.0
94.6
95.2
100.0
95.0
94.5
96.0
99.7
100.0
3.3
3.4
100.0
0.8
31.8
5.9
Same
100.0
100.0
95.5
94.4
100.0
94.7
96.1
95.6
Different
100.0
100.0
1.0
0.8
100.0
2.8
44.0
35.7
Same
100.0
100.0
99.1
96.4
100.0
94.6
97.3
96.5
Different
100.0
100.0
0.0
0.0
100.0
0.0
28.4
0.2
Different
Neural Data II
Same
Different
Health status
Subtype
Table 10.1. Distribution testing for data integration on multivariate data.

Numbers indicate the percentage of repetitions for which the null hypothesis
(p=q) was accepted, given = 0.05. Sample size (dimension; repetitions of
experiment): Neural I 4000 (63; 100) ; Neural II 1000 (100; 1200); Health
Status 25 (12,600; 1000); Subtype 25 (2,118; 1000).
10.15 Experiments
223
10.15.3 Computational Cost

We next investigate the tradeoff between computational cost and performance of the various tests, with particular attention to how the quadratic
time MMD tests from Sections 10.11 and 10.12 compare with the linear
time MMD-based asymptotic test from Section 10.13. We consider two 1-D
datasets (CNUM and FOREST) and two higher-dimensional datasets (FOREST10D and NEUROII). Results are plotted in Figure 10.5. If cost is not
a factor, then the MMD2u B shows best overall performance as a function
of sample size, with a Type II error dropping to zero as fast or faster than
competing approaches in three of four cases, and narrowly trailing FR Wolf
in the fourth (FOREST10D). That said, for datasets CNUM, FOREST, and
FOREST10D, the linear time MMD achieves results comparable to MMD2u
B at a far smaller computational cost, albeit by looking at a great deal more
data. In the CNUM case, however, the linear test is not able to achieve zero
error even for the largest data set size. For the NEUROII data, attaining
zero Type II error has about the same cost for both approaches. The difference in cost of MMD2u B and MMDb is due to the bootstrapping required for
the former, which produces a constant offset in cost between the two (here
150 resamplings were used).
The t-test also performs well in three of the four problems, and in fact
represents the best cost-performance tradeoff in these three datasets (i.e.
while it requires much more data than MMD2u B for a given level of performance, it costs far less to compute). The t-test assumes that only the
difference in means is important in distinguishing the distributions, and it
requires an accurate estimate of the within-sample covariance; the test fails
completely on the NEUROII data. We emphasise that the KolmogorovSmirnov results in 1-D were obtained using the classical statistic, and not
the Friedman-Rafsky statistic, hence the low computational cost. The cost
of both Friedman-Rafsky statistics is therefore given by the FR Wolf cost in
this case. The latter scales similarly with sample size to the quadratic time
MMD tests, confirming Friedman and Rafskys observation that obtaining
the pairwise distances between sample points is the dominant cost of their
tests. We also remark on the unusual behaviour of the Type II error of the
FR Wolf test in the FOREST dataset, which worsens for increasing sample
size.
We conclude that the approach to be recommended when testing homogeneity will depend on the data available: for small amounts of data, the
best results are obtained using every observation to maximum effect, and
employing the quadratic time MMD2u B test. When large volumes of data are
224
available, a better option is to look at each point only once, which can yield
greater accuracy for a given computational cost. It may also be worth doing a
t-test first in this case, and only running more sophisticated non-parametric
tests if the t-test accepts the null hypothesis, to verify the distributions are
identical in more than just mean.
10.15.4 Attribute Matching

Our final series of experiments addresses automatic attribute matching.
Given two databases, we want to detect corresponding attributes in the
schemas of these databases, based on their data-content (as a simple example, two databases might have respective fields Wage and Salary, which are
assumed to be observed via a subsampling of a particular population, and
we wish to automatically determine that both Wage and Salary denote to
the same underlying attribute). We use a two-sample test on pairs of attributes from two databases to find corresponding pairs.1 This procedure is
also called table matching for tables from different databases. We performed
attribute matching as follows: first, the dataset D was split into two halves
A and B. Each of the n attributes in A (and B, resp.) was then represented
by its instances in A (resp. B). We then tested all pairs of attributes from A
and from B against each other, to find the optimal assignment of attributes
A1 , . . . , An from A to attributes B1 , . . . , Bn from B. We assumed that A and
B contain the same number of attributes.
As a naive approach, one could assume that any possible pair of attributes
might correspond, and thus that every attribute of A needs to be tested
against all the attributes of B to find the optimal match. We report results for this naive approach, aggregated over all pairs of possible attribute
matches, in Table 10.2. We used three datasets: the census income dataset
from the UCI KDD archive (CNUM), the protein homology dataset from
the 2004 KDD Cup (BIO) [CJ04], and the forest dataset from the UCI ML
archive [BM98]. For the final dataset, we performed univariate matching of
attributes (FOREST) and multivariate matching of tables (FOREST10D)
from two different databases, where each table represents one type of forest.
Both our asymptotic MMD2u -based tests perform as well as or better than
the alternatives, notably for CNUM, where the advantage of MMD2u is large.
Unlike in Table 10.1, the next best alternatives are not consistently the same
across all data: e.g. in BIO they are Wolf or Hall, whereas in FOREST they
1
Note that corresponding attributes may have different distributions in real-world databases.
Hence, schema matching cannot solely rely on distribution testing. Advanced approaches to
schema matching using MMD as one key statistical test are a topic of current research.
10.15 Experiments
225
Fig. 10.5. Linear vs quadratic MMD. First column is performance, second is runtime. The dashed grey horizontal line indicates zero Type II error (required due log
y-axis)
226
are Smir, Biau, or the t-test. Thus, MMD2u appears to perform more consistently across the multiple datasets. The Friedman-Rafsky tests do not
always return a Type I error close to the design parameter: for instance,
Wolf has a Type I error of 9.7% on the BIO dataset (on these data, MMD2u
has the joint best Type II error without compromising the designed Type
I performance). Finally, MMDb performs much better than in Table 10.1,
although surprisingly it fails to reliably detect differences in FOREST10D.
The results of MMD2u H are also improved, although it remains among the
worst performing methods.
A more principled approach to attribute matching is also possible. Assume that (A) = (1 (A1 ), 2 (A2 ), ..., n (An )): in other words, the kernel
decomposes into kernels on the individual attributes of A (and also decomposes this way on the attributes of B). In this case, M M D2 can be written
n
2
i=1 i (Ai ) i (Bi ) , where we sum over the MMD terms on each of
the attributes. Our goal of optimally assigning attributes from B to attributes of A via MMD is equivalent to finding the optimal permutation
of attributes of B that minimizes ni=1 i (Ai ) i (B(i) ) 2 . If we define
Cij = i (Ai ) i (Bj ) 2 , then this is the same as minimizing the sum over
Ci,(i) . This is the linear assignment problem, which costs O(n3 ) time using
the Hungarian method [Kuh55].
While this may appear to be a crude heuristic, it nonetheless defines a
semi-metric on the sample spaces X and Y and the corresponding distributions p and q. This follows from the fact that matching distances are
proper metrics if the matching cost functions are metrics. We formalize this
as follows:
Theorem 10.26 Let p, q be distributions on Rd and denote by pi , qi the

marginal distributions on the i-th variable. Moreover, denote by the symmetric group on {1, . . . , d}. The following distance, obtained by optimal coordinate matching, is a semi-metric.
d
[F, p, q] := min
MMD[F, pi , q(i) ]
i=1
Proof Clearly [F, p, q] is nonnegative, since all of its summands are. Next
we show the triangle inequality. Denote by r a third distribution on Rd and
let p,q , q,r and p,r be the distance minimizing permutations between p, q
10.15 Experiments
227
and r respectively. It then follows that

d
[F, p, q] + [F, q, r] =
MMD[F, pi , qp,q (i) ] +

i=1
d
MMD[F, qi , rq,r (i) ]

i=1
MMD[F, pi , r[p,q q,r ](i) ] [F, p, r].

i=1
Here the first inequality follows from the triangle inequality on MMD, that
is
MMD[F, pi , qp,q (i) ]+MMD[F, qp,q (i) , r[p,q q,r ](i) ] MMD[F, pi , r[p,q q,r ](i) ].
The second inequality is a result of minimization over .
Dataset
Attr.
MMDb
MMD2u H
MMD2u B
MMD2u M
t-test
Wolf
Smir
Hall
Bia
BIO
Same
100.0
100.0
93.8
94.8
95.2
90.3
95.8
95.3
99
20.0
52.6
17.2
17.6
36.2
17.2
18.6
17.9
42
100.0
100.0
96.4
96.0
97.4
94.6
99.8
95.5
100
3.9
11.0
0.0
0.0
0.2
3.8
0.0
50.1
0.
100.0
100.0
94.5
93.8
94.0
98.4
97.5
91.2
98
14.9
52.7
2.7
2.5
19.17
22.5
11.6
79.1
50
100.0
100.0
94.0
94.0
100.0
93.5
96.5
97.0
100
86.6
100.0
0.0
0.0
0.0
0.0
1.0
72.0
100
Different
FOREST
Same
Different
CNUM
Same
Different
FOREST10D
Same
Different
Table 10.2. Naive attribute matching on univariate (BIO, FOREST,

CNUM) and multivariate data (FOREST10D). Numbers indicate the
percentage of accepted null hypothesis (p=q) pooled over attributes.
= 0.05. Sample size (dimension; attributes; repetitions of experiment):
BIO 377 (1; 6; 100); FOREST 538 (1; 10; 100); CNUM 386 (1; 13; 100);
FOREST10D 1000 (10; 2; 100).
We tested this Hungarian approach to attribute matching via MMD2u B
on three univariate datasets (BIO, CNUM, FOREST) and for table matching
on a fourth (FOREST10D). To study MMD2u B on structured data, we
obtained two datasets of protein graphs (PROTEINS and ENZYMES) and
used the graph kernel for proteins from [BOS+ 05] for table matching via
228
the Hungarian method (the other tests were not applicable to this graph
data). The challenge here is to match tables representing one functional
class of proteins (or enzymes) from dataset A to the corresponding tables
(functional classes) in B. Results are shown in Table 10.3. Besides on the
BIO and CNUM datasets, MMD2u B made no errors.
Dataset
Data type
No. attributes
Sample size
Repetitions
% correct matches
BIO
univariate
377
100
90.0
CNUM
univariate
13
386
100
99.8
FOREST
univariate
10
538
100
100.0
FOREST10D
multivariate
1000
100
100.0
ENZYME
structured
50
50
100.0
PROTEINS
structured
200
50
100.0
Table 10.3. Hungarian Method for attribute matching via MMD2u B on

univariate (BIO, CNUM, FOREST), multivariate (FOREST10D), and
structured data (ENZYMES, PROTEINS) ( = 0.05; % correct matches
is the percentage of the correct attribute matches detected over all
repetitions).
10.16 Conclusion
We have established three simple multivariate tests for comparing two distributions p and q, based on samples of size m and n from these respective
distributions. Our test statistic is the maximum mean discrepancy (MMD),
defined as the maximum deviation in the expectation of a function evaluated
on each of the random variables, taken over a sufficiently rich function class:
in our case, a universal reproducing kernel Hilbert space (RKHS). Equivalently, the statistic can be written as the norm of the difference between
distribution feature means in the RKHS. We do not require density estimates
as an intermediate step. Two of our tests provide Type I error bounds that
are exact and distribution-free for finite sample sizes. We also give a third
test based on quantiles of the asymptotic distribution of the associated test
statistic. All three tests can be computed in O((m + n)2 ) time, however
when sufficient data are available, a linear time statistic can be used, which
employs more data to get better results at smaller computational cost. In
addition, a number of metrics on distributions (Kolmogorov-Smirnov, Earth
10.16 Conclusion
229
Movers, L2 distance between Parzen window density estimates), as well as

certain kernel similarity measures on distributions, are included within our
framework.
While our result establishes that statistical tests based on the MMD are
consistent for universal kernels on compact domains, we draw attention to
the recent introduction of characteristic kernels by [FGSS08], these being
kernels for which the mean map is injective. Fukumizu et al. establish that
Gaussian and Laplace kernels are characteristic on Rd , and thus the MMD
is a consistent test for this domain. [?] further explore the properties of
characteristic kernels, providing a simple condition to determine whether
convolution kernels are characteristic, and describing characteristic kernels
which are not universal on compact domains. We also note (following Section 10.14.2) that the MMD for RKHSs is associated with a particular kernel between probability distributions. [HLB04] describe several further such
kernels, which induce corresponding distances between feature space distribution mappings: these may in turn lead to new and powerful two-sample
tests.
Two recent studies have shown that additional divergence measures between distributions can be obtained empirically through optimization in a
reproducing kernel Hilbert space. [HBM08] build on the work of [?], considering a homogeneity statistic arising from the kernel Fisher discriminant,
rather than the difference of RKHS means; and [NWJ08] obtain a KL divergence estimate by approximating the ratio of densities (or its log) with a
function in an RKHS. By design, both these kernel-based statistics prioritise
different features of p and q when measuring the divergence between them,
and the resulting effects on distinguishability of distributions are therefore
of interest.
Finally, we have seen in Section 10.9 that several classical metrics on probability distributions can be written as maximum mean discrepancies with
function classes that are not Hilbert spaces, but rather Banach, metric, or
semi-metric spaces. It would be of particular interest to establish under
what conditions one could write these discrepancies in terms of norms of
differences of mean elements. In particular, [DL07] consider Banach spaces
endowed with a semi-inner product, for which a General Riesz Representation exists for elements in the dual.
230
10.17 Large Deviation Bounds for Tests with Finite Sample

Guarantees
10.17.1 Preliminary Definitions and Theorems
We need the following theorem, due to [McD89].
Theorem 10.27 (McDiarmids inequality) Let f : Xm R be a function such that for all i {1, . . . , m}, there exist ci < for which
sup
XXm ,
xX
|f (x1 , . . . xm ) f (x1 , . . . xi1 , x

, xi+1 , . . . , xm )| ci .
Then for all probability measures p and every
> 0,
pxm (f (x) Exm (f (x)) > t) < exp
m
2
i=1 ci
We also define the Rademacher average of the function class F with respect
to the m-sample X.
Definition 10.28 (Rademacher average of F on X) Let F be the unit
ball in a universal RKHS on the compact domain X, with kernel bounded
according to 0 k(x, y) K. Let X be an i.i.d. sample of size m drawn
according to a probability measure p on X, and let i be i.i.d and take values
in {1, 1} with equal probability. We define the Rademacher average
Rm (F, X) := E sup
f F
1
m
i f (xi )
i=1
(K/m)1/2 ,
where the upper bound is due to [BM02, Lemma 22]. Similarly, we define
1
Rm (F, p) := Ep, sup
m
f F
i f (xi ) .
i=1
10.17.2 Bound when p and q May Differ

We want to show that the absolute difference between MMD(F, p, q) and
MMDb (F, X, Y ) is close to its expected value, independent of the distributions p and q. To this end, we prove three intermediate results, which we
then combine. The first result we need is an upper bound on the absolute
10.17 Large Deviation Bounds for Tests with Finite Sample Guarantees
231
difference between MMD(F, p, q) and MMDb (F, X, Y ). We have
|MMD(F, p, q) MMDb (F, X, Y )|

=
1
m
sup (Ep (f ) Eq (f )) sup

f F
f F
sup Ep (f ) Eq (f )
f F
1
m
f (xi )
i=1
f (xi ) +
i=1
1
n
1
n
f (yi )
i=1
f (yi ) .
(10.20)
i=1
(p,q,X,Y )
Second, we provide an upper bound on the difference between (p, q, X, Y )

and its expectation. Changing either of xi or yi in (p, q, X, Y ) results in
changes in magnitude of at most 2K 1/2 /m or 2K 1/2 /n, respectively. We can
then apply McDiarmids theorem, given a denominator in the exponent of
m 2K 1/2 /m
+ n 2K 1/2 /n
= 4K
1
1
+
m n
= 4K
m+n
,
mn
to obtain
Pr ((p, q, X, Y ) EX,Y [(p, q, X, Y )] > ) exp
2 mn
.
2K(m + n)
(10.21)
For our final result, we exploit symmetrisation, following e.g. [vdVW96, p.
108], to upper bound the expectation of (p, q, X, Y ). Denoting by X an
i.i.d sample of size m drawn independently of X (and likewise for Y ), we
232
have
EX,Y [(p, q, X, Y )]
=
EX,Y
1
m
EX,Y sup EX
f F
EX,Y,X
,Y
1
m
sup
f F
(a)
EX,Y,X
,Y ,,
sup
f F
EX,X sup
(b)
f F
1
sup Ep (f )
m
f F
1
m
i=1
1
f (xi ) Eq (f ) +
n
f (xi )
i=1
m
f (xi )
i=1
1
m
1
m
1
m
f (yj )
i=1
1
n
f (xi ) EY
i=1
f (xi )
i=1
1
n
i f (xi ) f (xi ) +
i=1
+ EY,Y
i=1
f (yj )
f (yj ) +
i=1
n
1
n
i=1
1
n
1
n
f (yj )
i=1
f (yj )
i=1
i f (yj ) f (yj )
i=1
i f (xi ) f (xi )
sup
f F
1
n
i f (yj ) f (yj )
i=1
2 [Rm (F, p) + Rn (F, q)] .
(c)
2 (K/m)1/2 + (K/n)1/2 ,
(10.22)
(d)
where (a) uses Jensens inequality, (b) uses the triangle inequality, (c) substitutes Definition 10.28 (the Rademacher average), and (d) bounds the
Rademacher averages, also via Definition 10.28.
Having established our preliminary results, we proceed to the proof of
Theorem 10.14.
Proof [Theorem 10.14] Combining equations (10.21) and (10.22), gives
Pr (p, q, X, Y ) 2 (K/m)1/2 + (K/n)1/2 >
exp
2 mn
2K(m + n)
Substituting equation (10.20) yields the result.
10.17.3 Bound when p = q and m = n

In this section, we derive the Theorem 10.15 result, namely the large deviation bound on the MMD when p = q and m = n. Note also that we consider
only positive deviations of MMDb (F, X, Y ) from MMD(F, p, q), since negative deviations are irrelevant to our hypothesis test. The proof follows the
233
same three steps as in the previous section. The first step in (10.20) becomes
MMDb (F, X, Y ) MMD(F, p, q) = MMDb (F, X, X ) 0
= sup
f F
1
m
f (xi ) f (xi )
.(10.23)
i=1
The McDiarmid bound on the difference between (10.23) and its expectation
is now a function of 2m observations in (10.23), and has a denominator in
2
the exponent of 2m 2K 1/2 /m = 8K/m. We use a different strategy in
obtaining an upper bound on the expected (10.23), however: this is now
EX,X
=
1
EX,X
m
1
m
f F
f (xi ) f (xi )
i=1
(xi ) (xi )
i=1
sup
1
EX,X
m
1
2
k(xi , xj ) + k(xi , xj ) k(xi , xj ) k(xi , xj )

i=1 j=1
1
2mEx k(x, x) + 2m(m 1)Ex,x k(x, x ) 2m2 Ex,x k(x, x )
m
2
Ex,x k(x, x) k(x, x )
m
1
2
1
2
(10.24)
(2K/m)1/2 .
(10.25)
We remark that both (10.24) and (10.25) bound the amount by which our
biased estimate of the population MMD exceeds zero under H0 . Combining
the three results, we find that under H0 ,
pX
1
2
>
< exp
2m
4K
pX MMDb (F, X, X ) (2K/m)1/2 >
< exp
2m
4K
2
MMDb (F, X, X )
Ex,x k(x, x) k(x, x )
m

We derive results needed in the asymptotic test of Section 10.12. Appendix
10.18.1 describes the distribution of the empirical MMD under H0 (both
distributions identical). Appendix 10.18.2 contains derivations of the second
and third moments of the empirical MMD, also under H0 .
and
.
234
10.18.1 Convergence of the Empirical MMD under H0

We describe the distribution of the unbiased estimator MMD2u [F, X, Y ] under the null hypothesis. In this circumstance, we denote it by MMD2u [F, X, X ],
to emphasise that the second sample X is drawn independently from the
same distribution as X. We thus obtain the U-statistic
1
MMD2u [F, X, X ] =
k(xi , xj ) + k(xi , xj ) k(xi , xj ) k(x
(10.26)
j , xi )
m(m 1)
i=j
1
m(m 1)
h(zi , zj ),
(10.27)
i=j
where zi = (xi , xi ). Under the null hypothesis, this U-statistic is degenerate,

meaning
Ezj h(zi , zj ) = Exj k(xi , xj ) + Exj k(xi , xj ) Exj k(xi , xj ) Exj k(xj , xi )
= 0.
The following theorem from [Ser80, Section 5.5.2] then applies.
Theorem 10.29 Assume MMD2u [F, X, X ] is as defined in (10.27), with
Ez h(z, z ) = 0, and furthermore assume 0 < Ez,z h2 (z, z ) < . Then
MMD2u [F, X, X ] converges in distribution according to
mMMD2u [F, X, X
l 21l 1 ,
]
l=1
where 21l are independent chi squared random variables of degree one, and
l are the solutions to the eigenvalue equation
l l (u) =
h(u, v)l (v)d Pr .

v
While this result is adequate for our purposes (since we do not explicitly
use the quantities l in our subsequent reasoning), it does not make clear
the dependence of the null distribution on the kernel choice. For this reason,
we provide an alternative expression based on the reasoning of [AHT94,
Appendix], bearing in mind the following changes:
we do not need to deal with the bias terms S1j seen by [AHT94, Appendix]
that vanish for large sample sizes, since our statistic is unbiased (although
these bias terms drop faster than the variance);
we require greater generality, since we deal with distributions on compact
metric spaces, and not densities on Rd ; correspondingly, our kernels are
235
not necessarily inner products in L2 between probability density functions

(although this is a special case).
Our first step is to express the kernel h(zi , zj ) of the U-statistic in terms of
i , xj ) between feature space mappings from which the
an RKHS kernel k(x
mean has been subtracted,
i , xj ) :=
k(x
=
(xi ) [p], (xj ) [p]

k(xi , xj ) Ex k(xi , x) Ex k(x, xj ) + Ex,x k(x, x ).
The centering terms cancel (the distance between the two points is unaffected by an identical global shift in both the points), meaning
i , xj ) + k(y
i , yj ) k(x
i , yj ) k(x
j , yi ).
h(zi , zj ) = k(x
i , xj ) in terms of eigenfunctions i (x) with
Next, we write the kernel k(x
respect to the probability measure Prx ,
x)=
k(x,
l l (x)l (x ),
l=1
where
X
x )i (x)d Pr(x) = i i (x )
k(x,
x
and
X
i (x)j (x)d Pr(x) = ij .

x
(10.28)
Since
v) = Ex k(x, v) Ex,x k(x, x ) Ex k(x, v) + Ex,x k(x, x )
Ex k(x,
= 0,
then when i = 0, we have that
i Ex i (x ) =
x )i (x)d Pr(x)
Ex k(x,
x
= 0,
and hence
Ex i (x) = 0.
(10.29)
236
We now use these results to transform the expression in (10.26). First,

1
m
i , xj )
k(x
i=j
1
m
1
m
l l (xi )l (xj )
i=j l=1
l
i
l=1
l2 (xi )
l (xi )
l (yl2 1),
l=1
where yl N(0, 1) are i.i.d., and the final relation denotes convergence
in distribution, using (10.28) and (10.29), following [Ser80, Section 5.5.2].
Likewise
1
m
,x )
k(x
i j
D
i=j
l (zl2 1),
l=1
where zl N(0, 1), and

1
m(m 1)
i , yj ) + k(x
j , yi )
k(x
i=j
2
D
l yl zl .
l=1
Combining these results, we get
l yl2 + zl2 2 2yl zl
mMMD2u (F, X, X )
D
l=1
l (yl zl )2 2 .
=
l=1
Note that yl zl , being the difference of two independent Gaussian variables,

has a normal distribution with mean zero and variance 2. This is therefore
a quadratic form in a Gaussian random variable minus an offset 2
l=1 l .
10.18.2 Moments of the Empirical MMD Under H0
In this section, we compute the moments of the U-statistic in Section 10.12,
under the null hypothesis conditions
Ez,z h(z, z ) = 0,
(10.30)
Ez h(z, z ) = 0.
(10.31)
and, importantly,
237
Note that the latter implies the former.

Variance/2nd moment: This was derived by [Hoe48, p. 299], and is
also described by [Ser80, Lemma A p. 183]. Applying these results,
E
MMD2u
2
2
n(n 1)
n(n 1)
(n 2)(2)Ez (Ez h(z, z ))2 +
Ez,z h2 (z, z )
n(n 1)
2
2
2
2(n 2)
=
Ez (Ez h(z, z ))2 +
Ez,z h2 (z, z )
n(n 1)
n(n 1)
2
=
Ez,z h2 (z, z ) ,
n(n 1)
where the first term in the penultimate line is zero due to (10.31). Note that
variance and 2nd moment are the same under the zero mean assumption.
3rd moment: We consider the terms that appear in the expansion of
3
E MMD2u . These are all of the form
3
2
n(n 1)
E(hab hcd hef ),
where we shorten hab = h(za , zb ), and we know za and zb are always independent. Most of the terms vanish due to (10.30) and (10.31). The first
terms that remain take the form
3
2
n(n 1)
E(hab hbc hca ),
and there are

n(n 1)
(n 2)(2)
2
of them, which gives us the expression
3
2
n(n 1)
(n 2)(2)Ez,z h(z, z )Ez h(z, z )h(z , z )
n(n 1)
2
8(n 2)
Ez,z h(z, z )Ez h(z, z )h(z , z ) .
(10.32)
= 2
n (n 1)2
1
Note the scaling n8(n2)
2 (n1)2 n3 . The remaining non-zero terms, for which
a = c = e and b = d = f , take the form
2
n(n 1)
Ez,z h3 (z, z ) ,
238
and there are
n(n1)
2
of them, which gives

2
n(n 1)
Ez,z h3 (z, z ) .
(10.33)
2
However n(n1)
n4 so this term is negligible compared with (10.32).
Thus, a reasonable approximation to the third moment is
MMD2u
8(n 2)
Ez,z h(z, z )Ez
n2 (n 1)2
h(z, z )h(z , z )
11
Reinforcement Learning
239
Appendix 1
Linear Algebra and Functional Analysis
A1.1 Spectral Properties of Matrices

A1.1.1 Basics
A1.1.2 Special Matrices
unitary, hermitean, positive semidefinite
A1.1.3 Normal Forms
Jacobi
A1.2 Functional Analysis
A1.2.1 Norms and Metrics
vector space, norm, triangle inequality
A1.2.2 Banach Spaces
normed vector space, evaluation functionals, examples, dual space
A1.2.3 Hilbert Spaces
symmetric inner product
A1.2.4 Operators
spectrum, norm, bounded, unbounded operators
A1.3 Fourier Analysis
A1.3.1 Basics
A1.3.2 Operators
241
Appendix 2
242
243
Binomial Beta
(x) = x
(n + 1)(n(1 ) + 1)
eh(n,n) =
= B(n + 1, n(1 ) + 1)
(n + 2)
In traditional notation one represents the conjugate as
p(z; , ) =
( + ) 1
z
(1 z)1
()()
where = n + 1 and = n(1 b) + 1.

Multinomial Dirichlet
(x) = ex
eh(n,n) =
d
i=1 (ni
+ 1)
(n + d)

p(z; ) =
d
d
i=1 i )
zii 1
d
i=1 (i ) i=1
where i = ni + 1
Poisson Gamma
(x) = x
h(n,n)
= nn (n)

p(z; ) = ()z 1 ex
where = n and = n.
Multinomial / Binomial
Gaussian
Laplace
Poisson
Dirichlet
Wishart
Student-t
Beta
Gamma
Appendix 3
Loss Functions
A3.1 Loss Functions

A multitude of loss functions are commonly used to derive seemingly different algorithms. This often blurs the similarities as well as subtle differences
between them, often for historic reasons: Each new loss is typically accompanied by at least one publication dedicated to it. In many cases, the loss is not
spelled out explicitly either but instead, it is only given by means of a constrained optimization problem. A case in point are the papers introducing
(binary) hinge loss [BM92, CV95] and structured loss [TGK04, TJHA05].
Likewise, a geometric description obscures the underlying loss function, as
in novelty detection [SPST+ 01].
In this section we give an expository yet unifying presentation of many
of those loss functions. Many of them are well known, while others, such
as multivariate ranking, hazard regression, or Poisson regression are not
commonly used in machine learning. Tables A3.1 and A3.1 contain a choice
subset of simple scalar and vectorial losses. Our aim is to put the multitude
of loss functions in an unified framework, and to show how these losses
and their (sub)gradients can be computed efficiently for use in our solver
framework.
Note that not all losses, while convex, are continuously differentiable. In
this situation we give a subgradient. While this may not be optimal, the
convergence rates of our algorithm do not depend on which element of the
subdifferential we provide: in all cases the first order Taylor approximation
is a lower bound which is tight at the point of expansion.
In this setion, with little abuse of notation, vi is understood as the i-th
component of vector v when v is clearly not an element of a sequence or a
set.
A3.1.1 Scalar Loss Functions

It is well known [Wah97] that the convex optimization problem
min subject to y w, x 1 and 0
244
(3.1)
0 if f and 1 otherwise
f y
sign(f y)
if f > y and 1 otherwise
0 if |f y| , else sign(f y)
f y if |f y| 1, else sign(f y)
exp(yf )
log(1 + exp(yf ))
max(0, f )
1
2 (f
|f y|
max( (f y), (1 )(y f ))
max(0, |f y| )
1
2 (f
exp(f ) yf
Exponential [CDLS99]
Logistic [CSS00]
Novelty [SPST+ 01]
Least mean squares [Wil98]
Least absolute deviation
Quantile regression [Koe05]
-insensitive [VGS97]
Hubers robust loss [MSR+ 97]
Poisson regression [Cre93]
exp(f ) y
y/(1 + exp(yf ))
1
2 (f
Multivariate Regression
exp(fy ) fy
y) M (f y) where M
ey exp(fy ) /
M (f y)
exp(fy ) ey
(y, y )(ey ey )
where y is the argmax of the loss
maxy (y, y )(fy fy + (y, y ))
Scaled Soft-Margin Multiclass

[TJHA05]
log
ey ey
where y is the argmax of the loss
maxy (fy fy + (y, y ))
Soft-Margin Multiclass [TGK04]

[CS03]
Softmax Multiclass [CDLS99]
Derivative
Loss
Vectorial loss functions and their derivatives, depending on the vector f := W x and on y.
y)2 if |f y| 1, else |f y|
y)2
1
2
0 if yf 1 and f y otherwise
1
2
Squared Hinge [KD05]

y exp(yf )
0 if yf 1 and y otherwise
max(0, 1 yf )
Hinge [BM92]
max(0, 1 yf )2
Derivative l (f, y)
Loss l(f, y)
Scalar loss functions and their derivatives, depending on f := w, x , and y.
A3.1 Loss Functions

245
246
3 Loss Functions
takes on the value max(0, 1 y w, x ). The latter is a convex function in

w and x. Likewise, we may rewrite the -insensitive loss, Hubers robust
loss, the quantile regression loss, and the novelty detection loss in terms of
loss functions rather than a constrained optimization problem. In all cases,
w, x will play a key role insofar as the loss is convex in terms of the scalar
quantity w, x . A large number of loss functions fall into this category,
as described in Table A3.1. Note that not all functions of this type are
continuously differentiable. In this case we adopt the convention that
x max(f (x), g(x)) =
x f (x)
if f (x) g(x)
x g(x)
otherwise .
(3.2)
Since we are only interested in obtaining an arbitrary element of the subdifferential this convention is consistent with our requirements.
Let us discuss the issue of efficient computation. For all scalar losses we
may write l(x, y, w) = l( w, x , y), as described in Table A3.1. In this case a
simple application of the chain rule yields that w l(x, y, w) = l ( w, x , y)x.
For instance, for squared loss we have
l( w, x , y) = 1 ( w, x y)2 and l ( w, x , y) = w, x y.
2
Consequently, the derivative of the empirical risk term is given by
w Remp (w) =
1
m
l ( w, xi , yi ) xi .
(3.3)
i=1
This means that if we want to compute l and w l on a large number of

observations xi , represented as matrix X, we can make use of fast linear
algebra routines to pre-compute the vectors
f = Xw and g X where gi = l (fi , yi ).
(3.4)
This is possible for any of the loss functions listed in Table A3.1, and many
other similar losses. The advantage of this unified representation is that implementation of each individual loss can be done in very little time. The
computational infrastructure for computing Xw and g X is shared. Evaluating l(fi , yi ) and l (fi , yi ) for all i can be done in O(m) time and it is
not time-critical in comparison to the remaining operations. Algorithm 3.1
describes the details.
An important but often neglected issue is worth mentioning. Computing f
requires us to right multiply the matrix X with the vector w while computing
g requires the left multiplication of X with the vector g . If X is stored in a
row major format then Xw can be computed rather efficiently while g X is
A3.1 Loss Functions
247
Algorithm 3.1 ScalarLoss(w, X, y)

1: input: Weight vector w, feature matrix X, and labels y
2: Compute f = Xw
3: Compute r =
i l(fi , yi ) and g = l (f, y)
4: g g X
5: return Risk r and gradient g
expensive. This is particularly true if X cannot fit in main memory. Converse

is the case when X is stored in column major format. Similar problems are
encountered when X is a sparse matrix and stored in either compressed row
format or in compressed column format.
A3.1.2 Structured Loss

In recent years structured estimation has gained substantial popularity in
machine learning [TJHA05, TGK04, BHS+ 07]. At its core it relies on two
types of convex loss functions: logistic loss:
w, (x, y) ,
(3.5)
l(x, y, w) = max (y, y ) w, (x, y ) (x, y) + (y, y ).
(3.6)
l(x, y, w) = log
exp
w, (x, y )
y Y
and soft-margin loss:

y Y
Here (x, y) is a joint feature map, (y, y ) 0 describes the cost of misclassifying y by y , and (y, y ) 0 is a scaling term which indicates by how
much the large margin property should be enforced. For instance, [TGK04]
choose (y, y ) = 1. On the other hand [TJHA05] suggest (y, y ) = (y, y ),
which reportedly yields better performance. Finally, [McA07] recently suggested generic functions (y, y ).
The logistic loss can also be interpreted as the negative log-likelihood of
a conditional exponential family model:
p(y|x; w) := exp( w, (x, y) g(w|x)),
(3.7)
where the normalizing constant g(w|x), often called the log-partition function, reads
g(w|x) := log
exp
y Y
w, (x, y )
(3.8)
248
3 Loss Functions
As a consequence of the Hammersley-Clifford theorem [Jor08] every exponential family distribution corresponds to a undirected graphical model. In
our case this implies that the labels y factorize according to an undirected
graphical model. A large number of problems have been addressed by this
setting, amongst them named entity tagging [LMP01], sequence alignment
[TJHA05], segmentation [RSS+ 07] and path planning [RBZ06]. It is clearly
impossible to give examples of all settings in this section, nor would a brief
summary do this field any justice. We therefore refer the reader to the edited
volume [BHS+ 07] and the references therein.
If the underlying graphical model is tractable then efficient inference algorithms based on dynamic programming can be used to compute (3.5) and
(3.6). We discuss intractable graphical models in Section A3.1.2.1, and now
turn our attention to the derivatives of the above structured losses.
When it comes to computing derivatives of the logistic loss, (3.5), we have
w l(x, y, w) =
(x, y ) exp w, (x, y )

y
exp w, (x, y )
(x, y)
= Ey p(y |x) (x, y ) (x, y).
(3.9)
(3.10)
where p(y|x) is the exponential family model (3.7). In the case of (3.6) we
denote by y(x) the argmax of the RHS, that is
y(x) := argmax (y, y ) w, (x, y ) (x, y) + (y, y ).
(3.11)
This allows us to compute the derivative of l(x, y, w) as

w l(x, y, w) = (y, y(x)) [(x, y(x)) (x, y)] .
(3.12)
In the case where the loss is maximized for more than one distinct value y(x)
we may average over the individual values, since any convex combination of
such terms lies in the subdifferential.
Note that (3.6) majorizes (y, y ), where y := argmaxy w, (x, y )
[TJHA05]. This can be seen via the following series of inequalities:
(y, y ) (y, y ) w, (x, y ) (x, y) + (y, y ) l(x, y, w).
The first inequality follows because (y, y ) 0 and y maximizes w, (x, y )
thus implying that (y, y ) w, (x, y ) (x, y) 0. The second inequality follows by definition of the loss.
We conclude this section with a simple lemma which is at the heart of
several derivations of [Joa05]. While the proof in the original paper is far
from trivial, it is straightforward in our setting:
A3.1 Loss Functions
249
Lemma 3.1 Denote by (y, y ) a loss and let (xi , yi ) be a feature map for
observations (xi , yi ) with 1 i m. Moreover, denote by X, Y the set of
all m patterns and labels respectively. Finally let
m
(X, Y ) :=
(xi , yi ) and (Y, Y ) :=

i=1
(yi , yi ).
(3.13)
i=1
Then the following two losses are equivalent:

m
max w, (xi , y ) (xi , yi ) + (yi , y ) and max w, (X, Y ) (X, Y ) + (Y, Y ).

i=1
This is immediately obvious, since both feature map and loss decompose,
which allows us to perform maximization over Y by maximizing each of its
m components. In doing so, we showed that aggregating all data and labels
into a single feature map and loss yields results identical to minimizing
the sum over all individual losses. This holds, in particular, for the sample
error loss of [Joa05]. Also note that this equivalence does not hold whenever
(y, y ) is not constant.
A3.1.2.1 Intractable Models
We now discuss cases where computing l(x, y, w) itself is too expensive. For
instance, for intractable graphical models, the computation of y exp w, (x, y)
cannot be computed efficiently. [WJ03] propose the use of a convex majorization of the log-partition function in those cases. In our setting this means
that instead of dealing with
l(x, y, w) = g(w|x) w, (x, y) where g(w|x) := log
exp w, (x, y)
y
(3.14)
one uses a more easily computable convex upper bound on g via
sup
w, + HGauss (|x).
(3.15)
MARG(x)
Here MARG(x) is an outer bound on the conditional marginal polytope

associated with the map (x, y). Moreover, HGauss (|x) is an upper bound
on the entropy by using a Gaussian with identical variance. More refined
tree decompositions exist, too. The key benefit of our approach is that the
solution of the optimization problem (3.15) can immediately be used as a
gradient of the upper bound. This is computationally rather efficient.
250
3 Loss Functions
Likewise note that [TGK04] use relaxations when solving structured estimation problems of the form
l(x, y, w) = max (y, y ) w, (x, y ) (x, y) + (y, y ),
y
(3.16)
by enlarging the domain of maximization with respect to y . For instance,

instead of an integer programming problem we might relax the setting to
a linear program which is much cheaper to solve. This, again, provides an
upper bound on the original loss function.
In summary, we have demonstrated that convex relaxation strategies are
well applicable for bundle methods. In fact, the results of the corresponding
optimization procedures can be used directly for further optimization steps.
A3.1.3 Scalar Multivariate Performance Scores

We now discuss a series of structured loss functions and how they can be
implemented efficiently. For the sake of completeness, we give a concise representation of previous work on multivariate performance scores and ranking
methods. All these loss functions rely on having access to w, x , which can
be computed efficiently by using the same operations as in Section A3.1.1.
A3.1.3.1 ROC Score
Denote by f = Xw the vector of function values on the training set. It is
well known that the area under the ROC curve is given by
AUC(x, y, w) =
1
m+ m
I( w, xi < w, xj ),
(3.17)
yi <yj
where m+ and m are the numbers of positive and negative observations

respectively, and I() is indicator function. Directly optimizing the cost 1
AUC(x, y, w) is difficult as it is not continuous in w. By using max(0, 1 +
w, xi xj ) as the surrogate loss function for all pairs (i, j) for which yi < yj
we have the following convex multivariate empirical risk
Remp (w) =
1
m+ m
max(0, 1 + w, xi xj ) =
yi <yj
1
m+ m
max(0, 1 + fi fj ).
yi <yj
(3.18)
Obviously, we could compute Remp (w) and its derivative by an O(m2 ) operation. However [Joa05] showed that both can be computed in O(m log m)
time using a sorting operation, which we now describe.
Denote by c = f 21 y an auxiliary variable and let i and j be indices such
A3.1 Loss Functions
251
Algorithm 3.2 ROCScore(X, y, w)

1: input: Feature matrix X, labels y, and weight vector w
1
2: initialization: s = m and s+ = 0 and l = 0m and c = Xw 2 y
3: {1, . . . , m} sorted in ascending order of c
4: for i = 1 to m do
5:
if yi = 1 then
6:
li s+ and s s 1
7:
else
8:
li s and s+ s+ + 1
9:
end if
10: end for
11: Rescale l l/(m+ m ) and compute r = l, c and g = l X.
12: return Risk r and subgradient g
that yi = 1 and yj = 1. It follows that ci cj = 1 + fi fj . The efficient

algorithm is due to the observation that there are at most m distinct terms
ck , k = 1, . . . , m, each with different frequency lk and sign, appear in (3.18).
These frequencies lk can be determined by first sorting c in ascending order
then scanning through the labels according to the sorted order of c and
keeping running statistics such as the number s of negative labels yet to
encounter, and the number s+ of positive labels encountered. When visiting
yk , we know ck should appears s+ (or s ) times with positive (or negative)
sign in (3.18) if yk = 1 (or yk = 1). Algorithm 3.2 spells out explicitly how
to compute Remp (w) and its subgradient.
A3.1.3.2 Ordinal Regression
Essentially the same preference relationships need to hold for ordinal regression. The only difference is that yi need not take on binary values any
more. Instead, we may have an arbitrary number of different values yi (e.g.,
1 corresponding to strong reject up to 10 corresponding to strong accept,
when it comes to ranking papers for a conference). That is, we now have
yi {1, . . . , n} rather than yi {1}. Our goal is to find some w such that
w, xi xj < 0 whenever yi < yj . Whenever this relationship is not satisfied, we incur a cost C(yi , yj ) for preferring xi to xj . For examples, C(yi , yj )
could be constant i.e., C(yi , yj ) = 1 [Joa06] or linear i.e., C(yi , yj ) = yj yi .
Denote by mi the number of xj for which yj = i. In this case, there are
= m2 n m2 pairs (yi , yj ) for which yi = yj ; this implies that there
M
i=1 i
/2 pairs (yi , yj ) such that yi < yj . Normalizing by the total
are M = M
252
3 Loss Functions
number of comparisons we may write the overall cost of the estimator as

1
M
yi <yj
1
m2
C(yi , yj )I( w, xi > w, xj ) where M =
2
m2i . (3.19)
i
Using the same convex majorization as above when we were maximizing the
ROC score, we obtain an empirical risk of the form
Remp (w) =
1
M
C(yi , yj ) max(0, 1 + w, xi xj )
(3.20)
yi <yj
Now the goal is to find an efficient algorithm for obtaining the number of
times when the individual losses are nonzero such as to compute both the
value and the gradient of Remp (w). The complication arises from the fact
that observations xi with label yi may appear in either side of the inequality
depending on whether yj < yi or yj > yi . This problem can be solved as
follows: sort f = Xw in ascending order and traverse it while keeping track
of how many items with a lower value yj are no more than 1 apart in terms
of their value of fi . This way we may compute the count statistics efficiently.
Algorithm 3.3 describes the details, generalizing the results of [Joa06]. Again,
its runtime is O(m log m), thus allowing for efficient computation.
A3.1.3.3 Preference Relations

In general, our loss may be described by means of a set of preference relations
j
i for arbitrary pairs (i, j) {1, . . . m}2 associated with a cost C(i, j)
which is incurred whenever i is ranked above j. This set of preferences may
or may not form a partial or a total order on the domain of all observations.
In these cases efficient computations along the lines of Algorithm 3.3 exist.
In general, this is not the case and we need to rely on the fact that the set
P containing all preferences is sufficiently small that it can be enumerated
efficiently. The risk is then given by
1
|P |
C(i, j)I( w, xi > w, xj )

(i,j)P
(3.21)
A3.1 Loss Functions
253
Algorithm 3.3 OrdinalRegression(X, y, w, C)

1: input: Feature matrix X, labels y, weight vector w, and score matrix C
2: initialization: l = 0n and ui = mi i [n] and r = 0 and g = 0m
2m
1
1
3: Compute f = Xw and set c = [f 2 , f + 2 ] R
(concatenate the
vectors)
n
2
4: Compute M = (m2
i=1 mi )/2
5: Rescale C C/M
6: {1, . . . , 2m} sorted in ascending order of c
7: for i = 1 to 2m do
8:
j = i mod m
9:
if i m then
10:
for k = 1 to yj 1 do
11:
r r C(k, yj )uk cj
12:
gj gj C(k, yj )uk
13:
end for
14:
lyj lyj + 1
15:
else
16:
for k = yj + 1 to n do
17:
r r + C(yj , k)lk cj+m
18:
gj gj + C(yj , k)lk
19:
end for
20:
uyj uyj 1
21:
end if
22: end for
23: g g X
24: return: Risk r and subgradient g
Again, the same majorization argument as before allows us to write a convex

upper bound
Remp (w) =
where w Remp (w) =
1
|P |
1
|P |
C(i, j) max (0, 1 + w, xi w, xj ) (3.22)

(i,j)P
C(i, j)
(i,j)P
if w, xj xi 1
xi xj
otherwise
(3.23)
The implementation is straightforward, as given in Algorithm 3.4.
254
3 Loss Functions
Algorithm 3.4 Preference(X, w, C, P )

1: input: Feature matrix X, weight vector w, score matrix C, and preference set P
2: initialization: r = 0 and g = 0m
3: Compute f = Xw
4: while (i, j) P do
5:
if fj fi < 1 then
6:
r r + C(i, j)(1 + fi fj )
7:
gi gi + C(i, j) and gj gj C(i, j)
8:
end if
9: end while
10: g g X
A3.1.3.4 Ranking
In webpage and document ranking we are often in a situation similar to that
described in Section A3.1.3.2, however with the difference that we do not
only care about objects xi being ranked according to scores yi but moreover
that different degrees of importance are placed on different documents.
The information retrieval literature is full with a large number of different scoring functions. Examples are criteria such as Normalized Discounted
Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), Precision@n, or
Expected Rank Utility (ERU). They are used to address the issue of evaluating rankers, search engines or recommender sytems [Voo01, JK02, BHK98,
BH04]. For instance, in webpage ranking only the first k retrieved documents that matter, since users are unlikely to look beyond the first k, say
10, retrieved webpages in an internet search. [LS07] show that these scores
can be optimized directly by minimizing the following loss:
ci w, x(i) xi + a a(), b(y) .
l(X, y, w) = max
(3.24)
Here ci is a monotonically decreasing sequence, the documents are assumed

to be arranged in order of decreasing relevance, is a permutation, the
vectors a and b(y) depend on the choice of a particular ranking measure, and
a() denotes the permutation of a according to . Pre-computing f = Xw
we may rewrite (3.24) as
l(f, y) = max c f () a() b(y) c f + a b(y)
(3.25)
A3.1 Loss Functions
255
Algorithm 3.5 Ranking(X, y, w)

1: input: Feature matrix X, relevances y, and weight vector w
2: Compute vectors a and b(y) according to some ranking measure
3: Compute f = Xw
4: Compute elements of matrix Cij = ci fj bi aj
5: = LinearAssignment(C)
6: r = c (f () f ) + (a a()) b
7: g = c( 1 ) c and g g X
and consequently the derivative of l(X, y, w) with respect to w is given by
w l(X, y, w) = (c(
1 ) c) X where
= argmax c f () a() b(y).
(3.26)
Here 1 denotes the inverse permutation, such that 1 = 1. Finding the
permutation maximizing c f () a() b(y) is a linear assignment problem
which can be easily solved by the Hungarian Marriage algorithm, that is,
the Kuhn-Munkres algorithm.
The original papers by [Kuh55] and [Mun57] implied an algorithm with
O(m3 ) cost in the number of terms. Later, [Kar80] suggested an algorithm
with expected quadratic time in the size of the assignment problem (ignoring log-factors). Finally, [OL93] propose a linear time algorithm for large
problems. Since in our case the number of pages is fairly small (in the order
of 50 to 200 per query) the scaling behavior per query is not too important.
We used an existing implementation due to [JV87].
Note also that training sets consist of a collection of ranking problems,
that is, we have several ranking problems of size 50 to 200. By means of
parallelization we are able to distribute the work onto a cluster of workstations, which is able to overcome the issue of the rather costly computation
per collection of queries. Algorithm 3.5 spells out the steps in detail.
A3.1.3.5 Contingency Table Scores
[Joa05] observed that F scores and related quantities dependent on a contingency table can also be computed efficiently by means of structured estimation. Such scores depend in general on the number of true and false
positives and negatives alike. Algorithm 3.6 shows how a corresponding empirical risk and subgradient can be computed efficiently. As with the previous losses, here again we use convex majorization to obtain a tractable
optimization problem.
256
3 Loss Functions
Given a set of labels y and an estimate y , the numbers of true positives

(T+ ), true negatives (T ), false positives (F+ ), and false negatives (F ) are
determined according to a contingency table as follows:
y>0
y<0
y >0
T+
F+
y <0
In the sequel, we denote by m+ = T+ + F and m = T + F+ the numbers

of positives and negative labels in y, respectively. We note that F score can
be computed based on the contingency table [Joa05] as
F (T+ , T ) =
(1 + 2 )T+
.
T+ + m T + 2 m+
(3.27)
If we want to use w, xi to estimate the label of observation xi , we may use

the following structured loss to directly optimize w.r.t. F score [Joa05]:
l(X, y, w) = max (y y) f + (T+ , T ) ,
y
(3.28)
where f = Xw, (T+ , T ) := 1 F (T+ , T ), and (T+ , T ) is determined

by using y and y . Since does not depend on the specific choice of (y, y )
but rather just on which sets they disagree, l can be maximized as follows:
Enumerating all possible m+ m contingency tables in a way such that given
a configuration (T+ , T ), T+ (T ) positive (negative) observations xi with
largest (lowest) value of w, xi are labeled as positive (negative). This is
effectively implemented as a nested loop hence run in O(m2 ) time. Algorithm
3.6 describes the procedure in details.
A3.1.4 Vector Loss Functions

Next we discuss vector loss functions, i.e., functions where w is best described as a matrix (denoted by W ) and the loss depends on W x. Here, we
have feature vector x Rd , label y Rk , and weight matrix W Rdk . We
also denote feature matrix X Rmd as a matrix of m feature vectors xi ,
and stack up the columns Wi of W as a vector w.
Some of the most relevant cases are multiclass classification using both
the exponential families model and structured estimation, hierarchical models, i.e., ontologies, and multivariate regression. Many of those cases are
summarized in Table A3.1.
A3.1 Loss Functions
257
Algorithm 3.6 F (X, y, w)

1: input: Feature matrix X, labels y, and weight vector w
2: Compute f = Xw
3: + {i : yi = 1} sorted in descending order of f
4: {i : yi = 1} sorted in ascending order of f
m+
5: Let p0 = 0 and pi = 2
k=i f + , i = 1, . . . , m+
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
f , i = 1, . . . , m
Let n0 = 0 and ni = 2 k=i
k
y y and r
for i = 0 to m+ do
for j = 0 to m do
rtmp = (i, j) pi + nj
if rtmp > r then
r rtmp
T+ i and T j
end if
end for
end for
y+ 1, i = 1, . . . , T+
i
18:
y 1, i = 1, . . . , T
19:
g (y y) X
return Risk r and subgradient g
20:
A3.1.4.1 Unstructured Setting

The simplest loss is multivariate regression, where l(x, y, W ) = 21 (yx W ) M (y
x W ). In this case it is clear that by pre-computing XW subsequent calculations of the loss and its gradient are significantly accelerated.
A second class of important losses is given by plain multiclass classification
problems, e.g., recognizing digits of a postal code or categorizing high-level
document categories. In this case, (x, y) is best represented by ey x (using
a linear model). Clearly we may view w, (x, y) as an operation which
chooses a column indexed by y from xW , since all labels y correspond to
a different weight vector Wy . Formally we set w, (x, y) = [xW ]y . In this
case, structured estimation losses can be rewritten as
l(x, y, W ) = max (y, y ) Wy Wy , x + (y, y )
y
and W l(x, y, W ) = (y, y )(ey ey ) x.
(3.29)
(3.30)
Here and are defined as in Section A3.1.2 and y denotes the value of y
258
3 Loss Functions
for which the RHS of (3.29) is maximized. This means that for unstructured
multiclass settings we may simply compute xW . Since this needs to be performed for all observations xi we may take advantage of fast linear algebra
routines and compute f = XW for efficiency. Likewise note that computing the gradient over m observations is now a matrix-matrix multiplication,
too: denote by G the matrix of rows of gradients (yi , yi )(eyi eyi ). Then
W Remp (X, y, W ) = G X. Note that G is very sparse with at most two
nonzero entries per row, which makes the computation of G X essentially
as expensive as two matrix vector multiplications. Whenever we have many
classes, this may yield significant computational gains.
Log-likelihood scores of exponential families share similar expansions. We
have
exp w, (x, y ) w, (x, y) = log
l(x, y, W ) = log
y
exp Wy , x Wy , x
y
(3.31)
W l(x, y, W ) =
(ey x) exp Wy , x
y
exp Wy , x
ey x.
(3.32)
The main difference to the soft-margin setting is that the gradients are
not sparse in the number of classes. This means that the computation of
gradients is slightly more costly.
A3.1.4.2 Ontologies
Fig. A3.1. Two ontologies. Left: a binary hierarchy with internal nodes {1, . . . , 7}
and labels {8, . . . 15}. Right: a generic directed acyclic graph with internal nodes
{1, . . . , 6, 12} and labels {7, . . . , 11, 13, . . . , 15}. Note that node 5 has two parents,
namely nodes 2 and 3. Moreover, the labels need not be found at the same level of
the tree: nodes 14 and 15 are one level lower than the rest of the nodes.
Assume that the labels we want to estimate can be found to belong to

a directed acyclic graph. For instance, this may be a gene-ontology graph
A3.1 Loss Functions
259
[ABB+ 00] a patent hierarchy [CH04], or a genealogy. In these cases we have a

hierarchy of categories to which an element x may belong. Figure A3.1 gives
two examples of such directed acyclic graphs (DAG). The first example is
a binary tree, while the second contains nodes with different numbers of
children (e.g., node 4 and 12), nodes at different levels having children (e.g.,
nodes 5 and 12), and nodes which have more than one parent (e.g., node 5).
It is a well known fundamental property of trees that they have at most as
many internal nodes as they have leaf nodes.
It is now our goal to build a classifier which is able to categorize observations according to which leaf node they belong to (each leaf node is assigned
a label y). Denote by k + 1 the number of nodes in the DAG including the
root node. In this case we may design a feature map (y) Rk [CH04] by
associating with every label y the vector describing the path from the root
node to y, ignoring the root node itself. For instance, for the first DAG in
Figure A3.1 we have
(8) = (1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0) and (13) = (0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0)
Whenever several paths are admissible, as in the right DAG of Figure A3.1
we average over all possible paths. For example, we have
(10) = (0.5, 0.5, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0) and (15) = (0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1).
Also note that the lengths of the paths need not be the same (e.g., to
reach 15 it takes a longer path than to reach 13). Likewise, it is natural to
assume that (y, y ), i.e., the cost for mislabeling y as y will depend on the
similarity of the path. In other words, it is likely that the cost for placing
x into the wrong sub-sub-category is less than getting the main category of
the object wrong.
To complete the setting, note that for (x, y) = (y) x the cost of
computing all labels is k inner products, since the value of w, (x, y) for a
particular y can be obtained by the sum of the contributions for the segments
of the path. This means that the values for all terms can be computed by
a simple breadth first traversal through the graph. As before, we may make
use of vectorization in our approach, since we may compute xW Rk to
obtain the contributions on all segments of the DAG before performing the
graph traversal. Since we have m patterns xi we may vectorize matters by
pre-computing XW .
Also note that (y)(y ) is nonzero only for those edges where the paths
for y and y differ. Hence we only change weights on those parts of the graph
where the categorization differs. Algorithm 3.7 describes the subgradient and
loss computation for the soft-margin type of loss function.
260
3 Loss Functions
Algorithm 3.7 Ontology(X, y, W )

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
input: Feature matrix X Rmd , labels y, and weight matrix W

Rdk
initialization: G = 0 Rmk and r = 0
Compute f = XW and let fi = xi W
for i = 1 to m do
Let Di be the DAG with edges annotated with the values of fi
Traverse Di to find node y that maximize sum of fi values on the
path plus (yi , y )
Gi = (y ) (yi )
r r + z y z yi
end for
g=G X
return Risk r and subgradient g
The same reasoning applies to estimation when using an exponential families model. The only difference is that we need to compute a soft-max
over paths rather than exclusively choosing the best path over the ontology. Again, a breadth-first recursion suffices: each of the leaves y of the
DAG is associated with a probability p(y|x). To obtain Eyp(y|x) [(y)] all
we need to do is perform a bottom-up traversal of the DAG summing over
all probability weights on the path. Wherever a node has more than one
parent, we distribute the probability weight equally over its parents.
Bibliography
[ABB+ 00] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M.
Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris,
D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock, Gene ontology: tool for the
unification of biology. the gene ontology consortium, Nat Genet 25 (2000), 25
29.
[ADSM06] J. Anemuller, J.-R. Duann, T. J. Sejnowski, and S. Makeig, Spatiotemporal dynamics in fmri recordings revealed with complex independent component analysis, Neurocomputing 69 (2006), 15021512.
[AG92] M. Arcones and E. Gine, On the bootstrap of u and v statistics, The Annals
of Statistics 20 (1992), no. 2, 655674.
[AGML90] S. F. Altschul, W. Gish, E. W. Myers, and D. J. Lipman, Basic local
alignment search tool, Journal of Molecular Biology 215 (1990), no. 3, 403
410.
[AHT94] N. Anderson, P. Hall, and D. Titterington, Two-sample test statistics
for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates, Journal of Multivariate Analysis 50
(1994), 4154.
[AS06] Y. Altun and A.J. Smola, Unifying divergence minimization and statistical
inference via convex duality, Proc. Annual Conf. Computational Learning Theory (H.U. Simon and G. Lugosi, eds.), LNCS, Springer, 2006, pp. 139153.
[ATH03] S. Andrews, I. Tsochantaridis, and T. Hofmann, Support vector machines
for multiple-instance learning, Advances in Neural Information Processing Systems 15 (S. Becker, S. Thrun, and K. Obermayer, eds.), MIT Press, 2003.
[BBCP07] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, Analysis of representations for domain adaptation, NIPS 19, MIT Press, 2007, pp. 137144.
[BBL05] O. Bousquet, S. Boucheron, and G. Lugosi, Theory of classification: a survey of recent advances, ESAIM: Probab. Stat. 9 (2005), 323 375.
[BCR84] C. Berg, J. P. R. Christensen, and P. Ressel, Harmonic analysis on semigroups, Springer, New York, 1984.
[Bel61] R. E. Bellman, Adaptive control processes, Princeton University Press,
Princeton, NJ, 1961.
[Bel05] Alexandre Belloni, Introduction to bundle methods, Tech. report, Operation
Research Center, M.I.T., 2005.
[Ber85] J. O. Berger, Statistical decision theory and Bayesian analysis, Springer,
New York, 1985.
[Bes74] Julian Besag, Spatial interaction and the statistical analysis of lattice systems
(with discussion), Journal of the Royal Statistical Society. Series B 36 (1974),
no. 2, 192236.
[BG05] G. Biau and L. Gyorfi, On the asymptotic properties of a nonparametric
l1 -test statistic of homogeneity, IEEE Transactions on Information Theory 51
(2005), no. 11, 39653973.
261
262
3 Bibliography
[BGR+ 06] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Scholkopf,

and A. J. Smola, Integrating structured biological data by kernel maximum
mean discrepancy, Bioinformatics (ISMB) 22 (2006), no. 14, e49e57.
[BH04] J. Basilico and T. Hofmann, Unifying collaborative and content-based filtering, Proc. Intl. Conf. Machine Learning (New York, NY), ACM Press, 2004,
pp. 6572.
[BHK98] J. S. Breese, D. Heckerman, and C. Kardie, Empirical analysis of predictive
algorithms for collaborative filtering, Proceedings of the 14th Conference on
Uncertainty in Artificial Intelligence, 1998, pp. 4352.
[BHS+ 07] G. Bakir, T. Hofmann, B. Scholkopf, A. Smola, B. Taskar, and S. V. N.
Vishwanathan, Predicting structured data, MIT Press, Cambridge, Massachusetts, 2007.
[Bic69] P. Bickel, A distribution free version of the Smirnov two sample test in the
p-variate case, The Annals of Mathematical Statistics 40 (1969), no. 1, 123.
[Bil68] Patrick Billingsley, Convergence of probability measures, John Wiley and
Sons, 1968.
[Bis95] C. M. Bishop, Neural networks for pattern recognition, Clarendon Press,
Oxford, 1995.
[Bis06] Christopher Bishop, Pattern recognition and machine learning, Springer,
2006.
[BJ02] F. R. Bach and M. I. Jordan, Kernel independent component analysis, J.
Mach. Learn. Res. 3 (2002), 148.
[BK07] R. M. Bell and Y. Koren, Lessons from the netflix prize challenge, SIGKDD
Explorations 9 (2007), no. 2, 7579.
[BKL06] A. Beygelzimer, S. Kakade, and J. Langford, Cover trees for nearest neighbor, International Conference on Machine Learning, 2006.
[BM90] H. Bourlard and N. Morgan, A continuous speech recognition system embedding MLP into HMM, Advances in Neural Information Processing Systems 2
(San Mateo, CA) (D. S. Touretzky, ed.), Morgan Kaufmann Publishers, 1990,
pp. 186193.
[BM92] K. P. Bennett and O. L. Mangasarian, Robust linear programming discrimination of two linearly inseparable sets, Optim. Methods Softw. 1 (1992), 2334.
[BM98] C. L. Blake and C. J. Merz, UCI repository of machine learning databases,
1998.
[BM02] P. L. Bartlett and S. Mendelson, Rademacher and Gaussian complexities:
Risk bounds and structural results, J. Mach. Learn. Res. 3 (2002), 463482.
[BNJ03] D. Blei, A. Ng, and M. Jordan, Latent Dirichlet allocation, Journal of Machine Learning Research 3 (2003), 9931022.
[BOS+ 05] K. M. Borgwardt, C. S. Ong, S. Schonauer, S. V. N. Vishwanathan, A. J.
Smola, and H. P. Kriegel, Protein function prediction via graph kernels, Bioinformatics 21 (2005), no. Suppl 1, i47i56.
[BPX+ 07] T. Brants, A.C. Popat, P. Xu, F.J. Och, and J. Dean, Large language models in machine translation, Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning EMNLPCoNLL, 2007, pp. 858867.
[BT03] D.P. Bertsekas and J.N. Tsitsiklis, Introduction to probability, Athena Scientific, 2003.
[BV04] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University
Press, Cambridge, England, 2004.
[Car97] J.-F. Cardoso, Infomax and maximum likelihood for blind source separation,
Bibliography
263
IEEE Letters on Signal Processing 4 (1997), 112114.

[Car98]
, Blind signal separation: statistical principles, Proceedings of the
IEEE 90 (1998), no. 8, 20092026.
[CB02] G. Casella and R. Berger, Statistical inference, 2nd ed., Duxbury, Pacific
Grove, CA, 2002.
[CDLS99] R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhalter, Probabilistic
networks and expert sytems, Springer, New York, 1999.
[CH04] Lijuan Cai and T. Hofmann, Hierarchical document categorization with support vector machines, Proceedings of the Thirteenth ACM conference on Information and knowledge management (New York, NY, USA), ACM Press, 2004,
pp. 7887.
[Cha00] B. Chazelle, A minimum spanning tree algorithm with inverse-ackermann
type complexity, Journal of the ACM 47 (2000).
[CJ04] R.
Caruana
and
T.
Joachims,
Kdd
cup,
http://kodiak.cs.cornell.edu/kddcup/index.html, 2004.
[Com94] P. Comon, Independent component analysis, a new concept?, Signal Processing 36 (1994), 287314.
[Cra46] H. Cramer, Mathematical methods of statistics, Princeton University Press,
1946.
[Cre93] N. A. C. Cressie, Statistics for spatial data, John Wiley and Sons, New York,
1993.
[CS03] K. Crammer and Y. Singer, Ultraconservative online algorithms for multiclass problems, Journal of Machine Learning Research 3 (2003), 951991.
[CSS00] M. Collins, R. E. Schapire, and Y. Singer, Logistic regression, AdaBoost
and Bregman distances, Proc. 13th Annu. Conference on Comput. Learning
Theory, Morgan Kaufmann, San Francisco, 2000, pp. 158169.
[CT91] T. M. Cover and J. A. Thomas, Elements of information theory, John Wiley
and Sons, New York, 1991.
[CV95] Corinna Cortes and V. Vapnik, Support vector networks, Machine Learning
20 (1995), no. 3, 273297.
[DdFG01] Arnaud Doucet, Nando de Freitas, and Neil Gordon, Sequential monte
carlo methods in practice, Springer-Verlag, 2001.
[DEKM98] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence
analysis: Probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.
[DG03] S. Dasgupta and A. Gupta, An elementary proof of a theorem of johnson
and lindenstrauss, Random Struct. Algorithms 22 (2003), no. 1, 6065.
[DG08] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large
clusters, CACM 51 (2008), no. 1, 107113.
[DGDR02] M. Davy, A. Gretton, A. Doucet, and P. J. W. Rayner, Optimized support
vector machines for nonstationary signal classification, IEEE Signal Processing
Letters 9 (2002), no. 12, 442445.
[DGL96] L. Devroye, L. Gy
orfi, and G. Lugosi, A probabilistic theory of pattern
recognition, Applications of mathematics, vol. 31, Springer, New York, 1996.
[DL07] R. Der and D. Lee, Large-margin classification in banach spaces, AISTATS
11, 2007.
[DLR77] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum Likelihood from
Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society
B 39 (1977), no. 1, 122.
[DPS04] M. Dudk, S. Phillips, and R.E. Schapire, Performance guarantees for reg-
264
3 Bibliography
ularized maximum entropy density estimation, Proc. Annual Conf. Computational Learning Theory, Springer Verlag, 2004.
[DS06] M. Dudk and R. E. Schapire, Maximum entropy distribution estimation with
generalized regularization, Proc. Annual Conf. Computational Learning Theory
(G
abor Lugosi and Hans U. Simon, eds.), Springer Verlag, June 2006.
[Dud02] R. M. Dudley, Real analysis and probability, Cambridge University Press,
Cambridge, UK, 2002.
[Fel71] W. Feller, An introduction to probability theory and its applications, 2 ed.,
John Wiley and Sons, New York, 1971.
[Feu93] Andrey Feuerverger, A consistent test for bivariate dependence, International Statistical Review 61 (1993), no. 3, 419433.
[FGSS08] K. Fukumizu, A. Gretton, X. Sun, and B. Scholkopf, Kernel measures of
conditional dependence, Advances in Neural Information Processing Systems
20 (Cambridge, MA), MIT Press, 2008, pp. 489496.
[FJ95] A. Frieze and M. Jerrum, An analysis of a monte carlo algorithm for estimating the permanent, Combinatorica 15 (1995), no. 1, 6783.
[FM53] R. Fortet and E. Mourier, Convergence de la reparation empirique vers la
reparation theorique, Ann. Scient. Ecole

Norm. Sup. 70 (1953), 266285.
[FMT96] J. Franklin, T. Mitchell, and S. Thrun (eds.), Recent advances in robot
learning, Kluwer International Series in Engineering and Computer Science,
no. 368, Kluwer Academic Publishers, 1996.
[FR79] J. Friedman and L. Rafsky, Multivariate generalizations of the WaldWolfowitz and Smirnov two-sample tests, The Annals of Statistics 7 (1979),
no. 4, 697717.
[FS99] Y. Freund and R. E. Schapire, Large margin classification using the perceptron algorithm, Machine Learning 37 (1999), no. 3, 277296.
[FS01] S. Fine and K. Scheinberg, Efficient SVM training using low-rank kernel
representations, JMLR 2 (2001), 243264.
[FT94] L. Fahrmeir and G. Tutz, Multivariate statistical modelling based on generalized linear models, Springer, 1994.
[GBR+ 07] A. Gretton, K. Borgwardt, M. Rasch, B. Schlkopf, and A. Smola, A kernel
approach to comparing distributions, Proceedings of the 22nd Conference on
Artificial Intelligence (AAAI-07) (2007), 16371641.
[GBSS05] A. Gretton, O. Bousquet, A.J. Smola, and B. Scholkopf, Measuring statistical dependence with Hilbert-Schmidt norms, ALT (S. Jain, H. U. Simon,
and E. Tomita, eds.), Springer-Verlag, 2005, pp. 6377.
[GHS+ 05] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Scholkopf, Kernel
methods for measuring independence, J. Mach. Learn. Res. 6 (2005), 2075
2129.
[GIM99] A. Gionis, P. Indyk, and R. Motwani, Similarity search in high dimensions
via hashing, Proceedings of the 25th VLDB Conference (Edinburgh, Scotland)
(M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie,
eds.), Morgan Kaufmann, 1999, pp. 518529.
[GP02] E. Gokcay and J.C. Principe, Information theoretic clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), no. 2, 158171.
[GS04] T.L. Griffiths and M. Steyvers, Finding scientific topics, Proceedings of the
National Academy of Sciences 101 (2004), 52285235.
[GVP90] D. Geiger, T. Verma, and J. Pearl, Recognizing independence in Bayesian
networks, Networks 20 (1990), 507534.
[GW92] P. Groeneboom and J. A. Wellner, Information bounds and nonparametric
Bibliography
265
maximum likelihood estimation, DMV, vol. 19, Springer, 1992.

[Hal92] P. Hall, The bootstrap and edgeworth expansions, Springer, New York, 1992.
[Hay91] S. Haykin, Adaptive filter theory, Prentice-Hall, Englewood Cliffs, NJ, 1991,
Second edition.
, Neural networks : A comprehensive foundation, Macmillan, New
[Hay98]
York, 1998, 2nd edition.
[HBM08] Z. Harchaoui, F. Bach, and E. Moulines, Testing for homogeneity with
kernel fisher discriminant analysis, NIPS 20, MIT Press, 2008.
[HC71] J. M. Hammersley and P. E. Clifford, Markov fields on finite graphs and
lattices, unpublished manuscript, 1971.
[Heb49] D. O. Hebb, The organization of behavior, John Wiley and Sons, New York,
1949.
[HF08] J.C. Huang and B.J. Frey, Cumulative distribution networks and the
derivative-sum-product algorithm, UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, July 9-12, 2008, Helsinki, Finland
(D.A. McAllester and P. Myllymaki, eds.), AUAI Press, 2008, pp. 290297.
[HKO01] A. Hyv
arinen, J. Karhunen, and E. Oja, Independent component analysis,
John Wiley and Sons, New York, 2001.
[HLB04] M. Hein, T.N. Lal, and O. Bousquet, Hilbertian metrics on probability measures and their application in SVMs, Proceedings of the 26th DAGM Symposium (Berlin), Springer, 2004, pp. 270277.
[Hoe48] Wassily Hoeffding, A class of statistics with asymptotically normal distribution, The Annals of Mathematical Statistics 19 (1948), no. 3, 293325.
[Hoe63] W. Hoeffding, Probability inequalities for sums of bounded random variables,
Journal of the American Statistical Association 58 (1963), 1330.
[HP99] N. Henze and M. Penrose, On the multivariate runs test, The Annals of
Statistics 27 (1999), no. 1, 290298.
[HT02] P. Hall and N. Tajvidi, Permutation tests for equality of distributions in
high-dimensional settings, Biometrika 89 (2002), no. 2, 359374.
[IM98] P. Indyk and R. Motawani, Approximate nearest neighbors: Towards removing the curse of dimensionality, Proceedings of the 30th Symposium on Theory
of Computing, 1998, pp. 604613.
[JGJS99] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, An introduction to variational methods for graphical models, Machine Learning 37 (1999),
no. 2, 183233.
[JK02] K. Jarvelin and J. Kekalainen, IR evaluation methods for retrieving highly
relevant documents, ACM Special Interest Group in Information Retrieval (SIGIR), New York: ACM, 2002, pp. 4148.
[JK03] T. Jebara and R. Kondor, Bhattacharyya and expected likelihood kernels,
Conference on Computational Learning Theory (COLT) (Heidelberg, Germany) (B. Sch
olkopf and M. Warmuth, eds.), LNCS, vol. 2777, Springer-Verlag,
2003, pp. 5771.
[JKB94] N. L. Johnson, S. Kotz, and N. Balakrishnan, Continuous univariate distributions. Volume 1 (second edition), John Wiley and Sons, 1994.
[Joa05] T. Joachims, A support vector method for multivariate performance measures, Proc. Intl. Conf. Machine Learning (San Francisco, California), Morgan
Kaufmann Publishers, 2005, pp. 377384.
[Joa06]
, Training linear SVMs in linear time, Proc. ACM Conf. Knowledge
Discovery and Data Mining (KDD), ACM, 2006.
[Jor08] M. I. Jordan, An introduction to probabilistic graphical models, MIT Press,
266
3 Bibliography
2008, To Appear.
[JV87] R. Jonker and A. Volgenant, A shortest augmenting path algorithm for dense
and sparse linear assignment problems, Computing 38 (1987), 325340.
[Kan95] A. Kankainen, Consistent testing of total independence based on the empirical characteristic function, Ph.D. thesis, University of Jyvaskyla, 1995.
[Kar80] R.M. Karp, An algorithm to solve the m n assignment problem in expected
time O(mn log n), Networks 10 (1980), no. 2, 143152.
[KBG04] D. Kifer, S. Ben-David, and J. Gehrke, Detecting change in data streams,
Very Large Databases (VLDB), 2004.
[KD05] S. S. Keerthi and D. DeCoste, A modified finite Newton method for fast
solution of large scale linear SVMs, J. Mach. Learn. Res. 6 (2005), 341361.
[Kel60] J. E. Kelly, The cutting-plane method for solving convex programs, Journal
of the Society for Industrial and Applied Mathematics 8 (1960), no. 4, 703712.
[KF09] D. Koller and N. Friedman, Probabilistic graphical models: Principles and
techniques, MIT Press, 2009.
[Kiw90] Krzysztof C. Kiwiel, Proximity control in bundle methods for convex nondifferentiable minimization, Mathematical Programming 46 (1990), 105122.
[KM00] Paul Komarek and Andrew Moore, A dynamic adaptation of AD-trees for
efficient machine learning on large data sets, Proc. Intl. Conf. Machine Learning, Morgan Kaufmann, San Francisco, CA, 2000, pp. 495502.
[KMH94] A. Krogh, I. S. Mian, and D. Haussler, A hidden Markov model that finds
genes in e. coli DNA, Nucleic Acids Research 22 (1994), 47684778.
[Koe05] R. Koenker, Quantile regression, Cambridge University Press, 2005.
[Kuh55] H.W. Kuhn, The Hungarian method for the assignment problem, Naval Research Logistics Quarterly 2 (1955), 8397.
[LBBH98] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning
applied to document recognition, Proceedings of the IEEE 86 (1998), no. 11,
22782324.
[Lew98] D. D. Lewis, Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on
Machine Learning (Chemnitz, DE) (C. Nedellec and C. Rouveirol, eds.), no.
1398, Springer Verlag, Heidelberg, DE, 1998, pp. 415.
[LGBS00] T.-W. Lee, M. Girolami, A. Bell, and T. Sejnowski, A unifying framework
for independent component analysis, Comput. Math. Appl. 39 (2000), 121.
[LMP01] J. D. Lafferty, A. McCallum, and F. Pereira, Conditional random fields:
Probabilistic modeling for segmenting and labeling sequence data, Proceedings
of International Conference on Machine Learning (San Francisco, CA), vol. 18,
Morgan Kaufmann, 2001, pp. 282289.
[LNN95] Claude Lemarechal, Arkadii Nemirovskii, and Yurii Nesterov, New variants
of bundle methods, Mathematical Programming 69 (1995), 111147.
[LS07] Q. Le and A.J. Smola, Direct optimization of ranking measures, J. Mach.
Learn. Res. (2007), submitted.
[Lue84] D. G. Luenberger, Linear and nonlinear programming, second ed., AddisonWesley, Reading, May 1984.
[Mac67] J. MacQueen, Some methods of classification and analysis of multivariate
observations, Proc. 5th Berkeley Symposium on Math., Stat., and Prob. (L. M.
LeCam and J. Neyman, eds.), U. California Press, Berkeley, CA, 1967, p. 281.
[Mar61] M.E. Maron, Automatic indexing: An experimental inquiry, Journal of the
Association for Computing Machinery 8 (1961), 404417.
[McA07] David McAllester, Generalization bounds and consistency for structured
Bibliography
267
labeling, Predicting Structured Data (Cambridge, Massachusetts), MIT Press,

2007.
[McD89] C. McDiarmid, On the method of bounded differences, Survey in Combinatorics, Cambridge University Press, 1989, pp. 148188.
[Mer09] J. Mercer, Functions of positive and negative type and their connection with
the theory of integral equations, Philos. Trans. R. Soc. Lond. Ser. A Math.
Phys. Eng. Sci. A 209 (1909), 415446.
[Mit97] T. M. Mitchell, Machine learning, McGraw-Hill, New York, 1997.
[MN83] P. McCullagh and J. A. Nelder, Generalized linear models, Chapman and
Hall, London, 1983.
[MSR+ 97] K.-R. M
uller, A. J. Smola, G. Ratsch, B. Scholkopf, J. Kohlmorgen, and
V. Vapnik, Predicting time series with support vector machines, Artificial Neural Networks ICANN97 (Berlin) (W. Gerstner, A. Germond, M. Hasler, and
J.-D. Nicoud, eds.), Lecture Notes in Comput. Sci., vol. 1327, Springer-Verlag,
1997, pp. 9991004.
[M
ul97] A. M
uller, Integral probability metrics and their generating classes of functions, Adv. Appl. Prob. 29 (1997), 429443.
[Mun57] J. Munkres, Algorithms for the assignment and transportation problems,
Journal of SIAM 5 (1957), no. 1, 3238.
[MXZ06] C. Micchelli, Y. Xu, and H. Zhang, Universal kernels, Journal of Machine
Learning Research 7 (2006), 26512667.
[MYA94] N. Murata, S. Yoshizawa, and S. Amari, Network information criterion
determining the number of hidden units for artificial neural network models,
IEEE Transactions on Neural Networks 5 (1994), 865872.
[Nad65] E. A. Nadaraya, On nonparametric estimates of density functions and regression curves, Theory of Probability and its Applications 10 (1965), 186190.
[Nel06] R. B. Nelsen, An introduction to copulas, Springer, 2006.
[NW99] J. Nocedal and S. J. Wright, Numerical optimization, Springer Series in
Operations Research, Springer, 1999.
[NWJ08] X.L. Nguyen, M. Wainwright, and M. Jordan, Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization, NIPS 20,
MIT Press, 2008.
[OL93] J.B. Orlin and Y. Lee, Quickmatch: A very fast algorithm for the assignment
problem, Working Paper 3547-93, Sloan School of Management, Massachusetts
Institute of Technology, Cambridge, MA, March 1993.
[Pap62] A. Papoulis, The fourier integral and its applications, McGraw-Hill, New
York, 1962.
[Pea01] J. Pearl, Causality: Models, reasoning and inference, Cambridge University
Press, 2001.
[Pla99] J. Platt, Fast training of support vector machines using sequential minimal
optimization, Advances in Kernel Methods Support Vector Learning (Cambridge, MA) (B. Sch
olkopf, C. J. C. Burges, and A. J. Smola, eds.), MIT Press,
1999, pp. 185208.
[PTVF94] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery,
Numerical recipes in c. the art of scientific computation, Cambridge University
Press, Cambridge, UK, 1994.
[Rab89] L. R. Rabiner, A tutorial on hidden Markov models and selected applications
in speech recognition, Proceedings of the IEEE 77 (1989), no. 2, 257285.
[Rao73] C. R. Rao, Linear statistical inference and its applications, John Wiley and
Sons, New York, 1973.
268
3 Bibliography
[RBZ06] N. Ratliff, J. Bagnell, and M. Zinkevich, Maximum margin planning, International Conference on Machine Learning, July 2006.
[RG99] S. Roweis and Z. Ghahramani, A unifying review of linear Gaussian models,
Neural Computation 11 (1999), no. 2.
[Ros58] F. Rosenblatt, The perceptron: A probabilistic model for information storage
and organization in the brain, Psychological Review 65 (1958), no. 6, 386408.
[Ros05] P. Rosenbaum, An exact distribution-free test comparing two multivariate
distributions based on adjacency, Journal of the Royal Statistical Society B 67
(2005), no. 4, 515530.
[RPB06] M. Richardson, A. Prakash, and E. Brill, Beyond pagerank: machine learning for static ranking, Proceedings of the 15th international conference on
World Wide Web, WWW (L. Carr, D. De Roure, A. Iyengar, C.A. Goble,
and M. Dahlin, eds.), ACM, 2006, pp. 707715.
[RSS+ 07] G. R
atsch, S. Sonnenburg, J. Srinivasan, H. Witte, K.-R. M
uller, R. J.
Sommer, and B. Sch
olkopf, Improving the Caenorhabditis elegans genome annotation using machine learning, PLoS Computational Biology 3 (2007), no. 2,
e20 doi:10.1371/journal.pcbi.0030020.
[RTG00] Y. Rubner, C. Tomasi, and L.J. Guibas, The earth movers distance as a
metric for image retrieval, Int. J. Comput. Vision 40 (2000), no. 2, 99121.
[Rud73] W. Rudin, Functional analysis, McGraw-Hill, New York, 1973.
[SC04] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis,
Cambridge University Press, Cambridge, UK, 2004.
[Sch97] B. Sch
olkopf, Support vector learning, R. Oldenbourg Verlag, Munich, 1997,
Download: http://www.kernel-machines.org.
[Ser80] R. Serfling, Approximation theorems of mathematical statistics, Wiley, New
York, 1980.
[SGSS07] A.J. Smola, A. Gretton, L. Song, and B. Scholkopf, A hilbert space embedding for distributions, Algorithmic Learning Theory (E. Takimoto, ed.),
Lecture Notes on Computer Science, Springer, 2007.
[Sha48] C. E. Shannon, A mathematical theory of communication, Bell System Technical Journal 27 (1948), 379423, 623656.
[Sha98] R.D. Shachter, Bayes-ball: The rational pasttime, Fourteenth Conference
on Uncertainty in Artificial Intelligence (Wisconsin, USA) (G.F. Cooper and
S. Moral, eds.), Morgan Kaufmann Publishers, Inc., San Francisco, June 1998.
[Sil86] B. W. Silverman, Density estimation for statistical and data analysis, Monographs on statistics and applied probability, Chapman and Hall, London, 1986.
[Smi39] N.V. Smirnov, On the estimation of the discrepancy between empirical curves
of distribution for two independent samples, Bulletin Mathematics 2 (1939),
326, University of Moscow.
[SPST+ 01] B. Sch
olkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C.
Williamson, Estimating the support of a high-dimensional distribution, Neural Comput. 13 (2001), no. 7, 14431471.
[SS00] A. J. Smola and B. Sch
olkopf, Sparse greedy matrix approximation for machine learning, Proc. Intl. Conf. Machine Learning (San Francisco) (P. Langley,
ed.), Morgan Kaufmann Publishers, 2000, pp. 911918.
[SS02] B. Sch
olkopf and A. Smola, Learning with kernels, MIT Press, Cambridge,
MA, 2002.
[STD07] J. Shawe-Taylor and A. Dolia, A framework for probability density estimation, Proceedings of International Workshop on Artificial Intelligence and
Statistics (M. Meila and X. Shen, eds.), 2007.
Bibliography
269
[Ste01] I. Steinwart, On the influence of the kernel on the consistency of support

vector machines, J. Mach. Learn. Res. 2 (2001), 6793.
[STV04] B. Sch
olkopf, K. Tsuda, and J.-P. Vert, Kernel methods in computational
biology, MIT Press, Cambridge, MA, 2004.
[SW86] G.R. Shorack and J.A. Wellner, Empirical processes with applications to
statistics, Wiley, New York, 1986.
[SZ92] Helga Schramm and Jochem Zowe, A version of the bundle idea for minimizing a nonsmooth function: Conceptual idea, convergence analysis, numerical
results, SIAM J. Optimization 2 (1992), 121152.
[SZS+ 08] L. Song, X. Zhang, A. Smola, A. Gretton, and B. Scholkopf, Tailoring
density estimation via reproducing kernel moment matching, ICML, 2008.
[TD99] D. M. J. Tax and R. P. W. Duin, Data domain description by support vectors,
Proceedings ESANN (Brussels) (M. Verleysen, ed.), D Facto, 1999, pp. 251
256.
[TGK04] B. Taskar, C. Guestrin, and D. Koller, Max-margin Markov networks,
Advances in Neural Information Processing Systems 16 (Cambridge, MA)
(S. Thrun, L. Saul, and B. Scholkopf, eds.), MIT Press, 2004, pp. 2532.
[TJHA05] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large margin
methods for structured and interdependent output variables, J. Mach. Learn.
Res. 6 (2005), 14531484.
[TLSS06] I. Takeuchi, Q. V. Le, T. Sears, and A. J. Smola, Nonparametric quantile
estimation, J. Mach. Learn. Res. 7 (2006).
[Vap82] V. Vapnik, Estimation of dependences based on empirical data, Springer,
Berlin, 1982.
[Vap95]
, The nature of statistical learning theory, Springer, New York, 1995.
[Vap98]
, Statistical learning theory, John Wiley and Sons, New York, 1998.
[vdG00] S. van de Geer, Empirical processes in M-estimation, Cambridge University
Press, 2000.
[vdVW96] A. W. van der Vaart and J. A. Wellner, Weak convergence and empirical
processes, Springer, 1996.
[VGS97] V. Vapnik, S. Golowich, and A. J. Smola, Support vector method for function approximation, regression estimation, and signal processing, Advances in
Neural Information Processing Systems 9 (Cambridge, MA) (M. C. Mozer,
M. I. Jordan, and T. Petsche, eds.), MIT Press, 1997, pp. 281287.
[Voo01] E. Voorhees, Overview of the TRECT 2001 question answering track,
TREC, 2001.
[Wah97] G. Wahba, Support vector machines, reproducing kernel Hilbert spaces and
the randomized GACV, Tech. Report 984, Department of Statistics, University
of Wisconsin, Madison, 1997.
[Wat64] G. S. Watson, Smooth regression analysis, Sankhya A 26 (1964), 359372.
[Wat99] C. Watkins, Dynamic alignment kernels, CSD-TR-98- 11, Royal Holloway,
University of London, Egham, Surrey, UK, 1999.
[WB06] G. Welch and G. Bishop, An introduction to the kalman filter, Tech. Report
TR-95-041, Department of Computer Science, University of North Carolina at
Chapel Hill, 2006.
[Wil44] J. E. Wilkins, A note on skewness and kurtosis, The Annals of Mathematical
Statistics 15 (1944), no. 3, 333335.
[Wil98] C. K. I. Williams, Prediction with Gaussian processes: From linear regression
to linear prediction and beyond, Learning and Inference in Graphical Models
(M. I. Jordan, ed.), Kluwer Academic, 1998, pp. 599621.
270
3 Bibliography
[WJ03] M. J. Wainwright and M. I. Jordan, Graphical models, exponential families, and variational inference, Tech. Report 649, UC Berkeley, Department of
Statistics, September 2003.
[WJ08]
, Graphical models, exponential families, and variational inference,
vol. 1, Foundations and Trends in Machine Learning, no. 12, 2008.
[WS01] Christoper K. I. Williams and Matthias Seeger, Using the Nystrom method to
speed up kernel machines, Advances in Neural Information Processing Systems
13 (Cambridge, MA) (T. K. Leen, T. G. Dietterich, and V. Tresp, eds.), MIT
Press, 2001, pp. 682688.
[YAC98] H. H. Yang, S.-I. Amari, and A. Cichocki, Information theoretic approach to
blind separation of sources in non-linear mixture, Signal Processing 64 (1998),
no. 3, 291300.

Smola - Introduction To Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Smola - Introduction To Machine Learning

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO MACHINE LEARNING

Introduction to Machine Learning

published by the press syndicate of the university of cambridge

[Alexander J. Smola and S.V.N.

Directed Graphical Models

Undirected Graphical Models

5.1.5 Convex Functions with Lipschitz Continous Gradient133

Kernels and Function Spaces

Maximum Mean Discrepancy

Linear Algebra and Functional Analysis

Since this is a textbook we biased our selection of references towards easily

Structure of the Book

Canberra, August 2008

1.1 A Taste of Machine Learning

Results 1 - 10 of about 10,500,000 for machine learning. (0.06 seconds)

Machine learning - Wikipedia, the free encyclopedia

Machine Learning textbook

Introduction to Machine Learning

Machine Learning - Artificial Intelligence (incl. Robotics ...

Machine Learning (Theory)

Amazon.com: Machine Learning: Tom M. Mitchell: Books

1.1 A Taste of Machine Learning

languages. In other words, we could use examples of translations to learn

Gifts & Wish Lists

Your Account | Help

Hot New Releases

The New York Times Best Sellers

Machine Learning (Mcgraw-Hill International Edit)

(30 customer reviews)

List Price: $87.47

Sign in to turn on 1-Click ordering.

More Buying Choices

16 used & new from

16 used & new available from $52.00

Share your own customer images

Search inside another edition of this book

Also Available in: List Price: Our Price: Other Offers:

Are You an Author or

34 used & new from $67.00

Customers Who Bought This Item Also Bought

Pattern Recognition and

Explore similar items : Books

The Elements of Statistical

Pattern Classification (2nd

Data Mining: Practical

(30 customer reviews)

Amazon.com Sales Rank: #104,460 in Books (See Bestsellers in Books)

(Publishers and authors: Improve Your Sales)

In-Print Editions: Hardcover (1) | All Editions

Inside This Book

Customers viewing this page may be interested in these Sponsored Links

Juris Doctor JD & LLM Masters Low tuition, Free Textbooks

Save on powerful mind-boosting CDs & DVDs. Huge Selection

Video Edit Magic

Video Editing Software trim, modify color, and merge video

Tags Customers Associate with This Product

Click on a tag to find related items, discussions, and people.

Your tags: Add your first tag

Search Products Tagged with

HAVANA (Reuters) - The European Unions top development aid official

1.1 A Taste of Machine Learning

with those issues in an automatic fashion is to normalize the data. We will