Professional Documents
Culture Documents
Abstract
1
examples from the same class together although they look ter predictions being initialized in random and contradictory
very different in pixel space. We call this the blue sky prob- ways. CCNN requires running k-means in each iteration.
lem with the pictures of a flying bird and a flying airplane Deep Embedded Regularized Clustering (DEPICT) [5] is
in mind, both of which will contain many blue pixels but similar to DEC in that feature representations and cluster as-
just a few others that make the actual distinction. In partic- signments are simultaneously learned using a deep network.
ular auto-encoder approaches suffer from the blue sky prob- An additional regularization term is used to balance the
lem as their cost function usually penalizes reconstruction cluster assignment probabilities allowing to get rid of the
in pixel space and hence favors the encoding of potentially pretraining step. This method achieved a performance com-
unnecessary information such as the color of the sky. parable to ours on MNIST and FGRC. However, it requires
pre-training using the autoencoder reconstruction loss.
1.2. Related work In Categorical Generative Adversarial Networks (Cat-
Classical clustering approaches are limited to the orig- GANs) [35], unlike standard GANs, the discriminator
inal data space and are thus not very effective in high- learns to separate the data into k categories instead of
dimensional spaces with complicated data distributions, learning a binary discriminative function. Results on large
such as images. Early deep-learning-based clustering meth- datasets are not reported. Another approach based on GANs
ods first train a feature extractor, and in a separate step apply is [31].
a clustering algorithm to the features [10, 32, 34]. Information-Maximizing Self-Augmented Training (IM-
Well-clusterable deep representations are characterized SAT) [18] is based on Regularized Information Maximiza-
by high within-cluster similarities of points compared to tion (RIM) [22], which learns a probabilistic classifier that
low across-cluster similarities. Some loss functions opti- maximizes the mutual information between the input and
mize only one of these two aspects. Particularly, methods the class assignment. In IMSAT this mutual information is
that do not encourage across-cluster dissimilarities are at represented by the difference between the marginal and con-
risk of producing worse (or theoretically even trivial) repre- ditional distribution of those values. IMSAT also introduces
sentations/results [42], but some of them nonetheless work regularization via self-augmentation. This is achieved via
well in practice. Other methods avoid trivial representa- an additional term to the cost function which assures that
tions by using additional techniques unrelated to clustering, the generated data point and the original are similarly as-
such as autoencoder reconstruction loss during pre-training signed. For large datasets, IMSAT uses fixed pre-trained
or entire training [20, 41, 42, 26]. network layers.
Deep Embedded Clustering (DEC) [41] model simulta- Several other methods with various quality of results ex-
neously learns feature representations and cluster assign- ist [3, 27, 40, 2, 33, 15].
ments. To get a good initialization, DEC needs auto- In summary, previous methods either do not scale well to
encoder pretraining. Also, the approach has some issues large datasets [41, 44, 35], have high computational and/or
scaling to larger datasets such as STL-10. memory cost [43], require a clustering algorithm to be run
Variational Deep Embedding (VaDE) [44] is a genera- during or after the training (most methods, e.g. [43, 17]),
tive clustering approach based on variational autoencoders. are at risk of producing trivial representations, and/or re-
It achieves significantly more accurate results on small quire some labeled data [18] or clustering-unrelated losses
datasets, but does not scale to larger, higher-resolution (for example additional autoencoder loss [20, 41, 42, 26, 44,
datasets. For STL-10, it uses a network pretrained on the 5, 1]) to (pre-)train (parts of) the network. In particular re-
dramatically larger Imagenet dataset. construction losses tend to overestimate the importance of
Joint Unsupervised Learning (JULE) of representations low level features such as colors.
and clusters [43] is based on agglomerative clustering. In In order to solve these problems, it is desirable to develop
contrast to our method, JULE’s network training proce- new training schemes that tackle the clustering problem as
dure alternates between cluster updates and network train- a direct end-to-end training of a neural network.
ing. JULE achieves very good results on several datasets.
1.3. Contribution
However, its computational and memory requirements are
relatively high as indicated by [17]. In this paper, we propose Associative Deep Clustering as
Clustering convolutional neural networks (CCNN) [17] an end-to-end framework that allows to train a neural net-
predict clustering assignments at the last layer, and a work directly for clustering. In particular, we introduce cen-
clustering-friendly representation at an intermediate layer. troid embeddings: variables that look like embeddings of
The training loss is the difference between the clustering images but are actually part of the model. They can be opti-
predictions and the results of k-means applied to the learned mized and they are used to learn a projection from embed-
representation. Interestingly, CCNN yields good results in ding space to the desired dimensionality, e.g. logits ∈ Rk
practice, despite both the clustering features and the clus- where k is the number of classes. The intuition is that the
centroid variables carry over high-level information about 2. Associative Deep Clustering
the data structure (i.e. cluster centroid embeddings) from it-
eration to iteration. It makes sense to train a neural network In this section, we describe our setup and the cost func-
directly for a clustering task, rather than a proxy task (such tion. Figure 2 depicts an overall schematic which will be
as reconstruction from embeddings) since the ultimate goal referenced in the following.
is actually clustering. 2.1. Associative learning
To facilitate this, we introduce a cost function consist-
ing of multiple terms which cause clusters to separate while Recent works have shown that associations in embed-
associating similar images. Associations [13] are made be- ding space can be used for semi-supervised training and
tween centroids and image embeddings, and between em- domain adaptation [13, 12]. Both applications require an
beddings of images and their transformations. Unlike pre- amount of labeled training data which is fed through a neu-
vious methods, we use clustering-specific loss terms that ral network along with unlabeled data. Then, an imaginary
allow to directly learn the assignment of an image to a clus- walker is sent from embeddings of labeled examples to em-
ter. There is no need for a subsequent training procedure beddings of unlabeled examples and back. From this idea,
on the embeddings. The output of the network is a clus- a cost function is constructed that encourages consistent as-
ter membership distribution for a given input image. We sociation cycles, meaning that two labeled examples are as-
demonstrate that this approach is useful for clustering im- sociated via an unlabeled example with a high probability
ages without any prior knowledge other than the number of if the labels match and with a low probability otherwise.
classes. More formally: Let Ai and Bj be embeddings of labeled
The resulting learned cluster assignments are so good and unlabeled data, respectively. Then a similarity matrix
that subsequently re-running a clustering algorithm on the Mij ..= Ai · Bj can be defined. These similarities can now
learned embeddings does not further improve the results be transformed into a transition probability matrix by soft-
(unlike e.g. [43]). maxing M over columns:
To the best of our knowledge, we are the first to introduce
a training scheme for direct clustering jointly with network Pijab = P (Bj |Ai ) ..=(softmaxcols (M ))ij (1)
X
training, as opposed to feature learning approaches where = exp(Mij )/ exp(Mij 0 )
the obtained features need to be clustered by a second algo- j0
rithm such as k-means (as in most methods), or to methods
where the clustering is directly learned but parts of the net- Analogously, the transition probabilities in the other direc-
work are pre-trained and fixed [18]. tion (P ba ) are obtained by replacing M with M T . Finally,
Moreover, unlike most methods, we use only clustering- the metric of interest is the probability of an association cy-
specific and invariance-imposing losses and no clustering- cle from A to B to A:
unrelated losses such as autoencoder reconstruction or
classification-based pre-training. Pijaba ..=(P ab P ba )ij (2)
In summary, our contributions are:
X
ab ba
= Pik Pkj
k
• We introduce centroid variables that are jointly trained
with the network’s weights. Since the labels of A are known, it is possible to define
a target distribution where inconsistent association cycles
• This is facilitated by our clustering cost function that (where labels mismatch) have zero probability:
makes associations between cluster centroids and im-
age embeddings. No labels are needed at any time. In
(
1/#class(Ai ) class(Ai ) = class(Aj )
particular, there is no subsequent clustering step nec- Tij ..= (3)
essary such as k-means. 0 else
• We conducted an extensive ablation study demonstrat- Now, the associative loss function becomes
ing the effects of our proposed training schedule. Lassoc (A, B) ..= crossentropy(Tij ; Pijab ). For more
details, the reader is kindly referred to [13].
• Our method outperforms the current state of the art on
a number of datasets. 2.2. Clustering loss function
In this work, we further develop this setup since there is
• All code is available as an open-source implementation
no labeled batch A. In its stead, we introduce centroid vari-
in TensorFlow1 .
ables µk that have the same dimensionality as the embed-
1 The code will be made available upon publication of this paper. dings and that have surrogate labels 0, 1, . . . , k − 1. With
them and embeddings of (augmented) images, we can de- Here, we apply τ once and set it once to the identity such
fine clustering associations. that the “augmented” batch contains each image and a trans-
We define two associative loss terms: formed version of it. This formulation can be interpreted as
a trick to obtain weak labels for examples since the log-
• Lassoc,c ..= Lassoc (µk ; zi ) where associations are its yield an estimate for the class membership “to the best
made between (labeled) centroid variables µk (instead knowledge of the network so far”. Particularly, this loss
of A) and (unlabeled) image embeddings zi serves multiple purposes:
• Lassoc,aug ..= Lassoc (zi ; f (τ (xj ))) where we apply 4 • The logit oi of an image xi and the logit oj of the trans-
random transformations τ to each image xj resulting formed version τ (xj ) should be similar for i = j.
in 4 “augmented” embeddings zj that share the same • Images that the network thinks belong to the same
surrogate label. The “labeled” batch A then consists cluster (i.e. their centroid distribution o is similar)
of all augmented images, the “unlabeled” batch B of should also have similar embeddings, regardless of
embeddings of the unaugmented images xi . whether τ was applied.
For simplicity, we imply a sum over all examples xi and • Embeddings of one image and a different image are al-
xj in the batch. The loss is then divided by the number of lowed to be similar if the centroid membership is sim-
summands. ilar.
The transformation τ randomly applies cropping and
flipping, Gaussian noise, small rotations and changes in • Embeddings are forced to be dissimilar if they do not
brightness, saturation and hue. belong to the same cluster.
All embeddings and centroids are regularized with an L2 The final cost function now becomes
loss Lnorm to have norm 1. We chose this value empirically
to avoid too small numbers in the 128-dimensional embed- L = αLassoc,c (5)
ding vectors. + βLassoc,aug (6)
A fully-connected layer W projects embeddings (µk , zi + γLnorm (7)
and zj ) to logits ok,i,j ∈ Rk . After a softmax operation,
each logit vector entry can be interpreted as the probabil- + δLtrafo (8)
ity of membership in the respective cluster. W is opti- with the loss weights α, β, γ, δ.
mized through a standard cross-entropy classification loss In the remainder of this paper, we present an ablation
Lclassification where the inputs are µk and the desired out- study to demonstrate that the loss terms do in fact support
puts are the surrogate labels of the centroids. the clustering performance. From this study, we conclude
Finally, we define a transformation loss on a set of hyper parameters to be held fixed for all datasets.
With these fixed parameters, we finally report clustering
Ltrafo ..= |1 − f (xi )T f (τ (xj )) − crossentropy(oi , oj )| scores on various datasets along with a qualitative analysis
(4) of the clustering results.
Figure 2: Schematic of the framework and the losses. Green boxes are trainable variables. Yellow boxes are representations
of the input. Red lines connect the losses with their inputs. Please find the definitions in Section 2.2
MNIST FRGC SVHN CIFAR-10 STL-10
k-means on pixels 53.49 24.3 12.5 20.8 22.0
DEC [41] 84.30 37.8 - - 35.9‡†
pretrain+DEC [18] - - 11.9 (0.4)† 46.9 (0.9)† 78.1 (0.1)†
VaDE [44] 94.46 - - - 84.5†
CatGAN [35] 95.73 - - - -
JULE [43] 96.4 46.1 - 63.5‡‡ -
DEPICT [5] 96.5 47.0 - - -
IMSAT [18] 98.4 (0.4) - 57.3 (3.9) ‡ 45.6 (2.9)† 94.1†
CCNN [17] 91.6 - - - -
ours (VCNN): Mean 98.7 (0.6) 43.7 (1.9) 37.2 (4.6) 26.7 (2.0) 38.9 (5.9)
ours (VCNN): Best 99.2 46.0 43.4 28.7 41.5
ours (ResNet): Mean 95.4 (2.9) 21.9 (4.9) 38.6 (4.1) 29.3 (1.5) 47.8 (2.7)
ours (ResNet): Best 97.3 29.0 45.3 32.5 53.0
ours (ResNet): K-Means 93.8 (4.5) 24.0 (3.0) 35.4 (3.8) 30.0 (1.8) 47.1 (3.2)
Table 1: Clustering. Accuracy (%) on the test sets (higher is better). Standard deviation in parentheses where available. We
report the mean and best of 20 runs for two architectures. For ResNet, we also report the accuracy when k-means is run on
top of the obtained embeddings after convergence. † : Using features after pre-training on Imagenet. ‡ : Using GIST features.
‡†
: Using Histogram-of-oriented-gradients (HOG) features. ‡‡ : Using 5k labels to train a classifier on top of the features
from clustering.