Towards Personalized Image Captioning Via Multimodal Memory Networks

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2018.2824816, IEEE
Transactions on Pattern Analysis and Machine Intelligence
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 14, NO 8, AUGUST 2015 1
Towards Personalized Image Captioning

via Multimodal Memory Networks
Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim
Abstract—We address personalized image captioning, which generates a descriptive sentence for a user’s image, accounting for prior
knowledge such as her active vocabulary or writing style in her previous documents. As applications of personalized image captioning,
we solve two post automation tasks in social networks: hashtag prediction and post generation. The hashtag prediction predicts a list of
hashtags for an image, while the post generation creates a natural text consisting of normal words, emojis, and even hashtags. We
propose a novel personalized captioning model named Context Sequence Memory Network (CSMN). Its unique updates over existing
memory networks include (i) exploiting memory as a repository for multiple types of context information, (ii) appending previously
generated words into memory to capture long-term information, and (iii) adopting CNN memory structure to jointly represent nearby
ordered memory slots for better context understanding. For evaluation, we collect a new dataset InstaPIC-1.1M, comprising 1.1M
Instagram posts from 6.3K users. We further use the benchmark YFCC100M dataset [1] to validate the generality of our approach.
With quantitative evaluation and user studies via Amazon Mechanical Turk, we show that the three novel features of the CSMN help
enhance the performance of personalized image captioning over state-of-the-art captioning models.
Index Terms—Image captioning, personalization, memory networks, convolutional neural networks.
1 I NTRODUCTION
I MAGE captioning is a task of automatically generating a

descriptive sentence of an image [2], [3], [4], [5], [6], [7],
[8], [9], [10]. As this task is often regarded as one of the
diverse annotations for individual events while keeping the
consistency within the whole lifelogs. This difficulty can be
alleviated to some extent by the personalized captioning,
frontier-AI problems, it has been actively studied in recent utilizing the prior knowledge about the user’s writing.
vision and language research. It not only requires an algo- There are also other interesting potential applications, such
rithm with in-depth understanding going beyond category as generating personalized scene descriptions for visually
or attribute levels, but also to connect its interpretation with impaired people [12], suggesting diverse captions for an
a language model to create a natural sentence. image [13], and making a chit-chat dialogue agent that is
This work addresses personalization issues of image cap- conditioned on the user profile information [14].
tioning, which have not been discussed in previous re- To show the usefulness of personalized image caption-
search. We aim at generating a descriptive sentence for ing, we focus on two post automation tasks: hashtag pre-
an image, accounting for prior knowledge such as the diction and post generation. See an example of Instagram
user’s active vocabulary or writing styles in her previous post in Figure 1(b). The hashtag prediction automatically
documents. Figure 1(a) illustrates the motivation with an predicts a list of hashtags for the image, while the post
Instagram post example. Given the same riverside image, generation creates a natural sentence consisting of normal
each user generates different captions according to their words, emojis, and even hashtags. Personalization is key
own experience, thoughts, or writing styles, while focusing to success in these two tasks, because the text in social
on different themes such as solitude, Melbourne, and wedding, networks is not a simple description of image content, but
respectively. the user’s own story and experience about the image with
Potentially, personalized image captioning is applicable his or her favorite vocabulary and expressions.
to a wide range of automation services in photo-sharing so- For achieving personalized image captioning tasks, we
cial networks. For example, in Instagram or Facebook, users propose a multimodal memory network model named as
tend to instantly take and share pictures as posts using their context sequence memory network (CSMN). Our model is in-
mobile phones. One convenient function here is to auto- spired by recent advances of neural memory networks [15],
matically craft hashtags or associated text description using [16], [17], which explicitly include memory components to
their own words, to expedite the completion of an image which neural networks read and write data for capturing
post. Another example would be annotating photos from long-term information. Our major updates over previous
personal photo lifelogs [11]. It often necessitates generating memory network models are three-fold.
First, we propose to use the memory as a context reposi-
• Cesc Chunseong Park is with the Lunit Incorporation, Seoul, Korea. tory of prior knowledge for personalized image captioning.
E-mail: cspark@lunit.io Since the topics of social network posts are too broad and
• Byeongchang Kim and Gunhee Kim are with the Department of Computer
Science and Engineering & Center of Superintelligence, Seoul National users’ writing styles are too diverse, it is crucial to leverage
University, Seoul, Korea. prior knowledge about the authors or metadata around the
E-mail: byeongchang.kim@vision.snu.ac.kr, gunhee@snu.ac.kr images. Our memory retains such multiple types of context
Manuscript received April 19, 2005; revised August 26, 2015. information to promote more focused prediction, including
0162-8828 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Query Image Softmax

Query Image
CNN
Context yt
Memory
yt
Image feature User context Word output
CSMN model
Beautiful solitude in the morning User’s Active Vocabulary
Task1. Hashtag prediction
User 1
The beautiful Melbourne, I love spring
User 2 Task2. Post generation
Beautiful day for a wedding
User 3
(a) (b)
Fig. 1. Problem statement of personalized image captioning with an Instagram example. (a) Personalized image captioning is motivated by that
different users are likely to generate different sentences for the same image, according to their own experiences, thoughts, or writing styles. (b) As
main applications, we address hashtag prediction and post generation tasks. Given a query image, the former predicts a list of hashtags, while the
latter generates a descriptive text to complete a post. We propose a versatile context sequence memory network (CSMN) model.
users’ active vocabulary and various image descriptors. volution layers that lead to much stronger representation
Second, we design the memory to sequentially store all power. We will present a more in-depth justification of CNN
of the output words that the model generates. It leads to in section 4.3, and quantitative performance improvement in
three important advantages. First, it enables the model to section 5.3.
selectively attend, at every step, on the most informative To evaluate the effectiveness of the three novel features
previous words and their combination with other context of the proposed CSMN, we collect a new personalized
information in the memory. Second, our model does not suf- image captioning dataset named InstaPIC-1.1M, comprising
fer from the vanishing gradient problem. Most captioning 1.1M Instagram posts from 6.3K users. Instagram is a great
models are equipped with RNN-based encoders (e.g. [2], [7], source for personalized captioning, because posts mostly
[8], [9], [10], [18]), which predict a word at every time step, include personal pictures with long hashtag lists and char-
based only on current input and a single or a few hidden acteristic text with a wide range of topics. For each picture
states as an implicit summary of all previous histories. post, we consider the body text or a list of hashtags as
Thus, RNNs and their variants often fail to capture long- groundtruth captions. We also use an existing benchmark
term dependencies, which could worsen if one wants to dataset, Yahoo Flickr Creative Commons 100 Millon Dataset
use prior knowledge together. On the other hand, our state- (YFCC100M) [1], to validate the generality of our approach.
based sequence generation explicitly retains all the word It comprises 100 million images and videos that have been
information in the memory to predict the next words. By uploaded to Flickr between 2004 and 2014.
using teacher-forced learning [19], our model has a Markov Our experimental results demonstrate that the aforemen-
property at training time; predicting a previous word yt−1 tioned three unique features of our CSMN model indeed
has no effect on predicting a next word yt , which depends improve captioning performance, especially for personaliza-
on only the current memory state. Thus, the gradients tion purposes. We also validate that our CSMN significantly
from the current time step prediction yt are not propagated outperforms several state-of-the-art captioning models with
through the time. Third, our model can be easily parallelized the decoders of RNNs or LSTMs (e.g. [8], [9], [20]). We
during training time, while RNNs are often tricky to be evaluate with quantitative language metrics (e.g. BLEU [21],
parallelizable since they need to maintain hidden states of CIDEr [22], METEOR [23], and ROUGE [24]) and user
the entire past time steps. On the other hand, our model studies via Amazon Mechanical Turk.
has no such hidden states for history, as the model does This paper extends the preliminary work of Park et al.
not depend on the computations of the previous time steps, [25] in a number of aspects. First, we make model updates
and thus allow parallelization over every timestep in the after thorough experimental comparisons, including that we
sequence. replace the single-layer CNN of [25] with multi-layer ones
Third, we propose to exploit a CNN to jointly represent for more expressive memory representation. Second, we
nearby ordered memory slots for better context understand- apply our model to the benchmark dataset YFCC100M [1]
ing. Original memory networks [16], [17] leverage time em- to show better generalization performance of our approach.
bedding to model the memory order. Still its representation Third, we perform thorough experiments to validate the
power is low, because it cannot represent the correlations effects of individual building blocks of our algorithm. For
between multiple memory slots, for which we exploit con- example, we present visualization examples of memory
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2018.2824816, IEEE
attention in section 5.6. Finally, we describe more details of user-specific human gestures. Almaev et al. [31] adopt a
the algorithms, implementation and experimental settings, transfer learning framework to detect person-specific facial
including the formulation changes for different image de- action unit detection. In NLP, Mirkin et al. [35] enhance
scriptors in section 4.2 and a minibatch sampling strategy in machine translation performance by exploiting personal
section 4.4. traits. Polozov et al. [34] generate personalized mathematical
We summarize contribution of this work as follows. word problem for a given tutor/student specification by
(1) To the best of our knowledge, we propose a first logic programming.
personalized image captioning approach. We introduce two Compared to these papers, our problem setup is novel
practical post automation tasks that benefit from personal- in that personalization issues in image captioning have not
ized captioning: post generation and hashtag prediction. been discussed yet.
(2) We propose a novel memory network model named Neural Networks with Memory. Various memory net-
CSMN for personalized captioning. The unique updates of work models have been proposed to enable neural networks
CSMN include (i) exploiting memory as a repository for to store variables and data over long timescales. Neural Tur-
multiple context information, (ii) appending previously gen- ing Machines [15] use external memory to solve algorithmic
erated words into memory to capture long-term information problems such as sorting and copying. Later, this architec-
without, and (iii) adopting CNN memory structure to jointly ture is extended to Differential Neural Computer (DNC)
represent nearby ordered memory slots. [36] to solve more complicated algorithmic problems such
(3) For evaluation of personalized image captioning, we as finding the shortest path and graph traversal. Weston et
introduce a novel Instagram dataset named InstaPIC-1.1M. al. [17] propose one of the earliest memory network models
We make the code and data publicly available. for natural language question answering (QA), and later
(4) With quantitative evaluation and user studies via Sukhbaatar et al. [16] modify the network to be trainable
AMT, we demonstrate the effectiveness of three novel fea- in an end-to-end manner. Kumar et al. [37] and Milleret al.
tures of CSMN and its performance superiority over state- [38] address language QA tasks proposing novel memory
of-the-art captioning models, including [8], [9], [20]. networks such as dynamic networks with episodic memory
in [37] and key-value memory networks in [38].
Compared to previous memory networks, our CSMN
2 R ELATED WORK has three novel features as discussed in section 1: (i) memory
Image Captioning. In recent years, much work has been usage of multiple context types, (ii) an explicit memory
published on image captioning, including [2], [3], [4], [5], storage of previous words to capture long-term dependen-
[6], [7], [8], [9], [10], [26], [27], [28], to name a few. Many cies, and (iii) CNN-based memory attention to fuse multiple
proposed captioning models exploit RNN-based decoders context types for better word prediction.
to generate a sequence of words from encoded input image Natural Language Processing with Convolutional Neu-
representation. For example, long-term recurrent convolu- ral Networks. Recently, CNN models have been popularly
tional networks [2] are one of the earliest model to use RNNs applied to NLP tasks beyond computer vision tasks, includ-
for modeling the relations between sequential inputs and ing [39], [40], [41], [42]. Kim [39] shows that single-layer
outputs. You et al. [10] exploit semantic attention to com- CNN models achieve the state-of-the-art performance on
bine top-down and bottom-up strategies to extract richer multiple text classification tasks, such as sentiment analysis
information from images, and couples it with an LSTM and question classification. Later, this CNN architecture is
decoder. Ren et al. [26] utilize an actor-critic reinforcement extended by combining highway connections and LSTMs
learning model with a visual-semantic embedding reward for language modeling problems [41]. Lai et al. [40] intro-
for better guidance of captioning. Venugopalan et al. [27] use duce a recurrent CNN model for text classification, in which
external image and semantic knowledge to describe unseen the CNN constructs word representations while the recur-
object categories in output captions. Lu et al. [28] propose an rent structure capture contextual information of the text.
adaptive attention encoder-decoder framework to provide a Gehring et al. [42] enhance machine translation performance
fallback option to the caption decoder. by using multi-layer CNN-based sequence generation mod-
Compared to such recent progress of image captioning els.
research, it is novel to replace an RNN-based decoder with Although our method is partly influenced by such recent
an explicit sequence memory. Moreover, no previous work progress of CNNs in the NLP tasks, it is novel to apply the
has tackled the personalization issue, which is the key CNN structure to read/write operations in neural memory
objective of this work. We also introduce post completion networks.
and hashtag prediction as concrete and practical application
for image captioning. 3 DATASETS
Personalization in Vision and Language Research. We develop our personalized image captioning approach
There have been many studies about personalization in with our newly collected InsaPIC-1.1M dataset and the
computer vision and natural language processing [29], [30], YFCC100M benchmark dataset [1]. Table 1 outlines key
[31], [32], [33], [34]. Especially, Denton et al. [29] develop a statistics of the two datasets.
CNN model that predicts hashtags from image content and
user information. However, this work does not formulate 3.1 InstaPIC-1.1M
the hashtag prediction as image captioning, and does not We make separate datasets for post completion and hashtag
address the post completion. In computer vision, Yao et prediction as follows. We first collect image posts from In-
al. [32] propose a domain adaptation approach to classify stagram, which is one of the fastest growing photo-sharing
1.0 mation irrelevant to the associated pictures. We set 3 as the

97.63% covered (40000) caption
hashtag minimum post length; too short posts are likely to include
0.8 84.31% covered (60000) only an exclamation (e.g. great!) or a short reply (e.g. thanks to
everyone!). We use the same rule for the hashtag dataset. We
0.6 observe that lengthy lists of more than 15 hashtags are often
too redundant (e.g. #fashionable, #fashionblog, #fashionista,
CDF
0.4 #fashionistas, #fashionlover, #fashionlovers). Finally, we ob-

tain about 721,176 posts for captions and 518,116 posts for
0.2 hashtags.
0.0
0 50000 100000
word index
150000 200000 3.2 YFCC100M Dataset
Yahoo Flickr Creative Commons 100 Million Dataset
(YFCC100M) [1] consists of 100 million Flickr user-uploaded
Fig. 2. Cumulative distribution functions (CDFs) of the words for captions images and videos between 2004 and 2014 (i.e. 99,206,564
and hashtags in the InstaPIC-1.1M dataset. For captions/hashtags, the images and 793,436 videos from 578,268 different Flickr
top 40K/60K most frequent words take 97.63%/84.31% of all the word users) along with their corresponding metadata including
occurrences of the dataset, respectively.
titles, descriptions, camera types, usertags. We regard the
titles and descriptions as captions and usertags as hashtags.
TABLE 1 We process a series of filtering similar to those used in
Statistics of InstaPIC-1.1M and YFCC100M dataset. We also show
average and median (in parentheses) values. The total unique posts InstaPIC-1.1M. We first exclude the posts where more than
and users in InstaPIC-1.1M dataset are (1, 124, 815/6, 315) and in 50% of words are not in English, and then remove all of
YFCC100M dataset are (867, 922/11, 093). the users’ posts if the user has more than max(30, 0.15 ×
#user posts) non-English posts. We limit the maximum
InstaPIC-1.1M number of posts per user to 1,000 and the minimum number
Dataset # posts # users # posts/user # words/post of posts per user to 30. We set 20 and 3 as the maximum
caption 721,176 4,820 149.6 (118) 8.55 (8) and minimum post length, respectively. If both post titles
hashtag 518,116 3,633 142.6 (107) 7.45 (7)
YFCC100M and descriptions satisfy these rules, we include them in
caption 462,036 6,197 74.6 (40) 6.30 (5) the dataset. As a result, we obtain about 462,036 posts for
hashtag 434,936 5,495 79.2 (49) 7.46 (6) captions and 1,353,498 posts for hashtags. To balance the
dataset size of captions and hashtags, we only consider
35% randomly sampled posts for hashtags, which finally
social networks. As a post crawler, we use the built-in hash- amounted to 434,936 posts.
tag search function provided by Instagram APIs. We select
270 search keywords, which consist of the 10 most common 3.3 Preprocessing
hashtags for each of the 27 general categories of Pinterest We separately build a vocabulary dictionary V for each of
(e.g. celebrities, design, education, food, drink, gardening, the two tasks, by choosing the most frequent V words in
hair, health, fitness, history, humor, decor, outdoor, illustra- the dataset. Based on the statistics in Figure 2, we set V
tion, quotes, product, sports, technology, travel, wedding, of the InstaPIC-1.1M to 40K for post completion and 60K
tour, car, football, animal, pet, fashion and worldcup). We for hash prediction. In the case of post completion, top 40K
use the Pinterest categories because they are well-defined most frequent words take 97.3% of all the word occurrences,
topics to obtain image posts of diverse users. We collect indicating a sufficient coverage for the vocabulary used on
3,455,021 raw posts from 17,813 users in total. Instagram. On the other hand, hashtags in Instagram show
Next, we process a series of filtering, because posts are extremely diverse neologism. Although 60K hashtags cover
often too short, highly noisy, or not written in English. We only 84.31% of the word usage, we set 60K as the dictionary
first apply language filtering to include only English posts; size due to very slow increase of the cumulative distribution
we exclude the posts where more than 20% of words are not function (CDF). Likewise, for YFCC100M, we set the size of
in English based on the dictionary en.us dict of PyEnchant. dictionary V to 40K for post completion and 60K for hashtag
We then remove the posts that embed hyperlinks in the body prediction.
text because they are likely to be advertisement. Finally, if Before building the dictionary, we remove any urls,
users have more than max(15, 0.15 × #user posts) non- special characters and unicodes except emojis. We then
English or advertisement posts, we remove all of their posts. lowercase words and change usernames to an @username
Then, we apply filtering rules for the lengths of captions token. Once the dictionary is obtained, we remove out-of-
and hashtags. We limit the maximum number of posts per dictionary words from the text.
user to 1,000, not to make the dataset biased to a small
number of dominant users. We also limit the minimum
number of posts per user to 50, to be sufficiently large to 4 T HE C ONTEXT S EQUENCE M EMORY N ETWORK
discover users’ writing patterns from posts. We also filter Figure 3 illustrates the proposed context sequence memory
out the posts whose lengths are too short or too long. We set network (CSMN) model, which consists of a recurrent model
15 as the maximum post length, based on the statistics of the and context memory. The memory network models [16],
dataset; we observe that lengthy posts tend to contain infor- [17] allows us to handle very long-term dependencies by
# Embedding
𝐖%& 𝐖o 𝐖f Softmax
yt 𝐖"# 𝐖"$ 𝐖",
I1 I2 … … I49
ct
$
𝐖%& Image feature
Update memory
I1 I2 … … I49 CNN
Convolution
ResNet
Output I1 I2 … … I49 u1 u2 … uD y1 … yt-1 yt
yt
𝐖"# Attention
u1 u2 … uD Input I1 I2 … … I49 u1 u2 … uD y1 … yt-1 yt
yt
𝐖"$ User context Image feature User context Word output
u1 u2 … uD qt
Wq Querying Update query
yt-1 yt
(a) (b)
Fig. 3. Illustration of the context sequence memory network (CSMN) model. (a) The context memory is constructed using image descriptions and
D frequent words from the query user’s previous posts (section 4.1). (b) The model generates an output word at every step t based on the memory
state, and the newly generated word is inserted into the word output memory (section 4.2).
easily reading and writing to parts of a memory component. representation is used to decide the attention weights over
The input is a query image Iq of a specific user, and the memory slots for a given query image, while the output
output is a sequence of words: {yt } = y1 , . . . , yT , each of memory representation is used to compute the memory
which is a symbol coming from the dictionary V . That is, readout vector.
{yt } corresponds to a list of hashtags in hashtag prediction, Image Memory. We represent images using ResNet-101
and a post sentence in post generation. Another input is [44] pretrained on the ImageNet 2012 dataset. We test two
the context information to be added to memory, such as different descriptions: (7 × 7) res5c feature map, and pool5
active vocabulary of a query user, which will be discussed feature vectors. The res5c feature map denoted by Ir5c ∈
in section 4.1. R2,048×7×7 is useful if a model exploits spatial attention;
Since both tasks can be formulated as word sequence otherwise, the pool5 feature Ip5 ∈ R2,048 is used as a feature
prediction for a given image, we exploit the same CSMN vector of the image. Hence, the pool5 is inserted into a single
model with only changing the dictionary. Especially, we also memory cell, while the res5c feature map occupies 49 cells,
regard hashtag prediction as sequence prediction instead on which the memory attention later can focus on different
of prediction of a bag of orderless tag words. Although regions of an (7 × 7) image grid. We will compare these two
some previous papers (e.g. [29], [43]) formulate the hashtag descriptors in the experiments.
prediction as a ranking problem, we believe it is more The image memory vector mim ∈ R1,024 for the res5c
advantageous to pose it as a sequence generation for several feature is represented by
reasons. First, hashtags in a post tend to have strong co-
occurrence relations, it is better to take previous hashtags
into account to predict a next one. Second, users often write maim,j = ReLU(Wim
a r5c
Ij + baim ), (1)
a list of hashtags according to their own habitual sequence mcim,j = ReLU(Wim
c
Ir5c
j + bcim ), (2)
patterns, which can be captured by the sequence modeling.
Third, models also need to decide the number of hashtags
to be generated, according to the user’s post history or the for j = 1, . . . , 49. The parameters to be learned include
a,c
content of a given image, which can only be learned through Wim ∈ R1,024×2,048 and ba,c
im ∈ R
1,024
. The ReLU indicates
sequence modeling. It will be validated by our experimental an element-wise ReLU activation [45]. For the pool5, we use
results.
a/c a/c a/c
mim,j = ReLU(Wim Ip5
j + bim ). (3)
4.1 Construction of Context Memory
We present what types of context information are to be for j = 1. In Eq.(3), we simply present two equations for
stored in the memory. As in Figure 3(a), we store three types input and output memory as a single one using superscript
of context information: (i) image memory for representation of a/c. We below derive the formulation assuming that we use
a query image, (ii) user context memory for TF-IDF weighted the res5c feature, because it subsumes the pool5 feature.
D frequent words from the query user’s previous posts, User Context Memory. In a personalized setting where
and (iii) word output memory for previously generated words. the author of a query image is identifiable, we define
Following [16], [17], each input to the memory is embedded {ui }D
i=1 by selecting D most frequent words from the user’s
into input and output memory representation, for which previous posts. We input {ui }Di=1 into the user context mem-
we use superscript a and c, respectively. The input memory ory in a decreasing order of TF-IDF scores, in order to exploit
CNN later effectively. This context memory improves the

model’s performance by focusing more on the user’s writing where we compute how well the input vector qt matches
style of active vocabulary or hashtags. To build {ui }Di=1 , we with each cell of memory Mat by a matrix multiplication
compute TF-IDF scores and select top-D words for a given followed by a softmax. That is, pt ∈ Rm indicates the
user. Using TF-IDF scores means that we do not include too compatibility of qt over m memory cells. Another inter-
general terms that many users commonly use, because they pretation is that pt indicates which part of input memory
are not helpful for personalization. Finally, the user context is important for input qt at the current time step (i.e. to
a/c
memory vector mus ∈ R1,024 becomes which part of memory the attention turns at time t [9]). Next,
we rescale each column of the output memory presentation
Mct ∈ Rm×1,024 by element-wise multiplication (denoted
uaj = Wea uj , ucj = Wec uj ; yj ; j ∈ 1, . . . , D (4) by ◦) with pt ∈ Rm . As a result, we obtain the attended
a/c a/c output memory representation Mot , which is decomposed
mus,j = ReLU(Wh [uj ] + bh ), (5)
a/c
into three memory types as Mot = [moim,1:49 ⊕ mus,1:D ⊕
a/c
mot,1:t−1 ].
where uj is a one-hot vector for j -th active word. Parame-
a/c We then apply a CNN to the attended memory output
ters include We ∈ R512×V and Wh ∈ R1,024×512 . We use Mot . As will be shown in our experiments, using a CNN
the same Wh for both input and output memory, while we significantly boosts the captioning performance. This is
a/c
learn separate word embedding matrices We . mainly thanks to that the CNN allows us to obtain a set of
Word Output Memory. As shown in Figure 3(c), we expressive representation by fusing multiple heterogeneous
insert a series of previously generated words y1 , . . . , yt−1 cells with different filters. In next section 4.3, we will further
into the word output memory, which is represented as justify the intuition of why memory CNNs help enhance the
captioning performance.
We test two different CNN types as follows: (i) single-
oaj = Wea yj , ocj = Wec yj ; j ∈ 1, . . . , t − 1 (6) layer with multiple sizes of kernels [39], and (ii) multiple-
a/c a/c layers with a single size of kernel [42].
mot,j = ReLU(Wh [oj ] + bh ). (7)
Single-layer CNNs. We define a set of three filters with
a depth of 300 and window sizes h = [3, 4, 5]. We separately
where yj is a one-hot vector for j -th previous word. We use
a/c apply a single convolutional layer and max-pooling layer to
the same word embeddings We and parameters Wh , bh each memory type. For h = [3, 4, 5],
a/c
with user context memory in Eq.(4). We update mot,j for
every iteration whenever a new word is generated.
Finally, we concatenate the presentation of all memory chim,t = maxpool(ReLU(wim
h
∗ moim,1:49 + bhim )) (10)
a/c a/c a/c a/c a/c
types: Mt = [mim,1 ⊕· · ·⊕mim,49 ⊕mus,1 ⊕· · ·⊕mus,D ⊕
a/c a/c where ∗ indicates the convolutional operation. Parame-
mot,1 ⊕ · · · ⊕ mot,t−1 ]. We use m to denote the memory
size, which is the sum of sizes of the three memory types: ters include biases bhim ∈ R49×300 and filters wim h
∈
m = mim + mus + mot . R[3,4,5]×1,024×300 . Via max-pooling, each chim,t is reduced
from (300×[47, 46, 45]) to (300×[1, 1, 1]). Finally, we obtain
cim,t by concatenating chim,t from h = 3 to 5. We repeat
4.2 State-Based Sequence Generation
the convolution and maxpooling operation of Eq.(10) to the
RNNs and their variants have been widely used for se- other memory types as well. As a result, we obtain ct =
quence generation via recurrent connections throughout [cim,t ⊕cus,t ⊕cot,t ], whose dimension is 2, 700 = 3×3×300.
time. However, our approach does not involve any RNN Multi-layer CNNs. We define a multi-layer CNN by
module, but sequentially store all of the previously gener- stacking several layers, of which each contains a convolution
ated words into the memory. It enables the prediction of followed by a non-linearity. After thorough tests, we use 3-
each output word by selectively attending on the combina- layer CNNs with a depth 1,024 and a window size h = 3,
tions of all previous words, image regions, and user context. although our formulation is orthogonal to any choice of
Input word. We discuss how to predict a word yt at CNN structure. Each convolution filter is parameterized as
l
time step t based on the memory state (see Figure 3(b)). wim ∈ R3×1,024×(2×1,024) and blim ∈ R49×(2×1,024) . We use
Letting one-hot vector of the previous word to yt−1 , we the gated linear units (GLU) [46] as non-linearity, which im-
first generate an input vector qt at time t to our memory plement a simple gating mechanism over the output of the
network as convolution [c1 c2 ] ∈ R(2×1024) : v([c1 c2 ]) = c1 ◦ σ(c1 B) ∈
R1,024 where σ is a sigmoid and ◦ is the element-wise
multiplication. That is, the GLU activation halves the output
qt = ReLU(Wq xt + bq ), where xt = Web yt−1 . (8)
of convolution. We also add residual connections from the
input of each convolution to the output of the layer. As a
where Web ∈ R512×V and Wq ∈ R1,024×512 are learned. result, the output of the l-th layer clim becomes
Next qt is fed into the attention model of context memory:
clim,t = v(wim
l
∗ cl−1 l l−1
im,1:49 + bim ) + cim,t , (11)
pt = softmax(Mat qt ), Mot (∗, i) = pt ◦ Mct (∗, i), (9)
where c1im,t is initialized as moim,1:49 . We apply the convo- art

lution operation of Eq.(11) to the other memory types in the fashion
same way. As a result, we obtain ct = [clim,t ⊕ clus,t ⊕ clot,t ], street Street fashion of the day
User 1
camera
whose dimension is 3, 072 = 3 × 1, 024.
paint
Use of pool5 Instead of res5c Features. In the above,
we describe our model assuming that we use the res5c
fashion
features Ir5c ∈ R2,048×7×7 for image representation. Here painting
we discuss two formulation changes when we use the pool5 street My first street painting
features Ip5 ∈ R2,048 .First, Eq.(10) of the output of memory Query Image
User 2
art
is changed to camera
User Context CNN
chim,t = ReLU(wim
h
moim,1 + bhim ). (12)
Fig. 4. An intuitive example of user context memory for showing why
the memory CNN can be helpful for better captioning. Suppose that two
where bhim ∈ R1,800 , wim h ∈ R1,024×1,800 . After experi- users have street as their active words in the memory. User 1 has art-
ments, we find out that adding ReLU on Eq.(12) slightly related words at the top of the memory, the street is joined with art, and
the meaning of street can be interpreted similarly as street art. User
improves performance. 2 has fashion-related words in the memory, the same word street is
Second, the memory output concatenation for pool5 interpreted as street fashion. In summary, the CNN encourages an effect
features is changed to ct = cim,t + [cus,t ⊕ cot,t ], where of n-gram.
the dimension of ct is changed from 2, 700 = 3 × 3 × 300 for
rec5c features to 1, 800 = 2 × 3 × 300 for pool5 features.
meaning of street can be interpreted similarly as street art.
Output word. After CNNs are applied, we compute the If the fashion-related words are at the top of the memory,
output word probability st ∈ RV as the same word street is interpreted as a street fashion. That
is, the CNN encourages an effect of n-gram; on the other
hand, without the memory CNN, every single memory slot
ht = ReLU(Wo ct + bo ), (13)
is accessed separately, and thus it is difficult to distinguish
st = softmax(Wf ht ). (14) between these two different uses of street. Note that such
effect is also exploited in other types of memory.
We obtain the hidden state ht by Eq.(13) with a weight
matrix Wo ∈ R2,700×2,700 and a bias bo ∈ R2,700 . We then 4.4 Training
compute the output probability st over vocabulary V by a We adopt teacher-forced learning [19], in which we provide
softmax layer in Eq.(14). the groundtruth words up to t − 1 to train the prediction at
Finally, we select the word that attains the highest proba- t for sequence learning. We use the softmax cross-entropy
bility yt = argmaxs∈V (st ). Unless the output word yt is the loss as the cost function, which minimizes the negative log
EOS token, we repeat generating the next word by feeding likelihood from the estimated yt to its corresponding target
yt into the word output memory in Eq.(6) and the input of word yGT,t at every time step. We randomly initialize all
Eq.(8) at time step t + 1. As simple post-processing, only for thepparameters with a uniform unit scaling of 1.0 factor:
hashtag prediction, we remove duplicate output hashtags. [± 3/dim].
In summary, this inference is greedy in the sense that the We apply a mini-batch stochastic gradient descent. We
model creates the best sequence by a sequential search for select the Adam optimizer [47] with β2 = 0.9, β2 = 0.999
the best word at each time step. and = 1e − 08. To speed up the training procedure, we
use four GPUs for data parallelism, and set a batch size
4.3 Why can Using Memory CNN be Helpful? to 200 for each GPU. We earn the best results when the
initial learning rate is set as 0.001 for all the models. At
While conventional memory networks cannot model the every 5 epochs, we divide a learning rate by 1.2 to gradually
structural ordering unless time embedding is added to the decrease it. We train our models up to 20 epochs.
model (e.g. [16]), we propose to exploit the memory CNN Although our model can take variable-length sequences
to model the structural ordering which results in a stronger as input, to speed up training, it is better for a minibatch
representation power. More specifically, for the word output to consist of sentences with the same length. Therefore, we
memory, the CNN is useful to represent the sequential order randomly group training samples to a set of minibatches,
of generated words. For the user context memory, the CNN each of which has the same length if possible. We then
can correctly capture the importance order of the context randomly shuffle a batch order so that short and long
words, given that we store the user’s frequent words in minibatches are mixed. We also use a curriculum learning
a decreasing order of TF-IDF weighted scores (rather than proposed in [48], which empirically leads to better training.
putting them in a random order).
For example, as illustrated in Figure 4, suppose that two
users have active words related to art and fashion in the user 5 E XPERIMENTS
context memory, respectively. If the art-related words are We compare the performance of our approach with other
at the top of the memory, the street is joined with art, and state-of-the-art models via quantitative measures and Ama-
this user can be modeled to be interested in art; thereby the zon Mechanical Turk (AMT) studies. We make public our
(GT) the sky is the limit (GT) i love my tatoo (GT) fiery sunset over the sihouette (GT) view for the weekend (GT) pretty flowers from the (GT) dresen by the end of (GT) thanks for the love
of magnificent mt taranaki hubby 🌸💐 the day @username
(Ours) i love the view from (Ours) i love my new (Ours) not a bad view from
my apartment window #pdx bracelet thanks @username (Ours) the last of the sunset from my the office (Ours) my beautiful flowers (Ours) a postcard from a (Ours) thanks for the love
balcony from my hubby lovely past night @username
(seq2seq) the best part about (UsrIm) my beautiful little (Showtell) is a beautiful
(Usr) wooden ceiling in the arundel (NoFB) I love (seq2seq) someone my heart (UsrIm) because cartoons
the sun girl place to be in the world
castle church are important and gummen
_UNK
shouldn’t be able to hide
behind any religion
(GT) posing with the cat (GT) the path to (GT) prime ministers global (GT) christmas at the (GT) sunset in the park (GT) red the cat tucks (GT) the first sale of the season
(Ours) molly with the _UNK fellowship reception capitol (Ours) a sunset in the forest herself in to a comfy (Ours) the pick of the day
cat (Ours) the (Ours) the prime ministers reception (Ours) speaking at the (seq2seq) happy birthday to all (Ours) a cat in the new (UsrIm) in the middle of unpacking
(attendtell) the in the overgrown path to at opening reception home our whole place was a mess
_UNK the trail (UsrIm) england twenty0 cricket (UsrIm) art at the (im2txt) and the _UNK cat
(seq2seq) the view team capitol _UNK …
from our _UNK
Fig. 5. Examples of post generation from InstaPIC-1.1M (top) and YFCC100M (bottom). In each set, we present a query image, groundtruth (GT),
and generated posts by our method (Ours) and baselines. The @username shows an anonymized user. Most of the predicted texts are relevant
and meaningful for the query images.
(GT) #vsco #sunset #vscocam (GT) #food #foodblog #foodie (GT) #artisticcommunity #art (GT) #dickblickartsupplies #art (GT) #memories #nolimits (GT) #greensmoothie #dairyfree
#pnw #foodporn #artist #photographer … #artists #vacation #family #fitfam #lifewithatoddler #glutenfree
(Ours) #vscocam #sunset (Ours) #foodstagram #foodie (Ours) #art #artist #interiors (Ours) #beautiful #style #fashion (Ours) #summer #kids #love #vegetarian …
#latergram #vsco #foodporn #food #dinner #yum #handmade #design #interior #ootd #family #memories #sunshine (Ours) #greensmoothie
#foodphotography #instafood … #lake #greenjuice #smoothie #vegan
#raw #juicing #eatclean #detox
#cleanse
(GT) city new_york street urban (GT) beautiful california canon friends (GT) arcade corridor light night shop (GT) bw black clothing me stripes (GT) bridges canon chicago (GT) bicycling cycling
(Ours) city new_york sidewalk nature … street … sweater white river racing
street urban (Ours) canon dog friends (Ours) japan street tokyo urban (Ours) b&w black close-up white (Ours) canon city downtown (Ours) bike cycling
(AttendTell) art new sidewalk (im2txt) garden lawn outside patio (1NN-UsrIm) barzil city cloud night (1NN-UsrIm) acoustic ca hdr london river street downhill fontana_ca
yard yellow … sky california camera d0 diamondbar (seq2seq) beach canon coast mountaint race racing
… horizon landscape rocks … (AttendTell) bicycle bike
Fig. 6. Examples of hashtag prediction from InstaPIC-1.1M (top) and YFCC100M (bottom). In each set, we present with a query image, groundtruth
(GT), and generated posts by our method (Ours) and baselines. Most of the predicted texts are relevant and meaningful for the query images. Bold
hashtags indicate correctly predicted ones (i.e. the hashtags that appear in both groundtruth and prediction).
source code and datasets at https://github.com/cesc-park/ 5.1 Experimental Setting

attend2u.
We use the image of a test post as a query and associated
hashtags and text description as groundtruth (GT). For
evaluation metrics of hashtag prediction, we compute the
F1-score as a balanced average metric between precision and the use of the memory CNN. The (-NoUC-) is the model
recall: 2(1/precision+1/recall)−1 . For evaluation measures without personalization; it does not use the information
of post generation, we compute the language similarity about query users, such as their D active words. Finally, the
between predicted sentences and GTs. We exploit BLEU [21], (-NoWO-) is the model without sequential prediction. For
CIDEr [22], METEOR [23], and ROUGE-r [24] scores. In all hashtag prediction, the (-NoWO-) indicates the performance
measures, higher scores indicate better performance. of separate tag generation instead of sequential prediction
For InstaPIC-1.1M, we randomly split the dataset into of our original proposal. We test two different image de-
90% for training, 5K posts for test and the rest for validation. scriptors in section 4.1: (7 × 7) res5c feature maps and pool5
For YFCC100M, we randomly split the dataset into 90% for feature vectors, denoted by (-R5C) and (-P5) respectively.
training, 5% for validation and 5% for test. We divide the We also test the multi-layer CNNs for memory as described
dataset by users so that training and test users are disjoint, in in section 4.2, which are denoted by (-Mul). We evaluate
order to measure the generalizability of methods. If user’s the effects on the sizes of user context memory: (-W20-)
posts exist both in training and test sets, then prediction can and (-W80-) or (-W100-). Finally, we test the effect of the
be often trivial by simply retrieving their closest posts in the beam search with different beam sizes of [3,5,7], which are
training set. denoted by (-B3), (-B5) and (-B7), respectively.
While some benchmark datasets for image captioning
(e.g. Flickr30K [49] and MS COCO [50]) have multiple GTs
5.3 Quantitative Results
(e.g. five sentences per image in the MS COCO), our dataset
has only one GT text and hashtag list per test example. Table 2 and 3 summarize the quantitative results of post
Hence, the absolute metric values in this work may be lower generation and hashtag prediction, respectively. Since al-
than those in these benchmark datasets. gorithms show similar patterns in both tasks, we below
analyze the experimental results all together.
First of all, according to most metrics in both tasks,
5.2 Baselines our approach (CSMN-*) significantly outperforms baselines.
As baselines, we select multiple nearest neighbor ap- We can divide the algorithms into two groups with and
proaches, one language generation algorithm, two state-of- without personalization; the latter includes (ShowTell),
the-art image captioning methods, and multiple variants of (AttendTell), (1NN-Im), and (CSMN-NoUC-P5), while the
our model. As straightforward baselines, we first test the former comprises the other methods. Our (CSMN-NoUC-P5)
1-nearest search by images, denoted by (1NN-Im); for a ranks the first among the methods with no personalization,
query image, we find its closest training image using the `2 while the (CSMN-P5) achieves the best overall.
distance on ResNet pool5 descriptors, and return its text as We summarize other interesting observations as follows.
a prediction. Second, we test the 1-nearest search by users, First, among the baselines, the simple nearest neighbor ap-
denoted by (1NN-Usr); we find the nearest user whose 60 proach (1NN-UsrIm) turns out to be the strongest candidate.
active words overlap the most with those of the query user, Second, our approach becomes significantly worse, if we
and then randomly select one post of the nearest user. The remove one of the key components, such as memory CNN,
third nearest neighbor variant denoted by (1NN-UsrIm) is personalization, or sequential prediction. Third, among the
to find the 1-nearest image among the nearest user’s images, tested memory sizes of user context, the best performance
and return its text as a prediction. is obtained with 60. With larger memory sizes, attention
As a language-only method, we use the sequence- learning becomes harder. Moreover, we choose the size of
to-sequence model by Vinyals et al. [20], denoted by 60 (40 for YFCC100M) based on the statistics of our dataset;
(seq2seq). It is a recurrent neural network with three with too large a size, there are many empty slots, which also
hidden LSTM layers, and originally applied to the language make attention learning difficult. Fourth, interestingly, for
translation. This baseline takes 60 active words of the query image description, using pool5 features occupying only a
user in a decreasing order of TF-IDF weights, and predicts single memory slot leads to better performance than using
captions. Since this baseline does not use an image for text (7 × 7) res5c feature maps with 49 slots. This is mainly
generation, this comparison quantifies how important the because attention learning quickly becomes harder with a
image is for hashtags or text prediction. much larger dimension of image representation. Another
We also compare with the two state-of-the-art image reason could be that users do not tend to discuss in the level
captioning methods which have no personalization. The of details about individual (7 × 7) image grids, and thus
first baseline is (ShowTell) from [8], which is a multi- a holistic view of the image content is often sufficient for
modal CNN and LSTM model. The second baseline is prediction of users’ posts. Fifth, the beam search does not
the attention-based captioning model from [9] denoted by improve the performance in both tasks. The reason of the
(AttendTell). poor performance by the beam search might be similar to
We compare different variants of our method (CSMN-*). that observed in [51], which reports that the beam search
To validate the contribution of each component, we exclude is successful for producing generic, repetitive, high-level
one of the key components from our model as follows: image description (e.g. This is a picture of a dog), but poor for
(i) without the memory CNN in section 4.2 denoted by story generation. That is, our personalized post prediction
(-NoCNN-), (ii) without user context memory denoted by is more similar to the subjective, specific story generation
(-NoUC-), and (iii) without feedback of previously gen- rather than generic, high-level image captioning. Sixth,
erated words to output memory by (-NoWO-). That is, post generation is more challenging than hashtag prediction.
the (-NoCNN-) quantifies the performance improvement by This is because the expression space of post generation is
TABLE 2 TABLE 3
Evaluation of post generation for the InstaPIC-1.1M and YFCC100M dataset. As Evaluation of the hashtag prediction. We show test
performance measures, we use language similarity metrics (BLEU, CIDEr, METEOR, results for split by users in the left and split by
ROUGE-L). The methods with [∗ ] use no personalization. posts in the right.
InstaPIC-1.1M InstaPIC-1.1M
Methods B-1 B-2 B-3 B-4 METEOR CIDEr ROUGE-L Methods F1 score
(seq2seq) [20] 0.050 0.012 0.003 0.000 0.024 0.034 0.065 (seq2seq) [20] 0.132 0.085
(ShowTell)∗ [8] 0.055 0.019 0.007 0.003 0.038 0.004 0.081 (ShowTell)∗ [8] 0.028 0.011
(AttendTell)∗ [9] 0.106 0.015 0.000 0.000 0.026 0.049 0.140 (AttendTell)∗ [9] 0.020 0.014
(1NN-Im)∗ 0.071 0.020 0.007 0.004 0.032 0.059 0.069 (1NN-Im)∗ 0.049 0.110
(1NN-Usr) 0.063 0.014 0.002 0.000 0.028 0.025 0.059 (1NN-Usr) 0.054 0.173
(1NN-UsrIm) 0.106 0.032 0.011 0.005 0.046 0.084 0.104 (1NN-UsrIm) 0.109 0.380
(CSMN-NoCNN-P5) 0.086 0.037 0.015 0.000 0.037 0.103 0.122 (CSMN-NoCNN-P5) 0.135 0.310
(CSMN-NoUC-P5)∗ 0.079 0.032 0.015 0.008 0.037 0.133 0.120 (CSMN-NoUC-P5)∗ 0.111 0.076
(CSMN-NoWO-P5) 0.090 0.040 0.016 0.006 0.037 0.119 0.116 (CSMN-NoWO-P5) 0.117 0.244
(CSMN-R5C) 0.097 0.034 0.013 0.006 0.040 0.107 0.110 (CSMN-R5C) 0.192 0.340
(CSMN-P5) 0.171 0.068 0.029 0.013 0.064 0.214 0.177 (CSMN-P5) 0.230 0.390
(CSMN-P5-Mul) 0.145 0.049 0.022 0.009 0.049 0.145 0.143 (CSMN-P5-Mul) 0.320 0.270
(CSMN-W20-P5) 0.116 0.041 0.018 0.007 0.044 0.119 0.123 (CSMN-W20-P5) 0.147 0.349
(CSMN-W100-P5) 0.109 0.037 0.015 0.007 0.042 0.109 0.112 (CSMN-W80-P5) 0.135 0.341
(CSMN-P5-B3) 0.128 0.049 0.019 0.008 0.052 0.124 0.128 (CSMN-P5-B3) 0.224 0.386
(CSMN-P5-B5) 0.132 0.051 0.019 0.008 0.054 0.123 0.130 (CSMN-P5-B5) 0.220 0.378
(CSMN-P5-B7) 0.129 0.050 0.020 0.009 0.054 0.126 0.128 (CSMN-P5-B7) 0.218 0.372
YFCC100M YFCC100M
(seq2seq) [20] 0.076 0.010 0.000 0.000 0.034 0.069 0.066 (seq2seq) [20] 0.112 0.337
(ShowTell)∗ [8] 0.027 0.003 0.000 0.000 0.024 0.003 0.043 (ShowTell)∗ [8] 0.047 0.047
(AttendTell)∗ [9] 0.088 0.010 0.001 0.000 0.034 0.076 0.116 (AttendTell)∗ [9] 0.069 0.313
(1NN-Im)∗ 0.033 0.006 0.002 0.000 0.020 0.063 0.043 (1NN-Im)∗ 0.020 0.177
(1NN-Usr) 0.032 0.003 0.001 0.000 0.016 0.028 0.041 (1NN-Usr) 0.029 0.353
(1NN-UsrIm) 0.039 0.005 0.001 0.000 0.021 0.050 0.050 (1NN-UsrIm) 0.065 0.609
(CSMN-P5) 0.106 0.034 0.012 0.004 0.033 0.064 0.099 (CSMN-P5) 0.229 0.335
(CSMN-P5-Mul) 0.116 0.036 0.010 0.003 0.036 0.060 0.111 (CSMN-P5-Mul) 0.240 0.348
#art #artist #illustration #paint #love #gift #makeup #skincare #mom #hair #natural my boys cheering on the sidelines for the game beautiful solitude in the morning
#design #artwork #painting #pink #love #valentinesday #engaged #saturdaynight #fashion golfing with my little bro ! the beautiful melbourne I love spring
#motivation #love #family #joy #bride #wedding #bridetobe #shenanigans #selfie #makeup… cubs game with the girls at the rugby today beautiful day for a wedding
#peace #prayer #vintagebooks #kids #art #gift #vintage #love #family my baby cousin @username and I
#handmade #packaging #books #diy #pro #selfie
Fig. 7. Three examples of hashtag prediction and two examples of post prediction with query images and multiple predictions by different users
(shown in different colors). Predicted results vary according to query users, but still are relevant and meaningful for the query images.
#sew #art #stitch #artist #art #artist #interiors #sketch #art #artist #illustration #instaart #blue #sketch #art #summer #sea #sky #cute #pet #catsofinstagram #cat #cats
#sewing #illustration #handmade #design #interior #design #artwork #webstagram #drawing #artist #tree
#embroidery #etsy #bird (a) Design topic (b) Substantially different topic
Fig. 8. Six examples of hashtag prediction for a single user, whose most posts are about design. (a) For design-related query images, our CSMN
predicts relevant hashtags to the design topic. (b) For the query images of substantially different topics, our CSMN is also resilient to predict
meaningful hashtags for the images.
much larger since the posts’ text includes any combinations of hashtags generated by our model is 5.95, which is shorter
of words, emojis, and symbols. Finally, the average length than the average GT length of 8.71. It is mainly because we
adopt early stopping during training to avoid overfitting. 5.4 Human Evaluation via Amazon Mechanical Turk
We choose F1 scores as the main evaluation metric, and the We perform two types of user studies to complement the
F1 scores are penalized severely when the model predicts limitation of automatic langugage metrics in previous sec-
wrong hashtags. Therefore, our model is trained to prefer tion for evaluating the quality of generated sentences. We
shorter hashtags compared to the GT answers. include actual examples of AMT surveys in the supplemen-
Given that Instagram provides a function of hashtag tary material.
automation from previous posts when writing a new post, User Preferences. We perform AMT tests to observe
we test another dataset split for hashtag prediction. That general users’ preferences between different algorithms for
is, we divide the dataset by posts so that each user’s posts the two post automation tasks. For each task, we randomly
are included in both training and test sets. We call this as sample 100 test examples. At test, we show a query image
split by posts, while the original split as a split by users. and three randomly sampled complete posts of the query
Table 3 shows the results of split by users in the left and user as a personalization clue, and two text descriptions
split by posts in the right. We observe that, due to the generated by our method and one baseline in a random
automation function, many posts in our training and test order. We ask turkers to choose the more relevant one
set have almost identical hashtags. This setting is highly among the two for the query image and the query user. We
favorable for (1NN-UsrIm), which returns the hashtags of obtain answers from three different turkers for each query.
the closest training image of the query user. Interestingly, We select the variant (CSMN-P5) as a representative of our
our (CSMN-P5) works better than (1NN-UsrIm) even in the method, because of its best quantitative performance. We
setting of split by posts, although its performance margin (i.e. compare with three baselines by selecting the best method
0.01 in F1 score) is not as significant as in the split by users in each group of 1NNs, image captioning, and language-
(i.e. 0.121). While our (CSMN-P5) is the best in split by posts, only methods: (1NN-UsrIm), (ShowTell), and (seq2seq).
(CSMN-P5-Mul) is overall the best in split by users. Given Table 4 summarizes the results of AMT tests, which val-
that, the test and training data in the split by posts are more idate that human annotators significantly prefer our results
similar to one another than in split by users, we can conclude to those of baselines. Among the baselines, (1NN-UsrIm) is
that (CSMN-P5-Mul) generalizes better than (CSMN-P5). preferred the most, given that the performance gap with our
We also summarize notable results in YFCC100M as approach is the smallest. These results coincide with those
follows. First, compared to InstaPIC-1.1M, all methods show of quantitative evaluation in Table 2 and 3. Another reason
lower prediction accuracies for caption generation. Its main may be that the nearest-neighbor approach can always
cause may be the way of data collection; while the InstaPIC- retrieve grammatically correct text, which may be more
1.1M is collected from 10 most common hashtags for each of favorable to general turkers.
27 Pinterest categories, the YFCC100M includes all the Flickr Sentence Quality. We perform another in-depth AMT
posts between 2004 and 2014. Thus, the topics of YFCC100M tests to evaluate several aspects of sentence generation,
are much wider than that of the InstaPIC-1.1M, thereby it is using the similar setting proposed in [53]. We use the same
more challenging to predict correct captions for YFCC100M. test set as in the user preference test. We ask turkers to rate
Second, in the split by users for hashtag prediction, our ap- the quality of each output sentence from three aspects as
proach (CSMN-*) significantly outperforms all the baselines. follows: (i)Plausibility: is the generated post plausible with
Third, in the split by posts, the baseline (1NN-UsrIm) attains a given image? (ii)Grammaticality: is the post grammatically
the best results, mainly because of the severe variety of the correct? (iii)Relevance: does the post have a similar style to
topics in the YFCC100M data, where nonparametric nearest- the user’s previous posts? Each item is scored from 1 to 3,
neighbor would be better than other parametric models. where 1 means bad, 2 is acceptable, and 3 indicates good.
It is worth noting that the absolute metric values in Table 5 summarizes the results of the AMT test, which
this work may be lower than those in the benchmark validate that human annotators assign significantly higher
datasets. Most benchmark datasets for image captioning scores to our results compared to those of the baselines. In-
(e.g. Flickr30K [49] and MS COCO [50]) contain a limited terestingly, in grammaticality our model even earns higher
number of objects in the images, and multiple groundtruth scores than the baseline, (1NN-UsrIm), which simply re-
(GT) sentences per image. For example, the MS COCO trieves the nearest training sentence, which indicates the se-
dataset contains 91 object categories, and five GT sentences quence generation capability of our model is also powerful
per image. On the other hand, our dataset is extremely enough to be used as a language model.
diverse and has only a single GT post and hashtag list per
test example. Such large diversity of our datasets may be the 5.5 Qualitative Results
main reason why the measured language metrics, which re-
Figure 5 illustrates selected examples of post generation. In
quire word matches, are quite low. Similar tendency can be
each set, we show a query image, GT, and generated text de-
also found in other highly diverse datasets with a single GT
scription by our method and baselines. In many of the Insta-
sentence; one typical example may be LSMDC [52], a movie-
gram examples, GT comments are hard to predict correctly,
based video captioning dataset. As shown in its leader-
because they are extremely diverse, subjective, and private
board1 , the best performing algorithm named (ELTanque)
conversations over a variety of topics. Nonetheless, most
shows 0.007 in BLEU-4, 0.055 in METEOR, 0.139 in ROUGE-
of the predicted text descriptions are relevant to the query
L, and 0.110 in CIDEr-D.
images. Moreover, our CSMN model can appropriately use
normal words, emojis, and even mentions to other users
1. https://competitions.codalab.org/competitions/6121#results. (anonymized by @username). Figure 6 shows examples of
TABLE 4 #hoardworld is a famous comics blog name, and #dc is a

AMT preference results for the two tasks between our methods and popular comics publisher. Third figure is another interesting
three baselines for the InstaPIC-1.1M dataset. We show the
percentages of responses that turkers vote for our approach over example in that our model generates #nz #kiwipics #nzmustdo
baselines. #newzealand with attention on bestvacations, thegreatoutdoors,
snorkeling, nzmustdo, newzealandfinds, nz, where #nz is an
Hashtag Prediction abbreviation of New Zealand and #kiwipics is a frequently
vs. Baselines (1NN-UsrIm) (Showtell) (seq2seq) used hashtag for pictures taken in New Zealand. Note that
(CSMN-P5) 67.0 (201/300) 88.0 (264/300) 81.3 (244/300) output hashtags do not necessarily coincide with the most
Post Generation attended words in the user context memory, although they
(CSMN-P5) 73.0 (219/300) 78.0 (234/300) 81.3 (244/300)
are likely to be correlated.
TABLE 5
AMT test results for plausibility, grammaticality, and relevance of 6 C ONCLUSIONS
generated posts by our method (CSMN-P5) and three baselines for the
InstaPIC-1.1M dataset. The score of each item ranges from 1 to 3. We
We proposed the context sequence memory networks
report averages and standard deviations (in parentheses). (CSMN) as a first personalized image captioning approach.
We addressed two post automation tasks: hashtag predic-
Post Generation tion and post generation. With quantitative evaluation and
Methods Plausibility Grammaticality Relevance AMT user studies on nearly collected InstaPIC-1.1M dataset
(1NN-UsrIm) 2.13 (0.55) 2.36 (0.54) 2.02 (0.58) and standard benchmark YFCC100M dataset, we showed
(Showtell) 2.26 (0.52) 2.34 (0.69) 2.12 (0.56) that our CSMN approach outperformed other state-of-the-
(seq2seq) 1.98 (0.62) 2.10 (0.66) 1.88 (0.60) art captioning models.
(CSMN-P5) 2.36 (0.58) 2.43 (0.56) 2.25 (0.60)
There are several promising future directions that go
beyond the current paper. First, we can extend the model to
hashtag prediction. We observe that our hashtag prediction incorporate other meta-data such as location, gender or time
is robust even with a variety of topics, including profiles, to generate more personalized sentences. Second, we can
food, fashion, and interior design. extend the CSMN model for another interesting related task
such as post commenting that generates a thread of replies
Figure 7 shows examples of how much hashtag and
for a given post. Third, since we dealt with only Instagram
post prediction varies according to different users for the
and Flickr posts in this work, we can explore data in other
same query images. Although predicted results change in
social networks such as Pinterest and Tumblr, which have
terms of word usages and focused topics, all of the results
different post types, metadata, text and hashtag usages.
are relevant and meaningful to the query images. In the
fourth example of Figure 7, the first user focuses on family
while the third user focuses on the rugby game to generate ACKNOWLEDGMENTS
captions. Figure 8 illustrates the variation of text predictions This research is partially supported by Kakao, Kakao Brain
according to query images for the same user. We first select and Brain Research Program through the National Research
a user whose most posts are about design, and then obtain Foundation of Korea(NRF) funded by the Ministry of Sci-
prediction by changing the query images. For design-related ence, ICT & Future Planning (2017M3C7A1047860). Gunhee
query images, the CSMN correctly predicts relevant hash- Kim is the corresponding author. The code and dataset are
tags to the design topic (Figure 8(a)). For the query images available at https://github.com/cesc-park/attend2u.
of substantially different topics, our CSMN is also resilient
to predict relevant hashtags to the images (Figure 8(b)).
R EFERENCES
[1] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,
5.6 Attention Visualization D. Poland, D. Borth, and L.-J. Li, “YFCC100M: The New Data in
Multimedia Research,” in CACM, 2016.
Figure 9 shows three examples of user context memory [2] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,
attention for hashtag prediction. In other words, they visual- S. Venugopalan, K. Saenko, and T. Darrell, “Long-term Recurrent
ize on which cells of the context memory is attended during Convolutional Networks for Visual Recognition and Description,”
three time steps from t1 to t3 . Each example presents a query in CVPR, 2015.
[3] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollar,
image along with the GT and predicted hashtags by our J. Gao, X. He, M. Mitchell, J. Platt, L. Zitnick, and G. Zweig, “From
method, and the evolution of attention diagrams where the Captions to Visual Concepts and Back,” in CVPR, 2015.
words along the x-axis are the top-5 most attended words [4] A. Karpathy and L. Fei-Fei, “Deep Visual-Semantic Alignments for
Generating Image Descriptions,” in CVPR, 2015.
in the memory cells at each time step, with darker colors
[5] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal Neural
indicating more strongly attended memory cells. In the first Language Models,” in ICML, 2014.
figure, our model generates #ipa #craftbeer #ballastpoint from [6] V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2Text: Describing
t1 to t3 , with attention on saison, spareribs, paleale, calgary, Images Using 1 Million Captioned Photographs,” in NIPS, 2011.
[7] C. C. Park and G. Kim, “Expressing an Image Stream with a
ballastpoint in the memory, where all of the generated words Sequence of Natural Sentences,” in NIPS, 2015.
and strongly attentioned words both are related to the beer [8] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and
topic. Similarly, in the second figure, our model produces Tell: Lessons Learned from the 2015 MSCOCO Image Captioning
#actionfigures #mastersofuniverse #toys #hoardworld with at- Challenge,” in IEEE TPMAI, 2016.
[9] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov,
tention on batman, blogs, hoardax, actionfigures, allstarcomic- R. Zemel, and Y. Bengio, “Show, Attend and Tell: Neural Image
smelbourne, dc, where, #mastersofuniverse is a comic name, Caption Generation with Visual Attention,” in ICML, 2015.
User Context Memory User Context Memory User Context Memory

ipa actionfigures nzmustdo
ascmelbourne
hoardax nzgreatwalks
calgary
allstarcomicsmelbourne nz
b
neca tramping
nativetonguesyyc vhs peoplewhodofunstuff
blogs bestvacations
comicbooks stokedforsaturday
ebay discoverqueenstown
ballastpoint boxes liveclimbrepeat
spareribs dc nzac
xe0 batman newzealandfinds
freshness
saison snorkeling
thegreatoutdoors
paleale
ferry
rockclimbing
waterfalls
(GT) #ipa #fourwindsbrewco (GT) #timburton #videogame (GT) #omarama #liveoutdoors waves
familydinner
#craftbeer #fourwindsbrewing #actionfigures #batman #thegreatoutdoors #nzmustdo
(Ours) #michaelkeaton #neca #toys #kiwipics #claycliffs
𝑡" : #ipa #hoardworld #keaton #newzealand
𝑡# : #craftbeer potatoes (Ours) (Ours)
𝑡$ : #ballastpoint pork 𝑡" : #actionfigures 𝑡" : #nz
𝑡# : #mastersoftheuniverse 𝑡# : #kiwipics
𝑡$ : #toys 𝑡$ : #nzmustdo
𝑡" 𝑡# 𝑡$ 𝑡% : #hoardworld 𝑡" 𝑡# 𝑡$ 𝑡% 𝑡% : #newzealand 𝑡" 𝑡# 𝑡$ 𝑡%
Fig. 9. Three examples of user context memory attention for hashtag prediction. We show the query image along with the groundtruth and predicted
hashtags by our method. On the right, we present the evolution of attention diagrams for three time steps from t1 to t3 , where the words along
the x-axis are the top-5 most attended words in the memory cells at each time step, with darker colors indicating more strongly attended memory
cells. Note that output hashtags do not necessarily coincide with the most attended words in the user context memory, although they are likely to
be correlated.
[10] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image Captioning [29] E. Denton, J. Weston, M. Paluri, L. Bourdev, and R. Fergus, “User
with Semantic Attention,” in CVPR, 2016. Conditional Hashtag Prediction for Images,” in KDD, 2015.
[11] C. Fan and D. Crandall, “DeepDiary: Automatic Caption Genera- [30] P.-L. Hsieh, C. Ma, J. Yu, and H. Li, “Unconstrained Realtime Facial
tion for Lifelogging Image Streams,” in ECCV EPIC, 2016. Performance Capture,” in CVPR, 2015.
[12] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, [31] T. Almaev, B. Martinez, and M. Valstar, “Learning to Transfer:
R. Miller, A. Tatarowicz, B. White, S. White et al., “VizWiz: Nearly Transferring Latent Task Structures and Its Application to Person-
Real-Time Answers to Visual Questions,” in UIST, 2010. specific Facial Action Unit Detection,” in ICCV, 2015.
[13] K. Ramnath, S. Baker, L. Vanderwende, M. El-Saban, S. N. Sinha, [32] A. Yao, L. Van Gool, and P. Kohli, “Gesture Recognition Portfolios
A. Kannan, N. Hassan, M. Galley, Y. Yang, and D. Ramanan, “Au- for Personalization,” in CVPR, 2014.
tocaption: Automatic Caption Generation for Personal Photos,” in [33] W. Kienzle and K. Chellapilla, “Personalized Handwriting Recog-
WACV, 2014. nition via Biased Regularization,” in ICML, 2006.
[14] S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston, [34] O. Polozov, E. ORourke, A. M. Smith, L. Zettlemoyer, S. Gulwani,
“Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets and Z. Popovic, “Personalized Mathematical Word Problem Gen-
Too?” in arXiv:1801.07243, 2018. eration,” in IJCAI, 2015.
[15] A. Graves, G. Wayne, and I. Danihelka, “Neural Turing Machines,” [35] S. Mirkin, S. Nowson, C. Brun, and J. Perez, “Motivating
in arXiv:1410.5401, 2014. Personality-aware Machine Translation,” in EMNLP, 2015.
[16] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-to-End [36] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka,
Memory Networks,” in NIPS, 2015. A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ra-
[17] J. Weston, S. Chopra, and A. Bordes, “Memory Networks,” in malho, J. Agapiou et al., “Hybrid Computing Using a Neural
ICLR, 2015. Network with Dynamic External Memory,” in Nature, 2016.
[37] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce,
[18] I. Sutskever, O. Vinyals, and Q. Le, “Sequence to Sequence Learn-
P. Ondruska, I. Gulrajani, and R. Socher, “Ask Me Anything:
ing with Neural Networks,” in NIPS, 2014.
Dynamic Memory Networks for Natural Language Processing,”
[19] R. J. Williams and D. Zipser, “A Learning Algorithm for Continu-
in ICML, 2016.
ally Running Fully Recurrent Neural Networks,” in NIPS, 1989.
[38] A. Miller, A. Fisch, J. Dodge, A.-H. Karimi, A. Bordes, and
[20] O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton, J. Weston, “Key-Value Memory Networks for Directly Reading
“Grammar as a Foreign Language,” in NIPS, 2015. Documents,” in EMNLP, 2016.
[21] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method [39] Y. Kim, “Convolutional Neural Networks for Sentence Classifica-
for Automatic Evaluation of Machine Translation,” in ACL, 2002. tion,” in EMNLP, 2014.
[22] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus- [40] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent Convolutional Neural
based Image Description Evaluation,” in CVPR, 2015. Networks for Text Classification,” in AAAI, 2015.
[23] S. B. A. Lavie, “METEOR: An Automatic Metric for MT Evaluation [41] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-Aware
with Improved Correlation with Human Judgments,” in ACL, Neural Language Models,” in AAAI, 2016.
2005. [42] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin,
[24] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Sum- “Convolutional Sequence to Sequence Learning,” in ICML, 2017.
maries,” in WAS, 2004. [43] J. Weston, S. Chopra, and K. Adams, “Tagspace: Semantic Embed-
[25] C. C. Park, B. Kim, and G. Kim, “Attend to You: Personalizd Image dings from Hashtags,” in EMNLP, 2014.
Captioning with Context Sequence Memory Networks,” in CVPR, [44] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for
2017. Image Recognition,” in CVPR, 2016.
[26] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep Re- [45] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Re-
inforcement Learning-based Image Captioning with Embedding stricted Boltzmann Machines,” in ICML, 2010.
Reward,” in CVPR, 2017. [46] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language
[27] S. Venugopalan, L. A. Hendricks, and M. Rohrbach, “Captioning Modeling with Gated Convolutional Networks,” in ICML, 2017.
Images with Diverse Objects,” in CVPR, 2017. [47] D. P. Kingma and J. L. Ba, “ADAM: A Method For Stochastic
[28] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing When to Look: Optimization,” in ICLR, 2015.
Adaptive Attention via a Visual Sentinel for Image Captioning,” [48] W. Zaremba and I. Sutskever, “Learning to Execute,” in
in CVPR, 2017. arXiv:1410.4615, 2014.
[49] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From Image

Descriptions to Visual Denotations: New Similarity Metrics for
Semantic Inference over Event Descriptions,” in TACL, 2014.
[50] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common Objects
in Context,” in ECCV, 2014.
[51] T. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal,
J. Devlin, R. B. Girshick, X. He, P. Kohli, D. Batra, C. L. Zitnick,
D. Parikh, L. Vanderwende, M. Galley, and M. Mitchell, “Visual
Storytelling,” in NAACL HLT, 2016.
[52] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal,
H. Larochelle, A. Courville, and B. Schiele, “Movie Description,”
in IJCV, 2017.
[53] K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang, “Generating
Sentences by Editing Prototypes,” in arXiv:1709.08878, 2017.
Cesc Chunseong Park is currently working as

a researcher at Lunit Inc. He received the M.S.
degree from Computer Science and Engineering
at Seoul National University in 2017. He is a
recipient of the 2016 Naver MS Fellowship and
NVIDIA Deep Learning Contest Awards (Korea).
His research interests span the intersection of
deep learning, computer vision and natural lan-
guage. In particular, he is interested in defining
and solving new problems and data rather than
standard tasks and data.
Byeongchang Kim is an M.S. student in Com-

puter Science and Engineering at Seoul Na-
tional University. Before that, he received the
B.S. degree in Biosystems Engineeraing & Com-
puter Science and Engineering at Seoul National
University. His research interests include deep
learning, computer vision and natural language
understanding.
Gunhee Kim received the BS degree in me-

chanical engineering from the Korea Advanced
Institute of Science and Technology (KAIST),
the MS degree from Robotics Institute, CMU,
and the PhD degree from the Computer Sci-
ence Department, Carnegie Mellon University,
in 2013. He has been an assistant professor in
computer science and engineering of Seoul Na-
tional University since 2015. Prior to that, he was
a postdoctoral researcher at Disney Research
Pittsburgh for one and a half years. His research
interests include computer vision, machine Learning, data mining, and
robotics. He received the 2014 ACM SIGKDD doctoral dissertation
award, and 2015 Naver New faculty award.

Towards Personalized Image Captioning Via Multimodal Memory Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Towards Personalized Image Captioning Via Multimodal Memory Networks

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

Towards Personalized Image Captioning

Index Terms—Image captioning, personalization, memory networks, convolutional neural networks.

I MAGE captioning is a task of automatically generating a

Query Image Softmax

The beautiful Melbourne, I love spring

User 2 Task2. Post generation

Beautiful day for a wedding

1.0 mation irrelevant to the associated pictures. We set 3 as the

0.4 #fashionistas, #fashionlover, #fashionlovers). Finally, we ob-

CNN later effectively. This context memory improves the

where c1im,t is initialized as moim,1:49 . We apply the convo- art

User Context CNN

source code and datasets at https://github.com/cesc-park/ 5.1 Experimental Setting

TABLE 4 #hoardworld is a famous comics blog name, and #dc is a

User Context Memory User Context Memory User Context Memory

[49] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From Image

Cesc Chunseong Park is currently working as

Byeongchang Kim is an M.S. student in Com-

Gunhee Kim received the BS degree in me-

You might also like