You are on page 1of 2

Audio-Visual Content Description for Video Genre

Classification in the Context of Social Media

Bogdan Ionescu1,3 , Klaus Seyerlehner2 , Constantin Vertan1 , Patrick Lambert3


1
LAPI - University Politehnica of Bucharest, 061071 Bucharest, Romania
{bionescu,cvertan}@alpha.imag.pub.ro
2
DCP - Johannes Kepler University, A-4040, Linz, Austria
klaus.seyerlehner@gmail.com
3
LISTIC - Polytech Annecy-Chambery, B.P. 80439, 74944 France
patrick.lambert@univ-savoie.fr

ABSTRACT vocal aspects. The proposed audio features are block-level


In this paper we address the automatic video genre classi- based, which compared to classic approaches have the ad-
fication with descriptors extracted from both, audio (block- vantage of capturing local temporal information by analyz-
based features) and visual (color and temporal based) modal- ing sequences of consecutive frames in a time-frequency rep-
ities. Tests performed on 26 genres from blip.tv media plat- resentation. Audio information is described with parame-
form prove the potential of these descriptors to this task. ters such as: spectral pattern (characterize the soundtrack’s
timbre), delta spectral pattern (captures the strength of on-
sets), variance delta spectral pattern (captures the variation
Categories and Subject Descriptors of the onset strength over time), logarithmic fluctuation pat-
I.2.10 [Artificial Intelligence]: Vision and Scene Under- tern (captures the rhythmic aspects), spectral contrast pat-
standing—audio, color and action descriptors; I.5.3 [Pattern tern (estimates ”tone-ness”) and correlation pattern (cap-
Recognition]: Clustering—video genre. tures the temporal relation of loudness changes over different
frequency bands). For more information see [3].
Keywords Temporal descriptors. The genre specificity is reflected
also at temporal level, e.g. music clips tend to have a high
block-based audio features, color perception, action content, visual tempo, documentaries have a reduced action content,
video genre classification. etc. To address those aspects we detect sharp transitions,
cuts and two of the most frequent gradual transitions, fades
1. INTRODUCTION and dissolves. Based on this information, we assess rhythm
In this paper we address the issue of automatic video genre as movie’s average shot change speed computed over 5s time
classification in the context of social media platforms as part windows (provides information about the movie’s changing
of the MediaEval 2011 Benchmarking Initiative for Multi- tempo) and action in terms of high action ratio (e.g. fast
media Evaluation (see http : //www.multimediaeval.org/). changes, fast motion, visual effects, etc.) and low action
The challenge is to provide solutions for distinguishing be- ratio (the occurrence of static scenes). Action level is deter-
tween up to 26 common genres, like ”art”, ”autos”, ”busi- mined based on user ground truth [4].
ness”, ”comedy”, ”food and drink”, ”gaming”, and so on [2]. Color descriptors. Finally, many genres have specific
Validation is to be carried out on video footage from the color palettes, e.g. sports tend to have predominant hues,
blip.tv media platform (see http : //blip.tv/). indoor scenes have different lighting conditions than outdoor
We approach this task, globally, from the classification scenes, etc. We assess color perception by projecting colors
point of view and focus on the feature extraction step. For a onto a color naming system (associating names with colors
state-of-the-art of the literature see [1]. In our approach, we allows everyone to create a mental image of a given color
extract information from both audio and visual modalities. or color mixture). We compute a global weighted color his-
Whether these sources of information have been already ex- togram (movie’s color distribution), an elementary color his-
ploited to genre classification, the novelty of our approach togram (distribution of basic hues), light/dark, saturated/weak-
is in the content descriptors we use. saturated, warm/cold color ratios, color variation (the amount
of different colors in the movie), color diversity (the amount
of different hues) and adjacency/complementarity color ra-
2. VIDEO CONTENT DESCRIPTION tios. For more information on visual descriptors see [4].
Audio descriptors. Most of the common video genres tend
to have very specific audio signatures, e.g. music clips con-
tain music, in sports there is the specific crowd noise, etc. To 3. EXPERIMENTAL RESULTS
address this specificity, we propose audio descriptors which Results on development data. First validation was per-
are related to rhythm, timbre, onset strength, noisiness and formed on the provided development data set (247 sequences)
which was eventually extended to up to 648 sequences in
order to provide a consistent training data set for classifi-
Copyright is held by the author/owner(s). cation (source blip.tv; sequences are different than the ones
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy proposed for the official runs). We observed that in the case
SVM linear kernel 1
BayesNet
DecisionTable
FT
HyperPipes
0.8
J48
0.7 Fscore NNge
0.6 NaiveBayes
0.5 RBFNetwork
RandomForest
0.4
RandomTree
0.3 Ridor
0.2 SVMlinear
0.1 VFI
kNN
0
)

(5 )

(6 )

3)

(2 )

4)

(3 )

(1 )

7)

(3 )

4)

)
(19

(1 5

(11

(20

(1 0

23

(1 5

18

(2 2

l (3

23

(1 8

27

(1 8

(1 6

(1 7
y(

h(

l (2
d(

s(

s(

s(
ss

sm

ry

ure

on

ia
ica

en
s
ed
a rt

tos

o ry

al

ng

alt

sic

on

gy

ing

nt
ed

ve
nta
ce
e

vie

litic

o rt
foo

ati
a li

on

nm

me
ra t

ph
sin

he

olo
mi

igi
mu

tra
au

_m

gg
teg
en

uc
me

sp
u rn

mo
co

ati

po
lite

gra

re l
ga

iro

lop
bu

hn

blo
fer

ed

am
ca

uc
cu
_ jo

nv
bio

tec

ve
n

eo
lt_

ed
do

tre
co

_e
en

de
to-
fau

vid
ins
the
iz

b_
au
de
cit

ma

we
Figure 1: Average F score achieved using all audio-visual descriptors and genre classification ”performance”
for the best run, i.e. SVM with linear kernel (graph on top).

of some of the proposed genres, the genre specific content est on audio-color-action, 0.121 for SVM linear on audio-
is captured mainly with the textual information. Therefore, color-action (best run), 0.103 for SVM linear on audio and
our tests focused mainly on genres with specific audio-visual 0.038 SVM linear on color-action. This is mainly due to
contents, like ”art”, ”food”, ”cars”, ”sports”, etc. (for which the limited training data set compared to the diversity of
we provided a representative number of examples). test sequences and to the inclusion of the genres for which
Tests were performed using a cross-validation approach. we obtain 0 precision (i.e. audio-visual information is not
We use for training p% of the existing sequences (randomly discriminant, see Figure 1). The fact that MAP provides
selected and uniformly distributed with respect to genre) only an overall average precision over all genres makes us
and the remainder for testing. Experiments were repeated unable to conclude on the genres which are better suited to
for different combinations between training and testing (e.g. be retrieved with audio-visual information and which fail to.
1000 repetitions).
Figure 1 presents average Fscore = 2 · P · R/(P + R) ratio 4. CONCLUSIONS AND FUTURE WORK
(where P and R are average precision and recall, respec- The proposed descriptors performed well for some of the
tively over all repetitions) for p = 50%, descriptor set = genres, however to improve the classification performance
audio-color-action (i.e. the descriptor set which provided the a more consistent training database is required. Also, our
most accurate results) and various classification approaches approach is more suitable for classifying genre patterns from
(see Weka at http : //www.cs.waikato.ac.nz/ml/weka/). the global point of view, like episodes from a series being not
The number in the brackets represent the number of test able to detect a genre related content within a sequence.
sequences used for each genre. Future tests will consist on preforming cross-validation on
From the global point of view, the best results are ob- all the 2375 sequences (development + test sets).
tained with SVM and a linear kernel (depicted in Orange),
followed by k-NN (k = 3, depicted in Dark Green) and FT
(Functional Trees, depicted in Cyan). At genre level, the
5. ACKNOWLEDGMENTS
best accuracy is obtained for genres with particular audio- Part of this work has been supported under the Financial
visual signatures. The graph on top presents a measure Agreement EXCEL POSDRU/89/1.5/S/62557.
of the individual genre classification ”performance” which is
computed as the Fscore times the number of test sequences 6. REFERENCES
used. An Fscore obtained for a greater number of sequences [1] D. Brezeale, D.J. Cook, ”Automatic Video
is more representative than one obtained for only a few (val- Classification: A Survey of the Literature,” IEEE
ues are normalized with respect to 1 for visualization pur- Trans. on Systems, Man, and Cybernetics, Part C:
pose). The proposed descriptors provided good discrimina- Applications and Reviews, 38(3), pp. 416-430, 2008.
tive power for genres like (the number in the brackets is [2] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, S.
Fscore ): ”food and drink” (0.757), ”travel” (0.633), ”politics” Schmiedeke, G.J.F. Jones, ”Overview of MediaEval
(0.552), ”web development and sites” (0.697), while at the 2011 Rich Speech Retrieval Task and Genre Tagging
bottom end are genres whose contents are less reflected with Task”, MediaEval 2011 Workshop, Pisa, Italy, 2011.
audio-visual information, e.g. ”citizen journalism”, ”busi- [3] K. Seyerlehner, M. Schedl, T. Pohle, P. Knees, ”Using
ness”, ”comedy” (see Figure 1). Block-Level Features for Genre Classification, Tag
Results on test data. For the final official runs, clas- Classification and Music Similarity Estimation,”
sification was performed on 1727 sequences with training MIREX-10, Utrecht, Netherlands, 2010.
performed on the previous data set (648 sequences). The [4] B. Ionescu, C. Rasche, C. Vertan, P. Lambert, ”A
overall results obtained in terms of MAP (Mean Average Contour-Color-Action Approach to Automatic
Precision) are less accurate than the previous results, thus: Classification of Several Common Video Genres”,
0.077 for k-NN on audio-color-action, 0.027 for RandomFor- AMR (LNCS 6817), Linz, Austria, 2010.

You might also like