Professional Documents
Culture Documents
60
40
20
0
9)
)
8)
1)
6)
8)
1)
3)
2)
1)
0)
2)
7)
7)
9)
9)
7)
4)
)
33
24
56
99
(6
17
31
20
(3
(1
(2
(4
(1
(2
(1
(3
(2
(3
(4
(2
l(
(5
(5
(9
(2
(3
t(
(1
l(
(2
t(
l(
t(
ol
s
ca
s
ss
ry
th
ic
ts
gy
ia
g
ie
na
ho
ve
ar
en
en
to
ed
ce
rin
ry
in
ur
io
in
us
hi
ov
or
ed
is
ta
al
ne
ic
lo
tra
au
io
lig
sc
gg
go
nm
m
at
ap
al
en
m
_d
he
en
sp
m
lit
no
_m
si
at
m
ga
op
er
rn
re
co
lo
te
po
gr
er
nd
m
ro
bu
uc
ch
lit
ou
ob
am
ca
el
io
nf
cu
vi
_a
ed
te
ev
-b
_j
de
co
t_
en
do
tre
od
en
to
_d
ul
vi
e_
ns
fa
au
fo
tiz
eb
th
de
ai
ci
w
m
Fig. 1. Average Fscore obtained with SVM and Functional Trees using all audio-visual descriptors (50% percentage split).
60
overall CD (%) 3-NN
BayesNet
Table 1. MediaEval benchmark [2] (selection of results).
50
RandomForest
descriptors modality method MAP team
40 J48 speech trans. text SVM 11.79% LIA
FT speech trans., Bag-of-Words
text 11.15% SINAI
30 RBFNetwork metadata, tags + Terrier IR
train (%) SVMlinear
speech trans. text Bag-of-Words 5.47% SINAI
20
10 20 30 40 50 60 70 80 90
speech trans. text TF-IDF 6.21% UAB
speech trans., TF-IDF + co-
text 9.34% UAB
metadata sine dist.
MFCC, zero cross.,
Fig. 2. Average correct classification for various machine audio SVMs 0.1% KIT
signal energy
learning techniques using all audio-visual descriptors. proposed audio SVM 10.3% RAF
Bag-of-Vis.-
clustered SURF visual 9.43% TUB
Words+ SVM
quences/genre; vertical lines = min-max average Fscore inter- hist., moments,
autocorr., co-occ., visual SVMs 0.35% KIT
vals for percentage split ranging from 10% to 90%). wavelet, edge hist.
The best performance is obtained for genres such as face statistics [4] visual SVMs 0.1% KIT
(we provide values for 50% percentage split and the high- shot statistics [4] visual SVMs 0.3% KIT
est): ”literature” (Fscore = 83%, highest 87%) and ”poli- proposed visual SVM 3.84% RAF
color, texture,
tics” (Fscore = 81%, highest 84%), followed by ”health” aural, cognitive,
audio,
SVMs 0.23% KIT
(Fscore = 78%, highest 85%), ”citizen journalism” (Fscore = visual
structural
65%, highest 68%), ”food and drink” (Fscore = 62%, highest proposed
audio,
SVM 12.1% RAF
77%), ”web development and sites” (Fscore = 63%, highest visual
84%), ”mainstream media” (Fscore = 63%, highest 74%),
and ”travel” (Fscore = 57%, highest 60%). Due to the large
variety of video materials, increasing the number of examples
ated to the genre category to which the document was classi-
may result in overtraining and thus in reduced classification
fied, while other genres have 0 relevance. Performance was
performance (e.g., in Figure 2, linear SVM for 90% training).
assessed with Mean Average Precision (MAP; see trec eval
Retrieval scenario. We present the results obtained at Medi- scoring tool3 ).
aEval 2011 Video Genre Tagging Task [2]. The challenge of In Table 1 we compare our results with several other
the competition was to develop a retrieval mechanism for all approaches using various modalities of the video, from tex-
the 26 genres. Each participant was provided with a develop- tual information (e.g., provided speech transcripts, user tags,
ment set consisting of 247 sequences (unequally distributed metadata) to audio-visual (Terrier Information Retrieval is
with respect to genre). Participants were encouraged to build to be found at http://terrier.org/). We achieved
their own training set. We extended the data set to up to 648 an overall MAP of up to 12.1% (see team RAF). These
sequences (using same source: blip.tv). The final retrieval were the best results obtained using audio-visual information
task was performed only once on a test set consisting of 1727 alone. Use of descriptors such as cognitive information (face
sequences. We used the SVM linear classifier and all audio- statistics), temporal information (average shot duration, dis-
visual descriptors. Retrieval results were provided using a
binary ranking, where the maximum relevance of 1 is associ- 3 http://trec.nist.gov/trec_eval/
tribution of shot lengths) [4], audio (MFCC, zero-crossing carried out with a real-world scenario, using 26 video genres
rate, signal energy), color (histograms, color moments, auto- from the blip.tv web media platform. The proposed descrip-
correlogram - denoted autocorr.), and texture (co-occurrence tors show potential in distinguishing for some specific genres.
- denoted co-occ., wavelet texture grid, edge histograms) To reduce descriptor semantic gap, we have designed a rele-
with SVM resulted in a MAP of less than 1% (see team KIT), vance feedback approach which uses hierarchical clustering.
while clustered SURF features and SVM achieved a MAP of It allowed boosting the performance close to the one obtained
up to 9.4% (see team TUB). We achieved better performance with higher-level semantic textual information. Future im-
even compared to some classic text-based approaches, such provements will consist mainly of addressing sub-genre cate-
as the TF-IDF (MAP 9.34%, see team UAB) and the Bag-of- gorization and considering the constraints of very large scale
Words (MAP 11.15%, see team SINAI). It must be noted that approaches.
the results presented in Table 1 cannot be definitive, as the
classification approaches were not trained and set up strictly 6. REFERENCES
comparably (e.g., team KIT used up to 2514 sequences,
most text-based approaches employed query expansion tech- [1] D. Brezeale, D.J. Cook, ”Automatic Video Classification:
niques). However, these results not only provide a (crude) A Survey of the Literature,” IEEE TSMC, Part C: Appli-
performance ranking, but also illustrate the difficulty of this cations and Reviews, 38(3), pp. 416-430, 2008.
task. The most efficient retrieval approach remains the in-
clusion of textual information which conducted to an average [2] M. Larson, A. Rae, C.-H. Demarty, C. Kofler, F. Metze,
MAP around 30% while the inclusion of information such as R. Troncy, V. Mezaris, Gareth J.F. Jones (eds.), Working
movie ID led to MAP up to 56% (the highest obtained). Notes Proceedings of the MediaEval 2011 Workshop at In-
terspeech 2011, vol. 807, Italy, September 1-2, 2011.
Relevance feedback scenario. In the final experiment we
sought to prove that, notwithstanding the superiority of text [3] X. Yuan, W. Lai, T. Mei, X.-S. Hua, X.-Q. Wu, S. Li,
descriptors, audio-visual information also has great poten- Automatic Video Genre Categorization using Hierarchical
tial in classification tasks, but may benefit from help of rel- SVM,” IEEE ICIP, pp. 2905-2908, 2006.
evance feedback. For tests, we used the entire data set: all [4] M. Montagnuolo, A. Messina, ”Parallel Neural Networks
2375 sequences. Each sequence was represented by the pro- for Multimodal Video Genre Classification”, Multimedia
posed audio-visual descriptors. The user feedback was sim- Tools and Applications, 41(1), pp. 125-159, 2009.
ulated automatically from the known class membership of
each video (i.e., the genre labels). We use only one feedback [5] Z. Al A. Ibrahim, I. Ferrane, P. Joly, ”A Similarity-Based
session. Tests were conducted for various sizes of the user Approach for Audiovisual Document Classification Us-
browsing window N , ranging from 20 to 50 sequences. Table ing Temporal Relation Analysis,” EURASIP JIVP, doi :
2 summarizes the overall retrieval MAP estimated as the area 10.1155/2011/537372, 2011.
under the uninterpolated precision-recall curve. [6] B. Ionescu, D. Coquin, P. Lambert, V. Buzuloiu, ”A
Fuzzy Color-Based Approach for Understanding Ani-
Table 2. MAP obtained with Relevance Feedback mated Movies Content in the Indexing Task,” EURASIP
RF method 20 seq. 30 seq. 40 seq. 50 seq. JIVP, doi:10.1155/2008/849625, 2008.
Rocchio [10] 46.8% 43.84% 42.05% 40.73% [7] K. Seyerlehner, M. Schedl, T. Pohle, P. Knees, ”Using
FRE [9] 48.45% 45.27% 43.67% 42.12% Block-Level Features for Genre Classification, Tag Classi-
SVM [8] 47.73% 44.44% 42.17% 40.26% fication and Music Similarity Estimation,” Music Informa-
proposed 51.27% 46.79% 43.96% 41.84%
tion Retrieval Evaluation eXchange, Netherlands, 2010.
[8] S. Liang, Z. Sun, ”Sketch retrieval and relevance feed-
For the proposed HCRF, the MAP ranges from 41.8% to back with biased SVM classification,” Pattern Recognition
51.3%, which is an improvement over the other methods of Letters, 29, pp. 1733-1741, 2008.
at least a few percents. Also, it can be seen that relevance [9] Yong Rui, T. S. Huang, M. Ortega, M. Mehrotra, S. Beck-
feedback proves to be a promising alternative for improving man, ”Relevance feedback: a power tool for interactive
retrieval performance since it provides results close to those content-based image retrieval”, IEEE Trans. on Circuits
obtained with high-level textual descriptors (see Table 1). and Video Technology, 8(5), pp. 644-655, 1998.
[10] N. V. Nguyen, J.-M. Ogier, S. Tabbone, and A. Boucher,
5. CONCLUSIONS ”Text Retrieval Relevance Feedback Techniques for Bag-
of-Words Model in CBIR”, Int. Conf. on Machine Learn-
We have addressed web video categorization and proposed a ing and Pattern Recognition, 2009.
set of audio-visual descriptors. Experimental validation was