You are on page 1of 4

Content-Based Video Retrieval and Compression: A Unified Solution

HongJiang Zhang, John Y. A. Wang and Yucel Altunbasak Hewlett-Packard Laboratories, 1501 Page Mill Rd. Palo Alto, CA94304
present them to the user in a timely fashion. Consequently, it is necessary to consider some form of data compression and scheme for content indexing in retrieval applications so that the system can achieve the required response. Thus, compression and content analysis must be addressed simultaneously in a unified framework [1]. As with text retrieval where a document is divided into smaller sub-components such as sentences, phrases, words, letters, and numerals, a long video sequence must be organized into smaller and more manageable components consisting of scenes, shots, moving objects, and pixel properties such as, color, texture and motion. Ultimately, a compact representation that encodes data as a set of moving objects would be highly desirable. Several researchers have pursued along these directions such as in the layered image representation work by Wang and Adelson [2]. An emerging video standard, MPEG-4, has also incorporated similar ideas based on these image layers. However, given these representations, there still remains many challenges on the design and creation of effective indices based on image and video content. In this paper we propose a scheme for video retrieval application that extends the current key-frame retrieval approach to include attributes of moving objects called key-objects. We begin in Section 2 by reviewing current strategies used in video retrieval and identifying the desirable features. With these features, we develop a video representation that incorporates key-objects and show how it facilitates both content-based query and browsing. In Section 3, we present techniques for keyobject analysis. Finally, we discuss applications of keyobjects in a video database management system.

Abstract
Video compression and retrieval have been treated as separate problems in the past. In this paper, we present an object-based video representation that facilitates both compression and retrieval. Typically in retrieval applications, a video sequence is subdivided in time into a set of shorter segments each of which contains similar content. These segments are represented by 2-D representative images called "key-frames" that greatly reduce amount of data that is searched. However, keyframes do not describe the motions and actions of objects within the segment. We propose a representation that extends the ideas of the key-frame to further include what we define as "key-objects". These key-objects consist of regions within a key-frame that move with similar motion. Thus our key-objects allow a retrieval system to more efficiently present information to users and assist them in browsing and retrieving relevant video content.

1. Introduction
As computers networks improve and digital image and video libraries become more readily available to homes and offices via the Internet, the problem of bandwidth and ones ability to access relevant data become more challenging. Even now, the task of viewing and browsing through the vast amount of images and video data with conventional VCR-like interfaces is tedious, time-consuming, and unbearable. In order to make these databases widely usable, we need to develop a more effectively method for content selection, data retrieval and browsing. Many compression algorithms and standards have been developed and adopted in the last two decades for efficient video transmission and storage. These include the well accepted MPEG1, MPEG2, and H.26x standards based on simple block transform techniques. These block-based representations, however, do not encode data in ways that are suitable for content-based retrieval [1]. This is understandable since they were designed and optimized around bandwidth and distortion constraints. Many researchers have investigated using these compression schemes for browsing and temporal analysis and have painstakingly realized the difficulties and challenges of using representation not designed to encode content information for retrieval applications. In browsing and retrieval applications, the system needs to quickly sample the data, decide on relevance and

2. Video Representation and Indexing


In this section, we briefly discuss current strategies used in video indexing and based on these ideas we develop a strategy that involves a more semantic and object-based representation. Current video indexing schemes rely on two primary concepts: video parsing (temporal segmentation) and content analysis. Video parsing involves the detection of temporal boundaries and identification of meaningful segments of video. These segments are categorized in a hierarchy similar to storyboards used in film making. The top level construct consists of sequences, which are composed of a set of scenes. Scenes are further partitioned into shots. Each shot contains a sequence of frames that expresses an event within the long video sequence.

In current systems, shot boundaries are identified based on the quantitative differences between consecutive frames. Consequently, neighboring shots portray different actions or events. These techniques for detection of shot boundaries that rely on simple temporal segmentation achieve satisfactory performance. Many such algorithms rely on detecting discontinuities in color and motion between frames [3]. Given the shot boundaries and data partitions, representative frames within each shot are selected to convey the content. These 2-D image frames, called "keyframes" allow users of retrieval systems to quickly browse over the entire video by viewing frames from selected time samples that describe the highlights of each shot. The use of key-frames greatly reduces the amount of data required in indexing and, furthermore, provides an organizational framework for dealing with video content. Thus, the problem of video retrieval and browsing becomes manageable since it can rely on existing 2-D image database retrieval infrastructure to catalog the keyframes. These techniques are typically based on global analysis of color, texture, and simple motions [3, 4]. These global features are rather effective for general search applications because they describe the characteristics of the key-frame as a whole. However, they sometimes do not provide sufficient resolution for object-based retrievals and queries based on properties of objects. As a result, key-frame retrieval is limited to general color, texture, or signal attributes search.

3 Object-based video indexing


Often in queries, it is desirable to quantify queries based on particular attributes that involve objects and subregions within the viewable image. Some support of objects in the retrieval representation would be greatly advantageous in dealing with these interesting queries about objects and their motions. Though object representations could be useful, analysis of semantic objects is not always possible. Many researchers have investigated various segmentation techniques based on motion, color, texture, etc. to identify objects, though none can report that their segmentation algorithm can identify semantic objects. These techniques merely identify simple regions of coherent motion, color, or texture, and frequently cannot deal with complex scenes that involve multiple motions, actions, and objects. Despite these limitations, it is, nevertheless, useful to provide simple object abstractions where possible.

motion coherence and people tend to group regions of similar motion into one semantic object. Thus motion coherence might capture some aspect of objects desirable in retrieval. Several attributes that we attach to keyobjects include color, texture, shape, motion, and life cycle. The color and texture attributes might be computed with algorithms described by previous authors [3]. We incorporate our key-objects within the shot-based key-frame representation. In the augmented representation, each shot is represented by one or more key-frames which are further decomposed into keyobjects. We also provide for a general description of motion activity within the shot. In shots where keyobjects cannot be reliably detected, a motion activity descriptor provides information about likely actions within the shot. Motion activity captures the general motions in the shot such as global motions arising from camera pan or zoom. For example, our motion activity descriptor can be used to distinguish "shaky" sequences captured with a hand-held camera from professionally captured sequences. Furthermore, there are some advantages in decomposing key-object motion into a global component and a local/object-based component. In this decomposition, the key-object motion can be more easily used to reflect motion relative to the background and other key-objects in the scene. Without this distinction, the keyobject motion would instead represent motion relative to the image frame. Thus this decomposition provides a more meaningful and effective description for retrieval. We summarize our descriptors for video indexing as follows: Sequence Sequence-ID: unique index key of the sequence Shots: { Shot(1), Shot(2), ... Shot(N) } Shot Shot-ID: unique index key of the shot; Motion-Activity: mean/dominant motion, activity based on variance; Objects: { Object(1), Object(2)...Object(N) } Object Object-ID: identification number of and object within the shot; Object-shape: alpha map of the object; Object-life: relative time frame when the object appears and disappears from the shot; Object-Color/Texture: the color and texture of the object; Object-Motion: the trajectory and motion mode parameters of the object. In a decomposition where object attributes do not change drastically over the entire shot, we need only use one representative description for each attribute. However, because object attributes do often change, a single descriptor may not be representative. In this case, we derive a set of attributes from various instances in time, thus, each key-object is a collection of "keyinstances".

3.1 Key-Object representation


Key-frames provide a suitable abstraction and framework for video browsing. However, we could support an even wider range of queries by defining smaller units within the key-frame. We call these smaller units "key-objects" because we used them to represent key regions that participate in distinct actions within the shot. Our key-objects do not necessarily correspond to semantic objects, that is an extremely difficult analysis problem. We want to avoid the semantic object analysis problem so instead we seek out regions of coherent motion. Our criterion on coherent motion is perceptually and physically motivated, since points of an object exhibit

3.2 Key-object analysis


In key-object analysis, we want to identify and group regions within a shot that move with coherent motion. There exist several techniques for achieving this goal. We use motion segmentation techniques described by Wang and Adelson [2]. The two major components of this algorithm consist of local motion estimation and motion model estimation. This algorithm is similar to other techniques that employ a generic furthermore algorithm except it makes several optimizations that improve robustness. It also reduces computations in our analysis procedure because we also use this local motion information to derive a measure of motion activity. The local motion field can be obtained by optic-flow estimation techniques. Optic-flow estimation produces a dense motion field that describes the motion between consecutive frames for every pixel in the image. Based on this observation of local motion, various motion models are hypothesized and their motion parameters iteratively refined. These models can range from simple translation to complex planar parallax motion. We find that affine motion models provide a good compromise between complexity and stability. These models are used to classify pixels or to partition the image into coherently moving regions. In cases where there exists clear and distinct motion regions, we track the region throughout the duration of the shot. These regions and their attributes of shape, color, and texture are cataloged along with the shot descriptors. Furthermore, we use the size of the region to determine a global motion. For example, a large region that includes the peripheral pixels might correspond to the background. This assumption performs fairly well when foreground objects are proportionately smaller than the background. The key-object motions are adjusted to reflect a motion relative to this background motion. These adjustments on the affine parameters can be easily completed with a simple matrix transformation. Likewise, the relative motion between key-objects can be computed in a similar fashion. Figure 1 shows an example of opticflow calculated between two frame of a video shot, where the moving object, a man riding a horse, can be seen clearly.

of the observed optic-flow motion. These measurements provide information about the complexity and distribution of motion that might be useful in queries. The Wang and Adelson approach of model estimation that uses opticflow data nicely complements our analysis of key-objects and motion activity. The local motion is computed once and used to compute activity and identify key-objects. Thus, the amount of computation is greatly reduced by using this approach compared with an approach that uses a dominant motion estimation algorithm, which requires many costly image warps and motions estimations.

3.3 Key-frame selection for video browsing


As described earlier, in video browsing applications, key-frames provide a quick and effective way to convey the video content for shots segments. The effectiveness of key-frame browsing depends on the choice of the image frames selected for representing the shot. The image frames within the shot are not all equally descriptive. Certain frames may provide more information about the objects and actions within the shot than other frames. Although, as a convenience, image frames at shot boundaries are frequently used as key-frames for video browsing, they often "miss" the more interesting actions that occur in the middle of the shot. Below we outline some strategies for key-frames selection based on the motion activity and attributes of key-objects within the shot. i) a key-object enters or leaves the image frame boundaries; ii) key-objects participate in occlusion relationship; iii) two key-objects are at the closest distance between them; iv) mean and extrema of key-object attributes, i.e. color, shape, motion; v) key-frames should have some small amount of background object overlap. Figure 1 and 2 show 3 key-frames selected from the video sequence according to the criteria outlined above.

(a) (b) Fig.1: (a) One key-frame and (b) its corresponding optic-flow. The key-object of a man riding a horse is clearly visible.

(a) (b) Fig.2: Key-frames selected based on moving objects: (a) the horse enters the scene; (b) the horse leaves the scene.

Because of difficulties in motion segmentation and limitations of current algorithms, complex scenes often cannot be decomposed into distinct coherent regions. Under these circumstances, we resort to computing a simple measure of motion activity. This motion activity includes measurements of mean, variance, and distribution

Another approach at presenting 2D images to the user for effective video browsing involves computing image mosaics [5]. Image mosaics can be used effectively for image retrieval because the mosaic, which is derived from condensing information from many image frames to produce an expanded view or "panorama" of the scene, quickly convey the elements of the scene. However, mosaics are not generally applicable or computable for all

shots. In situations where they are computable, our proposed video description and analysis of key-objects help us build these mosaics.

retrieving shots with similar key-objects. Furthermore, with key-objects and their motion features, key-frames or image mosaic of retrieved shots can be generated efficiently for browsing.
Video Video Content Analyzer Object encoder Meta-data encoder Index

4 Object-based retrieval and compression


The object-based representation has advantages beyond its use in key-frame extraction and video mosaic construction. It widens the range of queries for contentbased video retrieval. For instance, many queries involve questions about the figure/ground objects, such as bright red object moving over a green background. Our representation scheme contains all features required to support such queries. In our scheme, the matching between the query and candidate video sequences is based on visual attributes of individual objects and/or their compositions in shots. That is, for queries searching for the presence of one or more objects in a shot, the similarity between the query and the candidate is defined as the similarity between the key-objets. Formally, if two shots are denoted as Si and Sj, their key-object sets as Ki ={oi,m , m= 1, ..., M} and Kj = {oj,n, n= 1, ... , N}, then the similarity between the two shots can be defined as

Database Query Video Synthesize Edit Presentation

Channel

Object decoder Meta-data decoder

Fig.3: Data and process flow diagram of a content-based video compression and retrieval process.

S k Si , S j =

1 M

max[s
M m =1

(oi , m , o j ,1 ), s k (o i ,m , o j , 2 ),

5 Conclusion
(1) Video retrieval, browsing, and compression face similar problems. In this paper, we proposed a compact representation that enables efficient search and browsing. Our approach involves key-objects which extends current key-frame techniques and supports more descriptive queries about objects and their actions. Furthermore, we use key-object analysis to assist in the select of keyframes. Techniques for using key-objects in retrieval are discussed and the advantages over key-frame techniques demonstrated.

... , s k (oi ,m , o j , N )

where sk is a similarity metric between two objects; and there are totally MxN similarity values, from which the maximum is selected. This definition states that the similarity between two shots is the sum of similarities of the most similar key-object pairs. Note that one might be interested in a similarity measure between two objects defined by any combination of the object attributes. Likewise, when key-objects consists of several key instances, we can easily extend the similarity measure to compare the attributes of the various instances. The proposed object-based representation also provides a unified framework for both video indexing and compression. Using this framework in video coding, we may achieve higher compression ratio and support content-based data access. As shown in Figure 3, the major difference from traditional video coding schemes is that a content analysis module is added to the compression process. Video is first segmented into shots, each shot is then decomposed into objects with extracted content features, and, finally, the encoding module encodes the objects. The size of the resultant meta-data (content features) is relatively small compared with image data and can be further compressed with conventional algorithms. The emerging video standard, MPEG-4, also incorporates a similar object-based scheme in compression; however, retrieval has not been the focus [6]. At the decoder end, the compressed data are decoded into object image maps and mate-data. At this point, visual browsing and query can be without requiring sophisticated analysis of the compressed video as in browsing through MPEG-2 videos. For example, we can use color histograms coded with key-objects to help

6. References
1. 2. 3. R. Picard, Content Access for Image/Video Coding: The Fourth Criterion, Proc. of ICPR, Oct. 1994. J.Y.A. Wang and E. H. Adelson, Representing Moving Images with Layers, IEEE Tran. on Image Processing, 3(5):625-638, September 1994. H. J. Zhang, et al, Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution, Proc. of ACM Multimedia95, San Francisco, Nov.7-9, 1995, pp.15-24.
M. M. Yeung and B. Liu, Efficient Matching and Clustering of Video Shots, Proc. of ICIP95, Oct. 1995, pp.338-341.

4. 5. 6.

L.Teodosio and W. Bender, Salient Video Stills: Content and Context Preserved, Proc. ACM Multimedia93, August, 1993. Description of MPEG-4, ISO/IEC JTC1/SC29 /WG11 N1410, Oct. 1996.

You might also like