Professional Documents
Culture Documents
HongJiang Zhang, John Y. A. Wang and Yucel Altunbasak Hewlett-Packard Laboratories, 1501 Page Mill Rd. Palo Alto, CA94304
present them to the user in a timely fashion. Consequently, it is necessary to consider some form of data compression and scheme for content indexing in retrieval applications so that the system can achieve the required response. Thus, compression and content analysis must be addressed simultaneously in a unified framework [1]. As with text retrieval where a document is divided into smaller sub-components such as sentences, phrases, words, letters, and numerals, a long video sequence must be organized into smaller and more manageable components consisting of scenes, shots, moving objects, and pixel properties such as, color, texture and motion. Ultimately, a compact representation that encodes data as a set of moving objects would be highly desirable. Several researchers have pursued along these directions such as in the layered image representation work by Wang and Adelson [2]. An emerging video standard, MPEG-4, has also incorporated similar ideas based on these image layers. However, given these representations, there still remains many challenges on the design and creation of effective indices based on image and video content. In this paper we propose a scheme for video retrieval application that extends the current key-frame retrieval approach to include attributes of moving objects called key-objects. We begin in Section 2 by reviewing current strategies used in video retrieval and identifying the desirable features. With these features, we develop a video representation that incorporates key-objects and show how it facilitates both content-based query and browsing. In Section 3, we present techniques for keyobject analysis. Finally, we discuss applications of keyobjects in a video database management system.
Abstract
Video compression and retrieval have been treated as separate problems in the past. In this paper, we present an object-based video representation that facilitates both compression and retrieval. Typically in retrieval applications, a video sequence is subdivided in time into a set of shorter segments each of which contains similar content. These segments are represented by 2-D representative images called "key-frames" that greatly reduce amount of data that is searched. However, keyframes do not describe the motions and actions of objects within the segment. We propose a representation that extends the ideas of the key-frame to further include what we define as "key-objects". These key-objects consist of regions within a key-frame that move with similar motion. Thus our key-objects allow a retrieval system to more efficiently present information to users and assist them in browsing and retrieving relevant video content.
1. Introduction
As computers networks improve and digital image and video libraries become more readily available to homes and offices via the Internet, the problem of bandwidth and ones ability to access relevant data become more challenging. Even now, the task of viewing and browsing through the vast amount of images and video data with conventional VCR-like interfaces is tedious, time-consuming, and unbearable. In order to make these databases widely usable, we need to develop a more effectively method for content selection, data retrieval and browsing. Many compression algorithms and standards have been developed and adopted in the last two decades for efficient video transmission and storage. These include the well accepted MPEG1, MPEG2, and H.26x standards based on simple block transform techniques. These block-based representations, however, do not encode data in ways that are suitable for content-based retrieval [1]. This is understandable since they were designed and optimized around bandwidth and distortion constraints. Many researchers have investigated using these compression schemes for browsing and temporal analysis and have painstakingly realized the difficulties and challenges of using representation not designed to encode content information for retrieval applications. In browsing and retrieval applications, the system needs to quickly sample the data, decide on relevance and
In current systems, shot boundaries are identified based on the quantitative differences between consecutive frames. Consequently, neighboring shots portray different actions or events. These techniques for detection of shot boundaries that rely on simple temporal segmentation achieve satisfactory performance. Many such algorithms rely on detecting discontinuities in color and motion between frames [3]. Given the shot boundaries and data partitions, representative frames within each shot are selected to convey the content. These 2-D image frames, called "keyframes" allow users of retrieval systems to quickly browse over the entire video by viewing frames from selected time samples that describe the highlights of each shot. The use of key-frames greatly reduces the amount of data required in indexing and, furthermore, provides an organizational framework for dealing with video content. Thus, the problem of video retrieval and browsing becomes manageable since it can rely on existing 2-D image database retrieval infrastructure to catalog the keyframes. These techniques are typically based on global analysis of color, texture, and simple motions [3, 4]. These global features are rather effective for general search applications because they describe the characteristics of the key-frame as a whole. However, they sometimes do not provide sufficient resolution for object-based retrievals and queries based on properties of objects. As a result, key-frame retrieval is limited to general color, texture, or signal attributes search.
motion coherence and people tend to group regions of similar motion into one semantic object. Thus motion coherence might capture some aspect of objects desirable in retrieval. Several attributes that we attach to keyobjects include color, texture, shape, motion, and life cycle. The color and texture attributes might be computed with algorithms described by previous authors [3]. We incorporate our key-objects within the shot-based key-frame representation. In the augmented representation, each shot is represented by one or more key-frames which are further decomposed into keyobjects. We also provide for a general description of motion activity within the shot. In shots where keyobjects cannot be reliably detected, a motion activity descriptor provides information about likely actions within the shot. Motion activity captures the general motions in the shot such as global motions arising from camera pan or zoom. For example, our motion activity descriptor can be used to distinguish "shaky" sequences captured with a hand-held camera from professionally captured sequences. Furthermore, there are some advantages in decomposing key-object motion into a global component and a local/object-based component. In this decomposition, the key-object motion can be more easily used to reflect motion relative to the background and other key-objects in the scene. Without this distinction, the keyobject motion would instead represent motion relative to the image frame. Thus this decomposition provides a more meaningful and effective description for retrieval. We summarize our descriptors for video indexing as follows: Sequence Sequence-ID: unique index key of the sequence Shots: { Shot(1), Shot(2), ... Shot(N) } Shot Shot-ID: unique index key of the shot; Motion-Activity: mean/dominant motion, activity based on variance; Objects: { Object(1), Object(2)...Object(N) } Object Object-ID: identification number of and object within the shot; Object-shape: alpha map of the object; Object-life: relative time frame when the object appears and disappears from the shot; Object-Color/Texture: the color and texture of the object; Object-Motion: the trajectory and motion mode parameters of the object. In a decomposition where object attributes do not change drastically over the entire shot, we need only use one representative description for each attribute. However, because object attributes do often change, a single descriptor may not be representative. In this case, we derive a set of attributes from various instances in time, thus, each key-object is a collection of "keyinstances".
of the observed optic-flow motion. These measurements provide information about the complexity and distribution of motion that might be useful in queries. The Wang and Adelson approach of model estimation that uses opticflow data nicely complements our analysis of key-objects and motion activity. The local motion is computed once and used to compute activity and identify key-objects. Thus, the amount of computation is greatly reduced by using this approach compared with an approach that uses a dominant motion estimation algorithm, which requires many costly image warps and motions estimations.
(a) (b) Fig.1: (a) One key-frame and (b) its corresponding optic-flow. The key-object of a man riding a horse is clearly visible.
(a) (b) Fig.2: Key-frames selected based on moving objects: (a) the horse enters the scene; (b) the horse leaves the scene.
Because of difficulties in motion segmentation and limitations of current algorithms, complex scenes often cannot be decomposed into distinct coherent regions. Under these circumstances, we resort to computing a simple measure of motion activity. This motion activity includes measurements of mean, variance, and distribution
Another approach at presenting 2D images to the user for effective video browsing involves computing image mosaics [5]. Image mosaics can be used effectively for image retrieval because the mosaic, which is derived from condensing information from many image frames to produce an expanded view or "panorama" of the scene, quickly convey the elements of the scene. However, mosaics are not generally applicable or computable for all
shots. In situations where they are computable, our proposed video description and analysis of key-objects help us build these mosaics.
retrieving shots with similar key-objects. Furthermore, with key-objects and their motion features, key-frames or image mosaic of retrieved shots can be generated efficiently for browsing.
Video Video Content Analyzer Object encoder Meta-data encoder Index
Channel
Fig.3: Data and process flow diagram of a content-based video compression and retrieval process.
S k Si , S j =
1 M
max[s
M m =1
(oi , m , o j ,1 ), s k (o i ,m , o j , 2 ),
5 Conclusion
(1) Video retrieval, browsing, and compression face similar problems. In this paper, we proposed a compact representation that enables efficient search and browsing. Our approach involves key-objects which extends current key-frame techniques and supports more descriptive queries about objects and their actions. Furthermore, we use key-object analysis to assist in the select of keyframes. Techniques for using key-objects in retrieval are discussed and the advantages over key-frame techniques demonstrated.
... , s k (oi ,m , o j , N )
where sk is a similarity metric between two objects; and there are totally MxN similarity values, from which the maximum is selected. This definition states that the similarity between two shots is the sum of similarities of the most similar key-object pairs. Note that one might be interested in a similarity measure between two objects defined by any combination of the object attributes. Likewise, when key-objects consists of several key instances, we can easily extend the similarity measure to compare the attributes of the various instances. The proposed object-based representation also provides a unified framework for both video indexing and compression. Using this framework in video coding, we may achieve higher compression ratio and support content-based data access. As shown in Figure 3, the major difference from traditional video coding schemes is that a content analysis module is added to the compression process. Video is first segmented into shots, each shot is then decomposed into objects with extracted content features, and, finally, the encoding module encodes the objects. The size of the resultant meta-data (content features) is relatively small compared with image data and can be further compressed with conventional algorithms. The emerging video standard, MPEG-4, also incorporates a similar object-based scheme in compression; however, retrieval has not been the focus [6]. At the decoder end, the compressed data are decoded into object image maps and mate-data. At this point, visual browsing and query can be without requiring sophisticated analysis of the compressed video as in browsing through MPEG-2 videos. For example, we can use color histograms coded with key-objects to help
6. References
1. 2. 3. R. Picard, Content Access for Image/Video Coding: The Fourth Criterion, Proc. of ICPR, Oct. 1994. J.Y.A. Wang and E. H. Adelson, Representing Moving Images with Layers, IEEE Tran. on Image Processing, 3(5):625-638, September 1994. H. J. Zhang, et al, Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution, Proc. of ACM Multimedia95, San Francisco, Nov.7-9, 1995, pp.15-24.
M. M. Yeung and B. Liu, Efficient Matching and Clustering of Video Shots, Proc. of ICIP95, Oct. 1995, pp.338-341.
4. 5. 6.
L.Teodosio and W. Bender, Salient Video Stills: Content and Context Preserved, Proc. ACM Multimedia93, August, 1993. Description of MPEG-4, ISO/IEC JTC1/SC29 /WG11 N1410, Oct. 1996.