Video Indexing Using Shot Boundary Detection Approach and Search Tracks

International Journal of Computer Engineering and Technology ENGINEERING (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER 6367(Print), ISSN 0976
6375(Online) Volume 4, Issue 3, May June (2013), IAEME & TECHNOLOGY (IJCET) ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 3, May-June (2013), pp. 432-440 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com
IJCET
IAEME
VIDEO INDEXING USING SHOT BOUNDARY DETECTION APPROACH AND SEARCH TRACKS IN VIDEO
Reshma R.Gulwani1, Sudhirkumar D.Sawarkar2 1 (Computer Engineering, Ramrao Adik Institute of Technology/ Mumbai University, Mumbai, India) 2 (Computer Engineering, Datta Meghe College of Engineering / Mumbai University, Mumbai, India)
ABSTRACT Video indexing and retrieving is an important process towards searching in videos. Shot boundary detection approach is proposed to perform video indexing. To reduce the computational cost; frames that are clearly not shot boundaries are first removed from the original video. After that key points are found by dividing frame in to n*n blocks, and apply average function to each n*n block. Supervised learning classifier like support vector machine (SVM) is used for key points matching to capture different kinds of transitions such as abrupt (cut) and gradual (fade, wipe, dissolve).Frames shows transitions are represented in form of thumbnails. Audio characteristics like energy of signals are used to detect sound (tracks) in videos. Applications chosen for above approaches are CCTV and film videos. Keywords: Keypoint Extraction, Key Frame Extraction, Shot Boundary Detection, Support Vector Machine (SVM), video retrieval. 1. INTRODUCTION
Videos are important form of multimedia information. The advances in the digital and network technology have produced a flood of information. The amount of video information in particular has led to unprecedented high volumes of data. When fast-forwarding through videotape, a user searches for an image or sequence similar to that in their imagination. In some complex cases queries are not that simple, but a system that can locate and present keys relevant to the video content-instead of depending on the user's imagination-will promote easier handling of extensive videos. The essential issues involve assisting users by extracting
432
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME physical features from video data and designing an effective application interface. We can extract physical features to partition video data into useful footage segments and store the segment attribute information, or annotations, as indexes. These indexes should describe the essential information pertaining to the video segments, and should be content based. Indexes can be visualized through the interface so users can perform various functions. We extract physical features such as inter frame differences, motion vectors, and color distributions from image data, obtaining useful indexes related to editing information. The foundation step of content based video retrieval is shot boundary detection. A shot is a consecutive sequence of frames captured by camera action that takes place between start and stop operations, which mark the shot boundaries [15].There are strong content correlation between frames in a shot. Therefore shots are considered to be the fundamental units to organize the contents of video sequences. Shot boundaries can be broadly classified into two types: abrupt transition and gradual transition. Abrupt transition is instantaneous transition from one shot to the subsequent shot. Gradual transition occurs over multiple frames, which is generated via the application of more elaborated editing effects involving several frames, so that frame belongs to one shot, frame to the second, and the N-1 frames in between represent a gradual transformation of into 5]. Gradual transition can be further classified into fade out/in(FOI) transition, dissolve transition, wipe transition, and others transition, according to the characteristics of the different editing effects [1][3]. Many different papers have been proposed in last few years such as pixel by pixel comparison, Histogram based approach, Edge change ratio. In pixel comparison method, direct pixel comparisons of two consecutive frames are performed. If the number of different pixels is large enough, the two processed frames are declared to belong to different shots. The pixel-based method is easy and fast. But it is extremely sensitive, since it has captured any details of frame, such as highly sensitive to local motion ,camera motion and minor changes in illumination[1][3].To handle these drawbacks, several ameliorative methods have been proposed, for example luminance/color histogram-based method and edge-based method. Histogram based method uses the statistics of color/luminance. Xue L et al. [12] proposed a shot boundary detection measure that the features are obtained from the color histogram of the hue and saturation image of the video frame. The advantage of the histogram-based shot change detection is that it is quite discriminant, easy to compute, and mostly insensitive to translational, rotational, and zooming camera motions. The weakness of the histogram-based shot boundary detection is that it does not incorporate the spatial distribution information of various color, hence it will fail in the case which similar histograms but different structures [1]. A better tradeoff between pixel and global color histogram methods can be achieved by block-matching methods [6] [13], in which each frame is divided into several non overlapping blocks and luminance/color histogram feature of each block are extracted. The edge information is an obvious choice for characterizing image [1] [12] [14]. The advantage of this feature is that it is sufficiently invariant to illumination changes and several types of motion, and it is related to the human visual perception of a scene. Its main disadvantage is computational cost and noise sensitivity [1]. Our proposed method first uses block color histogram differences between two frames to find out the key frames from original video in order to reduce the detection time .Key frames are the frames which represent the salient content and information like shows the boundaries of the shot. Then new frames sequence NSEQ is constructed based on key frame. Next features are extracted from each frame of NSEQ to detect shot boundaries. Frames are
433
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME divided into n*n blocks and find out the keypoint from each block by applying average function. Then those keypoints are matched by Support Vector Machine (SVM).Furthermore our system uses different algorithms for different kinds of shot transitions. Frames shows transitions are represented in the form of thumbnails. Audio characteristics like energy of signals are used to detect the sound (tracks) in video. Experiments are carried out on CCTV videos and film videos. 2. KEYPOINT EXTRACTION
Each frame is divided in to n*n blocks. Then the key points are found by finding average of each n*n block in each frame. 3. SUPPORT VECTOR MACHINE
Having obtained keypoints from two images, now an important issue is to find the matched keypoints between two images. Traditional, the keypoint matching is computed based on Euclidean distance of their feature vectors. However it has several difficulties in achieving successful results. So we propose machine learning methods in this paper for keypoint matching. Support vector machine (SVM), machine learning method is preferred in this paper. The Support vector machine (SVM) is a kind of machine learning method that analyzes data and recognizes patterns, used for classification and regression analysis. SVM (Support Vector Machine) is a useful technique for data classification, which based on the concept of the structural risk minimization using the Vapnik-Chervonenkis(VC) dimension[8]. A classification task usually involves with training and testing data which consist of some data instances. Each instance in the training set contains one "target value" (class labels) and several "attributes" (features). The goal of SVM is to produce a model which predicts target value of data instances in the testing set which are given only the attributes. Keypoints which are extracted from frames are compared by using SVM methods. To train a SVM model for the keypoint matching, we have annotated a training set consisting of positive examples and negative examples. F= {( ). ( )} ( Where X is input feature vector. Y ( 1,-1) is the output vector. We assume that class labeled 1 corresponds to the correct matches of keypoint , and Class labeled -1 to the incorrect matches of keypoint. The number of the keypoint matching is regarded as the similarity score of two images, denoted by NKM (Number of keypoint matching). 4. SHOT BOUNDARY DETECTION
It is inefficient and extremely time consuming to apply boundary detection process to detect all the frames [4].So, our method removes the frames that are clearly not shot boundaries from original videos, detects only those frames that is likely to contain shot boundaries. Different algorithms are used to detect different kinds of shot transitions.
434
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME The details of each detection process are explained in the following section. KEYFRAME EXTRACTION There are the great redundancies among the frames in same shot, therefore certain frames that reflect the best shot contents are selected as key frames[9][10][11]succinctly to represent the shot. In our paper, the method for key frame extraction consists of three steps: First, frame is decomposed by n x n block. Step 1: To calculate the block color histogram difference: If hue value of same block of 2 adjacent frames is greater than threshold, then block color histogram difference is set to 1 otherwise it is set to 0. Step 2: To calculate frame color histogram difference of two adjacent frames: It is computed by adding the block color histogram difference of all the blocks which are present in two adjacent frames(which is already calculated in step1) Step 3: If frame color histogram is above threshold then it is judged that frame is shot transition candidate new sequence is created known as NSEQ. Assign value -1 to the new sequence, if it shows shot boundary. Otherwise assign value 1 CUT TRANSITION Cut transition is instantaneous transitions from one shot to the subsequent shot, which just involves two consecutive frames of different shots. Cut transition can be detected by similarity between adjacent frames. Similarity between frames is found by using above mentioned SVM approach. To detect cut transition: 1 if NKM ( f, f, ) <Threshold Cut ( f, f, ) = 0 Otherwise (1) 4.2. 4.1
If NKM ( f, f, ) is lower thanThreshold then cut transition is detected. If Cut ( f, f ) =1, then NSEQ ( f, f ) =1 4.3. (2)
FADE TRANSITION DETECTION A fade of a video sequence is a shot transition with the first shot gradually disappearing (fade out)before the second shot appears(fade in)[1].During fade out/in, two shots are spatially and temporally well separated by some monochrome frames [5].During a fade-out, the images gradually disappear into monochrome, often black image. During fadein, the images gradually appear from monochrome, often black image. During a fade out, visually the image becomes cloudy [1], until monochrome frame appears, and during a fade in the image becomes clear. The more clarity the image is, more number of the frame keypoint is. This implies that the number of the frame keypoint is reduced, along with the image becomes cloudy. When it is the monochrome frame, then the average value of all pixels in the frame is less than monochrome threshold. When an image becomes clear, the number of the frame keypoint is increasing.
435
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME The Details of detection of fade-out/in transition is explained in following section First, to determine whether the current frame is monochrome or not as shown in Eq. (3)
1 if F (f, ) <MonoThreshold Mono (f ) = 0 Otherwise
(3)
Where (f, ) is average of all pixels in current frame. If the current frame is not a monochrome frame, processing is stopped. Otherwise whether the current frame is the starting point of a fade out or ending point of a fade in is determined. A section of fade in/out is detected based on consecutive monotonic increases/decreases in the average number of the frame pixel value. The following formulas are used for the determination: Eq. (4) is for monotonic increases and Eq. (5) is for monotonic decreases.
1 if F (f, ) <F f, ) INCF (f, f ) = 0 Otherwise
(4)
1 if F (f ) > F (f ) DECF (f, f ) = 0 Otherwise
(5)
WIPE TRANSITION DETECTION A transition from one shot to another wherein the images of new shot are revealed by moving boundary is called a wipe. Generally the boundaries can be of any geometric shape. Most of the time they are lines or set of lines .It is a shot transition that one scene or picture gradually enters across the view while another gradually leaves. During wipe, the appearing and disappearing shots coexist in different spatial regions of the intermediate video frames, and the region occupied by the former grows until it entirely replaces the latter [2]. To detect all kinds of the wipe transitions, one of the important properties of the change during a wipe is that one portion of the frame match to the starting frame, and the rest portion of the frame matches to the ending frame. First, the starting point of a wipe and the ending point of a wipe are needed to be determined. On a series of frames, where NSEQ ( f, f ) =-1 the starting frame of this series of frames is regarded as F f and ending frame of this series of frames is regarded F f .
436
4.4.
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME To detect the beginning of wipe frame: F f =1 if NSEQ f =-1, NSEQ (f ) =-1 NSEQ ( f, ) =1 (6)
To detect the ending of wipe frame: F f =1 4.5. if NSEQ f =1, NSEQ (f) =-1 (7)
DISSOLVE TRANSITION DETECTION A dissolve in a video sequence is shot transition with the first shot gradually disappearing while the second shot gradually appears [1].In this proposed method for dissolve transition, we are interested in the similarity between frames that are a specific distance apart from each other. Similarity between two frames is calculated by finding the difference between the gray values of two frames that is considered as distance between the frames. Set maximum and minimum threshold for dissolve transition. If distance between frames is higher than maximum threshold then the dissolve transition is detected otherwise there is no dissolve transition Dist= 2 2 1 Dissolve = -1 if Dist < dissMinTh if Dist >dissMaxTh (8)
(9)
5.
DETECTION OF SOUND (TRACKS) IN VIDEO
To detect tracks in video, first extract the audio from video file. Matlab does not support to fetch an audio from video files directly. In order to extract audio, first video files such as .avi or .wmv are converted into .wav files by using third party utility like dbpoweramp music converter. Then this .wav file can be read in Matlab to fetch energy of signal. We are expecting this energy should be high so that based on configured thresholds song can be detected in video 6. EXPERIMENTAL RESULTS
In this section, we will carry out experiments on CCTV videos and film videos. All experiments are conducted in Matlab. First, we should decide some parameters in the experiment. For SVM, we use the software Libsvm provided by the National Science Council of Taiwan to do SVM classification [7].We have chosen RBF(Radial Basis Function) kernel for creating model. There are two parameters for RBF kernel: c and gamma. It is not known beforehand which C and gamma are best for our given problem. The goal is to identify good(c, gamma) so that classifier can accurately predict unknown data. We uses cross validation technique to obtain C and gamma in this paper.
437
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME Following are the different transitions which are detected in this experiment:
Figure 1. Cut detection
Figure 2. Fade out/in detection
Figure 3. Dissolve detection
Figure 4. Wipe detection In audio, First we converts .wmv or .avi video file into .wav file to extract the audio.To extract an audio, dbpoweramp software is used .To track the song, find out the average of the frames which comes in continuous 50 seconds. if that average is greater than threshold, then it detects song. we carry out the experiments on two short videos, first video contains only one song as shown in Fig.(5) and second video contains three songs as shown in Fig.(6).
438
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME
700
650
600 nearbyFrameAvg
550
500
450
400
350
20
40
60 time
80
100
120
140
Figure 5. Graph for detecting single song in video
700
600
500 nearbyFrameAvg
400
300
200
100
50
100 time
150
200
250
Figure 6. Graph for detecting three songs in video
7.
CONCLUSION
A method is proposed that avoids calculating all the frame features which tries to detect shot boundary and also skips the processing of frames that are not clearly shot boundaries and calculates all the features only for parts of video that are likely to contain shot boundaries. We are using SVM approaches for keypoint matching. Different algorithms are used to capture the different characteristics for different kinds of shot transitions and sound (track) is also detected.
439
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME REFERENCES Journal Papers [1] C. Cotsaces, N. Nikolaidis, and I. Pitas. Video Shot Detection and Condensed Representation. Journal of IEEE Signal Processing Magazine, March, pp. 28--37, 2006. [2] H. H. YU, and W. WOLF, A hierarchical multiresolution video shot Transition Detection scheme [J], Journal of Computer Vision and Image Understanding, vol. 75, no. 1/2, pp. 196-213, 1999. [3] J. H. Yuan, H. Y. Wang, and B. Zhang. A formal study of shot boundary detection. Journal of Transactions on Circuits and Systems for Video Technology, 17(2), pp. 168 186 ,February 2007 [4] Y. Kawai, H. Sumiyoshi, and N. Yagi, Shot Boundary Detection at TRECVID 2007, In TRECVID 2007 Workshop. http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html. [5] A. Hanjalic, Shot-Boundary Detection: Unraveled and Resolved?, Journal of IEEE Transaction on Circuits and Systems for Video Technology, vol. 12, no. 2, pp. 90-105, 2002. [6] J. Bescs, G. Cisneros, J. M. Martnez, J. M. Menndez, and J. Cabrera, A Unified Model for Techniques on Video-Shot Transition Detection. Journal of IEEE TRANSACTIONS ON MULTIMEDIA, 7(2), pp. 293306, April 2005. [7] C. W. Hsu, C. C. Chang, and C. J. Lin, A Practical Guide to Support Vector Classification, http://www.csie.ntu.edu.tw/~cjlin. [8] V. Vapnik. Statistical learning theory. John Wiley, New York, 1998. [9] K. W. Sze, K. M. Lam, and G. P. Qiu, A new key frame representation for video segment retrieval, IEEE Trans. Circuits Syst. Video Technology, vol. 15, no. 9, pp. 1148-1155, Sep. 2005. [10] B. T. Truong and S. Venkatesh, Video abstraction: A systematic review and classification, ACM Trans. Multimedia Comput., Commun. Appl., vol. 3, no. 1, art. 3, pp. 1-37, Feb.2007. [11] D. P. Mukherjee, S. K. Das, and S. Saha, Key frame estimation in vide using randomness measure of feature point pattern, IEEE Trans. Circuits Syst. Video Technology, vol. 7, no. 5, pp. 612-620, May. 2007. Proceedings Papers [12] L. Xue, C . Li, H. Li, and Z. Xiong. A general method for shot boundary detection. In Proceedings of the 2008 International Conference on Multimedia and Ubiquitous Engineering,PP.394397,2008. [13] Z. P. Zong, K. Liu, and J. H. Peng, Shot Boundary Detection Based on Histogram of Mismatching-Pixel Count of FMB. In Proceedings of ICIEA 2006, pp. 24--26, 2006 [14] H. ZHAO, X. H. LI, Shot Boundary Detection Based on Mutual Information and Canny Edge Detector. In Proceedings of 2008 International Conference on Computer Science and Software, pp:1124--1128, 2008. [15] C. H Yeo, Y. W. Zhu, Q. B. Sun, and S. F Chang, A Framework for sub-window shot detection, in Proc. Int. Multimedia Modelling Conf.,Jan. 2005, pp. 8491.
440

Video Indexing Using Shot Boundary Detection Approach and Search Tracks

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Video Indexing Using Shot Boundary Detection Approach and Search Tracks

Uploaded by

Copyright:

Available Formats

International Journal of Computer Engineering and Technology ENGINEERING (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER 6367(Print), ISSN 0976

1 if F (f, ) <MonoThreshold Mono (f ) = 0 Otherwise

1 if F (f, ) <F f, ) INCF (f, f ) = 0 Otherwise

1 if F (f ) > F (f ) DECF (f, f ) = 0 Otherwise

DETECTION OF SOUND (TRACKS) IN VIDEO

Figure 1. Cut detection

Figure 2. Fade out/in detection

Figure 3. Dissolve detection

Figure 5. Graph for detecting single song in video

Figure 6. Graph for detecting three songs in video

You might also like