You are on page 1of 99

Query Processing for Video Subsequence Identification

A Project report submitted in partial fulfillment of the requirements for the award of the Degree of Master of Technology in Computer Science

By M. HARI CHARAN GUPTA Roll No: 07012D0507 Under the guidance of Mrs. SUPREETHI K P Assistant Professor

Department of Computer Science and Engineering JNTUH College of Engineering, Hyderabad Jawaharlal Nehru Technological University Hyderabad Kukatpally, Hyderabad-500 085. 2011

Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad College of Engineering, Kukatpally, Hyderabad-500 085.

DECLARATION BY THE CANDIDATE


I, Mr. M. Hari Charan Gupta, bearing Roll No 07012D0507, hereby certify that the project report entitled Query Processing for Video Subsequence Identification, carried out under the guidance of Mrs. Supreethi K P is submitted in partial fulfillment of the requirements for the award of the degree of Master of Technology in Computer Science. This is a record of bonafide work carried out by me and the results embodied in this project have not been reproduced/copied from any source. The results embodied in this project report have not been submitted to any other university or institute for the award of any other degree or diploma.

M. Hari Charan Gupta Roll No: 07012D0507 Department of Computer Science & Engineering, JNTUH College of Engineering, Hyderabad.

Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad College of Engineering, Kukatpally, Hyderabad-500 085.

CERTIFICATE BY THE SUPERVISOR


This is to certify that the project report entitled Query Processing for Video Subsequence Identification, being submitted by M. Hari Charan Gupta, having Roll No: 07012D0507, in partial fulfillment of the requirements for the award of the degree of Master of Technology in Computer Science, is a record of bonafide work carried out by him. The results embodied in this project report have not been submitted to any other university or institute for the award of any other degree or diploma.

Mrs. Supreethi K P Assistant Professor, Department of Computer Science & Engineering, JNTUH College of Engineering, Hyderabad. Date:

Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad College of Engineering, Kukatpally, Hyderabad-500 085.

CERTIFICATE BY THE HEAD


This is to certify that the project report entitled Query Processing for Video Subsequence Identification, being submitted by Mr. M. Hari Charan Gupta, having Roll No: 07012D0507, in partial fulfillment of the requirements for the award of the degree of Master of Technology in Computer Science, is a record of bonafide work carried out by him.

Dr. O.B.V. Ramanaiah Professor & Head of the Department, Department of Computer Science & Engineering, JNTUH college of Engineering, Hyderabad. Date:

ACKNOWLEDGEMENTS

It is with great reverence that I wish to express my sincere gratitude towards Mrs. Suprethi K P, Assistant Professor, JNTUH College of Engineering, JNTUH Hyderabad, for her astute guidance, constant motivation and trust, without which this work would never have been possible. I am sincerely indebted to her for her constructive criticism and suggestions for improvement at various stages of the work. I would like to express my sincere thanks to Dr. O.B.V.Ramanaiah, Professor and Head of the Department of Computer Science & Engineering, JNTUH College of Engineering, JNTUH Hyderabad, for his kind cooperation in completion of this project. I am grateful to my family for their perennial inspiration. Last but not least; I would like to thanks all my friends and batch mates.

M. Hari Charan Gupta

Abstract
In the current internet community, secure data transfer is limited due to its attack made on data Communication. So more robust methods are chosen so that they ensure secured data transfer. One of the solutions which came to the rescue is the audio Steganography. But existing audio steganographic systems have poor interface, very low level implementation, difficult to understand and valid only for certain audio formats with restricted message size. Enhanced Audio Steganography (EAS) is one proposed system which is based on audio Steganography and cryptography, ensures secure data transfer between the source and destination. EAS uses most powerful encryption algorithm in the first level of security, which is very complex to break. In the second level it uses a more powerful modified LSB (Least Significant Bit) Algorithm to encode the message into audio. It performs bit level manipulation to encode the message. The basic idea behind this paper is to provide a good, efficient method for hiding the data from hackers and sent to the destination in a safer manner. Though it is well modulated software it has been limited to certain restrictions. The quality of sound depends on the size of the audio which the user selects and length of the message. Though it shows bit level deviations in the frequency chart, as a whole the change in the audio cannot be determined.

ii

Contents
1. Introduction........................................................................................2
1.1 Motivation.......................................................................................................2 1.2 Problem Definition ........................................................................................3 1.3 Organization of Report...................................................................................3

2. Literature Survey .............................................................................5


2.1 Video...............................................................................................................5
2.1.1 Number of Frames per Second......................................................................................6 2.1.2 Interlacing......................................................................................................................6 2.1.3 Display Resolution........................................................................................................6 2.1.4 Color Space and Bits per Pixel......................................................................................7 2.1.5 Aspect Ratio..................................................................................................................7 2.1.6 Video Compression Method..........................................................................................8

2.2 Video Formats ................................................................................................8


2.2.1 AVI................................................................................................................................8 2.2.2 MP4...............................................................................................................................9 2.2.3 MOV..............................................................................................................................9 2.2.4 WMV...........................................................................................................................10 2.2.5 3GP..............................................................................................................................10 2.2.6 3G2..............................................................................................................................10 2.2.7 FLV..............................................................................................................................10

2.3 Video Splitting Strategies.............................................................................11 2.4 Image File Formats.......................................................................................11


2.4.1 JPEG/JFIF....................................................................................................................13 2.4.2 JPEG 2000...................................................................................................................13 2.4.3 Exif..............................................................................................................................13 2.4.4 TIFF.............................................................................................................................14 2.4.5 RAW............................................................................................................................14 2.4.6 GIF...............................................................................................................................14

iii

2.4.7 PNG.............................................................................................................................15 2.4.8 BMP.............................................................................................................................15 2.4.9 PPM, PGM, PBM, PNM.............................................................................................15

2.5 Distance Measures........................................................................................16


2.5.1 Minkowski Distance Measure for Numeric Attributes................................................16 2.5.2 Euclidean Distance Measure for Numeric Attributes..................................................17 2.5.3 Manhattan Distance Measure for Numeric Attributes.................................................17 2.5.4 Distance Measures for Binary Attributes....................................................................18 2.5.5 Distance Measures for Nominal Attributes.................................................................19 2.5.6 Distance Metrics for Ordinal Attributes......................................................................20 2.5.7 Distance Metrics for Mixed-Type Attributes..............................................................20 2.5.8 Earth Mover's Distance................................................................................................22

2.6 Time Series Similarity Measures..................................................................25


2.6.1 Euclidean Distances and Lp Norms.............................................................................25 2.6.2 Dynamic Time Warping..............................................................................................27 2.6.3 Longest Common Subsequence Similarity..................................................................28 2.6.4 Probabilistic Methods..................................................................................................30 2.6.5 General Transformations.............................................................................................30

2.7 Bipartite Graph.............................................................................................31 2.8 k-Nearest Neighborhood...............................................................................32 2.9 Existing System............................................................................................34


2.9.1 Video Copy Detection.................................................................................................34 2.9.2 Visual Similarities of Near Duplicates........................................................................35 2.9.3 Retrieving Entries in a Broad Sense of High-Level Concept Similarity.....................36

3. Proposed System..............................................................................40
3.1 Video Subsequence Identification................................................................40 3.2 Applications of Video Subsequence Identification.......................................41
3.2.1 Recognition for Copyright Enforcement.....................................................................41 3.2.2 TV Commercial Detection...........................................................................................42

iv

4. Implementation ..............................................................................44
4.1 Video Fragmentation....................................................................................44
4.1.1 Supported Video Types...............................................................................................44 4.1.2 Capturing Video Properties.........................................................................................44 4.1.3 Capturing Images from Video at Periodic Intervals....................................................44 4.1.4 Image File Type Supported.........................................................................................45

4.2 Similar Frames Retrieval Using K-Nearest Neighborhood..........................45 4.3 Graph Transformation and Matching............................................................47
4.3.1 Bipartite Graph Transformation..................................................................................47 4.3.2 Dense Segment Extraction...........................................................................................50 4.3.3 Filtering by Maximum Size Matching.........................................................................52 4.3.4 Refinement by Sub-Maximum Similarity Matching...................................................54

4.4 Identifying Configurable Thresholds............................................................55 4.5 Implementation of Graphical User Interface................................................56

5. Testing and Results..........................................................................59


5.1 Test Plan Approach ......................................................................................59 5.2 Results ..........................................................................................................61
5.2.1 Test Case 1 Results......................................................................................................61 5.2.2 Test Case 2 Results......................................................................................................64 5.2.3 Test Case 3 Results......................................................................................................67 5.2.4 Test Case 4 Results......................................................................................................70 5.2.5 Test Case 5 Results......................................................................................................73 5.2.6 Test Case 6 Results......................................................................................................76 5.2.7 Test Case 7 Results......................................................................................................79

6. Conclusion........................................................................................82
6.1 Future Work..................................................................................................82

References ...........................................................................................83 Appendix A: openCV APIs ...............................................................84


v

Appendix B: Definitions/Acronyms .................................................85 Appendix C: Software and Hardware Specification ......................86

vi

List of Figures Figure 2.1 The Intuition Behind the Euclidean Distance Metric....26 Figure 2.2 Two Time Series that Require a Warping Measure......27 Figure 2.3 Simple Bipartite Graph.....................................................31 Figure 4.4 Construction of Bipartite Graph......................................48 Figure 4.5 1:M Mapping.....................................................................52 Figure 4.6 1:1 Mapping.......................................................................53 Figure 4.7 Graphical User Interface..................................................57 Figure 5.8 Test Plan Approach...........................................................60 Figure 5.9 GUI Screen Showing the Test Case 1 Input Parameters ...............................................................................................................61 Figure 5.10 Results of Test Case 1 (screen 1 of 2).............................62 Figure 5.11 Results of Test Case 1 (screen 2 of 2).............................63 Figure 5.12 GUI Screen Showing the Test Case 2 Input Parameters ...............................................................................................................64 Figure 5.13 Results of Test Case 2 (screen 1 of 2).............................65 Figure 5.14 Results of Test Case 2 (screen 2 of 2).............................66 Figure 5.15 GUI Screen Showing the Test Case 3 Input Parameters ...............................................................................................................67 Figure 5.16 Results of Test Case 3 (screen 1 of 2).............................68 Figure 5.17 Results of Test Case 3 (screen 2 of 2).............................69 Figure 5.18 GUI Screen Showing the Test Case 4 Input Parameters ...............................................................................................................70
vii

Figure 5.19 Results of Test Case 4 (screen 1 of 2).............................71 Figure 5.20 Results of Test Case 4 (screen 2 of 2).............................72 Figure 5.21 GUI Screen Showing the Test Case 5 Input Parameters ...............................................................................................................73 Figure 5.22 Results of Test Case 5 (screen 1 of 2).............................74 Figure 5.23 Results of Test Case 5 (screen 2 of 2).............................75 Figure 5.24 GUI Screen Showing the Test Case 6 Input Parameters ...............................................................................................................76 Figure 5.25 Results of Test Case 6 (screen 1 of 2).............................77 Figure 5.26 Results of Test Case 6 (screen 2 of 2).............................78 Figure 5.27 GUI Screen Showing the Test Case 7 Input Parameters ...............................................................................................................79 Figure 5.28 Results of Test Case 7 (screen 1 of 1).............................80

viii

CHAPTER 1 INTRODUCTION

1. Introduction
Video subsequence identification aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip.

1.1

Motivation
With the growing demand for visual information of rich content, effective and efficient manipulations of large video databases are increasingly desired. Many investigations have been made on content-based video retrieval. However, despite the importance, video subsequence identification, which is to find the similar content to a short query clip from a long video sequence, has not been well addressed. Nowadays, the rapid advances in multimedia and network technologies popularize many applications of video databases, and sophisticated techniques for representing, matching, and indexing videos are in high demand. A video sequence is an ordered set of a large number of frames, and from the database research perspective, each frame is usually represented by a high-dimensional vector, which has been extracted from some lowlevel content features, such as color distribution, texture pattern, or shape structure within the original media domain. Matching of videos is often translated into searches among these feature vectors. In practice, it is often undesirable to manually check whether a video is part of a long stream by browsing its entire length. Thus, a reliable solution of automatically finding similar content is imperative. Video retrieval task conventionally returns similar clips from a large collection of videos which have been either chopped up into similar lengths or cut at content boundaries. The clips for search have already been segmented and are always ready for similarity ranking. Video subsequence identification task aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip. Because the boundary and even the length of target subsequence are not available initially, choosing which fragments to evaluate similarities is not preknown.

1.2

Problem Definition
The aim of the project is to develop an application that locates the position of the most similar part with respect to a user-specified query clip Q from a long prestored video sequence S. Video subsequence identification task aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip. Thresholds used for video similarity search needs to be configurable thresholds so that similarity search can be adjusted by user feedback.

1.3

Organization of Report
This report is divided into following chapters. Chapter 2- Provides an overview of the Literature Survey and details of the existing systems that are relevant to this project. Chapter 3 - Provides details of the proposed system. Chapter 4 Provides the design and implementation details of the project. Chapter 5 - Deals with testing and results of the proposed system. Chapter 6 - Gives conclusion deducted from the results, scope for further work followed by references and Appendices A, B, C.

CHAPTER 2 LITERATURE SURVEY

2. Literature Survey
Literature Survey elaborates on Video properties, Video formats, Video splitting strategies, Image file formats, Distance Measures, Time Series Similarity Measures, Bipartite Graph, k-Nearest Neighborhood and Existing System.

2.1

Video
Video [4] is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion. The term video is derived from the Latin verb videre which means I see. Video commonly refers to several storage formats for moving pictures: digital video formats, including Blu-ray Disc, DVD, QuickTime, and MPEG-4; and analog videotapes, including VHS and Betamax. Video can be recorded and transmitted in various physical media: in magnetic tape when recorded as PAL or NTSC electric signals by video cameras, or in MPEG-4 or DV digital media when recorded by digital cameras. Quality of video essentially depends on the capturing method and storage used. Video stream has following characteristics: Number of Frames per Second Interlacing Display Resolution Aspect Ratio Color Space and Bits per Pixel Video Quality Video Compression Method Bit Rate

2.1.1 Number of Frames per Second Frame rate, the number of still pictures per unit of time of video, ranges from six or eight frames per second (frame/s) for old mechanical cameras to 120 or more frames per second for new professional cameras. PAL (Europe, Asia, Australia, etc.) and SECAM (France, Russia, parts of Africa etc.) standards specify 25 frame/s, while NTSC (USA, Canada, Japan, etc.) specifies 29.97 frame/s. The minimum frame rate to achieve the illusion of a moving image is about fifteen frames per second. 2.1.2 Interlacing Video can be interlaced or progressive. Interlacing was invented as a way to achieve good visual quality within the limitations of a narrow bandwidth. The horizontal scan lines of each interlaced frame are numbered consecutively and partitioned into two fields: the odd field (upper field) consisting of the odd-numbered lines and the even field (lower field) consisting of the even-numbered lines. NTSC, PAL and SECAM are interlaced formats. PAL video format is often specified as 576i50, where 576 indicates the vertical line resolution, i indicates interlacing, and 50 indicates 50 fields (halfframes) per second. In progressive scan systems, each refresh period updates all of the scan lines. The result is a higher spatial resolution and a lack of various artifacts that can make parts of a stationary picture appear to be moving or flashing. 2.1.3 Display Resolution The size of a video image is measured in pixels for digital video, or horizontal scan lines and vertical lines of resolution for analog video. In the digital domain, standard-definition television (SDTV) is specified as 720/704/640480i60 for NTSC and 768/720576i50 for PAL or SECAM resolution. New high-definition televisions (HDTV) are capable of resolutions up to 6

19201080p60, i.e. 1920 pixels per scan line by 1080 scan lines, progressive, at 60 frames per second. 2.1.4 Color Space and Bits per Pixel In digital imaging, a pixel is the smallest addressable screen element in a display device. It is the smallest unit of picture that can be represented or controlled. Each pixel has its own address. The address of a pixel corresponds to its coordinates. Pixels are normally arranged in a two-dimensional grid, and are often represented using dots or squares. Each pixel is a sample of an original image, more samples typically provide more accurate representations of the original. The intensity of each pixel is variable. In color image systems, a color is typically represented by three or four component intensities such as red, green, and blue, or cyan, magenta, yellow, and black. The number of distinct colors that can be represented by a pixel depends on the number of bits per pixel (bpp). A 1 bpp image uses 1-bit for each pixel, so each pixel can be either ON or OFF. Each additional bit doubles the number of colors available, so a 2 bpp image can have 4 colors, and a 3 bpp image can have 8 colors.

1 bpp, 2 bpp, 3 bpp, 8 bpp,

21 = 2 colors (monochrome) 22 = 4 colors 23 = 8 colors 28 = 256 colors

16 bpp, 216 = 65,536 colors (High Color) 24 bpp, 224 16.8 million colors (True Color)

2.1.5 Aspect Ratio Aspect ratio describes the dimensions of video screens and video picture elements. All popular video formats are rectilinear, and so can be described by a ratio between width and height. The screen aspect ratio of a traditional television screen is 4:3, or about 1.33:1. High definition televisions use an aspect ratio of 16:9, or about 1.78:1.

Ratios where the height is taller than the width are uncommon in general everyday use, but do have application in computer systems where the screen may be better suited for a vertical layout. The most common tall aspect ratio of 3:4 is referred to as portrait mode and is created by physically rotating the display device 90 degrees from the normal position. 2.1.6 Video Compression Method A wide variety of methods are used to compress video streams. Video data contains spatial and temporal redundancy, making uncompressed video streams extremely inefficient. Broadly speaking, spatial redundancy is reduced by registering differences between parts of a single frame. This task is known as intraframe compression and is closely related to image compression. Likewise, temporal redundancy can be reduced by registering differences between frames. This task is known as interframe compression, including motion compensation and other techniques. 2.2

Video Formats
Following are some of the video formats [5]: AVI (Audio Video Interleave) MP4 MOV WMV (Windows Media Video) 3GP 3G2 FLV (Flash Video)

2.2.1 AVI AVI is a multimedia container format introduced by Microsoft in November 1992 as part of its Video for Windows technology. AVI files can contain both audio and video data in a file container that allows synchronous audio-with-video playback. Like the

DVD video format, AVI files support multiple streaming audio and video, although these features are seldom used. AVI is a derivative of the Resource Interchange File Format (RIFF), which divides a file's data into blocks, or chunks. Each chunk is identified by a FourCC tag. An AVI file takes the form of a single chunk in a RIFF formatted file, which is then subdivided into two mandatory "chunks" and one optional chunk. The first sub-chunk is identified by the hdrl tag. This sub-chunk is the file header and contains metadata about the video, such as its width, height and frame rate. The second sub-chunk is identified by the movi tag. This chunk contains the actual audio/visual data that make up the AVI movie. The third optional sub-chunk is identified by the idx1 tag which indexes the offsets of the data chunks within the file. By way of the RIFF format, the audio-visual data contained in the movi chunk can be encoded or decoded by software called a codec, which is an abbreviation for (en)coder/decoder. Upon creation of the file, the codec translates between raw data and the (compressed) data format used inside the chunk.

2.2.2 MP4 MP4 is a multimedia container format standard specified as a part of MPEG-4. It is most commonly used to store digital video and digital audio streams, especially those defined by MPEG, but can also be used to store other data such as subtitles and still images. Like most modern container formats, MP4 allows streaming over the Internet. A separate hint track is used to include streaming information in the file.

2.2.3 MOV The native file format for Quicktime video specifies a multimedia container file that contains one or more tracks, each of which stores a particular type of data: audio, video,

effects, or text (e.g. for subtitles). Each track either contains a digitally-encoded media stream (using a specific format) or a data reference to the media stream located in another file. The ability to contain abstract data references for the media data, and the separation of the media data from the media offsets and the track edit lists means that QuickTime is particularly suited for editing, as it is capable of importing and editing in place (without data copying).

2.2.4 WMV Windows Media Video (WMV) is a video compression format for several proprietary codecs developed by Microsoft. The original video format, known as WMV, was originally designed for Internet streaming applications, as a competitor to RealVideo. The other formats, such as WMV Screen and WMV Image, cater for specialized content.

2.2.5 3GP 3GP (3GPP file format) is a multimedia container format defined by the Third Generation Partnership Project (3GPP) for 3G UMTS multimedia services. It is used on 3G mobile phones but can also be played on some 2G and 4G phones.

2.2.6 3G2 3G2 (3GPP2 file format) is a multimedia container format defined by the 3GPP2 for 3G CDMA2000 multimedia services. It is very similar to the 3GP file format, but has some extensions and limitations in comparison to 3GP.

2.2.7 FLV Flash Video is a container file format used to deliver video over the Internet using Adobe Flash Player versions 610. Flash Video content may also be embedded within SWF files. There are two different video file formats known as Flash Video: FLV and F4V. The audio and video data within FLV files are encoded in the same way as they are 10

within SWF files. The latter F4V file format is based on the ISO base media file format and is supported starting with Flash Player 9 update 3. Both formats are supported in Adobe Flash Player and currently developed by Adobe Systems. FLV was originally developed by Macromedia.

2.3

Video Splitting Strategies


A video sequence is an ordered set of a large number of frames. Each frame is usually represented by a high-dimensional vector, which has been extracted from some low level content features, such as color distribution, texture pattern, or shape structure within the original media domain. Videos can be split into images (frames) by following techniques Off the shelf tools that can split the video files into images. Aoao Video to Picture Converter is one of the tools. Evaluation version or free downloadable version of these tools will place a watermark on the images. If Full version of these tools are bought, then watermark will not be placed on the images. OpenCV (Open Source Computer Vision) [12] is a library of programming functions for real time computer vision. This library can be used for splitting the video files to images.

2.4

Image File Formats


Image file formats are standardized means of organizing and storing digital images. Image files are composed of either pixel or vector (geometric) data. The pixels that constitute an image are ordered as a grid (columns and rows). Each pixel consists of numbers representing magnitudes of brightness and color. Image file size is expressed as the number of bytes. Image file size increases with the number of pixels composing an image, and the color depth of the pixels. The greater the number of rows and columns, the greater the image resolution, and the larger the file.

11

Image compression uses algorithms to decrease the size of a file. High resolution cameras produce large image files, ranging from hundreds of kilobytes to megabytes, per the camera's resolution and the image-storage format capacity. High resolution digital cameras record 12 megapixel (1MP = 1,000,000 pixels / 1 million) images, or more, in truecolor. Faced with large file sizes, both within the camera and a storage disc, image file formats were developed to store such large images. There are two types of image file compression algorithms Lossless compression algorithms reduce file size without losing image quality, though they are not compressed into as small a file as a lossy compression file. When image quality is valued above file size, lossless algorithms are typically chosen. Lossy compression algorithms take advantage of the inherent limitations of the human eye and discard invisible information. Most lossy compression algorithms allow for variable quality levels (compression) and as these levels are increased, file size is reduced. At the highest compression levels, image deterioration becomes noticeable as compression artifacting. Following are some of the image formats [6]: JPEG/JFIF JPEG 2000 Exif TIFF RAW GIF PNG BMP PPM, PGM, PBM, PNM 12

2.4.1 JPEG/JFIF JPEG (Joint Photographic Experts Group) is a compression method. JPEG-compressed images are usually stored in the JFIF (JPEG File Interchange Format) file format. JPEG compression is (in most cases) lossy compression. The JPEG/JFIF filename extension is JPG or JPEG. Nearly every digital camera can save images in the JPEG/JFIF format, which supports 8 bits per color (red, green, blue) for a 24-bit total, producing relatively small files. The compression does not noticeably detract from the image's quality, but JPEG files suffer generational degradation when repeatedly edited and saved. The JPEG/JFIF format also is used as the image compression algorithm in many PDF files.

2.4.2 JPEG 2000 JPEG 2000 is a compression standard enabling both lossless and lossy storage. The compression methods used are different from the ones in standard JFIF/JPEG. They improve quality and compression ratios, but also require more computational power to process. JPEG 2000 also adds features that are missing in JPEG. It is not nearly as common as JPEG, but it is used currently in professional movie editing and distribution.

2.4.3 Exif The Exif (Exchangeable image file format) format is a file standard similar to the JFIF format with TIFF extensions. It is incorporated in the JPEG-writing software used in most cameras. Its purpose is to record and to standardize the exchange of images with image metadata between digital cameras and editing and viewing software. The metadata are recorded for individual images and include such things as camera settings, time and date, shutter speed, exposure, image size, compression, name of camera, color information, etc. When images are viewed or edited by image editing software, all of this image information can be displayed.

13

2.4.4 TIFF The TIFF (Tagged Image File Format) format is a flexible format that normally saves 8 bits or 16 bits per color (red, green, blue) for 24-bit and 48-bit totals respectively. It usually uses either the TIFF or TIF filename extension. TIFF's flexibility can be both an advantage and disadvantage, since a reader that reads every type of TIFF file does not exist. TIFFs can be lossy and lossless. Some offer relatively good lossless compression for bi-level (black&white) images. Some digital cameras can save in TIFF format, using the LZW compression algorithm for lossless storage. TIFF image format is not widely supported by web browsers. TIFF remains widely accepted as a photograph file standard in the printing business. TIFF can handle device-specific color spaces, such as the CMYK defined by a particular set of printing press inks.

2.4.5 RAW RAW refers to a family of raw image formats that are options available on some digital cameras. These formats usually use a lossless or nearly-lossless compression, and produce file sizes much smaller than the TIFF formats of full-size processed images from the same cameras. Although there is a standard raw image format, the raw formats used by most cameras are not standardized or documented, and differ among camera manufacturers. Many graphic programs and image editors may not accept some or all of them, and some older ones have been effectively orphaned already. Adobe's Digital Negative (DNG) specification is an attempt at standardizing a raw image format to be used by cameras, or for archival storage of image data converted from undocumented raw image formats, and is used by several niche and minority camera manufacturers including Pentax, Leica, and Samsung.

2.4.6 GIF GIF (Graphics Interchange Format) is limited to an 8-bit palette, or 256 colors. This makes the GIF format suitable for storing graphics with relatively few colors such as simple diagrams, shapes, logos and cartoon style images. The GIF format supports

14

animation and is still widely used to provide image animation effects. It also uses a lossless compression that is more effective when large areas have a single color, and ineffective for detailed images or dithered images.

2.4.7 PNG The PNG (Portable Network Graphics) file format was created as the free, open-source successor to the GIF. The PNG file format supports truecolor (16 million colors) while the GIF supports only 256 colors. The PNG file excels when the image has large, uniformly colored areas. The lossless PNG format is best suited for editing pictures, and the lossy formats, like JPG, are best for the final distribution of photographic images, because in this case JPG files are usually smaller than PNG files. PNG provides a patent-free replacement for GIF and can also replace many common uses of TIFF. Indexed-color, grayscale, and truecolor images are supported, plus an optional alpha channel. PNG is designed to work well in online viewing applications like web browsers so it is fully streamable with a progressive display option. PNG is robust, providing both full file integrity checking and simple detection of common transmission errors.

2.4.8 BMP The BMP file format (Windows bitmap) handles graphics files within the Microsoft Windows OS. Typically, BMP files are uncompressed. So, they are large. The advantage is their simplicity and wide acceptance in Windows programs.

2.4.9 PPM, PGM, PBM, PNM Netpbm format is a family including the portable pixmap file format (PPM), the portable graymap file format (PGM) and the portable bitmap file format (PBM). These are either pure ASCII files or raw binary files with an ASCII header that provide very basic 15

functionality and serve as a lowest-common-denominator for converting pixmap, graymap, or bitmap files between different platforms. Several applications refer to them collectively as PNM format (Portable Any Map).

2.5

Distance Measures
Answering queries based on alike but may be not exactly same is known as similarity search. It has been widely used in image retrievals. It is required to determine whether two images are similar or dissimilar. Distance measures [1, 2, 3, 9, 10] are used to determine the similarity or dissimilarity between any pair of objects. It is useful to denote the distance between two instances xi and xj as: d(xi, xj). A valid distance measure should be symmetric and obtains its minimum value (usually zero) in case of identical vectors. The distance measure is called a metric distance measure if it also satisfies the following properties: Triangle inequality d(xi, xk) d(xi, xj) + d(xj, xk) i, xj, xk x d(xi, xj) = 0 xi = xj

S.

S.

i, xj x

2.5.1 Minkowski Distance Measure for Numeric Attributes Given two p-dimensional instances, xi = (xi1, xi2, xip) and xj = (xj1, xj2, xjp), The distance between the two data instances can be calculated using the Minkowski Metric
d ( xi , x j ) = xi1 x j1

+ xi 2 x j 2

+ + xip x jp

1/ g

The commonly used Euclidean distance between two objects is achieved when g = 2. Given g = 1, the sum of absolute paraxial distances (Manhattan metric) is obtained, and with g= one gets the greatest of the paraxial distances. The measurement unit used can affect the clustering analysis. To avoid the dependence on the choice of measurement units, the data should be standardized. Standardizing

16

measurements attempts to give all variables an equal weight. However, if each variable is assigned with a weight according to its importance, then the weighted distance can be computed as:
d ( xi , x j ) = w1 xi1 x j1

+ w2 xi 2 x j 2

+ + w p xip x jp

1/ g

where wi [0,)

2.5.2 Euclidean Distance Measure for Numeric Attributes Given two p-dimensional instances, xi = (xi1, xi2, xip) and xj = (xj1, xj2, xjp), The distance between the two data instances can be calculated using the Euclidean Metric
d ( xi , x j ) = xi1 x j1

+ xi 2 x j 2

+ + xip x jp

1/ 2

If each variable is assigned with a weight according to its importance, then the weighted distance can be computed as:
d ( xi , x j ) = w1 xi1 x j1

+ w2 xi 2 x j 2

+ + w p xip x jp

1/ 2

where wi [0,)

2.5.3 Manhattan Distance Measure for Numeric Attributes Given two p-dimensional instances, xi = (xi1, xi2, xip) and xj = (xj1, xj2, xjp), The distance between the two data instances can be calculated using the Manhattan Metric
d ( xi , x j ) = xi1 x j1 + xi 2 x j 2 + + xip x jp

17

If each variable is assigned with a weight according to its importance, then the weighted distance can be computed as:
d ( xi , x j ) = w1 xi1 x j1 + w2 xi 2 x j 2 + + w p xip x jp

where wi [0,)

2.5.4 Distance Measures for Binary Attributes The distance measure for numeric attributes may be easily computed for continuousvalued attributes. In the case of instances described by categorical, binary, ordinal or mixed type attributes, the distance measure should be revised. In the case of binary attributes, the distance between objects may be calculated based on a contingency table. A binary attribute is symmetric if both of its states are equally valuable. In that case, using the simple matching coefficient can assess dissimilarity between two objects:

r+ s d ( xi , x j ) = q+ r+ s+ t
where q is the number of attributes that equal 1 for both objects; t is the number of attributes that equal 0 for both objects; and s and r are the number of attributes that are unequal for both objects.

18

A binary attribute is asymmetric, if its states are not equally important (usually the positive outcome is considered more important). In this case, the denominator ignores the unimportant negative matches (t). This is called the Jaccard coefficient:

r+ s d ( xi , x j ) = q+ r+ s
2.5.5 Distance Measures for Nominal Attributes When the attributes are nominal, two main approaches may be used: Simple matching

p m d ( xi , x j ) = p
where p is the total number of attributes and m is the number of matches.

19

Creating a binary attribute for each state of each nominal attribute and computing their dissimilarity as described above.

2.5.6 Distance Metrics for Ordinal Attributes When the attributes are ordinal, the sequence of the values is meaningful. In such cases, the attributes can be treated as numeric ones after mapping their range onto [0,1]. Such mapping may be carried out as follows:

zi ,n =

ri,n 1 Mn 1

where zi,n is the standardized value of attribute an of object i. ri,n is that value before standardization, and Mn is the upper limit of the domain of attribute an (assuming the lower limit is 1).

2.5.7 Distance Metrics for Mixed-Type Attributes In the cases where the instances are characterized by attributes of mixed-type, one may calculate the distance by combining the methods mentioned above. For instance, when calculating the distance between instances i and j using a metric such as the Euclidean distance, one may calculate the difference between nominal and binary attributes as 0 or 1 (match or mismatch, respectively), and the difference between numeric attributes as the difference between their normalized values. The square of each such difference will be added to the total distance. 20

The dissimilarity d(xi, xj) between two instances, containing p attributes of mixed types, is defined as:

d ( xi , x j ) =

d
n= 1

iji j

( n) ( n) ( n)

n= 1

ij

where the indicator ij(n) if one of the values is missing. The contribution of attribute n to the distance between the two objects d(n)(xi, xj) is computed according to its type:

If the attribute is binary or categorical, d(n)(xi, xj) = 0 if xin = xjn, otherwise d(n)(xi, xj) = 0. If the attribute is continuous-valued, d(n)

21

di j =

( n)

xi n xj n m h xh a nmx h xh i n n

where h runs over all non-missing objects for attribute n. If the attribute is ordinal, the standardized values of the attribute are computed first and then, zi,n is treated as continuous-valued. 2.5.8 Earth Mover's Distance The Earth Mover's Distance (EMD) [1, 3, 11] is a method to evaluate dissimilarity between two multi-dimensional distributions in some feature space where a distance measure between single features, known as ground distance is given. The EMD lifts this distance from individual features to full distributions. Intuitively, given two distributions, one can be seen as a mass of earth properly spread in space, the other as a collection of holes in that same space. Then, the EMD measures the least amount of work needed to fill the holes with earth. Here, a unit of work corresponds to transporting a unit of earth by a unit of ground distance. A distribution can be represented by a set of clusters where each cluster is represented by its mean (or mode), and by the fraction of the distribution that belongs to that cluster. Such a representation is called as the signature of the distribution. The two signatures

22

can have different sizes, for example, simple distributions have shorter signatures than complex ones. Let P = { (p1, wp1), , (pm, wpm) } be the first signature with m clusters, where p i is the cluster representative and wpi is the weight of the cluster. Q = { (q1, wq1), ,(qm, wqm) } the second signature with n clusters, and D = [dij] the ground distance matrix where dij is the ground distance between clusters pi and qj. To find a flow F = [fij], with fij the flow between pi and qj, that minimizes the overall cost
WORK ( P, Q, F ) = f ij d ij
i= 1 j= 1 m n

subject to the following constraints:


f ij 0 ; 1 i m, 1 j n

f
j =1

ij

w pi ; 1 i m

f
i =1

ij

wqj ; 1 j n

f
i =1 j =1

ij

n m = min w pi , wqj j =1 i =1

The first constraint allows moving supplies from P to Q and not vice versa. The next two constraints limits the amount of supplies that can be sent by the clusters in P to their weights, and the clusters in Q to receive no more supplies than their weights, and the last constraint forces to move the maximum amount of supplies possible. The earth mover's distance is defined as the work normalized by the total flow:

23

E (PM,Q) = D

fd
i= 1 j= 1 mn

mn

i ji j

f
i= 1 j= 1

ij

The normalization factor is introduced in order to avoid favoring smaller signatures in the case of partial matching. The EMD has the following advantages Naturally extends the notion of a distance between single elements to that of a distance between sets, or distributions, of elements. Can be applied to the more general variable-size signatures, which subsume histograms. Signatures are more compact, and the cost of moving earth reflects the notion of nearness properly, without the quantization problems of most other measures. Allows for partial matches in a very natural way. This is important, for instance, for image retrieval and in order to deal with occlusions and clutter. Is a true metric if the ground distance is metric and if the total weights of two signatures are equal. This allows endowing image spaces with a metric structure. Is bounded from below by the distance between the centers of mass of the two signatures when the ground distance is induced by a norm. Using this lower bound in retrieval systems significantly reduced the number of EMD computations.

24

Matches perceptual similarity better than other measures, when the ground distance is perceptually meaningful.

2.6

Time Series Similarity Measures


A time-series database consists of sequences of values or events changing with time. The values are measured at equal time intervals. Time series databases are popular in many applications, such as studying daily fluctuations of a stock market, traces of a dynamic production process, scientific parameters, medical treatments, and so on. A time-series database is also a sequence database. However, a sequence database is any database that consists of sequences of ordered events, with or without concrete notions of time. Given a set of time-series sequences, there are two types of similarity search. Subsequence matching finds all of the data sequences that are similar to the given sequence, while whole sequence matching finds those sequences that are similar to one other. Similarity search in time-series analysis is useful for the analysis of financial markets, medical diagnosis and in scientific or engineering databases.

2.6.1 Euclidean Distances and Lp Norms One of the simplest similarity measures for time series is the Euclidean distance measure [1, 2, 3, 9]. If both time sequences are of the same length n, then each sequence can be viewed as a point in n-dimensional Euclidean space, and define the dissimilarity between sequences C and Q and D(C,Q) = Lp(C,Q), i.e. the distance between the two points measured by the Lp norm (when p = 2, it reduces to the familiar Euclidean distance). Figure 2.1 shows a visual intuition behind the Euclidean distance metric. The two sequences Q and C appear to have approximately the same shape, but have different offsets in Y-axis.

25

D (Q ,C )

Figure 2.1 The Intuition Behind the Euclidean Distance Metric Such a measure is simple to understand and easy to compute, which has ensured that the Euclidean distance is the most widely used distance measure for similarity search. However, one major disadvantage is that it is very brittle. It does not allow for a situation where two sequences are alike, but one has been stretched or compressed in the Y-axis. For example, a time series may fluctuate with small amplitude between 10 and 20, while another may fluctuate in a similar manner with larger amplitude between 20 and 40. The Euclidean distance between the two time series will be large. This problem can be dealt with easily with offset translation and amplitude scaling, which requires normalizing the sequences before applying the distance operator. More formally, let (C) and (C) be the mean and standard deviation of sequence C = {c1, . . . , cn}. The sequence C is replaced by the normalized sequences C, where

c ( C) c= ( C)
' i i
Even after normalization, the Euclidean distance measure may still be unsuitable for some time series domains since it does not allow for acceleration and deceleration along the time axis. 26

2.6.2 Dynamic Time Warping In some time series domains, a very simple distance measure such as the Euclidean distance will suffice. However, it is often the case that the two sequences have approximately the same overall component shapes, but these shapes do not line up in Xaxis. Figure 2.2 shows a simple example.

10

20

30

40

50

60

10

20

30

40

50

60

Figure 2.2 Two Time Series that Require a Warping Measure In order to find the similarity between such sequences or as a preprocessing step before averaging them, warping the time axis of one (or both) sequences is required to achieve a better alignment. Dynamic Time Warping (DTW) [1, 3] is a technique for effectively achieving this warping. Dynamic time warping is an extensively used technique in speech recognition, and allows acceleration-deceleration of signals along the time dimension. Consider two sequence (of possibly different lengths), C = {c1, . . . , cm} and Q = {q1, . .. , qn}. When computing the similarity of the two time series using Dynamic Time Warping, it is allowed to extend each sequence by repeating elements. A straightforward algorithm for computing the Dynamic Time Warping distance between two sequences uses a bottom-up dynamic programming approach, where the smaller sub problems D(i, j) are first determined, and then used to solve the larger sub-problems, until D(m,n) is finally achieved.

27

2.6.3 Longest Common Subsequence Similarity The longest common subsequence similarity measure, or LCSS [1, 3], is a variation of edit distance used in speech recognition and text pattern matching. The basic idea is to match two sequences by allowing some elements to be unmatched. The advantage of the LCSS method is that some elements may be unmatched or left out (e.g. outliers), where as in Euclidean and DTW, all elements from both sequences must be used, even the outliers. Let C and Q be two sequences of length m and n, respectively. As was done with dynamic time warping, a recursive definition of the length of longest common subsequence of C and Q is given. Let L(i, j) denote the longest common subsequences {c1, . . . , ci} and {q1, . .. , qi}. L(i, j) may be recursively defined as follows: IF ai = bj THEN L(i, j) = 1 + L(i1, j1) ELSE L(i, j) = max {D(i1, j),D(i, j1)} Dissimilarity between C and Q as

LCSS(C , Q) =

m + n 2l m+ n

where l is the length of the longest common subsequence. Intuitively, this quantity determines the minimum (normalized) number of elements that should be removed from and inserted into C to transform C to Q. As with dynamic time warping, the LCSS measure can be computed by dynamic programming in O(mn) time. This can be improved to O((n+m)w) time if a matching window of length w is specified (i.e. where | i j| is allowed to be at most w).

28

With time series data, the requirement that the corresponding elements in the common subsequence should match exactly is rather rigid. This problem is addressed by allowing some tolerance (say > 0) when comparing elements. Thus, two elements a and b are said to match if a(1 ) < b < a(1+ ).

2.6.3.1

Local Scaling Functions

The basic idea is that two sequences are similar if they have enough non-overlapping time-ordered pairs of contiguous subsequences that are similar. Two contiguous subsequences are similar if one can be scaled and translated appropriately to approximately resemble the other. The scaling and translation function is local, i.e. it may be different for other pairs of subsequences. The algorithmic challenge is to determine how and where to cut the original sequences into subsequences so that the overall similarity is minimized. The first step is to find all pairs of atomic subsequences in the original sequences A and Q that are similar (atomic implies subsequences of a certain small size, say a parameter w). This step is done by a spatial self-join (using a spatial access structure such as an R-tree) over the set of all atomic subsequences. The next step is to stitch similar atomic subsequences to form pairs of larger similar subsequences. The last step is to find a non-overlapping ordering of subsequence matches having the longest match length. The stitching and subsequence ordering steps can be reduced to finding longest paths in a directed acyclic graph, where vertices are pairs of similar subsequences, and a directed edge denotes their ordering along the original sequences.

2.6.3.2

Global Scaling Function

Instead of different local scaling functions that apply to different portions of the sequences, a simpler approach is to try and incorporate a single global scaling function with the LCSS similarity measure. An obvious method is to first normalize both sequences and then apply LCSS similarity to the normalized sequences. However, the disadvantage of this approach is that the normalization function is derived from all data

29

points, including outliers. This defeats the very objective of the LCSS approach which is to ignore outliers in the similarity calculations. The basic idea is that two sequences C and Q are similar if there exists constants a and b, and long common subsequences C and Q such that Q is approximately equal to aC + b. The scale + translation linear function (i.e. the constants a and b) is derived from the subsequences, and not from the original sequences. Thus, outliers cannot taint the scale + translation function.

2.6.4 Probabilistic Methods A different approach to time-series similarity is the use of a probabilistic similarity measure [3]. Above methods were distance based, some of these methods are model based. Since time series similarity is inherently a fuzzy problem, probabilistic methods are well suited for handling noise and uncertainty. They are also suitable for handling scaling and offset translations. Finally, they provide the ability to incorporate prior knowledge into the similarity measure. It is not clear whether other problems such as time-series indexing, retrieval and clustering can be efficiently accomplished under probabilistic similarity measures. Given a sequence C, the basic idea is to construct a probabilistic generative model MC, i.e. a probability distribution on waveforms. Given a new sequence pattern Q, similarity is measured by computing p(Q| MC), i.e. the likelihood that MC generates Q.

2.6.5 General Transformations Recognizing the importance of the notion of shape in similarity computations, an alternate approach can be considered. A general similarity framework involving a transformation rules language [3]. Each rule in the transformation language takes an input sequence and produces an output sequence, at a cost that is associated with the rule. The similarity of sequence C to sequence Q is the minimum cost of transforming C to Q by applying a sequence of such rules. 30

2.7

Bipartite Graph
In the mathematical field of graph theory, a bipartite graph (or bigraph) [1, 7] is a graph whose vertices can be divided into two disjoint sets U and V such that every edge connects a vertex in U to one in V, that is, U and V are independent sets. Equivalently, a bipartite graph is a graph that does not contain any odd-length cycles.

Figure 2.3 Simple Bipartite Graph The two sets U and V may be thought of as a coloring of the graph with two colors. If all nodes in U is colored as blue, and all nodes in V as green, each edge has endpoints of differing colors, as is required in the graph coloring problem. In contrast, such a coloring is impossible in the case of a nonbipartite graph, such as a triangle. After one node is colored blue and another green, the third vertex of the triangle is connected to vertices of both colors, preventing it from being assigned either color. One often writes G = (U, V, E) to denote a bipartite graph whose partition has the parts U and V. If |U| =|V|, that is, if the two subsets have equal cardinality, then G is called a balanced bipartite graph.

31

Bipartite graphs can model entirely the more general multigraph. Given a multigraph M, take U as the vertex set of M and take V as the edge set of M. Then join an element of V to precisely the two elements of U which are the ends of that edge in M. Thus every multigraph is described entirely by a bipartite graph which is one-sided regular of degree 2, and vice versa. Similarly, every directed hypergraph can be represented as a bipartite digraph. Take U as the vertex set in the hypergraph, and V as set of edges. For each u contains u as an output. Following are properties of Bipartite Graph: A graph is bipartite if and only if it does not contain an odd cycle. Therefore, a bipartite graph cannot contain a clique of size 3 or more. A graph is bipartite if and only if it is 2-colorable, (i.e. its chromatic number is less than or equal to 2). The size of minimum vertex cover is equal to the size of the maximum matching. The size of the maximum independent set plus the size of the maximum matching is equal to the number of vertices. For a connected bipartite graph the size of the minimum edge cover is equal to the size of the maximum independent set. For a connected bipartite graph the size of the minimum edge cover plus the size of the minimum vertex cover is equal to the number of vertices. 2.8 Every bipartite graph is a perfect graph. The spectrum of a graph is symmetric if and only if it's a bipartite graph.

U and v

V,

connect u to v if the hypergraph edge v contains u as an input, and connect v to u if v

k-Nearest Neighborhood
k-Nearest Neighbor search [1, 2, 3, 8] identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. k-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors. 32

In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor. The same method can be used for regression, by simply assigning the property value for the object to be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. (A common weighting scheme is to give each neighbor a weight of 1/d, where d is the distance to the neighbor. This scheme is a generalization of linear interpolation.) The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. The k-nearest neighbor algorithm is sensitive to the local structure of the data. Nearest neighbor rules in effect compute the decision boundary in an implicit manner. It is also possible to compute the decision boundary itself explicitly, and to do so in an efficient manner so that the computational complexity is a function of the boundary complexity. It is easy to implement by computing the distances from the test sample to all stored vectors, but it is computationally intensive, especially when the size of the training set grows. Many nearest neighbor search algorithms have been proposed seek to reduce the number of distance evaluations actually performed. Using an appropriate nearest

33

neighbor search algorithm makes k-NN computationally tractable even for large data sets.

2.9

Existing System
According to different user requirements, video search intentions can be generally classified into three main categories: Detecting instances of copies Identifying visual similarities of near duplicates Retrieving entries in a broad sense of high-level concept similarity

2.9.1 Video Copy Detection Extensive research efforts have been made on extracting and matching content-based signatures to detect copies of videos. Several methods of using a sequence of frame features (ordinal, motion, or color signature) to leverage the characteristic of sequenceto-sequence matching. Query sequence slides frame by frame on database video with a fixed length window. Since the process of video transformation could give rise to several distortions, techniques circumventing these variations by globe signatures have been considered. They tend to depict a video globally rather than focusing on its sequential details. Some properties that are likely to be preserved even with these variations (e.g., shot length information) were suggested to be generated as compact signatures and string matching technique could be used to report such a copy. This method is efficient, but has limitations with blurry shot boundaries or very limited number of shots. Moreover, in reality, a query video clip can be just a shot or even a subshot. However, this method is only applicable for queries which consist of multiple shots.

34

2.9.1.1 Video Copy Detection based on Watermark The watermark consists in introducing a non-visible signal into the whole video with the goal of recognizing that video as genuine, and can easily detect copied videos. The watermarking is a widely used technique in the photography field. It allows the owner to detect whether the image has been copied or not. Sometimes, this watermark is visible in the image, and located in the background. The limitations of watermarks are that if the original image is not watermarked, then it is not possible to know if other images are copied or not.

2.9.1.2

Video Copy Detection based on the Content

In this case, the signature which defines the video is the content. The function of the algorithms of video copy detection based on the content extract the fingerprint through the features of the visual content. Then the fingerprint is used to compare with fingerprints from videos in a database. This type of algorithms have a difficult problem. It is really hard to solve if a video is a copied video or a similar video, The features of the content are very similar from one video to the other, and the system can think that the image is copied, but really it isn't.

2.9.2 Visual Similarities of Near Duplicates In this method, Query and Long videos will be fragmented to frames. Each frame of the query video is compared with the long video frames and nearest neighbors are identified for the query video frames. Approximate position in long video can be identified by getting the nearest neighbor of the query video frame. However, this method does not focus on the sequence matching between Query and Long videos. This method can give a success result even if one frame is similar and it is not effective.

35

2.9.3 Retrieving Entries in a Broad Sense of High-Level Concept Similarity User needs determine both the effectiveness and efficiency of video search engines. The purpose for which a video is created is either entertainment, information, communication, or data analysis. For all these purposes, the user needs and demands vary substantially. Search engines like YouTube can provide list of relevant video for a given query. Since humans perceive video as a complex interplay of cognitive concepts, the allimportant step forward in such video retrieval approaches will be to provide access at the semantic level. This is achievable by labeling all combinations of people, objects, settings, and events appearing in the audiovisual content. Labeling video content is a grand challenge as humans use approximately half of their cognitive capacity to achieve such tasks. Two types of semantic labeling solutions: Labels are assigned manually after audiovisual inspection Machine-driven with automatic assignment of labels to video segments

Video retrieval task conventionally returns similar clips from a large collection of videos which have been either chopped up into similar lengths or cut at content boundaries. The clips for search have already been segmented and are always ready for similarity ranking.

2.9.3.1 Human-Driven Labeling Manual labeling of video has traditionally been the realm of professionals. In cultural heritage institutions, for example, library experts label archival videos for future disclosure using controlled vocabularies. Because expert labeling is tedious and costly, it typically results in a brief description of a complete video only.

36

In contrast to expert labor, social tagging has been launched. This is a recent trend to let amateur consumers label, mostly personal, visual content on web sites like YouTube, Flickr, and Facebook. Alternatively, the manual concept-based labeling process can be transformed into a computer game or a tool facilitating volunteer-based labeling. Since the labels were never meant to meet professional standards, amateur labels are known to be ambiguous, overly personalized, and limited. Moreover, unlabeled video segments remain notoriously difficult to find. Manual labeling, whether by experts or amateurs, is geared toward one specific type of use and, therefore, inadequate to cater for alternative video retrieval needs, especially those user needs targeting at retrieval of video segments.

2.9.3.2 Machine-Driven Labeling Machine-driven labeling aims to derive meaningful descriptors from video data. These descriptors are the basis for searching large video collections. Most commercial video search engines provide access to video based on text, as this is still the easiest way for a user to describe an information need. The labels of these search engines are based on the filename, surrounding text, social tags, closed captions, or a speech transcript. Textbased video search using speech transcripts has proven itself especially effective for segment-level retrieval from broadcast news, interviews, political speeches, and video blogs featuring talking heads. However, a video search method based on just speech transcripts results in disappointing retrieval performance, when the audiovisual content is neither mentioned, nor properly reflected in the associated text. In addition, when the videos originate from non-English speaking countries, such as China, and the Netherlands, querying the content becomes much harder as robust automatic speech recognition results and their accurate machine translations are difficult to achieve. It might seem that video retrieval is the trivial extension of text retrieval, but it is in fact often more complex. Most of the data is of sensory origin (image, sound, video) and hence techniques from digital signal processing and computer vision are required to extract relevant descriptions. In addition to the important and valuable text data derived from audio analysis, much information is captured in the visual stream. Hence, a vast 37

body of research in machine-driven video labeling has investigated the role of visual content, with or without text. Video is dependent on low-level visual labels such as color, texture, shape, and spatiotemporal features. Most of those early systems are based on query-by-example, where users query an archive based on images rather than the visual feature values. They do so by sketches, or by providing example images using a browser interface. Querybyexample can be fruitful when users search for the same object under slightly varying circumstances and when the target images are available indeed. If proper example images are unavailable, content-based image retrieval techniques are not effective at all. Moreover, users often do not understand similarity expressed in low-level visual features. They expect semantic similarity. This expected semantic similarity, is exactly the major problem video retrieval is facing. The source of the problem lies in the semantic gap. The lack of correspondence between the low-level features that machines extract from video and the high-level conceptual interpretations a human gives to the data in a given situation. The existence of the gap has various causes. One reason is that different users interpret the same video data in a different way. This is especially true when the user is making subjective interpretations of the video data related to feelings or emotions, for example, by describing a scene as romantic or hilarious.

38

CHAPTER 3 PROPOSED SYSTEM

39

3. Proposed System
Video subsequence identification is the proposed system. Video subsequence identification [1] aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip. 3.1

Video Subsequence Identification


Answering queries based on alike but maybe not exactly same is known as similarity search. It has been widely used to simulate the process of object proximity ranking performed by human specialists, such as image retrieval and time series matching. Rapid advancement in multimedia and network technologies popularize many applications of video databases, and sophisticated techniques for representing, matching, and indexing videos are in high demand. A video sequence is an ordered set of a large number of frames, and from the database research perspective, each frame is usually represented by a high-dimensional vector, which has been extracted from some lowlevel content features, such as color distribution, texture pattern, or shape structure within the original media domain. Matching of videos is often translated into searches among these feature vectors. In practice, it is often undesirable to manually check whether a video is part of a long stream by browsing its entire length. Thus a reliable solution of automatically finding similar content is imperative. Video subsequence identification involves locating the position of the most similar part with respect to a user-specified query clip Q from a long prestored video sequence S. The primary difference between video retrieval and video subsequence identification is retrieval task conventionally returns similar clips from a large collection of videos which have been either chopped up into similar lengths or cut at content boundaries, subsequence identification task aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip. In other words, the clips for search have already been segmented and are always ready for similarity ranking in video

40

retrieval. Whereas in video subsequence identification, the boundary and even the length of target subsequence are not available initially, choosing which fragments to evaluate similarities is not preknown. Video Subsequence Identification uses a graph transformation and matching approach to process variable length comparison over database video with query. It facilitates safely pruning a large portion of irrelevant parts and rapidly locating some promising candidates for further similarity evaluations. Constructing a bipartite graph representing the similar frame mapping relationship between Query video (Q) and Long video (S) with an efficient batch kNN search algorithm, all the possibly similar video subsequences along the 1D temporal line can be extracted. Then, to effectively but still efficiently identify the most similar subsequence, the proposed query processing is conducted in a coarse-to-fine style. Imposing a one-to-one mapping constraint similar in spirit to that of Maximum Size Matching (MSM) is employed to rapidly filter some actually nonsimilar subsequences with lower computational cost. The smaller number of candidates which contain eligible numbers of similar frames are then further evaluated with relatively higher computational cost for accurate identification. Since measuring the video similarities for all the possible 1:1 mappings in a subgraph is computationally intractable, a heuristic method Sub-Maximum Similarity Matching (SMSM) is devised to quickly identify the subsequence corresponding to the most suitable 1:1 mapping. 3.2

Applications of Video Subsequence Identification


Video Subsequence Identification can be used in applications such as Recognition for copyright enforcement and TV commercial detection.

3.2.1 Recognition for Copyright Enforcement Video content owners would like to be aware of any use of their material, in any media or representation. For example, the producers of certain movie scenes may want to identify whether or where their original films have been reused by others.

41

3.2.2 TV Commercial Detection Some companies would like to track their TV commercials when they are aired on different channels during a certain time period for statistic purpose. They can verify whether their commercials have been actually broadcasted as contracted, and it is also valuable to monitor how their competitors conduct advertisements to apprehend their marketing strategies.

42

CHAPTER 4 IMPLEMENTATION

43

4. Implementation
This chapter provides the design and implementation details of video subsequence identification.

4.1

Video Fragmentation
Video fragmentation elaborates on the aspects to be considered for splitting a video file into image files.

4.1.1 Supported Video Types There are different types of videos and this project mainly focuses on following video types: AVI (Audio Video Interleave) MP4 MOV

4.1.2 Capturing Video Properties Following properties of Query video (Q) and Long video (S) are captured using openCV APIs: Frames Per Second Total Number of Frames Frame Width Frame Height

4.1.3 Capturing Images from Video at Periodic Intervals Images are captured from the query and long videos at a periodic time intervals. Periodic rate can be a configurable parameter.

44

4.1.4 Image File Type Supported JPEG Images are captured from the query and long videos. JPEG images are generally in compressed format, so less storage space is required for storing the JPEG images. 4.2

Similar Frames Retrieval Using K-Nearest Neighborhood


Generally, visual content represented by feature vectors is the primary factor in computing the similarity of videos. Video subsequence identification typically takes temporal order inherent in videos into account as well. However, an approach that strictly adheres to temporal order is too restrictive. In practice, similar video sequences may have different orderings due to content editing for postproduction, e.g., in human perception, different versions of promotion videos for a same supermarket are still regarded as quite similar. Therefore, taking temporal order into consideration does not mean it has to be preserved rigidly, but may have some flexibility as long as the cost of reordering has been considered. Frame alignment is another factor affecting relevance. Unlike detecting video copies, the lengths of Query and Long videos need not have to be exactly the same. The difference could arise from the presence of transmission distortion, which results in some random frame repeating, skipping, or noise. More often, frames can be added or deleted during content editing. For example, special effects such as fast forward and slow motion will lead two video sequences sharing the same content to have different frame numbers. Two videos of different lengths can be aligned for approximate matching so that they are still regarded as near-duplicates. Another convictive and promising application is that, when alignment has been considered, even when the frame rates of Query and Long videos are different, finding similar content can still be performed. The similarity of video frames can be captured by some Lp distance in d-dimensional vector space. Other metrics can be applied as well, such as more robust but complex Earth Movers Distance (EMD). Similar frame retrieval in S for each element qi processed as a range or kNN search.

Q is

45

Given qi and S, following algorithm gives the framework of retrieving similar frames. Algorithm: Retrieve similar frames Input: qi, S Output: F(qi) - similar frame set of qi Description: 1: if kNN search is defined then 2: F(qi) 4: else 5: F(qi)

{ sj | sj kNN(qi) };

3: return F(qi);

{ sj | sj range(qi) };

6: return F(qi); 7: end if Frames are regarded as similar if their distance is under a threshold. It is better to have each qi retrieve the same number of similar frames, and the differences of the maximum distances can vary can vary substantially. Therefore, kNN search is preferred and kNearest Neighbors are identified for all the query video frames.

46

4.3

Graph Transformation and Matching


While similar frame retrieval is the preliminary step analogous to building element-toelement correspondence in time series matching, this work emphasizes on effective temporal localization of target subsequence. Temporal characteristic is essential to videos. Following their temporal orders, Query video (Q) and Long video (S) can be placed along two 1D temporal lines. Mapping relationship between Q and S can be investigated by a bipartite graph. Graph transformation has been widely used to describe complex structures in a natural and intuitive way. Bipartite graph is useful for generalized matching problems such as assigning advertisements to be displayed with queries of search engines. Frame level mapping relationship can be exploited to transform video subsequence identification to a more manageable problem of bipartite graph matching. Note that the elements in each F(qi) which are all close to qi spatially in high-dimensional space are sometimes in close proximity to each other temporally, but may also disperse along the 1D temporal line quite separately. This provides some opportunity to prune irrelevant parts.

4.3.1 Bipartite Graph Transformation Each frame can be placed as a node along the temporal line of a video. Given a query clip Q and database video S, a short line and a long line can be abstracted, respectively. Q and S, which are two finite sets of nodes ordered along the temporal lines, are treated as two sides of a bipartite graph. Vertex is drawn from Q to S, if Euclidean Distance between Q and S frames is less than Bipartite Similarity Threshold. Hereafter, each frame is no longer modeled as a high-dimensional point as in the preliminary step, but simply a node. Q and S, which are two finite sets of nodes ordered along the temporal lines, are treated as two sides of a bipartite graph.

47

Figure 4.4 Construction of Bipartite Graph Let G = { V, E } be a bipartite graph representing the similar frame mappings between Q and S. V = Q S is the vertex set representing frames while E Q S is the edge set representing similar frame mappings. Following algorithm outlines how a bipartite graph is constructed Algorithm: Construct bipartite graph Input: F(qi) Output: E - edge set, SB - set of sj with nonzero count Description: 1: for each qi in Q do 2: 3: 4: for each sj F(qi) do count(sj) = 0; end for

48

5: end for 6: for each qi in Q do 7: 8: 9: 10: 11:end for 12:return E, SB; For each sj for each sj F(qi) do add qi end for

sj to E; count(sj) + +;

update SB with (sj, count(sj));

F(qi), its count which indicates the number of similar query frames is first

initialized to be 0. For each qi

Q, if sj

F(qi), there is an edge from qi to sj in G, and

the count of sj is increased by 1. At the same time, the set of sj with nonzero count, denoted as SB, is also updated. In the case that kNN search is defined to retrieve F(qi), k * | Q | edges will be formed. Observing the similar frame mappings along the 1D temporal line of S side, only a small portion is densely matched, while the most parts are unmatched at all or merely sparsely matched. Intuitively, the unmatched and sparsely matched parts can be directly discarded, as they clearly suggest there are no possible subsequences similar to Q, because a necessary condition for a subsequence to be similar to Q is they share sufficient number of similar frames. In view of this, comparing all the possible subsequences in S is avoided, which is infeasible, but safely and rapidly filter a large portion of irrelevant parts prior to similarity evaluations. To do so, the densely matched segments of S containing all the possibly similar video subsequences have to be identified. Note that it is unnecessary to maintain the entire graph.

49

4.3.2 Dense Segment Extraction Starting from the first frame in long video, Euclidean Distance between two adjacent frames in long video is calculated. If the Euclidean Distance between adjacent frames is less than Adjacent Similarity Threshold, then frames will be added in the same segment. Similarity in large video sequence will be traced. Sparse or empty parts can be learned. Below algorithm extracts dense segments Algorithm: Extract dense segments Input: SB Output: SD - set of dense segments Description: 1: order SB by sj; 2: k = 1; add first(SB) to Sk*; add Sk* to SD; 3: while first(SB) 4: 5: 6: 7: 8: 9: 10: end if first(SB) else k + +; add next(SB) to Sk*; add Sk* to SD;

last(SB) do

= next(SB).sj - first(SB).sj; if < then add next(SB) to Sk*

next(SB);

11:end while 12:return SD;

50

SB contains the tuples in the form of (sj, count(sj)) returned in bipartite graph construction algorithm, where sj is the frame ID and count(sj)

0 reflects the number of similar

query frames of sj. Additionally, a threshold determining the number of consecutive zero counts required to separate two high density segments is employed. However, even after dense segment extraction it is likely that two successive or overlapped subsequences actually both somewhat similar to Q are grouped in a longer segment. To further filter dense but nonsimilar segments, and extract the most similar subsequence accurately in a relatively long segment, a filter-and refine search strategy needs to be applied.

51

4.3.3 Filtering by Maximum Size Matching After locating the dense segments, there will be k separate subgraphs. However, high density of a segment cannot sufficiently indicate high similarity to query due to neglect of actual similar frame number, temporal order, or frame alignment. Since the query Q is composed of a series of frames, a frame sj in Sk* can also have multiple similar frames in Q due to continuity. The original similar frame mapping relationship between Q and Sk* is usually multiple-to multiple (M:M). Even when videos are represented by shots but not frames, this M:M mapping phenomenon is still common, though not as significant as at frame level. In fact, not all the dense segments with acceptable lengths mean they are indeed relevant. With 1:M mapping, one frame of query clip can be found similar to number of frames of the long video. This will lead to higher computational cost. With 1:1 mapping, Maximum Size Matching (MSM) is employed to rapidly identify subsequences with lower computational cost. Subsequences are identified, if number of frames in the subsequence is greater than Maximum Size Matching Threshold.

Figure 4.5 1:M Mapping

52

Figure 4.6 1:1 Mapping Below algorithm outlines the filtering process by MSM. Based on the matching size, i.e., the number of edges or the number of saturated vertices in Gk under 1:1 mapping constraint, which reflects similar frame number, some actually nonsimilar video subsequences can be filtered, whose maximum matching sizes are not greater than a given threshold , where is a parameter related to |Q|. This threshold can be tuned for tradeoff between effectiveness and efficiency. Algorithm: Filter by MSM Input: SD Output: SM set of actually similar S* Description: 1: SM = 2: for each segment in SD do 3: 4: 5: 6: 7: else discard Sk*; 53 MkMSM = MaximumSizeMatching(Gk); if | MkMSM| > then add Sk* to SM;

8:

end if

9: end for 10:return SM;

4.3.4 Refinement by Sub-Maximum Similarity Matching The Maximum size matching filtering step can be viewed as a rough similarity evaluation disregarding temporal information. SM is further refined by considering visual content, temporal order and frame alignment simultaneously.

4.3.4.1 Video Similarity Measure Defining a similarity measure consistent with human perception is crucial for similarity search. First, score function needs to be presented that integrates three factors in judging video relevance for resembling human perception more accurately. The video similarity is computed based on an arbitrary 1:1 mapping Mk out of all the possible 1:1 mappings between Q and Sk*. Visual content similarity: To locate the most visually similar subsequence, the distance between two similar frames is calculated, instead of simply judging whether they are similar or not. Let the weight of edge wij denote the detailed similarity between frames qi and sj in Mk. This information has already been available in the preliminary stage. A larger total weight of edges means a larger sum of interframe similarities, which consequently achieves a higher degree of visual similarity. Temporal order similarity: Reordering some frames to achieve a much higher Visual content with an acceptable sacrifice on preserving temporal order, for a much higher overall similarity from the angle of human perception. Given a 1:1 mapping Mk of Gk corresponding to Sk*, the number of frame pairs that have kept the temporal order, i.e., matched along the same direction, can be calculated by Longest Common Subsequence Similarity.

54

Frame alignment similarity: Only considering the factors of visual content and temporal order is still problematic. The number of frames that need alignment operations is considered. Frames will be added, deleted, or substituted.

4.4

Identifying Configurable Thresholds


During the implementation of this project, it is observed that there are thresholds that can affect the outcome of the result of the query processing. So, all these thresholds are identified as configurable thresholds. Following are the configurable thresholds: Video Splitting Rate Number of KNN Samples Bipartite Similarity Threshold Adjacent Similarity Threshold Maximum Size Matching Threshold

With the provision of configurable threshold, it allows the user to either tighten or loosen the passing criteria for query processing. So, video similarity can be adjusted by user feedback

55

4.5

Implementation of Graphical User Interface


Graphical User Interface is implemented using the C# Forms Application. It has the following options for user: Browse button for selection of query video Browse button for selection of long / database video Remove button for deleting JPEG files Text box for entering Video Splitting Rate (Configurable Threshold) Text box for entering Number of KNN Samples (Configurable Threshold) Text box for entering Bipartite Similarity Threshold (Configurable Threshold) Text box for entering Adjacent Similarity Threshold (Configurable Threshold) Text box for entering Maximum Size Matching Threshold (Configurable Threshold) Search button for query processing

56

Figure 4.4 shows the screenshot of Graphical User Interface

Figure 4.7 Graphical User Interface

57

CHAPTER 5 TESTING AND RESULTS

58

5. Testing and Results


This chapter provides the details of test plan and test results.

5.1

Test Plan Approach


An effective test plan has always been the foundation of an effective test strategy. Testing is performed with different video files types (i.e. avi, mp4 and mov). Testing is performed with videos of different kinds such as advertisements, movies, and animation movies. Testing is performed with videos of different frame rate and display resolution. Testing is performed with positive and negative scenarios.

Figure 5.1 provides the test plan approach. Long video Query video Remarks

Test Case 1: Positive test case with technical presentation videos (avi files).

Test Case 2: Positive test case with advertisement videos (avi files). Test case where two occurrences of query video are present in long video.

59

Long video

Query video

Remarks

Test Case 3: Positive test case with advertisement videos (mov files).

Test Case 4: Positive test case with animation movies (avi files).

Test Case 5: Positive test case with movies (avi files).

Test Case 6: Positive test case with advertisements of different display resolution (mp4 files).

Test Case 7: Negative test case with movies (avi files).

Figure 5.8 Test Plan Approach

60

5.2

Results
This section provides test results of test cases that are executed as per test plan approach.

5.2.1 Test Case 1 Results Test Case 1 Input Parameters are shown in Figure 5.2.

Figure 5.9 GUI Screen Showing the Test Case 1 Input Parameters

61

Test Case 1 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.3.

Figure 5.10 Results of Test Case 1 (screen 1 of 2)

62

Test Case 1 results for Maximum Size Matching and Results Display stages are shown in Figure 5.4.

Figure 5.11 Results of Test Case 1 (screen 2 of 2)

63

5.2.2 Test Case 2 Results Test Case 2 Input Parameters are shown in Figure 5.5.

Figure 5.12 GUI Screen Showing the Test Case 2 Input Parameters

64

Test Case 2 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.6.

Figure 5.13 Results of Test Case 2 (screen 1 of 2)

65

Test Case 2 results for Maximum Size Matching and Results Display stages are shown in Figure 5.7.

Figure 5.14 Results of Test Case 2 (screen 2 of 2)

66

5.2.3 Test Case 3 Results Test Case 3 Input Parameters are shown in Figure 5.8.

Figure 5.15 GUI Screen Showing the Test Case 3 Input Parameters

67

Test Case 3 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.9.

Figure 5.16 Results of Test Case 3 (screen 1 of 2)

68

Test Case 3 results for Maximum Size Matching and Results Display stages are shown in Figure 5.10.

Figure 5.17 Results of Test Case 3 (screen 2 of 2)

69

5.2.4 Test Case 4 Results Test Case 4 Input Parameters are shown in Figure 5.11.

Figure 5.18 GUI Screen Showing the Test Case 4 Input Parameters

70

Test Case 4 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.12.

Figure 5.19 Results of Test Case 4 (screen 1 of 2)

71

Test Case 4 results for Maximum Size Matching and Results Display stages are shown in Figure 5.13.

Figure 5.20 Results of Test Case 4 (screen 2 of 2)

72

5.2.5 Test Case 5 Results Test Case 5 Input Parameters are shown in Figure 5.14.

Figure 5.21 GUI Screen Showing the Test Case 5 Input Parameters

73

Test Case 5 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.15.

Figure 5.22 Results of Test Case 5 (screen 1 of 2)

74

Test Case 5 results for Maximum Size Matching and Results Display stages are shown in Figure 5.16.

Figure 5.23 Results of Test Case 5 (screen 2 of 2)

75

5.2.6 Test Case 6 Results Test Case 6 Input Parameters are shown in Figure 5.17.

Figure 5.24 GUI Screen Showing the Test Case 6 Input Parameters

76

Test Case 6 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.18.

Figure 5.25 Results of Test Case 6 (screen 1 of 2)

77

Test Case 6 results for Maximum Size Matching and Results Display stages are shown in Figure 5.19.

Figure 5.26 Results of Test Case 6 (screen 2 of 2)

78

5.2.7 Test Case 7 Results Test Case 7 Input Parameters are shown in Figure 5.20.

Figure 5.27 GUI Screen Showing the Test Case 7 Input Parameters

79

Test Case 7 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction, Dense Segment Extraction, Maximum Size Matching and Results Display stages are shown in Figure 5.21.

Figure 5.28 Results of Test Case 7 (screen 1 of 1)

80

CHAPTER 6 CONCLUSION

81

6. Conclusion
In this project, the similar frames of query clip are retrieved by a batch query algorithm. Then, a bipartite graph is constructed to exploit the opportunity of spatial pruning. Thus, the high-dimensional query and database video sequence can be transformed to two sides of a bipartite graph. Only the dense segments are roughly obtained as possibly similar subsequences. In the filter-and-refine phase, some nonsimilar segments are first filtered, several relevant segments are then processed to quickly identify the most suitable 1:1 mapping by optimizing the factors of visual content, temporal order, and frame alignment together. It is an effective and efficient query processing strategy for temporal localization of similar content from a long unsegmented video stream, considering target subsequence may be approximate occurrence of potentially different ordering or length with query clip. With the provision of configurable thresholds, video similarity can be adjusted by user feedback. Selection of Bipartite Similarity Threshold and Adjacent Similarity Threshold is dependent on the display resolution, color distribution and Video Splitting Rate. Heuristic approach needs to be considered while setting these thresholds. Hit ratio of the query is dependent on the Number of KNN Samples.

6.1

Future Work
This project can be extended by incorporating enhanced image processing techniques such as filtering and image transformation to effectively process the queries. With these image processing techniques, videos with slightly different background can successfully go through the video similarity search pass criteria. Currently, video subsequence identification is performed at frame level. New strategies can be implemented to perform video subsequence identification at shot/scene level.

82

REFERENCES
[1] Heng Tao Shen, Jie Shao, Zi Huang, and Xiaofang Zhou, Effective and Efficient Query Processing for Video Subsequence Identification, IEEE Transactions on Knowledge and Data Engineering, Vol. 21, No. 3, March 2009, pp. 321-334 [2] Data Mining Concepts and Techniques Jiawei Han & Micheline Kamber, Elsevier. [3] Data Mining and Knowledge Discovery Handbook - Oded Maimon & Lior Rokach, Second Edition, Springer [4] http://en.wikipedia.org/wiki/Video [5] http://en.wikipedia.org/wiki/Comparison_of_container_formats#Video_formats_supported [6] http://en.wikipedia.org/wiki/Image_file_formats [7] http://en.wikipedia.org/wiki/Bipartite_graph [8] http://en.wikipedia.org/wiki/KNN [9] http://en.wikipedia.org/wiki/Euclidean_distance [10] http://en.wikipedia.org/wiki/Minkowski_distance [11] http://en.wikipedia.org/wiki/Earth_mover%27s_distance [12] http://sourceforge.net/projects/opencvlibrary/

83

Appendix A: openCV APIs


cvCaptureFromFile: This API allocates and initializes the CvCapture structure for reading the video stream from the specified file. cvGetCaptureProperty: This API retrieves the specified property of the video file. Properties such as frame rate, frame width, frame height etc. can be retrieved. cvSetCaptureProperty: This API sets the specified property of video capturing. cvGrabFrame: This API grabs the frame from a video file. The grabbed frame is stored internally. cvRetrieveFrame: This API returns the pointer to the image grabbed with the GrabFrame API. cvQueryFrame: This API grabs a frame from a video file, decompresses it and returns it. This API is just a combination of GrabFrame and RetrieveFrame. cvReleaseCapture: This API releases the CvCapture structure allocated by cvCaptureFromFile API. cvLoadImage: This API loads an image from the specified file and returns the pointer to the loaded image. Image file formats such as bmp, jpg, png, tiff are supported. cvSaveImage: This API saves the image to the specified file. Image file formats such as bmp, jpg, png, tiff are supported.

84

Appendix B: Definitions/Acronyms
KNN: K Nearest Neighborhood DTW: Dynamic Time Warping LCSS: Longest Common Subsequence Similarity EMD: Earth Movers Distance PNG: Portable Network Graphics GIF: Graphics Interchange Format TIFF: Tagged Image File Format Exif: Exchangeable image file format JPEG: Joint Photographic Experts Group JFIF: JPEG File Interchange Format RIFF: Resource Interchange File Format WMV: Windows Media Video AVI: Audio Video Interleave MPEG: Moving Picture Experts Group PAL: Phase Alternating Line SECAM: Sequential Color with Memory NTSC: National Television System Committee SDTV: Standard-definition television HDTV: High-definition television GUI: Graphical User Interface

85

Appendix C: Software and Hardware Specification


Operating System: Microsoft Windows XP Softwares: Microsoft Visual Studio, openCV2.1, Quick Time Player Hard Disk Space: Minimum 1 GB Memory: Minimum 80 GB RAM Languages: C, C#

86

You might also like