Professional Documents
Culture Documents
A Project report submitted in partial fulfillment of the requirements for the award of the Degree of Master of Technology in Computer Science
By M. HARI CHARAN GUPTA Roll No: 07012D0507 Under the guidance of Mrs. SUPREETHI K P Assistant Professor
Department of Computer Science and Engineering JNTUH College of Engineering, Hyderabad Jawaharlal Nehru Technological University Hyderabad Kukatpally, Hyderabad-500 085. 2011
Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad College of Engineering, Kukatpally, Hyderabad-500 085.
M. Hari Charan Gupta Roll No: 07012D0507 Department of Computer Science & Engineering, JNTUH College of Engineering, Hyderabad.
Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad College of Engineering, Kukatpally, Hyderabad-500 085.
Mrs. Supreethi K P Assistant Professor, Department of Computer Science & Engineering, JNTUH College of Engineering, Hyderabad. Date:
Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad College of Engineering, Kukatpally, Hyderabad-500 085.
Dr. O.B.V. Ramanaiah Professor & Head of the Department, Department of Computer Science & Engineering, JNTUH college of Engineering, Hyderabad. Date:
ACKNOWLEDGEMENTS
It is with great reverence that I wish to express my sincere gratitude towards Mrs. Suprethi K P, Assistant Professor, JNTUH College of Engineering, JNTUH Hyderabad, for her astute guidance, constant motivation and trust, without which this work would never have been possible. I am sincerely indebted to her for her constructive criticism and suggestions for improvement at various stages of the work. I would like to express my sincere thanks to Dr. O.B.V.Ramanaiah, Professor and Head of the Department of Computer Science & Engineering, JNTUH College of Engineering, JNTUH Hyderabad, for his kind cooperation in completion of this project. I am grateful to my family for their perennial inspiration. Last but not least; I would like to thanks all my friends and batch mates.
Abstract
In the current internet community, secure data transfer is limited due to its attack made on data Communication. So more robust methods are chosen so that they ensure secured data transfer. One of the solutions which came to the rescue is the audio Steganography. But existing audio steganographic systems have poor interface, very low level implementation, difficult to understand and valid only for certain audio formats with restricted message size. Enhanced Audio Steganography (EAS) is one proposed system which is based on audio Steganography and cryptography, ensures secure data transfer between the source and destination. EAS uses most powerful encryption algorithm in the first level of security, which is very complex to break. In the second level it uses a more powerful modified LSB (Least Significant Bit) Algorithm to encode the message into audio. It performs bit level manipulation to encode the message. The basic idea behind this paper is to provide a good, efficient method for hiding the data from hackers and sent to the destination in a safer manner. Though it is well modulated software it has been limited to certain restrictions. The quality of sound depends on the size of the audio which the user selects and length of the message. Though it shows bit level deviations in the frequency chart, as a whole the change in the audio cannot be determined.
ii
Contents
1. Introduction........................................................................................2
1.1 Motivation.......................................................................................................2 1.2 Problem Definition ........................................................................................3 1.3 Organization of Report...................................................................................3
iii
3. Proposed System..............................................................................40
3.1 Video Subsequence Identification................................................................40 3.2 Applications of Video Subsequence Identification.......................................41
3.2.1 Recognition for Copyright Enforcement.....................................................................41 3.2.2 TV Commercial Detection...........................................................................................42
iv
4. Implementation ..............................................................................44
4.1 Video Fragmentation....................................................................................44
4.1.1 Supported Video Types...............................................................................................44 4.1.2 Capturing Video Properties.........................................................................................44 4.1.3 Capturing Images from Video at Periodic Intervals....................................................44 4.1.4 Image File Type Supported.........................................................................................45
4.2 Similar Frames Retrieval Using K-Nearest Neighborhood..........................45 4.3 Graph Transformation and Matching............................................................47
4.3.1 Bipartite Graph Transformation..................................................................................47 4.3.2 Dense Segment Extraction...........................................................................................50 4.3.3 Filtering by Maximum Size Matching.........................................................................52 4.3.4 Refinement by Sub-Maximum Similarity Matching...................................................54
6. Conclusion........................................................................................82
6.1 Future Work..................................................................................................82
vi
List of Figures Figure 2.1 The Intuition Behind the Euclidean Distance Metric....26 Figure 2.2 Two Time Series that Require a Warping Measure......27 Figure 2.3 Simple Bipartite Graph.....................................................31 Figure 4.4 Construction of Bipartite Graph......................................48 Figure 4.5 1:M Mapping.....................................................................52 Figure 4.6 1:1 Mapping.......................................................................53 Figure 4.7 Graphical User Interface..................................................57 Figure 5.8 Test Plan Approach...........................................................60 Figure 5.9 GUI Screen Showing the Test Case 1 Input Parameters ...............................................................................................................61 Figure 5.10 Results of Test Case 1 (screen 1 of 2).............................62 Figure 5.11 Results of Test Case 1 (screen 2 of 2).............................63 Figure 5.12 GUI Screen Showing the Test Case 2 Input Parameters ...............................................................................................................64 Figure 5.13 Results of Test Case 2 (screen 1 of 2).............................65 Figure 5.14 Results of Test Case 2 (screen 2 of 2).............................66 Figure 5.15 GUI Screen Showing the Test Case 3 Input Parameters ...............................................................................................................67 Figure 5.16 Results of Test Case 3 (screen 1 of 2).............................68 Figure 5.17 Results of Test Case 3 (screen 2 of 2).............................69 Figure 5.18 GUI Screen Showing the Test Case 4 Input Parameters ...............................................................................................................70
vii
Figure 5.19 Results of Test Case 4 (screen 1 of 2).............................71 Figure 5.20 Results of Test Case 4 (screen 2 of 2).............................72 Figure 5.21 GUI Screen Showing the Test Case 5 Input Parameters ...............................................................................................................73 Figure 5.22 Results of Test Case 5 (screen 1 of 2).............................74 Figure 5.23 Results of Test Case 5 (screen 2 of 2).............................75 Figure 5.24 GUI Screen Showing the Test Case 6 Input Parameters ...............................................................................................................76 Figure 5.25 Results of Test Case 6 (screen 1 of 2).............................77 Figure 5.26 Results of Test Case 6 (screen 2 of 2).............................78 Figure 5.27 GUI Screen Showing the Test Case 7 Input Parameters ...............................................................................................................79 Figure 5.28 Results of Test Case 7 (screen 1 of 1).............................80
viii
CHAPTER 1 INTRODUCTION
1. Introduction
Video subsequence identification aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip.
1.1
Motivation
With the growing demand for visual information of rich content, effective and efficient manipulations of large video databases are increasingly desired. Many investigations have been made on content-based video retrieval. However, despite the importance, video subsequence identification, which is to find the similar content to a short query clip from a long video sequence, has not been well addressed. Nowadays, the rapid advances in multimedia and network technologies popularize many applications of video databases, and sophisticated techniques for representing, matching, and indexing videos are in high demand. A video sequence is an ordered set of a large number of frames, and from the database research perspective, each frame is usually represented by a high-dimensional vector, which has been extracted from some lowlevel content features, such as color distribution, texture pattern, or shape structure within the original media domain. Matching of videos is often translated into searches among these feature vectors. In practice, it is often undesirable to manually check whether a video is part of a long stream by browsing its entire length. Thus, a reliable solution of automatically finding similar content is imperative. Video retrieval task conventionally returns similar clips from a large collection of videos which have been either chopped up into similar lengths or cut at content boundaries. The clips for search have already been segmented and are always ready for similarity ranking. Video subsequence identification task aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip. Because the boundary and even the length of target subsequence are not available initially, choosing which fragments to evaluate similarities is not preknown.
1.2
Problem Definition
The aim of the project is to develop an application that locates the position of the most similar part with respect to a user-specified query clip Q from a long prestored video sequence S. Video subsequence identification task aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip. Thresholds used for video similarity search needs to be configurable thresholds so that similarity search can be adjusted by user feedback.
1.3
Organization of Report
This report is divided into following chapters. Chapter 2- Provides an overview of the Literature Survey and details of the existing systems that are relevant to this project. Chapter 3 - Provides details of the proposed system. Chapter 4 Provides the design and implementation details of the project. Chapter 5 - Deals with testing and results of the proposed system. Chapter 6 - Gives conclusion deducted from the results, scope for further work followed by references and Appendices A, B, C.
2. Literature Survey
Literature Survey elaborates on Video properties, Video formats, Video splitting strategies, Image file formats, Distance Measures, Time Series Similarity Measures, Bipartite Graph, k-Nearest Neighborhood and Existing System.
2.1
Video
Video [4] is the technology of electronically capturing, recording, processing, storing, transmitting, and reconstructing a sequence of still images representing scenes in motion. The term video is derived from the Latin verb videre which means I see. Video commonly refers to several storage formats for moving pictures: digital video formats, including Blu-ray Disc, DVD, QuickTime, and MPEG-4; and analog videotapes, including VHS and Betamax. Video can be recorded and transmitted in various physical media: in magnetic tape when recorded as PAL or NTSC electric signals by video cameras, or in MPEG-4 or DV digital media when recorded by digital cameras. Quality of video essentially depends on the capturing method and storage used. Video stream has following characteristics: Number of Frames per Second Interlacing Display Resolution Aspect Ratio Color Space and Bits per Pixel Video Quality Video Compression Method Bit Rate
2.1.1 Number of Frames per Second Frame rate, the number of still pictures per unit of time of video, ranges from six or eight frames per second (frame/s) for old mechanical cameras to 120 or more frames per second for new professional cameras. PAL (Europe, Asia, Australia, etc.) and SECAM (France, Russia, parts of Africa etc.) standards specify 25 frame/s, while NTSC (USA, Canada, Japan, etc.) specifies 29.97 frame/s. The minimum frame rate to achieve the illusion of a moving image is about fifteen frames per second. 2.1.2 Interlacing Video can be interlaced or progressive. Interlacing was invented as a way to achieve good visual quality within the limitations of a narrow bandwidth. The horizontal scan lines of each interlaced frame are numbered consecutively and partitioned into two fields: the odd field (upper field) consisting of the odd-numbered lines and the even field (lower field) consisting of the even-numbered lines. NTSC, PAL and SECAM are interlaced formats. PAL video format is often specified as 576i50, where 576 indicates the vertical line resolution, i indicates interlacing, and 50 indicates 50 fields (halfframes) per second. In progressive scan systems, each refresh period updates all of the scan lines. The result is a higher spatial resolution and a lack of various artifacts that can make parts of a stationary picture appear to be moving or flashing. 2.1.3 Display Resolution The size of a video image is measured in pixels for digital video, or horizontal scan lines and vertical lines of resolution for analog video. In the digital domain, standard-definition television (SDTV) is specified as 720/704/640480i60 for NTSC and 768/720576i50 for PAL or SECAM resolution. New high-definition televisions (HDTV) are capable of resolutions up to 6
19201080p60, i.e. 1920 pixels per scan line by 1080 scan lines, progressive, at 60 frames per second. 2.1.4 Color Space and Bits per Pixel In digital imaging, a pixel is the smallest addressable screen element in a display device. It is the smallest unit of picture that can be represented or controlled. Each pixel has its own address. The address of a pixel corresponds to its coordinates. Pixels are normally arranged in a two-dimensional grid, and are often represented using dots or squares. Each pixel is a sample of an original image, more samples typically provide more accurate representations of the original. The intensity of each pixel is variable. In color image systems, a color is typically represented by three or four component intensities such as red, green, and blue, or cyan, magenta, yellow, and black. The number of distinct colors that can be represented by a pixel depends on the number of bits per pixel (bpp). A 1 bpp image uses 1-bit for each pixel, so each pixel can be either ON or OFF. Each additional bit doubles the number of colors available, so a 2 bpp image can have 4 colors, and a 3 bpp image can have 8 colors.
16 bpp, 216 = 65,536 colors (High Color) 24 bpp, 224 16.8 million colors (True Color)
2.1.5 Aspect Ratio Aspect ratio describes the dimensions of video screens and video picture elements. All popular video formats are rectilinear, and so can be described by a ratio between width and height. The screen aspect ratio of a traditional television screen is 4:3, or about 1.33:1. High definition televisions use an aspect ratio of 16:9, or about 1.78:1.
Ratios where the height is taller than the width are uncommon in general everyday use, but do have application in computer systems where the screen may be better suited for a vertical layout. The most common tall aspect ratio of 3:4 is referred to as portrait mode and is created by physically rotating the display device 90 degrees from the normal position. 2.1.6 Video Compression Method A wide variety of methods are used to compress video streams. Video data contains spatial and temporal redundancy, making uncompressed video streams extremely inefficient. Broadly speaking, spatial redundancy is reduced by registering differences between parts of a single frame. This task is known as intraframe compression and is closely related to image compression. Likewise, temporal redundancy can be reduced by registering differences between frames. This task is known as interframe compression, including motion compensation and other techniques. 2.2
Video Formats
Following are some of the video formats [5]: AVI (Audio Video Interleave) MP4 MOV WMV (Windows Media Video) 3GP 3G2 FLV (Flash Video)
2.2.1 AVI AVI is a multimedia container format introduced by Microsoft in November 1992 as part of its Video for Windows technology. AVI files can contain both audio and video data in a file container that allows synchronous audio-with-video playback. Like the
DVD video format, AVI files support multiple streaming audio and video, although these features are seldom used. AVI is a derivative of the Resource Interchange File Format (RIFF), which divides a file's data into blocks, or chunks. Each chunk is identified by a FourCC tag. An AVI file takes the form of a single chunk in a RIFF formatted file, which is then subdivided into two mandatory "chunks" and one optional chunk. The first sub-chunk is identified by the hdrl tag. This sub-chunk is the file header and contains metadata about the video, such as its width, height and frame rate. The second sub-chunk is identified by the movi tag. This chunk contains the actual audio/visual data that make up the AVI movie. The third optional sub-chunk is identified by the idx1 tag which indexes the offsets of the data chunks within the file. By way of the RIFF format, the audio-visual data contained in the movi chunk can be encoded or decoded by software called a codec, which is an abbreviation for (en)coder/decoder. Upon creation of the file, the codec translates between raw data and the (compressed) data format used inside the chunk.
2.2.2 MP4 MP4 is a multimedia container format standard specified as a part of MPEG-4. It is most commonly used to store digital video and digital audio streams, especially those defined by MPEG, but can also be used to store other data such as subtitles and still images. Like most modern container formats, MP4 allows streaming over the Internet. A separate hint track is used to include streaming information in the file.
2.2.3 MOV The native file format for Quicktime video specifies a multimedia container file that contains one or more tracks, each of which stores a particular type of data: audio, video,
effects, or text (e.g. for subtitles). Each track either contains a digitally-encoded media stream (using a specific format) or a data reference to the media stream located in another file. The ability to contain abstract data references for the media data, and the separation of the media data from the media offsets and the track edit lists means that QuickTime is particularly suited for editing, as it is capable of importing and editing in place (without data copying).
2.2.4 WMV Windows Media Video (WMV) is a video compression format for several proprietary codecs developed by Microsoft. The original video format, known as WMV, was originally designed for Internet streaming applications, as a competitor to RealVideo. The other formats, such as WMV Screen and WMV Image, cater for specialized content.
2.2.5 3GP 3GP (3GPP file format) is a multimedia container format defined by the Third Generation Partnership Project (3GPP) for 3G UMTS multimedia services. It is used on 3G mobile phones but can also be played on some 2G and 4G phones.
2.2.6 3G2 3G2 (3GPP2 file format) is a multimedia container format defined by the 3GPP2 for 3G CDMA2000 multimedia services. It is very similar to the 3GP file format, but has some extensions and limitations in comparison to 3GP.
2.2.7 FLV Flash Video is a container file format used to deliver video over the Internet using Adobe Flash Player versions 610. Flash Video content may also be embedded within SWF files. There are two different video file formats known as Flash Video: FLV and F4V. The audio and video data within FLV files are encoded in the same way as they are 10
within SWF files. The latter F4V file format is based on the ISO base media file format and is supported starting with Flash Player 9 update 3. Both formats are supported in Adobe Flash Player and currently developed by Adobe Systems. FLV was originally developed by Macromedia.
2.3
2.4
11
Image compression uses algorithms to decrease the size of a file. High resolution cameras produce large image files, ranging from hundreds of kilobytes to megabytes, per the camera's resolution and the image-storage format capacity. High resolution digital cameras record 12 megapixel (1MP = 1,000,000 pixels / 1 million) images, or more, in truecolor. Faced with large file sizes, both within the camera and a storage disc, image file formats were developed to store such large images. There are two types of image file compression algorithms Lossless compression algorithms reduce file size without losing image quality, though they are not compressed into as small a file as a lossy compression file. When image quality is valued above file size, lossless algorithms are typically chosen. Lossy compression algorithms take advantage of the inherent limitations of the human eye and discard invisible information. Most lossy compression algorithms allow for variable quality levels (compression) and as these levels are increased, file size is reduced. At the highest compression levels, image deterioration becomes noticeable as compression artifacting. Following are some of the image formats [6]: JPEG/JFIF JPEG 2000 Exif TIFF RAW GIF PNG BMP PPM, PGM, PBM, PNM 12
2.4.1 JPEG/JFIF JPEG (Joint Photographic Experts Group) is a compression method. JPEG-compressed images are usually stored in the JFIF (JPEG File Interchange Format) file format. JPEG compression is (in most cases) lossy compression. The JPEG/JFIF filename extension is JPG or JPEG. Nearly every digital camera can save images in the JPEG/JFIF format, which supports 8 bits per color (red, green, blue) for a 24-bit total, producing relatively small files. The compression does not noticeably detract from the image's quality, but JPEG files suffer generational degradation when repeatedly edited and saved. The JPEG/JFIF format also is used as the image compression algorithm in many PDF files.
2.4.2 JPEG 2000 JPEG 2000 is a compression standard enabling both lossless and lossy storage. The compression methods used are different from the ones in standard JFIF/JPEG. They improve quality and compression ratios, but also require more computational power to process. JPEG 2000 also adds features that are missing in JPEG. It is not nearly as common as JPEG, but it is used currently in professional movie editing and distribution.
2.4.3 Exif The Exif (Exchangeable image file format) format is a file standard similar to the JFIF format with TIFF extensions. It is incorporated in the JPEG-writing software used in most cameras. Its purpose is to record and to standardize the exchange of images with image metadata between digital cameras and editing and viewing software. The metadata are recorded for individual images and include such things as camera settings, time and date, shutter speed, exposure, image size, compression, name of camera, color information, etc. When images are viewed or edited by image editing software, all of this image information can be displayed.
13
2.4.4 TIFF The TIFF (Tagged Image File Format) format is a flexible format that normally saves 8 bits or 16 bits per color (red, green, blue) for 24-bit and 48-bit totals respectively. It usually uses either the TIFF or TIF filename extension. TIFF's flexibility can be both an advantage and disadvantage, since a reader that reads every type of TIFF file does not exist. TIFFs can be lossy and lossless. Some offer relatively good lossless compression for bi-level (black&white) images. Some digital cameras can save in TIFF format, using the LZW compression algorithm for lossless storage. TIFF image format is not widely supported by web browsers. TIFF remains widely accepted as a photograph file standard in the printing business. TIFF can handle device-specific color spaces, such as the CMYK defined by a particular set of printing press inks.
2.4.5 RAW RAW refers to a family of raw image formats that are options available on some digital cameras. These formats usually use a lossless or nearly-lossless compression, and produce file sizes much smaller than the TIFF formats of full-size processed images from the same cameras. Although there is a standard raw image format, the raw formats used by most cameras are not standardized or documented, and differ among camera manufacturers. Many graphic programs and image editors may not accept some or all of them, and some older ones have been effectively orphaned already. Adobe's Digital Negative (DNG) specification is an attempt at standardizing a raw image format to be used by cameras, or for archival storage of image data converted from undocumented raw image formats, and is used by several niche and minority camera manufacturers including Pentax, Leica, and Samsung.
2.4.6 GIF GIF (Graphics Interchange Format) is limited to an 8-bit palette, or 256 colors. This makes the GIF format suitable for storing graphics with relatively few colors such as simple diagrams, shapes, logos and cartoon style images. The GIF format supports
14
animation and is still widely used to provide image animation effects. It also uses a lossless compression that is more effective when large areas have a single color, and ineffective for detailed images or dithered images.
2.4.7 PNG The PNG (Portable Network Graphics) file format was created as the free, open-source successor to the GIF. The PNG file format supports truecolor (16 million colors) while the GIF supports only 256 colors. The PNG file excels when the image has large, uniformly colored areas. The lossless PNG format is best suited for editing pictures, and the lossy formats, like JPG, are best for the final distribution of photographic images, because in this case JPG files are usually smaller than PNG files. PNG provides a patent-free replacement for GIF and can also replace many common uses of TIFF. Indexed-color, grayscale, and truecolor images are supported, plus an optional alpha channel. PNG is designed to work well in online viewing applications like web browsers so it is fully streamable with a progressive display option. PNG is robust, providing both full file integrity checking and simple detection of common transmission errors.
2.4.8 BMP The BMP file format (Windows bitmap) handles graphics files within the Microsoft Windows OS. Typically, BMP files are uncompressed. So, they are large. The advantage is their simplicity and wide acceptance in Windows programs.
2.4.9 PPM, PGM, PBM, PNM Netpbm format is a family including the portable pixmap file format (PPM), the portable graymap file format (PGM) and the portable bitmap file format (PBM). These are either pure ASCII files or raw binary files with an ASCII header that provide very basic 15
functionality and serve as a lowest-common-denominator for converting pixmap, graymap, or bitmap files between different platforms. Several applications refer to them collectively as PNM format (Portable Any Map).
2.5
Distance Measures
Answering queries based on alike but may be not exactly same is known as similarity search. It has been widely used in image retrievals. It is required to determine whether two images are similar or dissimilar. Distance measures [1, 2, 3, 9, 10] are used to determine the similarity or dissimilarity between any pair of objects. It is useful to denote the distance between two instances xi and xj as: d(xi, xj). A valid distance measure should be symmetric and obtains its minimum value (usually zero) in case of identical vectors. The distance measure is called a metric distance measure if it also satisfies the following properties: Triangle inequality d(xi, xk) d(xi, xj) + d(xj, xk) i, xj, xk x d(xi, xj) = 0 xi = xj
S.
S.
i, xj x
2.5.1 Minkowski Distance Measure for Numeric Attributes Given two p-dimensional instances, xi = (xi1, xi2, xip) and xj = (xj1, xj2, xjp), The distance between the two data instances can be calculated using the Minkowski Metric
d ( xi , x j ) = xi1 x j1
+ xi 2 x j 2
+ + xip x jp
1/ g
The commonly used Euclidean distance between two objects is achieved when g = 2. Given g = 1, the sum of absolute paraxial distances (Manhattan metric) is obtained, and with g= one gets the greatest of the paraxial distances. The measurement unit used can affect the clustering analysis. To avoid the dependence on the choice of measurement units, the data should be standardized. Standardizing
16
measurements attempts to give all variables an equal weight. However, if each variable is assigned with a weight according to its importance, then the weighted distance can be computed as:
d ( xi , x j ) = w1 xi1 x j1
+ w2 xi 2 x j 2
+ + w p xip x jp
1/ g
where wi [0,)
2.5.2 Euclidean Distance Measure for Numeric Attributes Given two p-dimensional instances, xi = (xi1, xi2, xip) and xj = (xj1, xj2, xjp), The distance between the two data instances can be calculated using the Euclidean Metric
d ( xi , x j ) = xi1 x j1
+ xi 2 x j 2
+ + xip x jp
1/ 2
If each variable is assigned with a weight according to its importance, then the weighted distance can be computed as:
d ( xi , x j ) = w1 xi1 x j1
+ w2 xi 2 x j 2
+ + w p xip x jp
1/ 2
where wi [0,)
2.5.3 Manhattan Distance Measure for Numeric Attributes Given two p-dimensional instances, xi = (xi1, xi2, xip) and xj = (xj1, xj2, xjp), The distance between the two data instances can be calculated using the Manhattan Metric
d ( xi , x j ) = xi1 x j1 + xi 2 x j 2 + + xip x jp
17
If each variable is assigned with a weight according to its importance, then the weighted distance can be computed as:
d ( xi , x j ) = w1 xi1 x j1 + w2 xi 2 x j 2 + + w p xip x jp
where wi [0,)
2.5.4 Distance Measures for Binary Attributes The distance measure for numeric attributes may be easily computed for continuousvalued attributes. In the case of instances described by categorical, binary, ordinal or mixed type attributes, the distance measure should be revised. In the case of binary attributes, the distance between objects may be calculated based on a contingency table. A binary attribute is symmetric if both of its states are equally valuable. In that case, using the simple matching coefficient can assess dissimilarity between two objects:
r+ s d ( xi , x j ) = q+ r+ s+ t
where q is the number of attributes that equal 1 for both objects; t is the number of attributes that equal 0 for both objects; and s and r are the number of attributes that are unequal for both objects.
18
A binary attribute is asymmetric, if its states are not equally important (usually the positive outcome is considered more important). In this case, the denominator ignores the unimportant negative matches (t). This is called the Jaccard coefficient:
r+ s d ( xi , x j ) = q+ r+ s
2.5.5 Distance Measures for Nominal Attributes When the attributes are nominal, two main approaches may be used: Simple matching
p m d ( xi , x j ) = p
where p is the total number of attributes and m is the number of matches.
19
Creating a binary attribute for each state of each nominal attribute and computing their dissimilarity as described above.
2.5.6 Distance Metrics for Ordinal Attributes When the attributes are ordinal, the sequence of the values is meaningful. In such cases, the attributes can be treated as numeric ones after mapping their range onto [0,1]. Such mapping may be carried out as follows:
zi ,n =
ri,n 1 Mn 1
where zi,n is the standardized value of attribute an of object i. ri,n is that value before standardization, and Mn is the upper limit of the domain of attribute an (assuming the lower limit is 1).
2.5.7 Distance Metrics for Mixed-Type Attributes In the cases where the instances are characterized by attributes of mixed-type, one may calculate the distance by combining the methods mentioned above. For instance, when calculating the distance between instances i and j using a metric such as the Euclidean distance, one may calculate the difference between nominal and binary attributes as 0 or 1 (match or mismatch, respectively), and the difference between numeric attributes as the difference between their normalized values. The square of each such difference will be added to the total distance. 20
The dissimilarity d(xi, xj) between two instances, containing p attributes of mixed types, is defined as:
d ( xi , x j ) =
d
n= 1
iji j
( n) ( n) ( n)
n= 1
ij
where the indicator ij(n) if one of the values is missing. The contribution of attribute n to the distance between the two objects d(n)(xi, xj) is computed according to its type:
If the attribute is binary or categorical, d(n)(xi, xj) = 0 if xin = xjn, otherwise d(n)(xi, xj) = 0. If the attribute is continuous-valued, d(n)
21
di j =
( n)
xi n xj n m h xh a nmx h xh i n n
where h runs over all non-missing objects for attribute n. If the attribute is ordinal, the standardized values of the attribute are computed first and then, zi,n is treated as continuous-valued. 2.5.8 Earth Mover's Distance The Earth Mover's Distance (EMD) [1, 3, 11] is a method to evaluate dissimilarity between two multi-dimensional distributions in some feature space where a distance measure between single features, known as ground distance is given. The EMD lifts this distance from individual features to full distributions. Intuitively, given two distributions, one can be seen as a mass of earth properly spread in space, the other as a collection of holes in that same space. Then, the EMD measures the least amount of work needed to fill the holes with earth. Here, a unit of work corresponds to transporting a unit of earth by a unit of ground distance. A distribution can be represented by a set of clusters where each cluster is represented by its mean (or mode), and by the fraction of the distribution that belongs to that cluster. Such a representation is called as the signature of the distribution. The two signatures
22
can have different sizes, for example, simple distributions have shorter signatures than complex ones. Let P = { (p1, wp1), , (pm, wpm) } be the first signature with m clusters, where p i is the cluster representative and wpi is the weight of the cluster. Q = { (q1, wq1), ,(qm, wqm) } the second signature with n clusters, and D = [dij] the ground distance matrix where dij is the ground distance between clusters pi and qj. To find a flow F = [fij], with fij the flow between pi and qj, that minimizes the overall cost
WORK ( P, Q, F ) = f ij d ij
i= 1 j= 1 m n
f
j =1
ij
w pi ; 1 i m
f
i =1
ij
wqj ; 1 j n
f
i =1 j =1
ij
n m = min w pi , wqj j =1 i =1
The first constraint allows moving supplies from P to Q and not vice versa. The next two constraints limits the amount of supplies that can be sent by the clusters in P to their weights, and the clusters in Q to receive no more supplies than their weights, and the last constraint forces to move the maximum amount of supplies possible. The earth mover's distance is defined as the work normalized by the total flow:
23
E (PM,Q) = D
fd
i= 1 j= 1 mn
mn
i ji j
f
i= 1 j= 1
ij
The normalization factor is introduced in order to avoid favoring smaller signatures in the case of partial matching. The EMD has the following advantages Naturally extends the notion of a distance between single elements to that of a distance between sets, or distributions, of elements. Can be applied to the more general variable-size signatures, which subsume histograms. Signatures are more compact, and the cost of moving earth reflects the notion of nearness properly, without the quantization problems of most other measures. Allows for partial matches in a very natural way. This is important, for instance, for image retrieval and in order to deal with occlusions and clutter. Is a true metric if the ground distance is metric and if the total weights of two signatures are equal. This allows endowing image spaces with a metric structure. Is bounded from below by the distance between the centers of mass of the two signatures when the ground distance is induced by a norm. Using this lower bound in retrieval systems significantly reduced the number of EMD computations.
24
Matches perceptual similarity better than other measures, when the ground distance is perceptually meaningful.
2.6
2.6.1 Euclidean Distances and Lp Norms One of the simplest similarity measures for time series is the Euclidean distance measure [1, 2, 3, 9]. If both time sequences are of the same length n, then each sequence can be viewed as a point in n-dimensional Euclidean space, and define the dissimilarity between sequences C and Q and D(C,Q) = Lp(C,Q), i.e. the distance between the two points measured by the Lp norm (when p = 2, it reduces to the familiar Euclidean distance). Figure 2.1 shows a visual intuition behind the Euclidean distance metric. The two sequences Q and C appear to have approximately the same shape, but have different offsets in Y-axis.
25
D (Q ,C )
Figure 2.1 The Intuition Behind the Euclidean Distance Metric Such a measure is simple to understand and easy to compute, which has ensured that the Euclidean distance is the most widely used distance measure for similarity search. However, one major disadvantage is that it is very brittle. It does not allow for a situation where two sequences are alike, but one has been stretched or compressed in the Y-axis. For example, a time series may fluctuate with small amplitude between 10 and 20, while another may fluctuate in a similar manner with larger amplitude between 20 and 40. The Euclidean distance between the two time series will be large. This problem can be dealt with easily with offset translation and amplitude scaling, which requires normalizing the sequences before applying the distance operator. More formally, let (C) and (C) be the mean and standard deviation of sequence C = {c1, . . . , cn}. The sequence C is replaced by the normalized sequences C, where
c ( C) c= ( C)
' i i
Even after normalization, the Euclidean distance measure may still be unsuitable for some time series domains since it does not allow for acceleration and deceleration along the time axis. 26
2.6.2 Dynamic Time Warping In some time series domains, a very simple distance measure such as the Euclidean distance will suffice. However, it is often the case that the two sequences have approximately the same overall component shapes, but these shapes do not line up in Xaxis. Figure 2.2 shows a simple example.
10
20
30
40
50
60
10
20
30
40
50
60
Figure 2.2 Two Time Series that Require a Warping Measure In order to find the similarity between such sequences or as a preprocessing step before averaging them, warping the time axis of one (or both) sequences is required to achieve a better alignment. Dynamic Time Warping (DTW) [1, 3] is a technique for effectively achieving this warping. Dynamic time warping is an extensively used technique in speech recognition, and allows acceleration-deceleration of signals along the time dimension. Consider two sequence (of possibly different lengths), C = {c1, . . . , cm} and Q = {q1, . .. , qn}. When computing the similarity of the two time series using Dynamic Time Warping, it is allowed to extend each sequence by repeating elements. A straightforward algorithm for computing the Dynamic Time Warping distance between two sequences uses a bottom-up dynamic programming approach, where the smaller sub problems D(i, j) are first determined, and then used to solve the larger sub-problems, until D(m,n) is finally achieved.
27
2.6.3 Longest Common Subsequence Similarity The longest common subsequence similarity measure, or LCSS [1, 3], is a variation of edit distance used in speech recognition and text pattern matching. The basic idea is to match two sequences by allowing some elements to be unmatched. The advantage of the LCSS method is that some elements may be unmatched or left out (e.g. outliers), where as in Euclidean and DTW, all elements from both sequences must be used, even the outliers. Let C and Q be two sequences of length m and n, respectively. As was done with dynamic time warping, a recursive definition of the length of longest common subsequence of C and Q is given. Let L(i, j) denote the longest common subsequences {c1, . . . , ci} and {q1, . .. , qi}. L(i, j) may be recursively defined as follows: IF ai = bj THEN L(i, j) = 1 + L(i1, j1) ELSE L(i, j) = max {D(i1, j),D(i, j1)} Dissimilarity between C and Q as
LCSS(C , Q) =
m + n 2l m+ n
where l is the length of the longest common subsequence. Intuitively, this quantity determines the minimum (normalized) number of elements that should be removed from and inserted into C to transform C to Q. As with dynamic time warping, the LCSS measure can be computed by dynamic programming in O(mn) time. This can be improved to O((n+m)w) time if a matching window of length w is specified (i.e. where | i j| is allowed to be at most w).
28
With time series data, the requirement that the corresponding elements in the common subsequence should match exactly is rather rigid. This problem is addressed by allowing some tolerance (say > 0) when comparing elements. Thus, two elements a and b are said to match if a(1 ) < b < a(1+ ).
2.6.3.1
The basic idea is that two sequences are similar if they have enough non-overlapping time-ordered pairs of contiguous subsequences that are similar. Two contiguous subsequences are similar if one can be scaled and translated appropriately to approximately resemble the other. The scaling and translation function is local, i.e. it may be different for other pairs of subsequences. The algorithmic challenge is to determine how and where to cut the original sequences into subsequences so that the overall similarity is minimized. The first step is to find all pairs of atomic subsequences in the original sequences A and Q that are similar (atomic implies subsequences of a certain small size, say a parameter w). This step is done by a spatial self-join (using a spatial access structure such as an R-tree) over the set of all atomic subsequences. The next step is to stitch similar atomic subsequences to form pairs of larger similar subsequences. The last step is to find a non-overlapping ordering of subsequence matches having the longest match length. The stitching and subsequence ordering steps can be reduced to finding longest paths in a directed acyclic graph, where vertices are pairs of similar subsequences, and a directed edge denotes their ordering along the original sequences.
2.6.3.2
Instead of different local scaling functions that apply to different portions of the sequences, a simpler approach is to try and incorporate a single global scaling function with the LCSS similarity measure. An obvious method is to first normalize both sequences and then apply LCSS similarity to the normalized sequences. However, the disadvantage of this approach is that the normalization function is derived from all data
29
points, including outliers. This defeats the very objective of the LCSS approach which is to ignore outliers in the similarity calculations. The basic idea is that two sequences C and Q are similar if there exists constants a and b, and long common subsequences C and Q such that Q is approximately equal to aC + b. The scale + translation linear function (i.e. the constants a and b) is derived from the subsequences, and not from the original sequences. Thus, outliers cannot taint the scale + translation function.
2.6.4 Probabilistic Methods A different approach to time-series similarity is the use of a probabilistic similarity measure [3]. Above methods were distance based, some of these methods are model based. Since time series similarity is inherently a fuzzy problem, probabilistic methods are well suited for handling noise and uncertainty. They are also suitable for handling scaling and offset translations. Finally, they provide the ability to incorporate prior knowledge into the similarity measure. It is not clear whether other problems such as time-series indexing, retrieval and clustering can be efficiently accomplished under probabilistic similarity measures. Given a sequence C, the basic idea is to construct a probabilistic generative model MC, i.e. a probability distribution on waveforms. Given a new sequence pattern Q, similarity is measured by computing p(Q| MC), i.e. the likelihood that MC generates Q.
2.6.5 General Transformations Recognizing the importance of the notion of shape in similarity computations, an alternate approach can be considered. A general similarity framework involving a transformation rules language [3]. Each rule in the transformation language takes an input sequence and produces an output sequence, at a cost that is associated with the rule. The similarity of sequence C to sequence Q is the minimum cost of transforming C to Q by applying a sequence of such rules. 30
2.7
Bipartite Graph
In the mathematical field of graph theory, a bipartite graph (or bigraph) [1, 7] is a graph whose vertices can be divided into two disjoint sets U and V such that every edge connects a vertex in U to one in V, that is, U and V are independent sets. Equivalently, a bipartite graph is a graph that does not contain any odd-length cycles.
Figure 2.3 Simple Bipartite Graph The two sets U and V may be thought of as a coloring of the graph with two colors. If all nodes in U is colored as blue, and all nodes in V as green, each edge has endpoints of differing colors, as is required in the graph coloring problem. In contrast, such a coloring is impossible in the case of a nonbipartite graph, such as a triangle. After one node is colored blue and another green, the third vertex of the triangle is connected to vertices of both colors, preventing it from being assigned either color. One often writes G = (U, V, E) to denote a bipartite graph whose partition has the parts U and V. If |U| =|V|, that is, if the two subsets have equal cardinality, then G is called a balanced bipartite graph.
31
Bipartite graphs can model entirely the more general multigraph. Given a multigraph M, take U as the vertex set of M and take V as the edge set of M. Then join an element of V to precisely the two elements of U which are the ends of that edge in M. Thus every multigraph is described entirely by a bipartite graph which is one-sided regular of degree 2, and vice versa. Similarly, every directed hypergraph can be represented as a bipartite digraph. Take U as the vertex set in the hypergraph, and V as set of edges. For each u contains u as an output. Following are properties of Bipartite Graph: A graph is bipartite if and only if it does not contain an odd cycle. Therefore, a bipartite graph cannot contain a clique of size 3 or more. A graph is bipartite if and only if it is 2-colorable, (i.e. its chromatic number is less than or equal to 2). The size of minimum vertex cover is equal to the size of the maximum matching. The size of the maximum independent set plus the size of the maximum matching is equal to the number of vertices. For a connected bipartite graph the size of the minimum edge cover is equal to the size of the maximum independent set. For a connected bipartite graph the size of the minimum edge cover plus the size of the minimum vertex cover is equal to the number of vertices. 2.8 Every bipartite graph is a perfect graph. The spectrum of a graph is symmetric if and only if it's a bipartite graph.
U and v
V,
k-Nearest Neighborhood
k-Nearest Neighbor search [1, 2, 3, 8] identifies the top k nearest neighbors to the query. This technique is commonly used in predictive analytics to estimate or classify a point based on the consensus of its neighbors. k-nearest neighbor graphs are graphs in which every point is connected to its k nearest neighbors. 32
In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space. k-NN is a type of instance-based learning, where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor. The same method can be used for regression, by simply assigning the property value for the object to be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. (A common weighting scheme is to give each neighbor a weight of 1/d, where d is the distance to the neighbor. This scheme is a generalization of linear interpolation.) The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. The k-nearest neighbor algorithm is sensitive to the local structure of the data. Nearest neighbor rules in effect compute the decision boundary in an implicit manner. It is also possible to compute the decision boundary itself explicitly, and to do so in an efficient manner so that the computational complexity is a function of the boundary complexity. It is easy to implement by computing the distances from the test sample to all stored vectors, but it is computationally intensive, especially when the size of the training set grows. Many nearest neighbor search algorithms have been proposed seek to reduce the number of distance evaluations actually performed. Using an appropriate nearest
33
neighbor search algorithm makes k-NN computationally tractable even for large data sets.
2.9
Existing System
According to different user requirements, video search intentions can be generally classified into three main categories: Detecting instances of copies Identifying visual similarities of near duplicates Retrieving entries in a broad sense of high-level concept similarity
2.9.1 Video Copy Detection Extensive research efforts have been made on extracting and matching content-based signatures to detect copies of videos. Several methods of using a sequence of frame features (ordinal, motion, or color signature) to leverage the characteristic of sequenceto-sequence matching. Query sequence slides frame by frame on database video with a fixed length window. Since the process of video transformation could give rise to several distortions, techniques circumventing these variations by globe signatures have been considered. They tend to depict a video globally rather than focusing on its sequential details. Some properties that are likely to be preserved even with these variations (e.g., shot length information) were suggested to be generated as compact signatures and string matching technique could be used to report such a copy. This method is efficient, but has limitations with blurry shot boundaries or very limited number of shots. Moreover, in reality, a query video clip can be just a shot or even a subshot. However, this method is only applicable for queries which consist of multiple shots.
34
2.9.1.1 Video Copy Detection based on Watermark The watermark consists in introducing a non-visible signal into the whole video with the goal of recognizing that video as genuine, and can easily detect copied videos. The watermarking is a widely used technique in the photography field. It allows the owner to detect whether the image has been copied or not. Sometimes, this watermark is visible in the image, and located in the background. The limitations of watermarks are that if the original image is not watermarked, then it is not possible to know if other images are copied or not.
2.9.1.2
In this case, the signature which defines the video is the content. The function of the algorithms of video copy detection based on the content extract the fingerprint through the features of the visual content. Then the fingerprint is used to compare with fingerprints from videos in a database. This type of algorithms have a difficult problem. It is really hard to solve if a video is a copied video or a similar video, The features of the content are very similar from one video to the other, and the system can think that the image is copied, but really it isn't.
2.9.2 Visual Similarities of Near Duplicates In this method, Query and Long videos will be fragmented to frames. Each frame of the query video is compared with the long video frames and nearest neighbors are identified for the query video frames. Approximate position in long video can be identified by getting the nearest neighbor of the query video frame. However, this method does not focus on the sequence matching between Query and Long videos. This method can give a success result even if one frame is similar and it is not effective.
35
2.9.3 Retrieving Entries in a Broad Sense of High-Level Concept Similarity User needs determine both the effectiveness and efficiency of video search engines. The purpose for which a video is created is either entertainment, information, communication, or data analysis. For all these purposes, the user needs and demands vary substantially. Search engines like YouTube can provide list of relevant video for a given query. Since humans perceive video as a complex interplay of cognitive concepts, the allimportant step forward in such video retrieval approaches will be to provide access at the semantic level. This is achievable by labeling all combinations of people, objects, settings, and events appearing in the audiovisual content. Labeling video content is a grand challenge as humans use approximately half of their cognitive capacity to achieve such tasks. Two types of semantic labeling solutions: Labels are assigned manually after audiovisual inspection Machine-driven with automatic assignment of labels to video segments
Video retrieval task conventionally returns similar clips from a large collection of videos which have been either chopped up into similar lengths or cut at content boundaries. The clips for search have already been segmented and are always ready for similarity ranking.
2.9.3.1 Human-Driven Labeling Manual labeling of video has traditionally been the realm of professionals. In cultural heritage institutions, for example, library experts label archival videos for future disclosure using controlled vocabularies. Because expert labeling is tedious and costly, it typically results in a brief description of a complete video only.
36
In contrast to expert labor, social tagging has been launched. This is a recent trend to let amateur consumers label, mostly personal, visual content on web sites like YouTube, Flickr, and Facebook. Alternatively, the manual concept-based labeling process can be transformed into a computer game or a tool facilitating volunteer-based labeling. Since the labels were never meant to meet professional standards, amateur labels are known to be ambiguous, overly personalized, and limited. Moreover, unlabeled video segments remain notoriously difficult to find. Manual labeling, whether by experts or amateurs, is geared toward one specific type of use and, therefore, inadequate to cater for alternative video retrieval needs, especially those user needs targeting at retrieval of video segments.
2.9.3.2 Machine-Driven Labeling Machine-driven labeling aims to derive meaningful descriptors from video data. These descriptors are the basis for searching large video collections. Most commercial video search engines provide access to video based on text, as this is still the easiest way for a user to describe an information need. The labels of these search engines are based on the filename, surrounding text, social tags, closed captions, or a speech transcript. Textbased video search using speech transcripts has proven itself especially effective for segment-level retrieval from broadcast news, interviews, political speeches, and video blogs featuring talking heads. However, a video search method based on just speech transcripts results in disappointing retrieval performance, when the audiovisual content is neither mentioned, nor properly reflected in the associated text. In addition, when the videos originate from non-English speaking countries, such as China, and the Netherlands, querying the content becomes much harder as robust automatic speech recognition results and their accurate machine translations are difficult to achieve. It might seem that video retrieval is the trivial extension of text retrieval, but it is in fact often more complex. Most of the data is of sensory origin (image, sound, video) and hence techniques from digital signal processing and computer vision are required to extract relevant descriptions. In addition to the important and valuable text data derived from audio analysis, much information is captured in the visual stream. Hence, a vast 37
body of research in machine-driven video labeling has investigated the role of visual content, with or without text. Video is dependent on low-level visual labels such as color, texture, shape, and spatiotemporal features. Most of those early systems are based on query-by-example, where users query an archive based on images rather than the visual feature values. They do so by sketches, or by providing example images using a browser interface. Querybyexample can be fruitful when users search for the same object under slightly varying circumstances and when the target images are available indeed. If proper example images are unavailable, content-based image retrieval techniques are not effective at all. Moreover, users often do not understand similarity expressed in low-level visual features. They expect semantic similarity. This expected semantic similarity, is exactly the major problem video retrieval is facing. The source of the problem lies in the semantic gap. The lack of correspondence between the low-level features that machines extract from video and the high-level conceptual interpretations a human gives to the data in a given situation. The existence of the gap has various causes. One reason is that different users interpret the same video data in a different way. This is especially true when the user is making subjective interpretations of the video data related to feelings or emotions, for example, by describing a scene as romantic or hilarious.
38
39
3. Proposed System
Video subsequence identification is the proposed system. Video subsequence identification [1] aims at finding if there exists any subsequence of a long database video that shares similar content to a query clip. 3.1
40
retrieval. Whereas in video subsequence identification, the boundary and even the length of target subsequence are not available initially, choosing which fragments to evaluate similarities is not preknown. Video Subsequence Identification uses a graph transformation and matching approach to process variable length comparison over database video with query. It facilitates safely pruning a large portion of irrelevant parts and rapidly locating some promising candidates for further similarity evaluations. Constructing a bipartite graph representing the similar frame mapping relationship between Query video (Q) and Long video (S) with an efficient batch kNN search algorithm, all the possibly similar video subsequences along the 1D temporal line can be extracted. Then, to effectively but still efficiently identify the most similar subsequence, the proposed query processing is conducted in a coarse-to-fine style. Imposing a one-to-one mapping constraint similar in spirit to that of Maximum Size Matching (MSM) is employed to rapidly filter some actually nonsimilar subsequences with lower computational cost. The smaller number of candidates which contain eligible numbers of similar frames are then further evaluated with relatively higher computational cost for accurate identification. Since measuring the video similarities for all the possible 1:1 mappings in a subgraph is computationally intractable, a heuristic method Sub-Maximum Similarity Matching (SMSM) is devised to quickly identify the subsequence corresponding to the most suitable 1:1 mapping. 3.2
3.2.1 Recognition for Copyright Enforcement Video content owners would like to be aware of any use of their material, in any media or representation. For example, the producers of certain movie scenes may want to identify whether or where their original films have been reused by others.
41
3.2.2 TV Commercial Detection Some companies would like to track their TV commercials when they are aired on different channels during a certain time period for statistic purpose. They can verify whether their commercials have been actually broadcasted as contracted, and it is also valuable to monitor how their competitors conduct advertisements to apprehend their marketing strategies.
42
CHAPTER 4 IMPLEMENTATION
43
4. Implementation
This chapter provides the design and implementation details of video subsequence identification.
4.1
Video Fragmentation
Video fragmentation elaborates on the aspects to be considered for splitting a video file into image files.
4.1.1 Supported Video Types There are different types of videos and this project mainly focuses on following video types: AVI (Audio Video Interleave) MP4 MOV
4.1.2 Capturing Video Properties Following properties of Query video (Q) and Long video (S) are captured using openCV APIs: Frames Per Second Total Number of Frames Frame Width Frame Height
4.1.3 Capturing Images from Video at Periodic Intervals Images are captured from the query and long videos at a periodic time intervals. Periodic rate can be a configurable parameter.
44
4.1.4 Image File Type Supported JPEG Images are captured from the query and long videos. JPEG images are generally in compressed format, so less storage space is required for storing the JPEG images. 4.2
Q is
45
Given qi and S, following algorithm gives the framework of retrieving similar frames. Algorithm: Retrieve similar frames Input: qi, S Output: F(qi) - similar frame set of qi Description: 1: if kNN search is defined then 2: F(qi) 4: else 5: F(qi)
{ sj | sj kNN(qi) };
3: return F(qi);
{ sj | sj range(qi) };
6: return F(qi); 7: end if Frames are regarded as similar if their distance is under a threshold. It is better to have each qi retrieve the same number of similar frames, and the differences of the maximum distances can vary can vary substantially. Therefore, kNN search is preferred and kNearest Neighbors are identified for all the query video frames.
46
4.3
4.3.1 Bipartite Graph Transformation Each frame can be placed as a node along the temporal line of a video. Given a query clip Q and database video S, a short line and a long line can be abstracted, respectively. Q and S, which are two finite sets of nodes ordered along the temporal lines, are treated as two sides of a bipartite graph. Vertex is drawn from Q to S, if Euclidean Distance between Q and S frames is less than Bipartite Similarity Threshold. Hereafter, each frame is no longer modeled as a high-dimensional point as in the preliminary step, but simply a node. Q and S, which are two finite sets of nodes ordered along the temporal lines, are treated as two sides of a bipartite graph.
47
Figure 4.4 Construction of Bipartite Graph Let G = { V, E } be a bipartite graph representing the similar frame mappings between Q and S. V = Q S is the vertex set representing frames while E Q S is the edge set representing similar frame mappings. Following algorithm outlines how a bipartite graph is constructed Algorithm: Construct bipartite graph Input: F(qi) Output: E - edge set, SB - set of sj with nonzero count Description: 1: for each qi in Q do 2: 3: 4: for each sj F(qi) do count(sj) = 0; end for
48
5: end for 6: for each qi in Q do 7: 8: 9: 10: 11:end for 12:return E, SB; For each sj for each sj F(qi) do add qi end for
sj to E; count(sj) + +;
F(qi), its count which indicates the number of similar query frames is first
Q, if sj
the count of sj is increased by 1. At the same time, the set of sj with nonzero count, denoted as SB, is also updated. In the case that kNN search is defined to retrieve F(qi), k * | Q | edges will be formed. Observing the similar frame mappings along the 1D temporal line of S side, only a small portion is densely matched, while the most parts are unmatched at all or merely sparsely matched. Intuitively, the unmatched and sparsely matched parts can be directly discarded, as they clearly suggest there are no possible subsequences similar to Q, because a necessary condition for a subsequence to be similar to Q is they share sufficient number of similar frames. In view of this, comparing all the possible subsequences in S is avoided, which is infeasible, but safely and rapidly filter a large portion of irrelevant parts prior to similarity evaluations. To do so, the densely matched segments of S containing all the possibly similar video subsequences have to be identified. Note that it is unnecessary to maintain the entire graph.
49
4.3.2 Dense Segment Extraction Starting from the first frame in long video, Euclidean Distance between two adjacent frames in long video is calculated. If the Euclidean Distance between adjacent frames is less than Adjacent Similarity Threshold, then frames will be added in the same segment. Similarity in large video sequence will be traced. Sparse or empty parts can be learned. Below algorithm extracts dense segments Algorithm: Extract dense segments Input: SB Output: SD - set of dense segments Description: 1: order SB by sj; 2: k = 1; add first(SB) to Sk*; add Sk* to SD; 3: while first(SB) 4: 5: 6: 7: 8: 9: 10: end if first(SB) else k + +; add next(SB) to Sk*; add Sk* to SD;
last(SB) do
next(SB);
50
SB contains the tuples in the form of (sj, count(sj)) returned in bipartite graph construction algorithm, where sj is the frame ID and count(sj)
query frames of sj. Additionally, a threshold determining the number of consecutive zero counts required to separate two high density segments is employed. However, even after dense segment extraction it is likely that two successive or overlapped subsequences actually both somewhat similar to Q are grouped in a longer segment. To further filter dense but nonsimilar segments, and extract the most similar subsequence accurately in a relatively long segment, a filter-and refine search strategy needs to be applied.
51
4.3.3 Filtering by Maximum Size Matching After locating the dense segments, there will be k separate subgraphs. However, high density of a segment cannot sufficiently indicate high similarity to query due to neglect of actual similar frame number, temporal order, or frame alignment. Since the query Q is composed of a series of frames, a frame sj in Sk* can also have multiple similar frames in Q due to continuity. The original similar frame mapping relationship between Q and Sk* is usually multiple-to multiple (M:M). Even when videos are represented by shots but not frames, this M:M mapping phenomenon is still common, though not as significant as at frame level. In fact, not all the dense segments with acceptable lengths mean they are indeed relevant. With 1:M mapping, one frame of query clip can be found similar to number of frames of the long video. This will lead to higher computational cost. With 1:1 mapping, Maximum Size Matching (MSM) is employed to rapidly identify subsequences with lower computational cost. Subsequences are identified, if number of frames in the subsequence is greater than Maximum Size Matching Threshold.
52
Figure 4.6 1:1 Mapping Below algorithm outlines the filtering process by MSM. Based on the matching size, i.e., the number of edges or the number of saturated vertices in Gk under 1:1 mapping constraint, which reflects similar frame number, some actually nonsimilar video subsequences can be filtered, whose maximum matching sizes are not greater than a given threshold , where is a parameter related to |Q|. This threshold can be tuned for tradeoff between effectiveness and efficiency. Algorithm: Filter by MSM Input: SD Output: SM set of actually similar S* Description: 1: SM = 2: for each segment in SD do 3: 4: 5: 6: 7: else discard Sk*; 53 MkMSM = MaximumSizeMatching(Gk); if | MkMSM| > then add Sk* to SM;
8:
end if
4.3.4 Refinement by Sub-Maximum Similarity Matching The Maximum size matching filtering step can be viewed as a rough similarity evaluation disregarding temporal information. SM is further refined by considering visual content, temporal order and frame alignment simultaneously.
4.3.4.1 Video Similarity Measure Defining a similarity measure consistent with human perception is crucial for similarity search. First, score function needs to be presented that integrates three factors in judging video relevance for resembling human perception more accurately. The video similarity is computed based on an arbitrary 1:1 mapping Mk out of all the possible 1:1 mappings between Q and Sk*. Visual content similarity: To locate the most visually similar subsequence, the distance between two similar frames is calculated, instead of simply judging whether they are similar or not. Let the weight of edge wij denote the detailed similarity between frames qi and sj in Mk. This information has already been available in the preliminary stage. A larger total weight of edges means a larger sum of interframe similarities, which consequently achieves a higher degree of visual similarity. Temporal order similarity: Reordering some frames to achieve a much higher Visual content with an acceptable sacrifice on preserving temporal order, for a much higher overall similarity from the angle of human perception. Given a 1:1 mapping Mk of Gk corresponding to Sk*, the number of frame pairs that have kept the temporal order, i.e., matched along the same direction, can be calculated by Longest Common Subsequence Similarity.
54
Frame alignment similarity: Only considering the factors of visual content and temporal order is still problematic. The number of frames that need alignment operations is considered. Frames will be added, deleted, or substituted.
4.4
With the provision of configurable threshold, it allows the user to either tighten or loosen the passing criteria for query processing. So, video similarity can be adjusted by user feedback
55
4.5
56
57
58
5.1
Figure 5.1 provides the test plan approach. Long video Query video Remarks
Test Case 1: Positive test case with technical presentation videos (avi files).
Test Case 2: Positive test case with advertisement videos (avi files). Test case where two occurrences of query video are present in long video.
59
Long video
Query video
Remarks
Test Case 3: Positive test case with advertisement videos (mov files).
Test Case 4: Positive test case with animation movies (avi files).
Test Case 6: Positive test case with advertisements of different display resolution (mp4 files).
60
5.2
Results
This section provides test results of test cases that are executed as per test plan approach.
5.2.1 Test Case 1 Results Test Case 1 Input Parameters are shown in Figure 5.2.
Figure 5.9 GUI Screen Showing the Test Case 1 Input Parameters
61
Test Case 1 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.3.
62
Test Case 1 results for Maximum Size Matching and Results Display stages are shown in Figure 5.4.
63
5.2.2 Test Case 2 Results Test Case 2 Input Parameters are shown in Figure 5.5.
Figure 5.12 GUI Screen Showing the Test Case 2 Input Parameters
64
Test Case 2 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.6.
65
Test Case 2 results for Maximum Size Matching and Results Display stages are shown in Figure 5.7.
66
5.2.3 Test Case 3 Results Test Case 3 Input Parameters are shown in Figure 5.8.
Figure 5.15 GUI Screen Showing the Test Case 3 Input Parameters
67
Test Case 3 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.9.
68
Test Case 3 results for Maximum Size Matching and Results Display stages are shown in Figure 5.10.
69
5.2.4 Test Case 4 Results Test Case 4 Input Parameters are shown in Figure 5.11.
Figure 5.18 GUI Screen Showing the Test Case 4 Input Parameters
70
Test Case 4 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.12.
71
Test Case 4 results for Maximum Size Matching and Results Display stages are shown in Figure 5.13.
72
5.2.5 Test Case 5 Results Test Case 5 Input Parameters are shown in Figure 5.14.
Figure 5.21 GUI Screen Showing the Test Case 5 Input Parameters
73
Test Case 5 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.15.
74
Test Case 5 results for Maximum Size Matching and Results Display stages are shown in Figure 5.16.
75
5.2.6 Test Case 6 Results Test Case 6 Input Parameters are shown in Figure 5.17.
Figure 5.24 GUI Screen Showing the Test Case 6 Input Parameters
76
Test Case 6 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction and Dense Segment Extraction stages are shown in Figure 5.18.
77
Test Case 6 results for Maximum Size Matching and Results Display stages are shown in Figure 5.19.
78
5.2.7 Test Case 7 Results Test Case 7 Input Parameters are shown in Figure 5.20.
Figure 5.27 GUI Screen Showing the Test Case 7 Input Parameters
79
Test Case 7 results for Video Fragmentation, k-Nearest Neighborhood, Bipartite Graph Construction, Dense Segment Extraction, Maximum Size Matching and Results Display stages are shown in Figure 5.21.
80
CHAPTER 6 CONCLUSION
81
6. Conclusion
In this project, the similar frames of query clip are retrieved by a batch query algorithm. Then, a bipartite graph is constructed to exploit the opportunity of spatial pruning. Thus, the high-dimensional query and database video sequence can be transformed to two sides of a bipartite graph. Only the dense segments are roughly obtained as possibly similar subsequences. In the filter-and-refine phase, some nonsimilar segments are first filtered, several relevant segments are then processed to quickly identify the most suitable 1:1 mapping by optimizing the factors of visual content, temporal order, and frame alignment together. It is an effective and efficient query processing strategy for temporal localization of similar content from a long unsegmented video stream, considering target subsequence may be approximate occurrence of potentially different ordering or length with query clip. With the provision of configurable thresholds, video similarity can be adjusted by user feedback. Selection of Bipartite Similarity Threshold and Adjacent Similarity Threshold is dependent on the display resolution, color distribution and Video Splitting Rate. Heuristic approach needs to be considered while setting these thresholds. Hit ratio of the query is dependent on the Number of KNN Samples.
6.1
Future Work
This project can be extended by incorporating enhanced image processing techniques such as filtering and image transformation to effectively process the queries. With these image processing techniques, videos with slightly different background can successfully go through the video similarity search pass criteria. Currently, video subsequence identification is performed at frame level. New strategies can be implemented to perform video subsequence identification at shot/scene level.
82
REFERENCES
[1] Heng Tao Shen, Jie Shao, Zi Huang, and Xiaofang Zhou, Effective and Efficient Query Processing for Video Subsequence Identification, IEEE Transactions on Knowledge and Data Engineering, Vol. 21, No. 3, March 2009, pp. 321-334 [2] Data Mining Concepts and Techniques Jiawei Han & Micheline Kamber, Elsevier. [3] Data Mining and Knowledge Discovery Handbook - Oded Maimon & Lior Rokach, Second Edition, Springer [4] http://en.wikipedia.org/wiki/Video [5] http://en.wikipedia.org/wiki/Comparison_of_container_formats#Video_formats_supported [6] http://en.wikipedia.org/wiki/Image_file_formats [7] http://en.wikipedia.org/wiki/Bipartite_graph [8] http://en.wikipedia.org/wiki/KNN [9] http://en.wikipedia.org/wiki/Euclidean_distance [10] http://en.wikipedia.org/wiki/Minkowski_distance [11] http://en.wikipedia.org/wiki/Earth_mover%27s_distance [12] http://sourceforge.net/projects/opencvlibrary/
83
84
Appendix B: Definitions/Acronyms
KNN: K Nearest Neighborhood DTW: Dynamic Time Warping LCSS: Longest Common Subsequence Similarity EMD: Earth Movers Distance PNG: Portable Network Graphics GIF: Graphics Interchange Format TIFF: Tagged Image File Format Exif: Exchangeable image file format JPEG: Joint Photographic Experts Group JFIF: JPEG File Interchange Format RIFF: Resource Interchange File Format WMV: Windows Media Video AVI: Audio Video Interleave MPEG: Moving Picture Experts Group PAL: Phase Alternating Line SECAM: Sequential Color with Memory NTSC: National Television System Committee SDTV: Standard-definition television HDTV: High-definition television GUI: Graphical User Interface
85
86