Professional Documents
Culture Documents
TABLE OF CONTENTS
Duplicate content Why detect duplicate Reason for duplicates Type of duplicates Method for detecting duplicates Similarity Categories of Duplicate detection methods
DUPLICATE CONTENT
Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page, even on different web sites. Types of Duplicate Content Non-malicious Content: It may include multiple
views of the same page, such as normal HTML, pdf, word, etc.
Conserve resources
In a web crawl, many duplicate and near duplicate web pages are encountered
http://www.cs.umd.edu/~pugh vs http://www.cs.umd.edu/users/pugh
TYPES OF DUPLICATES
Exact Duplicates
Near Duplicates
Not bitwise identical Different dates or signatures, Slight edits, modifications, Different versions, updates, etc.
Shingle
It is a set of contiguous tokens in a document. Each document is divided into multiple shingles and one hash value is assigned to each shingle. By sorting these hash values, shingles with same hash values are grouped together. Then the resemblance of two documents can be calculated based on the number of matching shingles. A k-gram shingle is a shingle of k consecutive tokens.
Hashing
Provide a hash values to each document If the hash values of two documents is same, then the documents are exact duplicates.
{ Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered } { 357, 192, 755, 123, 987, 345 }
SIMILARITY
CONTD
CONTD
Also known as the Jaccard similarity Yes: Simple to describe, easy to compute No: Ignores repetition of trigrams: e.g. a rose is a rose and a rose is a rose is a rose have 100% similarity.
Offline Methods :
It calculates document similarities of a large set of web pages and detects all duplicates at the pre-processing stage. Viewed as global methods since they detect duplicates in the whole collection. detection is done at the data preparation phase. response time and throughput of search engines will not be affected. online method detects duplicates in the search result at run time. viewed as local methods since they detect duplicate documents in the scope of the search result of each query.
Online Methods :