You are on page 1of 12

DUPLICATE DETECTION

TABLE OF CONTENTS

Duplicate content Why detect duplicate Reason for duplicates Type of duplicates Method for detecting duplicates Similarity Categories of Duplicate detection methods

DUPLICATE CONTENT

Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page, even on different web sites. Types of Duplicate Content Non-malicious Content: It may include multiple
views of the same page, such as normal HTML, pdf, word, etc.

Malicious content: It refers to intentionally


generated search spam.

WHY DETECT DUPLICATES?

Conserve resources

reduced index size less memory, faster computations, etc.

REASON FOR DUPLICATION

In a web crawl, many duplicate and near duplicate web pages are encountered

one study suggested than 30+% are duplicates

Mirroring: Multiple URLs for same page


http://www.cs.umd.edu/~pugh vs http://www.cs.umd.edu/users/pugh

Same web site hosted on multiple host names Web spammers

TYPES OF DUPLICATES

Exact Duplicates

Same web-pages hosted at different URLs

Near Duplicates

Not bitwise identical Different dates or signatures, Slight edits, modifications, Different versions, updates, etc.

METHOD FOR DETECTING DUPLICATES

Shingle

It is a set of contiguous tokens in a document. Each document is divided into multiple shingles and one hash value is assigned to each shingle. By sorting these hash values, shingles with same hash values are grouped together. Then the resemblance of two documents can be calculated based on the number of matching shingles. A k-gram shingle is a shingle of k consecutive tokens.

e.g.{ he is a good boy}


shingle= {he is a, is a good, a good boy}

METHOD FOR DETECTING DUPLICATES

Hashing

Provide a hash values to each document If the hash values of two documents is same, then the documents are exact duplicates.

Take a hashing function, H. Hash each shingle:


e.g. {Once upon a midnight dreary while I pondered }

{ Once upon a, upon a midnight, a midnight dreary, midnight dreary while, dreary while I, while I pondered } { 357, 192, 755, 123, 987, 345 }

SIMILARITY

CONTD

CONTD

Also known as the Jaccard similarity Yes: Simple to describe, easy to compute No: Ignores repetition of trigrams: e.g. a rose is a rose and a rose is a rose is a rose have 100% similarity.

Is this a good similarity measure?


CATEGORIES OF DUPLICATE DETECTION


METHODS

Offline Methods :

It calculates document similarities of a large set of web pages and detects all duplicates at the pre-processing stage. Viewed as global methods since they detect duplicates in the whole collection. detection is done at the data preparation phase. response time and throughput of search engines will not be affected. online method detects duplicates in the search result at run time. viewed as local methods since they detect duplicate documents in the scope of the search result of each query.

Online Methods :

You might also like