You are on page 1of 50

A Simple Scratch of Search Engine

chunqi.shi@hotmail.com
http://hi.baidu.com/shichunqi

1 07/06
2 07/08
3 07/16 1/3 07/30
4
5
http://sewm.pku.edu.cn/IR-Guide.txt
--
.................................................................................................................................. 1
A Simple Scratch of Search Engine.................................................................................................... 1

................................................................................................................................. 2

.................................................................................................................................. 2
.................................................................................................................................. 2

-- ....................................................................................................... 3

............................................................................................................. 3

............................................................................................................. 5

1 ............................................................................................................................ 6
2 ................................................................................................................ 7
3 .......................................................................................................................... 10
.......................................................................................................... 10
1.

Spider ............................................................................................... 11

2.

Spider ............................................................................................... 12

3.

Spider ....................................................................................... 15

4.

Spider ....................................................................................... 18

5.

Spider ........................................................................................................ 23

...................................................................................................... 23
1.

Quality Selection .................................................................................... 24

2.

De-duplicate........................................................................................... 35

3.

Anti-spam .................................................................................................. 43

.......................................................................................................... 48

....................................................................................................... 48

................................................................................................................ 48
-- ........................................................................................................ 49
.......................................................................................................................................... 49


1,

GoogleBaidu
http://www.baidu.com/more/ Google
http://www.google.com.hk/intl/en/options/
Lab
Wiki
http://en.wikipedia.org/wiki/List_of_search_engines 10
1 2 3
45P2P
6Email78910
14
1[] 23456
78910 1112
13 14
2

Yahoo
InfoseekGoogle Baidu
Google 1 1000
3171

Grassroots

--

1.

(Sequential File)(Random File)

2.

(Index) (Hash)

3.

Storage Pyramid
Register
CacheInternal StorageExternal Storage



Index
Keywords/Term
Retrieval




1.1

entirelytimely


fast indexing

efficient accessibility small storage space


valuable

Terms
resemblance ranking
Internet

WWW

1.1



IO
/
/

CACHE->->->
/->/ Pyramid Hierachy

ClusterDistributed
Inverted IndexSequential
Hashing

WEB
MVC(Model-View-Controller) WEB DATA

ResemblanceRank WEB DATA
Retrieval
Web-Data-Retrieval

Google Baidu
Yahoo
TRECSIGIRWWW


(Information Retrieval) WEB Web Technology
1.2
Spider
Spider CrawlerSpider Schedule
Spider Update

Indexer Indexer
IR
Indexer Analyze
Data Base
Retrieval Retrieval
Retrieval Query Resemblance
Rank

User Interface
Frontend

Internet
Query
Schedule

Spider

Indexer

Update Preprocess

Retrieval

Analyze

Rank

Backend

1.2



Modeling the Internet and the
Web. Probabilistic Methods and Algorithm
http://book.douban.com/subject/1756106/
http://ibook.ics.uci.edu/slides.html
PDF

http://bib.tiera.ru/DVD-010/Baldi_P.,_Frasconi_P.,_Smyth_P._Modeling_the_Internet_and_the_Web
._Probabilistic_Methods_and_Algorithms_(2003)(en)(285s).pdf
Internet Intranet LAN
RJ-45
ISO TCP/IP
Internet
InternetIntranet LAN
TCP/IP
World Wide Web HTTP hyperlinks
Net Web WWWWebsite
WWW
1999 Chinaren 263

WWW WWW Web



1
-- CNNIC
16 http://www.cnnic.net.cn/index/0E/21/index.htm
1.
2.
3.
4.
5.

6.
7.
8.
9.
10.
11.
12.

1994 4 20 NCFC Sprint Internet 64K

Internet Internet
1994 5 BBS BBS

1998 6 CERNET IP (IPv6) 6BONE


1999 7 12

2000 12 12

2001 1 1 ""
2001 7 9

2001 12 20

2004 2 3 18 2003

2004 5 13

2004 6 16
2005 8 5

13. 2005 Web2.0 Web2.0

BlogRSSWIKISNS
14. 2006 12 18 Verizon

15. 2007 100

16. 2008 6 30 2.53 7 22 CN


1218.8
BBS
WEB2.0
1.

2.

()

3.

4.

()

5.

WEB2.0
=> => =>

2




2010 1 CNNIC
http://www.cnnic.net.cn/uploadfiles/pdf/2010/1/15/101600.pdf
/
- 11
html/htm shtml
php asp jsp aspx 3:1:5

- 10
75% 55%
30% 8%
1%
- 5 336 ~
- 14 30K
964 Terabytes

- 11

.html

20.1%

htm

6.5%

2.1%

shtml

8.7%

asp

12.6%

php

22.2%

txt

0.0%

nsf

0.0%

xml

0.0%

jsp

1.0%

cgi

0.2%

pl

0.0%

aspx

6.1%

do

0.5%

dll

0.0%

jhtml

0.0%

cfm

0.0%

php3

0.0%

phtml

0.0%

19.7%

100%

- 10

7.7%

21.2%

28.1%

18.8%

24.3%

100%
- 5

2008

2009

16,086,370,233

33,601,732,128

108.88%

7,891,388,272

18,998,243,013

140.75%

49.06%

56.54%

8,194,981,961

14,603,489,115

78.20%

50.94%

43.46%

0.96:1

1.3:1

KB

460,217,386,099

1,059,950,881,533

130.32%

5,588

10,397

86.06%

KB

28.6

31.5

10.30%

- 14

289.5
119.4

32.2
35.2

124.1

26.6

20.7
40.4

29.6
29.7

93.5

30.6

12.8
8.1

31.8
30.5

7.2

27.3

31.3

33.4

18.9

26.5

61.8

30.8

1.1

28.1

30.0

27.7

22.8

33.0

8.0

26.5

2.4

27.4

18.4

27.5

12.1

38.9

0.2

29.5

8.8

34.6

10.9

27.2

7.2

31.9

2.7

31.2

5.5

35.5

1.2

28.6

0.2

26.1

1.2

24.6

2.1

25.1

1.2

26.5

0.1

43.9

964.0

30.8

3


QQ

hao123

- 8 80%

- 8


Spider (MIT).Matthew Gray)
1993 NCSA Mosaic
("www wanderer") Wanderer Perl
.David Eichmann RBSE spider

3.1Matthew Gray Google

3.2David EichmannIowa
Spider Spider
EVP() Google --

http://www.cqumzh.cn/att_blog/month_0901/a2fe1b64c99263b246e9d923f1055549_1231307756.p
df

1. Spider
Spider
1
2 Spider

1
2
URI/URL
1 URL URL
JSP/ASP/Servlet/PHPURL URL Mapping

2 URL URL

URL
Spider Trap
URL
URL Spider URI /

Internet
tcp/ip
Downloader
unfetched URLs
URI
update

tcp/ip
Spider

URL extractor
& normalizer

HTML
update

update
Analyzer

3.1.1 Spider URI /

2. Spider
IP
URL HOST IP
DNS Resolver Spider WebSite
Robots Robots
Win-Win Crawler Trap URL path
robots.txt Spider


(Stress) Access RateMaximum Stress
Block

Website(Domain
Name) DN IP Download

IP Wildcard Domain
Infinite Sub-domain Generator
Domain Uniformization
IP

Multi-Downloader Schedule DNS DNS resolver Robots Robots Protocol
Checker Website Meta-info Collector[Maximum Stress
IP Multi-IP Stress BalanceDomain Uniformization]
HTML
Parser





Spider

Internet
robots.txt

DNS Cache

DNS Resolver
Client

pages

Robots
Checker

Multi-Downloader
Multi-Downloader

Scheduler

tcp/ip

unfetched URLs

update
URI

Site/Domain
update

URL extractor
& normalizer
Content
extractor

update

update

HTML Parser

Spider

HTML
Data Package

Analyzer

3.2.1 Spider
Spider
//Multi-Downloader
Multi-Downloader
Paralleled Crawler

Partitioning
Crawler Center
Distributed Crawler
Locally DistributedLower Latency
Internet Backbone Traffic Interchange Politeness

Linguistic Validation & Cultural Adaptation

Static ParalleledDynamic Paralleled


Crawler
Center 0

Downloader 0
Downloader 0

Crawler 0

Downloader 0
Multi- Downloader

Crawler 1

Crawler
Center 1

Spider
Architecture

Crawler
Center L

Crawler N
Paralleled Crawler

Distributed Crawler

3.2.2 Spider

Internet

Crawler 0
qq.com/download
qq.com/blog
163.com/news
163.com/mail

Crawler 1
news.sina.com.cn/beijing
news.sina.com.cn/whether
sohu.com/news
sohu.com/maps

Scheduler
St at ic Paralleled
3.2.3

Crawler N-1

Spider

Internet

Crawler 0

Crawler 1

Crawler N-1

qq.com/download
qq.com/blog
163.com/news
163.com/mail
news.sina.com.cn/beijing
news.sina.com.cn/whether
sohu.com/news
sohu.com/maps

Scheduler

Spider

Dynamic Paralleled
3.2.4

3. Spider




Devanshu Dhyani A Survey of Web Metrics

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.5859&rep=rep1&type=pdf

336 Seeds

30% 8%


Yahoo Research Barcelona Lab http://labs.yahoo.com/Yahoo_Labs_Barcelona
Ricardo

Baeza-Yates

B.

Barla

Cambazoglu

Tutorial

Yahoo

http://www.lirmm.fr/~coletta/CaisePresentations/TutorialYAHOO.pdf Spider
Quality Metrics
1 Crawler
Coverage: The percentage of the Web discovered or downloaded by the crawler.

2
Freshness: Measure of out-datedness of the local copy of a page relative to the pages original copy
on the Web

3
Page importance: Percentage of important or popular pages in the repository
Ricardo Baeza-YatesYahooVP
ACM Fellow SIGIR 2009
Quantifying Performance and Quality Gains in Distributed Web Search Engines. In SIGIR

2009http://research.yahoo.com/search/node/Quantifying
http://www.dcc.uchile.cl/~rbaeza/
http://research.yahoo.com/user/70

Spider

Coverage

Internet

Schedule

Freshness

Page importance

Analyzer

3.3.1
2003 :
--
http://sewm.pku.edu.cn/TianwangLiterature/Other/%5B%CD%F5%BC%CC%C3%F1,2003a%5D/032
116.pdf
Poisson Web Poisson
3 :(1);(2);(3)
Web T Web
1-exp-*T=0.5
CNNIC -
10
Sanasam Ranbir

Singh IJCAI07 Estimating the Rate of Web Page Updates

http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-462.pdf


Link Rot
Dead Link / Broken Link Link RotDead Link

Link Rot
Dead Link
Spider
Spider

obsolete

stable

Re-fetch

Fetched

fresh

To-fetch

New Links

Dead Links Internet Status(t)

Internet Status(t + T)

3.3.2T

Outdated/Obsolete Spider
Create-Delete-Update CRUD
Spider Spider
Google Deepbot Freshbot Deepbot
Coverage Freshbot Re-visit/Refresh

Spider

Coverage

Internet
Fetch

Refresh

Schedule

Freshness

Page importance

Dead links

Analyzer

3.3.3

4. Spider
Carlos Castillo EffectiveWeb Crawling
http://www.webir.org/resources/phd/Castillo_2004.pdf
Carlos Castillo University of Chile
YAHOO
&VPRicardo Baeza-Yates
--

http://www.c-s-a.org.cn/ch/reader/view_abstract.aspx?file_no=20090752&flag=1
Selecting/Ordering Strategies
FetchRefresh

Granularity





Divide Standard

Fetch

Refresh

Regular

Historical/Empirical Feedback

4.1

Granularity Geographically

Website

Page

Link/URL

Tracks

Linguistic

Architecture

Popularity

Group Pattern

Encoding

Traffic

Utility

URL Keyword

Popularity

Relevance

Path Depth

Quantity &
Saturation
4.2
,
Page Importance





Website ArchitectureLink Path Depth
Search Engine
Optimize
Search Engine Cheat
Spam Website
Page Importance
Google .Matt Cutts
HIThis is the great content I has
http://www.mattcutts.com/blog/
http://v.youku.com/v_show/id_XMTY3NTM2ODQ0.html
{


~
.
Is content still the king or has something else (structure) taken over? "Content is necessary. It's
not always sufficient because people have to find out about your content. But if you don't have good
content, it's a lot harder to do good search engine optimization for your site." ~ Matt Cutts.
}
Net

Homepage/Index Link
ContentLink
Content

2.3.4.1

1 Breadth First
a) --
b) /
c)
Baseline
Carlos Castillo EffectiveWeb
Crawling
Crawling the Infinite Web: Five Levels are Enough

3-5
90%

90%

Follow 5
d)

2Larger Site First


a)--
b)
c)

Winner-Take-All
Pending

d) TOP N

3Skeleton Links
a)--
b)
c) Yida Wang SIGIR08
Exploring Traversal
Strategy for Web Forum Crawling
http://research.microsoft.com/pubs/131117/forumcrawl_sigir08.pdf
Unique

d)Randon Sampling Sitemap Construction


Traversal Strategy Exploring
Double-ended Queue 1000


Pruning

4Possion Process
a)--
b)
c)
--
http://d.wanfangdata.com.cn/Periodical_xdjsj-xby200912018.aspx
10%
Index
LinkContent



http://www.jos.org.cn/ch/reader/view_abstract.aspx?file_no=20060513

d)Index/Link/Content F(Index)/F(Link)/F(Content)
X 5*X/t

5 Backlink Count
a)
b)/
c) Hyperlink
Backlink/Inlink

Baseline


d)
Link Backlink Count

6 PagerankBatch Pagerank
a)
b)/
c) Pagerank
Pagerank

d) Pagerank K Pagerank
Pagerank

7 PagerankPartial Pagerank
a)
b)/
c) Pagerank Pagerank Pagerank
Pagerank Pagerank Pagerank Pagerank

d) Pagerank K Pagerank
Pagerank Pagerank

8Online Page Importance Computation


a)
b)/
c)Pagerank
Pagerank
online OPIC Pagerank

d)Cash
Pending

9User Centric Crawling


a)
b)
c)Pandey and Olston (user-centric)
User-Centric
Web Crawling
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.59.6287&rep=rep1&type=pdf
()

d) Phrase/Multi-Words Query
Single Word Query
p q r1 r2 r1,r2 Q(p)
p Q(p) Q(p) P(p,t)=Q(p)*(t-LR(p,t)) LR(p,t)

5. Spider

Maximum Stress Control


IP Multi-IP Stress Balance

Domain UniformizationSpider-Trap HandleWildcard


Subdomain/Infinite Domain Name Generators
Distributed Crawling ArchitectureHTML Parser
ill-formed HTML RestoreJavaScript URL URL NormalizeURI
Hash Redirect ControlLogin-protected
Page Fetch Incremental Crawler
Coverage FreshnessPage Importance
Dead Link CheckSpam ControlLink Farm Penalize


Information Retrieval
Pre-process

Universal/Opening DistributionClean
Pros and Cons


Anti-Spam

Site/Domain

HTML
parser

Anchors

Link-relation

Tilte

Content
Content
extractor

Indexer

Web Page Cleaning

De-duplicate

Pre-process

Segmentation

Quality Selection

page analyze

Specials

Dictionary

4.1

1. Quality Selection
Page
Importance


GeneralQuality Selection

/


Pros and Cons

4.1.1

Sites Dictionary

2-4

<20%

2-5

>80%

Site Map

Site Topics

3-5 90%

2-4 2-5
20% 80% 3 N
1N N>4
1/5=20% 4/5=80%


Google Baidu

Site Evaluation
Website

1)/Credibility/ Authority

www.whitehouse.gov www.whitehouse.org www.whitehouse.net


Types gov
net
Google whitehouse www.whitehouse.org
whitehouse.georgewbush.org

4.1.2
.gov/.edu /.mil

.org

com

.net/.cn



LuceneHadoop Doug Cutting
http://cutting.wordpress.com/

Semantic Web
Email

2)Reputation

TrafficIndex
Alex 100

Navigation


3)Audience



4)Completeness

5)Access/Workability

6)Accuracy

7)Currency

8)Uniqueness

9)/Facticity/Objectivity
Encyclopaedia
Wikipedia Ask
Wikipedia Google
10)(Quality of writing)
Typographical errors/spelling mistakes

11)/Browsability and layout



Navigability

12)Multimedia

Google
Sign of Zodiac


4.1.3
1

10

11

12

/Link/Anchor Content
Canon

(Saint)
Authority and Hub
Link
Anchor

Relevant Linkage Principle [Kleinberg 1997]
Link_A Link_B Link_A Link_B
Topical Unity Principle [Kessler 1963, Small 1973]
Link_C Link_A Link_B Link_A Link_B
Lexical Affinity Principle [Maarek et al. 1991]
Link_A Link_B URL Link_A
Link_B Anchor

Link_A

Link_B

Relevant Linkage Principle


Link_A
Link_C
Link_B

Topical Unity Principle


Link_A
Link_B
HTML

Lexical Affinity Principle


4.1.1



Page Clean Site Templates

Pagelets Analysis
HTML DOM TREE HTML
(ordered linear space) two-dimensional space

CSS Visual Tree

DOM Tree


DOM Tree Web Page Cleaning for Web Mining through Feature Weighting
Visual Tree Entropy-Based Visual Tree Evaluation on Block Extraction
Site Templates Joint Optimization of Wrapper Generation and Template Detection
Site Templates Site-Independent Template-Block Detection
Site Templates Page-level Template Detection via Isotonic Smoothing

Pagelets Improving Hypertext Data using Pagelets and Templates


Page Templates

Page-level Template Detection via Isotonic Smoothing

Visual Tree

DOM Tree
CSS

HTML
4.1.2

http://news.sina.com.cn/c/2010-07-29/012020778393.shtml

Site Templates

http://news.sina.com.cn/c/2010-07-29/163620785082.shtml

4.1.3 Site Templates


Pagelets Analysis

4.1.4 Pagelets Analysis




1 PageRank Hilltop
2 HITS SALSA
3 Entropy Analysis
Ranking Link Analysis Ranking

PageRank
The Anatomy of a Large-Scale HypertextualWeb Search Engine

4.1.5 PageRank

Hilltop
When Experts Agree: Using Non-Affiliated. Experts to Rank Popular Topics

Krishna Bharat

George
Andrei
Mihaila

4.1.6 Hilltop

HITS
Hyperlink-Induced Topic Search --Authoritative Sources in a Hyperlinked Environment

4.1.7 HITS

SALSA
The Stochastic Approach for Link-Structure Analysis The Stochastic Approach for Link Structure
Analysis (SALSA) and the TKC Effect

4.1.8 SALSA

Entropy Analysis
Entropy-Based Link Analysis for Mining Web Informative Structures
Mining Web Informative Structures and Contents Based on Entropy Analysis

4.1.9 Entropy

Text Content


Pros and Cons

Noise

HTML Semi-structured

1Unstructured (Plain Text)


2More structured
Table
List
3Fixed Structured
Multimedia data
Document





Fixed Structured
Multimedia data

...

More structured
Table
List

...

Unstructured

(Plain Text)

...
HTML
4.1.10

Wikipedia 14 Topic
10
1 2 3
45P2P 6Email
78910
Universal Search GoogleBaiduYahooBing
Vertical Searchkooxoo gougou
qihoo

4.1.2

Blog

News

Image

Vedio

Forum

(P2P)

Wap

2. De-duplicate
Duplicate/Near-Duplicate Detection
Copy Detection / Plagiarism Detection / Duplicate Detection),
// 76 Ottenstein
Attribute Counting Copy Detection
20 1993 . Udi Manber Arizona SIFF (Finding

Similar Files in a Large File System) Approximate Fingerprints



Manber
http://manber.com/
AmasonYahoo
Google
VP Manber (Introduction
to Algorithms) Manber

4.2.1 Google VP, Udi Manber

95 . Sergei Brin - Garcia-Molina Stanford


COPS(copy protection system)
Garcia-Molina Shivakumar SCAM(Stanford copy analysis method)
SCAM (Vector Space Model)
Garcia-Molina Google
Sergei Brin Larry Page

2004
CSDN

http://blog.csdn.net/malefactor/

http://blog.csdn.net/malefactor/archive/2006/06/09/782882.aspx
Google Gurmeet Singh Manku Detecting Near Duplicates for Web Crawling
SimHash
http://infolab.stanford.edu/~manku/papers/07www-duplicates.ppt
Yahoo P Govindarajulu Duplicate and Near Duplicate
Documents Detection: A Review
http://www.eurojournals.com/ejsr_32_4_08.pdf
MIT Shreyes Seshasai 09 Efficient Near Duplicate Document Detection
for Specialized Corpora
http://via.mit.edu/documents/Seshasai.pdf

http://sewm.pku.edu.cn/TianwangLiterature/PhdDissertation/%5BHuang,2008%5D/hle_thesis.pd
f

(Process Introduction)
1,




2,
20% 30%

Precision>90% && Recall > 90% 85%

3,

1
exact duplicates:mirroringplagiarism
near duplicates: Advertisements
Template Frames
timestamps
2
post process:

inline process:
url
4
1.
LOGO Page Clean/Noise Redection

Abstarct Extraction
2. /FingerPrint

3. Resemblance 2 Distance
Fingerprint online

4. Cluster IterativeGraph
Union Find
5. Delegates
Hashing=>Signatures=>Fingerprint; Vector=>Cosine=>Distance=>Resemblance;

Delegates

Cluster

De-duplicate

FingerPrint

Resemblance

Segmentation

HTML

Dictionary

Page Clean

4.2.2

Link-duplicate
outlinksPath Hash
Hash

SEOSPAM
proper
subgraph

Content-duplicate
Fingerprint

FingerPrintMilestones
1, CheckSum: Checks MD5 & SHA & CRCs
2, Longest Common Subsequence
3, Shingling Broder 1997: Jaccard index of tokens
4, SimHash Charikar 2002/ Gurmeet Singh Manku 2007WWW:
5, I Match Chowdhury 2002: IDF tokens
tokens tokens Digest Digest

6, Spotsig Martin Theobald 2008:

Common Words Spot Words

Jaccard
7, Bloom Filter Bloom 1970 / Chazelle 2004
: K Hash m Hash
xk=>m0<=m<=M-1 1
8, Chunk HP LAB 2005/2009: Window Chunk
Chunk

1. CheckSum
URL
URL

2. Longest Common Subsequence


LCS
http://sewm.pku.edu.cn/TianwangLiterature/Report/NCIS_TR_2007012.pdf
CharacterPhrase

Statement

1)Top Keyword Feature


2) Special Statement[ ]
3) Query Phrase[
]
4)

LCS ON*N Myers OND


N LengthA + LengthB
D
LevenShtein Distance

Sentence A Sentence B LCS Sentence A Sentence B


SES Shortest Edit Script
SES Edit Graph

An O(ND) Difference Algorithm and Its Variations
Tonimoto R(A,B) = |LCS| / (|A| + |B| - |LCS|)
LCS

3. Shingling
Broder 1997Jaccard index of tokens
Syntactic Clustering of the Web
www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf

4. SimHash
Google --Detecting Near-Duplicates for Web Crawling
http://infolab.stanford.edu/~manku/papers/07www-duplicates.ppt

5. I Match
Improved Robustness of Signature-Based Near-Replica. Detection via Lexicon Randomization
www.ir.iit.edu/~abdur/publications/470-kolcz.pdf

6. Spotsig
SpotSigs: Robust and Efficient Near Duplicate Detection in. Large Web Collections.
http://ilpubs.stanford.edu:8090/831/1/2008-14.pdf

7. BloomFilter
Using Bloom Filters to Refine Web Search Results
www.cs.utexas.edu/users/dahlin/papers/webdb-167.pdf

8. Chunk
A Framework for Analyzing and Improving Content-BasedChunking Algorithms
http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf
Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup
www.hpl.hp.com/personal/Mark_Lillibridge/Extreme/final.pdf

Resemblance Distance
Euclidean,
Manhattan,Chebyshev, Jaccard, Cosine
,Correlation Coefficient
6
1. Cosine Similarity
2. Jaccard Index
3. Tonimoto Index
4. Pearson Correlation Coefficient
5. SimRank
6. Levenshtein distance

1. Cosine Similarity.
TF-IDF

TF-IDF Vector Space Model


When comes to Cosine Similarity,
TF-IDF weigting model is always the first choice. As TF-IDF weigting model is mentioned, the Vector
Space Model is a must.Vector Space Model VSM

Word Segmentation


Dictionary 1. 2. 3. 4. 5. 6 7

Stop

Words 1. 2.
Document A
Document B
Document C
Stop Words
1000

1.

**

2.

3.
4.

*
*

5.
6

*
*

*
*

Document A 1 4 1 6 7
Document B 1 3 4 6
Document C 2 7
Document A Document C
1234567
Document A 2 0 0 1 0 1 1
Document B 1 0 1 1 0 1 0
Document C 0 1 0 0 0 0 1

1983 Salton McGill TF-IDF Term Frequency-Inverse Document Frequency,


Term Frequency Document A 2 0 0 1 0 1 1 2
Inverse Document Frequency Inverse
Document Frequency
1234567
Document Frequency 2 1 1 2 0 2 1
TF-IDF A TF-IDF 2/2 = 1
1 TF Term Frequency
i j TFi,j
i j [Count(i,j)] 1 Document
A A TF(1,A) = Count(i,j) / Total = Count(i,j) / ij Count(i,j) Total
TF(1,A) = Count(1,A) / ij Count(i,j) = 2 / 11
2 IDF, Inverse Document Frequency Document Frequency
? Document Frequency i DF(i)
i [Number(i)] DF(i) = Number(i) / i Number(i) Inverse
Document Frequency IDF(i) = i Number(i) / Number(i) 1 , IDF(1) = 9 /
3 = 3;
3 TF-IDF TFi,j * IDF(i) = Count(i,j) / ij Count(i,j) * i Number(i)
/ Number(i) TFi,j i j
IDF(i) i log
log TF-IDFi,j
= Count(i,j) / ij Count(i,j) * log ( i Number(i) / Number(i) )
TF-IDFi,j= Count(i,j) / ij Count(i,j) * log ( i Number(i) / (1 +

Number(i)) )

TF-IDFi,j= Count(i,j) / ij Count(i,j) * log ( i Number(i) / (1 +

Number(i)) )

A TF-IDF1,A= 2 / 11 * log9 / 3+1 = 0.06


TF-IDFi,j Mutual Information
KL Kullback-Leibler Divergence
TF-IDF Document A Document B Cosine Similarity
TF-IDF (1, *), (2, *), (3, *), (4, *), (5, *), (6, *), (7, *)
Document A 0.06

Document B ?

Document C
Document A Document B [0~1]
Cos = AB / (||A||
*||B||) = AB / ((A A) *( B B)) )
Google

2. Jaccard Index
Cosine Similarity Jaccard Index
Jaccard Coefficient A, B A, B A,B
Jaccard Coefficient = ||AB || / ||AB||
Jaccard Distance = 1 - Jaccard Coefficient = (||AB|| - ||AB ||) / ||AB||
Document A 1 4 1 6 7
Document B 1 3 4 6
Jaccard Distance = 1 - || (1, 4, 6) || / || (1, 3, 4, 6, 7) || = 1 - 3 / 5 = 0.4

3. Tonimoto Index
1. Cosine Similarity. 2. Jaccard Index Tonimoto Index
T(A, B) = AB / (AA + BB - AB)
T(A, B) = ||AB || / ( ||A|| + ||B|| - ||AB || )
Dice Coefficient
D(A, B) = 2 AB / (AA + BB)
D(A, B) = 2 ||AB || / ||A|| + ||B||

D = 2J / (1 + J) and J = D / (2 D)

4. Pearson Correlation Coeafficient.


p(A, B) = cov (A, B) / (AB) = E(A-A)E(B-B) / (AB)
E cov

5. SimRank

R0(a,b) = (a == b);
Ri(a, b) = CijR(Ii(a), Ij(b)) / (||I(a)|| * ||I(b)|| )
I(x) : x in-neighbors of x

6. Levenshtein distance
Levenshtein distance Edit distance

Document A 1 4 1 6 7
Document B 1 3 4 6
Levenshtein distance = 3

3. Anti-spam
Spam Spam
WEB2.0
Spam
anti-spam
http://blog.csdn.net/malefactor/archive/2006/05/30/762895.aspx
Stanford University Zoltan Gyongyi Hector Garcia-Molina
http://infolab.stanford.edu/~zoltan/
http://infolab.stanford.edu/people/hector.html
Web Spam Taxonomy Spam
BoostingHiding

repetition(dumping)weaving
stiching
hony Pot
directory
posting
exchange

farm dir.clone

Web Spam TaxonomyBoosting

Web Spam TaxonomyHiding

Link-spam
Spam cheat

outlinks

Hub pages

inlinks

Spam farm

In-link exchange

/Links posting / Splogging

Web directory

DNS DNS cloaking

Expired domains purchase

Honey pot

Anti-Spam
1, Spam HITS
2, Spam PageRank
3, TrustRank (VLDB2004)
4, BadRank (WWW2005)
5, SpamRank (WWW2005, workshop)
6, ParentRank (WWW2005)

1. Spam HITS
Improvements of HITS algorithms for spam links

2. Spam PageRank
Microsoft --Robust PageRank and Locally Computable Spam Detection Features

3. TrustRank
Yahoo -- Combating Web Spam with TrustRank
Propagating Trust and Distrust to Demote Web Spam

4. BadRank
Google -- PR0 -Google's PageRank 0 penalty.
Generalized BadRank with Graduated Trust

5. SpamRank
SpamRank Fully Automatic Link Spam Detection

6. ParentRank
Identifying link farm spam pages

Content-Spam

Zipfs law Heaps Heaps' law

1. Zipfs law

Zipfs law GKZipf 1935

1/f
1/2
1/3 n 1/n

4.3.1 GKZipf
Simon Newcomb
b n
logb(n + 1) logb(n)
1 2 17.6%3 12.5%9
4.6%

4.3.2 Simon Newcomb


Spam
Spam

2. Heaps Heaps' law



Heaps Heaps' lawVR(n)

= KnK[10~100],[0.4~0.6]
Spam

3.
Title
AnchorMeta Spam

5% 30%

4.3.2

Microsoft -- Detecting Spam Web Pages through Content Analysis


http://research.microsoft.com/apps/pubs/default.aspx?id=65140

A successful search engine requires more bandwidth to upload query result pages than its
crawler needs to download pages

http://www.seo.com.cn/seopdf/.pdf

Precision
Recall
F1 :
=/
=/

F1
F1

:
1. Mean Average Precision MAP: MAP
MAP MAP MAP

2. R-Precision: R-Precision R R
R-Precision R-Precision
3. P@10: P@10 10
10
P@10

--
http://www.sales2marketing2.com/PracticalInternetMarketing_VincentCheng.ppt
Online Marketing Channels are:
1.

Search Engine Optimization and Marketing

2.

(Google, Yahoo, MSN)

3.

Social Network Sites

4.

(MySpace, Friendster, YouTube)

5.

Social BookMarking (De.li.cio.us, BlogMarks)

6.

Email Marketing

7.

Viral Marketing

8.

Online Press Release

9.

Blogs

10. Link Building Reciprocal and One Way Paid Links


11. Affiliate Marketing
12. E-Zine / Articles Marketing
13. Paid by Impression/Click Banner Advertising
14. Vertical Search Engines
15. Relevant / Vertical Directories
16. Video Advertising
17. Relevant / Vertical Forums
18. RSS Really Simple Syndication
19. PodCasting / Video Casting
20. Chat and Messengers
21. EBay / Auction Sites
22. B2B Business Networks
23. Mobile Marketing

,,
, ,
WWW


5.1. ""

5.2.
a)

b)

c)
d) e)

5.3.

5.4.

Google Google Answer AnswerBot
"how can kill virus of computer?"

"virus"
"how can kill virus of computer?"

5.5.

FTPFlash
5.6.

5.7.

GoogleYahoo
""

XML

You might also like