Search Engine Model

A Simple Scratch of Search Engine
chunqi.shi@hotmail.com
http://hi.baidu.com/shichunqi
1 07/06
2 07/08
3 07/16 1/3 07/30
4
5
http://sewm.pku.edu.cn/IR-Guide.txt
--
.................................................................................................................................. 1
A Simple Scratch of Search Engine.................................................................................................... 1
................................................................................................................................. 2
.................................................................................................................................. 2
.................................................................................................................................. 2
-- ....................................................................................................... 3
............................................................................................................. 3
............................................................................................................. 5
1 ............................................................................................................................ 6
2 ................................................................................................................ 7
3 .......................................................................................................................... 10
.......................................................................................................... 10
1.
Spider ............................................................................................... 11
2.
Spider ............................................................................................... 12
3.
Spider ....................................................................................... 15
4.
Spider ....................................................................................... 18
5.
Spider ........................................................................................................ 23
...................................................................................................... 23
1.
Quality Selection .................................................................................... 24
2.
De-duplicate........................................................................................... 35
3.
Anti-spam .................................................................................................. 43
.......................................................................................................... 48
....................................................................................................... 48
................................................................................................................ 48
-- ........................................................................................................ 49
.......................................................................................................................................... 49

1,
GoogleBaidu
http://www.baidu.com/more/ Google
http://www.google.com.hk/intl/en/options/
Lab
Wiki
http://en.wikipedia.org/wiki/List_of_search_engines 10
1 2 3
45P2P
6Email78910
14
1[] 23456
78910 1112
13 14
2

Yahoo
InfoseekGoogle Baidu
Google 1 1000
3171

Grassroots
--

1.
(Sequential File)(Random File)
2.
(Index) (Hash)
3.
Storage Pyramid
Register
CacheInternal StorageExternal Storage

Index
Keywords/Term
Retrieval

1.1
entirelytimely

fast indexing
efficient accessibility small storage space

valuable

Terms
resemblance ranking
Internet
WWW
1.1

IO
/
/

CACHE->->->
/->/ Pyramid Hierachy
ClusterDistributed
Inverted IndexSequential
Hashing

WEB
MVC(Model-View-Controller) WEB DATA

ResemblanceRank WEB DATA
Retrieval
Web-Data-Retrieval

Google Baidu
Yahoo
TRECSIGIRWWW

(Information Retrieval) WEB Web Technology
1.2
Spider
Spider CrawlerSpider Schedule
Spider Update
Indexer Indexer
IR
Indexer Analyze
Data Base
Retrieval Retrieval
Retrieval Query Resemblance
Rank

User Interface
Frontend
Internet
Query
Schedule
Spider
Indexer
Update Preprocess
Retrieval
Analyze
Rank
Backend
1.2

Modeling the Internet and the
Web. Probabilistic Methods and Algorithm
http://book.douban.com/subject/1756106/
http://ibook.ics.uci.edu/slides.html
PDF
http://bib.tiera.ru/DVD-010/Baldi_P.,_Frasconi_P.,_Smyth_P._Modeling_the_Internet_and_the_Web
._Probabilistic_Methods_and_Algorithms_(2003)(en)(285s).pdf
Internet Intranet LAN
RJ-45
ISO TCP/IP
Internet
InternetIntranet LAN
TCP/IP
World Wide Web HTTP hyperlinks
Net Web WWWWebsite
WWW
1999 Chinaren 263

WWW WWW Web

1
-- CNNIC
16 http://www.cnnic.net.cn/index/0E/21/index.htm
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
1994 4 20 NCFC Sprint Internet 64K
Internet Internet
1994 5 BBS BBS
1998 6 CERNET IP (IPv6) 6BONE

1999 7 12
2000 12 12
2001 1 1 ""
2001 7 9
2001 12 20
2004 2 3 18 2003
2004 5 13

2004 6 16
2005 8 5
13. 2005 Web2.0 Web2.0
BlogRSSWIKISNS
14. 2006 12 18 Verizon

15. 2007 100
16. 2008 6 30 2.53 7 22 CN

1218.8
BBS
WEB2.0
1.
2.
()
3.
4.
()
5.
WEB2.0
=> => =>
2

2010 1 CNNIC
http://www.cnnic.net.cn/uploadfiles/pdf/2010/1/15/101600.pdf
/
- 11
html/htm shtml
php asp jsp aspx 3:1:5
- 10
75% 55%
30% 8%
1%
- 5 336 ~
- 14 30K
964 Terabytes

- 11
.html
20.1%
htm
6.5%
2.1%
shtml
8.7%
asp
12.6%
php
22.2%
txt
0.0%
nsf
0.0%
xml
0.0%
jsp
1.0%
cgi
0.2%
pl
0.0%
aspx
6.1%
do
0.5%
dll
0.0%
jhtml
0.0%
cfm
0.0%
php3
0.0%
phtml
0.0%
19.7%
100%
- 10
7.7%
21.2%
28.1%
18.8%
24.3%
100%
- 5
2008
2009
16,086,370,233
33,601,732,128
108.88%
7,891,388,272
18,998,243,013
140.75%
49.06%
56.54%
8,194,981,961
14,603,489,115
78.20%
50.94%
43.46%
0.96:1
1.3:1
KB
460,217,386,099
1,059,950,881,533
130.32%
5,588
10,397
86.06%
KB
28.6
31.5
10.30%
- 14
289.5
119.4
32.2
35.2
124.1
26.6
20.7
40.4
29.6
29.7
93.5
30.6
12.8
8.1
31.8
30.5
7.2
27.3
31.3
33.4
18.9
26.5
61.8
30.8
1.1
28.1
30.0
27.7
22.8
33.0
8.0
26.5
2.4
27.4
18.4
27.5
12.1
38.9
0.2
29.5
8.8
34.6
10.9
27.2
7.2
31.9
2.7
31.2
5.5
35.5
1.2
28.6
0.2
26.1
1.2
24.6
2.1
25.1
1.2
26.5
0.1
43.9
964.0
30.8
3

QQ
hao123

- 8 80%

- 8

Spider (MIT).Matthew Gray)
1993 NCSA Mosaic
("www wanderer") Wanderer Perl
.David Eichmann RBSE spider
3.1Matthew Gray Google
3.2David EichmannIowa
Spider Spider
EVP() Google --
http://www.cqumzh.cn/att_blog/month_0901/a2fe1b64c99263b246e9d923f1055549_1231307756.p
df
1. Spider
Spider
1
2 Spider
1
2
URI/URL
1 URL URL
JSP/ASP/Servlet/PHPURL URL Mapping
2 URL URL
URL
Spider Trap
URL
URL Spider URI /
Internet
tcp/ip
Downloader
unfetched URLs
URI
update
tcp/ip
Spider
URL extractor
& normalizer
HTML
update
update
Analyzer
3.1.1 Spider URI /
2. Spider
IP
URL HOST IP
DNS Resolver Spider WebSite
Robots Robots
Win-Win Crawler Trap URL path
robots.txt Spider

(Stress) Access RateMaximum Stress
Block

Website(Domain
Name) DN IP Download
IP Wildcard Domain
Infinite Sub-domain Generator
Domain Uniformization
IP

Multi-Downloader Schedule DNS DNS resolver Robots Robots Protocol
Checker Website Meta-info Collector[Maximum Stress
IP Multi-IP Stress BalanceDomain Uniformization]
HTML
Parser

Spider
Internet
robots.txt
DNS Cache
DNS Resolver
Client
pages
Robots
Checker
Multi-Downloader
Multi-Downloader
Scheduler
tcp/ip
unfetched URLs
update
URI
Site/Domain
update
URL extractor
& normalizer
Content
extractor
update
update
HTML Parser
Spider
HTML
Data Package
Analyzer
3.2.1 Spider
Spider
//Multi-Downloader
Multi-Downloader
Paralleled Crawler
Partitioning
Crawler Center
Distributed Crawler
Locally DistributedLower Latency
Internet Backbone Traffic Interchange Politeness
Linguistic Validation & Cultural Adaptation
Static ParalleledDynamic Paralleled

Crawler
Center 0
Downloader 0
Downloader 0
Crawler 0
Downloader 0
Multi- Downloader
Crawler 1
Crawler
Center 1
Spider
Architecture
Crawler
Center L
Crawler N
Paralleled Crawler
Distributed Crawler
3.2.2 Spider
Internet
Crawler 0
qq.com/download
qq.com/blog
163.com/news
163.com/mail
Crawler 1
news.sina.com.cn/beijing
news.sina.com.cn/whether
sohu.com/news
sohu.com/maps
Scheduler
St at ic Paralleled
3.2.3
Crawler N-1
Spider
Internet
Crawler 0
Crawler 1
Crawler N-1
qq.com/download
qq.com/blog
163.com/news
163.com/mail
news.sina.com.cn/beijing
news.sina.com.cn/whether
sohu.com/news
sohu.com/maps
Scheduler
Spider
Dynamic Paralleled
3.2.4
3. Spider

Devanshu Dhyani A Survey of Web Metrics
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.5859&rep=rep1&type=pdf
336 Seeds

30% 8%

Yahoo Research Barcelona Lab http://labs.yahoo.com/Yahoo_Labs_Barcelona
Ricardo
Baeza-Yates
B.
Barla
Cambazoglu
Tutorial
Yahoo
http://www.lirmm.fr/~coletta/CaisePresentations/TutorialYAHOO.pdf Spider
Quality Metrics
1 Crawler
Coverage: The percentage of the Web discovered or downloaded by the crawler.
2
Freshness: Measure of out-datedness of the local copy of a page relative to the pages original copy
on the Web
3
Page importance: Percentage of important or popular pages in the repository
Ricardo Baeza-YatesYahooVP
ACM Fellow SIGIR 2009
Quantifying Performance and Quality Gains in Distributed Web Search Engines. In SIGIR
2009http://research.yahoo.com/search/node/Quantifying
http://www.dcc.uchile.cl/~rbaeza/
http://research.yahoo.com/user/70
Spider
Coverage
Internet
Schedule
Freshness
Page importance
Analyzer
3.3.1
2003 :
--
http://sewm.pku.edu.cn/TianwangLiterature/Other/%5B%CD%F5%BC%CC%C3%F1,2003a%5D/032
116.pdf
Poisson Web Poisson
3 :(1);(2);(3)
Web T Web
1-exp-*T=0.5
CNNIC -
10
Sanasam Ranbir
Singh IJCAI07 Estimating the Rate of Web Page Updates
http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-462.pdf

Link Rot
Dead Link / Broken Link Link RotDead Link

Link Rot
Dead Link
Spider
Spider
obsolete
stable
Re-fetch
Fetched
fresh
To-fetch
New Links
Dead Links Internet Status(t)
Internet Status(t + T)
3.3.2T

Outdated/Obsolete Spider
Create-Delete-Update CRUD
Spider Spider
Google Deepbot Freshbot Deepbot
Coverage Freshbot Re-visit/Refresh
Spider
Coverage
Internet
Fetch
Refresh
Schedule
Freshness
Page importance
Dead links
Analyzer
3.3.3
4. Spider
Carlos Castillo EffectiveWeb Crawling
http://www.webir.org/resources/phd/Castillo_2004.pdf
Carlos Castillo University of Chile
YAHOO
&VPRicardo Baeza-Yates
--

http://www.c-s-a.org.cn/ch/reader/view_abstract.aspx?file_no=20090752&flag=1
Selecting/Ordering Strategies
FetchRefresh

Granularity

Divide Standard
Fetch
Refresh
Regular
Historical/Empirical Feedback
4.1
Granularity Geographically
Website
Page
Link/URL
Tracks
Linguistic
Architecture
Popularity
Group Pattern
Encoding
Traffic
Utility
URL Keyword
Popularity
Relevance
Path Depth
Quantity &
Saturation
4.2
,
Page Importance

Website ArchitectureLink Path Depth
Search Engine
Optimize
Search Engine Cheat
Spam Website
Page Importance
Google .Matt Cutts
HIThis is the great content I has
http://www.mattcutts.com/blog/
http://v.youku.com/v_show/id_XMTY3NTM2ODQ0.html
{

~
.
Is content still the king or has something else (structure) taken over? "Content is necessary. It's
not always sufficient because people have to find out about your content. But if you don't have good
content, it's a lot harder to do good search engine optimization for your site." ~ Matt Cutts.
}
Net

Homepage/Index Link
ContentLink
Content
2.3.4.1
1 Breadth First
a) --
b) /
c)
Baseline
Carlos Castillo EffectiveWeb
Crawling
Crawling the Infinite Web: Five Levels are Enough

3-5
90%
90%
Follow 5
d)
2Larger Site First

a)--
b)
c)

Winner-Take-All
Pending
d) TOP N
3Skeleton Links
a)--
b)
c) Yida Wang SIGIR08
Exploring Traversal
Strategy for Web Forum Crawling
http://research.microsoft.com/pubs/131117/forumcrawl_sigir08.pdf
Unique
d)Randon Sampling Sitemap Construction

Traversal Strategy Exploring
Double-ended Queue 1000

Pruning

4Possion Process
a)--
b)
c)
--
http://d.wanfangdata.com.cn/Periodical_xdjsj-xby200912018.aspx
10%
Index
LinkContent

http://www.jos.org.cn/ch/reader/view_abstract.aspx?file_no=20060513
d)Index/Link/Content F(Index)/F(Link)/F(Content)
X 5*X/t
5 Backlink Count
a)
b)/
c) Hyperlink
Backlink/Inlink

Baseline

d)
Link Backlink Count

6 PagerankBatch Pagerank
a)
b)/
c) Pagerank
Pagerank
d) Pagerank K Pagerank
Pagerank
7 PagerankPartial Pagerank
a)
b)/
c) Pagerank Pagerank Pagerank
Pagerank Pagerank Pagerank Pagerank
d) Pagerank K Pagerank
Pagerank Pagerank
8Online Page Importance Computation

a)
b)/
c)Pagerank
Pagerank
online OPIC Pagerank

d)Cash
Pending

9User Centric Crawling

a)
b)
c)Pandey and Olston (user-centric)
User-Centric
Web Crawling
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.59.6287&rep=rep1&type=pdf
()
d) Phrase/Multi-Words Query
Single Word Query
p q r1 r2 r1,r2 Q(p)
p Q(p) Q(p) P(p,t)=Q(p)*(t-LR(p,t)) LR(p,t)
5. Spider
Maximum Stress Control

IP Multi-IP Stress Balance
Domain UniformizationSpider-Trap HandleWildcard

Subdomain/Infinite Domain Name Generators
Distributed Crawling ArchitectureHTML Parser
ill-formed HTML RestoreJavaScript URL URL NormalizeURI
Hash Redirect ControlLogin-protected
Page Fetch Incremental Crawler
Coverage FreshnessPage Importance
Dead Link CheckSpam ControlLink Farm Penalize

Information Retrieval
Pre-process

Universal/Opening DistributionClean
Pros and Cons

Anti-Spam
Site/Domain
HTML
parser
Anchors
Link-relation
Tilte
Content
Content
extractor
Indexer
Web Page Cleaning
De-duplicate
Pre-process
Segmentation
Quality Selection
page analyze
Specials
Dictionary
4.1
1. Quality Selection
Page
Importance

GeneralQuality Selection

/

Pros and Cons
4.1.1
Sites Dictionary
2-4
<20%
2-5
>80%
Site Map
Site Topics
3-5 90%
2-4 2-5
20% 80% 3 N
1N N>4
1/5=20% 4/5=80%

Google Baidu
Site Evaluation
Website

1)/Credibility/ Authority

www.whitehouse.gov www.whitehouse.org www.whitehouse.net

Types gov
net
Google whitehouse www.whitehouse.org
whitehouse.georgewbush.org

4.1.2
.gov/.edu /.mil
.org
com
.net/.cn

LuceneHadoop Doug Cutting
http://cutting.wordpress.com/

Semantic Web
Email
2)Reputation
TrafficIndex
Alex 100
Navigation

3)Audience

4)Completeness

5)Access/Workability

6)Accuracy

7)Currency

8)Uniqueness

9)/Facticity/Objectivity
Encyclopaedia
Wikipedia Ask
Wikipedia Google
10)(Quality of writing)
Typographical errors/spelling mistakes
11)/Browsability and layout

Navigability

12)Multimedia

Google
Sign of Zodiac

4.1.3
1
10
11
12
/Link/Anchor Content
Canon

(Saint)
Authority and Hub
Link
Anchor

Relevant Linkage Principle [Kleinberg 1997]
Link_A Link_B Link_A Link_B
Topical Unity Principle [Kessler 1963, Small 1973]
Link_C Link_A Link_B Link_A Link_B
Lexical Affinity Principle [Maarek et al. 1991]
Link_A Link_B URL Link_A
Link_B Anchor
Link_A
Link_B
Relevant Linkage Principle

Link_A
Link_C
Link_B
Topical Unity Principle

Link_A
Link_B
HTML
Lexical Affinity Principle

4.1.1

Page Clean Site Templates

Pagelets Analysis
HTML DOM TREE HTML
(ordered linear space) two-dimensional space
CSS Visual Tree
DOM Tree

DOM Tree Web Page Cleaning for Web Mining through Feature Weighting
Visual Tree Entropy-Based Visual Tree Evaluation on Block Extraction
Site Templates Joint Optimization of Wrapper Generation and Template Detection
Site Templates Site-Independent Template-Block Detection
Site Templates Page-level Template Detection via Isotonic Smoothing
Pagelets Improving Hypertext Data using Pagelets and Templates

Page Templates
Page-level Template Detection via Isotonic Smoothing
Visual Tree
DOM Tree
CSS
HTML
4.1.2
http://news.sina.com.cn/c/2010-07-29/012020778393.shtml
Site Templates
http://news.sina.com.cn/c/2010-07-29/163620785082.shtml
4.1.3 Site Templates

Pagelets Analysis
4.1.4 Pagelets Analysis

1 PageRank Hilltop
2 HITS SALSA
3 Entropy Analysis
Ranking Link Analysis Ranking
PageRank
The Anatomy of a Large-Scale HypertextualWeb Search Engine
4.1.5 PageRank
Hilltop
When Experts Agree: Using Non-Affiliated. Experts to Rank Popular Topics
Krishna Bharat
George
Andrei
Mihaila
4.1.6 Hilltop
HITS
Hyperlink-Induced Topic Search --Authoritative Sources in a Hyperlinked Environment
4.1.7 HITS
SALSA
The Stochastic Approach for Link-Structure Analysis The Stochastic Approach for Link Structure
Analysis (SALSA) and the TKC Effect
4.1.8 SALSA
Entropy Analysis
Entropy-Based Link Analysis for Mining Web Informative Structures
Mining Web Informative Structures and Contents Based on Entropy Analysis
4.1.9 Entropy
Text Content

Pros and Cons

Noise
HTML Semi-structured
1Unstructured (Plain Text)

2More structured
Table
List
3Fixed Structured
Multimedia data
Document

Fixed Structured
Multimedia data
...
More structured
Table
List
...
Unstructured
(Plain Text)
...
HTML
4.1.10
Wikipedia 14 Topic
10
1 2 3
45P2P 6Email
78910
Universal Search GoogleBaiduYahooBing
Vertical Searchkooxoo gougou
qihoo

4.1.2
Blog
News
Image
Vedio
Forum
(P2P)
Wap
2. De-duplicate
Duplicate/Near-Duplicate Detection
Copy Detection / Plagiarism Detection / Duplicate Detection),
// 76 Ottenstein
Attribute Counting Copy Detection
20 1993 . Udi Manber Arizona SIFF (Finding
Similar Files in a Large File System) Approximate Fingerprints

Manber
http://manber.com/
AmasonYahoo
Google
VP Manber (Introduction
to Algorithms) Manber
4.2.1 Google VP, Udi Manber
95 . Sergei Brin - Garcia-Molina Stanford

COPS(copy protection system)
Garcia-Molina Shivakumar SCAM(Stanford copy analysis method)
SCAM (Vector Space Model)
Garcia-Molina Google
Sergei Brin Larry Page

2004
CSDN
http://blog.csdn.net/malefactor/
http://blog.csdn.net/malefactor/archive/2006/06/09/782882.aspx
Google Gurmeet Singh Manku Detecting Near Duplicates for Web Crawling
SimHash
http://infolab.stanford.edu/~manku/papers/07www-duplicates.ppt
Yahoo P Govindarajulu Duplicate and Near Duplicate
Documents Detection: A Review
http://www.eurojournals.com/ejsr_32_4_08.pdf
MIT Shreyes Seshasai 09 Efficient Near Duplicate Document Detection
for Specialized Corpora
http://via.mit.edu/documents/Seshasai.pdf

http://sewm.pku.edu.cn/TianwangLiterature/PhdDissertation/%5BHuang,2008%5D/hle_thesis.pd
f
(Process Introduction)
1,

2,
20% 30%
Precision>90% && Recall > 90% 85%
3,
1
exact duplicates:mirroringplagiarism
near duplicates: Advertisements
Template Frames
timestamps
2
post process:

inline process:
url
4
1.
LOGO Page Clean/Noise Redection

Abstarct Extraction
2. /FingerPrint
3. Resemblance 2 Distance
Fingerprint online

4. Cluster IterativeGraph
Union Find
5. Delegates
Hashing=>Signatures=>Fingerprint; Vector=>Cosine=>Distance=>Resemblance;
Delegates
Cluster
De-duplicate
FingerPrint
Resemblance
Segmentation
HTML
Dictionary
Page Clean
4.2.2
Link-duplicate
outlinksPath Hash
Hash
SEOSPAM
proper
subgraph

Content-duplicate
Fingerprint

FingerPrintMilestones
1, CheckSum: Checks MD5 & SHA & CRCs
2, Longest Common Subsequence
3, Shingling Broder 1997: Jaccard index of tokens
4, SimHash Charikar 2002/ Gurmeet Singh Manku 2007WWW:
5, I Match Chowdhury 2002: IDF tokens
tokens tokens Digest Digest
6, Spotsig Martin Theobald 2008:
Common Words Spot Words
Jaccard
7, Bloom Filter Bloom 1970 / Chazelle 2004
: K Hash m Hash
xk=>m0<=m<=M-1 1
8, Chunk HP LAB 2005/2009: Window Chunk
Chunk
1. CheckSum
URL
URL
2. Longest Common Subsequence

LCS
http://sewm.pku.edu.cn/TianwangLiterature/Report/NCIS_TR_2007012.pdf
CharacterPhrase
Statement
1)Top Keyword Feature

2) Special Statement[ ]
3) Query Phrase[
]
4)
LCS ON*N Myers OND

N LengthA + LengthB
D
LevenShtein Distance
Sentence A Sentence B LCS Sentence A Sentence B

SES Shortest Edit Script
SES Edit Graph

An O(ND) Difference Algorithm and Its Variations
Tonimoto R(A,B) = |LCS| / (|A| + |B| - |LCS|)
LCS

3. Shingling
Broder 1997Jaccard index of tokens
Syntactic Clustering of the Web
www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf
4. SimHash
Google --Detecting Near-Duplicates for Web Crawling
http://infolab.stanford.edu/~manku/papers/07www-duplicates.ppt
5. I Match
Improved Robustness of Signature-Based Near-Replica. Detection via Lexicon Randomization
www.ir.iit.edu/~abdur/publications/470-kolcz.pdf
6. Spotsig
SpotSigs: Robust and Efficient Near Duplicate Detection in. Large Web Collections.
http://ilpubs.stanford.edu:8090/831/1/2008-14.pdf
7. BloomFilter
Using Bloom Filters to Refine Web Search Results
www.cs.utexas.edu/users/dahlin/papers/webdb-167.pdf
8. Chunk
A Framework for Analyzing and Improving Content-BasedChunking Algorithms
http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.pdf
Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup
www.hpl.hp.com/personal/Mark_Lillibridge/Extreme/final.pdf
Resemblance Distance
Euclidean,
Manhattan,Chebyshev, Jaccard, Cosine
,Correlation Coefficient
6
1. Cosine Similarity
2. Jaccard Index
3. Tonimoto Index
4. Pearson Correlation Coefficient
5. SimRank
6. Levenshtein distance
1. Cosine Similarity.
TF-IDF
TF-IDF Vector Space Model

When comes to Cosine Similarity,
TF-IDF weigting model is always the first choice. As TF-IDF weigting model is mentioned, the Vector
Space Model is a must.Vector Space Model VSM

Word Segmentation

Dictionary 1. 2. 3. 4. 5. 6 7
Stop
Words 1. 2.
Document A
Document B
Document C
Stop Words
1000
1.
**
2.
3.
4.
*
*
5.
6
*
*
*
*
Document A 1 4 1 6 7
Document B 1 3 4 6
Document C 2 7
Document A Document C
1234567
Document A 2 0 0 1 0 1 1
Document B 1 0 1 1 0 1 0
Document C 0 1 0 0 0 0 1

1983 Salton McGill TF-IDF Term Frequency-Inverse Document Frequency,

Term Frequency Document A 2 0 0 1 0 1 1 2
Inverse Document Frequency Inverse
Document Frequency
1234567
Document Frequency 2 1 1 2 0 2 1
TF-IDF A TF-IDF 2/2 = 1
1 TF Term Frequency
i j TFi,j
i j [Count(i,j)] 1 Document
A A TF(1,A) = Count(i,j) / Total = Count(i,j) / ij Count(i,j) Total
TF(1,A) = Count(1,A) / ij Count(i,j) = 2 / 11
2 IDF, Inverse Document Frequency Document Frequency
? Document Frequency i DF(i)
i [Number(i)] DF(i) = Number(i) / i Number(i) Inverse
Document Frequency IDF(i) = i Number(i) / Number(i) 1 , IDF(1) = 9 /
3 = 3;
3 TF-IDF TFi,j * IDF(i) = Count(i,j) / ij Count(i,j) * i Number(i)
/ Number(i) TFi,j i j
IDF(i) i log
log TF-IDFi,j
= Count(i,j) / ij Count(i,j) * log ( i Number(i) / Number(i) )
TF-IDFi,j= Count(i,j) / ij Count(i,j) * log ( i Number(i) / (1 +
Number(i)) )
TF-IDFi,j= Count(i,j) / ij Count(i,j) * log ( i Number(i) / (1 +
Number(i)) )
A TF-IDF1,A= 2 / 11 * log9 / 3+1 = 0.06

TF-IDFi,j Mutual Information
KL Kullback-Leibler Divergence
TF-IDF Document A Document B Cosine Similarity
TF-IDF (1, *), (2, *), (3, *), (4, *), (5, *), (6, *), (7, *)
Document A 0.06
Document B ?
Document C
Document A Document B [0~1]
Cos = AB / (||A||
*||B||) = AB / ((A A) *( B B)) )
Google
2. Jaccard Index
Cosine Similarity Jaccard Index
Jaccard Coefficient A, B A, B A,B
Jaccard Coefficient = ||AB || / ||AB||
Jaccard Distance = 1 - Jaccard Coefficient = (||AB|| - ||AB ||) / ||AB||
Document B 1 3 4 6
Jaccard Distance = 1 - || (1, 4, 6) || / || (1, 3, 4, 6, 7) || = 1 - 3 / 5 = 0.4
3. Tonimoto Index
1. Cosine Similarity. 2. Jaccard Index Tonimoto Index
T(A, B) = AB / (AA + BB - AB)
T(A, B) = ||AB || / ( ||A|| + ||B|| - ||AB || )
Dice Coefficient
D(A, B) = 2 AB / (AA + BB)
D(A, B) = 2 ||AB || / ||A|| + ||B||
D = 2J / (1 + J) and J = D / (2 D)
4. Pearson Correlation Coeafficient.

p(A, B) = cov (A, B) / (AB) = E(A-A)E(B-B) / (AB)
E cov
5. SimRank

R0(a,b) = (a == b);
Ri(a, b) = CijR(Ii(a), Ij(b)) / (||I(a)|| * ||I(b)|| )
I(x) : x in-neighbors of x
6. Levenshtein distance
Levenshtein distance Edit distance
Document B 1 3 4 6
Levenshtein distance = 3
3. Anti-spam
Spam Spam
WEB2.0
Spam
anti-spam
http://blog.csdn.net/malefactor/archive/2006/05/30/762895.aspx
Stanford University Zoltan Gyongyi Hector Garcia-Molina
http://infolab.stanford.edu/~zoltan/
http://infolab.stanford.edu/people/hector.html
Web Spam Taxonomy Spam
BoostingHiding
repetition(dumping)weaving
stiching
hony Pot
directory
posting
exchange
farm dir.clone
Web Spam TaxonomyBoosting
Web Spam TaxonomyHiding
Link-spam
Spam cheat
outlinks
Hub pages
inlinks
Spam farm
In-link exchange
/Links posting / Splogging
Web directory
DNS DNS cloaking
Expired domains purchase
Honey pot
Anti-Spam
1, Spam HITS
2, Spam PageRank
3, TrustRank (VLDB2004)
4, BadRank (WWW2005)
5, SpamRank (WWW2005, workshop)
6, ParentRank (WWW2005)
1. Spam HITS
Improvements of HITS algorithms for spam links
2. Spam PageRank
Microsoft --Robust PageRank and Locally Computable Spam Detection Features
3. TrustRank
Yahoo -- Combating Web Spam with TrustRank
Propagating Trust and Distrust to Demote Web Spam
4. BadRank
Google -- PR0 -Google's PageRank 0 penalty.
Generalized BadRank with Graduated Trust
5. SpamRank
SpamRank Fully Automatic Link Spam Detection
6. ParentRank
Identifying link farm spam pages
Content-Spam

Zipfs law Heaps Heaps' law

1. Zipfs law

Zipfs law GKZipf 1935
1/f
1/2
1/3 n 1/n
4.3.1 GKZipf
Simon Newcomb
b n
logb(n + 1) logb(n)
1 2 17.6%3 12.5%9
4.6%
4.3.2 Simon Newcomb

Spam
Spam
2. Heaps Heaps' law

Heaps Heaps' lawVR(n)
= KnK[10~100],[0.4~0.6]
Spam
3.
Title
AnchorMeta Spam
5% 30%
4.3.2

Microsoft -- Detecting Spam Web Pages through Content Analysis

http://research.microsoft.com/apps/pubs/default.aspx?id=65140
A successful search engine requires more bandwidth to upload query result pages than its
crawler needs to download pages
http://www.seo.com.cn/seopdf/.pdf
Precision
Recall
F1 :
=/
=/
F1
F1
:
1. Mean Average Precision MAP: MAP
MAP MAP MAP
2. R-Precision: R-Precision R R
R-Precision R-Precision
3. P@10: P@10 10
10
P@10
--
http://www.sales2marketing2.com/PracticalInternetMarketing_VincentCheng.ppt
Online Marketing Channels are:
1.
Search Engine Optimization and Marketing
2.
(Google, Yahoo, MSN)
3.
Social Network Sites
4.
(MySpace, Friendster, YouTube)
5.
Social BookMarking (De.li.cio.us, BlogMarks)
6.
Email Marketing
7.
Viral Marketing
8.
Online Press Release
9.
Blogs
10. Link Building Reciprocal and One Way Paid Links

11. Affiliate Marketing
12. E-Zine / Articles Marketing
13. Paid by Impression/Click Banner Advertising
14. Vertical Search Engines
15. Relevant / Vertical Directories
16. Video Advertising
17. Relevant / Vertical Forums
18. RSS Really Simple Syndication
19. PodCasting / Video Casting
20. Chat and Messengers
21. EBay / Auction Sites
22. B2B Business Networks
23. Mobile Marketing
,,
, ,
WWW

5.1. ""

5.2.
a)

b)
c)
d) e)
5.3.

5.4.

Google Google Answer AnswerBot
"how can kill virus of computer?"

"virus"
"how can kill virus of computer?"
5.5.

FTPFlash
5.6.
5.7.

GoogleYahoo
""
XML

Search Engine Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Search Engine Model

Uploaded by

Copyright:

Available Formats

A Simple Scratch of Search Engine

Quality Selection .................................................................................... 24

(Sequential File)(Random File)

efficient accessibility small storage space

1994 4 20 NCFC Sprint Internet 64K

1998 6 CERNET IP (IPv6) 6BONE

13. 2005 Web2.0 Web2.0

16. 2008 6 30 2.53 7 22 CN

3.1Matthew Gray Google

3.1.1 Spider URI /

Linguistic Validation & Cultural Adaptation

Static ParalleledDynamic Paralleled

Singh IJCAI07 Estimating the Rate of Web Page Updates

Dead Links Internet Status(t)

2Larger Site First

d)Randon Sampling Sitemap Construction

8Online Page Importance Computation

9User Centric Crawling

Maximum Stress Control

Domain UniformizationSpider-Trap HandleWildcard

Web Page Cleaning

www.whitehouse.gov www.whitehouse.org www.whitehouse.net

11)/Browsability and layout

Relevant Linkage Principle

Topical Unity Principle

Lexical Affinity Principle

CSS Visual Tree

Pagelets Improving Hypertext Data using Pagelets and Templates

Page-level Template Detection via Isotonic Smoothing

4.1.3 Site Templates

4.1.4 Pagelets Analysis

1Unstructured (Plain Text)

Similar Files in a Large File System) Approximate Fingerprints

4.2.1 Google VP, Udi Manber

95 . Sergei Brin - Garcia-Molina Stanford

Precision>90% && Recall > 90% 85%

6, Spotsig Martin Theobald 2008:

Common Words Spot Words

2. Longest Common Subsequence

1)Top Keyword Feature

LCS ON*N Myers OND

Sentence A Sentence B LCS Sentence A Sentence B

TF-IDF Vector Space Model

1983 Salton McGill TF-IDF Term Frequency-Inverse Document Frequency,

TF-IDFi,j= Count(i,j) / ij Count(i,j) * log ( i Number(i) / (1 +

A TF-IDF1,A= 2 / 11 * log9 / 3+1 = 0.06

4. Pearson Correlation Coeafficient.

Web Spam TaxonomyBoosting

Web Spam TaxonomyHiding

/Links posting / Splogging

DNS DNS cloaking

Expired domains purchase

4.3.2 Simon Newcomb

2. Heaps Heaps' law

Microsoft -- Detecting Spam Web Pages through Content Analysis

Search Engine Optimization and Marketing

(Google, Yahoo, MSN)

Social Network Sites

(MySpace, Friendster, YouTube)

Social BookMarking (De.li.cio.us, BlogMarks)

Online Press Release

10. Link Building Reciprocal and One Way Paid Links

You might also like