You are on page 1of 3

37 3 2011 2

Vol.37 No.3 Computer Engineering February 2011

10003428(2011)03006403 A TP18

DBSCAN

( 510642)


DBSCAN

DBSCAN

Webpage Content Extraction Based on DBSCAN


OUYANG Jia, LIN Pi-yuan
(College of Informatics, South China Agricultural University, Guangzhou 510642, China)

AbstractFor the problem of webpage content extraction, this paper presents a method based on section-factor to filter webpage and get the plain
text paragraph. Each paragraph is regarded as a point in the two-dimensional space. The DBSCAN clustering algorithm can cluster these points to
get the real content. This method has low complexity and does not depend on the site layout style, as well as has strong adaptability. Experiments are
put on the news websites from domestic and international, and results show that for both Chinese and English news website has a high average
accuracy and obvious effect.
Key wordstopic-focused crawler; content extraction; DBSCAN; density
DOI: 10.3969/j.issn.1000-3428.2011.03.023

1 /HTML

Web
Web
Web
[7]
DBSCAN
HTML
Web HTML HTML
Web [1]
DBSCAN

2
Wrapper
[1] DBSCAN
[2]

DBSCAN 2
[3]
1
VIPS(Vision-based Page Segmen-
tation)[4][5] DOM

[6]
1 DBSCAN
HMTL DOM
(60573043)
[7]/HTML (1986)

/HTML 2010-05-13 E-mail13824498818@139.com
37 3 DBSCAN 65

3 DBSCAN <p>
DBSCAN </p>

HTML
[8] HTML
1()
Eps
2(Eps ) Eps 4((TagNum))
Eps- TagNum
3() Eps-
MinPts
<a href="#" target="_blank"></a>
<p></p>
DBSCAN 2 TagNum=1 TagNum=2
2
1
2
TagNum=3
1
1
2 DBSCAN
4 1
4
4.2
DBSCAN HTML
HTML
2 HTML Point = (i, Ci ) i
0 i N ci i N
<></> DBSCAN
<></> 2
t
Dis( x, y) = x y = ( x yk ) 2 (1)
3
k
k =1

(1)HTML (TAG) t=2


<><=></> 3
<p></p>
(2)(1) (1)
(3)(Script)
function doPostBack
(eventTarget, eventArgument){} (2) 1

4.1 1 ~ 54 1 ~5
HTML

250

200

150

100

50

0 9 18 27 36 45 54 63 72 81 90 99 108 117 126 135 144 153 162 171 180 189 198 207 216 225 234 243 252 261 270 279 288 297 306 315 324

3
66 2011 2 5

5 DBSCAN (3)
4 3 DBSCAN 6 (2)

6
()()
Eps

DBSCAN Eps MinPts


5.1

1 (DataSet)

6 3
2
3 (4)
3

6
4

5.2 http://www.163.com; http://www.qq.com; http://www.sina.com.
3 cn; http://www.people.com.cn; http://www.sohu.com
4 1
()()
=/ (2)

1 2
1 ( Eps=10, MinPts=3)
/(%)
http://www.163.com 120 115 5 95.8
http://www.qq.com 82 79 3 96.3
http://www.sina.com.cn 93 87 6 93.5
http://www.people.com.cn 104 98 6 94.2
http://www.sohu.com 115 110 5 95.7
514 489 25 95.1

2 ( Eps=20, MinPts=1)
/(%)
http://www.163.com 120 113 7 94.2
http://www.qq.com 82 77 5 93.9
http://www.sina.com.cn 93 85 8 91.4
4 1 http://www.people.com.cn 104 94 10 90.4
http://www.sohu.com 115 108 7 93.9
4 514 477 37 92.8
(1)
DBSCAN 2 Eps

MinPtsEps
4
Eps

Eps
(2)
[10, 15]MinPts [3, 6]

Eps
5
MinPts Eps MinPts


4 http://www.buzzle.com/; http://www.
nytimes.com/; http://www.buzzle.com 3
3 ( Eps=10, MinPts=3)
/(%)
http://www.cnn.com 85 83 2 97.6
http://www.nytimes.com/ 70 75 5 93.3
http://www.buzzle.com 90 88 2 97.8
240 230 10 95.8

5 2 ( 69 )

You might also like