Professional Documents
Culture Documents
and
and
(1)
For simplicity
will be denoted as I
l
. Here, one should note that num-
ber of matches discovered at level l also contains the matches found in finer level l+1.
Yet, the new found matches can be calculated by getting the difference of I
l
I
l+1
for
l=0, 1, .., L-1. One fundamental principle of pyramid matching schema is penaliz-
ing the importance of matches in higher coarser levels by providing a weight factor to
each intersection level l. Thus, for each level l the weight is calculated as
that is
inversely proportional to cell width at that level [14]. Combining all of these, we ob-
tain the pyramid matching kernel as follows:
(2)
In this study we employ orthogonal approach as used in Lazebnik et al.[14]s study
which means performing pyramid matching kernel in 2D space. In details, quantiza-
tion stage (labeling all HTML leaf nodes in five categories) produces D discrete
channels with the assumption that the items of one channel can be matched only with
items belonging to same channel. Each channel c outputs two sets of vectors
and
which contains coordinates of type c found in feature space. Finally, we obtain the
kernel by summing up all individual channel kernels as can be seen below in Eq.3
[14]. As stated in [14], this kernel has the property of maintaining the continuity to
visual vocabulary paradigm. A more detailed explanation of pyramid matching kernel
can be found in [25]. Besides, comprehensive information about spatial pyramid ker-
nel is given in [14].
(3)
In our study we employ 3 levels pyramidal histogram intersection which results
1+4+16+64 = 85 cells which overlay a grid of 1024*1024 pixels. The regions below
the top most 1024 pixels are discarded. By using 5 different channels it results a 425
dimensional vector. This vector actually represents the visual layout signature of web
page. Finally the comparison between any two pages S
x
and S
y
is done by histogram
intersection of these two pages. More similar pages yield higher histogram intersec-
tion values. This value constitutes the main similarity metric of visual similarity.
4 Application
SimiLay contains three main parts: (1) a wrapper, (2) web based search engine inter-
face and (3) a distributer. All parts are implemented on C#.NET. Wrapper is an appli-
cation which loads the web page, extracts the DOM tree, classifies the tags and maps
HTML nodes on grid. It also takes the snapshots of pages with four seconds interval.
Finally, the signatures of the web pages are generated and written to central corpus
database. Mozilla GeckoFx [26] has been employed for HTML rendering and DOM
tree extraction.
A leaf tag or group of leaf tags may be child of the same ancestor type. For this
reason we developed a tree pruning function which scans all leaf nodes and their
ancestors and prunes the child leaf node(s) to capture bigger region in the name of
better mapping. In the lack of this function, it is indispensible sometimes possible to
incorrectly validate a textual region as blank region. On an Intel Core i7 1,73Ghz 4
Core, 4 GB memory computer, it is determined that creating the signature of a page
takes 25 seconds in average including motion capturing (page loading is excluded due
to variable network conditions and page size). Similay search engine (see Fig. 3) is
implemented on ASP.NET [27] platform. With ones query page input (Url of the
page), it retrieves the similar page results and rank them according to similarity score.
Fig. 3. Search engine module
For efficient computation and utilizing the cores of CPU, we implemented a wrap-
per distributor. The main function of distributor is reading the page Urls and distribut-
ing them to newly run wrappers. Degree of parallelization can be set on distributor
user interface. The system schema and work flow of SimiLay is depicted on Fig. 4.
Fig. 4. SimiLay system schema
5 Evaluation and Discussion
According to our best knowledge, there exist no standard well-known dataset for web
page visual similarity measurement and as can be seen in the literature of anti-
phishing, several studies utilize the web site phishtank.com for gathering daily phish
cases. In phishtank, the legitimate and fake web pages are listed. However, for mea-
suring our approachs success, phishtanks lists are not adequate. As phishing pages
share same color and page layout in almost perfect condition and we are primarily
searching the layout similarity among different web pages, we concluded to build our
test dataset. For this reason, we randomly picked up 100 pages. During the process of
selection, apart from the other pages, we chose four different layout groups totally
containing 26 pages. The in-group pages are validated and marked as visually similar
by human users. These 26 pages build our ground truth dataset. The remaining pages
are neither so different nor so similar to our ground truth web pages.
To evaluate the success of retrieval results, we first queried each page against to
ground-truth dataset and whole corpus at the second stage. Our expectation was al-
ways to retrieve the in-group sets on the top of the retrieval. For the performance
measure, we applied Average Normalized Rank (ANR) [15]. As the expected pages
are listed on top of retrieval result, ANR value tends to 0. For instance, if ANR of a
query result is 0.8, this can be interpreted as relevant pages of this query is in the top
of 8%. A perfect ANR result is always zero. A detailed explanation about ANR can
be found in [15].
The retrieval results are listed in Table 1. First column lists the web page URLs.
While second column presents the retrieval order of the in-group pages by consider-
ing only ground truth data, third column gives the calculated ANR for this page
query. Similarly, fourth and fifth columns are providing same information by consi-
dering whole corpus (100 pages).
According to results, 46% of the queries retrieved the pages with the success of
0.25 ANR score in corpus showing that relevant web pages are retrieved in top of
25% of the whole corpus. This ratio is rising up to 61% in ground truth only set.
As can be seen on Table 1, while group 3 shows the best performance among the
other groups, group 2 presents the worst one. If we investigate the reasons of this, we
reveal one truth. To date, HTML has evolved and reached to its fifth release. During
the years it has been enriched with new standards such as CSS [28]. With the advent
of CSS, web page developers have started to locate image elements with different
techniques. As the IMG tag is still being used by majority, many developer have
chose the way of embedding images as background images via CSS rules. This me-
thod may have advantages for better and modern design. However this technique
makes current SimiLay wrapper impossible to detect and map image regions. Due to
implicit declaration, it is a very exhaustive task to detect them correctly. In essence,
current SimiLay wrapper uses some smart tricks to detect the boundary rectangle of
items. A similar problem sometimes occurs in some web pages which employs java-
script based animation/motion regions. Therefore, we can conclude that new style
HTML implementations can cause indispensable problems. For the future work, we
are planning to apply a smarter way of tag coordinate calculation.
Table 1. Experiment result with ground truth and whole corpus data
Web Page Url Retrieval
order in GT
dataset
ANR in
GT
dataset
Retrieval order
in whole corpus
dataset
ANR in
whole
corpus
Group 1
http://www.rinze.com/ 1,2,5,9,16,19 0,192 1,5,10,35,58,78 0,276
http://www.brandynfidel.com/blog/ 1,2,3,4,7,8 0,025 1,2,3,4,8,10 0,011
http://blog.artversion.com/web-
design-company/why-is-web-design-
the-most-important-part-of-marketing/
1,2,6,7,8,9 0,076 1,2,10,15,17,20 0,073
http://johnwilhelm.ch/ 1,2,3,5,6,7 0,019 1,2,3,5,6,8 0,006
http://ojodesaturno.tumblr.com/ 1,2,8,15,16,18 0,250 1,2,21,60,61,67 0,318
http://david-quinn.co.uk/ 1,3,6,13,20,23 0,288 1,16,33,53,81,88 0,418
http://www.alejandropg.com/ 1,3,5,6,7,14 0,096 1,3,11,19,21,54 0,146
Group 2
http://www.loiclequere.com/new-
york-city
1,2,8,18 0,121 1,2,37,67 0,242
http://www.ks-
image.com/iceland.html
1,2,11,22 0,166 1,3,57,88 0,347
https://marion-
luttenberger.squarespace.com/
13,15,18,19 0,352 65,78,84,85 0,755
http://www.mitchellkphotos.com/
photography.html
1,2,14,19 0,166 2,8,68,85 0,382
http://www.yourbeautiful
photography.com/content.php
15,16,23,24 0,435 60,63,87,94 0,735
Group 3
http://www.dolusepetim.com/ 1,2,3,4,5,7,8 0,006 4,5,12,15,16,21 0,086
http://www.hepsiburada.com/ 1,2,3,4,7,12 0,051 1,2,3,4,16,59 0,106
http://www.fotografium.com/ 1,2,3,4,7,12 0,051 1,2,3,4,17,57 0,105
http://www.annelutfen.com/ 1,2,3,4,7,11 0,044 2,3,4,5,12,51 0,093
http://www.ereyon.com.tr/ 1,2,3,4,6,12 0,044 1,2,3,4,14,53 0,093
http://www.724tikla.com/ 1,2,3,4,7,12 0,051 1,2,3,4,24,63 0,126
http://www.sanalpazar.com/ 1,2,3,5,6,12 0,330 1,4,5,7,9,56 0,101
Group 4
http://www.magiclick.com 5,9,10,11,12,14 0,256 8,20,25,26,32,53 0,238
http://www.cognex.com/ 4,9,10,12,21,24 0,378 8,42,45,60,87,96 0,528
http://www.noritsu.com/ 2,3,6,10,18,21 0,250 4,6,13,31,83,93 0,348
http://www.interneer.com/ 1,2,3,12,15,17 0,185 6,10,13,67,84,89 0,413
http://www.decathlon.com.tr/ 1,8,10,12,20,24 0,346 62,85,89,93,94,100 0,836
http://www.god.de/god/ 7,8,9,11,12,13 0,250 16,17,22,39,43,45 0,268
http://www.ikea.com.tr/ 1,2,5,14,18,20 0,250 2,8,21,54,84,90 0,396
As we stated before, one of the objectives of our study and approach is to provide a
flexible way to weighting the importance of each visual word. In essence, histogram
vectors of spatial pyramid match kernel provide this flexibility. Currently we are set-
ting the weights of categories as follows: (1) blank: 1.55, (2) text: 0.75, (3) image:
1.1, (4): form elements: 1 and (5) animation: 1. As the textual regions may contain
different font size and types we decided to decrease the importance of textual regions
which means providing textual property invariance. In contrast to text areas, we in-
creased the weight of blank spaces as several studies suggest the importance of hori-
zontal and vertical separators. With this setting, we believe that we are boosting the
importance of white spaces on pages and make the algorithm to work more sensible
against to visual separators. However we have not optimized these parameters. As a
future work, we are also planning to make an optimization on visual words weight
settings.
6 Conclusion
In this study we designed and implemented a web page similarity search engine with
a novel method combined with a bag of features approach. According to our best
knowledge, there exists no study which considers and tries to solve this problem with
the followed approaches. Viewing the web page as a collection of different functional
object categories and mapping them to a multi resolution pyramid schema provides us
several benefits and flexibilities. Yet, it yields also good and promising results. By
this study, we can conclude that spatial pyramid matching kernel schema is a suitable
solution with the combination of our visual word decomposition approach.
References
1. Y. Yang and H.J. Zhang, HTML Page Analysis Based on Visual Cues. In Proceed-
ings of Sixth International Conference on Document Analysis and Recognition,
2001.
2. M. Alpuante, D. Romero, A Visual Technique for Web Pages Comparison, Elec-
tronic Notes in Theoretical Computer Science, Vol. 235: 3-18, 2009.
3. V. Eglin and S. Bres, Document Page Similarity based on Layout visual saliency:
Application to query by example and document classification, In Proceedings of
the Seventh International Conference on Document Analysis and Recognition.
2003.
4. X. Wan, A Novel Documents Similarity Measure based on Earth Movers Dis-
tance, Information Sciences. Vol: 177, pages: 3718-3730. 2007.
5. J. Kang and J. Choi, Recognising Informative Web Page Blocks Using Visual
Segmentation for Efficient Information Extraction, Journal of Universal Computer
Science. 14(11) .pages: 1893-1910 .2008
6. M. Hara, A. Yamada and Y. Miyake, Visual Similarity-based Phishing Detection
without Victim Site Information. In Proceedings of Computational Intelligence in
Cyber Security 09. Pages: 30-36. 2009.
7. M.T. Law, C.S. Gutierrez, N. Thome and S. Ganarski and M. Cord, Structural and
Visual Similarity Learning for Web Page Archiving, In Proceeding of CBMI. pag-
es: 1-6, 2012.
8. P. Bohunsky and W. Gatterbauer, Visual Structure-based Web Page Clustering and
Retrieval, In Proceddings of the 19th International Conference on World Wide
Web. Pages: 1067-1068. 2010.
9. E. Medvet, E. Kirda and C. Kruegel. Visual-Similarity-Based Phishing Detection,
In Proceedings of SecureComm 2008. 2008
10. M. Alpuente and D. Romero, A Tool for Computing the Visual Similarity of Web
Pages, In Proceedings of Applications and the Internet (SAINT), 2010.
11. A.P. Rosiello, E. Kirda, C. Kruegel and F. Ferrandi, A Layout-Similarity-Based
Approach for Detecting Phishing Pages, In Proceeding of Security and Privacy in
Communications Networks and the Workshops, pages: 457-463. 2007.
12. Gartner Press Release. Gartner Says Number of Phishing E-mails Sent to U.S.
Adults early Doubles in Just Two Years.
http://www.gartner.com/it/page.jsp?id=498245, 2006.
13. D. Gehrke, E. Turban, Determinant of successful website design: Relative impor-
tance and recommendations for effectiveness. In Proceedings of the 32th Hawaii
International Conference on System Sciences, 1999.
14. S. Lazebnik, C. Schmid and J. Ponce, Beyond Bags of Features: Spatial Pyramid
Matching Recognizing Natural Scene Categories, In Proceedings of IEEE Comput-
er Society Conference on Computer Vision and Pattern Recognition, 2006.
15. S. OHara amd B.A. Draper, Introduction to The Bag of Features Paradigm for Im-
age Classification and Retrieval, CoRR abs/1101.3354, 2011.
16. D. Cai, S. Yu, J.R. Wen, and W.Y. Ma, VIPS: a Vision-based Page Segmentation
Algorithm, Technical Report MSR-TR-2003-79, Microsoft Research. 2003.
17. H. Guo, J. Mahmud , Y. Borodin, A. Stent and I.V. Ramakrishnan, A General Ap-
proach for Partioning Web Page Content Based on Geometric and Style Informa-
tion, In Proceedings of Document Analysis and Recognition ICDAR 2007. 2007.
18. A. Tombros and Z. Ali, Factors Affecting Web Page Similarity, In Proceedings of
ICIR 2005, pages: 487-501, 2005.
19. ImgSeek, http:///www.imageseek.net. (28.1.2014)
20. A. Pnueli, R. Bergman, S Schein and O. Barkol, Web Page Layout Via Visual
Segmentation, Technical Report HPL-2009-160. 2009.
21. M. Kudelka, T. Takama, V. Snasel and K. Klos, Visual Similarity of Web Pages,
Advances In Intelligent and Soft Computing. Vol: 67, pages: 135-146, 2010.
22. T.C. Chen, S. Dick and J. Miller, Detecting Visually Similar Web Pages: Applica-
tion to Phishing Detection, ACM Transactions on Internet Technology. Vol. 10 (2),
2010.
23. J. Koenderink and A. V. Doorn, The structure of locally orderless images, IJVC,
31(2/3), pages: 159-168. 1999.
24. S. Lazebnik, C. Schmid and J. Ponce, Spatial Pyramid Matching,
http://www.cs.unc.edu/~lazebnik/publications/pyramid_chapter.pdf.
25. K. Grauman and T. Darrell, Pyramid match kernels: Discriminative classification
with sets of image features. In Proceedings of ICCV, 2005.
26. Mozilla GeckoFx.Net, https://code.google.com/p/geckofx/. 28.1.2014
27. ASP.NET, http://www.asp.net/ 30.1.2014
28. What is CSS?, http://www.w3.org/Style/CSS/ 1.2.2014