You are on page 1of 6

Volume 2 Special Issue ISSN 2079-8407

Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org

Web Document Clustering Using Fuzzy Equivalence Relations


Anjali B. Raut #1, G. R. Bamnote *2
#1
Department of Computer Science & Engineering, HVPM’s COET, Amravati SGB Amravati University (M.S.) India
*2Department of Computer Science & Engineering, PRMITR, Badnera, SGB Amravati University (M.S.) India
#1
anjali_dahake@rediffmail.com *2 grbamnote@rediffmail.com

ABSTRACT
Conventional clustering means classifying the given data objects as exclusive subsets (clusters).That means we can
discriminate clearly whether an object belongs to a cluster or not. However such a partition is insufficient to represent
many real situations. Therefore a fuzzy clustering method is offered to construct clusters with uncertain boundaries and
allows that one object belongs to overlapping clusters with some membership degree. In other words, the essence of fuzzy
clustering is to consider not only the belonging status to the clusters, but also to consider to what degree do the object
belong to the cluster. In this paper, a technique called “Retrieval of Web documents using a fuzzy hierarchical
clustering” is being proposed that creates the clusters of web documents using fuzzy hierarchical clustering.

Keywords: Web Mining, Clustering, Search Engine


fuzzy clustering, one which is based on fuzzy c-partitions,
I. INTRODUCTION is called a fuzzy c-means clustering method and the
other, based on the fuzzy equivalence relations, is
Over the last decade there is tremendous growth called a fuzzy equivalence clustering method. The
of information on World Wide Web (WWW).It has purpose of this research is to propose a search
become a major source of information. Web creates the methodology that consists of how to find relevant
new challenges of information retrieval as the amount of information from WWW. In this paper, a method is being
information on the web and number of users using web proposed of document clustering, which is based on fuzzy
growing rapidly. It is practically impossible to search equivalence relation that helps information retrieval in the
through this extremely large database for the information terms of time and relevant information.
needed by user. Hence the need for Search Engine arises. The paper is structured as follows: section 2
Search Engines uses crawlers to gather information and describes some related work about Web Mining and
stores it in database maintained at search engine side. clustering algorithms. Section 3 shows the proposed
For a given user's query the search engine searches in method and section 4 presents an example, how to
the local database and very quickly displays the results. retrieve the relevant information from WWW. Section
The ability to form meaningful groups of objects 5 shows the results. In section 6, conclusion and future
is one of the most fundamental modes of intelligence. work are presented.
Human perform this task with remarkable ease. Cluster
analysis is a tool for exploring the structure of data. The II. RELATED WORK
core of cluster analysis is clustering; the process of
grouping objects into clusters such that the objects from Data Mining has emerged as a new discipline in
the same cluster are similar and objects from different world of increasingly massive datasets. Data Mining is the
cluster are dissimilar. The need to structure and learn process of extracting or mining knowledge from data.
vigorously growing amount of data has been a driving Data Mining is becoming an increasingly important tool to
force for making clustering a highly active research area. transform data into information. Knowledge Discovery
Web Mining is the use of Data Mining techniques from Data i.e. KDD is synonym for Data Mining.
to automatically discover and extract information from
web. Clustering is one of the possible techniques to 2.1 Web Mining
improve the efficiency in information finding process. It is World Wide Web is a major source of
a Data Mining tool to use for grouping objects into information and it creates new challenges of information
clusters such that the objects from the same cluster are retrieval as the amount of information on the web
similar and objects from different cluster are dissimilar. increasing exponentially. Web Mining is use of Data
Web Mining has fuzzy characteristics, so fuzzy Mining techniques to automatically discover and extract
clustering is sometimes better suitable for Web Mining in information from web documents and services [1].
comparison with conventional clustering. Fuzzy clustering Oren Etzioni was the person who coined the term
is a relevant technique for information retrieval. As a Web Mining first time. Initially two different approaches
document might be relevant to multiple queries, this were taken for defining Web Mining. First was a “process-
document should be given in the corresponding centric view”, which defined Web Mining as a sequence
response sets, otherwise, the users would not be aware of different processes [1] whereas, second was a “data-
of it. Fuzzy clustering seems a natural technique for centric view” , which defined Web Mining in terms of the
document categorization. There are two basic methods of type of data that was being used in the mining process [2].
22
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org

The second definition has become more acceptable, as is Application of text mining to web content has been the
evident from the approach adopted in most research most widely researched.
papers[3][5]. Web Mining is also a cross point of
database, information retrieval and artificial intelligence Web Structure Mining (WSM):
[4].
The structure of a typical web graph consists of
2.2 Web Mining Process web pages as nodes, and hyperlinks as edges connecting
Web mining may be decomposed into the related pages. Web structure mining is the process of
following subtasks: discovering structure information from the web.
1. Resource Discovery: process of retrieving the Web Usage Mining(WUM):
web resources. Web usage mining is the application of data
2. Information Pre-processing : is the transform mining techniques to discover interesting usage patterns
process of the result of resource discovery from web usage data. Web usage data includes data from
3. Information Extraction: automatically extracting web server logs, browser logs, user profiles, registration
specific information from newly discovered Web data, cookies etc.
resources. WCM and WSM uses real or primary data on the
4. Generalization: uncovering general patterns at web whereas WUM mines the secondary data derived
individual Web sites and across multiple sites [3]. from the interaction of the users while interacting with the
web.
Resourse Information Information Genera-
Extraction
Descovery Pre-processing
-lization 2.4 Clustering
The Web is the largest information repository in
the history of mankind. Finding the relevant information
Fig 2.1: Web mining process on www is not an easy task. The information user can
encounter the following problems when interacting with
the web [2].
2.3 Web Mining Taxonomy
Web has different facets that yield different
approaches for the mining process : • Low precision: Today’s search tools have the
1. Web pages consist of text. low precision problem, which is due to the
2. Web pages are linked via hyperlinks irrelevance of many search results. This results in
3. User activity can be monitored via Web server a difficulty finding the relevant information.
logs. • Low recall: It is due to the inability to index all
the information available on the web. This results
This three facets leads to the distinction into three in a difficulty finding the unindexed information
categories i.e. Web content mining, Web structure mining that is relevant.
and Web usage mining [4-7]. Following Fig 2.2 shows the
Web Mining Taxonomy. Clustering is one of the Data Mining techniques
to improve the efficiency in information finding process.
Many clustering algorithms have been developed and used
in many fields. A. K. Jain, M. N. Murty and P. J. Flynn[8]
provides an extensive survey of various data clustering
techniques. Clustering algorithms can be broadly
categorized into partitional and hierarchical techniques.
Agglomerative hierarchical clustering (AHC)
algorithms are most commonly used .It use a bottom –up
methodology to merge smaller cluster into larger ones ,
using techniques such as minimal spanning tree . These
algorithms find to be slow when applied to large document
Fig 2.2: Web mining taxonomy collection. It has different variants such as single-link,
group-average and complete-link. Single-link and group-
Web Content Mining (WCM): average methods typically takes O(n2) time while
complete-link method typically takes O(n3) time.
Web content mining is the process of extracting Partition algorithm such as K- means are linear
useful information from the contents of web documents. time algorithms . It try to divide data into subgroups such
Content data is the collection of facts a web page is that the partition optimizes certain criteria , like inter –
designed to contain. It may consist of text, images, audio, cluster distance or intra- cluster distances. They typically
video, or structured records such as lists and tables. take an iterative approach. The time complexity of this

23
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org

algorithm is O(nkt) , where k is the number of desired entire set of documents and eliminates words i.e. stop
clusters and T is the number of iterations. . words such as “a”, “and”, “the” etc from each documents.
Most of the document clustering algorithm These keywords fetch the related documents and stored in
worked on BOW (Bag Of Words)model[5].Oren Zamir the indexed database. The documents are stored in
and Oren Etzioni[9] in their research listed the key indexed database based on keywords. Now, the proposed
requirements of web document clustering methods as fuzzy clustering method based upon fuzzy equivalence
relevance, browsable summaries, overlap, snippet relations is applied on the indexed database. A list of
tolerance, speed and accuracy. They have given STC common words called keywords is generated in table 3.1.
(Suffix Tree Clustering) algorithm which creates clusters
based on phrase shared between documents. Michael TABLE 3.1 DOCUMENT NO. AND KEYWORDS
Steinbach, George Karypis and Vipin Kumar [10]
presented the result of an experimental study of some Document Keywords
common document clustering algorithms. They compare No
the two main approaches of document clustering i.e. 0 Web
agglomerative hierarchical clustering and K-means 1 Fuzzy
method. Nicholas O. Andrews and Edward A. Fox[11] 2 Cluster
presented the recent developments in document clustering 3 Fuzzy
. A single object often contains multiple themes like a web 4 Web
document on topic Web Mining may contain different
themes like Data Mining, clustering and information Each keyword is assigned a Keyword ID as shown in
retrieval. Many traditional clustering algorithms assign table 3.2
each document to a single cluster, thus making it difficult
for the user to retrieve information. Based on this concept
clustering algorithm can be divided into hard & soft TABLE 3.2 KEYWORDS AND KEYWORD ID
clustering algorithm. In traditional clustering algorithm
Keywords Keyword ID
each object belongs to exactly one cluster where as in soft
clustering algorithm each object can belongs to multiple Web 0
clusters [12]. Fuzzy 1
The conventional clustering algorithms in Data Database 2
Mining have difficulties in handling the challenges posed Cluster 3
by the collection of natural data which is often vague and
uncertain. The modeling of imprecise and qualitative The information contained in table 3.1 and
knowledge, as well as handling of uncertainty at various table 3.2 is used to generate the required document
stages is possible through the use of fuzzy sets. Therefore clustering data for applying fuzzy equivalence relation.
a fuzzy clustering method was offered to construct clusters Since it is not directly possible; so first
with uncertain boundaries, so this method allows that one determine a fuzzy compatibility relation (reflexive and
object belongs to multiple clusters with some membership symmetric) in terms of an appropriate distance function
degree. applied on given data. Then, a meaningful fuzzy
Pawan Lingras, Rui Yan and Chad West[13] equivalence relation is defined as the transitive closure of
applied fuzzy technique to discover usage pattern from this fuzzy compatibility relation.
web data. The fuzzy c-means clustering was applied to the A set of data X is consisting of the following
web visitors of educational websites. The analysis shows points in R2 (p-tuples of Rp) as shown in figure 3.1.
the ability of the fuzzy c-means clustering to distinguish
different user characteristics. Anupam Joshi and Raghu
Krishnapuram[14] developed a prototype Web Mining
system which analyzes web access logs from a server and
tries to mine typical user access pattern. Maofu Liu,
Yanxiang He and Huijun Hu[15] proposed a web fuzzy
clustering model. In their paper the experimental result of
web fuzzy clustering in web user clustering proves the
feasibility of web fuzzy clustering in web usage mining.

III. PROPOSED WORK


A Clustering method based upon fuzzy
Fig 3.1: Web document data
equivalence relations is being proposed for web document
clustering. The downloaded documents and the
The data X is shown in table 3.3
keywords contained therein and stored in a database by
the crawler. The indexer extracts all words from the
24
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org

TABLE 3.3 1 .65 .1 .21 0


.65 1 .44 .5 .21
K 1 2 3 4 5 R= .1 .44 1 .44 .1
X K1 0 1 2 3 4 .21 .5 .44 1 .65
X K2 0 1 3 1 0 0 .21 .1 .65 1

Let a fuzzy compatibility relation, R, on X be defined in This relation is not max-min transitive; its transitive
terms of an appropriate distance function of the closure is
Minkowski class by the formula
p RT= 1 .65 .44 .5 .5
q 1/q
R(x i ,x k ) = 1- δ( ∑ | x ij – x kj | ) …….(i) . 65 1 .44 .5 .5
J=1 .44 .44 1 .44 .44
.5 .5 .44 1 .65
For all pairs (x i , x k ) Є X, where q Є RT, and δ is .5 .5 .44 .65 1
a constant that ensures that R (x i ,x k ) Є [0,1], Clearly , δ is
the inverse value of the largest distance in X. In general, This relation includes four distinct partitions of its α –
R defined by equation (i) is a fuzzy compatibility cuts:
relation, but not necessarily a fuzzy equivalence
relation. Hence, there is need to determine the transitive αЄ [0, 0.44] : { { x 1 , x 21 ,x 3 , x 4 , x 5 , x}}
closure of R. α Є ( 0.44, 0.5] : { { x 1 , x 2 , x 4 , x 5 },{x 3 }} αЄ ( 0.5, .65] :
Given a relation R(X,X), its transitive closure { { x 1 ,x 2 } , {x 3 }, {x 4, x 5 }}
RT (X,X) can be determined by simple algorithm that αЄ ( 0.65, .1] : { { x 1 } , {x 2 },{x 3 }, {x 4 },{ x 5 }}
consists of the following three steps:
Now repeat the analysis for q = 1 in eq (i) which
1. R' = R U (R o R) represents the Hamming distance. Since the largest
2. If R' ≠ R, make R = R' and go to step 1 hamming distance in the data is 5, we have δ=0.2. the
3. Stop R' = R T matrix form of relation R is given by eq.(i) is now

This algorithm is applicable to both crisp and


fuzzy relations. However, the type of composition and set R= 1 .6 0 .2 .2
union in step 1 must be compatible with the definition of .6 1 .4 .6 .2
transitivity employed. After applying this algorithm a 0 .4 1 .4 0
hierarchical cluster tree will be generated. Each cluster has .2 .6 .4 1 .6
similar documents which help to find the related .2 .2 0 .6 1
documents in the terms of time and relevancy.
And its transitive closure is
IV. EXAMPLE
To illustrate the method based on fuzzy RT= 1 .6 .4 .6 .6
equivalence relation, let us take a example. In this .6 1 .4 .6 .6
example there are five web documents and four keywords .4 .4 1 .4 .4
as shown in figure 3.1. By applying above algorithm, .6 .6 .4 1 .6
analyze the data for q= 1, 2. .6 .6 .4 .6 1
Firstly, for q=2, which is corresponds to the
Euclidean distance , there is need to determine the
value of δ for equation (i). The largest Euclidean The relation gives the following partions in it’s α - cuts
distance between any pair of given data points is 4 then δ
= 1/4 =0.25 these are data points for q =1 α Є [0, 0.4] : { { x 1 , x 2 ,x 3 , x 4 , x 5 }}
x 1 = (0,0) , x 2 = (1,1) ,x 3 =(2,3) , x 4 = (3,1), x 5 = (4,0) α Є ( 0.4, 0.6] : { { x 1 , x 2 , x 4 , x 5 },{x 3 }}
αЄ ( 0.6, 1] : { { x 1 } , {x 21 },{x 3 }, {x 4 },{ x 5 }}
Now calculate membership grade of R for equation (i)
2 2 0.5
R (x 1 , x 3 ) = 1- 0.25(2 + 3 ) = 0.1 V. RESULTS AND SNAPSHOTS
When determined, relation R may conveniently be
represented by the matrix for the following data points This result agrees with our visual perception of
geometric clusters in the data. This is undoubtedly due to
the use of the Euclidean distance. The dendrogram is a

25
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org

graphical representation of the results of hierarchical REFERENCES


cluster analysis. This is a tree-like plot where each step of
hierarchical clustering is represented as a fusion of two [1] Oren Etzioni, “The World Wide Web: quagmire or
branches of the tree into a single one. The branches gold mine?” Communications of ACM”, Nov 96.
represent clusters obtained on each step of hierarchical
clustering. The result of above example is described in the [2] R. Cooley,B. Mobasher and J. Srivastava ,”Web
form of dendrogram, in snapshots shown in Fig. 5.1 and Mining: Information and Pattern Discovery on the
fig. 5.2 World Wide Web”, In the Proceeding of ninth IEEE
Dendrogram (α-cut) International Conference on Tools with Artificial
Intelligence(ICTAI’97),1997.

[3] Hillol Kargupta, Anupam Joshi, Krishnamoorthy


Sivakumar and Yelena Yesha, “Data Mining: Next
Generation Challenges and Future Directions”, MIT
Press,USA , 2004

[4] WangBin and LiuZhijing, “Web Mining Research” ,


In Proceeding of the 5th International Conference
on Computational Intelligence and Multimedia
Applications(ICCIMA’03) 2003.

[5] R. Kosala and H.Blockeel, “Web Mining Research:


A Survey”, SIGKDD Explorations ACM SIGKDD,
July 2000.

[6] Sankar K. Pal,Varun Talwar and Pabitra Mitra ,


Fig 5.1: Snapshot of dendogram “Web Mining in Soft Computing Framework :
Relevance, State of the Art and Future Directions ”,
IEEE Transactions on Neural Network , Vol 13,No
5,Sept 2002.

[7] Andreas Hotho and Gerd Stumme, “Mining the


World Wide Web- Methods, Application and
Perceptivities”, in Künstliche Intelligenz, July 2007.
(Available at http://kobra.bibliothek.uni-kassel.de/)

[8] A. K. Jain,M. N. Murty and P. J. Flynn, “Data


clustering: A review,” ACM computing
surveys,31(3):264-323,Sept 1999.

[9] O. Zamir and O. Etzioni, “Web document clustering:


A feasibility demonstration”, in Proceeding of 19th
International ACM SIGIR Conference on Research
Fig 5.2 : Snapshot of dendogram and Development in Informational Retrieval,
June1998.

VI. CONCLUSION AND FUTURE RESEARCH [10] Michael Steinbach, George Karypis and Vipin
Kumar, “A Comparison of Document Clustering
Techniques”, KDD Worksop on Textmining, 2000.
Web data has fuzzy characteristics, so fuzzy clustering is
sometimes better suitable for Web Mining in comparison
[11] Nicholas O. Andrews and Edward A. Fox, “Recent
with conventional clustering. This proposed technique for
Development in Document Clustering Techniques”,
document retrieval on the web, based on fuzzy logic
Dept of Computer Science, Virgina Tech 2007.
approach improves relevancy factor. This technique keeps
the related documents in the same cluster so that searching
[12] King-Ip Lin and Ravikumar Kondadadi, “A
of documents becomes more efficient in terms of time
Similarity Based Soft Clustering Algorithm for
complexity. In future work we can also improve the
Documents”, in Proceeding of the 7th International
relevancy factor to retrieval the web documents.
Conference on Database Systems for Advanced
Applications (DASFAA-2001), April 2001.
26
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences

©2010-11 CIS Journal. All rights reserved.

http://www.cisjournal.org

[13] Pawan Lingras ,Rui Yan and Chad West, “ Fuzzy [15] Maofu Liu, Yanxiang He and Huijun Hu, “Web
C-Means Clustering of Web Users for Educational Fuzzy Clustering and Its Applications In Web Usage
Sites”, Springer Publication ,2003. Mining”, Proceedings of 8th International
Symposium on Future Software Technology (ISFST-
[14] Anupam Joshi and Raghu Krishnapuram , “ Robust 2004).
Fuzzy Clustering Methods to Support Web Mining”,
Proceedings of the Workshop on Data Mining and [16] Klir & Yuan,”Fuzzy Sets and Fuzzy Logic: Theory
Knowledge Discovery , SOGMOD ,1998. and Applications”, Prentice Hall Publication.

27

You might also like