Professional Documents
Culture Documents
http://www.cisjournal.org
ABSTRACT
Conventional clustering means classifying the given data objects as exclusive subsets (clusters).That means we can
discriminate clearly whether an object belongs to a cluster or not. However such a partition is insufficient to represent
many real situations. Therefore a fuzzy clustering method is offered to construct clusters with uncertain boundaries and
allows that one object belongs to overlapping clusters with some membership degree. In other words, the essence of fuzzy
clustering is to consider not only the belonging status to the clusters, but also to consider to what degree do the object
belong to the cluster. In this paper, a technique called “Retrieval of Web documents using a fuzzy hierarchical
clustering” is being proposed that creates the clusters of web documents using fuzzy hierarchical clustering.
http://www.cisjournal.org
The second definition has become more acceptable, as is Application of text mining to web content has been the
evident from the approach adopted in most research most widely researched.
papers[3][5]. Web Mining is also a cross point of
database, information retrieval and artificial intelligence Web Structure Mining (WSM):
[4].
The structure of a typical web graph consists of
2.2 Web Mining Process web pages as nodes, and hyperlinks as edges connecting
Web mining may be decomposed into the related pages. Web structure mining is the process of
following subtasks: discovering structure information from the web.
1. Resource Discovery: process of retrieving the Web Usage Mining(WUM):
web resources. Web usage mining is the application of data
2. Information Pre-processing : is the transform mining techniques to discover interesting usage patterns
process of the result of resource discovery from web usage data. Web usage data includes data from
3. Information Extraction: automatically extracting web server logs, browser logs, user profiles, registration
specific information from newly discovered Web data, cookies etc.
resources. WCM and WSM uses real or primary data on the
4. Generalization: uncovering general patterns at web whereas WUM mines the secondary data derived
individual Web sites and across multiple sites [3]. from the interaction of the users while interacting with the
web.
Resourse Information Information Genera-
Extraction
Descovery Pre-processing
-lization 2.4 Clustering
The Web is the largest information repository in
the history of mankind. Finding the relevant information
Fig 2.1: Web mining process on www is not an easy task. The information user can
encounter the following problems when interacting with
the web [2].
2.3 Web Mining Taxonomy
Web has different facets that yield different
approaches for the mining process : • Low precision: Today’s search tools have the
1. Web pages consist of text. low precision problem, which is due to the
2. Web pages are linked via hyperlinks irrelevance of many search results. This results in
3. User activity can be monitored via Web server a difficulty finding the relevant information.
logs. • Low recall: It is due to the inability to index all
the information available on the web. This results
This three facets leads to the distinction into three in a difficulty finding the unindexed information
categories i.e. Web content mining, Web structure mining that is relevant.
and Web usage mining [4-7]. Following Fig 2.2 shows the
Web Mining Taxonomy. Clustering is one of the Data Mining techniques
to improve the efficiency in information finding process.
Many clustering algorithms have been developed and used
in many fields. A. K. Jain, M. N. Murty and P. J. Flynn[8]
provides an extensive survey of various data clustering
techniques. Clustering algorithms can be broadly
categorized into partitional and hierarchical techniques.
Agglomerative hierarchical clustering (AHC)
algorithms are most commonly used .It use a bottom –up
methodology to merge smaller cluster into larger ones ,
using techniques such as minimal spanning tree . These
algorithms find to be slow when applied to large document
Fig 2.2: Web mining taxonomy collection. It has different variants such as single-link,
group-average and complete-link. Single-link and group-
Web Content Mining (WCM): average methods typically takes O(n2) time while
complete-link method typically takes O(n3) time.
Web content mining is the process of extracting Partition algorithm such as K- means are linear
useful information from the contents of web documents. time algorithms . It try to divide data into subgroups such
Content data is the collection of facts a web page is that the partition optimizes certain criteria , like inter –
designed to contain. It may consist of text, images, audio, cluster distance or intra- cluster distances. They typically
video, or structured records such as lists and tables. take an iterative approach. The time complexity of this
23
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
http://www.cisjournal.org
algorithm is O(nkt) , where k is the number of desired entire set of documents and eliminates words i.e. stop
clusters and T is the number of iterations. . words such as “a”, “and”, “the” etc from each documents.
Most of the document clustering algorithm These keywords fetch the related documents and stored in
worked on BOW (Bag Of Words)model[5].Oren Zamir the indexed database. The documents are stored in
and Oren Etzioni[9] in their research listed the key indexed database based on keywords. Now, the proposed
requirements of web document clustering methods as fuzzy clustering method based upon fuzzy equivalence
relevance, browsable summaries, overlap, snippet relations is applied on the indexed database. A list of
tolerance, speed and accuracy. They have given STC common words called keywords is generated in table 3.1.
(Suffix Tree Clustering) algorithm which creates clusters
based on phrase shared between documents. Michael TABLE 3.1 DOCUMENT NO. AND KEYWORDS
Steinbach, George Karypis and Vipin Kumar [10]
presented the result of an experimental study of some Document Keywords
common document clustering algorithms. They compare No
the two main approaches of document clustering i.e. 0 Web
agglomerative hierarchical clustering and K-means 1 Fuzzy
method. Nicholas O. Andrews and Edward A. Fox[11] 2 Cluster
presented the recent developments in document clustering 3 Fuzzy
. A single object often contains multiple themes like a web 4 Web
document on topic Web Mining may contain different
themes like Data Mining, clustering and information Each keyword is assigned a Keyword ID as shown in
retrieval. Many traditional clustering algorithms assign table 3.2
each document to a single cluster, thus making it difficult
for the user to retrieve information. Based on this concept
clustering algorithm can be divided into hard & soft TABLE 3.2 KEYWORDS AND KEYWORD ID
clustering algorithm. In traditional clustering algorithm
Keywords Keyword ID
each object belongs to exactly one cluster where as in soft
clustering algorithm each object can belongs to multiple Web 0
clusters [12]. Fuzzy 1
The conventional clustering algorithms in Data Database 2
Mining have difficulties in handling the challenges posed Cluster 3
by the collection of natural data which is often vague and
uncertain. The modeling of imprecise and qualitative The information contained in table 3.1 and
knowledge, as well as handling of uncertainty at various table 3.2 is used to generate the required document
stages is possible through the use of fuzzy sets. Therefore clustering data for applying fuzzy equivalence relation.
a fuzzy clustering method was offered to construct clusters Since it is not directly possible; so first
with uncertain boundaries, so this method allows that one determine a fuzzy compatibility relation (reflexive and
object belongs to multiple clusters with some membership symmetric) in terms of an appropriate distance function
degree. applied on given data. Then, a meaningful fuzzy
Pawan Lingras, Rui Yan and Chad West[13] equivalence relation is defined as the transitive closure of
applied fuzzy technique to discover usage pattern from this fuzzy compatibility relation.
web data. The fuzzy c-means clustering was applied to the A set of data X is consisting of the following
web visitors of educational websites. The analysis shows points in R2 (p-tuples of Rp) as shown in figure 3.1.
the ability of the fuzzy c-means clustering to distinguish
different user characteristics. Anupam Joshi and Raghu
Krishnapuram[14] developed a prototype Web Mining
system which analyzes web access logs from a server and
tries to mine typical user access pattern. Maofu Liu,
Yanxiang He and Huijun Hu[15] proposed a web fuzzy
clustering model. In their paper the experimental result of
web fuzzy clustering in web user clustering proves the
feasibility of web fuzzy clustering in web usage mining.
http://www.cisjournal.org
Let a fuzzy compatibility relation, R, on X be defined in This relation is not max-min transitive; its transitive
terms of an appropriate distance function of the closure is
Minkowski class by the formula
p RT= 1 .65 .44 .5 .5
q 1/q
R(x i ,x k ) = 1- δ( ∑ | x ij – x kj | ) …….(i) . 65 1 .44 .5 .5
J=1 .44 .44 1 .44 .44
.5 .5 .44 1 .65
For all pairs (x i , x k ) Є X, where q Є RT, and δ is .5 .5 .44 .65 1
a constant that ensures that R (x i ,x k ) Є [0,1], Clearly , δ is
the inverse value of the largest distance in X. In general, This relation includes four distinct partitions of its α –
R defined by equation (i) is a fuzzy compatibility cuts:
relation, but not necessarily a fuzzy equivalence
relation. Hence, there is need to determine the transitive αЄ [0, 0.44] : { { x 1 , x 21 ,x 3 , x 4 , x 5 , x}}
closure of R. α Є ( 0.44, 0.5] : { { x 1 , x 2 , x 4 , x 5 },{x 3 }} αЄ ( 0.5, .65] :
Given a relation R(X,X), its transitive closure { { x 1 ,x 2 } , {x 3 }, {x 4, x 5 }}
RT (X,X) can be determined by simple algorithm that αЄ ( 0.65, .1] : { { x 1 } , {x 2 },{x 3 }, {x 4 },{ x 5 }}
consists of the following three steps:
Now repeat the analysis for q = 1 in eq (i) which
1. R' = R U (R o R) represents the Hamming distance. Since the largest
2. If R' ≠ R, make R = R' and go to step 1 hamming distance in the data is 5, we have δ=0.2. the
3. Stop R' = R T matrix form of relation R is given by eq.(i) is now
25
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
http://www.cisjournal.org
VI. CONCLUSION AND FUTURE RESEARCH [10] Michael Steinbach, George Karypis and Vipin
Kumar, “A Comparison of Document Clustering
Techniques”, KDD Worksop on Textmining, 2000.
Web data has fuzzy characteristics, so fuzzy clustering is
sometimes better suitable for Web Mining in comparison
[11] Nicholas O. Andrews and Edward A. Fox, “Recent
with conventional clustering. This proposed technique for
Development in Document Clustering Techniques”,
document retrieval on the web, based on fuzzy logic
Dept of Computer Science, Virgina Tech 2007.
approach improves relevancy factor. This technique keeps
the related documents in the same cluster so that searching
[12] King-Ip Lin and Ravikumar Kondadadi, “A
of documents becomes more efficient in terms of time
Similarity Based Soft Clustering Algorithm for
complexity. In future work we can also improve the
Documents”, in Proceeding of the 7th International
relevancy factor to retrieval the web documents.
Conference on Database Systems for Advanced
Applications (DASFAA-2001), April 2001.
26
Volume 2 Special Issue ISSN 2079-8407
Journal of Emerging Trends in Computing and Information Sciences
http://www.cisjournal.org
[13] Pawan Lingras ,Rui Yan and Chad West, “ Fuzzy [15] Maofu Liu, Yanxiang He and Huijun Hu, “Web
C-Means Clustering of Web Users for Educational Fuzzy Clustering and Its Applications In Web Usage
Sites”, Springer Publication ,2003. Mining”, Proceedings of 8th International
Symposium on Future Software Technology (ISFST-
[14] Anupam Joshi and Raghu Krishnapuram , “ Robust 2004).
Fuzzy Clustering Methods to Support Web Mining”,
Proceedings of the Workshop on Data Mining and [16] Klir & Yuan,”Fuzzy Sets and Fuzzy Logic: Theory
Knowledge Discovery , SOGMOD ,1998. and Applications”, Prentice Hall Publication.
27