Web Mining Report

VedaVyasa Institute of Technology
WEB MINING
1.
INTRODUCTION
With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize auto-mated tools in order to find, extract, filter, and evaluate the desired information and resources. In addition, with the transformation of the web into the primary tool for electronic commerce, it is imperative for organizations and companies, who have invested millions in Internet and Intranet technologies, to track and analyze user access patterns. These factors give rise to the necessity of creating server-side and client-side intelligent systems that can effectively mine for knowledge both across the Internet and in particular web localities.
Figure(i). web size 224,749,695 (Mar 2009) Netcraft survey Total no of sites across all domains At present most of the users commonly use searching engines such as www.google.com,www.yahoo.com to find their required information. Moreover, the target of the Web search engine is only to discover resource on the Web. Each searching engines having its own characteristics and employing different algorithms to index, rank, and present web documents. But because all these searching engines is
Dept .of Computer Science & Engg.
2008 Admission
WEB MINING
build based on exact key words matching and it's query language belongs to some artificial kind, with restricted syntax and vocabulary other than natural language, there are defects that all kind of searching engines cannot overcome. Narrowly Searching Scope: Web pages indexed by any searching engines are only a tiny part of the whole pages on the www, and the return pages when user input and submit query are another tiny part of indexed numbers of the searching engine. Low Precision: User cannot browse all the pages one by one, and most pages are irrelevant to the user's meaning, they are highlighted and returned by searching engine just because these pages in possession of the key words. Web mining techniques could be used to solve the information over load problems directly or indirectly. However, Web mining techniques are not the only tools. Other techniques and works from different research areas, such as DataBase (DB), Information Retrieval (IR), Natural Language Processing (NLP), and the Web document community, could also be used. INFORMATION RETRIEVAL Information retrieval is the art and science of searching for information in documents, searching for documents themselves, searching for metadata which describes documents, or searching within databases, whether relational standalone databases or hypertext networked databases such as the Internet or intranets, for text, sound, images or data. NATURAL LANGUAGE PROCESSING Natural language processing (NLP) is concerned with the interactions between computers and human (natural) languages. NLP is a form of human-to-computer interaction where the elements of human language, be it spoken or written, are formalized so that a computer can perform value-adding tasks based on that interaction. Natural language understanding is sometimes referred to as an AI-complete problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it
2008 Admission
WEB MINING
2. OVERVIEW OF WEB MINING METHODOLOGIES

Web mining is the application of machine learning (data mining) techniques to webbased data for the purpose of learning or extracting knowledge. Web mining encompasses a wide variety techniques, including soft computing . Web mining methodologies can generally be classified into one of three distinct categories: web usage mining, web structure mining, and web content mining . In web usage mining the goal is to examine web page usage patterns in order to learn about a web system's users or the relationships between the documents. For example, the tool presented by Masseglia et al. [ M P C ~c~re]a tes association rules from web access logs, which store the identity of pages accessed by users along withother information such as when the pages were accessed and by whom; these logs are the focus of the data mining effort, rather than the actual web pages themselves. Rules created by their method could include, for example, "70% of the users that visited page A also visited page B." Similarly, the method of Nasraoui et al. [ N F J K ~a~ls]o examines web access logs. The method employed in that paper is to perform a hierarchical clustering in order to determine the usage patterns of different groups of users. Beeferman and Berger [BBOO] described a process they developed which determines topics related to a user query using click-through logs and agglomerative clustering of bipartite graphs. The transaction-based method developed in Ref. [Mer99] creates links between pages that are frequently accessed together during the same session. Web usage mining is useful for providing personalized web services, an area of web mining research that has lately become active. It promises to help tailor web services, such as web search engines, to the preferences of each individual user. For a recent review of web personalization methods In the second category of web mining methodologies, web structure mining, we examine only the relationships between web documents by utilizing the information conveyed by each document's hyperlinks. Like the web usage mining methods described above, the other content of the web pages is often ignored. In Ref. [ K R R
2008 Admission
WEB MINING
T ~K~u]m ar et al. examined utilizing a graph representation of web page structure. Here nodes in the graphs are web pages and edges indicate hyperlinks between pages. By examining these "web graphs" it is possible to find documents or areas of interest through the use of certain graph-theoretical measures or procedures. Structures such as web rings, portals, or affiliated sites can be identified by matching the characteristics of these structures (e.g. we can identify portal pages because they have an unusually high out-degree). Graph models are also used in other web structure mining approaches. For example, in Ref. [CvdBD99] the authors' method examines linked URLs and performs classification using a Bayesian method. The graph is processed to determine groups of pages that relate to a common topic In web content mining we examine the actual content of web pages (most often the text contained in the pages) and then perform some knowledge discovery procedure to learn about the pages themselves and their relationships. Most typically this is done to organize a group of documents into related categories. This is especially beneficial for web search engines, since it allows users to more quickly find the information they are looking for in comparison to the usual "endless" ranked list. There are several examples of web or text mining approaches [AHKv~~] that are content-oriented and attempt to cluster documents for browsing. The Athena system of Agrawal et al. [AJSOO]c reates groupings of e-mail messages. The goal is to create folders (classes) for different topics and route e-mail messages automatically to the appropriate folders. Athena uses a clustering algorithm called C-Evolve to create topics (folders), while the classification of documents to each cluster is supervised and requires manual interaction with the user. The classification method is based on ~ a i v e Bayes. Some notable papers that deal with clustering for web search include Ref. [~GG+99a ] which describes 2 partitioned methods, and Ref. [C~ 9 7 ] , which is a hierarchical clustering approach. Nahm and Mooney [NMoo] described a methodology where information extraction and data mining can be combined to improve upon each other; information extraction provides the data mining process with access to textual documents (text mining) and in turn data mining provides learned rules to the information extraction portion to improve its performance. An important paper that is strongly related to the current work is that of Strehl et al. [SGMOO], which examined clustering performance of different clustering algorithms when using various similarity measures on web document collections. Clustering methods examined in
2008 Admission
WEB MINING
the paper included k-means, graph partitioning, and self-organizing maps (SOMs). Vector-based representations were used in the experiments along with distance measures based on Euclidean distance, cosine similarity, and Jaccard similarity. One of the data sets used in this paper is publicly available and we will use it in our experiments.
3. TRADITIONAL INFORMATION RETRIEVAL TECHNIQUES

Traditional information retrieval methods represent plain-text documents using a series of numeric values for each document. Each value is associated with a specific term (word) that may appear on a document, and the set of possible terms is shared across all documents. The values may be binary, indicating the presence or absence of the corresponding term. The values may also be a non-negative integers, which represents the number of times a term appears on a document (ie. term frequency). Non-negative real numbers can also be used, in this case indicating the importance or weight of each term. These values are derived through a method such as the popular inverse document frequency model , which reduces the importance of terms that appear on many documents. Regardless of the method used, each series of values represents a document and corresponds to a point (ie. vector) in a Euclidean feature space; this is called the vector-space model of information retrieval. This model is often used when applying machine learning techniques to documents, as there is a strong mathematical foundation for performing distance measure and centroid calculations using vectors.
Vector- based distance measures

Here we briefly review some of the most popular vector-related distance measures which will also be used in the experiments we perform. First, we have the well-known Euclidean distance
2008 Admission
WEB MINING
Where xi and yi are the ith components of vectors x = [xl, xz, . . . , x,] and y = [yl, yz,: . . , yn], respectively. Euclidean distance measures the direct distance between two points in the space iRn. For applications in text and document clustering and classification, the cosine similarity measure [Sal89] is often used. We can convert this to a distance measure by the following
Here . indicates the dot product operation and |||| indicates the magnitude (length) of a vector. If we take each document to be a point in SRn formed by the values assigned to it under the vector model, each value corresponding to a dimension, then the direction of the vector formed from the origin to the point representing the document indicates the content of the document. Under the Euclidean distance, documents with large differences in size have a large distance between them, regardless of whether or not the content is similar, because the length of the vectors differs. The cosine distance is length invariant, meaning only the direction of the vectors is compared; the magnitude of the vectors is ignored. Another popular distance measure for determining document similarity is the extended Jaccard similarity [Sa189], which is converted to a distance measure as follows:
2008 Admission
WEB MINING
Jaccard distance has properties of both the Euclidean and cosine distance measures. At high distance values, Jaccard behaves more like the cosine measure; at lower values, it behaves more like Euclidean.
4. WEB MINING
With the recent explosive growth of the amount of content on the Internet, it has become increasingly difficult for users to find and utilize information and for content providers to classify and catalog documents. Traditional web search engines often return hundreds or thousands of results for a search, which is time consuming for users to browse. On-line libraries, search engines, and other large document repositories (e.g. customer support databases, product specification databases, press release archives, news story archives, etc.) are growing so rapidly that it is difficult and costly to categorize every document manually. In order to deal with these problems, researchers look toward automated methods of working with web documents so that they can be more easily browsed, organized, and cataloged with minimal human intervention. In contrast to the highly structured tabular data upon which most machine learning methods are expected to operate, web and text documents are semi-structured. Web documents have well-defined structures such as letters, words, sentences, paragraphs, sections, punctuation marks, HTML tags, and so forth. We know that words make up sentences, sentences make up paragraphs, and so on, but many of the rules governing the order in which the various elements are allowed to appear are vague or ill-defined and can vary dramatically between documents. It is estimated that as much as 85% of all digital business information, most of it web-related, is stored in non-structured formats ( i e . non-tabular formats, such as those that are used in databases and
2008 Admission
WEB MINING
spreadsheets) [pet]. Developing improved methods of performing machine learning techniques on this vast amount of non-tabular, semi-structured web data is therefore highly desirable. Clustering and classification have been useful and active areas of machine learning research that promise to help us cope with the problem of information overload on the Internet. With clustering the goal is to separate a given group of data items (the data set) into groups called clusters such that items in the same cluster are similar to each other and dissimilar to the items in other clusters. In clustering methods no labeled examples are provided in advance for training (this is called unsupervised learning). Under classification we attempt to assign a data item to a predefined category based on a model that is created from pre-classified training data (supervised learning). In more general terms, both clustering and classification come under the area of knowledge discovery in databases or data mining. Applying data mining techniques to web page content is referred to as web content mining which is a new sub-area of web mining, partially built upon the established field of information retrieval. When representing text and web document content for clustering and classification, a vector-space model is typically used. In this model, each possible term that can appear in a document becomes a feature dimension [Sa189]. The value assigned to each dimension of a document may indicate the number of times the corresponding term appears on it or it may be a weight that takes into account other frequency information, such as the number of documents upon which the terms appear. This model is simple and allows the use of traditional machine learning methods that deal with numerical feature vectors in a Euclidean feature space. However, it discards information such as the order in which the terms appear, where in the document the terms appear, how close the terms are to each other, and so forth. By keeping this kind of structural information we could possibly improve the performance of various machine learning algorithms. The problem is that traditional data mining methods are often restricted to working on purely numeric feature vectors due to the need to compute distances between data items or to calculate some representative of a cluster of items (i.e. a centroid or center of a cluster), both of which are easily accomplished in a Euclidean space. Thus either the original data needs to be converted to a vector of numeric values by discarding possibly useful structural information (which is what we
2008 Admission
WEB MINING
are doing when using the vector model to represent documents) or we need to develop new, customized methodologies for the specific representation. Graphs are important and effective mathematical constructs for modeling relationships and structural information. Graphs (and their more restrictive form, trees) are used in many different problems, including sorting, compression, traffic flow analysis, resource allocation, etc. [CLR97]. In addition to problems where the graph itself is processed by some algorithm (e.g. sorting by the depth first method or finding the minimum spanning tree) it would be extremely desirable in many applications, including those related to machine learning, to model data as graphs since these graphs can retain more information than sets or vectors of simple atomic features. Thus much research has been performed in the area of graph similarity in order to exploit the additional information allowed by graph representations by introducing mathematical frameworks for dealing with graphs. Some application domains where graph similarity techniques have been applied include face [WFKvdM97] and fingerprint [WJHO~re]c ognition as well as anomaly detection in communication networks [DBD+01]. In the literature, the work comes under several different topic names including graph distance, (exact) graph matching, inexact graph matching, error-tolerant graph matching, or error-correcting graph matching. In exact graph matching we are attempting to determine if two graphs are identical. Inexact graph matching implies we are attempting not to find a perfect matching, but rather a "best" or "closest" matching. Error-tolerant and error-correcting are special cases of inexact matching where the imperfections (e.g. missing nodes) in one of the graphs, called the data graph, are assumed to be the result of some errors (e.g. from transmission or recognition). We attempt to match the data graph to the most similar model graph in our database. Graph distance is a numeric measure of dissimilarity between graphs, with larger distances implying more dissimilarity. By graph similarity, we mean we are interested in some measurement that tells us how similar graphs are regardless if there is an exact matching between them. The web as we all know is the SINGLE largest source of data available. Web mining aims to extract and mine useful knowledge from the web. It is used to understand the customer behavior, evaluate the effectiveness of a website and also to help quantify the success of a marketing campaign. Web mining is the integration of information
2008 Admission
WEB MINING
gathered by traditional data mining methodologies and techniques with information gathered over the World Wide Web.
Figure(ii). Searching the Web
4.1. WEB MINING Vs DATA MINING

Data is structured and has well defined tables, columns, rows, keys and constraints. Web Mining has Dynamic and rich in features and patterns. It involves analysis of web server logs of a website whereas data mining involves using techniques to find relationships in large amounts of data. It often need to react to evolving usage patterns in real time eg. Merchandizing. Data mining is also called knowledge discovery and data mining (KDD). It is the extraction of useful patterns from data sources, e.g. databases, texts, web, images, etc. Patterns must be valid, novel, potentially useful, understandable. Classic data mining tasks
Classification: mining patterns that can classify future (new) data into known
classes.
10
2008 Admission
WEB MINING
Association rule mining: mining any rule of the form X Y, where X and Y
are sets of data items. E.g.,Cheese, Milk Bread [sup =5%, confid=80%]

Clustering: identifying a set of similarity groups in the data Sequential pattern mining: A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence
Figure (iii). The Data Mining (KDD) Process
Just as data mining aims at discovering valuable information that is hidden in conventional databases, the emerging field of web mining aims at finding and extracting relevant information that is hidden in Web-related data, in particular hypertext documents published on the Web. Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World Wide Web. There are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource discovery based on concepts indexing or agent based technology may also fall in this category. Web structure mining is the process of inferring knowledge from the World Wide Web organization and links between references and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs.
11
2008 Admission
WEB MINING
Web is a collection of inter-related files on one or more Web servers. Web mining is a multi-disciplinary effort that draws techniques from fields like information retrieval, statistics, machine learning, natural language processing, and others. Web mining has new character compared with the traditional data mining. First, the objects of Web mining are a large number of Web documents which are heterogeneously distributed and each data source are heterogeneous; second, the Web document itself is semistructured or unstructured and lack the semantics the machine can understand.
4.2.
HISTORY
The term Web Mining first used in [E1996], defined in a task oriented manner. Alternate data oriented definition given in [CMS1997]. Its First panel discussion at ICTAI 1997 [SM1997]. It is a continuing forum. WebKDD workshops with ACM SIGKDD, 1999, 2000, 2001, 2002, ; 60 90 attendees SIAM Web analytics workshop 2001, 2002, Special issues of DMKD journal, SIGKDD Explorations Papers in various data mining conferences & journals Surveys [MBNL 1999, BL 1999, KB2000]
4.3.
WEB MINING SUBTASKS
Web mining can be decomposed into the subtasks, namely:

1. Resource finding: the task of retrieving intended Web documents. By
resource finding we mean the process of retrieving the data that is either online or offline from the text sources available on the web such as electronic newsletters, electronic newswire, the text contents of HTML documents obtained by removing HTML tags, and also the manual selection of Web resources.
12
2008 Admission
WEB MINING
2.
Information selection and pre-processing: automatically selecting and preprocessing specific information from retrieved Web resources. It is a kind of transformation processes of the original data retrieved in the IR process. These transformations could be either a kind of pre-processing that are mentioned above such as stop words, stemming, etc. or a pre-processing aimed at obtaining the desired representation such as finding phrases in the training corpus, transforming the representation to relational or first order logic form, etc.
3.
Generalization: automatically discovers general patterns at individual Web sites as well as across multiple sites. Machine learning or data mining techniques are typically used in the process of generalization. Humans play an important role in the information or knowledge discovery process on the Web since the Web is an interactive medium.
4. Analysis: validating and/or interpretation of the mined patterns.
13
2008 Admission
WEB MINING
5. CHALLENGES OF WEB MINING
1. Today World Wide Web is flooded with billions of static and dynamic web
pages created with programming languages such as HTML, PHP and ASP. It is significant challenge to search useful and relevant information on the web. 2. Creating knowledge from available information.
3. As the coverage of information is very wide and diverse, personalization of the
information is a tedious process. 4. Learning customer and individual user patterns.

5. Complexity of Web pages far exceeds the complexity of any conventional text
document. Web pages on the internet lack uniformity and standardization.

6. Much of the information present on web is redundant, as the same piece of
information or its variant appears in many pages.

7.
The web is noisy i.e. a page typically contains a mixture of many kinds of information like, main content, advertisements, copyright notice, navigation panels.
8. The web is dynamic, information keeps on changing constantly. Keeping up
with the changes and monitoring them are very important.

9. The Web is not only disseminating information but it also about services.
Many Web sites and pages enable people to perform operations with input parameters, i.e., they provide services.
14
2008 Admission
WEB MINING
10. The most important challenge faced is Invasion of Privacy. Privacy is
considered lost when information concerning an individual is obtained, used, or disseminated, when it occurs without their knowledge or consent.
6. TAXONOMY OF WEB MINING
Figure(iv). Taxonomy of Web mining In general, Web mining tasks can be classified into three categories: 1. Web content mining, 2. Web structure mining and 3. Web usage mining.
15
2008 Admission
WEB MINING
However, there are two other different approaches to categorize Web mining. In both, the categories are reduced from three to two: Web content mining and Web usage mining. In one, Web structure is treated as part of Web Content while in the other Web usage is treated as part of Web Structure. All of the three categories focus on the process of knowledge discovery of implicit, previously unknown and potentially useful information from the Web. Each of them focuses on different mining objects of the Web.
Figure(v). Detailed classification of Taxonomy of Web mining
6.1.
WEB CONTENT MINING
Web Content mining is the discovery of useful information from web contents / data / documents. Web data contents: text, image, audio, video, metadata and hyperlinks. Web content mining is an automatic process that goes beyond keyword extraction. Since the content of a text document presents no machine readable semantic, some approaches have suggested to restructure the document content in a representation that
16
2008 Admission
WEB MINING
could be exploited by machines. The usual approach to exploit known structure in documents is to use wrappers to map documents to some data model. Techniques using lexicons for content interpretation are yet to come. There are two groups of web content mining strategies: Those that directly mine the content of documents and those that improve on the content search of other tools like search engines. Web Content Mining deals with discovering useful information or knowledge from web page contents. Web content mining analyzes the content of Web resources. Content data is the collection of facts that are contained in a web page. It consists of unstructured data such as free texts, images, audio, video, semi-structured data such as HTML documents, and a more structured data such as data in tables or database generated HTML pages. The primary Web resources that are mined in Web content mining are individual pages. They can be used to group, categorize, analyze, and retrieve documents. There are issues in Web Content Mining : Developing intelligent tools for IR Finding keywords for key phrases, Discovering grammatical rules and collocations , Hypertext classification/categorization, Extracting key phrases from text documents, Learning extraction models/rules, Hierarchical clustering, Predicting words relationship, Developing Web query systems WebOQL, XML-QL, Mining multimedia data Mining image from satellite(Fayyad, et al. 1996) , Mining image to identify small volcanoes on Venus (Smyth, et al 1996) Web content mining could be differentiated from two points of view: 6.1.1. Agent-Based Approach This approach aims to assist or to improve the information finding and filtering the information to the users. This could be placed into the following three categories:
a. Intelligent Search Agents: These agents search for relevant information using
domain characteristics and user profiles to organize and interpret the discovered information.
b. Information Filtering/ Categorization: These agents use information retrieval
techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them.
17
2008 Admission
WEB MINING
c.
Personalized Web Agents: These agents learn user preferences and discover Web information based on these preferences, and preferences of other users with similar interest.
6.1.2. Database Approach Database approach aims on modeling the data on the Web into more structured form in order to apply standard database querying mechanism and data mining applications to analyze it. The two main categories are
a. Multilevel databases: The main idea behind this approach is that the lowest
level of the database contains semi-structured information stored in various Web sources, such as hypertext documents. At the higher level(s) meta data or generalizations are extracted from lower levels and organized in structured collections, i.e. relational or object-oriented databases.
b. Web query systems: Many Web-based query systems and languages utilize
standard database query languages such as SQL, structural information about Web documents, and even natural language processing for the queries that are used in World Wide Web searches.
6.2.
WEB STRUCTURE MINING
World Wide Web can reveal more information than just the information contained in documents. For example, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. This can be compared to bibliographical citations. When a paper is cited often, it ought to be important. The PageRank and CLEVER methods take advantage of this information conveyed by the links to find pertinent web pages. By means of counters, higher levels cumulate the number of artifacts subsumed by the concepts they hold. Counters of hyperlinks, in and out documents, retrace the structure of the web artifacts summarized. Web structure mining is the process of discovering structure information from the web. The structure of a typical web graph consists of web pages as nodes, and
18
2008 Admission
WEB MINING
hyperlinks as edges connecting related pages. This can be further divided into two kinds based on the kind of structure information used.
Figure(vi). Web graph structure
Hyperlinks A hyperlink is a structural unit that connects a location in a web page to a different location, either within the same web page or on a different web page. A hyperlink that connects to a different part of the same page is called an Intra-document hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink. Document Structure In addition, the content within a Web page can also be organized in a tree structured format, based on the various HTML and XML tags within the page. Mining efforts here have focused on automatically extracting document object model (DOM) structures out of documents. Web structure mining focuses on the hyperlink structure within the Web itself. The different objects are linked in some way. Simply applying the traditional processes and assuming that the events are independent can lead to wrong conclusions. However, the appropriate handling of the links could lead to potential correlations, and then improve the predictive accuracy of the learned models. Two algorithms that have been proposed to lead with those potential correlations are: 1. HITS and
19
2008 Admission
WEB MINING
2. Page Rank.
6.2.1. Page Rank
Page Rank is a metric for ranking hypertext documents that determines the quality of these documents. The key idea is that a page has high rank if it is pointed to by many highly ranked pages. So the rank of a page depends upon the ranks of the pages pointing to it. This process is done iteratively till the rank of all the pages is determined. The rank of a page p can thus be written as:
Here, n is the number of nodes in the graph, OutDegree(q) is the number of hyperlinks on page q and d damping factor is the probability at each page the random surfer will get bored and request another random page. 6.2.2. HITS Hyperlink-induced topic search (HITS) is an iterative algorithm for mining the Web graph to identify topic hubs and authorities. Authorities are the pages with good sources of content that are referred by many other pages or highly ranked pages for a given topic; hubs are pages with good sources of links. The algorithm takes as input, search results returned by traditional text indexing techniques, and filters these results to identify hubs and authorities. The number and weight of hubs pointing to a page determine the page's authority. The algorithm assigns weight to a hub based on the authoritativeness of the pages it points to. If many good hubs point to a page p, then authority of that page p increases. Similarly if a page p points to many good authorities, then hub of page p increases. After the computation, HITS outputs the pages with the largest hub weight and the pages with the largest authority weights, which is the search result of a given topic.
6.3.
WEB USAGE MINING
20
2008 Admission
WEB MINING
Web usage mining is a process of extracting useful information from server logs i.e. users history. Web usage mining is the process of finding out what users are looking for on the Internet. Web usage mining focuses on techniques that could predict the behavior of users while they are interacting with the WWW. It collects the data from Web log records to discover user access patterns of Web pages. Usage data captures the identity or origin of web users along with their browsing behavior at a web site. Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the web access logs of different web sites can help understand the user behavior and the web structure, thereby improving the design of this colossal collection of resources. There are two main tendencies in Web Usage Mining driven by the applications of the discoveries: General Access Pattern Tracking and Customized Usage Tracking. The general access pattern tracking analyzes the web logs to understand access patterns and trends. These analyses can shed light on better structure and grouping of resource providers. Many web analysis tools existed but they are limited and usually unsatisfactory. We have designed a web log data mining tool, Web Log Miner, and proposed techniques for using data mining and OnLine Analytical Processing (OLAP) on treated and transformed web access files. Applying data mining techniques on access logs unveils interesting access patterns that can be used to restructure sites in a more efficient grouping, pinpoint effective advertising locations, and target specific users for specific selling ads. Customized usage tracking analyzes individual trends. Its purpose is to customize web sites to users. The information displayed, the depth of the site structure and the format of the resources can all be dynamically customized for each user over time based on their access patterns. While it is encouraging and exciting to see the various potential applications of web log file analysis, it is important to know that the success of such applications depends on what and how much valid and reliable knowledge one can discover from the large raw log data. Current web servers store limited information about the accesses. Some scripts custom-tailored for some sites may store additional information. However, for an effective web usage mining, an important cleaning and data transformation step before analysis may be needed.
21
2008 Admission
WEB MINING
In the using and mining of Web data, the most direct source of data are Web log files on the Web server. Web log files records of the visitor's browsing behavior very clearly. Web log files include the server log, agent log and client log (IP address, URL, page reference, access time, cookies etc.). Web servers have the ability to log all requests. Web server log format is the mostly used Common Log Format(CLF). New, Extended Log Format allows configuration of log file. Design of a Web Log Miner : Web log is filtered to generate a relational database. A data cube is generated from the database. OLAP is used to drill-down and roll-up in the cube. OLAM is used for mining interesting knowledge
Knowledge
Figure(vii). Web usage mining process
22
2008 Admission
WEB MINING
Figure (viii). Web Logs
There are several available research projects and commercial products that analyze those patterns for different purposes. The applications generated from this analysis can be classified as personalization, system improvement, site modification, business intelligence and usage characterization.
23
2008 Admission
WEB MINING
Figure (ix). A general Architecture for Web usage mining
The Web Usage Mining can be decomposed into the following three main sub tasks:
Figure (x). Diagram of Web usage mining process
6.3.1. Pre-processing It is necessary to perform a data preparation to convert the raw data for further process. The actual data collected generally have the features that incomplete, redundancy and ambiguity. In order to mine the knowledge more effectively, preprocessing the data collected is essential. Preprocessing can provide accurate, concise data for data mining. Data preprocessing, includes data cleaning, user identification, user sessions identification, access path supplement and transaction identification.
The main task of data cleaning is to remove the Web log redundant data which
is not associated with the useful data, narrowing the scope of data objects.
Determining the single user must be done after data cleaning. The purpose of
user identification is to identify the users uniqueness. It can be complete by
24
2008 Admission
WEB MINING
means of cookie technology, user registration techniques and investigative rules.

User session identification should be done on the basis of the user
identification. The purpose is to divide each user's access information into several separate session processes. The simplest way is to use time-out estimation approach, that is, when the time interval between the page requests exceeds the given value, namely, that the user has started a new session.
Because the widespread use of the page caching technology and the proxy
servers, the access path recorded by the Web server access logs may not be the complete access path of users. Incomplete access log does not accurately reflect the user's access patterns, so it is necessary to add access path. Path supplement can be achieved using the Web site topology to make the page analysis.
The transaction identification is based on the user's session recognition, and its
purpose is to divide or combine transactions according to the demand of data mining tasks in order to make it appropriate for demand of data mining analysis.
6.3.2. Pattern discovery Pattern discovery mines effective, novel, potentially useful and ultimately understandable information and knowledge using mining algorithm. Its methods include statistical analysis, classification analysis, association rule discovery, sequential pattern discovery, clustering analysis, and dependency modeling.
Statistical Analysis: Statistical analysts may perform different kinds of
descriptive statistical analyses (frequency, mean, median, etc.) based on different variables such as page views, viewing time and length of a navigational path when analyzing the session _le. By analyzing the statistical information contained in the periodic web system report, the extracted report
25
2008 Admission
WEB MINING
can be potentially useful for improving the system performance, enhancing the security of the system, facilitation the site modification task, and providing support for marketing decisions.
Association Rules: In the web domain, the pages, which are most often
referenced together, can be put in one single server session by applying the association rule generation. Association rule mining techniques can be used to discover unordered correlation between items found in a database of transactions.
Clustering analysis: Clustering analysis is a technique to group together users
or data items (pages) with the similar characteristics. Clustering of user information or pages can facilitate the development and execution of future marketing strategies.
Classification analysis: Classification is the technique to map a data item into
one of several predefined classes. The classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, nave Bayesian classifiers, k-nearest neighbor classifier, Support Vector Machines etc.
Sequential Pattern: This technique intends to find the inter-session pattern,
such that a set of the items follows the presence of another in a time-ordered set of sessions or episodes. Sequential patterns also include some other types of temporal analysis such as trend analysis, change point detection, or similarity analysis.
Dependency Modeling: The goal of this technique is to establish a model that is able to represent significant dependencies among the various variables in the web domain. The modeling technique provides a theoretical framework for analyzing the behavior of users, and is potentially useful for predicting future web resource consumption.
6.3.3. Pattern Analysis Pattern Analysis is a final stage of the whole web usage mining. The goal of this process is to eliminate the irrelevant rules or patterns and to understand, visualize and to extract the interesting rules or patterns from the output of the pattern discovery
26
2008 Admission
WEB MINING
process. The output of web mining algorithms is often not in the form suitable for direct human consumption, and thus need to be transform to a format can be assimilate easily. There are two most common approaches for the patter analysis. One is to use the knowledge query mechanism such as SQL, while another is to construct multi-dimensional data cube before perform OLAP operations.
Figure (xi). Web Usage Mining
27
2008 Admission
WEB MINING
Figure (xii).Potential Data Sources
7. WEB CRAWLER
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. Other terms for Web crawlers are ants, automatic indexers, bots, and worms or Web spider, Web robot. Search engines, use spidering as
28
2008 Admission
WEB MINING
a means of providing up-to-date data . Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam), eg. anita at cs dot sunysb dot edu ; mueller{remove this}@cs.sunysb.edu. A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier
Figure (xiii). Web crawler
8.APPLICATIONS OF WEB MINING

Web mining techniques can be applied to understand and analyze such data, and turned into actionable information, that can support a web enabled electronic business to improve its marketing, sales and customer support operations. Based on the patterns
29
2008 Admission
WEB MINING
found and the original cache and log data, many applications can be developed. Some of them are:
8.1.
Personalized Services
The so-called personalized service, that is, when the user browses Web sites, as far as possible to meet each user's browsing interest and constantly adjusts to adapt to the users browsing interests change, so that make each user feel he/she is a unique user of this Web site. In order to achieve personalized service, it first has to obtain and collect information on clients to grasp customer's spending habits, hobbies, consumer psychology, etc., and then can be targeted to provide personalized service. To obtain consumer spending behavior patterns, the traditional marketing approach is very difficult, but it can be done using Web mining techniques.
Figure(xiv). Personalization of Web pages Early on in the life of Amazon.com, its visionary CEO Jeff Bezos observed, In a traditional (brick-and mortar) store, the main effort is in getting a customer to the store. Once a customer is in the store they are likely to make a purchase since the cost of going to another store is high and thus the marketing budget (focused on getting the
30
2008 Admission
WEB MINING
customer to the store) is in general much higher than the in-store customer experience budget (which keeps the customer in the store). In the case of an on-line store, getting in or out requires exactly one click, and thus the main focus must be on customer experience in the store. This fundamental observation has been the driving force behind Amazons comprehensive approach to personalized customer experience, based on the mantra a personalized store for every customer. A host of Web mining techniques, e.g. associations between pages visited, click-path analysis, etc., are used to improve the customers experience during a store visit. Knowledge gained from Web mining is the key intelligence behind Amazons features such as instant recommendations, purchase circles, wish-lists, etc.
8.2. Improve the website design

Attractiveness of the site depends on its reasonable design of content and organizational structure. Web mining can provide details of user behavior, providing web site designers basis of decision making to improve the design of the site.
8.3.
System Improvement
Performance and other service quality attributes are crucial to user satisfaction from services such as databases, net-works, etc. Similar qualities are expected from the users of Web services. Web usage mining provides the key to under-standing Web traffic behavior, which can in turn be used for developing policies for Web caching, network transmission, load balancing, or data distribution. Security is an acutely growing concern for Web-based services, especially as electronic commerce continues to grow at an exponential rate. Web usage mining can also provide patterns which are useful for detecting intrusion, fraud, attempted break-ins, etc.
8.4.
Predicting trends
Web mining can predict trend within the retrieved information to indicate future values. For example, an electronic auction company provides information about items to auction, previous auction details, etc. Predictive modeling can be utilized to analyze the existing information, and to estimate the values for auctioneer items or number of people participating in future auctions.
31
2008 Admission
WEB MINING
The predicting capability of the mining application can also benefit society by identifying criminal activities.
8.5.
To carry out intelligent business
A visit cycle of customer network marketing activities can be divided into four steps: Being attracted, presence, purchase and left. Web mining technology can dig out the customers' motivation by analyzing the customer click-stream information in order to help sales make reasonable strategies, custom personalized pages for customers, carry out targeted information feedback and advertising. In short, in e-commerce network marketing, Using Web mining techniques to analyze large amounts of data can dig out the laws of the consumption of goods and the customers access patterns, help businesses develop effective marketing strategies, enhance enterprise competitiveness. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer.
9.WEB MINING PROS & CONS

9.1. PROS
32 2008 Admission
WEB MINING
Web mining essentially has many advantages which makes this technology attractive to corporations including the government agencies. This technology has enabled ecommerce to do personalized marketing, which eventually results in higher trade volumes. The government agencies are using this technology to classify threats and fight against terrorism. The predicting capability of the mining application can benefit the society by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer.
9.2.
CONS
Web mining, itself, doesnt create issues, but this technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent. The obtained data will be analyzed, and clustered to form profiles; the data will be made anonymous before clustering so that there are no personal profiles. Thus these applications de-individualize the users by judging them by their mouse clicks. De-individualization, can be defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits. Another important concern is that the companies collecting the data for a specific purpose might use the data for a totally different purpose, and this essentially violates the users interests. The growing trend of selling personal data as a commodity encourages website owners to trade personal data obtained from their site. This trend has increased the amount of data being captured and traded increasing the likeliness of ones privacy being invaded. The companies which buy the data are obliged make it
33
2008 Admission
WEB MINING
anonymous and these companies are considered authors of any specific release of mining patterns. They are legally responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits, but there is no law preventing them from trading the data. Some mining algorithms might use controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These practices might be against the antidiscrimination legislation. The applications make it hard to identify the use of such controversial attributes, and there is no strong rule against the usage of such algorithms with such attributes. This process could result in denial of service or a privilege to an individual based on his race, religion or sexual orientation, right now this situation can be avoided by the high ethical standards maintained by the data mining company. The collected data is being made anonymous so that, the obtained data and the obtained patterns cannot be traced back to an individual. It might look as if this poses no threat to ones privacy, actually many extra information can be inferred by the application by combining two separate unscrupulous data from the user. With the rapidly growth of World Wide Web, it therefore becomes a very hot and popular popular topic in web mining, and web mining now plays an important role for E-commerce website and E-service to understand how their websites and services are used and to provide better services for their customers and users. Recently, many researchers are focusing on developing new web mining techniques and algorithms have not been applied in real application environment. Therefore, it could be an important time to shift the research focus to application area, such as Ecommerce and E-services. Recently, many researchers are focusing on developing new web mining techniques and algorithms have not been applied in real application environment.
34
2008 Admission
WEB MINING
10.
VISUAL WEB MINING
Analysis of web site usage data involves two significant challenges:

1
1. Volume of data arising from the growth of the web. 2.Structural complexity of Web sites.
2 3
Web mining is the application of Information visualization techniques on results of Web Mining in order to further amplify the perception of extracted patterns and visually explore new ones in web domain. It applies Data Mining and Information Visualization techniques to the web domain; in order to benefit from the power of both human visual perception and computing and also applies Data Mining techniques to large web data sets and use Information Visualization methods on the results. Its goal is to correlate the outcomes of mining Web Usage Logs and the extracted Web Structure, by visually superimposing the results. Information Visualization enables Visual representations of abstract data, using computer-supported, interactive visual interfaces to reinforce human cognition; thus enabling the viewer to gain knowledge about the internal structure of the data and relationships in it. Visual Web Mining Framework provides a prototype implementation for applying information visualization techniques to the results of Data Mining. User Session gives a compact sequence of web accesses by a user. Visualization is in order to understand the structure of a particular website & to know Web surfers behavior when visiting that website. Due to the large dataset and the structural complexity of the sites, 3D visual representations are used. It is implemented using an open source toolkit called the Visualization Tool Kit (VTK). VTK consists of a C++ class library and several interpreted interface layers including Tcl/Tk, Java, and Python.
35
2008 Admission
WEB MINING
Figure (xv). Visual Web mining architecture Input is Web pages and Web server log files. web robot (webbot) is used to retrieve the pages of the website. The webbot is a very fast Web walker with support for regular expressions, SQL logging facilities, and many other features. It can be used to check links, find bad HTML, map out a web site, download images, etc. In parallel, Web Server Log files are downloaded and processed through a sessionizer and a LOGML file is generated. The Integration Engine is a suite of programs for data preparation, i.e., extracting, cleaning, transforming and integrating data and finally loading into database and later generating graphs in XGML. User sessions from web logs are extracted, which yields results roughly related to a specific user. User sessions are then converted into a special format for Sequence Mining using cSPADE (continues Spade - Sequential Pattern Discovery Using Equivalent Class).The Outputs are frequent contiguous sequences with a given minimum support. These are imported into a database, and non-maximal frequent sequences are removed. Different queries are executed against this data according to some criterion, e.g. support of each pattern, length of patterns, etc. Different URLs which correspond to the same webpage are unified in the final results.The Visualization Stage maps the extracted data and attributes into visual
36
2008 Admission
WEB MINING
images, realized through VTK extended with support for graphs. Result: Interactive 3D/2D visualizations which could be used by analysts to compare actual web surfing patterns with expected patterns. 1
10.1. Visual Representation

Structures:
Graphs Extract spanning tree from the site structure, and use this as the framework for presenting access-related results through glyphs(an element of writing) and color mapping. Stream Tubes Variable-width tubes showing access paths with different traffic are introduced on top of the web graph structure.
10.2. Design and Implementation of Diagrams

Below is a visualization of the web graph of the Computer Science department of Rensselaer Polytechnic Institute. Strahler numbers are used for assigning colors to edges.
Figure(xvi). 2D visualization layout with Strahler Coloring applied on web usage logs
37
2008 Admission
WEB MINING
One can see user access paths scattering from first page of website (the node in center) to cluster of web pages corresponding to faculty pages, course home pages, etc. Strahler numbers is a numerical measure of the branching complexity for assigning colors to the edges.
Figure(xvii). 3D visualization layout with Strahler Coloring applied on web usage logs
Adding third dimension enables visualization of more information and clarifies user behavior in and between clusters. Center node of circular basement is first page of web site from which users scatter to different clusters of web pages. Color spectrum from Red (entry point into clusters) to Blue (exit points) clarifies behavior of users. The cylinder like part of this figure is visualization of web usage of surfers as they browse a long HTML document
38
2008 Admission
WEB MINING
Figure(xviii). User log sessions Left: One can observe long user sessions as strings falling off. Those are special type of long sessions when user navigates sequence of web pages which come one after the other e.g., sections of a long document. In many cases were found web pages with many nodes connected with Next/Up/Previous hyperlinks. Right: An enlarged view of the same visualization.
Frequent access patterns extracted by the web mining process are visualized as a white graph on top of an embedded and colorful graph of web usage.
39
2008 Admission
WEB MINING
Figure(xix). Superimposition of Frequent Patterns extracted from Web Mining on top of Web Usage Similar to the above picture with addition of another attribute, i.e., frequency of pattern which is rendered as thickness of white tubes. This helps in the analysis of results.
40
2008 Admission
WEB MINING
Figure(xx). White tubes Thickness of the tubes represent frequency of found patterns Superimposition of Web Usage on top of Web Structure with higher order layout. Top node is the first page of the website. Hierarchical output of layouts make analysis easier.
41
2008 Admission
WEB MINING
Figure(xxi). Higher Order layout for clear visualization and easier analysis
42
2008 Admission
WEB MINING
Figure(xxii)Left: Superimposition of website dynamics(colored) on top of its static structure(gray) Right: Zoom view of colored region with layout of Web Usage taken from Web Graph basement. The basement itself is removed for clarity Using the visualizations, a web analyzer can easily identify which parts of the website are cold parts with few hits and which parts are hot ones with many hits and classify them accordingly. This also paves way for making exploratory changes in website. For e.g., adding links from hot parts of web site to cold parts and then extracting, visualizing and interpreting changes in access patterns.
43
2008 Admission
WEB MINING
Figure(xxiii). Amplification of a user session: Clickstream(Bottom Left) in drill down cylinder, Cone Scatter(Top Right) and Funnel Backoff to main page of website (Top Right) Users browsing access pattern is amplified by a different coloring. Depending on link structure of underlying pages, we can see vertical access patterns of a user drilling down the cluster, making a cylinder shape. Also users following links going down a hierarchy of webpages makes a cone shape and users going up hierarchies, e.g., back to main page of website makes a funnel shape.
44
2008 Admission
WEB MINING
11.
CONCLUSION
As the Web and its usage continue to grow, so does the opportunity to analyze Web data and extract all manner of useful knowledge from it. The past few years have seen the emergence of Web mining as a rapidly growing area, due to the efforts of the research community as well as various organizations that are practicing. The key component of web mining is the mining process itself. This area of research is so huge today due to the tremendous growth of information sources available on the Web and the recent interest in e-commerce. Web mining is used to understand customer behavior, evaluate the effectiveness of a particular Web site, and help quantify the success of a marketing campaign. Here we have described the key computer science contributions made in this field, including the overview of web mining, taxonomy of web mining, the prominent successful applications, and outlined some promising areas of future research.
45
2008 Admission
WEB MINING
12.
REFERENCE
[1] http://en.wikipedia.org/wiki/Web mining [2] http://www.galeas.de/webimining.html [3] Jaideep srivastava, Robert Cooley, Mukund Deshpande, Pan-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations, ACM SIGKDD,Jan 2000. [4] Miguel Gomes da Costa Jnior,Zhiguo Gong, Web Structure Mining: An Introduction, Proceedings of the 2005 IEEE International Conference on Information Acquisition [5] R. Cooley, B. Mobasher, and J. Srivastava,Web Mining: Information and Pattern Discovery on the World Wide Web, ICTAI97 [6] Brijendra Singh, Hemant Kumar Singh, WEB DATA MINING RE- SEARCH: A SURVEY, 2010 IEEE [7] Mining the Web: discovering knowledge from hypertext data, Part 2 By Soumen Chakrabarti, 2003 edition [8] Web mining: applications and techniques By Anthony Scime
46
2008 Admission

Web Mining Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Mining Report

Uploaded by

Copyright:

Available Formats

VedaVyasa Institute of Technology

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

2. OVERVIEW OF WEB MINING METHODOLOGIES

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

3. TRADITIONAL INFORMATION RETRIEVAL TECHNIQUES

Vector- based distance measures

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Figure(ii). Searching the Web

4.1. WEB MINING Vs DATA MINING

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Figure (iii). The Data Mining (KDD) Process

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

WEB MINING SUBTASKS

Web mining can be decomposed into the subtasks, namely:

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

4. Analysis: validating and/or interpretation of the mined patterns.

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

5. CHALLENGES OF WEB MINING

information is a tedious process. 4. Learning customer and individual user patterns.

document. Web pages on the internet lack uniformity and standardization.

information or its variant appears in many pages.

8. The web is dynamic, information keeps on changing constantly. Keeping up

with the changes and monitoring them are very important.

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

10. The most important challenge faced is Invasion of Privacy. Privacy is

6. TAXONOMY OF WEB MINING

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Figure(v). Detailed classification of Taxonomy of Web mining

WEB CONTENT MINING

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

WEB STRUCTURE MINING

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

Figure(vi). Web graph structure

Dept .of Computer Science & Engg.

VedaVyasa Institute of Technology

WEB USAGE MINING

Dept .of Computer Science & Engg.