You are on page 1of 26

A Personalized Ontology Model for Web Information Gathering

Abstract:
As a model for knowledge description and formalization, ontologies are
widely used to represent user profiles in personalized web information gathering.
Current web information gathering systems attempt to satisfy user requirements by
capturing their information needs. User profiles represent the concept models
possessed by users when gathering web information. However, when representing
user profiles, many models have utilized only knowledge from either a global
knowledge base or a user local information. A personalized ontology model is
proposed for knowledge representation and reasoning over user profiles. This
model learns ontological user profiles from both a world knowledge base and user
local instance repositories. The ontology model is evaluated by comparing it
against benchmark models in web information gathering. The proposed ontology
model provides a solution to emphasizing global and local knowledge in a single
computational model. The findings in this project can be applied to the design of
web information gathering systems. The model also has extensive contributions to
the fields of Information Retrieval, web Intelligence, Recommendation Systems,
and Information Systems.





Introduction
Overview Of the Project:

The advent of the Internet has become the source of information. It has been
widely by users belonging to a variety of avenues. The need of the hour is to make
the process of searching the internet for information more and more efficiently.
With the size of the internet increasing exponentially, the volume of data to be
crawled also proportionally increases, as a result of which it becomes increasingly
necessary to have appropriate crawling mechanisms in order to make crawls
efficient. Search engines have to answer millions of queries every day. This has
made engineering a search a highly challenging task. Search engines primary
perform the three basic tasks namely: They search the Internet or select pages on
important words. They keep an index of the words they find and where they find
them. c. They allow users to look for the words or combination of words found in
that index. A WebCrawler is a computer program that browses the Internet in a
methodical automated manner. The crawler typically crawls through links grabbing
content from websites and adding it to search engines indexes. The World Wide
Web provides a vast source of information of almost all type. However this
information is often scattered among many web servers and hosts, using many
different formats. We all want that we should have the best possible search in less
time. In this paper, we introduced the working of focused ontology, which is
merged, with the procedure of finding the copyright infringement. For any crawler
there are two issues that it should consider. First, the crawler should have the
capability to plan, i.e., a plan to decide which pages to download next. Second, it
needs to have a highly optimized and robust system architecture so that it can
download a large number of pages per second even against crashes, manageable,
and considerate of resources and web servers. Some recent academic interest is
there in the first issue, including work on deciding which important pages the
crawler should take first. In contrast, less work is done on second issues. Clearly,
all the major search engines have highly optimized crawling system, although
working and details of documentation of this system are usually with their owner.
The system that is known till now in detail and also available in literature is known
to be the Mercator system, which is used by Alta Vista. It is easy to build a crawler
that would work slowly and download few pages per second for a short period of
time. In contrast, its a big challenge to build the same system design, I/O, network
efficiency, robustness and manageability. Every search engine is divided into
different modules among those modules user interface module is the module on
which search engine relies the most because it helps to provide the best possible
results to the search engine. Crawlers are small programs that browse the web on
the search engines behalf, similarly to how a human user would follow links to
reach different pages. The programs are given a starting seed URLs, whose pages
they retrieve from the web. The crawler extracts URLs appearing in the retrieved
pages, and gives this information to the crawler control module. This module
determines what links to visit next, and feeds the links to visit back to the crawlers.
The crawler also passes the retrieved pages into a page repository. Crawlers
continue visiting the web, until local resources, such as storage, are exhausted.






Literature Survey:

ON the last decades, the amount of web-based information available has
increased dramatically. How to gather useful information from the web has
become a challenging issue for users. Current web information gathering systems
attempt to satisfy user requirements by capturing their information needs. For this
purpose, user profiles are created for user background knowledge description. User
profiles represent the concept models possessed by users when gathering web
information. A concept model is implicitly possessed by users and is generated
from their background knowledge. While this concept model cannot be proven in
laboratories, many web ontologists have observed it in user behavior. When users
read through a document, they can easily determine whether or not it is of their
interest or relevance to them, a judgment that arises from their implicit concept
models. If a users concept model can be simulated, then a superior representation
of user profiles can be built. To simulate user concept models, ontologiesa
knowledge description and formalization modelare utilized in personalized web
information gathering. Such ontologies are called ontological user profiles or
personalized ontologies. To represent user profiles, many researchers have
attempted to discover user background knowledge through global or local analysis.
Global analysis uses existing global knowledge bases for user background
knowledge representation. Commonly used knowledge bases include generic
ontologies (e.g., WordNet ), thesauruses (e.g., digital libraries), and online
knowledge bases (e.g., online categorizations and Wikipedia). The global analysis
techniques produce effective performance for user background knowledge
extraction. However, global analysis is limited by the quality of the used
knowledge base. For example, WordNet was reported as helpful in capturing user
interest in some areas but useless for others.
Local analysis investigates user local information or observes user behavior
in user profiles. In some works, such as, users were provided with a set of
documents and asked for relevance feedback. User background knowledge was
then discovered from this feedback for user profiles. However, because local
analysis techniques rely on data mining or classification techniques for knowledge
discovery, occasionally the discovered results contain noisy and uncertain
information. As a result, local analysis suffers from ineffectiveness at capturing
formal user knowledge. From this, we can hypothesize that user background
knowledge can be better discovered and represented if we can integrate global and
local analysis within a hybrid model. The knowledge formalized in a global
knowledge base will constrain the background knowledge discovery from the user
local information. Such a personalized ontology model should produce a superior
representation of user profiles for web information gathering. An ontology model
to evaluate this hypothesis is proposed. This model simulates users concept
models by using personalized ontologies, and attempts to improve web information
gathering performance by using ontological user profiles. The world knowledge
and a users local instance repository (LIR) are used in the proposed model. World
knowledge is commonsense knowledge acquired by people from experience and
education. An LIR is a users personal collection of information items. From a
world knowledge base, we construct personalized ontologies by adopting user
feedback on interesting knowledge. A multidimensional ontology mining method,
Specificity and Exhaustivity, is also introduced in the proposed model for
analyzing concepts specified in ontologies. The users LIRs are then used to
discover background knowledge and to populate the personalized ontologies. The
proposed ontology model is evaluated by comparison against some benchmark
models through experiments using a large standard data set. The evaluation results
show that the proposed ontology model is successful. The research contributes to
knowledge engineering, and has the potential to improve the design of
personalized web information gathering systems. The contributions are original
and increasingly significant, considering the rapid explosion of web information
and the growing accessibility of online documents.
Global knowledge bases were used by many existing models to learn
ontologies for web information gathering. On the basis of the Dewey Decimal
Classification, King et al. Developed IntelliOnto to improve performance in
distributed web information retrieval. Wikipedia was used by Downey et al. These
works effectively discovered user background knowledge; however, their
performance was limited by the quality of the global knowledge bases. Aiming at
learning personalized ontologies, many works mined user background knowledge
from user local information. Pattern recognition and association rule mining
techniques to discover knowledge from user local documents for ontology
construction. OntoLearn to discover semantic concepts and relations from web
documents. Web content mining techniques were used by Jiang and Tan to
discover semantic knowledge from domain-specific text documents for ontology
learning. Finally, Shehata et al. Captured user information needs at the sentence
level rather than the document level, and represented user profiles by the
Conceptual Ontological Graph. The use of data mining techniques in these models
lead to more user background knowledge being discovered. However, the
knowledge discovered in these works contained noise and uncertainties.
Additionally, ontologies were used in many works to improve the performance of
knowledge discovery. Using a fuzzy domain ontology extraction algorithm, a
mechanism was developed by Lau et al. These works attempted to explore a route
to model world knowledge more efficiently.

An Integrated Architecture for Personalized Query Expansion in Web
Search Alexander Salamanca and Elizabeth Leon

Personalization had been seen as one of the most promising trends in the
near future for improving significantly the enjoyment of the search experience on
the Web. The main idea is deliver quality results prepared uniquely for different
users, which do not share necessarily the same interests on the long term with
another people. The approach described in this paper exploits relations between
keywords of the users search history and a more general set of keywords that
expand the users search scope. Through a three stages cycle, which consists of
extracting key terms by local analysis, extraction of other key terms through a
system of automatic recommendation and an algorithm to personalize the final list
of terms suggested. Thus, we can produce high quality and relevant query
suggestions reformulations of the user intent verbalized as a Web query that
increases the chances of retrieving better results.

Query expansion is the process of adding additional terms to a users
original query, with the purpose of improving retrieval performance (Efthimiadis
1995). Although query expansion can be conducted manually by the searcher, or
automatically by the information retrieval system, the focus here is on interactive
query expansion which provides computer support for users to make choices which
result in the expansion of their queries. A common method for query expansion is
the relevance feedback technique (Salton 1979), in which the users judge relevance
from results of a search. This information then is used to modify the vector-model
query with increased weights on the terms found in the relevant documents. In
(Salton and Buckley 1990), these techniques were proved to be effective. In
(Efthimiadis 1995) is presented a comprehensive literature review to extract
keywords based on term frequency, document frequency, etc.


Link Analysis Ranking: Algorithms, Theory, and Experiments
ALLAN BORODIN

Ranking is an integral component of any information retrieval system. In the
case of Web search, because of the size of the Web and the special nature of the
Web users, the role of ranking becomes critical. It is common for Web search
queries to have thousands or millions of results. On the other hand, Web users do
not have the time and patience to go through them to find the ones they are
interested in. It has actually been documented [Broder 2002; Silverstein et al.
1998; Jansen et al. 1998] that mostWeb users do not look beyond the first page of
results. Therefore, it is important for the ranking function to output the desired
results within the top few pages, otherwise the search engine is rendered useless.
Furthermore, the needs of the users when querying the Web are different from
traditional information retrieval. For example, a user that poses the query
microsoft to aWeb search engine is most likely looking for the home page of
Microsoft Corporation, rather than the page of some random user that complains
about the Microsoft products. In a traditional information retrieval sense, the page
of the random user may be highly relevant to the query.
However, Web users are most interested in pages that are not only relevant,
but also authoritative, that is, trusted sources of correct information that have a
strong presence in the Web. In Web search, the focus shifts from relevance to
authoritativeness. The task of the ranking function is to identify and rank highly
the authoritative documents within a collection of Web pages. To this end, the Web
offers a rich context of information which is expressed through the hyperlinks. The
hyperlinks define the context in which a Web page appears. Intuitively, a link
from page p to page q denotes an endorsement for the quality of page q. We can
think of theWeb as a network of recommendations which contains information
about the authoritativeness of the pages. The task of the ranking function is to
extract this latent information and produce a ranking that reflects the relative
authority of Web pages. Building upon this idea, the seminal papers of Kleinberg
[1998], and Brin and Page [1998] introduced the area of Link Analysis Ranking,
where hyperlink structures are used to rank Web pages.


Learning to Map between Ontologies on the Semantic Web
AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy

The current World-Wide Web has well over 1.5 billion pages [3], but the
vast majority of them are in human- readable format only (e.g., HTML). As a
consequence soft- ware agents (softbots) cannot understand and process this
information, and much of the potential of the Web has so far remained untapped.
In response, researchers have created the vision of the Semantic Web [6], where
data has structure and ontolo- gies describe the semantics of the data. Ontologies
allow users to organize information into taxonomies of concepts, each with their
attributes, and describe relationships be- tween concepts. When data is marked up
using ontologies, softbots can better understand the semantics and therefore more
intelligently locate and integrate data for a wide vari- ety of tasks. The following
example illustrates the vision of the Semantic Web.
On the World-Wide Web of today you will have trouble _nding this person.
The above information is not contained within a single Web page, thus making
keyword search inef- fective. On the Semantic Web, however, you should be able
to quickly _nd the answers. A marked-up directory service makes it easy for your
personal softbot to _nd nearby Com- puter Science departments. These
departments have marked up data using some ontology. Here the data is organized
into a taxonomy that includes courses, people, and professors. Professors have
attributes such as name, degree, and degree-granting institution. Such marked-up
data makes it easy for your softbot to _nd a pro-fessor with the last name Cook.
Then by examining the at- tribute \granting institution", the softbot quickly _nds
the alma mater CS department in Australia. H


Ontology-Based Personalized Search and Browsing
Susan Gauch, Jason Chaffee
The Web has experienced continuous growth since its creation. As of March
2002, the largest search engine contained approximately 968 million indexed pages
in its database [SES 02]. As the number of Internet users and the number of
accessible Web pages grows, it is becoming increasingly difficult for users to find
documents that are relevant to their particular needs. Users of the Internet basically
have two ways to find the information for which they are looking: they can browse
or they can search with a search engine. Browsing is usually done by clicking
through a hierarchy of concepts, or ontology, until the area of interest has been
reached. The corresponding node then provides the user with links to related Web
sites. Search engines allow users to enter keywords to retrieve documents that
contain these keywords. The browsing and searching algorithms are essentially the
same for all users. The ontologies that are used for browsing content at a Web site
are generally different for each site that a user visits. Even if there are similarly
named concepts in the ontology, they may contain different types of pages.
Frequently, the same concepts will appear with different names and/or in different
areas of the ontology. Not only are there differences between sites, but between
users as well. One user may consider a certain topic to be an Arts topic, while a
different user might consider the same topic to be a Recreation topic. Thus,
although browsing provides a very simple mechanism for information navigation,
it can be time consuming for users when they take the wrong paths through the
ontology in search of information. The alternate navigation strategy, search, has its
own problems. Indeed, approximately one half of all retrieved documents have
been reported to be irrelevant [Casasola 98]. One of the main reasons for obtaining
poor search results is that many words have multiple meanings [Krovetz 92]. For
instance, two people searching for wildcats may be looking for two completely
different things (wild animals and sports teams), yet they will get exactly the same
results. It is highly unlikely that the millions of users with access to the Internet are
so similar in their interests that one approach to browsing or searching,
respectively, fits all needs. What is needed is a solution that will personalize the
information selection and presentation for each user. This paper explores the
OBIWAN projects use of ontologies as the key to providing personalized
information access. Our goals it to automatically create ontology-based user
profiles based and use these profiles to personalize the results from an Internet
search engine and to also use them to create personalized navigation hierarchies of
remove Web sites.


SYSTEM ANALYSIS

Existing System:
User profiles represent the concept models possessed by users when
gathering web information. A concept model is implicitly possessed by users and is
generated from their background knowledge. While this concept model cannot be
proven in laboratories, many web oncologists have observed it in user behavior.
When users read through a document, they can easily determine whether or not it is
of their interest or relevance to them, a judgment that arises from their implicit
concept models. If a users concept model can be simulated, then a superior
representation of user profiles can be built.


Drawbacks:
It is a blind search
If search space is large then the search performance
will be poor
Using web documents for training sets has one severe drawback: web
information has much noise and uncertainties.







Proposed System:

In our project, an ontology model to evaluate this hypothesis is proposed.
This model simulates users concept models by using personalized ontologies, and
attempts to improve web information gathering performance by using ontological
user profiles. The world knowledge and a users local instance repository (LIR) are
used in the proposed model. World knowledge is commonsense knowledge
acquired by people from experience and education, an LIR is a users personal
collection of information items. From a world knowledge base, we construct
personalized ontologies by adopting user feedback on interesting knowledge. A
multidimensional ontology mining method, Specificity and Exhaustivity, is also
introduced in the proposed model for analyzing concepts specified in ontologies.
The users LIRs are then used to discover background knowledge and to populate
the personalized ontologies. The proposed ontology model is evaluated by
comparison against some benchmark models through experiments using a large
standard data set. The evaluation results show that the proposed ontology model is
successful








Advantages:
A multidimensional ontology mining method, Specificity and Exhaustivity, is
also introduced in the proposed model for analyzing concepts specified in
ontologies.
The users LIRs are then used to discover background knowledge and to
populate the personalized ontologies.
The users LIRs are then used to discover background knowledge and to
populate the personalized ontologies.

DEVELOPMENT ENVIRONMENT
SYSTEM CONFIGURATION

HARDWARE REQUIREMENTS
Processor : 733 MHz Pentium III Processor
RAM : 128 MB
Hard Drive : 10GB
Monitor : 14 VGA COLOR MONITOR
Mouse : Logitech Serial Mouse
Disk Space : 1 GB

SOFTWARE REQUIREMENTS:
Platform : JDK 1.6
Operating System : Microsoft Windows NT 4.0 or
Windows 2000 or XP
Program Language : JAVA
Database : MySql 5.1
Tool : NETBEANS 6.8, SQLYog



Module Description:

Modules:
Knowledge Based Grouping
User Query For category Identification
Algorithm Implementation
Gathering Information





1. Knowledge Based Grouping:
Users personal collection of information items, From a world
knowledge based. We construct personalized ontologies by adopting
user feedback on interesting knowledge. Analyzing concepts specified
in ontologies were used in many works to improve the performance of
knowledge discovery. Such a personalized ontology model should
produce a superior representation of user profiles for web information
gathering. The user requesting url for gathering information through
the ontology model, it will be partitioning the relevant link which has
given by user. The relevant url is gathering information from by
separate categorys, like general, education, health, and entertainment.



2. User Query For category Identification:
User profiles were used in web information gathering to
interpret the semantic meanings of queries and capture user
information needs. User profiles can be categorized into four groups:
general, education, health and entertainment. When user profiles can
be deemed perfect url link for gathering information through the
category which has given url by them. If user given query about any
education oriented link means, the link will be used for identifying the
category and gathered information will be saving that particular
category. The user query is collecting information through searching
the subject catalog of the user given link. The categorys are useful
for separate the links and gathered information will be saved.

3. Algorithm Implementation:
User background knowledge can be discovered from user local
information collections, such as a users stored documents, browsed
web pages, and composed/received relevant details. The ontology
constructed has only subject labels and semantic relations specified.
We populate the ontology with the instances generated from user local
information collections. Classification is another strategy to map the
unstructured/ semi structured documents in user profile to the
representation in the global knowledge base. Because ontology
mapping and text classification/ clustering are beyond the scope of the
work presented in our project. The present work assumes that all user
local instance repositories have content-based descriptors referring to
the subjects, however, a large volume of documents existing on the
web may not have such content-based descriptors. For this problem,
strategies like ontology mapping and text classification/clustering
were suggested. These strategies will be investigated in future work to
solve this problem. Here we used the nave byes algorithm for
gathering information for text classification. And hierarchical
algorithm for clustering the data for which has given link by user.





4. Gathering Information:
Good Url are the concepts relevant to the information need, and
Bad Url subjects are the concepts resolving paradoxical or ambiguous
interpretation of the information need. The user want retrieve the web
information from given url link. In this gathering process the url will
be in under checking and validation process for identifying good url
or bad url. The given link will be gathering the information by good
url only, the bad url links are omitted by the validation process. Bad
url link is a un relevant link, it will be not considered by the web
information gathering process for requesting pages. After url checking
the information is downloaded. The downloaded links are constructed
by algorithms and url validation process. The gathered information are
only relevant details.














Architecture Diagram:












Clear Overall DFD (Data Flow Diagram)






Read URL
Pattern recognition
URL
Validation
Checking
Identification Process
Downloading Process
Identify the required
URL
Getting Required
URL


Data Flow Diagram










W1

f
1
f
2
W2

W3

W4

f
4
o
1
o
2
f
3
Web sites Facts Objects
Data Flow Diagram:


























Input
Enter the URL
Searching
URL
Internal URL In order Gathering
Internal URL In order of Size
External URL
Other URL
Bad URL
Exceptions
CSS Classes
Exit
No
Yes
Create Output as HTML files

References:

[1] B. Amento, L.G. Terveen, and W.C. Hill, Does Authority Mean Quality?
Predicting Expert Quality Ratings of Web Documents, Proc. ACM SIGIR 00,
July 2000.
[2] M. Blaze, J. Feigenbaum, and J. Lacy, Decentralized Trust Management,
Proc. IEEE Symp. Security and Privacy (ISSP 96), May 1996.
[3] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, Link Analysis
Ranking: Algorithms, Theory, and Experiments, ACM Trans. Internet
Technology, vol. 5, no. 1, pp. 231-297, 2005.
[4] J.S. Breese, D. Heckerman, and C. Kadie, Empirical Analysis of Predictive
Algorithms for Collaborative Filtering, technical report, Microsoft Research,
1998.
[5] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins, Propagation of Trust and
Distrust, Proc. 13th Intl Conf. World Wide Web (WWW), 2004.
[6] G. Jeh and J. Widom, SimRank: A Measure of Structural-Context Similarity,
Proc. ACM SIGKDD 02, July 2002.
[7] J.M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, J.
ACM, vol. 46, no. 5, pp. 604-632, 1999.
[8]Logistical Equation from Wolfram MathWorld,
http://mathworld.wolfram.com/LogisticEquation.html, 2008.

[9] T. Mandl, Implementation and Evaluation of a Quality-Based Search Engine,
Proc. 17th ACM Conf. Hypertext and Hypermedia, Aug. 2006.
[10] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation
Ranking: Bringing Order to the Web, technical report, Stanford Digital Library
Technologies Project, 1998.


Sites Referred:
http://java.sun.com
http://www.sourcefordgde.com
http://www.networkcomputing.com/









Conclusion:

In this paper, an ontology model is proposed for representing user
background knowledge for personalized web information gathering. The model
constructs user personalized ontologies by extracting world knowledge from the
LCSH system and discovering user background knowledge from user local
instance repositories. A multidimensional ontology mining method, exhaustively
and specificity, is also introduced for user background knowledge discovery. In
evaluation, the standard topics and a large test bed were used for experiments. The
model was compared against benchmark models by applying it to a common
system for information gathering. The experiment results demonstrate that our
proposed model is promising. A sensitivity analysis was also conducted for the
ontology model. In this investigation, we found that the combination of global and
local knowledge works better than using any one of them. In addition, the ontology
model using knowledge with both is-a and part-of semantic relations works better
than using only one of them. When using only global knowledge, these two kinds
of relations have the same contributions to the performance of the ontology model.
While using both global and local knowledge, the knowledge with part-of relations
is more important than that with is-a.
The proposed ontology model in this paper provides a solution to
emphasizing global and local knowledge in a single computational model. The
findings in this paper can be applied to the design of web information gathering
systems. The model also has extensive contributions to the fields of Information
Retrieval, web Intelligence, Recommendation Systems, and Information Systems.


Future Work:

In our future work, we will investigate the methods that generate user
local instance repositories to match the representation of a global knowledge base.
The present work assumes that all user local instance repositories have content-
based descriptors referring to the subjects; however, a large volume of documents
existing on the web may not have such content-based descriptors. For this problem,
strategies like ontology mapping and text classification/clustering were suggested.
These strategies will be investigated in future work to solve this problem. The
investigation will extend the applicability of the ontology model to the majority of
the existing web documents and increase the contribution and significance of the
present work.

You might also like