You are on page 1of 4

ISBN 978-952-5726-06-0

Proceedings of the 2009 International Workshop on Information Security and Application (IWISA 2009)
Qingdao, China, November 21-22, 2009

Personalized Intelligent Search Engine Based


on Web Data Mining
Hong Zhang1, Yanhong Ma2, Qiuyu Zhang1, Pengshou Xie1, Zhongxian Bao1
1. College of Computer and communication, Lanzhou University of Technology,Lanzhou730050, P.R.China.
2. Gansu Electric Power Corporation Wind Power Technology Center, Lanzhou 730050, P.R.China.

Abstract—A personalized intelligent search engine based on Theoretical model of personalized intelligent Search
web data mining, whose theory model is established, is Engine based on web data mining is shown in Fig. 1.
introduced and discussed and in this paper. And its critical From Fig. 1 we can see it contains five components,
part, algorithm and program of mining users’ interests, is which are gathering information, processing and indexing
described detailed. By creating users’ interest database, this information, retrieving and matching information,
system will realize personalized information retrieval which creating users’ interests database, user interface. Each is
can search information according to users requests. So the described as follows.
recall and precision ratio of search engine is enhanced
essentially. A. Gathering Information
Usually using an initial URL address as a starting
Index terms—data mining, personalize, users interests, users point and utilizing the standard transport protocol, Robot,
pattern, search engine that is Web Spider or Crawler, ransacks all over the
WWW space including all hyperlinks in Webpage to
I. INTRODUCTION gather Webpage information and stores the information
The study for search engine has always been a hot, into Webpage database[8]. Compared with other Robot,
which is due to the fact that there are many limitations of this Robot can discover the dead links and find the newly
search engine [1]. Here are some examples. added links by scanning web space constantly. There are
i. Search engine can not be a good "understanding" of two ways of obtaining initial URL address. One is
what users want to search, and only can match the key collection by itself regularly and the other is referral by
words or the sentence that user has input mechanically[2]. web site.
ii. Search engine does not have the nature of B. Processing And Indexing Information
personalization [3]. No matter who is the retriever,
The goal of this step is to extract available information
research workers, businessmen, students, doctors and so
from web pages which are gathered by Robot and index
on? As long as the key words input are same, the result
the information and build Index Database of web pages
returned is same.
which can be retrieved by users. In some ways, it is the
iii. Search engine does not have the interactive nature.
Index Database that determines the quality of search
According to the returned results, users want to express
engine. So the design of Index Database is very pivotal.
their own wishes, but they could not do so[4].
In order to incarnate this tenet, two measures are
In order to overcome these shortcomings of traditional
introduced in this system: knowledge database technique
search engines, an intelligent, especial tool for retrieving
and new weighting algorithm which is introduced in
Internet information is needed to been developed
references [9]
urgently to help users get the information they needed
from the Internet fleetly. On this basis, an intelligent,
C. Retrieving And Matching Information
personalized search engine based on Web data mining is
proposed[5], which is used to mine user's web history When users retrieve information, their interests or
and track user's web acting by web data mining to create hobbies are taken into account [10]. It has high degree of
users’ interest pattern database, in which each user’s specificity, not just to match users’ input simply.
interest and hobby is stored. The interest pattern database According to their query input and their interests or
is used to filter the user’s initial query results [6]. So the hobbies, it construct users search vector [11]. Namely, it
available information which meets users’ needs is refers to the users interests to determine whether the
returned to them and system realizes the personalized information retrieved is precise or satisfied with users. In
information retrieval. addition, it can achieve active information push by
judging user's interest, which is somewhat like the
II. THEORETICAL MODEL OF PERSONALIZED INTELLIGENT relevant information Push of web site Baidu. But there is
SEARCH ENGINE BASED ON WEB DATA MINING [7] essential difference between them. Because compared
with information push of web site Baidu, this information
This work is supported by Gansu Provincial Natural Science push pays more attention to users’ interests. It is not only
Foundation (No.2007GS04864)

© 2009 ACADEMY PUBLISHER


AP-PROC-CS-09CN004 584
the information push similar with the user’s retrieval information and feedback information to mine user's web
input. log to obtain users’ interests or hobbies by using special
web data mining algorithm which will be presented in the
D. Creating Users Interests Ddatabase following text. And then a users interests database is
As we all know, if system want to achieve the created. As user's interest may change sometimes, the
personalized information retrieval, the first task is to users interests database must be updated according to
know what the user's interest is. So system must has the user's web record. It resolves such questions as: first,
function of storing user’s interests, the function of feed information received is difficult to be understood or is
backing user’s interests, the function of reasoning and not precise. Secondly, Users don’t know how to express
judging user’s interests and etc. [12]. their requirements for internet resources appropriately or
how to find the information they need effectively.
User’s pattern is used to resolve this problem. This
module is mainly to integrate client user’s web

Fig. 1 Theoretical model of personalized intelligent search engine based on web data mining

and personalized results retrieved. Therefore, this paper


E. User Interface
gives the detailed process of using web data mining
User interface adopts browser, such as internet explore, techniques to create users’ interests pattern database.
to exchange data between users and servers. Users input
query requirement, initial information and feedback III. CREATING USERS’ INTERESTS PATTERN DATABASE
information at client. The results are returned back to
users by the form of browser too. By downloading Java Each web server will keep the user's access
Applet, Client communicates with server to achieve information to it. Usually, this information is called WEB
users’ feedback and results transferring. Users can Log including web server access log, proxy server log
evaluate the retrieval results, such as best, better, good, records, Browser log records, Users’ brief introduction,
no good and so on. These evaluations are feed back to the users’ registration information and users’ dialogue or
system to adjust user’s interest information. So user transaction information and so on[13]. The target of web
interests are updated ceaselessly and always kept up to data mining is to find the user's access pattern from vast
date. On retrieval interface, users can express themselves amounts of web log data and to dig out available users’
interests and correct, renew their interests database. information finally.
Personalized intelligent Search Engine based on web data In order to obtain users’ pattern information and have
mining is designed to realize personalized information real-time update for this information, system takes two
retrieval which can resolve the question that when steps to complete it: establishing users interests model
different users use the same query, the results are and mining users interests.
different and that when a user uses the same query in A. Establishing Usre Interests Model
different times, the results are different too.
As we all know the real intent of this system is to
Thus, personalized intelligent Search Engine based on
achieve personalized information retrieval. So a data
web data mining is a typical case of using web data
model must be created to do it. In this paper, users
mining technology into personalized intelligent search
interest model is expressed by an ordered triad which is
engine. In the system, the most critical problem is to
interested word, word weight, word fresh degree. Each
create user’s interest pattern database by the way of using
interested node is marked with a triad (pi, wi, xi)
data mining techniques. Once user’s interest pattern
abbreviated Node (pi) [14].
database is created, system can combine user's interest
into his retrieval input to provide him with more accurate

585
In above expression, the value range of pi is P, marked system will push the football pages to him but filter the
with pięP, and P is words sets, marked with P= {p1ǃ diving pages. Then system will rectify the user’s interest
p2 ǃ … ǃ pm} ˈ in which p1 ǃ p2 ǃ … ǃ pm are the parameter: the interest degree for sport is 0.7 and the
interested words and m is the number of words. The wi is interest degree for diving is 0.15.
the weight of interested word pi; the xi is the fresh degree
of word pi. b. Classification Analysis [16]
For the sake of the fact that different location of word In the web log mining, the input set of classification
in the document reflects different importance, the analysis is group of record collection and several types of
location word appears is taken into account, which is tags. First, each record is given a type tag. Then system
tf i ,wj checks these tags and describes the common features of
called location weight marked with sign [9]. When these tags. For an example, 50% users live in large cities
calculating fresh degree of words, we use a fresh degree and their ages are between 18 and 28 among users who
function f (n) to document dn˄dnęD, Sign n refers to have submitted mp4 Orders. After getting this
the nth document in buffers. Sign D is the document information, we can provide pertinent and personalized
collection in buffers˅. The function f (n) is monotonous service to the aged between 18 and 28 users living in
and non-decreasing which can assure that the more recent large cities.
a document is visited, the more users are interested in it.
So the weight and fresh degree of Node (pi) are c. Clustering Analysis [17]
calculated as follows.
Clustering analysis is different from classification
n analysis. It is the process of classifying data items or
Node( pi ) x Zi ¦ tf w
i, j u Ej (1) users with similar characteristics. For an example that
j 1 some users often browse the pages about “TOFEI” or
n tf i ,wj u E “GRF” or “application” or “visa”, then these users will be
Node ( p i ) x F i ¦ Node ( p i ) x Z i
u f ( j) (2) clustered as a group: they may be a group of expecting
j 1 overseas users. Therefore system will send e-mail about
w
going abroad to them and provide personalized service to
tf them.
In above formula, the sign i , j , pi, wi, xi, f (n), and n
are explained as above. Sign Ej (Eję [0, 1]) is interest
d. Sequential Pattern
coefficient of document dj. And f (n) can be calculated by
n Sequential pattern refers to find data items which are
f (n)
formula n  1 . After the weight and fresh sequential in time from the time-series data sets. In the
degree of word pi is calculated, formula ti=wihf (xi) is web log mining, sequential pattern recognition means to
adopted to calculate interest degree of word pi. And f (xi) find the user’s requests for pages which are successive in
is an influence function on fresh degree upon weight of time among user session. For an example that if 60%
word pi. It is calculated by formula f (xi) = xi. Finally this users ordering baby sleeping bag on line order baby
information is stored into users interests database in the clothes within 2 months, then system will predict the web
model of ordered pair which is expressed with the pair of pages that may be requested by the users and provide the
interest words and interest degree. The interest degree of users ordering baby sleeping bags web pages about baby
words is the ultimate basis for making search engine clothes actively.
intelligent and personalized. Of course, the methods of data mining are various.
Only some methods on web mining are introduced here.
B. Mining Users Interest Pattern When the user interest patterns are recognized, these
patterns must be expressed with the formal language and
Here are several ways of data mining is used to fond formed knowledge or rules and stored in knowledge
users interests. database which can be used when users retrieve.

a. Mining Association Rule [15] Č. CONCLUSIONS


Through correlation analysis, such as algorithm Aprior, This paper analyzes the developing status of search
relationships hidden among data are uncovered. Here are engine and puts forward a new kind of search engine
some examples. When mining association rules on web based on web data mining and gives its theoretical model.
site server logs, we find that 70% users have accessed the Each part of personalized intelligent search engine based
football pages and 15% users have accessed the diving on web data mining is described in detail. Several
pages among users who have accessed sports news pages. methods of web data mining in personalized intelligent
Then such a conclusion can be drawn about: If a user search engine are introduced emphatically. By creating
likes sports, we can prediction that the probability he users’ interests pattern database, system combines user’s
likes football is 0.70 and the probability he likes diving is interests into his retrieval to achieve personalized
0.15. So if his query words contain the word sport, information retrieval and information push service. So

586
users can get the exact information they want fleetly. search and web data mining, pages 45–54, 2008.
The recall and precision ratio of search engine is [7] Hong Zhang, Yanhong Ma, Qiuyu Zhang. Research on
improved. intelligent personalized search engine.
As we know that personalized intelligent search engine ICICT2006:168~172.
[8] Tong ZhaoFeng. Java programming guide for Robot [M].
based on web data mining involves not only data mining
beijing˖Publishing House of Electronics Industry, 2002.
techniques but also artificial intelligence, pattern
[9] Hong Zhang, Yanhong Ma, Qiuyu Zhang, Pengshou Xie.
recognition, natural language retrieval, formal description Study and Design of Chinese Concept-Based Search
and other related disciplines [18]. So it depends on the Engine. ISCIT2005:38~41.
development of these related disciplines to achieve real [10] H. B. Liu and V. Kešelj, “Combined mining of web server
personalized information retrieval. logs and web contents for classifying user navigation
patterns and predicting users’ future requests,” Data &
REFERENCES Knowledge Engineering, Vol. 61, No. 2, May 2007,
pp.304-330.
[1] B. Carterette and R. Jones. Evaluating search engines by [11] R. Jones, B. Rey, O. Madani, and W. Greiner. Generating
modeling the relationship between relevance and clicks. query substitutions. In Proceedings of the 15th
In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, international conference on World Wide Web, pages 387–
Advances in Neural Information Processing Systems 20, 396, 2006.
pages 217–224. 2008. [12] V. S. Tseng and K. W. Lin, “Efficient mining and
[2] G.E. Dupret and B. Piwowarski. A user browsing model prediction of user behavior patterns in mobile web
to predict search engine click data from past observations. systems,” Information and Software Technology, Vol. 48,
In SIGIR ’08: Proceedings of the 31st annual No. 6, June 2006, pp.357-369.
international ACM SIGIR conference on Research and [13] Srivastava J. Web usage mining: Discovery and
development in information retrieval, pages 331–338, application of usage Patterns form Web data. 2000.
2008. [14] YANG Jing-jing; JU Shi-guang; WANG Xiu-hong.
[3] Morris, M. R., Teevan, J., and Bush, S. (2008). Enhancing Research of individuation search engine based on web.
collaborative Web search with personalization: Computer Engineering and Design, 2008,
Groupization, smart splitting, and group hit-highlighting. 29(20):5206~5208.
Proc. of CSCW ’08. [15] Zhu Ming. Data Mining. Hefei: China Science &
[4] Danny Sullivan, etc. fifth Annual Search Engine Meeting Technology University Press, 2002. 230~231.
Report, Bostom,MA.APR.1999. [16] Y. Li and N. Zhong. Mining rough association from text
[5] Fu ZhongQian, Wang XinYue, Zhou PeiLing ect. documents. In RSCTC, pages 368–377, 2006.
Realization of intelligent body on network personalized [17] O.Zamir, O.Etzioni, Web Document Clustering: A
information filter. Computer Application, 2000, 20(3): 26- Feasibility Demonstration,SIGIR.1998.
29. [18] Nils J.Nilsson writte. Zheng kougen etc. translate.
[6] Q. Mei and K. Church. Entropy of search logs: how hard Artificial intelligence. Beijing: Mechanical Industry Press,
is search? with personalization? with backoff? In 2000. 277~281.
Proceedings of the international conference on Web

587

You might also like