You are on page 1of 7

Kaustubh S. Raval et al.

/ International Journal of Engineering Science and Technology (IJEST)

The Anatomy of a Small-Scale Document Search Engine Tool: Incorporating a new Ranking Algorithm
1

KAUSTUBH S. RAVAL
Research Scholar

raval_kaustubh@yahoo.co.in
2

RANJEETSINGH S. SURYAWANSHI
Research Scholar ranjeetsuryawanshi06@gmail.com
3

J. NAVEENKUMAR
Research Scholar

pro_naveen@hotmail.com
4

DEVENDRA M. THAKORE
dmthakore@bvucoep.edu.in

Professor, Department of Computer Engineering


1, 2, 3, 4

Bharati Vidyapeeth Deemed University,

College of Engineering, Pune 411043, Maharashtra, INDIA.

Abstract: A search engine is an information retrieval system to help find out the information contained in documents stored on a computer system. The results provided by this kind of a system are usually in form of a list. Search engines basically work on the concept called Text-Mining. Text mining is a variation on a field called data mining and refers to the process of deriving high-quality information from unstructured text. In this paper we are going to depict an intelligent agent based search engine tool which takes the input from user in form of keyword and based on the keyword, find out the matching documents and show it to user (in the form of links). This tool uses a new Ranking Algorithm to rank the documents.

Keywords Search Engine, Text Mining, Intelligent Agent, Ranking Algorithm.

I.

Introduction

First of all, we need basic information about various terms on which this work is to be carried out. Search Engine: A search engine is an information retrieval system to help find out the information contained in documents stored on a computer system. The results provided by this kind of a system are usually in form of a list. Search engines provide a user interface to a collection of items that enables users to specify criteria (i.e. keyword) about an item of their interest and have the engine find the matching items. The keywords are referred

ISSN : 0975-5462

Vol. 3 No. 7 July 2011

5802

Kaustubh S. Raval et al. / International Journal of Engineering Science and Technology (IJEST)

to as a search query. Then search engines identify the desired concept that one or more documents may contain. The list of items that meet the criteria specified by the query is typically sorted, or ranked. Ranking items by relevance (from highest to lowest) reduces the time required to find the desired information. [1]

Text Mining: Text-mining is a variation on a field called data-mining and refers to the process of deriving highquality information from the unstructured text. High quality in text-mining usually refers to some combination of relevance, novelty and interestingness. [2]

Intelligent Agents: Intelligent agents are software entities that carry out some set of operations on behalf of a user with some degree of independence or autonomy, and in doing so, employ some knowledge or representation of the users goals or desires. Software agents are useful in automating repetitive tasks, finding and filtering information, intelligently summarizing complex data, and so on, but more importantly, just like their human counterparts, intelligent agents can have capability to learn from the managers and even make recommendations to them regarding a particular course of action. [3]

Ranking Algorithm: Ranking algorithms are mainly used to rank or index the documents based on some relevance. They are a formal set of instructions that can be followed to perform ranking or indexing task, such as a mathematical formula or a set of instructions in a computer program. A ranking is a relationship between a set of items such that, for any two items, the first is either 'ranked higher than', 'ranked lower than' or 'ranked equal to' the second. The rankings themselves are totally ordered and make it possible to evaluate complex information according to certain criteria. [4]

Motivation: The literature study of various research papers and my interest in the field of Data Mining motivated me to take up this as my dissertation topic for post-graduation. Study of different existing ranking algorithms gave me insight how the ranking algorithms work and ultimately provided me the idea of developing new algorithm. Working scenario of Google Search Engine also has been the motivational factor to take up this topic as my dissertation work. Google Search Engine is the best example of optimized intelligent software agent based text-mining system encompassing a very large domain of web. II. System Description

System description is the context which includes the details about the overall working of the existing or proposed system. Why Agents? Text mining mainly includes the field of information retrieval which means the finding of documents which contain answers to questions and not the finding of answers itself and for this to achieve statistical measures and methods are used. By using statistical measures and methods automatic processing of text data and comparison to given question is performed. But the issue here is how to automate the processing of text data? And that is where Agents come into picture.

ISSN : 0975-5462

Vol. 3 No. 7 July 2011

5803

Kaustubh S. Raval et al. / International Journal of Engineering Science and Technology (IJEST)

System Architecture Fig. 1 shows the architectural diagram for intelligent agent based text-mining system. It includes all the components required to make the system workable and the relationship and interaction between them. There are mainly three agents, one dataset, the user category, and one cache/log component. We are basically considering that all the pre-processing has been done on the dataset and after those preprocessing steps we would have following things: i) List of keywords

ii) Table or File consisting of Contexts (categories) and their respective Keywords. iii) Table or File consisting of Contexts and their respective Documents (Articles). Now, how exactly the system should work is to be explained as follows: Step 1: When the user types in something (Word), get that keyword and look up into the list of keywords. Step 2: If the keyword appears in the list then look for the corresponding category. Step 3: After getting the corresponding category, find out all the documents in that category in which keyword appears. Step 4: Now, weigh all these documents for a given keyword using term frequency method and assign weight values to respective documents (articles). Step 5: Now, using the ranking algorithm (which will be presented in the next section), rank all the documents and then shows those ranked documents to user.

Figure 1: Architecture of Intelligent Agent Based Small-Scale Search Engine

ISSN : 0975-5462

Vol. 3 No. 7 July 2011

5804

Kaustubh S. Raval et al. / International Journal of Engineering Science and Technology (IJEST)

III.

Algorithm

According to Websters Dictionary An algorithm is a precise set of rules specifying how to solve some problem. Im going to write the steps for a new ranking algorithm which is used to rank documents based on their weights. Weights of the documents are calculated using (tf*idf) equation, where tf represents term frequency (word frequency) and idf represents inverse document frequency. So, the equation for weighing a particular document d1 can be, Weight (d1) = tf(d1) * idf =tf(d1)*log(n/df) (1)

Where, tf(d1) - term frequency in document d1 n Total number of documents df no. of documents in which the term appears. Moreover, Im considering that all the documents which contains the required word are weighted using the [Equ 1]. Now, below is the ranking algorithm for ranking the documents based on weight values calculated.

Ranking Algorithm Inputs: list l, int dw[], int num // l is the list of documents in which inputted word appears. // dw is an integer array containing weight values of the documents in l. // num represents the no. of documents containing inputted word. Output: list final // final will contain all the ranked documents

Step 1: Find the average weight value by adding the weights of all documents in l and dividing that by num.

Step 2: for all the documents in l, compare the weight of document with the avg. weight.

Step 3: The documents which are having less weight than the avg. weight will be added to another list M and their weights are added to another array weight1 and the documents which are having more weight than avg. weight will be added to another list N and their respective weights are added to another array weight2.

Step 4: If there are more than one documents in the list N then recursively call the Algorithm with parameters related to documents in list N. If all the documents in the list N are having same weight value then there is no need to repeat the process because every time the average weight will be same and thus it will go into the infinite loop. And if there is only one document in the list N then add that document in the final list.

Step 5: If there are more than one documents in the list M then recursively call the Algorithm with parameters related to documents in list M. If all the documents in the list N are having same weight value then there is

ISSN : 0975-5462

Vol. 3 No. 7 July 2011

5805

Kaustubh S. Raval et al. / International Journal of Engineering Science and Technology (IJEST)

no need to repeat the process because every time the average weight will be same and thus it will go into the infinite loop. And if there is only one document in the list N then add that document in the final list.

Step 6: Display the documents in the list final to the user. Thus final will have all the ranked documents with their title and data which will be displayed to user who had given the search query using particular keyword.

IV.

Results

We converted the above algorithm into JAVA programming code and tested it on the Reuters-21578 news articles dataset. We had compiled and ran that code into command prompt and got the results as shown below for the inputted word APPLE.
Table 1: Weighted Documents

Weight 34.91431376 20.94858825 6.982862751 20.94858825 6.982862751 6.982862751 13.9657255 13.9657255 6.982862751 6.982862751 20.94858825 6.982862751 6.982862751 6.982862751 13.9657255 6.982862751

Title (Inputted Word: Apple) APPLE COMPUTER <AAPL> UPGRADES MACINTOSH LINE APPLE COMPUTER <AAPL> HAS NEW MACINTOSH MODELS LOTUS <LOTS> INTRODUCES NEW SOFTWARE APPLE <AAPL>, AST <ASTA> OFFER MS-DOS PRODUCTS SOFTWARE COS SUPPORT NEW APPLE <AAPL> PRODUCTS COMPUTER COMPANIES FORM NETWORKING GROUP FCOJ SUPPLIES SIGNIFICANTLY ABOVE YEAR AGO-USDA BERTELSMANN TO MARKET APPLE SOFTWARE IN GERMANY MOTOROLA (MOT) SEES CONTINUED GROWTH FOR CHIPS TECHNOLOGY/DESKTOP PUBLISHING APPLE <AAPL> EXPANDS NETWORK CAPABILITIES RJR <RJR> UNIT SELLS FOUR TOBACCO BRANDS NYNEX <NYN> TO SELL NEW IBM <IBM> COMPUTERS WALL STREET STOCKS/COMPAQ COMPUTER <CPQ> THORN-EMI WINS U.S. RULING IN BEATLES SUIT DIGITAL COMMUNICATIONS <DCAI> INTRODUCES ITEMS

The graph has been generated on the basis of the result of the above table which shows the weight of the word in each document with respect to which the documents has been weighted in descending order. The graph is shown below.

ISSN : 0975-5462

Vol. 3 No. 7 July 2011

5806

Kaustubh S. Raval et al. / International Journal of Engineering Science and Technology (IJEST)

Figure 2: Weighted document graph.

The weighted document shown above in table 1 is just a sample, for showing the result of the algorithm which is been implemented. Some more results about the words inputted to the algorithm are shown below. The graph below shows the highest ranked documents for the searched word.
Table 2: Highest Ranked documents for various words

Weight 34.91431376 31.59958129 21.61913187 21.61913187 36.35504269 13.77510514

Title APPLE COMPUTER <AAPL> UPGRADES MACINTOSH LINE PORSCHE EXPECTS IMPROVEMENT IN U.S. SALES REGAN DEPARTURE MAKES 3RD VOLCKER TERM LIKELY DONALD REGAN SAYS U.S. SHOULD EASE CREDIT SUPPLY QATAR"S BANKS SET FOR FURTHER LEAN SPELL ESSO UK PLANNING SLIGHTLY LESS OIL EXPLORATION

ISSN : 0975-5462

Vol. 3 No. 7 July 2011

5807

Kaustubh S. Raval et al. / International Journal of Engineering Science and Technology (IJEST)

Figure 3: Graphs for highest ranked documents

Conclusion Information retrieval in documents became a very tedious work because of very large amount of data. But if the documents are weighted and ranked properly then the relevant documents are retrieved easily and with less amount of time. Results in this research paper shows that, both, document weighing and ranking works correctly. Thus we can conclude that there could be a small-scale search engine tool that can incorporate a new ranking algorithm and serve the purpose of information retrieval in documents.

References
[1] [2] [3] [4] [5] http://en.wikipedia.org/wiki/Search_engine_ (computing) Vishal Gupta and Gurpreet S. Lehal, A Survey of Text Mining Techniques and Applications, Journal of Emerging Technologies in Web Intelligence, vol. 1,pages 60-76, August 2009 Stuart Russell and Peter Norvig, Artificial Intelligence, Chapter 2: Intelligent Agents A Modern Approach. http://en.wikipedia.org/wiki/Ranking Kaustubh S. Raval, Ranjeetsingh S. Suryawanshi and Prof. Devendra M. Thakore, An Intelligent Agent Based Text-Mining System: Presenting Concept through Design Approach, International Journal of Computer Science and Information Security, Vol.9 No.4, pages. 112-117, April 2011. Stephen Robertson, Microsoft Research, Understanding Inverse Document Frequency: On Theoretical Arguments for IDF. http://en.wikipedia.org/wiki/Tf-idf

[6] [7]

ISSN : 0975-5462

Vol. 3 No. 7 July 2011

5808

You might also like