You are on page 1of 3

International Journal of Wisdom Based Computing, Vol.

1 (2), August 2011

52

A Note on NLP based Search Engines


A.Geetha
Assistant Professor, Department of Computer Science Government Arts College, Udumalpet gee_sam@yahoo.com

Abstract We have seen the era right from Data Retrieval, Text Retrieval, and Information Retrieval (IR) and today we are amidst Knowledge Retrieval. We have sophisticated information retrieval tools that functions on both machine and less human intervention. NLP based IR tools are available making the searches to narrow down to users requirements. The objective of this paper is intelligent search engines can be constructed. Intelligent search engines evolved as a descendent to Meta search engines, which incorporates machine-learning techniques. That is semantically empowered Search engines. And this is possible with NLP. Keywords- Search engine optimization (SEO), Intelligent search engines, Intelligent search, Information retrieval (IR), Natural Language processing (NLP), Machine Learning (ML), Semantic base , L2L.

large lexical database for English. It functions on Thesauri, Lexical & Semantic relations. But, here we propose to integrate these tools with search engines (SE). Search engines based on Keyword matching are successful, but, it cannot interpret meaning. As a result, they deliver high Recall, but very low Precision. We are not going to replace the SE, but upgrade it. With NLP based search engines it is possible to find information, filter it, organize it, keep it up-to-date, and generate patterns and present visual presentations for quick understanding. Above all it has to match the users query. II. INTELLIGENT SEARCH

I.

INTRODUCTION

Search has arguably become the dominant paradigm for finding information on the WWW. In order to build a successful Search engine, there are number of challenges. Today, in the era of data explosion, web is rapidly multiplying, so too the search engine technologies and search engine optimizations. Despite this growth & developments problems still remain, as search engines are handicapped: as it lacks the property of NLP. NLP & ML are one of the techniques of Artificial Intelligence. AI makes the system to behave intelligently as human beings. Here, our idea is NLP can parse the users query understand it & decide which way to present to the SE. ML technologies are incorporated that automatically develop rule & knowledge bases. The following tasks of NLP [2] can be merged with Search Engines and can be deployed to do functionalities like Performing context based search [1], Identifying Parts of a sentence [2], Language conversion [2], Image to Text conversion & vice versa [2], Relationship extraction [2], Concept Extraction & Mapping [3], Machine aided & Automatic Indexing [3], Automatic Summarization [3], Cross-Language Retrieval [3], Entity Extraction [3], Text Mining [3], Question Answering (QA) [3], etc. There are tools like WordNet, REAP, BBC Bitesize, Mesh, Multilingual IR, and Weekly Reader, powerset, Hakia, Textwise, NetBase, etc for some of the above functions. They all are functioning based on NLP and are

The term Intelligent Search [4] is associated with Semantic web, Ontology, Intelligent information retrieval (IIR), Intelligent data mining (IDM), Image processing (IP), Artificial Intelligence (AI), Web mining, Knowledge base Information Retrieval (KBIR), etc [5]. The objective of this paper is to make the search Intelligent, more knowledgeable, in a multi-dimensional way [6]. It must be a Context / Semantic based search. Hence the term Intelligent Search & it can be achieved with NLP based Search Engines [7]. NLP improves the efficiency of search by optimizing query performance & query evaluation. In order to build a state-of-the-art information system, one must extract as much meaning as possible from each document. Context and meaning must be preserved. A good natural-language-based system provides the foundation for this system, because it parses sentences thoroughly, extracts meaning from context, and is smart to differentiate between all entities. To put in a nutshell, the proposed system falls under L2L (Language to Logic) for parsing & processing queries and arriving at answer [8]. III. NLP BASED SEARCH ENGINE TASKS

A. Identifying the parts of a sentence (Noun, Verb, Object, Adjective, Adverb .........). As WordNet functions. [9] The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications.

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011

53

B. Tense Conversion (Present tense, past tense, future tense, perfect tense.........). The user can type a sentence in one form & the system will help to convert to other tenses as the user wishes. More importantly, this will help the system to function well in QA based queries. C. Semantic based search [10] Eg. The brown barks. The word bark as a noun has different meanings i.e. Polysemous [11]
1. 2. 3. bark (tough protective covering of the woody stems and roots of trees and other woody plants) bark (a noise resembling the bark of a dog) bark (a sailing ship with 3 (or more) masts)

within documents or knowledge bases to find answers. To build sites like ask.com, answers.com, askjeeves.com. For eg. Who won the Cricket World cup 2011? The system has to respond with India. Yes, it needs to respond with words as answers, not as sentences. Suppose if the user requires more information, it will link to the source documents. But at the same time, Questions like Who is Rajiv Gandhi? is typed. It will respond with the probabilities like son of Smt. Indira Gandhi, Former Prime Minister of India, Husband of Sonia Gandhi,...... in our future upgradation. For this, the system must be capable of updating its knowledge base and must find inter relationship between entities. H. Word Character Segmentation. [14][15] In linguistics, morphology is the identification, analysis and description of the structure of morphemes and other units of meaning in a language like words, affixes, and parts of speech and intonation/stress, implied context (words in a lexicon are the subject matter of lexicology). In order to understand what a user is searching for, word sense disambiguation must occur. When a term is ambiguous, a word can have several meanings. I. Multimedia Information Retrieval[16] Similar to hakia.com, our system will fetch information in all Medias. The UI will assist in deciding the specific mode. J. Automatic Summarization [17] Automatic summarization involves reducing a text document or a larger corpus of multiple documents into a short set of words or paragraph that conveys the main meaning of the text. There are two kinds of automatic summarization. The first summarizes whole documents, either by extracting important sentences or by rephrasing and shortening the original text. The second process summarizes across multiple documents. Cross-document summarization is harder, but potentially more valuable. IV. METHODOLOGY

Seeing the sentence, it refers only to the first case. This requires huge knowledge to be embedded. In our case, since the bark is not related with animals / dogs / ship / masts / sound / voyage, this is made possible. D. Image to text & text to image conversion. If the user types Apple, then the NLP user Interface (UI) pops up, to make sure what actually Apple means either fruit / Apple computer. The user selects fruit, then the SE lists only images that mean apple fruit, but not Apples ipad, ipod, itunes, etc. E. Multilingual Retrieval / Cross language Retrieval [12] Similar to Google Translate, queries if related to other languages, with users consent must fetch that too. F. Relationship Finder (Father, son, relationship, Association among objects, duration, period ..... from a given sentence). This is a Herculean task, but if properly associated, the system will answer. Hari married Mary last year. From the word Mary, the system will recognize as Female, Hari as Male, and the word married binds them as couples. So when the query Who is the wife of Hari? is posed, the system will respond with Mary. For this, we must set up a hierarchy of relationship tree, gender, association (mother, father, brother...), interrelationships. The ISA relationship defines who or what the subject is: Sachin is a Good Cricket Player [3]. The AGENTOF relationship describes who or what caused an event to happen or had causal relationship: Increased ozone in the Southern Hemisphere causes severe sunburns [3]. G. Question Answer based.[13] We often lose sight of the purpose of information retrieval, which is usually to answer questions, not just retrieve documents. Question-answering systems look

Existing SEs depend on a set of list of rules to decide which path/pattern to navigate. But, our goal is the system must generate rules & decide its own path. A Search engine must allow users to compose their own search queries rather than simply follow pre-specified search paths or hierarchies [18][19]. Search Engine Optimization (SEO) is the process of improving the visibility of a website or a web page in a Search engine [20]. It can be through Site Maps, Page ranking Algorithms, Keyword / Phrase density, Using Meta tags, Link farms, Indexing / Clustering, etc. To achieve NLP based SE successful and sustainable in the market, we must perform the following

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011


1. The Grammar of a particular language has to be populated into the Database, along with its synonym, antonym, usage, origin, examples..... That is the entire dictionary / Encyclopedia has to be loaded. Referred as Thesaurus which is a set of items (phrases or words) with a set of relations between these items. The meta database of SE must update its knowledge base frequently. For eg. the statement India's 11th President, Dr. A.P.J. Abdul Kalam assumed office on July 25, 2002. Invokes, the system to automatically assume that Smt. Pradeepa Patil as the 12th President of India. Also, Dr.A.P.J. is called the missile man of India. The proposed system must co-relate the statements. As the news about Dr.A.P.J. gets updated in the database, our system must extract knowledge and must be capable of making inference. Also, each users search pattern, domain of interest, depth of knowledge exploration, specialized entities may be discovered by this system and must assist the user in his future searches. Also, the user is allowed to update his Meta database with new terminologies, relationships, cross-references and vocabularies as the language grew. Above all, the user can design macros / shortcut terms for easy reference & retrieval.

54

VI.

CONCLUSION

2.

The proposed system redefines the computer to perceive, think, process like human beings. It is a step towards wisdom computing. The user decides the pattern/path of search trace based on taxonomy. It largely minimizes the irrelevant web pages, with higher user satisfaction with lesser time with greater accuracy / relevancy of web pages. The above said tasks of NLP are existing, but we propose to integrate all these tools with Search Engines. It is a part of a large system to investigate ways in which several new technologies of NLP can be integrated to improve access to large scale knowledge resources and making smart web decisions. To conclude, NLP based SE seeks to return information in a structured form, consistent with human cognitive processes as opposed to simple lists of data items. Tomorrows IR will be Knowledge centric, thats Wisdom Computing. REFERENCES
[1] Hendler, J., T.B-Lee and E.Miller., Integrating Applications on the Semantic web. J. Institute Elec. Eng. Japan, 2002. 122: 676680. http://wikipedia.org/wiki/NLP/ http://www.infotoday.com/searcher/jan00/feldman.htm Zhong, N., J.Liu and Y.Yao, 2002. In search of the Wisdom web. IEEE Computer, 35: 27-31. Kevin Curran., Web Intelligence In Information Retrieval in Information Technology Journal 3(2):196-201, 2004 ISSN 16826027. http://wi-consortium.org/ P.C.Reghu Raj and S.Raman., Applied Artificial Intelligence, in Taylor and Francis Inc. 19:559-599 2005. http://trec.nist.gov G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, K. Miller. 1990. WordNet: An online lexical database. Int. J. Lexicograph. 3, 4, pp. 235-244 Information Retrieval and Semantic Web in Proceedings of the 38th Hawaii International Conference on System Sciences 2005. P.C.Reghu Raj and S.Raman., Applied Artificial Intelligence in Taylor and Francis Inc. 19:559-599 2005 M.Mitra., B.B.Chaudhuri., Multilingual IR in Information Retrieval2, 141-163(2000). Marius Pasca and Sanda Harabagiu., High performance question/answering In the proceedings of the 24th International Conference on R&D on IR, pages 366-374, 2001. J.T. Tou and R.C. Gonzalez., Pattern Recognition Principles, Addison-Wesley Publishing Company, Inc., Reading, Massachusetts, 1974 M. Szmurlo, Masters Thesis, Oslo, May 1995 Maybury, M., Intelligent Multimedia Information Retrieval in AAAI Press/MIT Press London 1987 . http://en.wikipedia.org/wiki/Automatic_summarization H.Chu., M. Rosenthal., Search engines for WWW: A comparative study and evaluation methodology 2007. http://wordnet.princeton.edu/ http:// wikipedia.org/wiki/SEO

3.

4.

5.

Then, the system / search engine becomes tailored or customized to individual users needs. Hence, we propose that searches may be carried out through NLP; thereby we can land on relevant and expected web pages. V.
1.

[2] [3] [4] [5]

EXAMPLES
[6] [7] [8] [9]

As, a result our system will enable the following


The word bug, invokes the NLP UI to select the discipline, once the user selects computer, system gets the meaning as error, flaw & does the further processing. Tsunami given in English, finds the origin of the word in Japanese & then starts further search to Japanese data base with users approval. Rama reads a book. If the user wants its other forms of tenses, this system will. When a user arises a query, Who is the first women prime minister of India? it will respond with an answer Smt. Indira Gandhi. No more non sense web pages about women, prime minister, India, first women... which were as a result of keyword searches. If the user prefers Image, then the Image of Smt. Indira Gandhi will be displayed. Given a word crane this systems UI differentiates between noun / verb. When the user selects noun, it pops up with bird or construction tool. The user responds by selecting bird, then this NLP based SE crawls only for the bird Crane. Also, given the word NLP, the UI must categorize as [2] & then proceed the search. Linguistics - Natural language processing Mathematics - Nonlinear programming Medicine and biology - Neoplastic lumbosacral plexopathy

2.

[10]

3. 4.

[11] [12] [13]

5. 6.

[14]

[15] [16] [17] [18] [19] [20]

7.