You are on page 1of 31

Chapter 1

Introduction

Although search over World Wide Web pages has recently received much academic and
commercial attention, surprisingly little research has been done on how to search the web pages
within large/small, diverse intranets. Intranets contain the information associated with the
internal workings of an organization. The intranet creates new challenges for information
retrieval. The amount of information on the intranet is growing rapidly, as well as the number of
new users inexperienced in the art of intranet research. Earlier works that compared intranets and
the Internet from the view point of keyword search has pointed to several reasons why the search
problem is quite different in these two domains. In this project, we address the problem of
providing quality answers to navigational queries over the intranet. As intranets grow, providing
access to more and more documents, their value grows. The larger the collection, the harder and
harder is becomes to find that important presentation, contract, or HR form. Enterprise
Information Portals provide a starting point to intranets, and a search engine helps locate
information, including archives and unstructured data. Search engines need to be tuned and
indexed to provide the best answers.

Our approach is based on crawler identification of navigational pages, intelligent generation


of term variants to associate with each page, and the construction of separate indices exclusively
devoted to answering navigational queries.

This Chapter outlines the aims of the project and motivation behind its implementation.

1
1.1 What is Intranet?

An intranet is a private computer network that uses Internet protocols and network
connectivity to securely share any part of an organization's information or operational systems
with its employees.

1.1.1 Features of Intranet

 Sometimes the term refers only to the organization's internal website, but often it is a more
extensive part of the organization's computer infrastructure and private websites are an
important component and focal point of internal communication and collaboration.

 An intranet is built from the same concepts and technologies used for the Internet, such as
clients and servers running on the Internet Protocol Suite (TCP/IP). Any of the well known
Internet protocols may be found in an intranet, such as HTTP (web services), SMTP (email),
and FTP (file transfer).

 Intranets differ from extranets in that the former are generally restricted to employees of the
organization while extranets may also be accessed by customers, suppliers, or other approved
parties.

 Intranets are being used to deliver tools and applications, e.g., collaboration (to facilitate
working in groups and teleconferencing) or sophisticated corporate directories, sales and
Customer relationship management tools, project management etc., to advance productivity.

 Intranets are also being used as corporate culture-change platforms. For example, large
numbers of employees discussing key issues in an intranet forum application could lead to
new ideas in management, productivity, quality, and other corporate issues.

Just one example of improved usability from taking advantage of managed diversity: an
intranet search engine can take advantage of weighted keywords to increase precision. Weights
are impossible on the open Internet, since every site about widgets will claim to have the highest
possible relevance weight for the keyword "widget." On an intranet, even a light touch of
information management should ensure that authors assign weights reasonably fairly and that
they use, say, a controlled vocabulary correctly to classify their pages.

Intranet is network of computers that can be accessed only by an authorized set of users
within an organization. Its purpose is typically to share information and computing resources

2
among employees within an organization. The term “search engine” is often used generically to
describe both crawler-based search engines and human-powered directories. These two types of
search engines gather their listings in radically different ways. Crawler-based search engines,
such as Google, create their listings automatically. Human-powered directories such as the Open
Directory, depends on humans for its listings. The search looks for matches only in the
descriptions submitted. In this case, if there are changes to any of the web pages, it has no effect
on the listing. The only exception is that a good site, with good content, might be more probable
to get review. There are two types of Intranet search, namely desktop-based and web-based.
Desktop-based address the whole spectra of electronic information that might be found in an
organization, including video, images, database etc.

Figure 1.1: A model of Intranet

1.2 Scope of project


This project can be used by the various clients who want to search for shared documents
scattered all over the intranet.

3
1.3 Requirement Specifications
1.3.1 Functional Requirements
Query Box:
There should be a query box for the existing user where the user types in the name of the file
that he is searching.

Search Button:
This button initiates the search operation over the intranet with the text typed by the user.

Result Box:
The results as obtained based on the query of the user is displayed with the names of the
system where the file is stored in.

1.3.2 Non-Functional Requirements


Security:
Only the files which are stored in the systems for which we have prior permission to access it
are displayed thus preventing unauthorized access.

Database:
Integrity should be maintained and all the constraints should be satisfied.

Platform Independence:
Written using 100 percent Pure Java Code.

1.3.3 Software Requirements


The following softwares have been used for the project.

Windows Platform:
The Microsoft‟s Windows have been be used as the platform for coding.

NetBeans:
IDE NetBeans used for developing the codes for the project using Java.

1.3.4 Hardware Requirements


PC with 2 GB Hard disk and 256 MB RAM

RJ-45 LAN cables and LAN connectors

4
1.4 The Difference between Intranet and Internet Design
The intranet and the public website on the open Internet are two different information spaces
and should have two different user interface designs. It is tempting to try to save design
resources by reusing a single design, but it is a bad idea to do so because the two types of site
differ along several dimensions:

 Users differ. Intranet users are own employees who know a lot about the company, its
organizational structure, and special terminology and circumstances. The Internet site is
used by customers who will know much less about the company and also care less about
it.

 The tasks differ. The intranet is used for everyday work inside the company, including
some quite complex applications; the Internet site is mainly used to find out information
about your products.

 The type of information differs. The intranet will have many draft reports, project
progress reports, human resource information, and other detailed information, whereas
the Internet site will have marketing information and customer support information.

 The amount of information differs. Typically, an intranet has between ten and a
hundred times as many pages as the same company's public website. The difference is
due to the extensive amount of work-in-progress that is documented on the intranet and
the fact that many projects and departments never publish anything publicly even though
they have many internal documents.

 Bandwidth and cross-platform needs differ. Intranets often run between a hundred and
a thousand times faster than most Internet users' Web access which is stuck at low-band
or mid-band, so it is feasible to use rich graphics and even multimedia and other
advanced content on intranet pages. Also, it is sometimes possible to control what
computers and software versions are supported on an intranet, meaning that designs need
to be less cross-platform compatible (again allowing for more advanced page content).

Most basically, the intranet and the website are two different information spaces. They
should look different in order to let employees know when they are on the internal net and when
they have ventured out to the public site. Different looks will emphasize the sense of place and
thus facilitate navigation. Also, making the two information spaces feel different will facilitate

5
an understanding of when an employee is seeing information that can be freely shared with the
outside and when the information is internal and confidential.

An intranet design should be much more task-oriented and less promotional than an Internet
design. An organization should only have a single intranet design, so users only have to learn it
once. Therefore it is acceptable to use a much larger number of options and features on an
intranet since users will not feel intimidated and overwhelmed as they would on the open
Internet where people move rapidly between sites. An intranet will need a much stronger
navigational system than an Internet site because it has to encompass a larger amount of
information. In particular, the intranet will need a navigation system to facilitate movement
between servers, whereas a public website only needs to support within-site navigation.

6
Chapter 2
Problem definition

Today‟s age is better known as „INFORMATION AGE „. The world runs on information.
According to the Data Warehousing Institute the data available today gets doubled every 6
months. Lots of information is present on the private LAN or intranet of the organizations. So
lots of man power is needed to get proper information from scattered data on intranet. An
obvious reason for poor enterprise search is that a high performing text retrieval algorithm
developed in the laboratory cannot be applied without extensive engineering to the enterprise
search problem because of the complexity of typical enterprise information spaces.

As organization developed more and more information, there is a need to sort the data and
information in a systematic manner and made available to the user in intranet as requested. So
that he can decide what is necessary and take the appropriate action. Our project will provide a
helping hand for this regard, to access the information within our fingertips present in intranet.
Our aim is to provide effective, efficient and systematic search engine that works for a local area
network. In other words effective in terms of search, efficient in terms of time and systematic in
representation is our INTRANET SEARCH ENGINE.

2.1 The Need

1. The need to respect fine-grained individual access-control rights, typically at the document
level; thus two users issuing the same search/navigation request may see differing sets of
documents due to the differences in their privileges.

2. The need to index and search a large variety of file types (formats), such as PDF, Microsoft
Word and Power-point files, etc.

3. The need to seamlessly and scalably combine structured (e.g. relational) as well as
unstructured information in a document for search, as well as for organizational purposes
(clustering, classification, etc.) and for personalization.
7
An effective search tool on an intranet can make an enormous difference to its usability. A
good search engine ensures that users find what they're looking for, first time, regardless of the
format or location of the information. This means that a wide variety of information can be
effectively dispersed and made available to staff, without the need for complex navigation
systems or filing conventions.

Our project aims to help the user to search and access text information. The search will be a
content based search. As stated earlier there is load of information available for the user to
access on the intranet. But only specific required information is to be searched, sorted and
represented in a systematic manner to the user, thus increasing the availability of useful
information for the user to access. The access will be given to only those data which are shared,
thus preventing unauthorized access.

2.2 Objectives of project

To implement a centrally managed Intranet search engine that helps a client to search for
files over the intranet. Client can execute search operation as per his needs. The files, if present
shall be displayed over the same search form. By making use of this project we can provide a
enhanced capability of searching over the intranet.

8
Chapter 3
Mechanization of Search Engine

An intranet search engine is much and more the same as the Web-wide search engines. The
search engine locates the documents, extracts the text, and stores it in an index file, making an
entry for each word. When an end-user or employee types a word into a form and clicks the
Search button, the browser sends it to the server. The search engine receives the search query,
looks for matching words in the index file, gathers related document information, sorts the
documents by relevance, formats the results into appropriate format, and sends the page back to
the user. Several indexing aspects require attention from the intranet site manager. Indexing
integrates content from many sources: pages on internal sites, content management systems etc.

3.1 Processes of Search Operation


There are various processes and entities involved in finding the results for the user as per the
query he has input.

3.1.1 Gathering

The index should be kept current. As soon as the new content is published, it should be
indexed. Publishing or content management systems can notify the indexer of new data;
otherwise, index the frequently changing areas more often. If the search engine cannot respond to
queries when updating, use mirrored servers or switch search engines.

3.1.2 Indexing

In addition to HTML, XML, and text, intranet search engines deals with binary file formats
such as PDF, MS Office formats, including Word, Excel, and PowerPoint, WordPerfect, and
others. The index should store the entire content of every file, even very long documents. It
should keep every word and the word position in the document, for later phrase searching and
match highlighting.
9
Intranets generally include various levels of security and access controls, and the index
should store this information, so it can show only the accessible content in the search results. For
high-security content, it is a good idea to create a separate index file to avoid co-mingling private
and public text.

Figure 3.1: Components of Search Engine

10
3.1.3 Crawling
The general algorithm involves backtracking to the root directory and penetrating new web
pages via their links. The process continues until the entire website (Intranet) is indexed.
Besides, our crawler is able to recognize duplicate pages and discard them accordingly.

3.1.4 Searching agent


This is the tool that will be on the client side and triggered by the server with key word, so it
searches on the only client on which it lies. And returns back the result to server. Each client will
have the searching agent.

When a new search comes to server it searches the index (database) it have if found then s
returns back as response and if not then trigger to all client‟s searching agents and gets the replay
from them. When a user enters a query into a search engine (typically by using key words), the
engine examines its index and provides a listing of best-matching pages/files according to its
criteria, usually with a short summary containing the document's title and sometimes parts of the
text. Most search engines support the use of the boolean operators AND, OR and NOT to further
specify the search query. Boolean operators are for literal searches that allow the user to refine
and extend the terms of the search. The engine looks for the words or phrases exactly as entered.
Natural language queries allow the user to type a question in the same form one would ask it to a
human.

As intranets grow, providing access to more and more documents, their value grows. The
larger the collection, the harder and harder is becomes to find that important presentation,
contract, or HR form. Enterprise Information Portals provide a starting point to intranets, and a
search engine helps locate information, including archives and unstructured data. Search engines
need to be tuned and indexed to provide the best answers.

11
Figure 3.2: Basic Information Retrieval Process

12
3.2 Analysis of Search Engine
There is a need of analyzing the search engine so that we can optimize the software to its
optimum. For this purpose, we need to understand the pros & cons of the same.

3.2.1 Pros:
Search engines provide access to a fairly large portion of the publicly available pages over
the internet and intranet, which itself are growing exponentially.

Search engines are the best means devised yet for searching the internet and intranet.
Stranded in the middle of this global electronic library of information without either a card
catalog or any recognizable structure, how else are you going to find what you're looking for?

3.2.2 Cons:
On the down side, the sheer number of words indexed by search engines increases the
likelihood that they will return hundreds of thousands of responses to simple search requests.
Remember, they will return lengthy documents in which your keyword appears only once.
Additionally, many of these responses will be irrelevant to your search.

13
Chapter 4
Evaluation of Intranet Search Engine

Any Intranet Search Engine should be developed as per the requirement of the environment
in which it will be used. But as per our studies, for the overall deployment of any Intranet Search
Engine, there are some generic functions that are almost same for all of them.

4.1 Important Features for Intranet Search


Search functionality is divided into several parts: the search form and query options, the
search engine retrieval and relevance ranking, and results display.

1. Search Functionality

When the user clicks the Search button, the search form sends a query to the search engine
server. It looks for the words in the index file. Some search engines use stemming to locate
singular and plural forms of words. Once it locates the matches, the search engine gets
information about the associated documents, such as URL and titles. It sorts the documents by
relevance, as defined by an internal set of rules, by frequency of matched terms in the
documents, phrases, and location in the document.

2. Search Results Pages

Search results are not a place to surprise users with experimental interfaces. It is best to
conform to the basic conventions of Web search results, with a listing of documents showing
titles and descriptions. The Internet can be used to identify useful features.

3. Search Problems and No-Matches Pages

Searches fail for various reasons:

 The user forgets to type anything in the search field.


 The user is searching for text that is not in the scope of the index.
 The user is using a term that is not used in the index.
14
 The user has made a spelling or typing mistake.
 The user is doing a search in which all the query requirements are not met (for example,
one word was matched but the other was not).

To avoid common search failures, create a page that explains these errors and helps users
understand what is within the scope of the search engine. If a taxonomy or hierarchy exists,
display it on the page to allow users to drill down through the category.

4. Search Log Analysis

Search logs are a great window into the minds of intranet users. If the search log tracks the
query and the number of matches, this is good. This makes it possible to count the 25 or 100
most popular search terms and to make sure these topics are adequately covered. It is also
possible to track the most common terms that do not find matches and to address these problems.

5. The Indexer

Full-text indexing literally creates a virtual copy of the entire website. The option is still
feasible as it only encompasses Intranet searches. With this, content can be subjected to further
scrutiny and hopefully more precise information. The first step is to initiate the creation of an
index; this index will contain location information for each and every word in all of your
documents. The creation of this index is external of the files and does not affect them in anyway.
Indexed documents are typically specified according to directory and extension. There can either
be one index for all of the files, or several separate indexes, each for a different project. The
indexes automatically are updated when new documents are created, or existing documents are
changed. However, any changes to the table‟s structure such as configuration data will need a
complete rebuilding or the full-text index. Once there is an index, it can be used to locate, view
and retrieve information. Using the indexes created, the search query can be used to locate the
required information in your documents. Results are displayed almost instantly, despite its
relatively large size and thus proving the speed and advantages of implementing indexes.

4.2 Multi-level Approach


Here in we developed a multi-level approach that comprises of four levels.

4.2.1 Data gathering

Most organizations have legacy data in formats other than HTML, e.g. Adobe‟s PDF,
MSOffice, FrameMaker, Lotus Notes, Postscript, and plain ASCII text. The spider should at least

15
be able to correctly interpret and index the most frequently used or the most important of these
formats. If meta-information and XML tags are likely to show up within the documents, the

spider must be able to interpret such tags, and it would also be useful if RDF-formatted
information could be gathered intelligently. If USENET newsgroups need to be indexed, the
spider must be able to crawl through them. That also goes for client side image maps,
CGIscripts, ASP generated pages, pages using frames, and Lotus Domino servers. Although
frames are frequently used within many companies, spiders, which generally work their way
round the net by picking up and following hypertext links, may not be able correctly interpret the
different syntax used for framed pages. These links could end up ignored. Spidering Domino
servers using the above HTTP requests requires the search engine to be able to intelligently filter
out the many collapsed/ expanded versions of the same page, or the index will quickly be filled
with duplicates. Another, and arguably better, way would be to access Domino servers via the
provided APIs.

Another situation that is likely to require access via APIs rather than having to crawl through
HTTP is when Content Management (CM) systems are used. In CM tools, the actual content of a
page is stored separated from the page layout information. Since pages are rendered dynamically
only when requested by a user (via her browser), the spider may not be able to pick up the link
information that is embedded in the page code. Without those links, the spider will not be able to
find the information. Even if the information is found and indexed correctly it might be difficult
for the search engine to understand how to display a search result since the information that has
been indexed may belong to several dynamic pages. This is an area not yet fully explored by
search engine vendors and proposed solutions should be investigated carefully.

Intelligent robots are able to detect copies or replicas of already indexed data while crawling
and advanced search engines can index “active” sites, e.g. sites that update frequently, more
often than sites that are more “passive”. If this is not supported, some manual means of
determining time-to-live should be provided. There should be some means of restricting the
robot from entering certain areas of the net, including any desired domain, sub-net, server,
directory, or file level. Also, check if search depth can be set to avoid loops when indexing
dynamically generated pages. Support for proxy servers and password handling can be useful, as
can the ability to not only follow links but also detect directories and thus find files not linked to
from other pages. The spider should be easy to set up and start. Check how the URLs from which
to start are specified as well as if the users may add URLs.

Finally, the Robot Exclusion Protocol provides a way for the webmaster to tell the robot not
to index a certain part of a server. This should be used to avoid indexing temporary files, caches,
test or backup copies, as well as classified information such as password files.

16
4.2.2 Index

Although a good index alone does not make a good search engine, the index is an essential
part of a search tool. One of the most important issues is keeping the index up-to-date, and the
best way to do that is to allow real-time updates. There is a big difference between indexing the
full text or just a portion. Though partial indexing saves disk space it may prevent people from
finding what they are looking for. The portion of text being indexed also affects the data that is
presented as the search result. Some tools only show the first few lines while others may
generate an automatic abstract or use meta-information.

If the organization consists of several sub-domains, users might only want to search their
specific sub-domain. Allowing the index to be divided into multiple collections might then speed
up the search. It may also prove useful to be able to split the index into several collections even
though they are kept at one physical location. For example, one may want separate collections
for separate topics or business areas.

Some tools support linguistic features such as automatic truncation or stemming of the search
terms, where the latter is a more sophisticated form that usually performs better. If the
organization is located in non-English speaking countries the ability to correctly handle national
characters becomes important. Also, note that some products cannot handle numbers. If number
searching is required, e.g. serial numbers, this limitation should be taken into consideration.
Should words that occur too frequently be removed from the index? Some engines have
automatically generated stop-lists, while others require the administrator to remove such words
manually.

Search engines are of little use if an overview of the indexed data is wanted, unless they are
able to categorize the data and present that data as a table of content. Automatic categorization
may also be used to focus in on the right sub-topic after having received too many documents. If
information about when a particular URL is due for indexing is available, it is useful to make it
accessible to the user.

17
4.2.3 Search features

The user query and the search result interfaces are often sadly confusing and unpredictable
argue that the text-search community would greatly benefit from a more consistent terminology.
Since we do not yet have this concordance, evaluation of the search features must be done with
great care. Different vendors use different names for the same feature, or the same name for
different features.

Though Boolean-type search language is often offered, most users do not feel comfortable
with Boolean expressions. Instead, studies have shown that the average user only enters 1.5
keywords. Due to the vocabulary problem, the user is likely to receive many irrelevant
documents as a result of a one-keyword search. Natural language queries have been shown to
yield more search terms and better search results, even when performed by skilled IR personnel.

Apart from Boolean operators, a number of more or less sophisticated options (e.g. full text
search, fuzzy search, require/exclude, case sensitivity, field search, stemming, phrase
recognition, thesaurus, or query-by-example) are usually offered. One feature to look for in
particular is proximity search, which lets the user search for words that appear relatively close
together in a document. Proximity search capability has been noted to have a positive influence
on precision.

Many organizations prefer to have a common “company look” on all their intranet pages.
This requires customization that may include anything from changing a logo to replacing entire
pages or chunks of code. Again, this is an aspect irrelevant to public search services but
something an intranet search engine might benefit from. Sometimes a built-in option allows the
user to choose a simple or an advanced interface. It should also be possible to customize the
result page.

The user could be given the opportunity to select the level of output, e.g., by specifying
compact or summary. Further, search terms may be highlighted in the retrieved text, the
individual word count can be shown, or the last modification date of the documents may be
displayed. It can also be possible to restrict the search to a specific domain or server, or to search
previously retrieved documents only. For the latter, relevance feedback is a very important way
to improve results and increase user satisfaction.

Ranking is usually done according to relevancy of some form. However, the true meaning of
the ranking is normally hidden to the user, and only presented as a number or percentage on the
result page. More sophisticated ways to communicate this important information to the user have
been developed, but not many of the commercially available products have yet incorporated such
features. However, the possibility to switch between relevancy and date is often supported.

18
Dividing the results into specific categories might help the user to interpret the returned result.
Finally, ensure the product comes with good and extensive online user documentation.

4.2.4 Operation and maintenance

Hosting a search service requires considerations not necessary when using a public search
engine. For example, operations and maintenance issues are of no importance to public search
engine evaluations, but for an internal search service, they are of course highly interesting.

Start by checking if the product is available on many platforms or if it requires the


organization to invest in new and unfamiliar hardware. If the intranet consists of one server only,
a spider is not needed, but as the web grows, crawling capabilities become essential. A spider
allows the net to grow without forcing the webmasters to install indexing software locally. An
intranet search engine often runs on a single machine and is operated and maintained by people
with knowledge about servers, but not necessarily experts in spider technology. This suggests
that a good intranet spider should be designed specifically for an intranet and not just be a ported
version of an Internet spider. Still, the spider and the index must be able to handle large amounts
of data without letting response times degrade or the users will be upset. For example, a product
that can take advantage of multi-processor hardware scales better as the intranet grows. The
product should therefore have been tested to handle an intranet of the intended size. Running the
spider should not interfere with how the index is operated. Both these components need to be
active simultaneously.

It is found that great differences exist in how straightforward the products were to install, set-
up, and operate. Some required an external HTTP server while others had a built-in web server.
The latter were consistently less complicated to install. However, installation is probably
something done once while indexing and searching is done daily. This ratio suggests that
indexing and searching features should be weighed higher than installation routines.

It is difficult to estimate data collection time since it depends on the network, but during the
test installation, this activity should be clocked. Also, try to determine how query response times
grow with the size of the index. If an index in every city, state, or country where the organization
is represented is wanted, ensure the product supports this kind of distributed operation, and check
whether any bandwidth-saving technique is used.

Having technical support locally is an advantage if the local support also has local
competence. If questions have to be sent to a lab elsewhere, the advantage is lost. An important
feature is the ability to automatically detect links to pages that have been moved or removed. If
dead links cannot be detected automatically, the links should at least be easy to remove,
19
preferably by the end-user. Allowing end-users to add links is a feature that will off-load the
administrator. Functions like email notification to an operator, should any of the main processes
die, and good logging and monitoring capabilities, are features to look for. I found that products
with a graphical administrator interface were more easily and intuitively handled, though the
possibility of being able to operate the engine via line commands may sometimes be desired. It
should also be able to administer the product remotely via any standard browser. Documentation
should be comprehensive and adequate.

Finally, consider the price - is it a fixed fee or is it correlated to the size of the intranet? In
addition, what kind of support is offered and to what cost? Sometimes installation and training
are included in the price. How long the products have been available and how often they are
updated are important factors that indicate the stability of the product, and it is also important to
ask about future plans and directions.

20
Chapter 5
Deploying an Effective Intranet Search
Engine
A search engine is often the first method used to find a page, and yet, most users suffer
frustration and failure. More still are put off by the complexity of the search engine, and the
confusing manner in which the results are displayed. An effective search tool on an intranet can
make an enormous difference to its usability. In fact, usability expert Jakob Nielsen found that
“Poor search was the greatest single cause of reduced usability across intranets”. A good search
engine ensures that users find what they're looking for, first time, regardless of the format or
location of the information. This means that a wide variety of information can be effectively
dispersed and made available to staff, without the need for complex navigation systems or filing
conventions. Most intranets evolve over time, and search functionality need not be a daunting
task. A search tool can be implemented quickly, and then refined as the intranet grows and the
needs of the organization change. It is important to recognize that every intranet is different, with
its own objectives, requirements and environment.

A good search engine must:

 Be easy to use.
 Assist users to find the correct information.
 Display results in a meaningful way.
 Help authors to improve the site.

5.1 Use of Search Engine


Search engines are best at finding unique keywords, phrases, quotes, and information buried
in the full-text of web pages. Because they index word by word, search engines are also useful in
retrieving tons of documents. If you want a wide range of responses to specific queries, use a
search engine.

Today, the line between search engines and subject directories is blurring. Search engines no
longer limit themselves to a search mechanism alone. Across the Web, they are partnering with

21
subject directories, or creating their own directories, and returning results gathered from a variety
of other guides and services as well.

Selecting a Search Engine

Before taking any action in determining the type of the search engine, we need to determine
our technical requirements. Once this is complete, research on currently available engines can be
pursued and built an effective search engine that caters to the need of ours.

5.2 Data Sources & File Types

Once we have our objectives clearly defined, we can work out what type of file formats and
data sources your search engine will need to support. Next step will be to list out every file type
used in creating the information that we want to share on your intranet. These usually fall into
one of three categories:

1. Unstructured formats
File formats that contain primarily text-based information. These include text files, word
processor files, PDFs, emails and formats used to create most documents. There is no real
structure to these file formats and few relationships exist between elements within them.

2. Semi-structured formats
File formats that contain a mixture of text-based and database information, with a basic
structure. These include file types such as HTML, spreadsheets, XML. There may be
relationships between elements within these files, however they are not as rigidly defined
as they are in structured formats, and there may be sections of textual information where
no structure exists.

3. Structured formats
File formats where the information is contained in a well defined structure, such as a
relational database. Many enterprise systems have a structured architecture, such as ERP
and CRM systems, as well as many legacy databases. An effective intranet search engine
should be able to support a large number of files that will be in the intranet data
repository.

22
5.3 Processing of Query

For most intranets, there will be a wide spectrum of users, from very basic all the way
through to highly technical power users. The search function needs to cater for all of these
people, with a simple yet powerful interface that provides options for advanced searching if
required. There should be three steps to the search process, and a range of features work to
streamline each of these steps:

(1) Entering the Query, or asking the initial question,

(2) Getting the Search Results, or receiving the list of found documents back from the search
engine, and

(3) Finding the Right Answer, or examining and refining the search results to find the
information you were looking for.

Step 1: Entering the Query.


When a user enters their query, they should have the option to do this using a natural
language approach; that is, by simply entering the question as they would ask it. Such as “What
is the cost of double-deck refrigerators?” There should also be the option to build queries using
Boolean operators, so that users who know exactly what they want can be extremely specific
with their search. For example “returns~ within 10 words of refrigerator but not freezer”.
Building a search engine with a simple user interface to make sure it is intuitive for basic users,
and also provide powerful advanced search functionality for more experienced users will be a
definite aim of ours. A good search engine should enable you to group logical chunks of
information together so that searches can be conducted on specific areas of interest.

Step 2: Getting the Search Results.


If there is specifically defined data, such as legal documents, a high degree of precision may
be required to identify and return specific information. In other situations, however, it may be
better to return a wider range of documents for a given query. The accuracy we require depends
on the role of the search engine and the nature of the data. If we want to make available a large
volume of data on your intranet, providing a fast search engine is important. Otherwise users find
it frustrating to wait for the search engine to bring back the search results. With smaller amounts
of data this will be less of a concern; it all depends on the volume of data that we intend to make
available on the intranet.

23
Any good search engine should use some form of intelligent relevancy determination. This is
where the search engine, based on the query entered, makes a judgment about which results will
be the most relevant, and ranks them accordingly.

Step 3: Finding the Right Answer.


The search process doesn‟t stop once the user receives the list of results. They then need to
refine and manipulate the results list until they find exactly what they were looking for. There are
many features that can assist in this task, some of which include:

 Document summary information


The display of useful document attributes such as file type, file size, date last changed,
relevancy rating and the number of „hits‟ (key words found) in the document. The display
of an extract of the document, say several lines above and below the first hit, is helpful for
determining the context in which the document has been returned.

 Re-sorting
The ability to re-sort the results list using different criteria, such as title, number of hits,
relevancy, and date changed; file type or any other criteria that makes sense for your
organization.

 Hit-to-hit navigation
The provision of navigation buttons enabling users to go directly to the first hit in the
returned document, and thereon to the next or previous hit as required. This means users
avoid having to read through pages and pages of document before finding the relevant
section, making it much more efficient.

 Hit highlighting
A familiar concept from searching the web, hit highlighting is when the key words, or
„hits‟, in a document, are highlighted in a different colour. This feature is often not
available in an intranet search engine, but it really should be, as combined with hit-to-hit
navigation it enables users to immediately see the relevant sections of the document.

 Fast preview
The ability to preview large non-HTML documents in a basic HTML format, without the
need for downloading the whole document. This function enables users to view a few lines
above and below each hit, and then to expand up or down to continue reading.

 Search within
The ability to search within the current set of results, to further narrow them. Although
just some of the features available in intranet search engines, these are the main features

24
required to ensure that users have the best overall experience. Others that may be relevant
to your organization might include intelligent agents that automatically advises users when
relevant content appears in the data repository, or the ability to save or export search
results.

5.4 Designing the interface


Take extra time and effort when designing your search pages. They should be clear, easy, and
above all, simple. Don‟t bother with an „advanced search‟ facility: your users won‟t understand
it.

Behind the scenes

Make your search engine quietly work for the user, to correct their mistakes, and to help
them find the right page. While much of the work of deploying a search engine goes on behind
the scenes, the design of the user interface greatly influences how successful the system will be.
While the interface design must be consistent with the rest of your online material, we
recommend the following guidelines:

5.4.1 Search Page

 Keep it simple
There are two key elements on a search page: a field to enter the search terms, and a
‟search‟ button. There is no reason to make the page any more complex than this.

 Provide hints
A list of tips and examples on the main search page helps users when they first use the
search engine. This list should be written in plain English, and should cover the common
issues and questions.

 No advanced searching
Normal users have enough difficulty with search engines without confronting them with a
complex set of „advanced search‟ methods. Users want to quickly find a single page, and
therefore we must design our interface to meet this need.

25
 Always ‘and’

Few users understand the concept of „Boolean operators‟. Instead, they expect that when
they type in three words, they will be given only those documents that contain all three.
Furthermore, typing in more words should provide fewer hits, not more.

The search engine must therefore default to „and-ing‟ the words together. In fact,
eliminate support for Boolean operators all together, unless there is a clear case that they
will be of value to your users.

 Place the cursor

When the search page is opened, the cursor should already be in the search field (this is
known as ‟setting the focus‟). This allows the user to simply type in their words, and hit
enter. It‟s a small point, but it took only days for our users to specifically ask us for it

5.4.2 Result Page

 Make it attractive

A results page should encourage users, not frighten them off with tiny text, difficult
layouts, and hard-to-read fonts. We expect users to spend time browsing through the list
of results, so it is worth spending some extra time making the pages easy on the eye.

 Keep it simple

There are only three things that we need to present for each hit: title (a hyperlink to the
actual page), page summary and ranking. Why, for example, would the user want to
know the size of the page in kilobytes? The less we say for each hit, the easier it is for the
user to scan through the list and find the page they want.

 Make the description meaningful

Ideally, each hit should provide a useful description of the page, obtained from the „meta‟
tags within the page. If this information is not available, we shall provide a brief extract,
highlighting where the search terms are used.

To ensure that the extract always shows some useful text and not the standard headings
on every page (how many listings have you seen that start with „[Home] [Contents]
[Index] …‟?) is also notified.
26
Behind the scenes

Effort should be spent „behind the scenes‟ to improve the effectiveness of your search
engine. Most engines have capabilities that, when implemented carefully, will help users to find
the pages they are looking for.

These features must operate transparently, so that the user is not even aware of their impact.
They should simply find the search engine both easy to use and effective.

Fuzzy searching, stemming, and more

Our selected search engine provided a number of powerful searching capabilities:

Fuzzy searching, or ’sounds-like’

There were three closely-related options which were essentially designed to find terms which
‟sounded like‟ those entered by the user. In this way, it becomes possible to handle spelling
mistakes and other inconstancies.

Stemming

This feature takes the terms entered by the user, and tries other combinations of endings. For
example, searching for „walks‟ would also find „walk‟, „walking‟, „walked‟.

We found this to be very effective, and it eliminated differences in singular versus plural uses
of terms in our pages.

There are a wide variety of other tools available in modern search engines, beyond those
mentioned above. As per our evaluation and study we noted that just because a feature exists, it
doesn‟t mean it will help the users.

Weightings and rankings

The order in which results are displayed by a search engine is the product of a number of
complex weighting and ranking factors behind the scenes. These vary from engine to engine.
They also have a big impact on how effective the search engine is.

The main aim would be to understand our search engine, and configure it (if required) to
meet our specific requirements. The key is to have the search engine work in a „transparent‟ and
understandable way.

27
Figure 6.1: Search Engine User Interface

28
Chapter 6
Conclusion and Future work

7.1 Conclusion
We have discussed the concept of Intranet search engine. Under this project, the mechanism
of intranet, search engine was thoroughly examined. Developing a search engine for intranet
needs a complete research as per the needs. In brief, we learned the following lessons as a result
of this project:

- Spend a lot of time identifying your needs, and researching the right search engine.
Choosing the wrong search engine is a costly mistake that is not easy to rectify half way
through a project.

- Keep the interface simple. The search page should have a field to type in and a ‟search‟
button. Complex interfaces and advanced searches will confuse users: by default, your
search engine should simply do what the users expect.

- Take the time to configure the intelligence „under the hood‟. The search engine should
quietly assist the user to find the desired page (via synonyms, fuzzy searching, and so
forth).

- Track the usage of your search engine, and use this to assess how well it is working. You
should be gathering enough information to allow you to refine the engine‟s configuration
to better meet user needs.

29
7.2 Future work
In this project following modifications and up gradation can be integrated to make it a
better search engine.

(a) Enable better query understanding

Building in intelligence so as to find the correct word and to solve typo errors, search engines
till today still lack the intelligence to actually understand the semantics rather than the syntax of
a search query.

(b) A ranking algorithm

Ranks are based on the number of occurrence of words in the content and title. Thus the
results are accurate base on content. However, this alone is insufficient when the content
searched is not purely documented based, as in the case of internet.

(c) Multimedia Search Engine

The current version of our Intranet search engine is only capable for searching documents in
text format. This version could be enhanced by supporting searches for various types of files
including images, audio, video etc.

30
Bibliography

[1] Cynthia P. Ruppel and Susan J. Harrington. Sharing Knowledge Through Intranets: A Study
of Organizational Culture and Intranet Implementation, 2000.

[2] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. An Introduction to


Information Retrieval, Online edition, 2009.

[3] Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan and Alexander Loser. Navigating
the Intranet with High Precision, 2007.

[4] Dick Stenmark. A Method for Intranet Search Engine Evaluations, Proceedings of IRIS22,
1999.

[5] Michael Chen, Marti Hearst and Jason Hong. Cha-Cha: A System for Organizing Intranet
Search Results, 2002.

31

You might also like