You are on page 1of 29

Te interesa este libro?

Cmpralo en nuestra tienda: www.campusmvp.com

En papel o en formato electrnico Sin DRM Imprimible Busca en el contenido

Especialistas en formacin online y libros de tecnologas Microsoft.

Sguenos y descubrirs los mejores trucos y recursos:

Lightning FAST Enterprise Searches in Sharepoint 2010

Gustavo Vlez

LIGHTNING FAST ENTERPRISE SEARCHES IN SHAREPOINT 2010

2012 KRASIS CONSULTING, S. L. www.campusmvp.net ALL RIGHTS RESERVED, NO PART OF THIS BOOK MAY BE REPRODUCED, IN ANY FORM OR BY ANY MEANS WITHOUT PERMISSION IN WRITING FROM THE PUBLISHER

ISBN ELECTRONIC FORMAT: 978-84-939659-1-4

Acknowledgments
In recognition and appreciation, I would like to acknowledge the people involved in making this project possible. Juan Carlos Gonzalez Martin of the Centro de Innovacin en Integracin (Integration and Innovation Center CIIN, http://www.ciin.es, Cantabria, Spain) and Fabian Imaz of Siderys (Siderys, http://www.siderys.com, Montevideo, Uruguay), both SharePoint MVPs, recognized SharePoint experts and amazing friends, agreed to read the text and sift out errors and inconsistencies. And Jose Manuel Alarcn, editor at Krasis, who ensured the book's publication and has the patience to hear my complains through the realization of the book and wait till all problems were solved. Gustavo Vlez

Table of Contents
ACKNOWLEDGMENTS ....................................................................................... iii TABLE OF CONTENTS ......................................................................................... v PREFACE ............................................................................................................... vii CHAPTER 1: INTRODUCTION .......................................................................... 11 1.- Search in the IT world.............................................................................................................. 11 2.- Short history of fast .................................................................................................................. 13 3.- Positioning of fast in the microsoft stack ............................................................................ 15 3.1.- Windows, SQL, SharePoint, SCOM ......................................................................... 16 3.2.- Microsoft Search Products .......................................................................................... 16 4.- Some Important Documentation .......................................................................................... 18 CHAPTER 2: FAST IN THE CONTEXT OF SEARCH ..................................... 21 1.2.3.4.Goals of search........................................................................................................................... 21 Internet search vs. Enterprise search ................................................................................... 22 Search terminology and concepts ......................................................................................... 23 Fast versions ............................................................................................................................... 28

CHAPTER 3: ARCHITECTURE AND DESIGN ................................................. 31 1.- Conceptual Design .................................................................................................................... 31 2.- Logical Design ............................................................................................................................. 34 3.- Physical Design ........................................................................................................................... 35 3.1.- Extra-Small Farm ............................................................................................................ 35 3.2.- Medium Farm .................................................................................................................. 36 3.3.- Large Farm ....................................................................................................................... 38 3.4.- Extra-Large Farm ........................................................................................................... 41 3.5.- Virtualization of FAST farms ....................................................................................... 42 CHAPTER 4: FAST REQUIREMENTS AND INSTALLATION ........................ 43 1.- Requirements.............................................................................................................................. 43 1.1.- Hardware Requirements.............................................................................................. 43 1.2.- Software Requirements ................................................................................................ 44 1.3.- Environment preparation ............................................................................................. 44 2.- Manual Installation ..................................................................................................................... 45 2.1.- Prerequisites ................................................................................................................... 45 2.2.- Software Installation ...................................................................................................... 46 2.3.- Post-Setup Configuration ............................................................................................ 47
v

vi Lightning FAST Enterprise Searches in Sharepoint 2010

2.3.1.- Single-server FAST Post-Setup Configuration ................................................ 47 2.3.2.- Farm FAST Post-Setup Configuration .............................................................. 49 3.- Scripted Installation ................................................................................................................... 51 3.1.- Prerequisites ................................................................................................................... 52 3.2.- Software Installation ...................................................................................................... 52 3.3.- Post-Setup Configuration ............................................................................................ 53 CHAPTER 5: CONFIGURATION AND ADMINISTRATION .......................... 57 1.- Configuration .............................................................................................................................. 57 1.1.- SharePoint 2010 Content Search Service Application ......................................... 57 1.2.- SharePoint 2010 Query Search Service Application ............................................. 59 1.3.- Certificates ...................................................................................................................... 60 1.3.1.- Create a new Content Self-Signed Certificate ............................................... 60 1.3.2.- Replace a Content Self-Signed Certificate with a CA Certificate ............. 61 1.3.3.- Query Certificate ................................................................................................... 62 1.3.4.- Certificate for Security Trimming...................................................................... 63 1.4.- SharePoint 2010 Site Collection Configuration ..................................................... 63 1.5.- Content Indexing ........................................................................................................... 64 2.- Administration and Configuration with PowerShell ......................................................... 65 2.1.- Administration cmdlets ................................................................................................ 65 2.2.- Index Schema cmdlets .................................................................................................. 66 2.3.- Installation cmdlets ........................................................................................................ 67 2.4.- Spell Tuning cmdlets ..................................................................................................... 67 2.5.- Security cmdlets ............................................................................................................. 68 3.- Administration ............................................................................................................................ 69 3.1.- SharePoint 2010 Central Administration ................................................................ 70 3.1.1.- General Administration ........................................................................................ 70 3.1.2.- Crawling Administration ...................................................................................... 71 3.1.3.- Query Administration........................................................................................... 72 3.1.4.- Reporting and Reporting Administration ........................................................ 72 3.2.- Administration Command-line Tools ....................................................................... 73 3.3.- Backup and Recovery ................................................................................................... 74 3.3.1.- Backup and Restore Prerequisites ..................................................................... 74 3.3.2.- Configuration Backup and Restore ................................................................... 75 3.3.3.- Full Backup and Restore....................................................................................... 76 4.- Monitoring ................................................................................................................................... 79 4.1.- FAST Logs ........................................................................................................................ 79 4.2.- WMI for monitoring ..................................................................................................... 80 4.3.- Performance Counters for monitoring .................................................................... 81 4.4.- Monitoring Command-line Tools .............................................................................. 82 CHAPTER 6: USER INTERFACE ........................................................................ 85 1.- FAST Search Center ................................................................................................................. 85 2.- WebParts for SharePoint 2010 .............................................................................................. 88 2.1.- Search WebPart Gallery .............................................................................................. 88 2.1.1.- Search Box WebPart ............................................................................................ 90
vi

Contents vii

2.1.2.- Core Results WebPart ......................................................................................... 92 2.1.3.- Refinement Panel ................................................................................................... 94 2.2.- Customizing Search WebParts ................................................................................... 97 2.2.1.- XSLT Transformations ......................................................................................... 97 2.2.2.- Properties Manipulation ..................................................................................... 101 2.3.- Customizing Non-sealed Search WebParts .......................................................... 102 CHAPTER 7: PROGRAMMING ......................................................................... 107 1.- Working with the Search API .............................................................................................. 107 1.1.- Administrating FAST Programmatically ................................................................. 107 1.2.- Querying FAST Programmatically ........................................................................... 111 1.2.1.- The Federated Object Model ........................................................................... 112 1.2.2.- The Query Object Model .................................................................................. 113 1.2.3.- The Query WebService ..................................................................................... 116 1.3.- Content API for FAST ................................................................................................ 119 2.- Customize The Content Pipeline with the Extensibility Stage .................................... 121 2.1.- Crawled properties, Managed properties and Crawled property categories 121 2.2.- Creating the Logic of the Pipeline Extension ....................................................... 122 2.3.- Configuration of the Pipeline Stage ......................................................................... 124 3.- Adding custom Refinement Panels ...................................................................................... 126 4.- Building Search WebParts providing FQL capabilities ................................................... 130 INDEX .................................................................................................................. 133

Preface
FAST is the Enterprise Search solution from Microsoft and it is taking quickly a very important role in the offer of the company's enterprise servers. With its integration in the SharePoint 2010 family, FAST bids a scalable, flexible and powerful search server that not only contents with other similar commercial software but that can pick up the gauntlet and surmount easily any other product. This book is oriented to technical audiences that need to design, install, configure and customize a FAST Search implementation. More general themes are handled in the first chapters: wide-ranging information about search, the past-and-future of search, a short history of FAST and explanations about the very specific definitions and concept used by search engines; because search is intimately related to human linguistics and how people organize information, special attention has been given to how the internal algorithms can be interpreted from an information technology perspective, not from a pure technical point of view. Installation and configuration are managed in the following chapters. Although the installation procedure trails the traditional friendly installation routines of all Microsoft products, there are some important aspects that must be taken in consideration especially for an enterprise FAST farm. The different configuration options (SharePoint Central Administration, SharePoint Site Collection Administration, FAST Object Model and FAST PowerShell console) are reviewed to explain the several available ways to adapt the system to the enterprise requirements. Finally, the default Search User Interface is assessed. Albeit the SharePoint Search WebParts can be used by both, the SharePoint Enterprise Search and FAST, the different WebParts are analyzed and the configuration and customization possibilities are described because they form the main components that the day-to-day users will experience. Customization of the core search engine is one of the points that make FAST different from the SharePoint Enterprise Search engine. In the current FAST version the great part of customizations take place modifying XML files but some programming is allowed and sometimes indispensable to ensure FAST is behaving as required. The last book chapter deals with programming and customizing the engine and it is mainly oriented to developers. All-by-all the book offers a 360 degrees view of FAST and it is intended to be a reference work for those people that are curious about FAST and the ones that must deal with the server for the first time. And remember: if you cannot find it, it doesn't exist... Gustavo Vlez

CHAPTER

Introduction

The Wikipedia defines "Search" as "software for finding information": that is a short, concise definition of something that is becoming indispensable in our information-driven society; namely, how to discover the necessary data and distinguish relevant from irrelevant material. Search as IT technology is at this moment one of the most important components in each information system. Because computer systems are able to generate huge amounts of information, everyday it is more and more difficult (and expensive) to reach the appropriate information. Search technologies enable us to work in a smarter way, reusing the data that already exists. But, on the other hand, our society is also becoming more and more addicted to and dependent on information search technologies, making the knowledge society reliant on search services and their quality to work correctly; saying in other words, if you cannot find it, it doesn't exist.

1.- SEARCH IN THE IT WORLD


Search is not a new issue in the IT world. Since computers have been saving data electronically, it has been a necessity to get the correct records back. Theoretical work started as early as 1945 when Vannevar Bush as Director of the Office of Scientific Research and Development in USA after the Second War World, stressed the necessity of creating an information device (that he called a "Memex") to allow a memory storage retrieval system without limits, flexible and associative. Gerard Salton from Harvard University is considered the father of the modern search technologies. After the publication of his book "A Theory of Indexing", where
11

12 Lightning FAST Enterprise Searches in Sharepoint 2010

base concepts as Document Frequency, Term Frequency, Term Discrimination and Relevancy were defined, the mathematical and theoretical foundations for search algorithms found their place. With the creation of ARPANet in 1972 and the start of Internet as we currently know it in 1993, the necessity for a search mechanism was urgent. In 1990 the first search engine was created: Archie fashioned by Alan Emtage, a student at McGill University in Montreal. Archie was merely a script data collector that used regular expressions to retrieve file names matching the user queries. Because Archie was a big success, new search systems starting to appear to fill the gaps left. Veronica was created at the University of Nevada that, besides the same use as Archie, was also able to index the content of plain text files. In a short time Jughead, a clone of Veronica was created with a more advanced user interface. At this time Gopher and FTP where the main transfer protocols used and ARPANet was principally an academic initiative. On August 6 1991 Tim Burners-Lee at the CERN created the first page using the WWW protocol; at the same time, the Virtual Library (http://vlib.org/, still existing), the first and oldest sites catalog was online. Very soon the first crawlers were implemented and in June 1993 Matthew Gray presented the "World Wide Web Wanderer" initially to measure the active web servers, but soon becoming "Wandex" the first data base created to capture URL's. By the beginning of 1994 Internet was three search engines rich: "World Wide Web Worm", JumpStation and RBSE (the "Repository Based Software Engineering" spider). The only one that had a ranking mechanism was RSBE. The other two listed their results as they were found without any discrimination, making them impossible to use when the WWW grew exponentially. In 1993 Excite was also born, the first search engine that used statistical analysis and word relationships to improve the search mechanism. Excite had a huge success and was sold in 1999 for $6.5 billion (and sold again in 2001 for $10 million, after the Internet crash). 1994 saw the birth year of Yahoo! as well (David Filo and Jerry Yang), initially as a collection of web pages and shortly after creation, making the jump to commercialization in the model that we know currently. Lycos, Hotbot and Altavista went online the same year, making the change from web pages catalogue to crawled search mechanisms, allowing new technologies as natural language queries. All of these search engines become eventually irrelevant because of technical, financial and management reasons. Finally in 1998 Larry Page and Sergey Brin launched Google at Stanford University, based on its early work BackRub. The same year Microsoft set up MSN Search online and in 2006 Microsoft announced Bing using its own created search technologies (MSN Search was based mainly on Yahoo!, Overture, Looksmart and Inktomi). Although web search is very important, enterprise search is occupying a prominent role in the search market. Currently all the big software companies (IBM, Google, EMC, SAP and, of course, Microsoft) have one or more enterprise search offering. Some extra technical information about the similarities and differences between web and enterprise search will be analyzed in the second chapter.

Introduction 13

Search seems to be a static world seen from the perspective of the users, but it is a very dynamic world from the technical and business perspective. At the web search front, the battle between Google and Bing is beginning to become legendary: the underdog against the huge establishment. At the enterprise search front, the roles are more equally distributed, with FAST gaining more momentum especially because of the growing influence of Microsoft in the business world. Technically the future is completely open. Currently search is still essentially about finding primary topics or noun-phrases: a person's name, a city, a product and so on. The future of search should be finding verbs, called by Microsoft as the "decision engine" (as opposed to "search engine"): search will try to give the user the knowledge to complete tasks doing the initial computational discerning automatically. New classes of information are starting to be also more important; social network data for example, or location data and the interconnection between all layers of information. Currently a user normally searches for a term or number of terms: "fast search" and the result is a mash-up of information that has something to do with "fast" (any kind of fast: fast food, fast cars, FAST search) and "search". As search engines become more "intelligent", they should add other layers of information, for example the kind of user ("user is an IT-pro), his current geographical location ("user is at the office") and filter the results to show a much more consequent and useable set of information. Additionally, the search engine could prepare the information in report form, setting it directly in Word format for example. The search engine should become in part intelligent software and in part assistant and less of an information reader only. The progression of search is from merely data to useful information to knowledge that answers questions.

2.- SHORT HISTORY OF FAST


Till Google provoked a landslide in the search word in 2002, Microsoft was not really aware of the importance of search for the IT industry. Until then, Microsoft used different third-party technologies for web search and had one "enterprise" search engine used locally for Windows and for some of its servers (namely Search Server for SharePoint). Gartner's Magic Quadrant for Information Access reflects this position in its 2006 report as Figure 01 shows: Microsoft is impossible to be found in the diagram.

14 Lightning FAST Enterprise Searches in Sharepoint 2010

Figure 01.- Gardner Magic Quadrant for Information Access 2006

In 2006 Microsoft stepped up the company strategy for Search for the next few years announcing that search should be of vital importance for the company and all its servers. Three years after that, the Gardner Magic Quadrant would show a very different panorama, as shown in Figure 02: Microsoft is in the most important part of the diagram, the "Leaders" quadrant. And that was possible thanks to the acquisition of FAST in 2008.

Figure 02.- Gardner Magic Quadrant for Information Access 2009

Introduction 15

From this date till the present day, Google and Microsoft FAST have remained approximately in the same position in the quadrants. Google stays as the first player in the web search market and Microsoft is very busy converting all the constituent base technologies used originally by FAST to Microsoft technologies and integrating FAST in the Microsoft Stack, principally SharePoint. FAST was originally a Norwegian company focused on enterprise data search technologies and its application. Microsoft bought the company on April 24 2008. FAST was born at the desks of the Department of Computer and Information Science of the Norwegian University of Science and Technology (NTNU) in 1997 and launched the first version of the engine in 1999. Initially FAST had versions for web and enterprise search, but in 2003 they decided to focus exclusively on enterprise search. At the beginning of 2004 FAST launched the FAST Enterprise Search Platform (FAST ESP). The next year FAST found its reputation in the enterprise search world as probably the best and technologically most advanced engine in the mark and FAST appears in the Gartner Magic Quadrant for Information Access Technologies in the "Leaders" Quadrant for a number of years in a row. Nevertheless, FAST was almost never financially profitable and legal problems troubled the company continuously, finalizing in the suspension of trading of FAST shares in the Oslo Stock Exchange in December 2007. January 8, 2008 Microsoft announced the acquisition of FAST Search & Transfer for $1.2 billion, making a separate division in the company to house FAST. FAST ESP was probably the technological leader of enterprise search engines, offering Contextual Insight (a group of technologies that add linguistic and statistical analytics to improve search precision), semantic indexes (to recognize and retain the inherent structure of documents), entity metadata, taxonomic navigation, faceted browsing and entity discovery (to extract textual entities from the results of previous search) under other advances. Originally, FAST ESP was an agnostic system: it was possible to install it in Windows, Unix and Linux systems, 32 and 64 bits, and it was written in Java, PHP and Python. It had its proprietary administration interface, user interface, alerting system, connector mechanism and different other subsystems, but it was possible to integrate the query and results in SharePoint 2007 using WebParts. Since FAST was bought by Microsoft, the main change in the server has been the attempt to integrate its code base with the Microsoft toolset and make it to work smoothly with the rest of the Microsoft Stack. That means Java and Python code have been changed to Microsoft DotNet compatible technologies, SQL is used extensively and SharePoint 2010 is becoming the default interface.

3.- POSITIONING OF FAST IN THE MICROSOFT STACK


Although currently Microsoft has different search engines and versions, FAST is its most powerful engine and the enterprise preferable offer. As for each of its enterprise

16 Lightning FAST Enterprise Searches in Sharepoint 2010

servers, FAST is part of an ecosystem and impossible to work as a stand-alone product. FAST relays on Windows as Operating System, SQL Server as its repository mechanism and SharePoint as user and administration web interface. Besides that, products as Microsoft System Center Operations Manager (SCOM) could be used to control the availability, performance, configuration and security of FAST, and Microsoft Forefront Threat Management Gateway (Forefront TMG) would be necessary to protect FAST from outside threats. Other Microsoft products such as IIS could be necessary as underground services for one of more of the servers.

3.1.- Windows, SQL, SharePoint, SCOM


Originally, FAST was developed as an agnostic system that could be installed in Windows, Unix or Linux systems. Being the key Microsoft search technology means that it must be specifically target to be implemented under Windows, specifically Windows 2008 Server (64 bits) and up. FAST 2010 can be installed only under Windows as Operating System (FAST ESP is consider legacy software, not supported anymore). FAST doesn't demand special conditions for the Operating System; the requirements are more hardware oriented, as it will be explained in the design chapter. An SQL Server is required by FAST to maintain the configuration information. SQL 2008 and up (64 bits) can be used, and FAST should require a modest part of the database server performance and capability. All data necessary for indexing is not stored in the database. SharePoint 2010 is the User Interface and Administrators Interface of FAST, and in this way, necessary to run properly FAST; but FAST is independent of SharePoint and could (and should) be installed in separated servers. Both standalone and farm installations of SharePoint 2010 can be used and an Enterprise license of SharePoint 2010 is indispensable. If document preview is desirable, to see thumbnails of Microsoft Office Word and PowerPoint in the search results from FAST, Microsoft Office Web Apps must be installed on the SharePoint servers. SCOM is not required for the normal working of FAST, SQL or SharePoint, but as the Microsoft strategic system center operation manager, SCOM is the recommended monitoring solution for FAST. FAST support a number of monitoring services that provide data using standardized Windows interfaces; SCOM can consume this data giving the required protection from the operations perspective.

3.2.- Microsoft Search Products


Currently as of anno 2011, Microsoft have a variety of search offers, varying from the low-cost/low-functionality of Search Express to the high-end FAST: Microsoft SharePoint Foundation 2010 Search. Integrated in SharePoint Foundation 2010 allows search scoped to single SharePoint Site Collections and it cannot crawl external data sources. It has no

Introduction 17

administration User Interface and all the configurations happen automatically. Scales to approximately 10 million items (using SQL Server) for each search server Microsoft Search Server Express. Free product that allows search over enterprise content. Can crawl external data sources (web sites, file shares, Exchange, Lotus Notes) and can federate query results from any OpenSearch system. Deployment is limited to one server and can use SQL Server Express (300.000 search items) or SQL Server (10 million search items) Microsoft Search Server 2010. Provides almost the same search functionality of Microsoft SharePoint Server 2010 and can be deployed across multiple servers for redundancy and increase of capacity and performance. Supports multiple crawl servers and query servers and scales to approximately 100 million items Microsoft SharePoint Server 2010. The search engine embedded in SharePoint Server 2010, making use of all social networking and managed taxonomy features of SharePoint: indexing of people Profile database, search in MySites, takes advantage of user-generated tags, managed taxonomy to influence ranking, etc. Scales to approximately 100 million items and can be installed in multiple servers and be used in multi-tenant hosting environments Microsoft FAST Search Server 2010 for SharePoint. Includes all the search features of the other Microsoft search systems (except the Social features of SharePoint 2010) adding almost unlimited scalability and performance. Content processing is much more flexible and customizable. FAST consists of three different versions: o o o FS4SP: FAST Search for SharePoint 2010, the version packaged for the SharePoint 2010 environment FSIA: FAST Search for Internal Applications aimed at organizations that must crawl internal content FSIS: FAST Search for Internet Sites that allows crawling of online information

It is important to consider that the differences in versions are merely a licensing issue, the kernel engine and functionality is the same in all versions. FAST ESP is considered legacy software and no longer available, but customers who have currently Maintenance & Support contracts can upgrade to either FSIA or FSIS.

18 Lightning FAST Enterprise Searches in Sharepoint 2010

Although little is known about the technology behind Bing, the web search engine of Microsoft, it is indisputable that some aspects of Bing are directly related to FAST. MSN Search (and later Windows Live Search), the web search engine before Bing, used mixed technologies from AltaVista, Yahoo! and Inktomi. Bing uses suggestion for queries and related searches based on semantic technology from Powerset, a company purchased by Microsoft in 2008, but its search algorithms are property and very secret. Always a difficult question is what the right choice is: SharePoint Search or FAST Search? The answer is always specific to the organization data landscape and user/functionality needs. A quick differentiation between the two products is the required search capacity and the necessity of customization. SharePoint Search has a theoretical limit of 100 million search items, but the real-life edge should be much lower. The theoretical limit of FAST is 500 million but with the right hardware and topology could go over this figure. Customization is the second criterion, but it could be the most important. The search engine of SharePoint cannot be modified or adapted, meaning that adapting the ranking mechanism or the indexing and querying tool should be impossible. FAST allows many customizations making it much more flexible and adaptable to the enterprise requirements. In any case, choosing between SharePoint Search and FAST must follow the indispensable design steps of any design: gain full understanding of the business requirements, understand the data background in terms of quality, format and volume and capture the search needs of the customer. The analysis of these factors should indicate the right technology to use and provide the costs, risks and cost of ownership estimations.

4.- SOME IMPORTANT DOCUMENTATION


Information about FAST is becoming better and more accessible. Because of the close character of FAST ESP, the FAST version before Microsoft bought the company, it was nearly impossible to find any kind of information about installation, configuration, programming or use. Since release of the last version together with SharePoint 2010, the flow of information from Microsoft has improved considerably. The next list of documents from Microsoft is limited, but they represent the most important information delivered from Microsoft about FAST for SharePoint 2010. The list is limited to official Microsoft information, but slowly more and more independent information is appearing in Internet from other sources. FAST Software Development Kit (SDK) Probably the most important technical document about FAST. Several parts of this book are based on the information delivered from the SDK, especially the configuration chapter. The SDK can be found online in the site of the TechNet Library (http://technet.microsoft.com/enus/library/ee781286.aspx)

Introduction 19

Microsoft FAST Search Server 2010 for SharePoint Enterprise Search Evaluation Guide - This evaluation guide is designed to give business decision makers and IT professionals an understanding of the design goals and the details of the enterprise search features provided by Microsoft FAST Search Server 2010 for SharePoint (http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=24972) FAST Search Server 2010 for SharePoint Capacity Planning - This white paper describes the performance and capacity impacts in relation to FAST Search Server 2010 for SharePoint. This white paper includes information about the performance and capacity characteristics of the feature and how it was tested by Microsoft (http://www.microsoft.com/downloads/details.aspx?FamilyID=65B799E3-825C4398-8CD7-3311D3297997&displaylang=e&displaylang=en) Download Microsoft FAST Search Server 2010 for SharePoint Trial 120 days Trial version of FAST (http://technet.microsoft.com/en-us/evalcenter/ee424282)

CHAPTER

FAST in the context of Search

Search is intrinsic to human nature, humans are searching continuously; as a consequence, the concept of search is intuitively recognized. The term search is related to the process of finding solutions to yet unsolved problems. In computers, search is used almost as generally as in the human context: each algorithm searches for the completion of a given task.

1.- GOALS OF SEARCH


Search has been an important part of computers since its very beginning, as the core technique to solve problems. In general, search can be applied to many problems, from solving games (chess has an expected search space of about 1044 possibilities, making it possible for the IBM "Deep Fritz" computer to be able to find the correct answers to win against the human world champion in 2006, evaluating about 10 million options per second), to many industrial route planning systems that use search to answer shortest- and quickest-route queries in fractions of the time that other algorithms can do it. Search algorithms can be used to solve optimally sequence alignment problems in biology, to guide industrial robots in unknown environments or to find bugs in software (using very similar search patterns as those used to find the most successful strategy to win in chess). In summary, search and search algorithms are extensively used in the real-world domain, although this book scope is limited to search information in the search space of computer saved data.

21

22 Lightning FAST Enterprise Searches in Sharepoint 2010

2.- INTERNET SEARCH VS. ENTERPRISE SEARCH


For information search purpose, traditionally there is a division made between Internet search and Database search. The differences are more related to unstructured information (information that has no formal relationship together) versus information that has a more relational character. But lately it is very clear that both concepts are fusing to something more general as Enterprise search. Internet search is designed to go across web pages and documents, looking for new and changed information and making indexes of everything they can find. The engines for this kind of search are made to follow a process (a "Pipeline") that goes from crawling to discover the sources of information, through indexing their content in a structured way (in a database, xml files or any other form) finalizing with the mechanism to resolve the user queries and deliver the results. Database search is an integrated mechanism of each modern Database (see for example the "Full-Text Search" functionality of Microsoft SQL Server). Technically speaking the search mechanism also needs to have an index on the tables based on one or more columns in the table. The databases that allow this type of search provide as well language-specific linguistic components, including word breakers, stemmers and thesaurus files, allowing the use of queries with full-text predicates such as "contains" or "like", so that the user can perform a variety of types of searches (search for a single word or phrase). Most of the currently used search products are large internet search engines optimized to crawl web-pages and documents using the capabilities of database search engines, thus Enterprise search engines. They can search both structured and unstructured data sources: web-pages and documents are crawled, discovered and indexed in separated indexes. Search results are generated on-the-fly for the users querying the indexes in parallel and organizing the results following predetermined rules. One further differentiation between web-pages search engines (like Bing or Google) and document oriented search engines is the capabilities and speed of indexing documentation: the second type is made able and optimized to open and understand the structure of documents (using iFilters for example) natively. At the other side, the crawlers of Internet search engines are considerably different in comparison with the document search engines, because they must travel across a completely different environment (IP addresses and www technologies). A main distinction point is the Relevance of search results. The overlaps within the context of Enterprise search are very different from those applied to Internet search. Enterprise search cannot take advantage of the very rich structure of links as is found on the www hyperlink content. Algorithms that exploit the hyperlink structure to build the information ranks are more suitable to be exploited than the query-independent factors used by Enterprise search, such as document date or popularity.

FAST in the context of Search 23

3.- SEARCH TERMINOLOGY AND CONCEPTS


Search engines use their unique concepts and terminology. Because search has a very strong language component, almost all the terminology is lent from the linguistic and philological study fields. Authoritative Page Page designated as more relevant than other pages (for example the home page for the intranet of an organization). The higher the authoritative assigned level, the higher the ranking of the page in the search results. Best Bets Hand-created list of keywords for common queries that can dramatically improve the search experience, particularly on information-rich sites such as intranets. Best Bets are presented prominently at the beginning of the search results, followed by the rest of the matching pages. Implementing Best Bets is an effective way to improve the quality of search results. Content Source Options specific to a precise content to be crawled, including its start address. A Content Source for SharePoint can contain up to 500 start addresses. Crawl Crawl is the methodical and automated manner used for search engines to find information. Crawlers are computer programs in charge of the crawling. Crawlers mainly create a copy of all the visited web-pages and documents for later processing by the search engine that will index to provide searches. Crawlers are often used for automating other tasks as links and source code (HTML) validation and to gather specific types of information like E-mail address for example. The crawlers are responsible for the freshness of the information that the search engine can use. Because crawling a huge amount of information can take weeks or months, by the time a crawler has finished its crawl, many events could have happened (creation, update, deletion of information); for the search engine there is a cost associated with not detecting this events and having outdated copies of the information. Crawlers can have also an impact on the performance of the servers that maintain the information: if the crawlers are requiring huge amounts of webpages or documents from a system, they can have a crippling impact on the performance of the servers. General speaking, web search crawlers architecture are pretty much unknown (Yahoo!Slurp the Yahoo Search crawler, Googlebot from Google or Bingbot from Bing), but the crawlers for Enterprise search are well-documented (the FAST crawler for example). Crawl Queue refers to the data structure that stores the list of items to be crawled. Crawl Rule is a set of preferences that applies to a specific Content Source and it is used to include and exclude items in a crawl. Crawled Property means a type of metadata that can be discovered during a crawl and applied to one or more items and can be promoted to Managed Property. A Managed Property is a specific property in the metadata schema that can be made available for queries. Duplicate and Duplicate Result Removal Refers to identical or near identical content that should be removed from the search results.

24 Lightning FAST Enterprise Searches in Sharepoint 2010

Entity Extraction Seeks to locate and classify elements in text into predefined categories such as people's names, organization, location, expressions of times, quantities, monetary values, percentages, etc. Faceted Search It is a filtering technique to access collections of information represented using classifications with some common significance. Allows users to narrow down the search results. Also known as Navigators or Refiners. Federation Allows simultaneous search of multiple searchable resources. A Federation establishes a collaborative link in between different search systems, allowing the systems to query other search engines without the necessity to maintain indexes of the external systems, arranging the results from the various sources into a useful form and presenting them to the user. When the search data model of the search system is different from the data model of the foreign target system, the query must be first translated and the users' credentials must be passed to maintain the appropriate security. On the return side, the results need to be mapped back from the foreign system to the search engine form to be rendered to the user. Scalability and performance are always a source of concern in Federation: the query performance and results quality are totally dependent on the foreign search engine. High Confidentiality A Managed Property identified as a good indicator of a highly relevant item iFilter An iFilter is a translator that teaches the search engine the structure of documents to be indexed. Without an appropriate iFilter, contents of a file cannot be parsed and indexed by the search engine. Windows Indexing Service, MSN Desktop Search, Internet Information Server, SharePoint Server, Site Server, Exchange Server, SQL Server and all other products based on Microsoft Search technology support indexing technology based on iFilters. Index Indexing is the process of extracting information from the original data source and saving it in a format that the search engine can understand. The index is structured in such way that the engine can find quickly the information that contains a particular term. Indexing can be a complex process that uses a lot of resources of the search servers. During the indexing not only the constituent words of the source are extracted, but the language, the boundaries of sentences and paragraphs, changes in the case and stemming of the words into their roots are determined. Normally the indexing process is continuous to refresh the complete index frequently. For Internet search it is usual to have a limit on the information indexed for each page and an algorithm decides which sections of the page are relevant to be indexed (to prevent overload of the web servers that contain very large pages such as technical manuals). On the contrary, for Document search it is important to index as much as possible information, and normally the limit (if it exists) is very high.

FAST in the context of Search 25

Information Extraction The study that attempts to identify semantic structures in order to excerpt relevant data. It describes the techniques to develop systems to index and search vast amounts of data effectively. The goal is to automatically extract structured information from unstructured documents. Inverse Document Frequency (IDF) A measure of how rare a term is in a collection of documents, calculated by total collection size divided by the number of documents containing the term. Common terms ("the", "and" etc.) have a very low IDF and are often excluded from search results. These low IDF words are commonly referred to as "stop words". Keyword A word used in a query. In web search, Keywords are targeted based on what users looking for in the HTML of the pages. In Enterprise search, Keywords can be configured to target specific terms relevant for the specific company. Keyword Density A measure of the percentage of words in a document that are specifically chosen as keywords of the total number of currently present words. The ranking is based on (amongst many things) the percentage of words on a page that are similar to the words used in the query. Latent Semantic Indexing (LSI) Also known as Latent Semantic Analysis. It is an indexing which switches the current lexical functioning of every search engine to a semantic one. It uses a mathematical technique (Singular Value Decomposition) to identify patterns and relations between the terms contained in a text. In this way it is possible that a query returns results which do not contain the keywords searched. Search engines are heading to LSI to ensure more human accurate results. Lemmatization Is the process of grouping different forms of words so that they can be analyzed as a single item. A lemmatization algorithm determines the "lemma" for a word; that means it understands the context of the word and determines its role in a sentence. Following the example given for Stemming, "playing", "player" and "play" should be lemmatized to the lemma "play" as well. The difference with Stemming is that the stemmer has no knowledge about the context of the word in the sentence and therefore cannot discriminate between words which have different meanings depending on their position or use in the sentence. Taking a different example, the words "improved" should have "good" as lemma and a complete different stem. Lemmatization can be very difficult to implement as it is not only language-dependent, but also culture-dependent (one lemma can be different in the same language but also in different countries). Link Map A Link Map is a graph structure of the nodes connected by links in Internet search. The map facilitates the fast access to the data, the popularity score of the page and the ranking algorithm.

26 Lightning FAST Enterprise Searches in Sharepoint 2010

Natural Language Processing (NLP) A system that allows search engine users to type a question rather than keywords. This can be reached, at the simplest level, making the search engine remove the stop words in the question to leave keywords that are then processed as if it was a regular query. At the other end of the scale are advanced systems that use statistics and linguistic analysis to accurately match the available indexes to the user's question. Partial Word Matching Some search engines will consider not only exact matches, but also partial matches. This means that if the search term is contained within a word in a document in its index, the search engine considers the document a match. Strongly related to lemmatization and Stemming. Phrase Search A type of search that allows users to search for documents containing an exact sentence or phrase, rather than single keywords. Important point here is that in a phrase search the words have to appear side by side in the document (exactly as in the query) to be considered a match. If the words appear dispersed or they appear side by side but in the wrong sequence, it is not considered a match. Phrase searching can be done on most search engines by simply enclosing the phrase in quotation marks. Anti-phrasing means phrases for which there is no value in indexing (for example What xxx means). Pipeline Specially tailored FAST architecture to address the challenges of flexibility versus the inherent shortcomings of any search engine. The FAST Pipeline (format conversion, language detection, stemming, entity extraction, lemmatization) allows the introduction of custom plug-ins (stages) to enrich the data to be indexed; for example, the entity extractor can be programmed to recognize entities that are important to an organization. Polysemy One word can have several meanings. Language - dependent and very difficult to address in algorithms. Precision and Recall Strongly related to the search accuracy, a simple metric that computes the fraction of instances for which the correct result is returned. Search engines often consider a document a match to a query when that document is not really relevant to the query. These mistakes happen because search engines should conjecture what the user means. Search engines must find a balance between recall (its ability to find all relevant documents) and precision (its ability to find only relevant documents). The aim is to retrieve all relevant documents and nothing else. Precision is scored by dividing the total number of pages found by the number of relevant pages found. For example, in a collection of 1000 documents if 100 documents are found and 60 are relevant, the search engine's precision is 60%. In the same example, if the document collection contains 70 hits that are relevant but only 60 were found, the Recall is 60/70 = 85%

FAST in the context of Search 27

Promotion / Demotion Getting a search result to the top of the results rankings means Promotion. The other way around is Demotion. In Enterprise search engines there is always a configuration that can be implemented for Promotion and Demotion of terms. Internet search engines have many different security mechanisms to prevent the user from promoting or demoting sites in an illegal way. Property Extraction Allows the extraction of language-specific properties for names (locations, company, people). Ranking Is the order by Relevance of the search results, so that the most relevant ones come first. Relevance ranking mainly refers to the different features and algorithms used to estimate the weight of documents and to sort them appropriately. The most basic retrieval function is a Boolean query on the incidence of terms in the information. Assuming a query word1 word2 the Boolean AND query would return all documents containing the word1 and word2 at least once. These documents represent the set of potentially relevant documents: all documents not in this set could be considered irrelevant and ignored. This step usually reduces the number of documents to be considered for ranking, but it does not order the documents in the result set. After that, each document needs to be scored: the documents relevance must be estimated as a function of its relevance features. Contemporary search engines use hundreds of features as parameters to estimate the Ranking. Relevance How closely the search results that are returned to the user match what the user wanted to find. Ideally, the results that are returned at the top are the most relevant: the user does not have to look through several pages of results to find the best matches for their search. In other words, Relevance describes how well a given search satisfies a users information needs. The problem that search relevance attacks is to estimate how pertinent a result is to a query. Commercial search engines combine hundreds of features to estimate relevance. The specific features and their mode of combination are often kept secret to prevent the user from forging the results. Nevertheless, the main types of features in use, as well as the methods for their combination, are publicly known and are the subject of scientific investigation. Spelling Suggestions ("Did you mean"). Type mistakes are very common when users are typing search terms. The linguistic capabilities of modern search engines allow the detection of the mistakes and the suggestion of related terms improving the quality of searches. Spell checking exceptions can also be defined in FAST: the words that are not found in the default spell checking dictionary but that are still valid. Stemming The process of reducing words to their stem or root form. An English stemming algorithm should reduce the words "playing", "player" and "play" to the root word "play". Stemming is a challenging task in the algorithm world and it is considered as a difficult linguistic research field. Each language needs its own stemming algorithms; some of them are more trivial that other, but the more complicated the morphology and orthography of the language are, the more complex the stemming

28 Lightning FAST Enterprise Searches in Sharepoint 2010

becomes. Stemming is close related to Lemmatization. FAST map one form of a word to its variants to enrich the query results. Synonyms Synonyms are different words with almost identical or similar meanings. Depending of language, geographical origin and social-cultural status, synonyms can have very different meaning because of etymology, orthography, phonic qualities, ambiguous meanings, usage, etc. making them unique; this problem makes Synonyms difficult to process by search engines. Normally Synonyms are presented to the search engines as Thesauruses, lists of related words. Term Frequency (TF) A measure of how often a term is found in a collection of documents. TF is combined with Inverse Document Frequency (IDF) as a means of determining which documents are most relevant to a query. TF is also used to measure how often a word appears in a specific document. Tokenization The process of splitting a text into individual words or tokens to be indexed. All separation characters (spaces, commas, dashes, periods, etc.) are considered delimiting characters and are excluded from the indexes. Tokenization is dependent on the language and very important for Relevance.

4.- FAST VERSIONS


FS4SP: Fast Search for SharePoint 2010 is the FAST version packaged for the SharePoint 2010 environment. Licensing is per Client Access License (CAL)/server. FSIA: FAST Search for Internal Applications is aimed at organizations looking for a standalone FAST implementation (not integrated with SharePoint) for internal use. It is generally sold on a CAL/server basis. FSIS: FAST Search for Internet Sites is aimed at online search applications. FSIS is licensed per server. FAST ESP: In release 5.3 as it was when Microsoft bought the product, it is the last version of FAST before its integration in the Microsoft server stack. FAST customers who are currently with FAST ESP Maintenance & Support can upgrade to either FSIS or FSIA. Microsoft divides the FAST family into two groups: Search solutions for Business Productivity: o o Microsoft FAST Search Server 2010 for SharePoint (FS4SP) Microsoft FAST Search Server 2010 for Internal Applications (FSIA).

FAST in the context of Search 29

Search solutions for Internet Sites o o Microsoft FAST Search Server 2010 for Internet Sites ("FSIS"). Microsoft SharePoint Server 2010 for Internet Sites, Enterprise (FISE). This product includes rights to Microsoft FAST Search Server 2010 for SharePoint Internet Sites (FS4FIS).

From the Microsoft sales FAST information: FSIS and FSIA must be purchased from FAST or FAST resellers. They are offered only from the FAST price list, under a FAST EULA, and FAST maintenance and support options are available. FS4SP and FIS-E will be available through Microsoft VL only. FAST maintenance and support are not available for these VL products, but Microsoft support and SA are available. All servers need license coverage, just like for SharePoint or ESP for SharePoint. The appropriate way to achieve license coverage depends on what the server is used for. Production (includes active and fault-tolerance servers), staging, admin, and hot and warm stand-by servers all require product licensing Cold stand-by servers used for disaster recovery do not require product licenses as long as the customer is current on M or SA. This is a benefit of M/SA and customers who drop M/SA lose this benefit. Development and testing servers can be covered in a few ways. Under Microsoft VL, customers can choose to cover them with product licenses (server/CAL for FS4SP; server for FIS-E) or via MSDN subscriptions. Under FAST, these rights will be included in the base licenses for FSIS and FSIA. Each user of FSIA must be covered by a CAL. Each virtual machine on a physical server counts as a server and requires a separate license. This matches Microsoft licensing for server technology hosted in a virtual environment.

(http://www.microsoft.com/pathways/fast/FAST%20License%20Grants.htm)

Te interesa este libro?


Cmpralo en nuestra tienda: www.campusmvp.com

En papel o en formato electrnico Sin DRM Imprimible Busca en el contenido

Especialistas en formacin online y libros de tecnologas Microsoft.

Sguenos y descubrirs los mejores trucos y recursos: