Chap 2

SearchEngines
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
SearchEngineArchitecture
Asoftwarearchitectureconsistsofsoftware components,theinterfacesprovidedbythose components,andtherelationshipsbetween them
describesasystemataparticularlevelofabstraction
Architectureofasearchenginedeterminedby2 requirements
effectiveness(qualityofresults)andefficiency (responsetimeandthroughput)
IndexingProcess
IndexingProcess
Textacquisition
identifiesandstoresdocumentsforindexing
Texttransformation
transformsdocumentsintoindextermsor features
Indexcreation
takesindextermsandcreatesdatastructures (indexes)tosupportfastsearching
QueryProcess
QueryProcess
Userinteraction
supportscreationandrefinementofquery,display ofresults
Ranking
usesqueryandindexestogeneraterankedlistof documents
Evaluation
monitorsandmeasureseffectivenessand efficiency(primarilyoffline)
Details:TextAcquisition
Crawler
Identifiesandacquiresdocumentsforsearch engine Manytypes web,enterprise,desktop Webcrawlersfollowlinks tofinddocuments
Mustefficientlyfindhugenumbersofwebpages (coverage)andkeepthemuptodate(freshness) Singlesitecrawlersforsitesearch Topicalor focusedcrawlersforvertical search
Document crawlersforenterpriseanddesktop search

Followlinksandscandirectories
TextAcquisition
Feeds
Realtimestreamsofdocuments
e.g.,webfeedsfornews,blogs,video,radio,tv
RSSiscommonstandard
RSSreadercanprovidenewXMLdocumentstosearch engine
Conversion
Convertvarietyofdocumentsintoaconsistenttext plusmetadataformat
e.g.HTML,XML,Word,PDF,etc.XML
Converttextencodingfordifferentlanguages
UsingaUnicodestandardlikeUTF8
TextAcquisition
Documentdatastore
Storestext,metadata,andotherrelatedcontent fordocuments
Metadataisinformationaboutdocumentsuchastype andcreationdate Othercontentincludeslinks,anchortext
Providesfastaccesstodocumentcontentsfor searchenginecomponents
e.g.resultlistgeneration
Coulduserelationaldatabasesystem
Moretypically,asimpler,moreefficientstoragesystem isusedduetohugenumbersofdocuments
TextTransformation
Parser
Processingthesequenceoftexttokensinthe documenttorecognizestructuralelements
e.g.,titles,links,headings,etc.
Tokenizer recognizeswordsinthetext
mustconsiderissueslikecapitalization,hyphens, apostrophes,nonalphacharacters,separators
MarkuplanguagessuchasHTML,XMLoftenusedto specifystructure
Tags usedtospecifydocumentelements
E.g.,<h2>Overview</h2>
Documentparserusessyntax ofmarkuplanguage(orother formatting)toidentifystructure
TextTransformation
Stopping
Removecommonwords
e.g.,and,or,the,in
Someimpactonefficiencyandeffectiveness Canbeaproblemforsomequeries
Stemming
Groupwordsderivedfromacommonstem
e.g.,computer,computers,computing,compute
Usuallyeffective,butnotforallqueries Benefitsvaryfordifferentlanguages
TextTransformation
LinkAnalysis
Makesuseoflinks andanchortextinwebpages Linkanalysisidentifiespopularity andcommunity information
e.g.,PageRank
Anchortextcansignificantlyenhancethe representationofpagespointedtobylinks Significantimpactonwebsearch

Lessimportanceinotherapplications
TextTransformation
InformationExtraction
Identifyclassesofindextermsthatareimportant forsomeapplications e.g.,namedentityrecognizersidentifyclasses suchaspeople, locations, companies, dates, etc.
Classifier
Identifiesclassrelatedmetadatafordocuments
i.e.,assignslabelstodocuments e.g.,topics,readinglevels,sentiment,genre
Usedependsonapplication
IndexCreation
DocumentStatistics
Gatherscountsandpositionsofwordsandother features Usedinrankingalgorithm
Weighting
Computesweightsforindexterms Usedinrankingalgorithm e.g.,tf.idf weight
Combinationoftermfrequencyindocumentand inversedocumentfrequencyinthecollection
IndexCreation
Inversion
Coreofindexingprocess Convertsdocumentterminformationtoterm documentforindexing
Difficultforverylargenumbersofdocuments
Formatofinvertedfileisdesignedforfastquery processing
Mustalsohandleupdates Compressionusedforefficiency
IndexCreation
IndexDistribution
Distributesindexesacrossmultiplecomputers and/ormultiplesites Essentialforfastqueryprocessingwithlarge numbersofdocuments Manyvariations
Documentdistribution,termdistribution,replication
P2P anddistributedIR involvesearchacross multiplesites
UserInteraction
Queryinput
Providesinterfaceandparserforquerylanguage Mostwebqueriesareverysimple,other applicationsmayuseforms Querylanguageusedtodescribemorecomplex queriesandresultsofquerytransformation
e.g.,Booleanqueries,IndriandGalago querylanguages similartoSQLlanguageusedindatabaseapplications IRquerylanguagesalsoallowcontentandstructure specifications,butfocusoncontent
UserInteraction
Querytransformation
Improvesinitialquery,bothbeforeandafterinitial search Includestexttransformationtechniquesusedfor documents Spellcheckingandquerysuggestion provide alternativestooriginalquery Queryexpansionandrelevancefeedback modify theoriginalquerywithadditionalterms
UserInteraction
Resultsoutput
Constructsthedisplayofrankeddocumentsfora query Generatessnippets toshowhowqueriesmatch documents Highlights importantwordsandpassages Retrievesappropriateadvertising inmany applications Mayprovideclustering andothervisualization tools
Ranking
Scoring
Calculatesscoresfordocumentsusingaranking algorithm Corecomponentofsearchengine Basicformofscoreis qi di
qi anddi arequeryanddocumenttermweightsfor termi
Manyvariationsofrankingalgorithmsand retrievalmodels
Ranking
Performanceoptimization
Designingrankingalgorithmsforefficient processing
Termatatimevs.documentatatime processing Safe vs.unsafe optimizations
Distribution
Processingqueriesinadistributedenvironment Querybrokerdistributesqueriesandassembles results Caching isaformofdistributedsearching
Evaluation
Logging
Logginguserqueriesandinteractioniscrucialfor improvingsearcheffectivenessandefficiency Querylogsandclickthrough datausedforquery suggestion,spellchecking,querycaching,ranking, advertisingsearch,andothercomponents
Rankinganalysis
Measuringandtuningrankingeffectiveness
Performanceanalysis
Measuringandtuningsystemefficiency
HowDoesItReally Work?
Thiscourseexplainsthesecomponentsofa searchengineinmoredetail Oftenmanypossibleapproachesandtechniques foragivencomponent
Focusisonthemostimportantalternatives i.e.,explainasmallnumberofapproachesindetail ratherthanmanyapproaches Importancebasedonresearchresultsandusein actualsearchengines Alternativesdescribedinreferences

Chap 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap 2

Uploaded by

Copyright:

Available Formats

SearchEngines

Document crawlersforenterpriseanddesktop search

Documentparserusessyntax ofmarkuplanguage(orother formatting)toidentifystructure

Anchortextcansignificantlyenhancethe representationofpagespointedtobylinks Significantimpactonwebsearch

P2P anddistributedIR involvesearchacross multiplesites

You might also like