You are on page 1of 70

SearchEngines

InformationRetrievalinPractice

AllslidesAddisonWesley,2008
InformationNeeds
Aninformationneed istheunderlyingcauseof
thequerythatapersonsubmitstoasearch
engine
sometimescalledinformationproblemtoemphasize
thatinformationneedisgenerallyrelatedtoatask
Categorizedusingvarietyofdimensions
e.g.,numberofrelevantdocumentsbeingsought
typeofinformationthatisneeded
typeoftaskthatledtotherequirementfor
information
QueriesandInformationNeeds
Aquerycanrepresentverydifferentinformation
needs
Mayrequiredifferentsearchtechniquesandranking
algorithmstoproducethebestrankings
Aquerycanbeapoorrepresentationofthe
informationneed
Usermayfinditdifficulttoexpresstheinformation
need
Userisencouragedtoentershortqueriesbothbythe
searchengineinterface,andbythefactthatlong
queriesdontwork
Interaction
Interactionwiththesystemoccurs
duringqueryformulationandreformulation
whilebrowsingtheresult
Keyaspectofeffectiveretrieval
userscantchangerankingalgorithmbutcan
changeresultsthroughinteraction
helpsrefinedescriptionofinformationneed
e.g.,sameinitialquery,differentinformationneeds
howdoesuserdescribewhattheydontknow?
ASKHypothesis
Belkinetal(1982)proposedamodelcalled
AnomalousStateofKnowledge
ASKhypothesis:
difficultforpeopletodefineexactlywhattheir
informationneedis,becausethatinformationisa
gapintheirknowledge
Searchengineshouldlookforinformationthatfills
thosegaps
Interestingideas,littlepracticalimpact(yet)
KeywordQueries
Querylanguagesinthepastweredesignedfor
professionalsearchers(intermediaries)
KeywordQueries
Simple,naturallanguagequerieswere
designedtoenableeveryonetosearch
Currentsearchenginesdonotperformwell
(ingeneral)withnaturallanguagequeries
Peopletrained(ineffect)tousekeywords
compareaverageofabout2.3words/webquery
toaverageof30words/CQAquery
Keywordselectionisnotalwayseasy
queryrefinementtechniquescanhelp
QueryBasedStemming
Makedecisionaboutstemmingatquerytime
ratherthanduringindexing
improvedflexibility,effectiveness
Queryisexpandedusingwordvariants
documentsarenotstemmed
e.g.,rockclimbingexpandedwithclimb,not
stemmedtoclimb
StemClasses
Astemclassisthegroupofwordsthatwillbe
transformedintothesamestembythe
stemmingalgorithm
generatedbyrunningstemmeronlargecorpus
e.g.,PorterstemmeronTRECNews
StemClasses
Stemclassesareoftentoobigandinaccurate
Modifyusinganalysisofwordcooccurrence
Assumption:
Wordvariantsthatcouldsubstituteforeachother
shouldcooccuroftenindocuments
ModifyingStemClasses
ModifyingStemClasses
DicesCoefficientisanexampleofaterm
associationmeasure

wherenx isthenumberofwindowscontainingx
Twoverticesareinthesameconnected
componentofagraphifthereisapath
betweenthem
formswordclusters
Exampleoutputofmodification
SpellChecking
Importantpartofqueryprocessing
1015%ofallwebquerieshavespellingerrors
Errorsincludetypicalwordprocessingerrors
butalsomanyothertypes,e.g.
SpellChecking
Basicapproach:suggestcorrectionsforwords
notfoundinspellingdictionary
Suggestionsfoundbycomparingwordto
wordsindictionaryusingsimilaritymeasure
Mostcommonsimilaritymeasureisedit
distance
numberofoperationsrequiredtotransformone
wordintotheother
EditDistance
DamerauLevenshtein distance
countstheminimumnumberofinsertions,
deletions,substitutions,ortranspositionsofsingle
charactersrequired
e.g.,DamerauLevenshtein distance1

distance2
EditDistance
Numberoftechniquesusedtospeedup
calculationofeditdistances
restricttowordsstartingwithsamecharacter
restricttowordsofsameorsimilarlength
restricttowordsthatsoundthesame
Lastoptionusesaphoneticcodetogroup
words
e.g.Soundex
Soundex Code
SpellingCorrectionIssues
Rankingcorrections
Didyoumean...featurerequiresaccuraterankingof
possiblecorrections
Context
Choosingrightsuggestiondependsoncontext(other
words)
e.g.,lawers lowers,lawyers,layers,lasers,lagers
but triallawers triallawyers
Runonerrors
e.g.,mainscourcebank
missingspacescanbeconsideredanothersingle
charactererrorinrightframework
NoisyChannelModel
Userchooseswordw basedonprobability
distributionP(w)
calledthelanguagemodel
cancapturecontextinformation,e.g.P(w1|w2)
Userwritesword,butnoisychannelcauses
worde tobewritteninsteadwithprobability
P(e|w)
callederrormodel
representsinformationaboutthefrequencyof
spellingerrors
NoisyChannelModel
Needtoestimateprobabilityofcorrection
P(w|e)= P(e|w)P(w)
Estimatelanguagemodelusingcontext
e.g.,P(w)=P(w)+ (1 )P(w|wp)
wp ispreviousword
e.g.,
fishtink
tankandthinkbothlikelycorrections,but
P(tank|fish) >P(think|fish)
NoisyChannelModel
Languagemodelprobabilitiesestimatedusing
corpusandquerylog
Bothsimpleandcomplexmethodshavebeen
usedforestimatingerrormodel
simpleapproach:assumeallwordswithsameedit
distancehavesameprobability,onlyeditdistance
1and2considered
morecomplexapproach:incorporateestimates
basedoncommontypingerrors
ExampleSpellcheck Process
TheThesaurus
Usedinearlysearchenginesasatoolfor
indexing andqueryformulation
specifiedpreferredtermsandrelationships
betweenthem
alsocalledcontrolledvocabulary
Particularlyusefulforqueryexpansion
addingsynonymsormorespecifictermsusing
queryoperatorsbasedonthesaurus
improvessearcheffectiveness
MeSH Thesaurus
QueryExpansion
Avarietyofautomatic orsemiautomatic
queryexpansiontechniqueshavebeen
developed
goalistoimproveeffectivenessbymatching
relatedterms
semiautomatictechniquesrequireuser
interactiontoselectbestexpansionterms
Querysuggestionisarelatedtechnique
alternativequeries,notnecessarilymoreterms
QueryExpansion
Approachesusuallybasedonananalysisof
termcooccurrence
eitherintheentiredocumentcollection,alarge
collectionofqueries,orthetopranked
documentsinaresultlist
querybasedstemmingalsoanexpansion
technique
Automaticexpansionbasedongeneral
thesaurusnoteffective
doesnottakecontextintoaccount
TermAssociationMeasures
DicesCoefficient

MutualInformation
TermAssociationMeasures
MutualInformationmeasurefavorslow
frequencyterms
ExpectedMutualInformationMeasure
(EMIM)

actuallyonly1partoffullEMIM,focusedonword
occurrence
TermAssociationMeasures
PearsonsChisquared(2)measure
comparesthenumberofcooccurrencesoftwo
wordswiththeexpectednumberofco
occurrencesifthetwowordswereindependent
normalizesthiscomparisonbytheexpected
number
alsolimitedformfocusedonwordcooccurrence
AssociationMeasureSummary
AssociationMeasureExample

MoststronglyassociatedwordsfortropicalinacollectionofTRECnews
stories.Cooccurrencecountsaremeasuredatthedocumentlevel.
AssociationMeasureExample

MoststronglyassociatedwordsforfishinacollectionofTRECnewsstories.
AssociationMeasureExample

MoststronglyassociatedwordsforfishinacollectionofTREC
newsstories.Cooccurrencecountsaremeasuredinwindowsof5
words.
AssociationMeasures
Associatedwordsareoflittleusefor
expandingthequerytropicalfish
Expansionbasedonwholequerytakes
contextintoaccount
e.g.,usingDicewithtermtropicalfishgivesthe
followinghighlyassociatedwords:
goldfish,reptile,aquarium,coral,frog,exotic,stripe,
regent,pet,wet
Impracticalforallpossiblequeries,other
approachesusedtoachievethiseffect
OtherApproaches
Pseudorelevancefeedback
expansiontermsbasedontopretrieveddocuments
forinitialquery
Contextvectors
Representwordsbythewordsthatcooccurwith
them
e.g.,top35moststronglyassociatedwordsforaquarium(using
Dicescoefficient):

Rankwordsforaquerybyrankingcontextvectors
OtherApproaches
Querylogs
Bestsourceofinformationaboutqueriesand
relatedterms
shortpiecesoftextandclickdata
e.g.,mostfrequentwordsinqueriescontaining
tropicalfishfromMSNlog:
stores,pictures,live,sale,types,clipart,blue,
freshwater,aquarium,supplies
querysuggestionbasedonfindingsimilarqueries
groupbasedonclickdata
RelevanceFeedback
Useridentifiesrelevant(andmaybenonrelevant)
documentsintheinitialresultlist
Systemmodifiesqueryusingtermsfromthose
documentsandreranks documents
exampleofsimplemachinelearningalgorithmusing
trainingdata
but,verylittletrainingdata
Pseudorelevancefeedbackjustassumestop
rankeddocumentsarerelevant nouserinput
RelevanceFeedbackExample

Top10documents
fortropicalfish
RelevanceFeedbackExample
Ifweassumetop10arerelevant,most
frequenttermsare(withfrequency):
a(926),td(535),href (495),http(357),width(345),
com(343),nbsp (316),www(260),tr (239),htm (233),
class(225),jpg(221)
toomanystopwords andHTMLexpressions
Useonlysnippetsandremovestopwords
tropical(26),fish(28),aquarium(8),freshwater(5),
breeding(4),information(3),species(3),tank(2),
Badmans (2),page(2),hobby(2),forums(2)
RelevanceFeedbackExample
Ifdocument7(Breedingtropicalfish)is
explicitly indicatedtoberelevant,themost
frequenttermsare:
breeding(4),fish(4),tropical(4),marine(2),pond(2),
coldwater(2),keeping(1),interested(1)
Specificweightsandscoringmethodsusedfor
relevancefeedbackdependonretrievalmodel
RelevanceFeedback
Bothrelevancefeedbackandpseudorelevance
feedbackareeffective,butnotusedinmany
applications
pseudorelevancefeedbackhasreliabilityissues,
especiallywithqueriesthatdontretrievemany
relevantdocuments
Someapplicationsuserelevancefeedback
filtering,morelikethis
Querysuggestionmorepopular
maybelessaccurate,butcanworkifinitialqueryfails
ContextandPersonalization
Ifaqueryhasthesamewordsasanother
query,resultswillbethesameregardlessof
whosubmittedthequery
whythequerywassubmitted
wherethequerywassubmitted
whatotherqueriesweresubmittedinthesame
session
Theseotherfactors(thecontext)couldhavea
significantimpactonrelevance
difficulttoincorporateintoranking
UserModels
Generateuserprofilesbasedondocuments
thatthepersonlooksat
suchaswebpagesvisited,emailmessages,or
wordprocessingdocumentsonthedesktop
Modifyqueriesusingwordsfromprofile
Generallynoteffective
impreciseprofiles,informationneedscanchange
significantly
QueryLogs
Querylogsprovideimportantcontextual
informationthatcanbeusedeffectively
Contextinthiscaseis
previousqueriesthatarethesame
previousqueriesthataresimilar
querysessionsincludingthesamequery
Queryhistoryforindividualscouldbeusedfor
caching
LocalSearch
Locationiscontext
Localsearchusesgeographicinformationto
modifytherankingofsearchresults
locationderivedfromthequerytext
locationofthedevicewherethequeryoriginated
e.g.,
underworld3capecod
underworld3frommobiledeviceinHyannis
LocalSearch
Identifythegeographicregionassociatedwith
webpages
uselocationmetadatathathasbeenmanuallyadded
tothedocument,
oridentifylocationssuchasplacenames,citynames,
orcountrynamesintext
Identifythegeographicregionassociatedwith
thequery
1015%ofqueriescontainsomelocationreference
Rankwebpagesusinglocationinformationin
additiontotextandlinkbasedfeatures
ExtractingLocationInformation
Typeofinformationextraction
ambiguityandsignificanceoflocationsareissues
Locationnamesaremappedtospecific
regionsandcoordinates

Matchingdonebyinclusion,distance
SnippetGeneration

Querydependentdocumentsummary
Simplesummarizationapproach
rankeachsentenceinadocumentusinga
significancefactor
selectthetopsentencesforthesummary
firstproposedbyLuhn in50s
SentenceSelection
Significancefactorforasentenceiscalculated
basedontheoccurrenceofsignificantwords
Iffd,w isthefrequencyofwordw indocumentd,
thenw isasignificantwordifitisnotastopword
and

wheresd isthenumberofsentencesindocumentd
textisbracketed bysignificantwords(limiton
numberofnonsignificantwordsinbracket)
SentenceSelection
Significancefactorforbracketedtextspansis
computedbydividingthesquareofthe
numberofsignificantwordsinthespanbythe
totalnumberofwords
e.g.,

Significancefactor=42/7=2.3
SnippetGeneration
Involvesmorefeaturesthanjustsignificance
factor
e.g.foranewsstory,coulduse
whetherthesentenceisaheading
whetheritisthefirstorsecondlineofthedocument
thetotalnumberofquerytermsoccurringinthesentence
thenumberofuniquequerytermsinthesentence
thelongestcontiguousrunofquerywordsinthesentence
adensitymeasureofquerywords(significancefactor)
Weightedcombinationoffeaturesusedto
ranksentences
SnippetGeneration
Webpagesarelessstructuredthannewsstories
canbedifficulttofindgoodsummarysentences
Snippetsentencesareoftenselectedfromother
sources
metadataassociatedwiththewebpage
e.g.,<metaname="description"content=...>
externalsourcessuchaswebdirectories
e.g.,OpenDirectoryProject,http://www.dmoz.org
Snippetscanbegeneratedfromtextofpageslike
Wikipedia
SnippetGuidelines
Allquerytermsshouldappearinthe
summary,showingtheirrelationshiptothe
retrievedpage
Whenquerytermsarepresentinthetitle,
theyneednotberepeated
allowssnippetsthatdonotcontainqueryterms
HighlightquerytermsinURLs
Snippetsshouldbereadabletext,notlistsof
keywords
Advertising
Sponsoredsearch advertisingpresentedwith
searchresults
Contextualadvertising advertisingpresented
whenbrowsingwebpages
Bothinvolvefindingthemostrelevant
advertisementsinadatabase
Anadvertisementusuallyconsistsofashorttext
descriptionandalinktoawebpagedescribing
theproductorserviceinmoredetail
SearchingAdvertisements
Factorsinvolvedinrankingadvertisements
similarityoftextcontenttoquery
bidsforkeywordsinquery
popularityofadvertisement
Smallamountoftextinadvertisement
dealingwithvocabularymismatchisimportant
expansiontechniquesareeffective
ExampleAdvertisements

Advertisementsretrievedforqueryfishtank
SearchingAdvertisements
Pseudorelevancefeedback
expandqueryand/ordocumentusingtheWeb
useadtextorqueryforpseudorelevance
feedback
rankexactmatchesfirst,followedbystem
matches,followedbyexpansionmatches
Queryreformulationbasedonsearchsessions
learnassociationsbetweenwordsandphrases
basedoncooccurrenceinsearchsessions
ClusteringResults
Resultlistsoftencontaindocumentsrealted to
differentaspects ofthequerytopic
Clustering isusedtogrouprelateddocuments
tosimplifybrowsing

Exampleclustersfor
querytropicalfish
ResultListExample

Top10documents
fortropicalfish
ClusteringResults
Requirements
Efficiency
mustbespecifictoeachqueryandarebasedon
thetoprankeddocumentsforthatquery
typicallybasedonsnippets
Easytounderstand
Canbedifficulttoassigngoodlabelstogroups
Monothetic vs.polythetic classification
TypesofClassification
Monothetic
everymemberofaclasshasthepropertythat
definestheclass
typicalassumptionmadebyusers
easytounderstand
Polythetic
membersofclassessharemanypropertiesbut
thereisnosingledefiningproperty
mostclusteringalgorithms(e.g.Kmeans)produce
thistypeofoutput
ClassificationExample

Possiblemonothetic classification
{D1,D2}(labeledusinga)and{D2,D3}(labelede)
Possiblepolythetic classification
{D2,D3,D4}, D1
labels?
ResultClusters
Simplealgorithm
groupbasedon
wordsinsnippets
Refinements
usephrases
usemorefeatures
whetherphrasesoccurredintitlesorsnippets
lengthofthephrase
collectionfrequencyofthephrase
overlapoftheresultingclusters,
FacetedClassification
Asetofcategories,usuallyorganizedintoa
hierarchy,togetherwithasetoffacetsthat
describetheimportantpropertiesassociated
withthecategory
Manuallydefined
potentiallylessadaptablethandynamic
classification
Easytounderstand
commonlyusedinecommerce
ExampleFacetedClassification

Categoriesfortropicalfish
ExampleFacetedClassification

SubcategoriesandfacetsforHome&Garden
CrossLanguageSearch
Queryinonelanguage,retrievedocumentsin
multipleotherlanguages
Involvesquerytranslation,andprobably
documenttranslation
Querytranslationcanbedoneusingbilingual
dictionaries
Documenttranslationrequiresmore
sophisticatedstatisticaltranslationmodels
similartosomeretrievalmodels
CrossLanguageSearch
StatisticalTranslationModels
Modelsrequireparallelcorporafortraining
probabilityestimatesbasedonaligned sentences
Translationofunusualwordsandphrasesisa
problem
alsousetransliteration techniques
e.g.,Qathafi,Kaddafi,Qadafi,Gadafi,Gaddafi,Kathafi,
Kadhafi,Qadhafi,Qazzafi,Kazafi,Qaddafy,Qadafy,
Quadhaffi,Gadhdhafi,alQaddafi,AlQaddafi
Translation
Websearchenginesalsousetranslation
e.g.forquerypecheur france

translationlinktranslateswebpage
usesstatisticalmachinetranslationmodels

You might also like