Search Engine and Social Network

SearchEngineandSocialNetwork
VaibhavDaga
13114067
ComputerScienceDepartment,IITRoorkee
1.HowSearchEngineWorks
Searchengineisthepopulartermforaninformationretrieval(IR)system.Whileresearchersand
developerstakeabroaderviewofIRsystems,consumersthinkofthemmoreintermsofwhat
theywantthesystemstodonamelysearchtheWeb,oranintranet,oradatabase.Actually
consumerswouldreallypreferafindingengine,ratherthanasearchengine.
Thewebcreatesnewchallengesforinformationretrieval.Theamountofinformationon
thewebisgrowingrapidly,aswellasthenumberofnewusersinexperiencedintheartofweb
research.Peoplearelikelytosurfthewebusingitslinkgraph,oftenstartingwithhighquality
humanmaintainedindicessuchasYahoo!orwithsearchengines.Humanmaintainedlistscover
populartopicseffectivelybutaresubjective,expensivetobuildandmaintain,slowtoimprove,
andcannotcoverallesoterictopics.Automatedsearchenginesthatrelyonkeywordmatching
usuallyreturntoomanylowqualitymatches.
Crawling
Beforeasearchenginecantellwhereafileordocumentis,itmustbefound.Tofindinformation
onthehundredsofmillionsofWebpagesthatexist,asearchengineemploysspecialsoftware
robots,calledspiders,tobuildlistsofthewordsfoundonWebsites.Whenaspiderisbuilding
itslists,theprocessiscalledWebcrawling.
Crawlingistheacquisitionofdataaboutawebsite.Thisinvolvesscanningthesiteand
gettingacompletelistofeverythingontherethepagetitle,images,keywordsitcontains,and
anyotherpagesitlinkstoatabareminimum.Moderncrawlersmaycacheacopyofthewhole
page,aswellaslookforsomeadditionalinformationsuchasthepagelayout,wherethe
advertisingunitsare,wherethelinksareonthepage.
Anautomatedbotaspidervisitseachpageveryquickly.Evenintheearliestdays,
Googlereportedthattheywerereadingafewhundredpagesasecond.Thecrawlerthenaddsall
thenewlinksitfoundtoalistofplacestocrawlnextinadditiontorecrawlingsitesagainto
seeifanythinghaschanged.Itsaneverendingprocess.
Anysitethatislinkedtofromanothersitealreadyindexed,oranysitethatmanually
askedtobeindexed,willeventuallybecrawledsomesitesmorefrequentlythanothersand
sometoagreaterdepth.Ifthesiteishugeandcontenthiddenmanyclicksawayfromthe
homepage,thecrawlerbotsmayactuallygiveup.TherearewaystoasksearchenginesNOTto
indexasite,thoughthisisrarelyusedtoblockanentirewebsite.
Architectureofawebcrawler:
Indexing
Indexingistheprocessoftakingallofthatdatayouhavefromacrawl,andplacingitinabig
database.Allofthisdataisstoredinvastdatacentreswiththousandsofpetabytesworthof
drives.Therearetwokeycomponentsinvolvedinmakingthegathereddataaccessibletousers:
Theinformationstoredwiththedata
Themethodbywhichtheinformationisindexed
Inthesimplestcase,asearchenginecouldjuststorethewordandtheURLwhereitwas
found.Inreality,thiswouldmakeforanengineoflimiteduse,sincetherewouldbenowayof
tellingwhetherthewordwasusedinanimportantoratrivialwayonthepage,whethertheword
wasusedonceormanytimesorwhetherthepagecontainedlinkstootherpagescontainingthe
word.Inotherwords,therewouldbenowayofbuildingtherankinglistthattriestopresentthe
mostusefulpagesatthetopofthelistofsearchresults.
Tomakeformoreusefulresults,mostsearchenginesstoremorethanjustthewordand
URL.Anenginemightstorethenumberoftimesthatthewordappearsonapage.Theengine
mightassignaweighttoeachentry,withincreasingvaluesassignedtowordsastheyappearnear
thetopofthedocument,insubheadings,inlinks,inthemetatagsorinthetitleofthepage.Each
commercialsearchenginehasadifferentformulaforassigningweighttothewordsinitsindex.
Thisisoneofthereasonsthatasearchforthesamewordondifferentsearchengineswill
producedifferentlists,withthepagespresentedindifferentorders.
Importantdatastructuresforstoringdata
BigFiles
BigFilesarevirtualfilesspanningmultiplefilesystemsandareaddressableby
64bitintegers.TheBigFilespackagealsohandlesallocationanddeallocationoffile
descriptors.
Repository
:
TherepositorycontainsthefullHTMLofeverywebpage.Eachpageis
compressedusingzlib.
DocumentIndex
:
Thedocumentindexkeepsinformationabouteachdocument.The
informationstoredineachentryincludesthecurrentdocumentstatus,apointerintothe
repository,adocumentchecksum,andvariousstatistics.
HitList
:
Ahitlistcorrespondstoalistofoccurrencesofaparticularwordinaparticular
documentincludingposition,font,andcapitalizationinformation.
Retrievalofquery
Thelaststepisyoutypeinasearchquery,andthesearchengineattemptstodisplaythemost
relevantdocumentsitfindsthatmatchyourquery.Thisisthemostcomplicatedstep,butalsothe
mostrelevanttoyouorI,aswebdevelopersandusers.Itisalsotheareainwhichsearchengines
differentiatethemselves.Someworkwithkeywords,someallowyoutoaskaquestion,andsome
includeadvancedfeatureslikekeywordproximityorfilteringbyageofcontent.
Therankingalgorithmchecksyoursearchqueryagainstbillionsofpagestodetermine
howrelevanteachoneis.Thisoperationissocomplexthatcompaniescloselyguardtheirown
rankingalgorithmsaspatentedindustrysecrets.Competitiveadvantageforastartsolongas
theyaregivingyouthebestsearchresults,theycanstayontopofthemarket.Secondly,to
preventgamingofthesystemandgivinganunfairadvantagetoonesiteoveranother.
Searchingthroughanindexinvolvesauserbuildingaqueryandsubmittingitthroughthe
searchengine.Thequerycanbequitesimple,asinglewordatminimum.Buildingamore
complexqueryrequirestheuseofBooleanoperatorsthatallowyoutorefineandextendthe
termsofthesearch.
TheBooleanoperatorsmostoftenseenare:
AND
Allthetermsjoinedby"AND"mustappearinthepagesordocuments.
Somesearchenginessubstitutetheoperator"+"forthewordAND.
OR
Atleastoneofthetermsjoinedby"OR"mustappearinthepagesor
documents.
NOT
Thetermortermsfollowing"NOT"mustnotappearinthepagesor
documents.Somesearchenginessubstitutetheoperator""forthewordNOT.
FOLLOWEDBY
Oneofthetermsmustbedirectlyfollowedbytheother.
NEAR
Oneofthetermsmustbewithinaspecifiednumberofwordsofthe
other.
QuotationMarks
Thewordsbetweenthequotationmarksaretreatedasa
phrase,andthatphrasemustbefoundwithinthedocumentorfile.
2.WorkingofaSocialNetwork
Realtimepresencenotification
Themostresourceintensiveoperationperformedinachatsystemisnotsendingmessages.Itis
ratherkeepingeachonlineuserawareoftheonlineidleofflinestatesoftheirfriends,sothat
conversationscanbegin.
Thenaiveimplementationofsendinganotificationtoallfriendswheneverausercomes
onlineorgoesofflinehasaworstcasecostofO(averagefriendlistsize*peakusers*churnrate)
messages/second,wherechurnrateisthefrequencywithwhichuserscomeonlineandgooffline,
inevents/second.Thisiswildlyinefficienttothepointofbeinguntenable,giventhattheaverage
numberoffriendsperuserismeasuredinthehundreds,andthenumberofconcurrentusers
duringpeaksiteusageisontheorderofseveralmillions.
Surfacingconnectedusers'idlenessgreatlyenhancesthechatuserexperiencebutfurther
compoundstheproblemofkeepingpresenceinformationuptodate.EachFacebookChatuser
nowneedstobenotifiedwheneveroneofhis/herfriends
(a)takesanactionsuchassendingachatmessageorloadsaFacebookpage
(b)transitionsbetweenidlenessstates
Realtimemessaging
Anotherchallengeisensuringthetimelydeliveryofthemessagesthemselves.Themethodwe
chosetogettextfromoneusertoanotherinvolvesloadinganiframeoneachFacebookpage,
andhavingthatiframe'sJavascriptmakeanHTTPGETrequestoverapersistentconnectionthat
doesn'treturnuntiltheserverhasdatafortheclient.Therequestgetsreestablishedifit's
interruptedortimesout.
HavingalargenumberoflongrunningconcurrentrequestsmakestheApachepartofthe
standardLAMPstackadubiousimplementationchoice.Evenwithoutaccountingforthe
sizeableoverheadofspawninganOSprocessthat,onaverage,twiddlesitsthumbsforaminute
beforereportingthatnoonehassenttheuseramessage,thewaitingtimecouldbespent
servicing60somerequestsforregularFacebookpages.
Distribution,Isolation,andFailover
Faulttoleranceisadesirablecharacteristicofanybigsystem:ifanerrorhappens,thesystem
shouldtryitsbesttorecoverwithouthumaninterventionbeforegivingupandinformingthe
user.Theresultsofinevitableprogrammingbugs,hardwarefailures,etal.,shouldbehidden
fromtheuserasmuchaspossibleandisolatedfromtherestofthesystem.
Thewaythisistypicallyaccomplishedinawebapplicationisbyseparatingthemodel
andtheview:dataispersistedinadatabase(perhapswithaseparateinmemorycache),with
eachshortlivedrequestretrievingonlythepartsrelevanttothatrequest.Becausethedatais
persisted,afailedreadrequestcanbereattempted.Cachemissesanddatabasefailurecanbe
detectedbythenondatabaselayersandeitherreportedtotheuserorworkedaroundusing
replication.
Whilethisarchitectureworksprettywellingeneral,itisn'tassuccessfulinachat
applicationduetothehighvolumeoflonglivedrequests,thenonrelationalnatureofthedata
involved,andthestatefulnessofeachrequest.
ForFacebookChat,werolledourownsubsystemforloggingchatmessages(inC++)as
wellasanepolldrivenwebserver(inErlang)thatholdsonlineusers'conversationsinmemory
andservesthelongpolledHTTPrequests.Bothsubsystemsareclusteredandpartitionedfor
reliabilityandefficientfailover.
ScalingMessagesApplication
FacebookMessagesseamlesslyintegratesmanycommunicationchannels:email,SMS,
FacebookChat,andtheexistingFacebookInbox.Combiningallthisfunctionalityandofferinga
powerfuluserexperienceinvolvedbuildinganentirelynewinfrastructurestackfromtheground
up.
Tosimplifytheproductandpresentapowerfuluserexperience,integratingand
supportingalltheabovecommunicationchannelsrequiresanumberofservicestoruntogether
andinteract.Thesystemneedsto:
Scale,asweneedtosupportmillionsofuserswithexistingmessagehistory.
Operateinrealtime.
Behighlyavailable.
Eachapplicationservercomprises:
API
:Theentrypointforallgetandsetoperations,whicheveryclientcalls.An
applicationserveristhesoleentrypointforanygivenuserintothesystem.Anydata
writtentoorreadfromthesystemneedstogothroughthisAPI.
Distributedlogic
:Tounderstandthedistributedlogicweneedtounderstandwhatacell
is.Theentiresystemisdividedintocells,andeachcellcontainsonlyasubsetofusers.A
celllookslikethis:
References
https://www.facebook.com/notes/facebookengineering/facebookchat/14218138919
https://www.facebook.com/notes/facebookengineering/scalingthemessagesapplica
tionbackend/10150148835363920
http://infolab.stanford.edu/~backrub/google.html
http://computer.howstuffworks.com/internet/basics/searchengine.htm
http://www.makeuseof.com/tag/howdosearchenginesworkmakeuseofexplains/

Search Engine and Social Network

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Search Engine and Social Network

Uploaded by

Copyright:

Available Formats

SearchEngineandSocialNetwork

You might also like