You are on page 1of 6

SearchEngineandSocialNetwork

VaibhavDaga
13114067
ComputerScienceDepartment,IITRoorkee

1.HowSearchEngineWorks
Searchengineisthepopulartermforaninformationretrieval(IR)system.Whileresearchersand
developerstakeabroaderviewofIRsystems,consumersthinkofthemmoreintermsofwhat
theywantthesystemstodonamelysearchtheWeb,oranintranet,oradatabase.Actually
consumerswouldreallypreferafindingengine,ratherthanasearchengine.
Thewebcreatesnewchallengesforinformationretrieval.Theamountofinformationon
thewebisgrowingrapidly,aswellasthenumberofnewusersinexperiencedintheartofweb
research.Peoplearelikelytosurfthewebusingitslinkgraph,oftenstartingwithhighquality
humanmaintainedindicessuchasYahoo!orwithsearchengines.Humanmaintainedlistscover
populartopicseffectivelybutaresubjective,expensivetobuildandmaintain,slowtoimprove,
andcannotcoverallesoterictopics.Automatedsearchenginesthatrelyonkeywordmatching
usuallyreturntoomanylowqualitymatches.

Crawling
Beforeasearchenginecantellwhereafileordocumentis,itmustbefound.Tofindinformation
onthehundredsofmillionsofWebpagesthatexist,asearchengineemploysspecialsoftware
robots,calledspiders,tobuildlistsofthewordsfoundonWebsites.Whenaspiderisbuilding
itslists,theprocessiscalledWebcrawling.
Crawlingistheacquisitionofdataaboutawebsite.Thisinvolvesscanningthesiteand
gettingacompletelistofeverythingontherethepagetitle,images,keywordsitcontains,and
anyotherpagesitlinkstoatabareminimum.Moderncrawlersmaycacheacopyofthewhole
page,aswellaslookforsomeadditionalinformationsuchasthepagelayout,wherethe
advertisingunitsare,wherethelinksareonthepage.
Anautomatedbotaspidervisitseachpageveryquickly.Evenintheearliestdays,
Googlereportedthattheywerereadingafewhundredpagesasecond.Thecrawlerthenaddsall
thenewlinksitfoundtoalistofplacestocrawlnextinadditiontorecrawlingsitesagainto
seeifanythinghaschanged.Itsaneverendingprocess.
Anysitethatislinkedtofromanothersitealreadyindexed,oranysitethatmanually
askedtobeindexed,willeventuallybecrawledsomesitesmorefrequentlythanothersand
sometoagreaterdepth.Ifthesiteishugeandcontenthiddenmanyclicksawayfromthe

homepage,thecrawlerbotsmayactuallygiveup.TherearewaystoasksearchenginesNOTto
indexasite,thoughthisisrarelyusedtoblockanentirewebsite.
Architectureofawebcrawler:

Indexing
Indexingistheprocessoftakingallofthatdatayouhavefromacrawl,andplacingitinabig
database.Allofthisdataisstoredinvastdatacentreswiththousandsofpetabytesworthof
drives.Therearetwokeycomponentsinvolvedinmakingthegathereddataaccessibletousers:
Theinformationstoredwiththedata
Themethodbywhichtheinformationisindexed

Inthesimplestcase,asearchenginecouldjuststorethewordandtheURLwhereitwas
found.Inreality,thiswouldmakeforanengineoflimiteduse,sincetherewouldbenowayof
tellingwhetherthewordwasusedinanimportantoratrivialwayonthepage,whethertheword
wasusedonceormanytimesorwhetherthepagecontainedlinkstootherpagescontainingthe
word.Inotherwords,therewouldbenowayofbuildingtherankinglistthattriestopresentthe
mostusefulpagesatthetopofthelistofsearchresults.
Tomakeformoreusefulresults,mostsearchenginesstoremorethanjustthewordand
URL.Anenginemightstorethenumberoftimesthatthewordappearsonapage.Theengine
mightassignaweighttoeachentry,withincreasingvaluesassignedtowordsastheyappearnear
thetopofthedocument,insubheadings,inlinks,inthemetatagsorinthetitleofthepage.Each
commercialsearchenginehasadifferentformulaforassigningweighttothewordsinitsindex.
Thisisoneofthereasonsthatasearchforthesamewordondifferentsearchengineswill
producedifferentlists,withthepagespresentedindifferentorders.

Importantdatastructuresforstoringdata
BigFiles
BigFilesarevirtualfilesspanningmultiplefilesystemsandareaddressableby
64bitintegers.TheBigFilespackagealsohandlesallocationanddeallocationoffile
descriptors.
Repository
:
TherepositorycontainsthefullHTMLofeverywebpage.Eachpageis
compressedusingzlib.
DocumentIndex
:
Thedocumentindexkeepsinformationabouteachdocument.The
informationstoredineachentryincludesthecurrentdocumentstatus,apointerintothe
repository,adocumentchecksum,andvariousstatistics.
HitList
:
Ahitlistcorrespondstoalistofoccurrencesofaparticularwordinaparticular
documentincludingposition,font,andcapitalizationinformation.

Retrievalofquery
Thelaststepisyoutypeinasearchquery,andthesearchengineattemptstodisplaythemost
relevantdocumentsitfindsthatmatchyourquery.Thisisthemostcomplicatedstep,butalsothe
mostrelevanttoyouorI,aswebdevelopersandusers.Itisalsotheareainwhichsearchengines
differentiatethemselves.Someworkwithkeywords,someallowyoutoaskaquestion,andsome
includeadvancedfeatureslikekeywordproximityorfilteringbyageofcontent.
Therankingalgorithmchecksyoursearchqueryagainstbillionsofpagestodetermine
howrelevanteachoneis.Thisoperationissocomplexthatcompaniescloselyguardtheirown
rankingalgorithmsaspatentedindustrysecrets.Competitiveadvantageforastartsolongas
theyaregivingyouthebestsearchresults,theycanstayontopofthemarket.Secondly,to
preventgamingofthesystemandgivinganunfairadvantagetoonesiteoveranother.

Searchingthroughanindexinvolvesauserbuildingaqueryandsubmittingitthroughthe
searchengine.Thequerycanbequitesimple,asinglewordatminimum.Buildingamore
complexqueryrequirestheuseofBooleanoperatorsthatallowyoutorefineandextendthe
termsofthesearch.
TheBooleanoperatorsmostoftenseenare:
AND
Allthetermsjoinedby"AND"mustappearinthepagesordocuments.
Somesearchenginessubstitutetheoperator"+"forthewordAND.
OR
Atleastoneofthetermsjoinedby"OR"mustappearinthepagesor
documents.
NOT
Thetermortermsfollowing"NOT"mustnotappearinthepagesor
documents.Somesearchenginessubstitutetheoperator""forthewordNOT.
FOLLOWEDBY
Oneofthetermsmustbedirectlyfollowedbytheother.
NEAR
Oneofthetermsmustbewithinaspecifiednumberofwordsofthe
other.
QuotationMarks
Thewordsbetweenthequotationmarksaretreatedasa
phrase,andthatphrasemustbefoundwithinthedocumentorfile.

2.WorkingofaSocialNetwork

Realtimepresencenotification
Themostresourceintensiveoperationperformedinachatsystemisnotsendingmessages.Itis
ratherkeepingeachonlineuserawareoftheonlineidleofflinestatesoftheirfriends,sothat
conversationscanbegin.
Thenaiveimplementationofsendinganotificationtoallfriendswheneverausercomes
onlineorgoesofflinehasaworstcasecostofO(averagefriendlistsize*peakusers*churnrate)
messages/second,wherechurnrateisthefrequencywithwhichuserscomeonlineandgooffline,
inevents/second.Thisiswildlyinefficienttothepointofbeinguntenable,giventhattheaverage
numberoffriendsperuserismeasuredinthehundreds,andthenumberofconcurrentusers
duringpeaksiteusageisontheorderofseveralmillions.
Surfacingconnectedusers'idlenessgreatlyenhancesthechatuserexperiencebutfurther
compoundstheproblemofkeepingpresenceinformationuptodate.EachFacebookChatuser
nowneedstobenotifiedwheneveroneofhis/herfriends
(a)takesanactionsuchassendingachatmessageorloadsaFacebookpage
(b)transitionsbetweenidlenessstates

Realtimemessaging
Anotherchallengeisensuringthetimelydeliveryofthemessagesthemselves.Themethodwe
chosetogettextfromoneusertoanotherinvolvesloadinganiframeoneachFacebookpage,
andhavingthatiframe'sJavascriptmakeanHTTPGETrequestoverapersistentconnectionthat
doesn'treturnuntiltheserverhasdatafortheclient.Therequestgetsreestablishedifit's
interruptedortimesout.
HavingalargenumberoflongrunningconcurrentrequestsmakestheApachepartofthe
standardLAMPstackadubiousimplementationchoice.Evenwithoutaccountingforthe
sizeableoverheadofspawninganOSprocessthat,onaverage,twiddlesitsthumbsforaminute
beforereportingthatnoonehassenttheuseramessage,thewaitingtimecouldbespent
servicing60somerequestsforregularFacebookpages.

Distribution,Isolation,andFailover
Faulttoleranceisadesirablecharacteristicofanybigsystem:ifanerrorhappens,thesystem
shouldtryitsbesttorecoverwithouthumaninterventionbeforegivingupandinformingthe
user.Theresultsofinevitableprogrammingbugs,hardwarefailures,etal.,shouldbehidden
fromtheuserasmuchaspossibleandisolatedfromtherestofthesystem.
Thewaythisistypicallyaccomplishedinawebapplicationisbyseparatingthemodel
andtheview:dataispersistedinadatabase(perhapswithaseparateinmemorycache),with
eachshortlivedrequestretrievingonlythepartsrelevanttothatrequest.Becausethedatais

persisted,afailedreadrequestcanbereattempted.Cachemissesanddatabasefailurecanbe
detectedbythenondatabaselayersandeitherreportedtotheuserorworkedaroundusing
replication.
Whilethisarchitectureworksprettywellingeneral,itisn'tassuccessfulinachat
applicationduetothehighvolumeoflonglivedrequests,thenonrelationalnatureofthedata
involved,andthestatefulnessofeachrequest.
ForFacebookChat,werolledourownsubsystemforloggingchatmessages(inC++)as
wellasanepolldrivenwebserver(inErlang)thatholdsonlineusers'conversationsinmemory
andservesthelongpolledHTTPrequests.Bothsubsystemsareclusteredandpartitionedfor
reliabilityandefficientfailover.

ScalingMessagesApplication
FacebookMessagesseamlesslyintegratesmanycommunicationchannels:email,SMS,
FacebookChat,andtheexistingFacebookInbox.Combiningallthisfunctionalityandofferinga
powerfuluserexperienceinvolvedbuildinganentirelynewinfrastructurestackfromtheground
up.
Tosimplifytheproductandpresentapowerfuluserexperience,integratingand
supportingalltheabovecommunicationchannelsrequiresanumberofservicestoruntogether
andinteract.Thesystemneedsto:
Scale,asweneedtosupportmillionsofuserswithexistingmessagehistory.
Operateinrealtime.
Behighlyavailable.
Eachapplicationservercomprises:
API
:Theentrypointforallgetandsetoperations,whicheveryclientcalls.An
applicationserveristhesoleentrypointforanygivenuserintothesystem.Anydata
writtentoorreadfromthesystemneedstogothroughthisAPI.
Distributedlogic
:Tounderstandthedistributedlogicweneedtounderstandwhatacell
is.Theentiresystemisdividedintocells,andeachcellcontainsonlyasubsetofusers.A
celllookslikethis:

References

https://www.facebook.com/notes/facebookengineering/facebookchat/14218138919
https://www.facebook.com/notes/facebookengineering/scalingthemessagesapplica
tionbackend/10150148835363920
http://infolab.stanford.edu/~backrub/google.html
http://computer.howstuffworks.com/internet/basics/searchengine.htm
http://www.makeuseof.com/tag/howdosearchenginesworkmakeuseofexplains/

You might also like