Professional Documents
Culture Documents
Sridhar Rajagopalan
What is new?
Some graphs are getting very very large.
Several Web crawls have over 2 Billion pages (nodes) and 10 times as many edges. Some social networks (Telephone call graphs, for instance) are huge.
Why and how does scale change the game? It is not about an instance, it is about THE instance.
Systems and Software is designed from scratch to solve one (or a few) instances of the problem. Properties of the instance are important. Generality and genericity of the code base are not critical. An answer to a single instance can be worth a lot.
A new kind of algorithm and system design which is very engineering orientated.
Non-viability of the random access model. Hardware and software co-design..
What does a system and programming model for processing large data sets have to do?
System and programming model for processing large graphs. Exceptions will occur. Components (both hard and soft) will fail. Data structures will exceed available memory. Aware of statistical issues. Approximate or incomplete results are usually good enough. What happens when you string (even well understood) approximate techniques together.
Sridhar Rajagopalan
2.
3.
Sridhar Rajagopalan
Issues:
Memory usage. Passes required over the data. Many variations.
Sorting is hard. As are almost all interesting combinatorial/graph theoretic problems. Exact computation of statistical functions are hard. Approximation is possible. Relationships to communication complexity and information complexity.
Sridhar Rajagopalan
Sorting
But Sorting well requires a great deal of care and customization to the hardware platform. What is the cost of indexing the Web? 2B documents, each with 500 words = 1 Trillion records. Cost of index build per Penny Sort is under 50 bucks. Networking speeds make sharing large quantities of streaming data possible.
Sridhar Rajagopalan
Model 1: Stream + Sort Basic multi-pass data stream model with access to a Sort box. Quite powerful.
Can do entire index build (including PageRank like calculations). Spanning Tree, Connected Components, MinCut, STCONN, Bipartiteness. Exact computations of order statistics and frequency moments. Suffix Tree/Suffix Array build. Red/Blue segment intersection. So strictly stronger than just streaming.
Theorem : NC StrSort
Sridhar Rajagopalan
Sridhar Rajagopalan
Sridhar Rajagopalan
Power Laws: A Curious Statistic About the Web Indegree outdegree distributions of the web graph are distributed by the power law. Component size distributions are distributed by the power law.
Sridhar Rajagopalan
Power Laws Inverse polynomial tail. Word frequency in text. Yule (later Mandelbrot) statistical study of the literary vocabulary.[Yule, 1944]. Citation analysis [Lotka, 1926]. Zipf human behavior and the principle of least effort. [Zipf, 1947]. Pareto Cours deconomie politique. [Pareto,1897]. Network graph. [Faloutsos-Faloutsos-Faloutsos, 1999]. Oligonucleotide sequences [Martindale-Konopka, 1996]. Access statistics for web pages. (From server logs) [Glassman, 1997]. User behavior (instrument browsers and proxies) [Lukose-Huberman, 1998, Crovella and others,1997-99]. Many other instances.
Sridhar Rajagopalan
Namespace tree: T.
Nodes = URLs. Directed (labeled) edges = parent to child (labeled by extension).
Host graph: H.
Nodes = websites/webhosts. Directed (weighted) edges = (number) links from one host to the other.
Estimation methods.
Sampling. Random walks.
Data mining.
Communities. Focused crawling. Mirrors.
Sridhar Rajagopalan
0 Ai , j ! 1
Markov chain M(G).
0 if i and j are not adjacent in G M i, j ! 1 if i and j are adjacent in G i d d i is the outdegree of vertex i
Sridhar Rajagopalan
Search: Eigenvectors of M. Page rank comes from the web graph. [Brin and Page, 1998]
Principal Eigenvector of (1 P ) M (G ) P
Hub rank comes from bibliographic coupling. [Kleinberg, 1998]
rincipal igenvector of A( B )
Authority rank comes from co-citations. [Kleinberg, 1998]
igen space of M ( S ).
Sridhar Rajagopalan
Co-citation and Web Communities. Social networks: Milgram 6DOS Routing. Bibliographic coupling thesis: frequently co-cited web pages are related. Pages with large bibliographic overlap are related. CS problem: enumerate all frequently co-cited groups of web pages. (Complete bipartite subgraphs). We call these cores.
K ( 3, 3)
Sridhar Rajagopalan
The Cores Are Interesting. Explicit communities. Implicit communities Yahoo!, Excite, Infoseek Hotels in costa rica webrings Clipart news groups Japanese elementary schools Turkish student associations mailing lists
Oil spills off the coast of japan Australian fire brigades Aviation/aircraft vendors Guitar manufacturers
(1) Implicit communities are defined by cores. (2) There are an order of magnitude more implicit communities. (3) Very reliable. Over 97% (sampled) make sense.
Sridhar Rajagopalan
More Applications. Find similar: look at neighborhood in co-citation graph (C). Bidirectional browsing: edges are reversed and a list of pages which point to the currently visible one is made available. Mirror detection: find identical sub-trees in the namespace tree.
Sridhar Rajagopalan
Random Graphs Erdos and RenyisGn , p model. Graph with n vertices. Each of n(n-1) arcs appear independently with probability p. Graphical evolution [Palmer]: study properties of the resulting random graph as p is increased from 0 to 1.
Sridhar Rajagopalan
A random graph with average degree 4 has a giant connected component containing almost all (90%) of the vertices. Indegrees and outdegrees are concentrated around the mean. And have exponentially declining tails. Most vertices in the graph are close to most others (small world).
Sridhar Rajagopalan
New Models 1. 2. 3. 4. Ad Hoc. [Aiello, Chung and Lu, 2000] Pick a random graph which satisfies degree distribution constraints. Copying models. [Barabasi-Albert, KRRT, 2000-01]. Local optimization models. [Papadimitriou et.al. , 2001]. See [Mitzenmacher 2000] survey.
Sridhar Rajagopalan