You are on page 1of 22

The Web as a Graph, Models and Algorithms Sridhar Rajagopalan IBM Almaden Research Center

Sridhar Rajagopalan

Graphs in Computer Science: How do they help? What do they model?


An abstraction in Core CS.
Examples: VLSI Circuits, Communication Networks, Logical flow of a computer program, Data structures.

An abstraction for data and relationships.


Examples: The Web, Social Networks, Flows and Flow Networks, Biological Data, Taxonomies, Citations, Explicit relations within a DB system.

What aspects are studied?


Algorithms, Data Structures and Complexity Theory. Characterization and Modeling of Graphs. Implementations of Graph Algorithms in Specific contexts.

What is new?
Some graphs are getting very very large.
Several Web crawls have over 2 Billion pages (nodes) and 10 times as many edges. Some social networks (Telephone call graphs, for instance) are huge.

Sensors and small devices will output enormous amounts of data.


Sridhar Rajagopalan

Why and how does scale change the game? It is not about an instance, it is about THE instance.
Systems and Software is designed from scratch to solve one (or a few) instances of the problem. Properties of the instance are important. Generality and genericity of the code base are not critical. An answer to a single instance can be worth a lot.

A new kind of algorithm and system design which is very engineering orientated.
Non-viability of the random access model. Hardware and software co-design..

Time to market considerations encourage careful approximations.


Statistical approaches are valuable. Do not need the complete answer, even a partial high quality result can be very valuable.
Sridhar Rajagopalan

What does a system and programming model for processing large data sets have to do?

System and programming model for processing large graphs. Exceptions will occur. Components (both hard and soft) will fail. Data structures will exceed available memory. Aware of statistical issues. Approximate or incomplete results are usually good enough. What happens when you string (even well understood) approximate techniques together.

Sridhar Rajagopalan

Commodity Computing Platforms 1.

Disk based storage is plentiful and cheap.


Current price less than $1.5 dollars/gigabyte.

2.

Memory hierarchies are complicated and pipelines are deep.


Large gap between random and sequential access, even within main memory. Random access in main is slower than sequential access to disk. CPU is underutilized, especially in data intensive applications. Where have all the cycles gone?

3.

Parallelism is great in theory, but hard in practice.


Nave data partitioning is the only practical option to keep system complexity under control.

Sridhar Rajagopalan

The Streaming Model [Munro-Paterson]

Data is seen as a sequence of elements.


Proposed as a model for large data sets, as well as data from sensors.

Issues:
Memory usage. Passes required over the data. Many variations.

Sorting is hard. As are almost all interesting combinatorial/graph theoretic problems. Exact computation of statistical functions are hard. Approximation is possible. Relationships to communication complexity and information complexity.
Sridhar Rajagopalan

Sorting

Sorting large a large data set on commodity hardware is a solved problem.


Google sorts more than 100 B terms in its index. SPSort. Penny Sort, Minute Sort.

But Sorting well requires a great deal of care and customization to the hardware platform. What is the cost of indexing the Web? 2B documents, each with 500 words = 1 Trillion records. Cost of index build per Penny Sort is under 50 bucks. Networking speeds make sharing large quantities of streaming data possible.

Sridhar Rajagopalan

Model 1: Stream + Sort Basic multi-pass data stream model with access to a Sort box. Quite powerful.
Can do entire index build (including PageRank like calculations). Spanning Tree, Connected Components, MinCut, STCONN, Bipartiteness. Exact computations of order statistics and frequency moments. Suffix Tree/Suffix Array build. Red/Blue segment intersection. So strictly stronger than just streaming.

Theorem : NC StrSort

Sridhar Rajagopalan

Questions About the Web Graph


Size: how big is the graph?How many links on a page (outdegree)? How many links to a page (indegree)? Connectedness: can one browse from any web page to any other? How many clicks? Sampling: can we pick a random page on the web? Browsing: how different is browsing from a random walk? Applications: can we exploit the structure of the web graph for searching and mining? Models: what does the web graph reveal about social processes which result in its creation and dynamics?

Sridhar Rajagopalan

A Picture of the Web Graph

Sridhar Rajagopalan

Broder et.al., 2000

Power Laws: A Curious Statistic About the Web Indegree outdegree distributions of the web graph are distributed by the power law. Component size distributions are distributed by the power law.

Sridhar Rajagopalan

Broder et.al., 2000

Power Laws Inverse polynomial tail. Word frequency in text. Yule (later Mandelbrot) statistical study of the literary vocabulary.[Yule, 1944]. Citation analysis [Lotka, 1926]. Zipf human behavior and the principle of least effort. [Zipf, 1947]. Pareto Cours deconomie politique. [Pareto,1897]. Network graph. [Faloutsos-Faloutsos-Faloutsos, 1999]. Oligonucleotide sequences [Martindale-Konopka, 1996]. Access statistics for web pages. (From server logs) [Glassman, 1997]. User behavior (instrument browsers and proxies) [Lukose-Huberman, 1998, Crovella and others,1997-99]. Many other instances.
Sridhar Rajagopalan

The Web Is a Rich Source of Graphs


Co-citation graph: C.
Nodes = web pages. Undirected (weighted) edges = co-citation (weight).

Bibliographic coupling graph: B.


Nodes = web pages. Undirected (weighted) edges = bibliographic overlap.

Content similarity graph: S.


Nodes = web pages. Undirected or directed (weighted) edges = similarity of text on the pages.

Namespace tree: T.
Nodes = URLs. Directed (labeled) edges = parent to child (labeled by extension).

Host graph: H.
Nodes = websites/webhosts. Directed (weighted) edges = (number) links from one host to the other.

Data mining graph: D.


Bipartite. Two nodes per web page, one per partition. Edges = from LHS to RHS corresponding to each edge in the web graph. Nodes on LHS are adjacent vertex sets. On RHS are items.
Sridhar Rajagopalan

Applications of Graph Methods


Searching.
Page rank. (Google). Hits. Clever and variants.

Estimation methods.
Sampling. Random walks.

Browsing and foraging.


Back links. Find similar.

Data mining.
Communities. Focused crawling. Mirrors.

Sridhar Rajagopalan

Spectral Analysis: Matrices Adjacency matrix A(G).

0 Ai , j ! 1
Markov chain M(G).

if i and j are not adjacent in G if i and j are adjacent in G

0 if i and j are not adjacent in G M i, j ! 1 if i and j are adjacent in G i d d i is the outdegree of vertex i
Sridhar Rajagopalan

Search: Eigenvectors of M. Page rank comes from the web graph. [Brin and Page, 1998]

Principal Eigenvector of (1  P ) M (G )  P
Hub rank comes from bibliographic coupling. [Kleinberg, 1998]

rincipal igenvector of A( B )
Authority rank comes from co-citations. [Kleinberg, 1998]

rincipal igenvector of A(C ).


Latent semantic indexing builds on textual similarity. [Dumais et.al.]

igen space of M ( S ).

Sridhar Rajagopalan

Co-citation and Web Communities. Social networks: Milgram 6DOS Routing. Bibliographic coupling thesis: frequently co-cited web pages are related. Pages with large bibliographic overlap are related. CS problem: enumerate all frequently co-cited groups of web pages. (Complete bipartite subgraphs). We call these cores.

K ( 3, 3)
Sridhar Rajagopalan

The Cores Are Interesting. Explicit communities. Implicit communities Yahoo!, Excite, Infoseek Hotels in costa rica webrings Clipart news groups Japanese elementary schools Turkish student associations mailing lists
Oil spills off the coast of japan Australian fire brigades Aviation/aircraft vendors Guitar manufacturers

(1) Implicit communities are defined by cores. (2) There are an order of magnitude more implicit communities. (3) Very reliable. Over 97% (sampled) make sense.

Sridhar Rajagopalan

More Applications. Find similar: look at neighborhood in co-citation graph (C). Bidirectional browsing: edges are reversed and a list of pages which point to the currently visible one is made available. Mirror detection: find identical sub-trees in the namespace tree.

Sridhar Rajagopalan

Random Graphs Erdos and RenyisGn , p model. Graph with n vertices. Each of n(n-1) arcs appear independently with probability p. Graphical evolution [Palmer]: study properties of the resulting random graph as p is increased from 0 to 1.

Sridhar Rajagopalan

Facts About the Erdos-Renyi Model

A random graph with average degree 4 has a giant connected component containing almost all (90%) of the vertices. Indegrees and outdegrees are concentrated around the mean. And have exponentially declining tails. Most vertices in the graph are close to most others (small world).

Sridhar Rajagopalan

New Models 1. 2. 3. 4. Ad Hoc. [Aiello, Chung and Lu, 2000] Pick a random graph which satisfies degree distribution constraints. Copying models. [Barabasi-Albert, KRRT, 2000-01]. Local optimization models. [Papadimitriou et.al. , 2001]. See [Mitzenmacher 2000] survey.

Sridhar Rajagopalan

You might also like