You are on page 1of 107

Mining Social and Semantic Network Data on the Web

Markus Schatten, PhD


University of Zagreb Faculty of Organization and Informatics

May 4, 2011

Introduction Web 2.0, Semantic Web, Web 3.0 Network science Data collection Semistructured data XPath Regular expressions Ajax APIs Data analysis Network matrices Folksonomy model Logic programming Case study - CROSBI
2 of 54

Outline
Introduction Web 2.0, Semantic Web, Web 3.0 Network science Data collection Data analysis Case study - CROSBI

3 of 54

Questions

Why mining social and semantic network data? How to collect these data? What are the common obstacles and how to solve them? How to analyze these data?

4 of 54

Outline
Introduction Web 2.0, Semantic Web, Web 3.0 Network science Data collection Data analysis Case study - CROSBI

5 of 54

Web 2.0, Semantic Web, Web 3.0

Why should one bother with these buzzwords?

6 of 54

Why mining (social) web data?


Millions of people,

7 of 54

Why mining (social) web data?


Millions of people, on thousands of sites,

7 of 54

Why mining (social) web data?


Millions of people, on thousands of sites, across hundreds of diverse cultures,

7 of 54

Why mining (social) web data?


Millions of people, on thousands of sites, across hundreds of diverse cultures, leave trails

7 of 54

Why mining (social) web data?


Millions of people, on thousands of sites, across hundreds of diverse cultures, leave trails about various aspects of their lives...

7 of 54

Why mining (social) web data?


Millions of people, on thousands of sites, across hundreds of diverse cultures, leave trails about various aspects of their lives...

Social systems are systems of communication (Niklas Luhman)

7 of 54

Why mining (social) web data?


Millions of people, on thousands of sites, across hundreds of diverse cultures, leave trails about various aspects of their lives...

Social systems are systems of communication (Niklas Luhman)

By analyzing the trails (the results of social systems structural coupling) we can analyze the social system itself:

7 of 54

Why mining (social) web data?


Millions of people, on thousands of sites, across hundreds of diverse cultures, leave trails about various aspects of their lives...

Social systems are systems of communication (Niklas Luhman)

By analyzing the trails (the results of social systems structural coupling) we can analyze the social system itself:politics, culture, media, business, technology, science ...
7 of 54

Outline
Introduction Web 2.0, Semantic Web, Web 3.0 Network science Data collection Data analysis Case study - CROSBI

8 of 54

The new science of networks

source: The New York Times (April 3rd, 1933., p. 17).

source: An Attraction Network in a Fourth Grade Class (Moreno, Who shall survive?, 1934). 9 of 54

Political and nancial networks

Mark Lombardi (1980s & 1990s)


10 of 54

Terrorist networks

11 of 54

Board of directors membership networks

source: http://theyrule.net 12 of 54

On-line social networks

13 of 54

Internet

14 of 54

source: Bill Cheswick http://www.cheswick.com/ches/map/gallery/index.html

Air trafc networks

source: Northwest Airlines WorldTraveler Magazine 15 of 54

Railway networks

source: TRTA, March 2003 - Tokyo rail map

16 of 54

Semantic networks

source: http://wordnet.princeton.edu/man/wnlicens.7WN

17 of 54

Gene networks

source: http://www.zaik.uni-koeln.de/bioinformatik/regulatorynets.html.en

18 of 54

Food chain networks

19 of 54

source: http://marinebio.org/Oceans/Biotic-Structure.asp

On-line social and semantic networks


Where can we nd networks? social networking sites

20 of 54

On-line social and semantic networks


Where can we nd networks? social networking sites forums, buletinboards, discussion groups, mailing lists

20 of 54

On-line social and semantic networks


Where can we nd networks? social networking sites forums, buletinboards, discussion groups, mailing lists blogs, microblogs, vlogs

20 of 54

On-line social and semantic networks


Where can we nd networks? social networking sites forums, buletinboards, discussion groups, mailing lists blogs, microblogs, vlogs wikis, semantic wikis

20 of 54

On-line social and semantic networks


Where can we nd networks? social networking sites forums, buletinboards, discussion groups, mailing lists blogs, microblogs, vlogs wikis, semantic wikis social tagging

20 of 54

On-line social and semantic networks


Where can we nd networks? social networking sites forums, buletinboards, discussion groups, mailing lists blogs, microblogs, vlogs wikis, semantic wikis social tagging RSS feeds, RDF repositories, metadata collections

20 of 54

On-line social and semantic networks


Where can we nd networks? social networking sites forums, buletinboards, discussion groups, mailing lists blogs, microblogs, vlogs wikis, semantic wikis social tagging RSS feeds, RDF repositories, metadata collections...

20 of 54

On-line social and semantic networks


Where can we nd networks? social networking sites forums, buletinboards, discussion groups, mailing lists blogs, microblogs, vlogs wikis, semantic wikis social tagging RSS feeds, RDF repositories, metadata collections... A social network can be found on any application where users communicate.

20 of 54

On-line social and semantic networks


Where can we nd networks? social networking sites forums, buletinboards, discussion groups, mailing lists blogs, microblogs, vlogs wikis, semantic wikis social tagging RSS feeds, RDF repositories, metadata collections... A social network can be found on any application where users communicate. A semantic network can be found in any specication where concepts are meaningfully connected.
20 of 54

Outline
Introduction Web 2.0, Semantic Web, Web 3.0 Network science Data collection Semistructured data XPath Regular expressions Ajax APIs Data analysis Case study - CROSBI 21 of 54

Common obstacles in data collection


Semi-structured data (HTML, XHTML, XML, RDF, OWL ...)

22 of 54

Common obstacles in data collection


Semi-structured data (HTML, XHTML, XML, RDF, OWL ...)

The need for pattern matching

22 of 54

Common obstacles in data collection


Semi-structured data (HTML, XHTML, XML, RDF, OWL ...) Client-side content generation

The need for pattern matching

22 of 54

Semistructured data and XPath

query language for semistructured data part of the W3C standard uses path expressions to navigate through documents has a standard library with over 100 useful functions Example:

/html/body/div[@id=background]/div/div[@id=mainBar]/div/p[last()-1]/

23 of 54

XPather (Firefox add-on)

24 of 54

Regular expressions

Useful for extracting various formatted data and searching for keywords Example (date matching) 27-9-2001 [0-9]{1,2}[0-9]{1,2}[1-2][0-9]{3}

25 of 54

Client side generated content


Some sites use Ajax to generate data on page load or user interaction (e.g. mouse click, mouse over etc.)

26 of 54

Client side generated content


Some sites use Ajax to generate data on page load or user interaction (e.g. mouse click, mouse over etc.) In combination with obfuscated JavaScript code this is hard to scrape! Current solutions:
Browser scripting (can be slow and cumbersome) - e.g. with selenium

26 of 54

Client side generated content


Some sites use Ajax to generate data on page load or user interaction (e.g. mouse click, mouse over etc.) In combination with obfuscated JavaScript code this is hard to scrape! Current solutions:
Browser scripting (can be slow and cumbersome) - e.g. with selenium Headless browser - e.g. mechanize, spidermonkey, HtmlUnit

26 of 54

Client side generated content


Some sites use Ajax to generate data on page load or user interaction (e.g. mouse click, mouse over etc.) In combination with obfuscated JavaScript code this is hard to scrape! Current solutions:
Browser scripting (can be slow and cumbersome) - e.g. with selenium Headless browser - e.g. mechanize, spidermonkey, HtmlUnit

26 of 54

Open APIs

Most popular social web sites have an open API that allow you to connect your application to their site (Facebook, Twitter, Wikipedia, ...)

27 of 54

Open APIs

Most popular social web sites have an open API that allow you to connect your application to their site (Facebook, Twitter, Wikipedia, ...) Still such APIs often have drawbacks for network data collection:

27 of 54

Open APIs

Most popular social web sites have an open API that allow you to connect your application to their site (Facebook, Twitter, Wikipedia, ...) Still such APIs often have drawbacks for network data collection:
Limited access rights (e.g. Facebook)

27 of 54

Open APIs

Most popular social web sites have an open API that allow you to connect your application to their site (Facebook, Twitter, Wikipedia, ...) Still such APIs often have drawbacks for network data collection:
Limited access rights (e.g. Facebook) Limited queries per time interval (e.g. Twitter) etc.

27 of 54

Outline
Introduction Web 2.0, Semantic Web, Web 3.0 Network science Data collection Data analysis Network matrices Folksonomy model Logic programming Case study - CROSBI
28 of 54

Networks are most often analyzed using graph theory:

Denition
A graph G is the pair (N , E) whereby N represents the set of verticles or nodes, and E N N the set of edges connecting pairs from N .

29 of 54

Networks are most often analyzed using graph theory:

Denition
A graph G is the pair (N , E) whereby N represents the set of verticles or nodes, and E N N the set of edges connecting pairs from N .

Denition
A directed graph or digraph G is the pair (N , A), whereby N represents the set of nodes, and A N N the set of ordered pairs of elements from N that represent the set of graph arcs.

29 of 54

Networks are most often analyzed using graph theory:

Denition
A graph G is the pair (N , E) whereby N represents the set of verticles or nodes, and E N N the set of edges connecting pairs from N .

Denition
A directed graph or digraph G is the pair (N , A), whereby N represents the set of nodes, and A N N the set of ordered pairs of elements from N that represent the set of graph arcs.

Denition
A valued or weighted graph GV is the triple (N , A, V) whereby N represents the set of nodes or verticles, A N N the set of pairs of elements from N that represent the set of graph arcs, and V : N R a function that attaches values or weights to nodes.

29 of 54

Denition
A bipartite graph G = (N , E) is a graph for which set of nodes there exists an partition N = {X , Y }, such that every arc has one end in X , a and the other in Y .

30 of 54

Denition
A bipartite graph G = (N , E) is a graph for which set of nodes there exists an partition N = {X , Y }, such that every arc has one end in X , a and the other in Y .

Denition
A multigraph G = (N , E) is a graph in which E is a multiset, e.g. there can exist multiple edges between two nodes.

30 of 54

Matrix representation

Denition
Let G be a graph dened with the set of nodes {n1 , n2 , ..., nm } and edges {e1 , e2 , ..., el }. For every i, j (1 i m and 1 j m) we dene aij = 1, if there is an edge between nodes ni and nj 0, otherwise

Matrix A = [aij ] is then the adjacency matrix of graph G. The matrix i symmetric since if there is an edge between nodes ni and nj then clearly there is also an edge between nj and ni . Thus A = [aij ] = [aji ].

31 of 54

Simple graph

32 of 54

Digraph

33 of 54

Digraph

34 of 54

Weighted graph

35 of 54

Bipartite graph

36 of 54

Multigraph

37 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ...

38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes)

38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes) W - co-worker matrix (w w, w - number of workers)

38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes) W - co-worker matrix (w w, w - number of workers) O - process responsibility matrix (p z)

38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes) W - co-worker matrix (w w, w - number of workers) O - process responsibility matrix (p z) WW - co-workers of co-workers

38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes) W - co-worker matrix (w w, w - number of workers) O - process responsibility matrix (p z) WW - co-workers of co-workers PT - inversed process ow

38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes) W - co-worker matrix (w w, w - number of workers) O - process responsibility matrix (p z) WW - co-workers of co-workers PT - inversed process ow OT O - matrix of workers who are responsible for the same processes

38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes) W - co-worker matrix (w w, w - number of workers) O - process responsibility matrix (p z) WW - co-workers of co-workers PT - inversed process ow OT O - matrix of workers who are responsible for the same processes PO - matrix of workers who are responsible for the forthcoming processes

38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes) W - co-worker matrix (w w, w - number of workers) O - process responsibility matrix (p z) WW - co-workers of co-workers PT - inversed process ow OT O - matrix of workers who are responsible for the same processes PO - matrix of workers who are responsible for the forthcoming processes PT O - matrix of workers who are responsible for the previous processes
38 of 54

Matrix algebra and interpretation


Matrices can be transposed, multiplied ... Examples: P - process ow matrix (p p, p - number of processes) W - co-worker matrix (w w, w - number of workers) O - process responsibility matrix (p z) WW - co-workers of co-workers PT - inversed process ow OT O - matrix of workers who are responsible for the same processes PO - matrix of workers who are responsible for the forthcoming processes PT O - matrix of workers who are responsible for the previous processes ...
38 of 54

ACI model (Mika 2007)


A - actors

39 of 54

ACI model (Mika 2007)


A - actors C - concepts

39 of 54

ACI model (Mika 2007)


A - actors C - concepts I - instances

39 of 54

ACI model (Mika 2007)


A - actors C - concepts I - instances T A C I - folksonomy

39 of 54

ACI model (Mika 2007)


A - actors C - concepts I - instances T A C I - folksonomy A tripartite hypergraph which can be reduced into three bipartite graphs:

39 of 54

ACI model (Mika 2007)


A - actors C - concepts I - instances T A C I - folksonomy A tripartite hypergraph which can be reduced into three bipartite graphs: AC - network of actors and concepts

39 of 54

ACI model (Mika 2007)


A - actors C - concepts I - instances T A C I - folksonomy A tripartite hypergraph which can be reduced into three bipartite graphs: AC - network of actors and concepts AI - network of actors and instances

39 of 54

ACI model (Mika 2007)


A - actors C - concepts I - instances T A C I - folksonomy A tripartite hypergraph which can be reduced into three bipartite graphs: AC - network of actors and concepts AI - network of actors and instances CI - network of concepts and instances

39 of 54

Delicious - analysis

40 of 54

Problems with matrix representation

41 of 54

Problems with matrix representation

Only the 1-s contain information


41 of 54

Sparse matrix

42 of 54

Problems with sparse matrix

If A is connected to B, then B is connected to A (undirected graphs)

43 of 54

Problems with sparse matrix

If A is connected to B, then B is connected to A (undirected graphs) How to nd friends of friends?

43 of 54

Problems with sparse matrix

If A is connected to B, then B is connected to A (undirected graphs) How to nd friends of friends? Is there a path from node A to node B? Which path is that?

43 of 54

Problems with sparse matrix

If A is connected to B, then B is connected to A (undirected graphs) How to nd friends of friends? Is there a path from node A to node B? Which path is that? How to compute the previously outlined graphs?

43 of 54

Problems with sparse matrix

If A is connected to B, then B is connected to A (undirected graphs) How to nd friends of friends? Is there a path from node A to node B? Which path is that? How to compute the previously outlined graphs?

43 of 54

Logic programming as a solution

Deductive languages like Prolog, F LORA-2

44 of 54

Logic programming as a solution

Deductive languages like Prolog, F LORA-2 Advantages:


Declarative problem solving (on a higher level) Very appropriate for expressing mathematical structures

44 of 54

Logic programming as a solution

Deductive languages like Prolog, F LORA-2 Advantages:


Declarative problem solving (on a higher level) Very appropriate for expressing mathematical structures

Disadvantages:
Combinatoric explosion (use heuristics and predicate tabling)

44 of 54

Examples

If A is connected to B, then B is connected to A link( A, B ) :- link( B, A ).

45 of 54

Examples

Friend of a friend ff( A, B ) :- link( A, C ), link( C, B ).

46 of 54

Examples

Is there a path from node A to node B? path( A, B ) :- link( A, B ). path( A, B ) :- link( A, C ), path( C, B ).

47 of 54

Examples

Which path is that? whichpath( X, Y, whichpath( X, Y, link( X, Z whichpath( [ X, Y ] ) :- link( X, Y ). [ X, Z | T ] ) :), Z, Y, [ Z | T ] ).

48 of 54

Outline
Introduction Web 2.0, Semantic Web, Web 3.0 Network science Data collection Data analysis Case study - CROSBI

49 of 54

Project outline
Croatian Scientic Bibliography (http://bib.irb.hr)

50 of 54

Project outline
Croatian Scientic Bibliography (http://bib.irb.hr) Scientists enter data about their research activities by them selves which makes it a social application

50 of 54

Project outline
Croatian Scientic Bibliography (http://bib.irb.hr) Scientists enter data about their research activities by them selves which makes it a social application Research hypotheses:

50 of 54

Project outline
Croatian Scientic Bibliography (http://bib.irb.hr) Scientists enter data about their research activities by them selves which makes it a social application Research hypotheses:
1. analyze two social systems (the Croatian scientic community and the Croatian public) and see how the most important research concepts (as viewed by the scientic community) are interpreted by the public (does the public understand most important concepts from Croatian science in the last decade?).

50 of 54

Project outline
Croatian Scientic Bibliography (http://bib.irb.hr) Scientists enter data about their research activities by them selves which makes it a social application Research hypotheses:
1. analyze two social systems (the Croatian scientic community and the Croatian public) and see how the most important research concepts (as viewed by the scientic community) are interpreted by the public (does the public understand most important concepts from Croatian science in the last decade?). 2. analyze how various scientic elds are conceptually and socially interrelated (do scientists do interdisciplinary research together if they have conceptually connected elds?)

50 of 54

Methodology

Social systems are represented by their trails: CROSBI as the Croatian scientic community, Croatian Wikipedia as the Croatian public One system can interpret concepts from another if the concept exists within its conceptual network (e.g. keyword or wiki page)

51 of 54

Data collection

CROSBI data has been collected using Scrapy in November 2010 (a total of 285,234 entries has been collected) Wikipedia data has been collected using the Wikipedia API + Scrapy for parsing

52 of 54

Results

53 of 54

Bibliography
1. Adamic, L.: Why networks are interesting to study, University of Michigan, School of Information, https://open.umich.edu/education/si/si508/fall2008 2. Barratm M., Barthlemy, M., Vespignani, A.: Dynamical Processes on Complex Networks, Cambridge University Press, 2008. 3. Carley, K.M.: Dynamic Network Analysis, Summary of the NRC workshop on social network modeling and analysis, Committee on Human Factors, National Research Council, 133145, Eds: Breiger, R., Carley, K.M., and Pattison, P., 2003. 4. Divjak, B., Lovren i , A.: Diskretna matematika s teorijom grafova, TIVA & FOI, 2005 cc 5. Krackhardt, David, and Carley, Kathleen M.: A PCANS Model of Structure in Organization, Proceedings of the 1998 International Symposium on Command and Control Research and Technology, Evidence Based Research, Vienna, VA, June 1998 6. Mika, Peter: Ontologies are us: A unied model of social networks and semantics, Web Semantics: Science, Services and Agents on the World Wide Web 5(1), volume 5, 515, March 2007 7. Newman, M., Barabsi, A.-L., Watts, D. j.: The Structure and Dynamics of Networks, Princeton University Press, 2006. 8. Various web sites (images, graphs)

54 of 54

You might also like