You are on page 1of 6

Analyzing Multi-Source Social Data for Extracting

and Mining Social Networks

I-Hsien Ting Hui-Ju Wu Pei-Shan Chang


Department of Information Management Institute of Human Resource Management Department of Information Management
National University of Kaohsiung National Changhua University of National University of Kaohsiung
Kaohsiung City, Taiwan Education Kaohsiung City, Taiwan
iting@nuk.edu.tw Changhua, Taiwan purplemio@gmail.com
d94311001@mail.ncue.edu.tw

AbstractIn recent years, social computing has become a very process and analyze these data [31]. In this paper, we therefore
popular application in the Internet, and therefore large amount will propose method to analyze the common social data and
of social (communication) data has been collected in different discuss how to extract social networks from multi-sources
social computing application. This paper will introduce a social data dynamically. Furthermore, this paper will propose a
methodology to collect and analyze multi-source social, and by system architecture to use the three techniques of social
this for extracting social networks from the data. A system network analysis, social network construction and visualization
architecture will also be presented in this paper to show how the to process and analyze those valuable data. The system will
data can be collected, pre-processed, analyzed. Furthermore, the allow user to input tasks for dynamic social network analysis
system will allow the users to use the data as a resource for
and construction and the final results will be presented by
personal decision support.
visualized mean and interface for decision support.
Keywords-Social Networking; Instant Messenger; E-mail; Data The structure of this paper is organized as below: In section
Mining; Social Network Analysis 1, the background and introduction will be introduced. Some
related literatures of social network extraction, social network
I. INTRODUCTION and data mining and social network analysis will be reviewed
With the rapid growth of Internet and communication in section 2. A system architecture about how to extract
technologies, there are many communication and social dynamic social networks from multi-sources data will be
activities of people have been transferred to Internet-based proposed in section 3 as well as the introduction of the
platform, e.g. e-mail communication, instant messaging components in the system. In section 4, we will focus on how
software and social networking websites (such as Blog and to extract social networks from social data and how to use data
web albums), etc. [8] Under this background, large amount of mining and AI techniques for decision support. In section 5,
personal communication and social data has been aggregated this paper will be concluded with the suggestions for future
and stored in different locations [15]. However, these valuable research.
data have not been well organized, treated and used. Thus, it is
an interesting research issue about how to use current II. LITERATURE REVIEW
information techniques to process and analyze these data, such In this section, related literatures will be reviewed,
as artificial intelligence, data mining or visualization technique including social networks analysis, social networks extractoin
[18]. and social networking for decision support.
Social network analysis and construction are originally in
the research fields of Sociology. In recent years, many research A. Social Networks Analysis
issues of information science and social networking have been The research methodology of social network analysis is
concerned due to the development of information techniques developed to understand the relationship between actors, and
and the requirements of data processing ability [13]. The target the term actor can be a person, an organization, an event or an
of social network analysis and construction is relationship data object [4]. In a social network, each actor is presented as a
and it is therefore suitable to process and analyze node and each pair of nodes can be connected by lines to show
communication and social data that discussed previously [7]. the relationships. The social network structure graph is a graph
Since the communication data, such as e-mails and the logs that formed by those lines and nodes, and social network
of instant messenger, are very common data in our daily life. analysis is therefore a methodology that used to understand the
However, there is less work focusing on how to organize, graph and the relationships and actors in the social network
[11][34][27].
There are three important elements that included in a social Most of the researches that discussed above are focusing on
network: actors, ties, and relationships. Actors are the essential a single source for social network extraction. However, the
elements in the social network to define the people, events or issue of how to extract social networks from different sources
objects. Ties are used to construct the relationship between has not been discussed well in related literatures. It is also a
actors by using a mean of path to establish the relationship hard task about how to integrate multi-source data for social
directly or indirectly. Ties can also be divided into strong tie networking extraction. In addition to the problem of multi-
and weak tie according to the strength of the relationships; they source data, instant messenger is a very popular and hot
are also useful for discovering the subgroups of the social software for people to send message and communication
network. Relationships are used to illustrate the interactions recently. However, it has not been seen in recent research about
and relationship between two actors. Furthermore, different how to extract social networks from the data. These research
relationships may cause the network to reflect different issues will be discussed in this paper.
characteristics [32][33].
The most important measurements of SNA include network C. Web Mining Techniques for Social Networking
size, diameter, density, centrality and structure holes [5]. Size
is a measurement to measure the amount of nodes or links in a According to different analysis targets and resources, the
network, and the measurement of diameter is to measure the web mining techniques can be divided into three different types,
amount of nodes between two nodes in a network. Density is which are Web Content Mining, Web Structure Mining and
used to calculate the closeness of a network [23][28]. These Web Usage Mining [30].
measurements are common used in many social network
Web content mining is a web mining technique to analyze
related researches and will be used in this paper as well.
the contents in the web, such as texts, graphs, graphics, etc [2].
Traditionally, researches about SNA are mainly focus on Recently, most of web content mining researches are focused
small group of actors and are process manually in most cases. on the text data processing and few are focused on other
[6] However, with the rapid growth of Internet and web multimedia data. Natural language process is therefore the
techniques, more and more data have been collected and it has main technology that used in this area. The concept and
become a hard task to process these data by only the mean of techniques of Semantic Web and Ontology also have to be
manually [9]. Therefore, the scholars of information studied [16][ 20].
technology and computer science are starting to devote related
Web structure mining is a technique that can be used to
researches to deal with these research issues [12][26].
analyze the links and structure of websites [10]. Graph theory
Currently, the researches of computer science in SNA can be
is usually the main concept and theory for web structure
divided into four main topics, including social networks
mining to analyze and explain the structure of websites. In
construction, social networks extraction, social networks
addition, the extraction of the structure of websites is always
analysis and visualization.[24]
essential in this research area [12]. Therefore, its usually the
concern about how to design and implement a crawler (or
B. Social Networks Extraction spider, bots) to extract and construct the structure of websites,
such as the research topic of Deep-web.
In the research field of information technology and Web usage mining is a web mining technique that can be
computer science in social networking, social networks used to analyze how the websites have been used, such as the
extraction is a subfield focusing on extract social networks navigation behavior of users. The server-side Clickstream data
from large amount of communication data. With the rapid (logs file) is the main sources that used for web usage mining.
growth of Internet and WWW, there are various kinds of data Client-side data (such as client-side logs file, cookies) is
have been generated due to communication purpose. The sometimes to be used due to some research concerns, such as in
common used communication data such as email order to record more complete behavior of users. Different web
communication data, web usage logs, event logs, instant usage mining analyses include basic statistical analysis of the
messenger logs, logs of telecommunication, etc[22][29]. navigation behavior of users in a website, such as how many
times the website has been browsed, where the users comes
Currently, there are some researches which are focusing on
from, etc. Furthermore, advanced web usage mining analyses
the extraction of these social data. For example, Bird et al.
can also be provided, such as more complex analysis for
propose a method to extract social networks from e-mail
understand the navigation history of users in a website or cross-
communications [3]. Agrawal et. al using web mining
website analysis [25].
techniques to understand the behavior of users in newsgroup
[2]. Web is considered as the biggest database in the world, so
that various social networks can be extracted from this resource,
such as Furukawa et al. were trying to identify social networks
from blogspace [14][19] Jin et al. and Matsuo et al. developed
systems and tried to extract social networks from the web [17]
[21]. Adamic and Adar developed a method to discover the
relationship of friends and neighbours in the web [1].
Figure 1. The architecture of the social network extraction system

About data collection, a data collection system will be


introduced in section 4 of the paper. The system is developed
III. SYSTEM ARCHITECTURE by web-based concept and it allows the users to upload email
According to the research background and motivation of and the history file of MSN messenger by either automatically
this paper, we have designed a system architecture to addressed or manually mean. About the collection of blog browsing data,
the raised issues. The system will allowed us to collect social we use a client side agent to collect the navigation history of a
data from different sources, such as e-mail, instant messenger use when using particular blog. With the ability of client side
and blog. The multi-source social data will then be pre-process logging, the complete browsing data will be recorded without
and analyze. The processed data can be used to extract social missing, such as the behavior of browsing, posting a message
networks automatically and dynamically. The system then can and responding to a message.
be further developed to a decision support system. However, 2) Data extraction engine
this paper will not focus on the decision support system and The second step of the system is data extraction engine.
only the methodology to collect multi-source social data, pre- The engine will firstly used to process the data that collected
processing and how to integrate the data for generating social from previous step. Then, useful data will be extracted and
networks, which is the main contribution of this paper. The filtered out from the raw data. The detail of how the data will
architecture of the social network extraction system is be processed and extracted will be discussed in section 4 of the
presented in figure 1. paper.
As shown in figure 1, the system can be divided to two
3) Database
major phases according to the characteristic of processing. The
two phases are offline data collection and processing and on- After data collection and extraction, the output of the data
line process. The elements and process of the two phases will extraction enginer will then be stored in a database. The
be introduced in detail below. database is designed according to the characteristics of
different sources of social data.
A. Offline data collection and processing
B. Online processing
The first phase of the system is mainly working offline, and
The second phase of the system is a possible application of
there are three elements in this phase including multi-source
social data collection, data extraction engine and a database. the paper in the future. The works in this phase are mainly
processed online according to the data that collected offline.
1) Multi-source social data collection The elements of this phase will be introduced as follow even
This is the first step of the system. In this paper, we intend they are not the main focus of the paper.
to collect social data which are most related to personal daily
4) Ontology-base
communication. Thus, three types of data will be collected
including instant messenger data, email data and blog browsing The second phase of the system is an possible application
data. The messaging history of MSN or other messengers will of the paper. In the system, the user can use the collected social
be recorded in a structural format, such as xml. The detail of data for personal decision support. It will allow the user to
the message contents and communication target and time will input a keyword and other parameters to ask for decision
be stored in the file. In the paper, the history file of MSN support. A social network for dealing with the problem will
messenger is used as the social data of instant messenger. then be generated, which provides possible solutions for the
user.
The ontology base in the system is used to illustrate the or servers may produce various email file, the system will
semantic of the keyword that use input. It is also a very accept any email file and there are some common fields in
important element of the system for social network extraction. different email file format. Fields extraction of the uploaded
email file will be discussed later in this paper to show which
5) User input: fields are useful for the research. Figure 2 shows a sample
There are two essential user input of the system, including a email file which is saved by an email agent and figure 3 shows
keyword and other parameters. The parameters are limitation the collected mail in the system.
and condition for the system to scale down the extracted social
network. For example, if the input keyword is BBQ, it means Return-path: <eri@xx.xx.xx.xx>
the user may want to create a network about BBQ from the Envelope-to: RSs@xx.xxxx.xx.xx
social data to find out previous communications about the issue Received: from funnelweb.cs.york.ac.uk ([144.32.161.232]
of BBQ in email, MSN or blog browsing. Then, the other Message-ID: <47552CF4.70806@xx.xxx.xx.xx>
Date: Tue, 04 Dec 2007 10:33:24 +0000
parameters could be red wine and relationship >2. It means From: E Rid <eri@xx.xxx.xx.xx>
that the user want to find friends to attend the BBQ party. Reply-To: eri@xx.xx.xx.xx
However, the participatns must attened to BBQ party before User-Agent: Thunderbird 2.0.0.9 (Windows/20071031)
more then twice and bring red wine with them. MIME-Version: 1.0
To: RSs@xx.xxx.xx.xx
6) SNA engine: Subject: java versus C benchmarks
After the inputting of keywords and parameters, the system Content-Type: text/plain; charset=ISO-8859-1; format=flowed
will use SNA methods to get information for social networks Content-Transfer-Encoding: 7bit
Status: RO
construction, such as the nodes, relationships, closeness, etc.
The detail of how to calculate the relationship value from the
multi-source social data will be discussed in next section. Figure 2. A sample email file
7) SN Construction engine:
The SN construction will use the results of SNA to prepare
the visualization of the SN. For example, how much nodes will
be used in the SN, the characteristics of the links, the detail of
the network, etc. The gathered information will be used for SN
construction and visualization.
8) Visualization engine:
In this research,the library of OpenGL will be used to
visualize the network based on the information that provided
by SNA engine and SN construction engine.
9) Dynamic and task-oriented social network:
Finally, the system will generate a social network based on
the input of user, which is considered as a dynamic and task-
oriented social network.
Figure 3. The email collection system
In this research, only the instant messenger history file of
IV. DATA COLLECTION SYSTEM AND EXTRACTION MSN is accepted. The MSN history file is stored in a .xml file
EXTRACTION ENGINE and based on the format of xml. Each contactor in the MSN
contactor list has an independent history file, and the
In this section, the paper will focus on three main tasks of information that stored in the file include session ID, Date and
the system architecture that introduced previously. The three Time, from, to and message content. Figure 4 shows a sample
tasks include the data collection system, data extraction MSN history file that used in the paper.
methodology and relationship calculation methods for social
networks construction. <?xml version="1.0"?>
<?xml-stylesheet type='text/xsl' href='MessageLog.xsl'?>
A. The data collection system <Log FirstSessionID="1" LastSessionID="12">
<Message Date="2009/7/4" Time=" 12:50:55" DateTime="2009-
Data collection is the first step of the system, and therefore 07-03T16:50:55.390Z" SessionID="31">
we design a sub-system for uploading related data. The system <From><User FriendlyName="Want SAP, Netweaver, J2EE
is a web-based system, and it allows the user to upload email consultant"/></From>
file (in *.eml or text format), MSN history data (in *.xml <To><User FriendlyName="Derrick-"/></To><Text Style="font-
format) and client side logging data (in *.log or text format). family:; color:#000000; ">
ok...I got it...
The format and file sample of email and MSN history data will </Text>
be introduced below. </Message>
</Log>
About email file, the system allows users to upload email
file manually or automatically. Although different email agents Figure 4. A sample MSN history file
B. Data extraction methodology node i. W1, W2, W3 are three difference weight value to
In order to filter-out unnecessary data from the collected measure the importance of each relationship.
email and MSN history file, we developed a data extraction Ri W1 E i W 2 M i W 3 Bi (1)
methodology to extract useful data for constructing social
networks. In section 4.B, we have introduced a sample email
file format. However, different email agents and servers may The details of relationship E, M and B can be measured by
have different email format. Thus, we have selected some formulation 2, 3, 4, and 5. In formulation 2, Esend means the
important fields in the email file which are useful for us to frequency about how many how many mail send to node i,
calculate the social relationship between the communicators of Ereceive means how many mail receive from i, Eforward is how
the collected emails and MSN history. many main forward from i, Ecc denotes how many mail
received as cc from i, and Eco-receiver means how many mail
From the email file, some necessary fields will be extracted, received as co-receiver from i. Furthermore, W1, W2, W3, W4,
including deliver-to, receive-id, date, to, from, W5 are the weight values for each email relationship.
subject, msg-id, priority, reply-to, mailer (agent),
encode. content-type, content, cc. These fields will be E i W1 E send W 2 E receive W 3 E forward W 4 E cc W 5 E coreceiver (2)
extracted from the original. Some of the extracted fields are
used to identify emails and some are important for relationship Formulation 4 is the complete formulation for MSN
measurement. relationship measurement, Msend means a message send to node
i, Mreceive is a message received from i, Mmulti denotes a
In addition to the extraction of email fields, we also extract
message send for multi-communicators at the same time and
useful fields from the MSN history file. The fields will be
Minteraction is an advanced interaction between node i such as
extracted from the file, including msn-from, msn-to, msn-
video conference, file sharing, etc. However, only the
content, msn-datetime, msn-id, msn-sessionid, msn-
measurements Msend , Mreceive , Mmulti are used in the paper as
totage. Among all of the extracted fields, the msn-sessionid
the formulation 3. In formulation 3 and 4, W1, W2, W3, W4 are
field is used to record the session number and msn-totage
weight values for each MSN relationship.
field is used to identify a communication with multi-users.
M i W1 M send W 2 M receive W 3 M multi (3)
C. From social data to social networking M i W1 M send W 2 M receive W 3 M multi W 4 M interaction (4)
After collecting and extracting useful information from the
muti-source social data. The relationship can then be calculated. For Blog relationship, the measurement is shown in
When user inputting keywords and parameters, the system will formulation 5. Bbrowsing means browsing frequency to a Blog of
match the keyword and the ontology based to find related email node i, Bbookmarking is the frequency of adding a Blog of node i to
records, MSN messages and blog content. bookmark and Binteractoin is the frequency of interaction in the
Blog of node i, such as response to a Blog entry.
Bi W1 Bbrowsing W 2 Bbookmarking W 3 Binteraction (5)

The formulations above are helpful for most cases to


measure the relationships of the nodes to a specific node in a
social network. Then the system can use the analyze results to
generate and visualize the social network.

V. CONCLUSION AND FUTURE RESEARCH


Social and communication data are very common data in
Figure 5. Communication frequency table our daily life; however these data have not been used well for
use to make decision. In this paper, we firstly provide an
The relationship (communication frequency) is the most overview about the characteristics of these data and to illustrate
important element to form a social network. Thus, a series of how to use the concept of social networking and web mining to
formulations for measure the relationship for different kinds of analyze the data. A system architecture is then provide to give
social communication will be discussed in this section. The the reader a picture about how to use the multi-source social
data collection system that introduced in section 4.A also has data to generate dynamic and task-oriented social networks and
the ability to measure the frequency for the input keyword as by this to assist the decision making. More detail process of
presented in figure 5. data collection, data extraction and relationship measurement
The relationship of two nodes can be measured by in the system are also provided.
formulation (1). Ri means the relationship from a specific node In the future, we will try to move our research focus to the
(a keyword with parameters) to a node i in a social network. Ei second part (online processing) of the system and to implement
mean the email relationship for node i, Mi means the MSN the decision support system. Furthermore, we will try to study
relationship for node i, and Bi means the Blog relationship for how to use the techniques of web mining to get better analysis
results and to enhance the accuracy of the decision support [16] Godbole, N., Srinivasaiah, M., Skiena, S.: Large-Scale Sentiment
system. In addition, we will also try to understand more social Analysis for News and Blogs. In: Proceedings of ICWSM 2007, Boulder,
Colorado, USA (2007)
data sources which are useful for including in the system and
[17] Jin, Y. Z., Matsuo, Y., and Ishizuka, M. (2007) Extracting Social
our future research. Networks among Various Entities on the Web In Proceedings of the
Fourth European Semantic Web Conference, 2007
ACKNOWLEDGMENT [18] Kumar, R., Novak, J., and Tomkins, A. (2006) Structure and Evolution
of Online Social Networks In Proceedings of KDD 2006 Conference,
This work is partially supported by a NSC research grant, August 20-23 2006, Philadelphia, Pennsylvania, USA, pp. 611-617
TAIWAN (NSC 97-2410-H-390-022). [19] Lento, T., Welser, H. T., Gu, L., and Smith M. (2006) The Ties that
Blog: Examining the Relationship Between Social Ties and Continued
Participation in the Wallop Weblogging System In Proceedings of the
REFERENCES 15th International World Wide Web Conference, May 23-26 2006,
[1] Adamic, L. A., and Adar, E. (2007) Friends and Neighbors on the Edinburgh, Scotland
Web Social Networks, Vol. 25, 2007, pp. 211-230 [20] Mika, P.: Flink: Semantic Web Technology for the Extraction and
[2] Agrawal, R., Rajagopalan, S., Srikant, R., and Xu, Y. (2003) Mining Analysis of Social Networks. Web Semantics 3(2-3), 211223 (2005)
Newsgroup Using Networks Arising From Social Behavior In [21] Matsuo, Y., Mori, J., Hamasaki, M. (2006) POLYPHONET: An
Proceedings of World Wide Web 2003 Conference, Budapest, Hungary, Advanced Social Network Extraction System from the Web In
pp. 529-535 Proceedings of 2006 Internet World Wide Web Conference, May 23-26,
[3] Bird, C., Gourley, A., Devanbu, P., Gertz, M. and Swaminathan, A. Edinburgh, Scotland.
(2006) Mining Email Social Networks In Proceedings of MSR 2006, [22] Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P. and
May 22-23, 2006, Shanghai, China. Bhattacharjee, B. (2007) Measurement and Analysis of Online Social
[4] Borgatti, S. P., and Everett, M.G. (2002) Ucinet for Windows: Software Networks In Proceedings of 2007 Internet Measurement Conference,
for Social Network Analysis, Harvard: Analytic Technologies. October 24-26, 2007, San Diego, California, USA, pp. 29-42
[5] Burt, R.S., (1992). Structural Holes, Harvard University Press, [23] Mitchell, J. C. (1969) Social Networks and Urban Situations
Cambridge,MA. Manchester University Press, England
[6] Cooley, R. Mobasher, B. and Srivastave, J. (1997) Web Mining: [24] Mutton, P. (2004) Inferring and Visualizing Social Netwokrs on
Information and Pattern Discovery on the World Wide Web In Internet Relay Chat In Proceedings of the Eighth International
Proceedings of the 9th IEEE International Conference on Tool with Conference on Information Visualization, pp. 25-43
Artificial Intelligence, 1997, pp. 558-567, Newport Beach, CA, USA [25] Pierrakos, D., Paliouras, G., Papatheodorou, C., Spyropoulos,
[7] Cross, R. and Parker, A (2004). The Hidden Power of Social C.D.: Web Usage Mining As A Tool for Personalization: A
Networks, Harvard University Press Survey. User Modelling and User Adapted Interaction 13(4),
[8] Chin, A. and Chignell, M. (2006) Finding Evidence of Community 311372 (2003)
from Blogging Co-Citations: A Social Network Analytic Approach In [26] Sarkar, P., and Moore, A.W. Dynamic Social Network Analysis Using
Proceedings of the IADIS International Conference on Web Based Latent Space Models SIGKDD Explorations, Vol. 7, Issue 2, pp. 31-40.
Communities 2006, San Sebastian, Spain, February 26-28, 2006 [27] Scott, J. (2000) Social Network Analysis: A Hand Book (2nd ed.),
[9] Churchill, E. F., and Halverson, C. A. (2005) Social Networks and SAGE publication, 2000
Social Networking IEEE Internet Computing, September/October 2005, [28] Scott, J. (2002) Social Network Analysis: Critical Concepts in
pp.14-19 Sociology Routledge, New York, USA
[10] Dingt C. H. Q., Zha, H., Husbands, P., and Simont, H. D. (2004) Link [29] Tang, J., Zhang, D. and Yao, L. (2007) Social Networking Extraction
Analysis: Hubs and Authorities on the World Wide Web SIAM of Academic Researchers In Proceedings of the Seventh IEEE
Review, Vol. 46, No. 2, pp. 256-268. International Conference on Data Mining, pp.292-301.
[11] Freeman, L., Centrality in Social Networks: Conceptual Clarification, [30] Ting, I. H. (2008) Web Mining Techniques for On-line Social
Social Networks, 1979. Networks Analysis In Proceedings of the 5th International Conference
[12] Fu, F., Chen, X., Liu, L., and Wang, L. (2007) Social Dilemmas in An on Service Systems and Service Management, Melbourne, Australia, 30
Online Social Network: The Structure and Evolution of Cooperation June-2 July 2008, pp. 696-700
Physics Letters A, Vol 371, 2007, pp. 58-64 [31] Turoff, M., Hiltz. S. R., Cho, H. K., Li, Z., and Wang, Y. (2002) Social
[13] Fu, F., Liu, L., Wang, L. (2008) Empirical Analysis of Online Social Decision Support Systems (SDSS) In Proceedings of the 35th Hawaii
Networks in the age of Web 2.0 Physics Letters A , Vol. 387, 2008, pp. International Conference on System Sciences, pp. 1-10.
675-684 [32] Wasserman, S., and Faust, K. (2003) Social Network Analysis: Method
[14] Furukawa, T., Matsuo, Y., Ohmukai, I., Uchiyama, K., Ishizuka, M. and Applications Cambridge University Press, Great Britain 2003
(2007) Social Networks and Reading Behavior in the Blogosphere In [33] Wellman, B. and Berkowitz, S. D. (ed.), (1988) Social structures: A
Proceedings of ICWSM 2007, Boulder, Colorado, USA, pp. 51-58 network approach, Cambridge University Press, pp. 19-61
[15] Garton, L., Haythornthwaite, C., and Wellman, B.( 1997) Studying [34] Wasserman, B., and Faust, K. Social Network Analysis : Methods and
Online Social Networks, Journal of Computer Mediated Applications.New York: Cambridge University Press, 1994.
Communication (3:1).

You might also like