You are on page 1of 9

Detecting and Searching System for Event on Internet Blog Data Using Cluster Mining Algorithm

Robin Singh Bhadoria1, Manish Dixit2, Rohit Bansal3, and Abhishek Singh Chauhan4
Dept of Computer Science & Engineering, IITM, Gwalior (MP) ssr_robin@yahoo.co.in 2 Dept of Computer Science & Engineering, MITS, Gwalior (MP) dixitmits@gmail.com 3 NIIST, Bhopal(MP) rohitbansal.cse@gmail.com 4 SATI, Vidisha(MP) abhichauhan78@gmail.com
1

Abstract. The popularity of Internet is growing every day with an exponential growth in the information that is being published over it. Apart from static content, dynamic content on the Web is also growing at an increasing rate thanks to blogs, news forums and the likes. Users of such blogs and forums write about their personal life, professional life and events happening in real world such as a cricket match, elections, a product release or disasters. The number of blog entries published on an event is proportional to its popularity. Using this as the basis, we designed a system called EventDS (Event Detection and Searching) which detects major events by analyzing blogs using a novel clustering algorithm called PDDPHAC. We also propose a new representation for events: each event is represented as a Topic Tree where sub-topics are treated as children of their super-topics.

1 Introduction
With advances in technology today Internet is easily available to most of the people. Apart from all other activities that happen in Internet, a lot people express their feelings, views and share their knowledge over the Internet. They use blogs, social networking sites and discussion forums to express and discuss their opinions as these are available to everyone for free and provide unlimited space. Now people don't have to host their own websites to express something, they can use these mediums and hence the information that is being published on the Internet through these mediums is growing at a very fast rate. It has been reported that more than 112 million blogs exist today and more than 175,000 new blogs are created everyday and more than 1.6 millions posts are created per day [23].

2 Algorithm for System


2.1 Clustering Algorithms In this paper, we have studied and used two well know clustering algorithms PDDP (Principal Direction Divisive Partition) and HAC (Hierarchical Agglomerative
S.C. Satapathy et al. (Eds.): Proceedings of the InConINDIA 2012, AISC 132, pp. 8391. springerlink.com Springer-Verlag Berlin Heidelberg 2012

84

R.S. Bhadoria et al.

Clustering). Both are hierarchical based clustering algorithms but PDDP is a topdown clustering algorithm while HAC is a bottom-up clustering algorithm. We discuss them in the following subsections in detail. 2.1.1 Principal Direction Divisive Partition (PDDP) PDDP is hierarchical divisive partitioning clustering algorithm was introduced by Boley [5]. As mentioned earlier PDDP is top-down clustering algorithm so it first considers the entire document set as a single cluster and splits it into two clusters and recursively selects a cluster with maximum distortion and splits into further into two clusters until a terminating condition is reached. PDDP forms binary tree of clusters where root of the tree contains all documents and leaf nodes are the final output clusters. 2.1.2 Hierarchical Agglomerative Clustering (HAC) HAC is a hierarchical bottom-up clustering [24] and it is the one of the most highly studied clustering algorithms as it is very intuitive. It first treats every document in the document set as a cluster and merges the two most similar clusters in each iteration. It uses document-document distance matrix to combine two most similar clusters and updates the distance matrix after each iteration. HAC updates the distance matrix in different ways depending on linkage. The most well known and used linkages are discussed below.

3 Event Representation, Tracking and Updater


Now event identification is trivial, we consider clusters which have the number of documents more than EventTh as event clusters. Although we have designed PDDPHAC in such a way that it produces better event detection results than PDDP and HAC algorithms the performance of PDDPHAC may depend on the performance of individual algorithms to some ex- tent. That is its results can sometimes get effected by the threshold used in HAC to combine clusters returned by PDDP. To obtain reliable results, we have used a tight threshold for HAC which results compact clusters so some events may not be detected. 3.1 Event Representation The main goal for representing an event in a topic tree is to capture different topics being discussed part of the event. We have used the following idea to capture the topics in an event. First we treat an event document as an entire dataset and recomputed TF-IDF scores of terms in them, and then we apply PDDPHAC on this event document set and get the topic clusters. We then use the topic clusters to construct topic tree of the event. 3.1.1 Topic Tree Structure Topic tree for an event is constructed using the following steps: a node's children are subtopics in it. In a topic tree, the root of tree contains information about all the documents of the event, and each child of the root represents a subtopic of it. All the subtopics of a node are not equally important and some of them can contain a larger

Detecting and Searching System for Event on Internet Blog Data

85

number of documents compared to other nodes, so we call them prominent topics. And these topics can further contain subtopics so we again apply PDDPHAC on the prominent topics and get subtopics. In this way we proceed until there are no prominent topics.

Fig. 1. Topic Tree

Figure 1 shows the sample topic tree, where T0 is the root of the tree so it contains all the information about the event. And T1, T2, T3 are subtopics of T0 so they are more specific than T0 and further T4 and T5 are subtopics of T1 and T6, T7 and T8 are subtopics of T2. The leaf nodes: T4, T5, T6, T7 and T8 are the most specific topics in the topic tree. 3.2 Event Tracking Event tracking implies identifying newly written blogs which are related to the already detected events. The event tracking can be done in following ways: 3.3 Event Updater We now describe possible scenarios of updates and in all cases we make sure that the key property of topic tree mentioned earlier does not get violated. 1. Updating a very similar topic: This scenario arises when we are updating a topic node with newly posted blogs related to the topic. And sometimes newly tracked blogs contain very similar information, in this case creating new topic node is not necessary, we just update existing topic node. The following Figure 2 explains this case.

Fig. 2. Describes Updating a very similar topic

where ST is a similar topic, NT is a new topic and UT is an updated topic. Keyword set is updated by recomposing overall principal direct vector and ranking keywords according to it. Link set is also updated by taking top l blog posts from both of the link sets of ST and NT.

86

R.S. Bhadoria et al.

2. Updating a leaf topic: This case occurs when we have tracked sufficientdocuments related to a leaf topic node of topic tree and a new topic created from these documents does not satisfy previous update condition. We create a new topic node using the tracked documents and update the corresponding related node in the following manner. where LT is a leaf topic, NT is a new topic and NPT is a new parent topic. In this case the reason for creating a new topic node is to ensure the key property of the topic tree. If we create NT as single child of LT then this property will be violated as discussed earlier. NT is not very similar to LT if that was the case then it should have come under first case. So in this case, we update the topic tree by creating a new parent node NPT of both LT and NT, and keyword and link sets of NPT are computed by combining keyword and link sets of both LT and NT respectively. 3. Updating a topic node other than leaf node: This case occurs when the both the above update conditions are not satisfied. In this case we have to update a node which have at least 2 children. This update is performed in the following way.

Fig. 3. Describes Updating a leaf topic

Fig. 4. Describes Updating a topic node which have children

Detecting and Searching System for Event on Internet Blog Data

87

4 Result and Analysis


4.1 PDDP Analysis In this work, we discuss the experiments conducted on PDDP to test its performance on our dataset. The Table 2 presents the results produced by PDDP using CSV as the termination condition.
Table 1. PDDP clustering results Parameters Actual Events Detected Events Number of clusters Purity of clusters Purity of event clusters Execution time (in seconds) Clustered Documents Remaining Documents Values 5 1 12 90.74% 100% 0.0005 303 2197

This experiment was to understand PDDP splitting, the Figure 5 shows, how the dataset was partitioned into two clusters after first iteration. PDDP puts documents with a projection value 0 is one cluster and rest of them in other cluster. According to this, in Figure 5, we can see that PDDP partitioned three events namely Haiti Earthquake, Sachin Double Century, Iceland Volcano properly, but the other two: Avatar Movie release and Indian Budget 2010 are improperly partitioned. As we discussed earlier, PDDP doesn't perform well on ill separated datasets and the figure shows the same.

Fig. 5. Projection values of the documents in PDDP first iteration

88

R.S. Bhadoria et al.

Table 2, we present details about the same experiment but now we show the number of documents from each event class in each cluster.
Table 2. Data Distribution after first PDDP split

HAC Analysis
In this section, we discuss the experiments that have been conducted to test HAC on our dataset.
Table 3. Shows the results produced by HAC on our test dataset Value DT=0.6 5 2 1154 85.48% 62.05% 0.2420 917 1583

Parameters Actual Events Detected Events Number of clusters Purity of clusters Purity of event Clusters Execution time(in sec.) Clustered Documents Remaining Documents

DT=0.5 5 0 1746 99.32% --------0.1751 0 2500

DT=0.7 5 1 531 40.56% 21.79% 0.2454 1886 614

PDDPHAC Analysis
In previous sections, we have shown the experiments conducted for testing PDDP and HAC clustering algorithms on our test dataset. As we can see PDDP had per- formed better than HAC on our dataset but its accuracy of event detection was poor.
Table 4. PDDPHAC clustering results

Detecting and Searching System for Event on Internet Blog Data

89

From Table 4, we can easily say that the performance of PDDPHAC is much better than both PDDP and HAC. The event detection accuracy of PDDPHAC is more than the others and the time taken by PDDPHAC for clustering is comparable to PDDP and much less than HAC. 4.2 Architecture of EventDS We have discussed the main parts of the EventDS: Back-end and Front-end, we now present the architecture of complete EventDS in Figure 6. The figure shows architecture of both Back-end and Front-end and how they are connected using a MySql database. The figure shows all the elements of EventDS and their integration. Observe the arrows between Back-end and Database, they are bidirectional i.e. Back- end reads data from and writes data to the database. But arrows between Front-end and Database are unidirectional i.e. Front-end just reads data from the database but never modifies data in the database. From the figure, we can observe that the tasks performed by the Back-end are more complex and major compared to the Front-end. The Front-end just provides an interface to access the functionalities provided in the Back-end.

Fig. 6. Architecture of EventDS

90

R.S. Bhadoria et al.

5 Conclusion
We have studied two very famous clustering algorithms called Principal direction divisive partitioning (PDDP) and Hierarchical agglomerative clustering (HAC) and tested their applicability in detecting events from blogs. The results are not very satisfactory so we designed a new clustering algorithm called PDDPHAC by combining these two clustering algorithms and proved that PDDPHAC outperformed both algorithms. Using the proposed clustering algorithm, we have designed a system called EventDS: events detection and searching and the results produced by EventDS were very satisfactory. We have also proposed a new representation for the events that is topic tree and discussed its advantages over using weighted keyword set representation. We have implemented event tracker for tracking the newly written blogs on already detected events and event updater for updating the events.

References
[1] Aksyonof, A.: Sphinx: free open-source SQL full-text search engine, http://sphinxsearch.com/ [2] Trevor, H., Robert, T., Jerome, F.: Hierarchical clustering. In: The Elements of Statistical Learning, 2nd edn., pp. 520528. Springer, Heidelberg (2009) [3] Belmonte, N.G.: JIT: Java Script InfoVis Toolkit, http://thejit.org/ [4] Boley, D.: Principal Direction Divisive Partitioning. Data Mining and Knowledge Discovery 2, 325344 (1997) [5] Boley, D.: Hierarchical Taxonomies using Divisive Partitioning. Technical report (1998) [6] Boley, D., Borst, V.: Unsupervised Clustering: A Fast Scalable Method for Large Datasets. Technical report (1999) [7] Thierer, A.: Internet Statistics, http://techliberation.com/2008/05/06/ need-help-how-many-blogs-are-there-out-there/ [8] Denton, N.: Weblogs: Blogspot updates provider, http://www.weblogs.com/ [9] Fiscus, J.G., Doddington, G.R.: Topic detection and tracking evaluation overview. In: Topic Detection and Tracking: Event-Based Information Organization, pp. 1731. Kluwer Academic Publishers (2002) [10] Manning, C.D., Raghavan, P., Schtze, H.: Evaluation of clustering. In: Introduction to Information Retrieval, pp. 356360. Cambridge University Press (2008) [11] Witten Ian, H., Paynter Gordon, W., Eibe, F., Carl, G., NevillManning Craig, G.: KEA: practical automatic keyphrase extraction. In: DL 1999: Proceedings of the Fourth ACM Conference on Digital Libraries, pp. 254255. ACM, New York (1999) [12] Seo Katia, Y.-W., Seo, Y.W., Sycara, K.: Text Clustering for Topic Detection. Technical report (2004) [13] Kruengkrai, C.: Implementation of PDDP Algorithm in JAVA, http://www.tcllab.org/canasai/software/omniclusterer [14] Kruengkrai, C., Sornlertlamvanich, V., Isahara, H.: Document Clustering Using Linear Partitioning Hyperplanes and Reallocation. In: Myaeng, S.-H., Zhou, M., Wong, K.-F., Zhang, H.-J. (eds.) AIRS 2004. LNCS, vol. 3411, pp. 3647. Springer, Heidelberg (2005) [15] Lenz, H.J.: Proximities in Statistics: Similarity and Distance. In: Preferences and Similarities. CISM International Centre for Mechanical Sciences, vol. 504, pp. 161177. Springer, Vienna (2008)

Detecting and Searching System for Event on Internet Blog Data

91

[16] Zhang, K., Xu, H., Tang, J., Li, J.: Keyword Extraction Using Support Vector Machine. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 8596. Springer, Heidelberg (2006) [17] Matsuo, Y., Ishizuka, M.: Keyword Extraction from a Single Document using Word Cooccurrence Statistical Information (2003) [18] Trevor, H., Robert, T., Jerome, F.: Hierarchical clustering. In: The Elements of Statistical Learning, 2nd edn., pp. 520528. Springer, Heidelberg (2009) [19] Palshikar, G.K.: Keyword Extraction from a Single Document Using Centrality Measures. In: Ghosh, A., De, R.K., Pal, S.K. (eds.) PReMI 2007. LNCS, vol. 4815, pp. 503510. Springer, Heidelberg (2007)

You might also like