You are on page 1of 2

Topic Models from Twitter Hashtags

December 9, 2013

Method
Social networks data reects homogenization. It has been showed that users act much like information transmitters when there exists hot news that they consider of interest for their audience. Events that capture public attention are more likelihood of being spread what increases its popularity and scope. A lateral eect of this phenomena are the trending topics which represent events (news, people, events) people discusses a lot about, in certain time. When it happens, information mass related to topic gets concentrated in time periods that ranges from hours to days as functions of the impact such events have amongst the audience, and beside retransmitting the hot information that gives origin to discussion people shares opinions or comments about it. These discussions rarely involve just one point of view and it is common that thematic content evolve over time, giving origin to new topics, merging some existing ones or simply vanishing. We are concerned with capturing topic models from twitter data so our approach consist in using relations between words as features that persist during more time, in order to avoid or reduce the decaying of models while allowing us to use it for identify relevant new messages accurately for more time, even when there exists thematic variations in the stream of incoming data generated after training of the model. Our proposal is to use Latent Associations that represent high-order non-linear relations between words that occurs in our short documents. Using Latent Associations as features for our models we can capture patterns observed in training data as pattern environments, this is, families of variations of the identied pattern added with noise or subject to corruption (variation). Two hypothesis sustain our procedure: that such relations can be obtained, and that they allow us to better represents the topic. To demonstrate the former we use an implementation based in a conectionist model, the Restricted Boltzmann Machine, that is trained and used to transform counting vectors into a new features space. Then we train one class classiers and evaluate in a proposed hashtag stream ltering environment. For the second we analyse and compare the decaying of our classiers and relate the performance with a proposed measures that captures the thematic (lexical) degree of variation in a stream. Details of each procedure are described in next subsection

0.1

Latent Associations

Traditionally a topic, as dened in techniques like latent semantic indexing or its generalization latent Dirichlet allocation consist of a probability distribution over therms. The words in that distribution are grouped together because appear in similar contexts certain number of times that is statistically signicant. When describing which terms form a topic, words with probabilities over a threshold are listed. It can be said that this models consider few linear co-occurrence relations between words which clearly does not cover other kinds of relations that may exists between topical related words. Latent associations represent non-linear relations between all the words in the vocabulary, dened recursively as functions of the appearance/absence in each of the possible contexts. In that sense, Latent associations are more like pattern environments, it is families of presence/absence patterns in documents vectors over vocabulary, which tolerates variations due to noise and deformations and includes versions of the pattern with additional unseen words and incomplete patterns. So we argue this kind of exible relations can help to construct a new representations for documents that allow us to better identify documents within a topic. Calculate an optimal set of this kind of relations seems to be an untreatable problem, due to big number of parameters that must be optimized, so in our approach, an approximate solution is obtained by a stochastic training in a conectionist models, which have shown to have the ability in capturing such kind of relations for dierent problems. We use Restricted Boltzmann Machines trained with Contrastive Divergence to capture such latent associations in the hidden nodes. Next sections species their principles and training.

Restricted Boltzmann Machines


Restricted Boltzmann Machines RBM have received lot of attention for its applications as basic processing modular blocks in deep architectures thanks to a set of algorithms discovered in recent years that allows to train them eciently. They are stochastic connectionist models with two layers of processing units, a visible layer that acts as input and a hidden one which is the output of the net, that captures structurally the relations between inputs. Figure depicts the classical Restricted Boltzmann Machine Architecture. Restricted Boltzmann machines are capable of learning underlying constraints that characterize a domain simply by being shown examples from it. Their training modies the strength of its connections to construct an internal generative model that produces examples with the same probability distributions as the training examples. RBMs are composed by computing elements called units that are connected from one layer to other by bidirectional links. A unit is always in one of two states on or o, and it adopts these states as a probabilistic function of the states of the units in the layer it is connected to and the weights on its respective links. Weight can take real values of either sign. A unit being on or o is taken to mean that the system currently accepts or rejects some elemental hypothesis about the domain. The weight of a link represents a weak pairwise constrain between two hypothesis. A positive weight indicates that both tend to support each other; if one is currently accepted, accepting the other should be more likely. A negative weight suggest that the two hypothesis should not both be accepted. Each global state of the net can be assigned to a single number called the energy of that state. When using the right conguration, the eect of individual states can be made to act for minimizing the global energy. If some units are externally forced to particular states to represent a particular input, the system will nd the minimum energy conguration that is compatible with that input. Energy of a conguration can be interpreted as the extent to which combination of hypothesis violates the constrains implicit in the problem domain [Hinton]. If we feed the input visible units with vectors of presence/absence of words in documents and interpret the activation of hidden units as active relations between all those words, after properly training a RBM hidden units implicitly express the Latent Associations that are present in the training examples. When trained with documents that are topical or semantically related, these associations manifest the implicit high order relations between terms. Each vector projected onto the new representational space in order to obtain its latent associations representation. Our approach consist in training RBMs that capture relations between terms in a topic collection, transform the documents to a Latent Associations space and use them to perform tasks.

Hashtag Stream Filtering Environment


Evaluating topic models is dicult because clearly determining when words are relevant or not depends on contextual and subjective criteria for most of the cases. Evaluating models that relies in implicit high order associations between terms, which can not be visualized or enumerated, is also hard so we propose an evaluating scheme that indirectly measures the power of our models through a one class classier that in turn lters relevant messages for a topic model trained from socially labelled messages at certain time. As mentioned before, some of the features that makes dicult to capture and use topic models in socially generated streams of data are related to the fast life cycle that topics have in such environments. Authors like [Leskov] have identied dierent patterns. In most of them it is recognizable certain common behaviour, like the rise-sustain-decay components come topics have clearly dened. While topics related to events/news have short life cycles that span from hours to days, topics related to more stable entities like people, places and organizations exposes wider durations with the discourse around them constantly changing, being updated with more recent informations while abandoning past lapsed words. One of the hypothesis about Latent Associations is concerned with that ephemeral nature of the context for a topic in certain time. We ask if it is possible for relations between words to capture evidence about the changing direction a topic follows in certain time. If it is the case, our model should be able to identify relevant messages with acceptable accuracy in longer time spans; if it is not, the proposed model will perform similarly to models that did not consider changing information. To probe our ideas the ltering environment was set up. Let M...

Stream Broadness Index

You might also like