Professional Documents
Culture Documents
December 9, 2013
Method
Social networks data reects homogenization. It has been showed that users act much like information transmitters when there exists hot news that they consider of interest for their audience. Events that capture public attention are more likelihood of being spread what increases its popularity and scope. A lateral eect of this phenomena are the trending topics which represent events (news, people, events) people discusses a lot about, in certain time. When it happens, information mass related to topic gets concentrated in time periods that ranges from hours to days as functions of the impact such events have amongst the audience, and beside retransmitting the hot information that gives origin to discussion people shares opinions or comments about it. These discussions rarely involve just one point of view and it is common that thematic content evolve over time, giving origin to new topics, merging some existing ones or simply vanishing. We are concerned with capturing topic models from twitter data so our approach consist in using relations between words as features that persist during more time, in order to avoid or reduce the decaying of models while allowing us to use it for identify relevant new messages accurately for more time, even when there exists thematic variations in the stream of incoming data generated after training of the model. Our proposal is to use Latent Associations that represent high-order non-linear relations between words that occurs in our short documents. Using Latent Associations as features for our models we can capture patterns observed in training data as pattern environments, this is, families of variations of the identied pattern added with noise or subject to corruption (variation). Two hypothesis sustain our procedure: that such relations can be obtained, and that they allow us to better represents the topic. To demonstrate the former we use an implementation based in a conectionist model, the Restricted Boltzmann Machine, that is trained and used to transform counting vectors into a new features space. Then we train one class classiers and evaluate in a proposed hashtag stream ltering environment. For the second we analyse and compare the decaying of our classiers and relate the performance with a proposed measures that captures the thematic (lexical) degree of variation in a stream. Details of each procedure are described in next subsection
0.1
Latent Associations
Traditionally a topic, as dened in techniques like latent semantic indexing or its generalization latent Dirichlet allocation consist of a probability distribution over therms. The words in that distribution are grouped together because appear in similar contexts certain number of times that is statistically signicant. When describing which terms form a topic, words with probabilities over a threshold are listed. It can be said that this models consider few linear co-occurrence relations between words which clearly does not cover other kinds of relations that may exists between topical related words. Latent associations represent non-linear relations between all the words in the vocabulary, dened recursively as functions of the appearance/absence in each of the possible contexts. In that sense, Latent associations are more like pattern environments, it is families of presence/absence patterns in documents vectors over vocabulary, which tolerates variations due to noise and deformations and includes versions of the pattern with additional unseen words and incomplete patterns. So we argue this kind of exible relations can help to construct a new representations for documents that allow us to better identify documents within a topic. Calculate an optimal set of this kind of relations seems to be an untreatable problem, due to big number of parameters that must be optimized, so in our approach, an approximate solution is obtained by a stochastic training in a conectionist models, which have shown to have the ability in capturing such kind of relations for dierent problems. We use Restricted Boltzmann Machines trained with Contrastive Divergence to capture such latent associations in the hidden nodes. Next sections species their principles and training.