You are on page 1of 7

Exploring Trends in a Topic-Based Search Engine

Wray Buntine, Jukka Perkiö, Sami Perttu


Complex Systems Computation Group,
Helsinki Institute for Information Technology
P.O. Box 9800, FIN-02015 HUT, Finland.
{firstname}.{lastname}@hiit.fi

Abstract Queries to a topic model are performed in principle in


the following way: Given a query its topical representation
Topic-based search engines are an alternative to sim- is calculated i.e. its distribution over the topic space. That
ple keyword search engines that are common in today’s in- distribution is compared to the similar distribution for the
tranets. Trend analysis is an important research goal for documents. Documents are retrieved and ranked according
many different industries. The temporal behavior of the to the statistical distance between the query and the doc-
topics in a topic model based search engine can be used uments over the topic space. The existence of keywords
for trend analysis. We apply a topic model approach to an needs to be folded into the process to combine topic and
online financial newspaper data and show that these topics keyword during retrieval. In principle topic models should
can be used to explore prevailing trends. Furthermore we work better than simple word index searches as they are
show these trends are consistent with common understand- supposed to represent the semantics of a document better
ing. than single word frequencies.
Having motivated our interest in topic-based search en-
gines, we now turn to the main subject of this paper. Trend
1 Introduction detection and trend analysis is of much interest for many
different groups like the financial sector, political parties,
The huge amount of information available on the Web security services etc. Note an extensive body of research
and other sources makes information search and retrieval exists in the area of Topic Detection and Tracking1 (TDT)
a critical task. Topic models are potentially a more ele- and in the TREC information filtering tasks. These are su-
gant approach than pure keyword search. In topic mod- pervised tracking tasks, and nor unsupervised tracking, as
els a topical representation is used for documents and thus we pursue here.
each document can be described in terms of different topics. In trend analysis we are interested in the temporal be-
This provides a complement to keyword search and TF-IDF haviour of a variable or a group of variables. When doing
methods of traditional information retrieval [1]; both topics trend analysis, there are different aspects that are interest-
and key words can be relevant when retrieving a document. ing. The linear component of the trend tells whether the
Topic structure can be given, as is the case for large direc- trend is ascending or descending. The quadratic compo-
tory engines such as Yahoo or Dmoz, or it can be learned nent of the trend tells whether the speed of change of the
from the data. A topic structure can be either hierachical trend is increasing or decreasing. We are mainly interested
or flat. Hierarchical structure is more natural as that is the in the linear component of the trends as that is enough to
way people normally construct their view of the world. A tell whether some things are gaining momentum or whether
single document, rather than being explained by a single they are fading. The change in the increase or decrease ve-
topic, is normally explained by several different topics in locity would be interesting also but at this point we concen-
both the library sciences and the newspaper business. For trate only on the general direction of the trends.
this reason, standard statistical clustering methods, which The rest of this paper is organized as follows. In Section
perform a mutually exclusive and exhaustive partitioning, 2 we explain our statistical topic model that uses MPCA
are not adequate for developing the topical structures auto- (multinomial principal component analysis) [5, 3]. This is
matically. Moreover, it implies that topic models when not a model this is proving successful in a variety of modes in
supplied are statistical in nature and statistical learning is
used for the learning phase. 1 http://www.nist.gov/speech/tests/tdt/

1
text analysis [4, 8], and we use it as the method for creating are m
~ and w ~ for each document, whereas ~c is derived. The
the topics and measuring trends. In Section 3 we explain proportions m ~ correspond to the components for a docu-
our data, its preprocessing and our empirical experiments ment, and the counts w ~ are the original word counts broken
and finally in Section 4 we draw some conclusions on our out into word counts per component.
approach and discuss future research directions. There are two computationally viable schemes for learn-
ing these models from data. The mean field approach [2, 3]
2 The topic model and Gibbs sampling [6, 4]. Gibbs sampling is usually not
considered feasible for large problems, but in this applica-
The topic model we use is based on a recent discrete tion it can be used to hone the results of faster methods, and
or multinomial version of Principal Components Analysis also it is moderately fast due to the specifics of the model.
(PCA). These so-called multi-aspect topic models are a sta- We have our own implementation of these methods.
tistical model for documents that allow multiple topics to
co-exist in the one document [5, 2, 4], as is the case in most 2.1 Temporal topics
News Wire collections. They are directly analogous to the
Gaussian basis of PCA which in its form of Latent Seman- If the documents’ date information is known then there
tic Analysis (LSI) has not proven successful in information is a very simple way to track the changes in the topics’
retrieval. strength in the model. Given the resolution that is desired
The simplest version consists of a linear admixture of one has to simply calculate simple histogram over the doc-
different multinomials, and can be thought of as a genera- uments. The number of the bins in the histogram depends
tive model for sampling words to make up a bag, for the on the resolution. The probabilities are scaled according to
Bag of Words representation for a document [1]. the model and accoring to the number of the documents in
the particular bin so that they sum up to unity. Note each
• We have a total count L of words to sample.
document has some proportion in each topic. These pro-
• We partition these words into KPtopics, components portions are usually sparse: for instance a single document
or aspects: c1 , c2 , ...cK where k=1,...,K ck = L. might include 5 topics out of the 200 in the model. These
This is done using a hidden proportion vector m ~ = proportions are averaged for all docments in a given time
(m1 , m2 , ..., mK ). The intention is that, for instance, point.
a sporting article may have 50 general vocabulary
words, 40 words relevant to Germany, 50 relevant to 3 Experiments
football, and 30 relevant to people’s opinions. Thus
L=170 are in the document and the topic partition is
(50,40,50,30). In our experiments we used articles from the Finnish
financial newspaper Kauppalehti2 from the years 1994 to
• In each partition, we then sample words according to 2003. Kauppalehti is a leading financial news provider in
the multinomial for the topic, component or aspect. Finland. Although its content is mainly financial it also has
This is the base model for each component. This some coverage of non-financial topics. In addition to news
then yields a bag of word counts for the k-th partition, articles it also contains analysis on stock markets, different
~ k,· = (wk,1 , wk,2 , ..., wk,J ). Here J is the dictionary
w companies, equities etc. The average number of news arti-
size, the size of the basic multinomials on words. Thus cles for a day is about 80. The paper is published 5 days a
the 50 football words are now sampled into actual dic- week. Breaking news are updated constantly on Kauppale-
tionary entries, “forward”, “kicked”, “covered” etc. hti’s web site.
• The partitions are then combined additively, hence the
term admixture, to make a distinction with classical 3.1 Data
mixture models. This yields the final sample of words
~r = (r1 , r2 , ..., rJ ) by totalling The dataset contains about 200000 documents from the
P the corresponding
counts in each partition, rj = k=1,...,K wk,j . Thus years 1994-2003 varying in length from about 20 words to
if an instance of “forward” is sampled twice, as a foot- about 500 words. For the year 2003 we have only the nine
ball word and a general vocabulary word, then we re- first months’ documents. The number of documents for
turn the count of 2 and its actual topical assignments each year is shown in 1. Note all our time plots are scaled
are lost, they are hidden data. according to these frequencies so that 2002 is not always
the most common year in a model.
There is a full generative probability model for the bag of
words in a document. The hidden or latent variables here 2 http://www.kauppalehti.fi/

2
Number of documents per year anced tree so that below the root node there are ten child
35000 nodes each having ten children. Topics are generally very
descriptive and of high quality as they describe documents
30000 well.
Topics are named semi-automatically. We obtain de-
Number of documents

25000
scriptive nominal phrases for a topic automatically. But
these alone cannot be used as names. The problem with
20000
phrase generation is that even though the phrases are gener-
15000
ally very nice and descriptive, they alone are not very good
names for topics and some conceptual generalization is usu-
10000 ally required. Phrases are very valuable in the naming if
they are used as the intermediate level in the naming pro-
5000
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
cess. We name the topics using a two level process the first
Year being the phrase generation and the second one being man-
ual naming using generated nominal phrases as the basis
Figure 1. The number of documents for each for naming. The phrase generation is explained in Section
year. 3.2.1.

3.2.1 Obtaining Descriptive Phrases


As the documents are newspaper articles they are classi-
The phrases shown in the component tables are generated
fied into predefined categories but the classification is quite
by a simple algorithm that looks for descriptive nominal
coarse and thus not very useful for topical analysis. E.g.,
phrases, i.e., phrases that have a noun as a headword with
there is only one “news” category under which all the news
zero or more attributes. Note that nouns appearing alone are
articles go. Because of this we did not use predefined cat-
also considered nominal phrases. When generating phrases
egories but concentrated on the topics we learned with the
for a component, each document in the collection is classi-
topic model.
fied according to the proportion allocated to the component:
If the proportion is lower than a threshold α, the document
3.1.1 Pre-processing is considered a negative example. If the proportion is higher
The data is first run through an external parser to nor- than another threshold β > α, the document is considered
a positive example. If the proportion is in the grey area
malize word forms to their lemmas and to obtain part-of-
between α and β, the document is ignored. For the experi-
speech and some other information needed to detect nom- 1
ments in this paper we have used the magic values α = 5·K
inal phrases, which we have a use for later. Note this is 5
and β = K , where K is the number of components in the
essential in Finnish. With over 2000 distinct versions of the
verb “to shop,” information retrieval is poor at best with- topic model. The naming process is not particularly sensi-
tive to these values.
out lemmatization or similar preprocessing. We use Con-
nexor’s3 Functional Dependency Grammar (FDG) parser, The score for a noun phrase is the fraction of positive ex-
which is a commercial parser for Finnish and other lan- amples it appears in minus the fraction of negative examples
guages. For the topic model we retain only verbs, nouns, it appears in. Thus we think of the phrases as attempting to
adjectives and adverbs. Common stop-words are also re- predict the appearance of a component. The ones with the
moved. Then we further remove all words that appear three highest scores are the most descriptive.
or less times in the data. After this we have a lexicon of There are many other possible choices for the score, in
about 190000 lexemes, which is about 22% of the original fact a whole lot have been investigated as so-called co-
900000 lexemes contained in the documents. This leaves occurrence scores. The problem is mainly balancing the
about 21 million words in total in the full dataset, a fairly weight given to positive and negative examples, and popular
small size for our system. scores include statistical independence tests such as mutual
information, the T-score and the log likelihood test. In our
3.2 Topics case, in order to obtain descriptive phrases with sufficient
generality, we are prepared to tolerate many appearances in
negative examples in exchange for a single appearance in a
We built a 111-component topic hierarchical model of
positive example. This is evident in the score, since negative
the data described above. The model is a three level bal-
examples outnumber positive examples approximately by a
3 http://www.connexor.com factor of K, yet the accuracy in both samples is weighted

3
Revenue East-Europe Raw materials Taxation Sales
Ltd. in Russia companies taxation year <x>
CEO year <x> price taxes percents
euros in a country price of raw oil companies growth
company in Estonia incline of prices marks banks
revenue in Poland price of oil acquittances last year
Nokia last year Finnish companies coming sales
revenue per share Baltic countries price of electricity tax authority next year
company’s turnover revenue per share decline of prices percents Sonera
net revenue revenue of <x> euros public sector VAT till the end of year
yearly revenue Russian markets small companies dividend <x> percents’ growth

Table 1. Five of the ten second level topics and ten of their most important phrases.

equally. Phrase Score


Some topics and their most important phrases are shown Ericsson 0.1185
in tables 2, 3 and 4. Phrases for the topic “Telecommunica- mobile phone 0.0911
tion companies” in the table 2 are very nice and consistent Motorola 0.0638
only two of them being somewhat different from the oth- phone 0.0584
ers. It is not surprising that Nokia, Motorola and Ericsson in the markets 0.0504
appear in those as they are the biggest mobile phone manu- Siemens 0.0380
facturers during the whole time period of 1994-2003. Bene- Ollila 0.0351
fon is a Finnish mobile phone manufacturer and lately there Jorma Ollila 0.0301
has been a lot of writing about it because of its precarious of the world 0.0258
financial situation. The phrase “Jorma Ollila” is also very Benefon 0.0178
distinctive as he is the CEO of Nokia.
Phrases for the topic “EU states” in the table 3 are even
Table 2. Topic “Telecommunications compa-
more consistent, all of them but two containing EU country
nies”.
names, “franc” being the French currency before the euro
and “in a country” not contradictory to other phrases.
Top phrases for the topic “Vacations” in the table 4 are
Phrase Score
least consistent of these three but nevertheless they are de-
scriptive to the documents in the topic. Phrases like “on in France 0.1390
the road”, “in summer”, “hotel” and “in the beach” share in Germany 0.1019
clearly some vacation aspects and also rest of them are al- in a country 0.0850
ways found in documents that discuss vacation related mat- in Europe 0.0840
ters. in Italy 0.0689
in Spain 0.0675
In all the examples it has to be remembered that only
franc 0.0647
top ten phrases are shown and for the naming about 30-50
in Netherlands 0.0590
phrases were used.
in Switzerland 0.0588
in Belgium 0.0445
3.3 Trend analysis

We use the topic model for our trend analysis. Most in- Table 3. Topic “EU states”.
teresting trends are naturally the ones that show some tem-
poral change but it may also be valuable to know that the
strength of some topics remains constant. At this point we nication companies” is shown. It is interesting to notice
do not do a complete analysis of the topical trends rather that until the middle of 1999 the volume of material about
we are interested in exploring the topics and seeing whether telecommunication companies seems to have been pretty
they can be used as means for trend analysis and whether constant. From that time it started to grow quite steeply and
they can be explained by some events that have happened then in 2003 it shows some signs of slowing down although
during the time that we we have our material from. that cannot be said yet. This topic seem to be very consis-
In the Figure 2 the behaviour of the topic “Telecommu- tent with the common understanding as we remember the

4
Phrase Score held 2000 and 1996. EU parliament elections were held
a place 0.1245 1996 and 1999 This topic seem to be able to capture the ef-
one day 0.0992 fect of parliament elections very well except for the 2003
on the road 0.0940 elections that are not visible in the trend for some reason.
in summer 0.0880 The reason may be that the number of documents for the
hotel 0.0775 year is much lower than for the previous years. Presiden-
a city 0.0770 tial elections seem not be that well visible in the topic but
some hours 0.0553 it has to be remembered that this topic is about political
in the baech 0.0454 leaders in general and the presidential elections are focused
an island 0.0425 on just a few persons. Municipal elections seem not to get
a kilometer 0.0358 much attention in this newspaper either, which is natural as
Kauppalehti is concentrated mainly on financial issues and
the parliament elections are by far the most important ones
Table 4. Topic “Vacations”. in that respect. It is also important to remember that writ-
ings about politicians are not only explained by the elections
there are also other factors like political scandals etc. E.g.
telecommunication technology boom that really got noticed
the peak around 2001-2002 may be explained by the fact
by everyone at about that time. Since then the writing about
that at that time there was a huge discussion about some
that topic has been high but is probably slowing down. It
political promotions.
is interesting to see how this topic will behave in the future
and will it start decreasing. Topic “International stock markets” in Figure 6 is also
very intuitive as it shows almost exactly the behaviour one
In the Figure 3 the topic “EU states” is shown. This topic
would expect. The topic is first almost linearly ascending
is clearly going down. This behaviour seem also reasonable
till 2000-2001 and then rapidly descending. This behaviour
as we remember that Finland joined EU 1995 and at that
can be justified well if we remember the stock market bub-
time debate about the issue was very heated. From that time
ble and its bursting. It is also interesting to notice that de-
the topic has decreased in strength and EU is these days
spite the fact that the articles are from a financial newspaper
everyday matter that nobody really is interested in. Things
the writing about international stock markets is affected that
like Finland joining the common currency market and the
strongly by the state of the markets.
EU’s coming expansion are not visible in this topic although
Topic trends do seem very promising and intuitive but as
one could see some increase in the topic in the beginning of
we are dealing with data that comes from a newspaper it has
2000 but that is so small that nothing can be said about that.
to be remembered that there are many factors that affect the
Topics that show cyclical behaviour are interesting. In
writing. Editorial matters are one, the newspaper’s publish-
the Figure 4 the behaviour of topic “Vacations” is shown.
ing policy may change and also the editors change. When
This topic shows clear cyclical behaviour with a cycle of
explaining the behaviour of a particular topic it must be re-
one year. This behaviour is very intuitive as there is one
membered that not only the events that are reported make
main holiday season a year. Based on this topic it seems
the effect but also the reporters. Solely from the trend anal-
that writing about vacations seems to be least active at the
ysis perspective the selection of data sources is an important
new year time. In Finland given the climate the summer
question, which is discussed e.g. in [7]. For our task that
holiday is by far the most important holiday. It is normal
is not a problem as our aim is to analyze the topics that are
that people have one month’s holiday during the summer.
present in the data that a search engine handles.
Against this background this topic seems very interesting.
Another topic with a clear cyclical behaviour is shown
in the Figure 5. This topic is “Finnish political leaders”. 4 Conclusions
The cycle in this topic is two years. In Finland there are
four major elections; They are the parliamentary elections We have applied a statistical topic model to a financial
that take place every fourth year, the presidential elections online newpaper data. This model is then used for doing
that take place every sixth year, the municipal elections that queries to the documents and exploring the temporal be-
take place every fourth year and the EU parliament elec- haviour of the topics, which is the main contribution of this
tions. During the ten years’ time that we have material from paper. We have shown that the topics acquired by the model
there have been two presidential elections, three parliament are able to reflect changes in the prevailing trends and that
elections, two municipal elections and two EU parliament they are consistent with the common understanding about
elections. Last presidential elections were held in 2000 and events. This approach of using a search engine for getting
before that 1994. Last parliament elections were in 2003 a better understanding about the temporal changes in the
and before that 1999 and 1995. Municipal elections were world is very natural but also very convenient as the need

5
Topic: EU states Topic: Finnish political leaders
0.02 0.006
Temporal strength of the topic Temporal strength of the topic

0.005

0.015

0.004
Strength

Strength
0.01 0.003

0.002

0.005

0.001

0 0
01/94 01/95 01/96 01/97 01/98 01/99 01/00 01/01 01/02 01/03 09/03 01/94 01/95 01/96 01/97 01/98 01/99 01/00 01/01 01/02 01/03 09/03
Month/Year Month/Year

Figure 2. Topic “Telecommunication compa- Figure 5. Topic “Finnish political leaders”.


nies”
Topic: International stock markets
0.008
Temporal strength of the topic

Topic: EU states
0.007
0.01
Temporal strength of the topic

0.006

0.008
0.005
Strength

0.004
0.006
Strength

0.003

0.004 0.002

0.001

0.002
0
01/94 01/95 01/96 01/97 01/98 01/99 01/00 01/01 01/02 01/03 09/03
Month/Year
0
01/94 01/95 01/96 01/97 01/98 01/99 01/00 01/01 01/02 01/03 09/03
Month/Year Figure 6. Topic “International stock markets”.

Figure 3. Topic “EU states”.

Topic: Vacations For future work there are many interesting possibilities.
0.006
Temporal strength of the topic The predictive quality of the topics should be investigated
and compared to other methodologies. Also the trend infor-
0.005
mation could be used for refining the queries and thus im-
proving the search results and the query model could take
0.004
into account the temporal information that is available.
Strength

0.003

References
0.002

0.001
[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. Addison Wesley, 1999.
0
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allo-
01/94 01/95 01/96 01/97 01/98 01/99
Month/Year
01/00 01/01 01/02 01/03 09/03
cation. Journal of Machine Learning Research, 3:993–1022,
2003.
Figure 4. Topic “Vacations”. [3] W. Buntine. Variational extensions to EM and multinomial
PCA. In ECML 2002, 2002.
[4] T. Griffiths and M. Steyvers. Finding scientific topics. PNAS
for search technology is growing rapidly. Colloquium, 2004.

6
[5] T. Hofmann. Probabilistic latent semantic indexing. In Re-
search and Development in Information Retrieval, pages 50–
57, 1999.
[6] J. Pritchard, M. Stephens, and P. Donnelly. Inference of pop-
ulation structure using multilocus genotype data. Genetics,
155:945–959, 2000.
[7] Z. Yan and A. Buchmann. Evaluating and selecting web
sources as external information resources of a data ware-
house. In Web Information Systems Engineering, WISE2002,
2002.
[8] Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy
detection in adaptive filtering. In Proceedings of the 25th
annual international ACM SIGIR conference on Research
and development in information retrieval, pages 81–88. ACM
Press, 2002.

You might also like