Professional Documents
Culture Documents
1
text analysis [4, 8], and we use it as the method for creating are m
~ and w ~ for each document, whereas ~c is derived. The
the topics and measuring trends. In Section 3 we explain proportions m ~ correspond to the components for a docu-
our data, its preprocessing and our empirical experiments ment, and the counts w ~ are the original word counts broken
and finally in Section 4 we draw some conclusions on our out into word counts per component.
approach and discuss future research directions. There are two computationally viable schemes for learn-
ing these models from data. The mean field approach [2, 3]
2 The topic model and Gibbs sampling [6, 4]. Gibbs sampling is usually not
considered feasible for large problems, but in this applica-
The topic model we use is based on a recent discrete tion it can be used to hone the results of faster methods, and
or multinomial version of Principal Components Analysis also it is moderately fast due to the specifics of the model.
(PCA). These so-called multi-aspect topic models are a sta- We have our own implementation of these methods.
tistical model for documents that allow multiple topics to
co-exist in the one document [5, 2, 4], as is the case in most 2.1 Temporal topics
News Wire collections. They are directly analogous to the
Gaussian basis of PCA which in its form of Latent Seman- If the documents’ date information is known then there
tic Analysis (LSI) has not proven successful in information is a very simple way to track the changes in the topics’
retrieval. strength in the model. Given the resolution that is desired
The simplest version consists of a linear admixture of one has to simply calculate simple histogram over the doc-
different multinomials, and can be thought of as a genera- uments. The number of the bins in the histogram depends
tive model for sampling words to make up a bag, for the on the resolution. The probabilities are scaled according to
Bag of Words representation for a document [1]. the model and accoring to the number of the documents in
the particular bin so that they sum up to unity. Note each
• We have a total count L of words to sample.
document has some proportion in each topic. These pro-
• We partition these words into KPtopics, components portions are usually sparse: for instance a single document
or aspects: c1 , c2 , ...cK where k=1,...,K ck = L. might include 5 topics out of the 200 in the model. These
This is done using a hidden proportion vector m ~ = proportions are averaged for all docments in a given time
(m1 , m2 , ..., mK ). The intention is that, for instance, point.
a sporting article may have 50 general vocabulary
words, 40 words relevant to Germany, 50 relevant to 3 Experiments
football, and 30 relevant to people’s opinions. Thus
L=170 are in the document and the topic partition is
(50,40,50,30). In our experiments we used articles from the Finnish
financial newspaper Kauppalehti2 from the years 1994 to
• In each partition, we then sample words according to 2003. Kauppalehti is a leading financial news provider in
the multinomial for the topic, component or aspect. Finland. Although its content is mainly financial it also has
This is the base model for each component. This some coverage of non-financial topics. In addition to news
then yields a bag of word counts for the k-th partition, articles it also contains analysis on stock markets, different
~ k,· = (wk,1 , wk,2 , ..., wk,J ). Here J is the dictionary
w companies, equities etc. The average number of news arti-
size, the size of the basic multinomials on words. Thus cles for a day is about 80. The paper is published 5 days a
the 50 football words are now sampled into actual dic- week. Breaking news are updated constantly on Kauppale-
tionary entries, “forward”, “kicked”, “covered” etc. hti’s web site.
• The partitions are then combined additively, hence the
term admixture, to make a distinction with classical 3.1 Data
mixture models. This yields the final sample of words
~r = (r1 , r2 , ..., rJ ) by totalling The dataset contains about 200000 documents from the
P the corresponding
counts in each partition, rj = k=1,...,K wk,j . Thus years 1994-2003 varying in length from about 20 words to
if an instance of “forward” is sampled twice, as a foot- about 500 words. For the year 2003 we have only the nine
ball word and a general vocabulary word, then we re- first months’ documents. The number of documents for
turn the count of 2 and its actual topical assignments each year is shown in 1. Note all our time plots are scaled
are lost, they are hidden data. according to these frequencies so that 2002 is not always
the most common year in a model.
There is a full generative probability model for the bag of
words in a document. The hidden or latent variables here 2 http://www.kauppalehti.fi/
2
Number of documents per year anced tree so that below the root node there are ten child
35000 nodes each having ten children. Topics are generally very
descriptive and of high quality as they describe documents
30000 well.
Topics are named semi-automatically. We obtain de-
Number of documents
25000
scriptive nominal phrases for a topic automatically. But
these alone cannot be used as names. The problem with
20000
phrase generation is that even though the phrases are gener-
15000
ally very nice and descriptive, they alone are not very good
names for topics and some conceptual generalization is usu-
10000 ally required. Phrases are very valuable in the naming if
they are used as the intermediate level in the naming pro-
5000
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
cess. We name the topics using a two level process the first
Year being the phrase generation and the second one being man-
ual naming using generated nominal phrases as the basis
Figure 1. The number of documents for each for naming. The phrase generation is explained in Section
year. 3.2.1.
3
Revenue East-Europe Raw materials Taxation Sales
Ltd. in Russia companies taxation year <x>
CEO year <x> price taxes percents
euros in a country price of raw oil companies growth
company in Estonia incline of prices marks banks
revenue in Poland price of oil acquittances last year
Nokia last year Finnish companies coming sales
revenue per share Baltic countries price of electricity tax authority next year
company’s turnover revenue per share decline of prices percents Sonera
net revenue revenue of <x> euros public sector VAT till the end of year
yearly revenue Russian markets small companies dividend <x> percents’ growth
Table 1. Five of the ten second level topics and ten of their most important phrases.
We use the topic model for our trend analysis. Most in- Table 3. Topic “EU states”.
teresting trends are naturally the ones that show some tem-
poral change but it may also be valuable to know that the
strength of some topics remains constant. At this point we nication companies” is shown. It is interesting to notice
do not do a complete analysis of the topical trends rather that until the middle of 1999 the volume of material about
we are interested in exploring the topics and seeing whether telecommunication companies seems to have been pretty
they can be used as means for trend analysis and whether constant. From that time it started to grow quite steeply and
they can be explained by some events that have happened then in 2003 it shows some signs of slowing down although
during the time that we we have our material from. that cannot be said yet. This topic seem to be very consis-
In the Figure 2 the behaviour of the topic “Telecommu- tent with the common understanding as we remember the
4
Phrase Score held 2000 and 1996. EU parliament elections were held
a place 0.1245 1996 and 1999 This topic seem to be able to capture the ef-
one day 0.0992 fect of parliament elections very well except for the 2003
on the road 0.0940 elections that are not visible in the trend for some reason.
in summer 0.0880 The reason may be that the number of documents for the
hotel 0.0775 year is much lower than for the previous years. Presiden-
a city 0.0770 tial elections seem not be that well visible in the topic but
some hours 0.0553 it has to be remembered that this topic is about political
in the baech 0.0454 leaders in general and the presidential elections are focused
an island 0.0425 on just a few persons. Municipal elections seem not to get
a kilometer 0.0358 much attention in this newspaper either, which is natural as
Kauppalehti is concentrated mainly on financial issues and
the parliament elections are by far the most important ones
Table 4. Topic “Vacations”. in that respect. It is also important to remember that writ-
ings about politicians are not only explained by the elections
there are also other factors like political scandals etc. E.g.
telecommunication technology boom that really got noticed
the peak around 2001-2002 may be explained by the fact
by everyone at about that time. Since then the writing about
that at that time there was a huge discussion about some
that topic has been high but is probably slowing down. It
political promotions.
is interesting to see how this topic will behave in the future
and will it start decreasing. Topic “International stock markets” in Figure 6 is also
very intuitive as it shows almost exactly the behaviour one
In the Figure 3 the topic “EU states” is shown. This topic
would expect. The topic is first almost linearly ascending
is clearly going down. This behaviour seem also reasonable
till 2000-2001 and then rapidly descending. This behaviour
as we remember that Finland joined EU 1995 and at that
can be justified well if we remember the stock market bub-
time debate about the issue was very heated. From that time
ble and its bursting. It is also interesting to notice that de-
the topic has decreased in strength and EU is these days
spite the fact that the articles are from a financial newspaper
everyday matter that nobody really is interested in. Things
the writing about international stock markets is affected that
like Finland joining the common currency market and the
strongly by the state of the markets.
EU’s coming expansion are not visible in this topic although
Topic trends do seem very promising and intuitive but as
one could see some increase in the topic in the beginning of
we are dealing with data that comes from a newspaper it has
2000 but that is so small that nothing can be said about that.
to be remembered that there are many factors that affect the
Topics that show cyclical behaviour are interesting. In
writing. Editorial matters are one, the newspaper’s publish-
the Figure 4 the behaviour of topic “Vacations” is shown.
ing policy may change and also the editors change. When
This topic shows clear cyclical behaviour with a cycle of
explaining the behaviour of a particular topic it must be re-
one year. This behaviour is very intuitive as there is one
membered that not only the events that are reported make
main holiday season a year. Based on this topic it seems
the effect but also the reporters. Solely from the trend anal-
that writing about vacations seems to be least active at the
ysis perspective the selection of data sources is an important
new year time. In Finland given the climate the summer
question, which is discussed e.g. in [7]. For our task that
holiday is by far the most important holiday. It is normal
is not a problem as our aim is to analyze the topics that are
that people have one month’s holiday during the summer.
present in the data that a search engine handles.
Against this background this topic seems very interesting.
Another topic with a clear cyclical behaviour is shown
in the Figure 5. This topic is “Finnish political leaders”. 4 Conclusions
The cycle in this topic is two years. In Finland there are
four major elections; They are the parliamentary elections We have applied a statistical topic model to a financial
that take place every fourth year, the presidential elections online newpaper data. This model is then used for doing
that take place every sixth year, the municipal elections that queries to the documents and exploring the temporal be-
take place every fourth year and the EU parliament elec- haviour of the topics, which is the main contribution of this
tions. During the ten years’ time that we have material from paper. We have shown that the topics acquired by the model
there have been two presidential elections, three parliament are able to reflect changes in the prevailing trends and that
elections, two municipal elections and two EU parliament they are consistent with the common understanding about
elections. Last presidential elections were held in 2000 and events. This approach of using a search engine for getting
before that 1994. Last parliament elections were in 2003 a better understanding about the temporal changes in the
and before that 1999 and 1995. Municipal elections were world is very natural but also very convenient as the need
5
Topic: EU states Topic: Finnish political leaders
0.02 0.006
Temporal strength of the topic Temporal strength of the topic
0.005
0.015
0.004
Strength
Strength
0.01 0.003
0.002
0.005
0.001
0 0
01/94 01/95 01/96 01/97 01/98 01/99 01/00 01/01 01/02 01/03 09/03 01/94 01/95 01/96 01/97 01/98 01/99 01/00 01/01 01/02 01/03 09/03
Month/Year Month/Year
Topic: EU states
0.007
0.01
Temporal strength of the topic
0.006
0.008
0.005
Strength
0.004
0.006
Strength
0.003
0.004 0.002
0.001
0.002
0
01/94 01/95 01/96 01/97 01/98 01/99 01/00 01/01 01/02 01/03 09/03
Month/Year
0
01/94 01/95 01/96 01/97 01/98 01/99 01/00 01/01 01/02 01/03 09/03
Month/Year Figure 6. Topic “International stock markets”.
Topic: Vacations For future work there are many interesting possibilities.
0.006
Temporal strength of the topic The predictive quality of the topics should be investigated
and compared to other methodologies. Also the trend infor-
0.005
mation could be used for refining the queries and thus im-
proving the search results and the query model could take
0.004
into account the temporal information that is available.
Strength
0.003
References
0.002
0.001
[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. Addison Wesley, 1999.
0
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allo-
01/94 01/95 01/96 01/97 01/98 01/99
Month/Year
01/00 01/01 01/02 01/03 09/03
cation. Journal of Machine Learning Research, 3:993–1022,
2003.
Figure 4. Topic “Vacations”. [3] W. Buntine. Variational extensions to EM and multinomial
PCA. In ECML 2002, 2002.
[4] T. Griffiths and M. Steyvers. Finding scientific topics. PNAS
for search technology is growing rapidly. Colloquium, 2004.
6
[5] T. Hofmann. Probabilistic latent semantic indexing. In Re-
search and Development in Information Retrieval, pages 50–
57, 1999.
[6] J. Pritchard, M. Stephens, and P. Donnelly. Inference of pop-
ulation structure using multilocus genotype data. Genetics,
155:945–959, 2000.
[7] Z. Yan and A. Buchmann. Evaluating and selecting web
sources as external information resources of a data ware-
house. In Web Information Systems Engineering, WISE2002,
2002.
[8] Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy
detection in adaptive filtering. In Proceedings of the 25th
annual international ACM SIGIR conference on Research
and development in information retrieval, pages 81–88. ACM
Press, 2002.