Professional Documents
Culture Documents
Mathias Niepert
After the corpus has been loaded with the given settings, information gain
the distribution of term and document frequencies can be X
plotted (Menu: Corpus → Distribution Graphs → Docu- w(tk ) = − P (C)logP (C)
C
ment Frequency/Term Frequency”) and the category prop-
erties can be examined in a table (Menu: Corpus → Cate- X
gory Table). The Distribution Graph is a plot of the gaussian +P (tk ) P (C|tk )logP (C|tk )
distribution of the document or term frequency in the cor- C
Results
dimensionality of feature space = 500
chi square information gain MSF
break even 0.511 0.686 0.519
11 point 0.561 0.716 0.548
avg precision 0.563 0.741 0.550