You are on page 1of 4

Classifying Web Texts Based on Syntactic and

Grammatical Modeling
Miaomiao Zhang

Qinghua Wang

Department of Language and Communication Studies


NTNU Norwegian University of Science and Technology
NO-7491 Trondheim, Norway
miaozhan@gmail.com

Aalto University, School of Electrical Engineering


Department of Communications and Networking
P.O.Box 13000, FI-00076 AALTO, Finland
qinghua.wang@aalto.fi

AbstractInternet has become one indispensable part of human


life. One annoying thing on the Internet is that there are
malicious web texts, such as web sites containing viruses and
spam emails. The contribution of this paper is to propose a web
text classification system and algorithms which could help us
differentiate malicious web texts from normal ones. Syntactic
features and grammatical features of web texts have been used to
provide inputs for the system, and it is the first time that such
features have been exploited for a system of web text
classification.
Keywords-Computational linguistics; syntactic modeling;
grammatical modeling; Internet security;web text classification

I.

INTRODUCTION

The Internet is anarchy and different views (either good or


bad) can be expressed freely. This is a great human invention
and it protects democracy and freedom of speech. However,
there are things people may not want to access on the Internet,
such as unhealthy websites, websites containing viruses, spam
emails, etc. We call these contexts the malicious web texts. On
the contrary, the texts other than the malicious ones are called
normal texts. The contribution of this paper is that an
innovative web text classification system is proposed to help us
differentiate malicious web texts from the normal ones.
Many previous works have been focused on the research of
detecting malicious web texts, such as spam emails [1-4].
Among them, Xie et. al. [4] and Thomas [2] use the feature of
URL to distinguish spam messages from legitimate ones. Using
personal social networks, Li et al. [3] adopts spam keywords to
enhance a Bayesian filter. In addition, Gao et. al. [1] uses a
complex of features, such as the cluster size, the senders social
degree, interaction history, average time interval, average URL
number per message, etc., to detect spams on the online social
networks (e.g., Facebook). This paper innovates by proposing
to use syntactic and grammatical features of web texts which
have been rarely paid attention to by earlier researches.
In the following of this paper, Section II presents the
system model for web text classification. Section III presents
our algorithm used for profiling web texts and comparing web
texts, which is also the core of the classification system.
Section IV concludes the paper.

II.

SYSTEM MODEL

The purpose of this paper is to design a web text


classification system which can automatically classify
interesting web texts into pre-defined categories, e.g.
malicious web texts and normal texts. The core of the system
is a web text profiling algorithm which uses decision tree to
profile selected web texts based on syntactic and grammatical
analyses. The details of the algorithm are elaborated in Section
III. The system works by comparing the similarities between
an unknown profile and a pre-learned profile. If the unknown
profile is similar to the pre-learned profile, then the web texts
described by the unknown profile share the same category as
the web texts described by the pre-learned profile.
There are three steps in order to have the system to work.
The details are shown in Figure 1. The first step is to learn the
profiles for labeled web texts (presumably labeled by humans)
which belong to a particular pre-defined category. By doing
this, each category of web texts has a standard profile. This
step is called the training phase. The second step is to learn the
profiles for unlabeled interesting web texts in the same way as
it is in the training phase. The profiled web texts in this step
are those waiting to be classified (e.g. an incoming email
which is not labeled in other means). This step is called the
processing phase or the detection phase if the purpose is to
detect malicious texts or spam emails. The last step is to
compare the profile learned for unlabeled texts with those prelearned profiles learned for labeled texts. If the profiles are
similar, then the unlabeled texts are considered to belong to
the same category as those used for the pre-learned profile.

Figure 1: System Model

III.

Table 1: An example training set of malicious/normal


sentences with syntactic features

PROFILING WEB TEXTS USING DECISION TREES

Decision tree is a machine learning and data mining


technology which is used to assist in decision making or in
profiling sequences of operation. It builds a tree-like graph
based on the features of training items and later can be used to
classify different groups of items (i.e. a decision) or to predict
a missing feature of an item given other features. In
linguistics, a message consists of different types of words and
these words appear in sequences according to grammar rules.
We can thus model and classify different categories of
messages using decision trees. In the following of this section,
the decision tree technique is utilized in a slightly different
way from its traditional use. Instead of making decisions, a
decision tree is used to build message pattern profiles for
different categories of web texts.
A. Building Decision Trees Based on Syntactic
Decomposition
Tagging word classes and modeling the syntactic rules are
two important tasks in the field of computational linguistics
which deals with the processing and computation of natural
languages. Assume that a computer has acquired different
word classes and syntactic rules of a certain language (e.g.
English) after training by linguists and computer scientists. A
sentence is thus able to be described as a sequence of word
types. Take the following sentence which is from a spam
message as an example:
Your bank information is not valid.
The phrase structure of this sentence is shown in Figure 2.

Figure 2: Syntactic decomposition of a spam message


If we only consider the syntactic categories of the words
forming this sentence, then the message can be abstracted as:
Det (determiner) - N (noun) - N (noun) - V (verb) - Adv
(adverb) - Adj (adjective)
The purpose is to learn whether the appearance of this
abstracted pattern is normal or it represents a malicious
message, i.e. a spam in this case. In order to learn all patterns
of malicious (or normal) messages and their probability of
appearance, we can train a decision tree using labeled training
data.
Suppose that a sentence has a maximum of N words. We
can have an example training set of messages as it is in Table
1.

No.
1
2
3
4
5
6
7
8

1st
Word
N
N
V
Det
Det
N
Det
N

2nd
Word
V
V
V
N
N
V
N
V

3rd
Word
Adv
V
Adv
N
V
Adv
V
Adv

Nth
Word
N/A
N/A
Adv
Adj
N/A
N/A
N/A
N/A

Label
Normal
Malicious
Malicious
Malicious
Normal
Malicious
Malicious
Normal

In Table 1, each word can be one of the word types: {N


(noun), V (verb), Adj (adjective), P (preposition), Adv
(adverb), Det (determiner), Deg (degree word), Qual
(qualifier), Aux (auxiliary), Con (conjunction), N/A (not
available)}. A kth word is assigned the value N/A if there is no
kth word in a sentence. In addition, a message can be labeled
either as a malicious message or as a normal message.
When it comes to the identification of a malicious
message, we may say that a message is malicious because it
exhibits a pattern (e.g. in the form of the items in Table 1)
which has been observed in other malicious messages (e.g. in
the training set provided by Table 1). But this kind of
malicious message identification is very inaccurate because a
pattern observed in malicious messages may also be observed
in normal messages. For example, item 5 and item 7 in Table
1 have the same set of attribute values but they have been
separately observed as malicious as well as normal. We thus
need a more advanced technique which could describe the
difference between malicious messages and normal messages
from a holistic point of view.
Decision tree provides a method to build complete pattern
profiles either for malicious messages or for normal messages.
In Table 1, each item represents a message and it has a series
of attributes, namely 1st word, 2nd word, 3rd word, , Nth
word. Each attribute can take attribute values from the set: {N,
V, Adj, P, Adv, Det, Deg, Qual, Aux, Con, N/A}. From Table
1, we can construct a decision tree for malicious messages by
iteratively partitioning the data (i.e. those items which are
labeled as malicious) into subsets that share the same attribute
values. A possible tree is as following:

Figure 3: Decision tree model for malicious messages based


on syntactic analysis

For brevitys sake, only the attributes of the first three


words are analyzed and tree branches with probability 0 have
been omitted. In Figure 3, each branch from top down
represents a pattern that has been observed for malicious
messages. For example, the left-most branch in Figure 3 tells
the pattern N-V-V. At the place of each node in the tree, there
is an observation probability associated with the node. If it is a
non-leaf node, the probability associated with it tells the
marginal probability that a malicious message exhibits a
partial pattern from the root to the node of interest. If it is a
leaf node, the probability associated with it tells the
probability that a malicious message has the full pattern from
the top to the bottom. Because of the adoption of the N/A
attribute value, variable-length patterns can be easily handled
under the framework of a fixed-depth decision tree.
Similarly, we can also build a decision tree model for the
normal messages shown in Table 1. The result is not shown
due to space limitation.
Because of the different categories of messages, we shall
expect that the decision tree built for malicious messages is
different from the one built for normal messages. If there is a
set of messages (e.g. extracted from a web site or from an
email) whose category is not determined, we can also build a
decision tree for them. In order to determine the category of
this set of messages, a similarity test can be performed
between the newly built decision tree and the ones which have
been trained for malicious messages and for normal messages.
We shall say that the set of messages is malicious if its
decision tree is more similar to the one trained for malicious
messages than to the one trained for normal messages.
Otherwise, we say that the set of messages is normal.
As it is mentioned earlier that the probability associated
with a leaf node in a decision tree is the probability that a
pattern represented by the branch from the root to the leaf
node is observed. If each attribute is considered as one
dimension in the value space of the patterns, then a specific
pattern is a data point in the N-dimensional (supposing there
are N attributes) value space. From this perspective, the
probability of observing a specific pattern is also the
probability that a specific discrete data point (i.e. a value) is
taken by a random pattern variable. Considering a complete
decision tree, it represents the joint probability distribution of
the random pattern variable which is associated with the
category of messages that have been used to build the decision
tree. This kind of interpretation is very useful as we can now
compare the similarities of two decision trees by comparing
the similarities of their respective probability distributions.
Let Tm and Tn be respectively the probability distributions
represented by the decision tree of malicious messages and by
the decision tree of normal messages. Let P be the probability
distribution represented by the decision tree trained by an
unknown set of messages. The similarities or the distances
between Tm and P and between Tn and P can be measured
using the Kullback-Leibler (K-L) divergence:
DKL (Tm || P) = i Tm(i) ln (Tm(i) /P(i))

(1)

DKL (Tn || P) = i Tn(i) ln (Tn(i) /P(i))

(2)

As it can be seen from the definition, the K-L divergence


is the average of the logarithmic difference between two
probability distributions. In (1) and (2), i is a data point in the
pattern value space for which P(i) is non-zero. In order to have
a meaningful measurement of the K-L divergence, all the
distributions involved must sum to 1. That means it must be a
probability distribution represented by a complete decision
tree where all non-zero probability tree branches have been
included.
In order to determine the category of an unknown set of
messages, their K-L divergences with respect to Tm and Tn
must be compared. A K-L divergence tells the difference
between two distributions and it is always a non-negative
value according to the Gibbs inequality. Therefore, the larger
a K-L divergence is, less similar the two distributions
compared (or their associated two decision trees) are. We say
that an unknown set of messages is malicious if
DKL (Tm || P) < DKL (Tn || P),

(3)

where P is the pattern probability distribution represented by


the decision tree built according to the unknown set of
messages. Otherwise, we say that the unknown set of
messages is normal.
If only one distribution among Tm and Tn is known, the
category of an unknown set of messages can be determined by
comparing DKL (Tm || P) or DKL (Tn || P) with an empirical
threshold.
B. Building Decision Trees Based on Grammatical
Decomposition
Grammatical roles specify relations between words in
sentences. They are bounded categories and well defined
(compared to semantic roles which are inherently unbounded
and generally not clearly defined). In Section III.A, it is
assumed that a syntactic decomposition can be performed by
computers. If we precede one step further, we can assume that
computers are able to understand grammatical relations in
sentences. This is actually a practical assumption. The
grammatical role of a word in a sentence can be determined by
the syntactic category it belongs to and the position it appears
in the sentence. For example, the subject in English is the
nominal element that the verb agrees with. It comes right
before the verb in unmarked, declarative clauses, and when
pronominalized, employs subjective pronouns. As it is seen,
the definition of a grammatical role is quite clear, without
referring to the meanings of words, and can be easily
understood by a computer program. Because grammatical
relations are language specific and vary from one language to
another, the grammatical roles we mention in the following
only apply to English.
Once again, we take the sentence from a spam message as
an example: Your bank information is not valid. The
grammatical structure of this message is shown in Figure 4.

Figure 4: Grammatical decomposition of a spam message


From Figure 4, we know this spam message has a pattern
Subject Verb Complement in terms of grammatical
relationships.
In English, a complete set of grammatical roles is defined
as: {Subject, Verb, Indirect Object, (Direct) Object,
Complement, and Adverbial}. English is actually a quite
structured language in terms of the strict orders among
different grammatical roles. For a declarative sentence (and
other sentence types include: interrogative, imperative and
exclamative sentences) consisting of a single clause, its
construction appears this way in its entirety:
Adverbial Subject Adverbial Verb Indirect Object
Object Complement Adverbial
In the above, the Verb element is the most central element
and it is normally obligatory in all sentences. The Subject
element is another element which is indispensible. Other
elements do not have important roles and are mainly optional.
As it is in Section III.A, we also hope that different groups
of web texts exhibit different patterns or at least different
pattern distributions in terms of grammatical relations. The
grammatical patterns profiling different groups of web texts
can also be learned in the form of decision trees. Similar to
Table 1 in Section III.A, we also have an example training set
here shown in Table 2, where N is assumed to be the
maximum number of grammatical parts in a sentence (or a
clause) and N/A is again used to represent an empty value for
absent grammatical parts.
Table 2: An example training set of malicious/normal
sentences with grammatical features
No.
1
2
3
4
5
6
7
8

1st
Part
Subject
Subject
Subject
Subject
Adv.
Subject
Subject
Subject

2nd
Part
Verb
Verb
Verb
Verb
Subject
Adv
Verb
Compl.

3rd
Part
N/A
Object
Compl.
Adv.
Verb
Compl.
Compl.
Adv.

Nth
Part
N/A
Compl.
N/A
N/A
N/A
N/A
N/A
Adv.

Label
Normal
Normal
Normal
Normal
Normal
Malicious
Malicious
Malicious

Due to space consideration, we only show the decision tree


learned for normal messages. In Figure 5, the decision tree for
normal messages is drawn by iteratively partitioning items in
Table 2 into subsets which share the same attribute values (in
this case, the attributes are 1st Part, 2nd Part, etc.). Only three
levels of attributes are drawn in Figure 5.

Figure 5: Decision tree model for normal messages based on


grammatical analysis
As the decision tree in Figure 5 can also be interpreted as a
joint probability distribution as it is in Figure 3, methods used
to compare similarities of probability distributions can also be
used to compare similarities of decision trees (or web texts in
the context of this paper as decision trees are used to profile
web texts in this paper). If an unknown group of web texts are
also profiled with a decision tree, the comparison of the
decision tree with a pre-learned one (such as the one in Figure
5) tells whether this unknown group of web texts shares the
same group with the web texts that have been used to profile
the pre-learned decision tree. The details of the web text
classification technology based on the comparison of decision
trees have been shown in Equations (1)-(3) and their
surrounding texts.
IV.

CONCLUSIONS

This paper presents a web text classification system which


uses decision trees to profile texts. Syntactic analysis and
grammatical analysis of texts have been used to help building
precise profiles. The technique presented in this paper can be
used to detect malicious and spam messages in the Internet. It
can also be used to improve the efficiency of language
documentation by automating the process of identifying
interesting texts from the samples gathered (or crawled) from
the Internet.
REFERENCES
[1]

[2]

[3]

[4]

Hongyu Gao, Yan Chen, Kathy Lee, Diana Palsetia & Alok Choudhary,
Towards online spam filtering in social networks, in the Proc. of 19th
Network & Distributed System Security Symposium (NDSS), 2012.
K. Thomas, C. Grier, J. Ma, V. Paxon, and D Song, Design and
Evaluation of a Real-Time URL Spam Filtering Service. In
Proceedings of the IEEE Symposium on Security and Privacy (May
2011).
Z. Li and H. Shen, SOAP: A Social Network Aided Personalized and
Effective Spam Filter to Clean Your E-mail Box. In Proceedings of the
IEEE INFOCOM (April 2011).
Y. Xie, F. Yu, K. Achan, Panigraphy R., Hulten G. and I. Osipkov,
Spamming botnets: signatures and characteristics. In Proc. of
SIGCOMM (2008).

You might also like