Professional Documents
Culture Documents
Saurav Bose
2009042 IIIT-Delhi
Saurabh Yadav
2010077 IIIT-Delhi
saurav09042@iiitd.ac.in ABSTRACT
With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. This report provides an overview of Authorship analysis and the process of Authorship identification of online messages. It explains in detail the different types of writing-style features that are extracted to build feature-based classification models for identifying authorship of online messages. It reviews the efficiency and robustness of these models in multi-language context (English, Arabic). It also discusses various limitations that exist in performing authorship analysis of Arabic messages and how we can overcome these limitations.
saurabh10077@iiitd.ac.in
necessary to analyze multilingual content. Arabic has garnered specific attention in recent years for sociopolitical reasons that include possible ties between certain Middle Eastern groups and terror- ism. Arabic has morphological characteristics that pose several critical problems to current authorship analysis techniques.
Keywords
Authorship analysis, Authorship identification, online messages, writing style features.
1. INTRODUCTION
With the advent of Internet, it has become easier to share information between people across time and space. These are followed by both advantages and disadvantages, the latter being opening a new venue for criminal activities, collectively known as cyber crimes. Some examples include distribution of illegal content in cyber space like pornography, pirated softwares; terrorism, hatred, etc. Of late, the cyber criminals have been extensively involved in the distribution of such illegal contents and hatred speeches via the Web-based channels, such as websites, newsgroups, forums, etc. The Internets feature of anonymity provides them an upper hand into performing such activities. Participating in cyber activities is an easy task as people usually do not have to provide their real identity information, such as name, address, gender, etc. As a result, it imposes complex challenges for the law enforcement agencies in criminal identity tracing. To add to their agony, we have a sheer amount of cyber users and activities, making the manual approach to criminal identity tracing impossible for meeting cybercrime investigation requirements. The need of the hour is to automate criminal identity tracing in cyberspace, allowing the investigators to prioritize their tasks and focus on the major criminals. Authorship analysis can assist this activity by automatically extracting linguistic features from online messages and evaluating stylistic details for patterns of terrorist communication. However, the related work on authorship analysis techniques have mostly been on paper without much implementation in real life, particularly in online communication. Furthermore, the global nature of terrorist activity has made it
Character-based lexical features include total number of characters, characters per sentence, usage frequency of individual letters, etc. Structural features deal mainly with the texts organization and layout and have proved to be important in analyzing online messages. Researchers traditionally focused on word structures such as greetings and sig- natures or on the number of paragraphs and average paragraph length. Examples include greetings, signatures; number of paragraphs used, average paragraph length, etc. These features are only good for providing discriminators but not for capturing additional information present in online messages. For example, the use of various font sizes and colors requires a conscientious effort, making it a style marker. Content-specific features are words which are important within a specific topic domain. For example, in a discussion on computers, words like RAM, laptop, etc would be the ones which would be heard the most.
It is a Semitic language belonging to the Afro-Asian group. When it comes to the stylistic and structural properties of a language, Arabic is a language which poses some challenges on this front. Inflection, Diacritics, Word Length and Elongation are some characteristics which need to be taken care of while applying authorship analysis over Arabic messages.
3.1 Inflection
Arabic consists of approximately 5000 roots which are used to form words and sentences, thus making it a highly inflected language. These roots are themselves composed of 3-5 consonants. The orthographical and morphological properties of Arabic result in significant lexical variation [1], because words can take on numerous forms. Inflection creates feature extraction problems owing to the larger number of possible words, which weakens vocabulary richness measures.
3.2 Diacritics
Diacritics are the markings above or below the letters used to indicate special phonetic values [1]. In English, for example, a diacritic is the little mark on top of the letter e in the word rsum. [1] Diacritics are used in Arabic to represent short vowels, consonant lengths, and relationships between words. However, diacritics are rarely used in online communication. Although readers can use the sentence semantics to decipher proper meaning, this isnt feasible for an automated extraction program. For instance, without diacritics the words resume and rsum would look identical to a computer. The lack of diacritics can significantly impact the effectiveness of wordusage- based features such as function words. In Arabic, for example, its impossible without diacritics to distinguish between the words who and from.
Drastic increase in computational power over the years has caused the Machine Learning techniques to emerge. These techniques include Support Vector Machines (SVMs), Neural Networks and Decision Trees. They provide greater scalability than statistical techniques for handling more features, and they are less susceptible to noisy data. As a result, they have gained wider acceptance in authorship analysis studies in the recent years. These benefits are important for working with online messages, which involve classification of many authors and a large feature set.
4. EXPERIMENT DESIGN
The test bed for relevant messages is taken from Web Forums. In case of Arabic messages, the data set was taken from Al-Aqsa Martyrs group while for English messages; it was the White
3. ARABIC CHARACTERISTICS
Knights group. There were 400 messages pertaining to each language. The average message length for the English data set was 76.6 words, and the average length for the Arabic data set was 580.69 words. The White Knights content revolved around political, racial, and religious issues. Members commonly used profanities and advocated the use of violence against groups they disliked. The Al-Aqsa Martyrs group messages were mostly anti-America messages featuring lengthy arguments espousing the groups views. The messages contained abundant embedded images and links relating to the war in Iraq and the treatment of Al-Qaeda prisoners. There were two classifier techniques put into use in the experiment: C4.5 and SVM. C4.5 is a powerful decision-tree-based classifier and shows a great analytical and explanatory potential in effectively assessing key differences between the English and Arabic feature sets. On the other hand, SVM is a computational learning method based on structural risk minimization [1]. It has gained popularity over the years due to its massive classification power and robustness. SVM readily handles many input values owing to its capacity for dealing with noisy data. Both English and Arabic Feature sets were formed each consisting of 301 and 418 features respectively. Out of 301 English features, 87 were lexical, 158 syntactic, 45 structural, and 11 content-specific features. In case of Arabic messages, they were distributed as 79 lexical, 262 syntactic, 62 structural, and 15 con- tent-specific features. In order to come up with Arabic feature set, the languages morphological and orthographical properties were taken into consideration. The issue of Inflection was handled using usage frequencies for a selected set of word roots. This compensated for the losses in vocabulary richness measures. Tracking of the root frequencies was done by a clustering algorithm designed by De Roeck and Al-Fares. The extraction of root frequencies was done by calculating similarity scores for each word against a dictionary containing more than 4,500 roots. We assigned words to the root with the highest similarity score and incremented the selected roots usage frequency. An important issue was to determine the number of roots to include in the final feature set. We used a trial- and-error approach, as other multilingual authorship studies have done, because previous research hasnt yielded more definitive techniques. To determine the number of roots to include, we added between 0 and 500 of the most frequently occurring roots to the complete Arabic feature set. We tested the classification power of these roots with SVM and integrated the optimal number (50 roots) into the feature set. Their algorithm calculated and assigned similarity scores for each word against a collection of roots. The word having highest similarity score with respect to a root was assigned to that root. The SVM technique zeroed upon 50 optimal roots which were often used for classification. In order to capture word length precisely, a filter was embedded in the Arabic feature extractor which helped in removing elongation after it had been tracked. The absence of a feasible semantic tagger restricted tracking of Diacritics.
Collection, Extraction and Experimentation: these are the 3 main steps involved in complete online authorship identification process.
5.2 Experiments
After extracting the feature values, the next step is to form feature sets. These sets are formed in a step-wise manner. For example, the first set consisted of lexical features, the second encompassed lexical and syntactic features. The third feature set consisted of lexical, syntactic and structural features while the fourth set consisted of all the features namely: lexical, structural, syntactic and content-specific. Such a stepwise increment of features helps to identify the relevance of each writing style characteristic in authorship analysis. Lexical and syntactic features happen to be the most important categories and hence form the foundation for structural and contentspecific features. For the experiment concerning English and Arabic messages, 30 randomly selected samples of five authors were selected. Each sample of five authors was evaluated using all 20 messages per author [1]. Both classifiers were used one at a time. Often, a 30fold cross-validation testing method is used in all experiments. Accuracy, recall and precision measures are used to evaluate the prediction performance. These measures have been commonly adopted in the information retrieval and authorship analysis literature. The accuracy is a measure which indicates the overall prediction performance of a particular classifier [2].
Accuracy = Number of messages with correctly identified author Total number of messages
For a particular author, precision and recall measures are used to measure the effectiveness of the approach for identifying messages that were written by that author. The precision and recall are defined as [2]: Precision = Number of messages correctly assigned to the author Total number of messages assigned to the author Recall = Number of messages correctly assigned to the author Total number of messages written by the author
5. IDENTIFICATION PROCESS
Al-Aqsa Martyr messages were longer, too. Al-Aqsa messages used a plethora of font colors and sizes, often as tools to emphasize a certain point. Red, blue, and navy were almost as common as black. This was in sharp contrast to the KKK messages, where fonts featuring black, 10-to-12- point type were a fixture, with the exception of the occasional deviation to green or blue.
8. REFERENCES
[1] Abbasi, Chen 2005. Applying Authorship Analysis to Extremist Group Web Forum Messages. University of Arizona. [2] Zheng, Huang, Chen. Authorship Analysis in Cybercrime Investigation. Department of Management Information Systems, University of Arizona.
[3] Zheng, Chen, Huang 2006. A Framework for Authorship Identication of Online Messages: Writing-Style Features and Classication Techniques [4] Abbasi, Chen, 2006. Visualizing Authorship for Identification. Department of Management Information Systems, The University of Arizona
[5] [6]