You are on page 1of 5

Authorship Analysis and Identification

Saurav Bose
2009042 IIIT-Delhi

Saurabh Yadav
2010077 IIIT-Delhi

saurav09042@iiitd.ac.in ABSTRACT
With the rapid proliferation of Internet technologies and applications, misuse of online messages for inappropriate or illegal purposes has become a major concern for society. The anonymous nature of online-message distribution makes identity tracing a critical problem. This report provides an overview of Authorship analysis and the process of Authorship identification of online messages. It explains in detail the different types of writing-style features that are extracted to build feature-based classification models for identifying authorship of online messages. It reviews the efficiency and robustness of these models in multi-language context (English, Arabic). It also discusses various limitations that exist in performing authorship analysis of Arabic messages and how we can overcome these limitations.

saurabh10077@iiitd.ac.in
necessary to analyze multilingual content. Arabic has garnered specific attention in recent years for sociopolitical reasons that include possible ties between certain Middle Eastern groups and terror- ism. Arabic has morphological characteristics that pose several critical problems to current authorship analysis techniques.

2. LITERATURE REVIEW 2.1 Authorship Analysis


Authorship analysis is the process of examining the characteristics of a piece of work in order to draw conclusions on its authorship [2]. The problem can be broken down into three sub-fields: Author Identification determines the likelihood of a particular author having written a piece of work by examining other works produced by that author. Author Characterization summarizes the characteristics of an author and generates the author profile based on his/her work. Some of these characteristics include gender, educational and cultural background, and language familiarity. Similarity Detection compares multiple pieces of work and determines whether or not they are produced by a single author without actually identifying the author.

Keywords
Authorship analysis, Authorship identification, online messages, writing style features.

1. INTRODUCTION
With the advent of Internet, it has become easier to share information between people across time and space. These are followed by both advantages and disadvantages, the latter being opening a new venue for criminal activities, collectively known as cyber crimes. Some examples include distribution of illegal content in cyber space like pornography, pirated softwares; terrorism, hatred, etc. Of late, the cyber criminals have been extensively involved in the distribution of such illegal contents and hatred speeches via the Web-based channels, such as websites, newsgroups, forums, etc. The Internets feature of anonymity provides them an upper hand into performing such activities. Participating in cyber activities is an easy task as people usually do not have to provide their real identity information, such as name, address, gender, etc. As a result, it imposes complex challenges for the law enforcement agencies in criminal identity tracing. To add to their agony, we have a sheer amount of cyber users and activities, making the manual approach to criminal identity tracing impossible for meeting cybercrime investigation requirements. The need of the hour is to automate criminal identity tracing in cyberspace, allowing the investigators to prioritize their tasks and focus on the major criminals. Authorship analysis can assist this activity by automatically extracting linguistic features from online messages and evaluating stylistic details for patterns of terrorist communication. However, the related work on authorship analysis techniques have mostly been on paper without much implementation in real life, particularly in online communication. Furthermore, the global nature of terrorist activity has made it

2.2 Feature Selection


Primarily, there are four writing style features that facilitate authorship attribution: syntactic, lexical, structural, and contentspecific. Syntactical features refer to the patterns used to form sentences. They consist of tools used to structure sentences, such as punctuation and function words. Example function words are while and upon. Usage patterns of function words can be effective features for authorship identification [1]. For example, the difference between using the word thus or hence might seem subtle, but it can constitute a significant stylistic difference. Lexical features can be either word- or character-based. Characteristics such as total number of words, words per sentence, vocabulary richness, etc. can be included under Wordbased lexical features. Vocabulary richness measures include the number of words that occur once (hapax legomena) and twice (hapax dislegomena), as well as several statistical measures defined by previous studies. On the other hand,

Character-based lexical features include total number of characters, characters per sentence, usage frequency of individual letters, etc. Structural features deal mainly with the texts organization and layout and have proved to be important in analyzing online messages. Researchers traditionally focused on word structures such as greetings and sig- natures or on the number of paragraphs and average paragraph length. Examples include greetings, signatures; number of paragraphs used, average paragraph length, etc. These features are only good for providing discriminators but not for capturing additional information present in online messages. For example, the use of various font sizes and colors requires a conscientious effort, making it a style marker. Content-specific features are words which are important within a specific topic domain. For example, in a discussion on computers, words like RAM, laptop, etc would be the ones which would be heard the most.

It is a Semitic language belonging to the Afro-Asian group. When it comes to the stylistic and structural properties of a language, Arabic is a language which poses some challenges on this front. Inflection, Diacritics, Word Length and Elongation are some characteristics which need to be taken care of while applying authorship analysis over Arabic messages.

3.1 Inflection
Arabic consists of approximately 5000 roots which are used to form words and sentences, thus making it a highly inflected language. These roots are themselves composed of 3-5 consonants. The orthographical and morphological properties of Arabic result in significant lexical variation [1], because words can take on numerous forms. Inflection creates feature extraction problems owing to the larger number of possible words, which weakens vocabulary richness measures.

3.2 Diacritics
Diacritics are the markings above or below the letters used to indicate special phonetic values [1]. In English, for example, a diacritic is the little mark on top of the letter e in the word rsum. [1] Diacritics are used in Arabic to represent short vowels, consonant lengths, and relationships between words. However, diacritics are rarely used in online communication. Although readers can use the sentence semantics to decipher proper meaning, this isnt feasible for an automated extraction program. For instance, without diacritics the words resume and rsum would look identical to a computer. The lack of diacritics can significantly impact the effectiveness of wordusage- based features such as function words. In Arabic, for example, its impossible without diacritics to distinguish between the words who and from.

2.3 Techniques for Authorship Analysis


In early studies most analytical methods used in authorship analysis were statistical methods. The basic idea is that different authors have different text compositions which are characterized by a probability distribution of word usage. More specifically, given a population of an authors texts, the identification of a new text can be considered as a statistical hypothesis test or a classification problem. Most early work used statistical methods to facilitate authorship analysis. Brainerd used Chi- squared and related distributions to perform lexical data analysis. An important statistical test was introduced by Thisted and Efrons paper. Farringdon first applied the CUSUM technique in authorship analysis. Francis gave a summary of early statistical approaches used to resolve the Federalist Papers dispute. Baayen proposed a linguistic evaluation of diverse statistical models of word frequency. Al- though statistical methods achieved much success in authorship analysis, there are some constraints for particular methods. For example, Holmes found that the CUSUM analysis was unreliable because the stability of those characteristics over multiple texts is not warranted. Moreover, the prediction capability of statistical methods, such as attributing a new text to a certain author, is limited.

3.3 Word Length and Elongation


Arabic words are shorter in length, thus reducing the effectiveness of many lexical features in identifying authorship. For example, word-length features are less discriminating because they are distributed over a smaller range. Also, the usage of long complex words in a sentence shows how well versed a person is in his/ her language. But since Arabic words are almost of the same length, be it easy or complex words, this assumption does not hold true. Elongation presents a further complication as at times the Arabic words are elongated for purely stylistic reasons, using a special character that resembles a dash ()[1]. Arabic characters are combined during writing, so elongation is possible by lengthening the joins between letters. For example, the word MZKR (remind) is extended with four short dashes between the M and the Z (denoted by a faint oval), doubling the word size. Elongation has its own pros and cons, the pros being providing an important style marker and the cons being inflating the values of word length features significantly.

Drastic increase in computational power over the years has caused the Machine Learning techniques to emerge. These techniques include Support Vector Machines (SVMs), Neural Networks and Decision Trees. They provide greater scalability than statistical techniques for handling more features, and they are less susceptible to noisy data. As a result, they have gained wider acceptance in authorship analysis studies in the recent years. These benefits are important for working with online messages, which involve classification of many authors and a large feature set.

4. EXPERIMENT DESIGN
The test bed for relevant messages is taken from Web Forums. In case of Arabic messages, the data set was taken from Al-Aqsa Martyrs group while for English messages; it was the White

3. ARABIC CHARACTERISTICS

Knights group. There were 400 messages pertaining to each language. The average message length for the English data set was 76.6 words, and the average length for the Arabic data set was 580.69 words. The White Knights content revolved around political, racial, and religious issues. Members commonly used profanities and advocated the use of violence against groups they disliked. The Al-Aqsa Martyrs group messages were mostly anti-America messages featuring lengthy arguments espousing the groups views. The messages contained abundant embedded images and links relating to the war in Iraq and the treatment of Al-Qaeda prisoners. There were two classifier techniques put into use in the experiment: C4.5 and SVM. C4.5 is a powerful decision-tree-based classifier and shows a great analytical and explanatory potential in effectively assessing key differences between the English and Arabic feature sets. On the other hand, SVM is a computational learning method based on structural risk minimization [1]. It has gained popularity over the years due to its massive classification power and robustness. SVM readily handles many input values owing to its capacity for dealing with noisy data. Both English and Arabic Feature sets were formed each consisting of 301 and 418 features respectively. Out of 301 English features, 87 were lexical, 158 syntactic, 45 structural, and 11 content-specific features. In case of Arabic messages, they were distributed as 79 lexical, 262 syntactic, 62 structural, and 15 con- tent-specific features. In order to come up with Arabic feature set, the languages morphological and orthographical properties were taken into consideration. The issue of Inflection was handled using usage frequencies for a selected set of word roots. This compensated for the losses in vocabulary richness measures. Tracking of the root frequencies was done by a clustering algorithm designed by De Roeck and Al-Fares. The extraction of root frequencies was done by calculating similarity scores for each word against a dictionary containing more than 4,500 roots. We assigned words to the root with the highest similarity score and incremented the selected roots usage frequency. An important issue was to determine the number of roots to include in the final feature set. We used a trial- and-error approach, as other multilingual authorship studies have done, because previous research hasnt yielded more definitive techniques. To determine the number of roots to include, we added between 0 and 500 of the most frequently occurring roots to the complete Arabic feature set. We tested the classification power of these roots with SVM and integrated the optimal number (50 roots) into the feature set. Their algorithm calculated and assigned similarity scores for each word against a collection of roots. The word having highest similarity score with respect to a root was assigned to that root. The SVM technique zeroed upon 50 optimal roots which were often used for classification. In order to capture word length precisely, a filter was embedded in the Arabic feature extractor which helped in removing elongation after it had been tracked. The absence of a feasible semantic tagger restricted tracking of Diacritics.

Collection, Extraction and Experimentation: these are the 3 main steps involved in complete online authorship identification process.

5.1 Collection and Extraction


The web forums of interest are identified with the help of spidering programs, which crawl through the Internet, searching for potentially dangerous or abusive contents that might relate to cybercrime and homeland security issues. Once identified, the collection programs take over the job of storing such messages in text and HTML format. The extraction programs further facilitate the process by deriving the writing style characteristics such as lexical, structural, syntactical and content-specific features. The extraction programs have a complexity varying over different languages due to presence of special features in every language (Word Length Elongation and Inflection in case of Arabic).

5.2 Experiments
After extracting the feature values, the next step is to form feature sets. These sets are formed in a step-wise manner. For example, the first set consisted of lexical features, the second encompassed lexical and syntactic features. The third feature set consisted of lexical, syntactic and structural features while the fourth set consisted of all the features namely: lexical, structural, syntactic and content-specific. Such a stepwise increment of features helps to identify the relevance of each writing style characteristic in authorship analysis. Lexical and syntactic features happen to be the most important categories and hence form the foundation for structural and contentspecific features. For the experiment concerning English and Arabic messages, 30 randomly selected samples of five authors were selected. Each sample of five authors was evaluated using all 20 messages per author [1]. Both classifiers were used one at a time. Often, a 30fold cross-validation testing method is used in all experiments. Accuracy, recall and precision measures are used to evaluate the prediction performance. These measures have been commonly adopted in the information retrieval and authorship analysis literature. The accuracy is a measure which indicates the overall prediction performance of a particular classifier [2].
Accuracy = Number of messages with correctly identified author Total number of messages

For a particular author, precision and recall measures are used to measure the effectiveness of the approach for identifying messages that were written by that author. The precision and recall are defined as [2]: Precision = Number of messages correctly assigned to the author Total number of messages assigned to the author Recall = Number of messages correctly assigned to the author Total number of messages written by the author

5. IDENTIFICATION PROCESS

6. ANALYSIS AND RESULTS 6.1 Feature Type Comparison


The authorship prediction performance varied significantly with different combinations of metrics. Pair-wise t-test results indicated that: Using style markers and structural features outperformed using style markers only [2]: We achieved significantly higher accuracies for all three datasets (p-values were all below 0.05) by adopting the structural features. The results might be explained by the fact that an authors consistent writing patterns show up in the messages structural features. Using style markers, structural features, and content-specific features did not outperform using style markers and structural features [2]: The results indicated that using content-specific features as additional features did not improve the authorship prediction performance significantly. The weaker performance of content-specific features could be attributable to their smaller representation in the feature set as they contained 11 and 15 content-specific features in English and Arabic messages respectively. This number is far smaller than all other feature categories. Overall, the impact of the different feature types for Arabic was consistent with the results we obtained on English messages.

Al-Aqsa Martyr messages were longer, too. Al-Aqsa messages used a plethora of font colors and sizes, often as tools to emphasize a certain point. Red, blue, and navy were almost as common as black. This was in sharp contrast to the KKK messages, where fonts featuring black, 10-to-12- point type were a fixture, with the exception of the occasional deviation to green or blue.

7. CONCLUSION AND FUTURE WORK


The experiments have demonstrated that with a set of carefully selected features and an effective learning algorithm, the authors of Internet newsgroup and email messages can be identified with a reasonably high accuracy. There was a significant improvement in performance when structural features were added on top of style markers. Also, the SVM classifier outperformed the other classifiers on all occasions. The experimental results have shown a promising future for applying the automatic authorship analysis approaches in cybercrime investigation to address the identity- tracing problem. Using such techniques investigators would be able to identify major cyber criminals who post illegal messages on the Internet, even though they may use different identities. This study can be expanded in the future to include more authors and messages to further demonstrate the scalability and feasibility of SVM classifier as well as finding any other feasible classifier to solve the purpose. We can pursue several potential future directions. The current authorship identification methodologies are limited in the number of authors we can apply them to. They require significant upward scalability to help discriminate between hundreds of potential authors. The development of more complex methodologies for differentiating between a larger set of authors is an important future endeavor. We also plan a more comprehensive analy- sis of English and Arabic extremist group authorship tendencies to distinguish group-level differences from linguistic disparities inherent between English and Arabic. For example, do the persuasive tendencies observed regarding the Al-Aqsa Martyr messages have broader applicability to other extremist Arabic groups? The current approach can also be extended to analyze the authorship of other cybercrime-related materials, such as bomb threats, hate speeches, and child-pornography images [2]. Another more challenging future direction is to automatically generate an optimal feature set which is specifically suitable for a given dataset.

6.2 Classification Technique Comparison


The SVM technique significantly outperformed the decision tree classifier in terms of accuracy. It also attained a better performance on the grounds of recall and precision. The results were consistent with previous studies, in that neural networks and SVM typically had better performance than decision tree algorithms. The difference in accuracy between classifiers across Arabic messages was far greater than it was in English messages: SVM outperformed C4.5 by more than 20 percent on all feature set combinations.

6.3 Decision Tree Analysis


The C4.5 decision tree is an effective analytical tool because of its descriptive nature. All the results were obtained in percentage units signifying the importance that a particular feature set played in the analysis. Elongation features and nearly half the word roots were some of the important attributes that are required to be considered for future analyses. Word Length had a major role to play in White Knights messages (40 percent) as compared to the Al-Aqsa Martyr group messages (20 percent) [1]. The importance of punctuation, function words, and word-based structural features was fairly consistent across both languages, suggesting that syntactical and structural characteristics are fairly robust feature categories across languages. The messages extracted from the AlAqsa group had an overdose of various font sizes, color, hyperlinks and embedded images; leading to a greater disparity in terms of feature importance in the technical structure category. The Al-Aqsa messages tended to be considerably longer than the KKK messages. In addition to overall length, sentence lengths of

8. REFERENCES
[1] Abbasi, Chen 2005. Applying Authorship Analysis to Extremist Group Web Forum Messages. University of Arizona. [2] Zheng, Huang, Chen. Authorship Analysis in Cybercrime Investigation. Department of Management Information Systems, University of Arizona.

[3] Zheng, Chen, Huang 2006. A Framework for Authorship Identication of Online Messages: Writing-Style Features and Classication Techniques [4] Abbasi, Chen, 2006. Visualizing Authorship for Identification. Department of Management Information Systems, The University of Arizona

[5] [6]

You might also like