Professional Documents
Culture Documents
Data Description
In this homework, we will work on real HTTP logs, which include 7837 HTTP requests. Most of these HTTP requests are normal, while some of them are attacks. Figure 1 shows examples of normal requests as well as of attacks in the log.
Figure 1: Examples of normal requests and attack requests in the HTTP logs. (a) A common request. (b) A JS XSS attack. It gathers the user's cookie and then sends a request to anewweb.com with the cookie in the query. (c) Remote file inclusion attack. The attacker grabs the password file. (d) DoS attack.
In order to classify the attacks from the normal HTTP requests, we transform each HTTP request into a vector, which is a character distribution. There are 256 types of ASCII code in total but only 95 types of ASCII code (between 33 and 127) appear in the HTTP requests (unprintable characters are not allowed). See the corresponding characters of ASCII code between 33 and 127 at http://www.williamrobertson.net/documents/ascii.html. The character distribution is computed as the frequency of each ASCII code in the path source of a HTTP request. For example,
The 7837 HTTP vectors can be downloaded from http://www.lri.fr/~xlzhang/KAUST/CS340_slides/data/http_vector.mat. In the data file, there are two variables: data --- HTTP vectors 7837*95 label--- class label of HTTP request 7837*1. 0: normal requests; 1: attacks (only 234 attacks). In this classification task, we randomly divide the data into 2 disjoint parts, e.g., 40% for training, and the remaining 60% for testing. You decide the proportion of training and testing according to your classification method (complexity). Regarding the classification method, you can use any one we have discussed, e.g., k-nn, SVM, Nave Bayes.
(2) Use your classification method on the training data described by the selected features to learn a classifier. Report the FPR and TPR of this classifier on test data set.
(2) Use your classification method on the training data described by the selected features to learn a classifier. Report the FPR and TPR of this classifier on test data set.
(2) Select k principle components which cover 90% of the variance. Project all the data on these k principle components for representing the data by an N*k matrix. Divide the data (N*k) into training and testing parts. Use your classification method on the training data described in k-dimension to learn a classifier. Report the FPR and TPR of this classifier on test data set (in k-dimension).