You are on page 1of 4

AMCS/CS 340 Data Mining, Fall 2010

Homework 6: Feature Selection/Dimensionality Reduction


Due Wednesday Nov 30, 11:59pm
Submit by the blackboard system
The goal of this homework is to become familiar with Feature Selection and Dimensionality Reduction algorithms.

Data Description
In this homework, we will work on real HTTP logs, which include 7837 HTTP requests. Most of these HTTP requests are normal, while some of them are attacks. Figure 1 shows examples of normal requests as well as of attacks in the log.

Figure 1: Examples of normal requests and attack requests in the HTTP logs. (a) A common request. (b) A JS XSS attack. It gathers the user's cookie and then sends a request to anewweb.com with the cookie in the query. (c) Remote file inclusion attack. The attacker grabs the password file. (d) DoS attack.

In order to classify the attacks from the normal HTTP requests, we transform each HTTP request into a vector, which is a character distribution. There are 256 types of ASCII code in total but only 95 types of ASCII code (between 33 and 127) appear in the HTTP requests (unprintable characters are not allowed). See the corresponding characters of ASCII code between 33 and 127 at http://www.williamrobertson.net/documents/ascii.html. The character distribution is computed as the frequency of each ASCII code in the path source of a HTTP request. For example,

As a consequence, each HTTP request is represented by a vector of 95 dimensions.


1

The 7837 HTTP vectors can be downloaded from http://www.lri.fr/~xlzhang/KAUST/CS340_slides/data/http_vector.mat. In the data file, there are two variables: data --- HTTP vectors 7837*95 label--- class label of HTTP request 7837*1. 0: normal requests; 1: attacks (only 234 attacks). In this classification task, we randomly divide the data into 2 disjoint parts, e.g., 40% for training, and the remaining 60% for testing. You decide the proportion of training and testing according to your classification method (complexity). Regarding the classification method, you can use any one we have discussed, e.g., k-nn, SVM, Nave Bayes.

Question 1 (15 points) (full features)


Use your classification method on the HTTP vectors to learn a classifier from the training data. Report the False Positive Rate (FPR) and True Positive Rate (TPR) of this classifier on the test data set.

Question 2 (20 points) (selected features by ranking)


Use one individual feature selection method we have discussed to rank all the features, e.g., AUC, mutual information, t-test Choose k highly ranked features. Remove the unselected features from training data set, and test data set. (1) Report the features you selected

(2) Use your classification method on the training data described by the selected features to learn a classifier. Report the FPR and TPR of this classifier on test data set.

Question 3 (25 points) (selected features by forward selection)


Use forward feature selection method to select a subset of features. You can use Matlab, or Weka, or other available toolbox. (1) Report the features you selected

(2) Use your classification method on the training data described by the selected features to learn a classifier. Report the FPR and TPR of this classifier on test data set.

Question 4 (20 points) (Dimensionality reduction by PCA)


Use PCA on the whole data, 7837 HTTP requests (without the labels). (1) Find 2 principle components and use them as x-axis and y-axis. Project all the HTTP requests on these 2 dimensions. Plot the normal and abnormal records in different colors.

(2) Select k principle components which cover 90% of the variance. Project all the data on these k principle components for representing the data by an N*k matrix. Divide the data (N*k) into training and testing parts. Use your classification method on the training data described in k-dimension to learn a classifier. Report the FPR and TPR of this classifier on test data set (in k-dimension).

Question 5 (20 points) (Dimensionality reduction by SVD)


Use SVD on the whole data, 7837 HTTP requests (without the labels), X=UV. Select k largest singular values (can be the same k as you used in Question 4). Remove the small singular values and corresponding variables in U and V. After removing, Uu, , Vv. Reconstruct the data by Y= uv. Divide the new constructed Y into training and testing parts. Use your classification method on the training data to learn a classifier. Report the FPR and TPR of this classifier on test data set.