You are on page 1of 11

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO.

1, JANUARY 2013

Internet Trafc Classication by Aggregating Correlated Naive Bayes Predictions


Jun Zhang, Member, IEEE, Chao Chen, Yang Xiang, Senior Member, IEEE, Wanlei Zhou, Senior Member, IEEE, and Yong Xiang, Senior Member, IEEE

AbstractThis paper presents a novel trafc classication scheme to improve classication performance when few training data are available. In the proposed scheme, trafc ows are described using the discretized statistical features and ow correlation information is modeled by bag-of-ow (BoF). We solve the BoF-based trafc classication in a classier combination framework and theoretically analyze the performance benet. Furthermore, a new BoF-based trafc classication method is proposed to aggregate the naive Bayes (NB) predictions of the correlated ows. We also present an analysis on prediction error sensitivity of the aggregation strategies. Finally, a large number of experiments are carried out on two large-scale real-world trafc datasets to evaluate the proposed scheme. The experimental results show that the proposed scheme can achieve much better classication performance than existing state-of-the-art trafc classication methods. Index TermsTrafc classication, network security, naive Bayes.

I. INTRODUCTION PPLICATION oriented trafc classication is a fundamental technology for modern network security. It is useful to tackle a number of network security problems including lawful interception and intrusion detection [1]. For example, trafc classication can be used to detect patterns indicative of denial of service attacks, worm propagation, intrusions [2], and spam spread. In addition, trafc classication also plays an important role in modern network management, such as quality of service (QoS) control. Many open source and commercial tools [3], [4] with trafc classication function have been deployed and there is an increasing demand on the development of modern trafc classication techniques [1], [5]. While traditional trafc classication techniques may rely on the port numbers specied by different applications or the signature strings in the payload of IP packets, modern techniques

Manuscript received December 06, 2011; revised April 29, 2012; accepted October 03, 2012. Date of publication October 09, 2012; date of current version December 26, 2012. This work was supported in part by ARC Discovery Project DP1095498, in part by ARC Linkage Project LP100100208, and in part by ARC Linkage Project LP120200266. The associate editor coordinating the review of this manuscript and approving it for publication was C.-C. Jay Kuo. The authors are with the School of Information Technology, Deakin University, Melbourne, 3125, Australia (e-mail: jun.zhang@deakin.edu.au; chao. chen@deakin.edu.au; yang.xiang@deakin.edu.au; wanlei@deakin.edu.au; yong.xiang@deakin.edu.au). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TIFS.2012.2223675

normally utilize host/network behavior analysis or ow level statistical features by taking emerging and encrypted applications into account [6], [7]. Recently, substantial attention has been paid on the application of machine learning techniques to statistical features based trafc classication [1]. In the state-ofthe-art trafc classication methods, Internet trafc is characterized by a set of ow statistical properties and machine learning techniques are applied to automatically search for structural patterns. These methods can address the problems suffered from by the traditional methods, such as dynamic port numbers and user privacy protection. Recent research shows that ow statistical feature based trafc classication can be enhanced by feature discretization. Particularly, feature discretization is able to dramatically affect the performance of naive Bayes (NB). NB is one of the earliest classication methods applied in Internet trafc classication [7], which is a simple and effective probabilistic classier employing the Bayes theorem with naive feature independence assumptions [8]. Since independent features are assumed, an advantage of the NB classier is that it only requires a small amount of training data to estimate the parameters of a classication model. However, the performance degradation of NB trafc classier is reported in the existing works [5], [9]. Lim et al. found that the main reason for the underperformance of a number of traditional classiers including NB is the lack of the feature discretization process [10]. For example, feature discretization can effectively improve the accuracies of the support vector machine (SVM) and -NN algorithms at the price of lower classication speed. More interestingly, NB with feature discretization demonstrates not only signicantly higher accuracy but also much faster classication speed. Considering complex network situation, a difcult question is that how to obtain a high-performance statistical feature based trafc classier using a small set of training data. The solutions to this question are essential to address a number of difcult problems in the eld of network security and management. For instance, in practice, we may only manually label very few samples as supervised training data since trafc labelling is time-consuming, especially for new applications and encrypted applications. Moreover, a big challenge for current network management is to handle a large number of emerging applications, where it is almost impossible to collect sufcient training samples in a limited time. These observations motivate our work. In this paper, we provide a solution to effectively improve NB-based trafc classier with a small set of training samples.

1556-6013/$31.00 2012 IEEE

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013

The idea is to seamlessly incorporate ow correlation [11] into the NB-based classication process with feature discretization. Our major contributions are as follows. We propose a new trafc classication scheme to utilize the information among the correlated trafc ows generated by an application. In the proposed scheme, bag-of-ow (BoF) is introduced for modelling correlated ows and the new BoF-based trafc classication is solved by aggregating correlated NB predictions. We provide a theoretical study on the proposed scheme. First, we explain why the proposed scheme does work in a theoretical framework of classier combination. Second, we analyze the sensitivities to prediction errors of different aggregation rules employed in the proposed scheme. We present a comprehensive evaluation of the proposed scheme on two large scale real-world network datasets. The empirical study shows that the proposed scheme can effectively improve the trafc classication performance with a small set of training data and it outperforms the existing state-of-the-art trafc classication methods. All code and data related to this work will be available on request. The remainder of the paper is structured as follows. Section II reviews some related works. The new trafc classication scheme is proposed in Section III. Section IV presents the experimental results followed by a theoretical analysis on error sensitivity in Section V. Finally, the paper is concluded in Section VI. II. RELATED WORK In the area of network trafc classication, the state-of-the-art methods employ ow statistical features and machine learning techniques [1]. Many supervised classication algorithms and unsupervised clustering algorithms have been applied to categorize Internet trafc. In supervised trafc classication, the trafc classes are predened according to real applications and a set of labelled training samples are also manually collected for classier construction. In contrast, the clustering-based methods can automatically group a set of unlabeled training samples and use the clustering results to train a trafc classier. However, the number of clusters has to be set large enough to obtain useful and accurate trafc clusters, which results in a problem of mapping from a large number of trafc clusters to a small number of real applications [12][16]. This problem is very difcult to solve without knowing any information about real applications. A lot of effort has been made to develop effective supervised methods with the consideration of various network applications and situations. In early works, Moore and Zuev [7] applied the naive Bayes techniques to classify network trafc based on the ow statistical features. Later, several well-known algorithms were also applied to trafc classication, such as Bayesian neural networks [17] and support vector machines [18]. Erman et al. [19] proposed to use unidirectional statistical features to facilitate trafc classication in the network core. Taking into account the real-time purpose, several supervised classication methods [20], [21] were proposed, which only used the rst few packets. Other existing works include the

Pearsons chi-Square test based technique [22], probability density function (PDF) based protocol ngerprints [23], and small time-windows based packet count [24]. Different methods may have their own advantages in different network situations. Some empirical study [25], [9], [5], [26] evaluated the trafc classication performance of different methods for practical usage. Roughan et al. [25] have tested NN and LDA methods for trafc classication using ve categories of statistical features. Williams et al. [9] compared the supervised algorithms including naive Bayes with discretization, naive Bayes with kernel density estimation, C4.5 decision tree, Bayesian network and naive Bayes tree. Kim et al. [5] extensively evaluated ports-based CorelReef method, host behavior-based BLINC method and seven common statistical feature based methods using supervised algorithms on seven different trafc traces. A recent research nding is that feature discretization is critical and essential for Internet trafc classication [10]. By investigating the reasons for C4.5 performing very well under any circumstances, Lim et al. discovered that feature discretization can substantially improve the classication accuracy of every tested machine learning algorithm [10]. Since the performance of supervised methods is sensitive to the size of training data, some proposals tried to address this problem. Erman et al. [27] proposed to use a set of supervised training data in an unsupervised approach to address the problem of mapping from ow clusters to real applications. However, the mapping method will produce a large proportion of unknown clusters, especially when the supervised training data is very small. Another recent research nding is that ow correlation can be benecial to trafc classication. Ma et al. [11] proposed a payload-based clustering method for protocol inference, in which they grouped ows into equivalence clusters using a 3-tuple heuristic, i.e., the ows sharing the same destination IP, destination port and transport layer protocol are generated by the same application. Canini et al. [28] tested the correctness of the 3-tuple heuristic with real-world traces. In our previous work [29], we applied the heuristic to improve unsupervised trafc clustering. However, it is unclear why ow correlation is helpful to trafc classication and how to apply ow correlation in the supervised classication approach. The problem of how to effectively classify network trafc using a small set of training data, is still to be solved. III. PROPOSED CLASSIFICATION SCHEME This section presents a novel NB-based classication scheme to deal with the correlated ows in an effective way, which can signicantly improve the classication performance even with a small set of supervised training data. A. Classication Process Fig. 1 illustrates the classication process of our proposed scheme, which is focused on ow-level trafc classication. In the preprocessing, the system captures IP packets crossing a target network and constructs trafc ows by checking the headers of IP packets. A ow consists of successive IP packets with the same 5-tuple: source IP, source port, destination IP, destination port, and transport layer protocol. We apply a heuristic way to determine the correlated ows and model them

ZHANG et al.: INTERNET TRAFFIC CLASSIFICATION BY AGGREGATING CORRELATED NAIVE BAYES PREDICTIONS

Fig. 1. Classication process of correlated trafc.

using bag-of-ows (BoF). If the ows observed in a certain period of time share the same destination IP, destination port, and transport layer protocol, they are determined as correlated ows and form a BoF. For the classication purpose, a set of ow statistical features are extracted and discretized to represent trafc ows. A novel approach is proposed for trafc classication, namely aggregation of correlated NB predictions, which consists of two steps. In the rst step, the single NB predictor produces the posteriori class-conditional probabilities for each ow. In the second step, the aggregated predictor aggregates the ow predictions (posteriori probabilities) to determine the nal class for BoFs. B. A BoF-Based Classication Framework In the proposed scheme, a set of correlated ows are generated by the same application, which is modelled using a bag of , ows (BoF), . Since the ows, belong to the same application-based class, such correlation information can be utilized to improve the classication results. Therefore, we aim to aggregate the individual predictions of the correlated ows so as to conduct more accurate classication. Our research shows that the goal can be achieved by following the approach of classier combination. The BoF-based classication can be tted into Kittlers theoretical framework [30] for classier combination. Consider a trafc classication problem where pattern (BoF) possible trafc classes is to be assigned to one of the . Let us assume that we have a classier, but the given pattern can be represented by using distinct measure. This is a ment vectors (ows in this BoF), typical classier combination architecture of repeated measureis modments [31]. In the measurement space, each class and its priori elled by the probability density function . According to the probability of occurrence is denoted by , Bayesian decision theory, given measurements the pattern (BoF) should be assigned to class provided the a posteriori probability of that interpretation is maximum, i.e.,

predictions, . Therefore, classication of a BoF can be addressed by aggregating the ow predictions produced by a conventional classier. We derive the proof to explain why the aggregation of ow predictions does work. Suppose is a simple predictor and is the training set. The aggregation can be described as (2) is an aggregated predictor and denotes the expectawhere tion. Let be the class label of a ow which belongs to a BoF . Both and are random variables which are drawn from the distribution independent of the training set . The average classication error on BoFs, estimated by the simple predictor , is (3) The corresponding classication error estimated by the aggregated predictor is (4) Using the inequality, (5) we have (6) Through (3), (4) and (6), we can obtain

(7) The analysis on classier combination using bagging and random subspace are provided in [32] and [33]. There is a strong assumption [33] that the average performance of all the individual classiers, each trained on a subset of features and the training set replicas, is similar to a classier which uses the full feature set and the whole training set. This assumption is not always true, but we do not make such assumption here. From the inequality in (7), one can see that the more accurate

(1) means that assigning all ows in BoF into where trafc class in this work. Kittler et al. have proved that a number of classier combination rules can be derived from the Bayesian decision theory [30]. Specically, the complex prediction can be computed by combining simple

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013

aggregated classier can be obtained with the higher diversity of the simple predictor . In our work, the simple predictor is unstable due to a small set of training data. Consequently, the aggregation of correlated ow predictions can improve the performance to generate the aggregated predictor.

the median rule and the majority vote rule for ow aggregation and evaluate these rules in the experiments. The aggregated classier using the sum rule is

(12) C. Aggregation of Correlated NB Predictions We present a new approach, BoF-based NB (BoF-NB), to aggregate correlated NB predictions in this work, which results in a more accurate aggregated predictor for trafc classication. 1) Single NB Predictor: Naive Bayes classier is chosen for our scheme due to two reasons. Firstly, it has demonstrated high classication speed and good performance using the discretized statistical features in trafc classication. Secondly, it is easy for naive Bayes classier to produce the posterior probability that a testing ow belongs to a trafc class. According to the Bayesian decision theory [8], the maximum-a-posterior classier can minimize the average classication error. The key point is to estimate the posterior probability that a testing ow belongs to a trafc class. Given a ow , the posterior probability corresponding to class is (8) Using Bayes theorem, we have (9) Under the naive conditional independence assumptions that each feature is conditionally independent of every other feature , (8) becomes (10) (18) is a scaling factor. where In the proposed scheme, we use the NB algorithm to produce a set of posterior probabilities as predictions for each testing ow. It is different to the conventional NB classier which directly assigns a testing ow to a class with the maximum posterior probability. Considering correlated ows, the predictions of multiple ows will be aggregated to make a nal prediction. 2) Aggregated Predictor: Under Kittlers theoretical framework [30], a number of combination methods can be derived from the Bayesian decision theory which can be used for aggregated predictor. The aggregated classier can be expressed as (11) where is the combination method. In this paper, we use the equal prior assumption for all combination rules. Based on the previous research [30] and our empirical experience, the product rule and the min rule are pretty sensitive to noisy samples and weak classiers. Therefore, we use the sum rule, the max rule, where is binary valued function as (19) This results in the following decision rule The corresponding decision rule is (16) We obtain the decision rule

(13) The aggregated classier using the max rule is

(14) One can obtain the decision rule as

(15) The aggregated classier using the median rule is

(17) The aggregated classier using the majority vote rule is

(20) The effects of different aggregation rules to trafc classication are evaluated in the next section. IV. EXPERIMENTAL EVALUATION In this section, we evaluate the proposed BoF-NB scheme on two real-world trafc datasets. The proposed BoF-NB scheme is compared to four state-of-the-art trafc classication methods

ZHANG et al.: INTERNET TRAFFIC CLASSIFICATION BY AGGREGATING CORRELATED NAIVE BAYES PREDICTIONS

TABLE I UNIDIRECTIONAL STATISTICAL FEATURES

Fig. 2. Impact of feature discretization (a) on isp dataset and (b) on wide dataset.

including C4.5, k-NN, NB [10] and Ermans semisupervised method [27] in the situation of a small number of supervised training samples. To establish the ground truth for the testing datasets, we have developed a deep packet inspection (DPI) tool that matches regular expression signatures against ow payload content [29]. A number of application signatures are developed based on previous experience and some well-known tools such as l7-lter (http://l7-lter.sourceforge.net) and Tstat (http://tstat.tlc.polito. it). Also, several encrypted and new applications are investigated by manual inspection of the unidentied trafc. Our empirical study uses two testing datasets which are created from two real-world network trafc traces, wide [34] and isp [29], respectively. The wide dataset consists of 182 k trafc ows which are randomly selected from the wide trace and carefully recognized by the DPI tool and manual inspection. All ows in the wide dataset are categorized into 6 applicationoriented classes. For the wide dataset, there are only a small number of classes and the HTTP ows dominate the whole dataset. The other is the isp dataset created from our isp trace. The isp dataset consists of 200 k ows randomly sampled from 11 major classes. To avoid the dominating classes, we randomly select up to 30 k ows from every class. The wide and isp datasets can well represent the different natures of two realworld network trafc traces. A large number of experimental results obtained on the two datasets with different characteristics are statistically signicant. The experimental results can effectively demonstrate the classication capability of various trafc classication methods. In the experiments, 20 unidirectional ow statistical features are extracted and used to represent trafc ows, which are listed in Table I. We apply feature selection to remove irrelevant and redundant features from the feature set [35], [5]. The correlation-based feature subset selection is used in the experiments, which searches for a subset of features with high class-specic correlation and low intercorrelation. A Best First search [36] is used to create candidate sets of features. The process of feature selection [36] yields 6 features for the isp dataset and 6 features for the wide dataset, respectively. Feature discretization can signicantly improve the classication performance of many supervised classication algorithms [10]. We also incorporate feature discretization [37] into our proposed scheme. Two common metrics are used to measure the classication performance [5], overall accuracy and F-Measure. Overall accuracy is the ratio of the sum of all correctly classied ows to the sum of all testing ows. This metric is used to measure the

Fig. 3. Impact of aggregation methods (a) on isp dataset and (b) on wide dataset.

accuracy of a classier on the whole testing data. F-measure is calculated by (21) where precision is the ratio of correctly classied ows over all predicted ows in a class and recall is the ratio of correctly classied ows over all ground truth ows in a class. F-Measure is used to evaluate the per-class performance. In this paper, we report the average performance of 100 random runs. A. Impact of Feature Discretization Firstly, a set of experiments are carried out to evaluate the effect of feature discretization. Fig. 2 reports the classication accuracy of NB with and without feature discretization on the isp and wide datasets. As shown in Fig. 2(a), on the isp dataset, feature discretization can improve the classication accuracy by approximately 5 percent when only 10 training samples are available for each class. The improvement increases with the rise of the training samples and it can achieve up to 20 percent. The results on the wide dataset (see Fig. 2(b)) is similar to that on the isp dataset, while the maximum improvement can be 30 percent. The experimental results demonstrate the benet of feature discretization, i.e., feature discretization can signicantly improve the classication accuracy of the NB classier. Therefore, similar to [10], we apply feature discretization in our proposed scheme. B. Impact of Aggregation Methods We perform a set of experiments to evaluate the proposed BoF-NB scheme with different aggregation methods. The original NB classier with feature discretization is used in the experiments as a baseline. Fig. 3 shows the classication accuracy with different training date sizes. One can nd that the proposed BoF-NB scheme outperforms NB whichever aggregation method is used. On the isp dataset, the classication accuracy of BoF-NB is higher than that of NB by about 10 percent. The

10

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013

Fig. 4. F-Measures of BoF-NB with different aggregation rules on isp dataset. (a) bt, (b) dns, (c) ftp, (d) http, (e) imap, (f) msn, (g) pop3, (h) smtp, (i) ssh, (j) ssl, and (k) xmpp.

reason is that BoF-NB can effectively utilize the ow correlation information. Regarding the aggregation methods, the sum rule is slightly better than the majority vote rule and the median rule. The max rule is the worst one among the four competing aggregation methods, whose accuracy is lower than the sum rule by approximately 4 percent. The similar results can be obtained on the wide dataset as shown in Fig. 3(b). BoF-NB exhibits better classication capability than NB and the sum rule is superior to the max rule by 8 percent. Figs. 4 and 5 report the F-Measures of BoF-NB and NB for each class on the two datasets. In general, our BoF-NB scheme, especially with the sum rule, the majority vote rule, and the median rule can signicantly improve the F-Measure for any application-based trafc class. The degree of improvement varies in different classes. For example, on the isp dataset as shown in Fig. 4, the F-Measure of BoF-NB with the sum rule is more

than 15 percent greater than that of NB for dns class. In the class pop3, the improvement is about 10 percent. Among the four aggregation methods, the max rule does not work as well as other aggregation methods for many trafc classes. For instance, the F-Measure of using the max rule is lower than that of other rules by up to 15 percent for imap on the isp dataset. On the wide dataset as shown in Fig. 5, BoF-NB with the max rule has similar accuracy to NB. However, the sum rule consistently demonstrates good classication performance for all trafc classes on the two datasets. C. Impact of BoF Intradiversity As discussed in Section III, the diversity of simple predictor can affect the performance of aggregation of ow predictions. In this work, the diversity is related to two factors, a small set

ZHANG et al.: INTERNET TRAFFIC CLASSIFICATION BY AGGREGATING CORRELATED NAIVE BAYES PREDICTIONS

11

Fig. 5. F-Measures of BoF-NB with different aggregation rules on wide dataset. (a) bt, (b) dns, (c) http, (d) smtp, (e) ssh, and (f) ssl.

Fig. 6. Impact of BoF intradiversity. (a) Sum, (b) Max, (c) Median, and (d) Majority Vote.

of training data and the size of BoFs (namely BoF intradiversity). Since the ows in a BoF share the same destination but come from different sources and networks, we argue that the more ows exist in a BoF, the higher the BoF intradiversity is. Fig. 6 shows the impact of BoF intradiversity to the accuracy improvement. The improvement is measured by comparing the classication accuracy of the BoF-based scheme and that of the original NB classication method on the same testing set. When the size of BoFs is 1, no improvement is obtained because there is no diversity in BoFs. As the size of BoFs increases, more improvement is achieved, thanks to the growth of the BoF intradiversity. The experimental results well match the theoretical analysis in Section III. D. Comparison With State-of-the-Art Methods We conduct a number of experiments to compare the classication performance of the proposed BoF-NB scheme with three

state-of-the-art methods: C4.5, -NN, and Ermans semisupervised method [27]. C4.5 and -NN demonstrate superior trafc classication performance in recent research [5], [10]. Ermans semisupervised method [27] employs the -means clustering algorithm and a supervised cluster-application mapping strategy. A large proportion of testing ows will be labelled as unknown by the semisupervised method when a small size of supervised training set is available. We implement Ermans semisupervised method with ignoring the unknown class in the training stage for fair comparison. In the experiments, the sum rule is selected for our BoF-NB scheme based on the experimental results in the Subsection IV-B. Fig. 7 shows the classication accuracy of the four competing classication methods versus training data size. One can see that BoF-NB outperforms the other three state-of-the-art methods. For example, the classication accuracy of BoF-NB is higher than that the second best one, the semisupervised method, by approximately 9 percent on the isp dataset. C4.5 and -NN have the similar performance, which are slightly worse than the semisupervised method. On the wide dataset, BoF-NB is better than the second best one, C4.5, by up to 8 percent. The performance of the semisupervised method is close to that of C4.5 as the supervised training data size increases. -NN is always worse than C4.5 by about 6 percent. The results show that BoF-NB can effectively improve the classication accuracy by aggregating correlated NB predictions. The F-Measures of the four competing methods are shown in Figs. 8 and 9. A general observation is that BoF-NB is superior to C4.5, -NN and the semisupervised method. On the isp dataset, the F-Measures of BoF-NB are signicantly higher than those of other three competing methods for bt, ftp, imap, msn, pop3, smtp and ssl classes. For instance, the F-Measure of BoF-NB is dramatically higher than that of the second best one, the semisupervised method, by up to 20 percent for the imap class. For the other 4 classes, BoF-NB has comparable F-Measures to the best method. On the wide dataset, BoF-NB can achieve higher F-Measure than the other three methods for

12

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013

We consider the effect of the estimation errors on the aggregation rules. Substituting (22) into (13) we have

(23) Under the assumption that we have and

(24)

Substituting (24) into (23) it yields

(25)
Fig. 7. Classication accuracy of four methods (a) on isp dataset and (b) on wide dataset.

every class. For example, BoF-NB has a higher F-Measure for the dns class than the second best one, C4.5, by approximately 10 percent. The experimental results conrm that BoF-NB has better classication performance than the state-of-the-art classication methods due to its capability of utilizing ow correlation information. V. ANALYSIS ON ERROR SENSITIVITY In order to explain why the sum rule works better than the max rule, we investigate the error sensitivity. An empirical nding reported in Section IV is that the sum rule (13) appears to produce more reliable decisions than the max rule (15). We shall show that the sum rule is much less affected by prediction errors. This theoretical analysis result is consistent with the experimental nding. In Section III, we assumed that the a posteriori class probabilities for a ow , are computed correctly. In fact, each ow will produce only an estimate of the posteriori class probability for a BoF , which is denoted as . The estimate deviates from the probability by error , i.e., (22) These estimated probabilities, rather than the true probabilities, are used in the aggregated predictor rules.

Comparing (13) and (25) we nd that each class-based term in the aggregation rule (13) is affected by error factor

(26)

A similar analysis of the max rule (15) commences with

(27) which can be rewritten as

(28) A comparison of (15) and (28) shows that the aggregation rule (15) is affected by error factor (29)

ZHANG et al.: INTERNET TRAFFIC CLASSIFICATION BY AGGREGATING CORRELATED NAIVE BAYES PREDICTIONS

13

Fig. 8. F-Measure comparison of four methods on isp dataset. (a) bt, (b) dns, (c) ftp, (d) http, (e) imap, (f) msn, (g) pop3, (h) smtp, (i) ssh, (j) ssl, and (k) xmpp.

Comparing error factors (26) and (29), it inspires that the difference of the two error factors depends on two components, (30)

(31) We observe that the sum operation is able to cancel the effect of the positive and negative values, so the value of should be close to the expected value of the distribution of . In contrast, the max operation chooses a large value and the value of should be away from the expected value of the distribution of . We design and perform a simulation to illustrate the effect of and . In the simulation, the normal distribution is

. Considering is an esused for analyzing errors timation of and , we set . Furthermore, we randomly create a number of values and use them to compute values for BoFs. Fig. 10 shows the corresponding absolute values, and , with two different BoF sizes. For each BoF size, 100 runs are carried out. One can see from Fig. 10 that is smaller than in most runs when the BoF size is 10. When the BoF size increases to 50, becomes much smaller than in every run. The sum rule demonstrates less sensitivity to prediction errors than the max rule, which is in line with our experimental results in Section IV. VI. CONCLUSION In this paper, we proposed a new trafc classication scheme which can effectively improve the classication performance in

14

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 8, NO. 1, JANUARY 2013

Fig. 9. Average F-Measure per-class on wide dataset. (a) bt, (b) dns, (c) http, (d) smtp, (e) ssh, and (f) ssl. [7] A. W. Moore and D. Zuev, Internet trafc classication using bayesian analysis techniques, in SIGMETRICS Perform. Eval. Rev., Jun. 2005, vol. 33, pp. 5060. [8] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classication. New York: Wiley, 2001. [9] N. Williams, S. Zander, and G. Armitage, A preliminary performance comparison of ve machine learning algorithms for practical ip trafc ow classication, in Proc. SIGCOMM Comput. Commun. Rev., Oct. 2006, vol. 36, pp. 516. [10] Y.-S. Lim, H.-C. Kim, J. Jeong, C.-K. Kim, T. T. Kwon, and Y. Choi, Internet trafc classication demystied: On the sources of the discriminative power, in Proc. 6th Int. Conf., Ser. Co-NEXT10, New York, 2010, pp. 9:19:12, ACM. [11] J. Ma, K. Levchenko, C. Kreibich, S. Savage, and G. M. Voelker, Unexpected means of protocol inference, in Proc. 6th ACM SIGCOMM Conf. Internet Measurement, New York, 2006, pp. 313326. [12] S. Zander, T. Nguyen, and G. Armitage, Automated trafc classication and application identication using machine learning, in Proc. Ann. IEEE Conf. Local Computer Networks, Los Alamitos, CA, 2005, pp. 250257. [13] J. Erman, M. Arlitt, and A. Mahanti, Trafc classication using clustering algorithms, in Proc. SIGCOMM Workshop on Mining Network Data, New York, 2006, pp. 281286. [14] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, Trafc classication on the y, in Proc. SIGCOMM Comput. Commun. Rev., Apr. 2006, vol. 36, pp. 2326. [15] Y. Wang, Y. Xiang, and S.-Z. Yu, An automatic application signature construction system for unknown trafc, Concurrency Computat.: Pract. Exper., vol. 22, pp. 19271944, 2010. [16] A. Finamore, M. Mellia, and M. Meo, Mining unclassied trafc using automatic clustering techniques, in Proc. TMA Int. Workshop on Trafc Monitoring and Analysis, Vienna, Austria, Apr. 2011, pp. 150163. [17] T. Auld, A. W. Moore, and S. F. Gull, Bayesian neural networks for internet trafc classication, IEEE Trans. Neural Netw., vol. 18, no. 1, pp. 223239, Jan. 2007. [18] A. Este, F. Gringoli, and L. Salgarelli, Support vector machines for tcp trafc classication, Comput. Netw., vol. 53, no. 14, pp. 24762490, Sep. 2009. [19] J. Erman, A. Mahanti, M. Arlitt, and C. Williamson, Identifying and discriminating between web and peer-to-peer trafc in the network core, in Proc. 16th Int. Conf. World Wide Web, New York, 2007, pp. 883892. [20] T. Nguyen and G. Armitage, Training on multiple sub-ows to optimise the use of machine learning classiers in real-world ip networks, in Proc. Ann. IEEE Conf. Local Computer Networks, Los Alamitos, CA, 2006, pp. 369376.

Fig. 10. Illustration of error sensitivity. (a) BoF size

. (b) BoF size

the situation that only few training data are available. The proposed scheme is able to incorporate ow correlation information into the classication process. We presented a theoretical analysis on why and how the proposed scheme does work. A new BoF-NB method was also proposed to effectively aggregate the correlation naive Bayes (NB) predictions. The experiments performed on two real-world network trafc datasets demonstrated the effectiveness of the proposed scheme. The experimental results showed that BoF-NB with the sum rule outperforms existing state-of-the-art methods by large margins. This study provides a solution to achieve high-performance trafc classication without time-consuming training samples labelling. REFERENCES
[1] T. T. Nguyen and G. Armitage, A survey of techniques for internet trafc classication using machine learning, Commun. Surveys Tuts., vol. 10, no. 4, pp. 5676, 4th Quarter 2008. [2] Y. Xiang, W. Zhou, and M. Guo, Flexible deterministic packet marking: An ip traceback system to nd the real source of attacks, IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 4, pp. 567580, Apr. 2009. [3] Snort 2011 [Online]. Available: http://www.snort.org/ [4] Bro 2011 [Online]. Available: http://bro-ids.org/index.html [5] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee, Internet trafc classication demystied: Myths, caveats, and the best practices, in Proc. ACM CoNEXT Conf., New York, 2008, pp. 112. [6] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, BLINC: Multilevel trafc classication in the dark, in Proc. SIGCOMM Comput. Commun. Rev., Aug. 2005, vol. 35, pp. 229240.

ZHANG et al.: INTERNET TRAFFIC CLASSIFICATION BY AGGREGATING CORRELATED NAIVE BAYES PREDICTIONS

15

[21] L. Bernaille and R. Teixeira, Early recognition of encrypted applications, in Proc. 8th Int. Conf. Passive and Active Network Measurement, Berlin, Heidelberg, Germany, 2007, pp. 165175. [22] D. Bonglio, M. Mellia, M. Meo, D. Rossi, and P. Tofanelli, Revealing skype trafc: When randomness plays with you, in Proc. Conf. Applications, Technologies, Architectures, and Protocols for Computer Communications, New York, 2007, pp. 3748. [23] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, Trafc classication through simple statistical ngerprinting, in Proc. SIGCOMM Comput. Commun. Rev., Jan. 2007, vol. 37, pp. 516. [24] S. Valenti, D. Rossi, M. Meo, M. Mellia, and P. Bermolen, Accurate, ne-grained classication of P2P-TV applications by simply counting packets, in Proc. Int. Workshop on Trafc Monitoring and Analysis, Berlin, Heidelberg, Germany, 2009, pp. 8492. [25] M. Roughan, S. Sen, O. Spatscheck, and N. Dufeld, Class-of-service mapping for QoS: A statistical signature-based approach to IP trafc classication, in Proc. 4th ACM SIGCOMM Conf. Internet Measurement, New York, 2004, pp. 135148. [26] M. Pietrzyk, J.-L. Costeux, G. Urvoy-Keller, and T. En-Najjary, Challenging statistical classication for operational usage: The ADSL case, in Proc. 9th ACM SIGCOMM Conf. Internet Measurement Conf., New York, 2009, pp. 122135. [27] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, Ofine/ realtime trafc classication using semi-supervised learning, Performance Evaluation, vol. 64, no. 9-12, pp. 11941213, Oct. 2007. [28] M. Canini, W. Li, M. Zadnik, and A. W. Moore, Experience with high-speed automated application-identication for network-management, in Proc. 5th ACM/IEEE Symp. Architectures for Networking and Communications Systems, New York, 2009, pp. 209218. [29] Y. Wang, Y. Xiang, J. Zhang, and S.-Z. Yu, A novel semi-supervised approach for network trafc clustering, in Proc. Int. Conf. Network and System Security, Milan, Italy, Sep. 2011, pp. 169175. [30] J. Kittler, M. Hatef, R. Duin, and J. Matas, On combining classiers, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226239, Mar. 1998. [31] A. Webb, Statistical Pattern Recognition. Hoboken, NJ: Wiley, 2002. [32] L. Breiman, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123140, 1996. [33] D. Tao, X. Tang, X. Li, and X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 10881099, Jul. 2006. [34] MAWI Working Group Trafc Archive [Online]. Available: http://mawi.wide.ad.jp/mawi/ [35] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., vol. 3, pp. 11571182, Mar. 2003. [36] Weka 3: Data Mining Software in Java 2011 [Online]. Available: http:// www.cs.waikato.ac.nz/ml/weka/ [37] U. M. Fayyad and K. B. Irani, Multi-interval discretization of continuous-valued attributes for classication learning, in Proc. Int. Joint Conf. Uncertainty in Articial Intelligence, 1993, pp. 16.

Chao Chen received the Bachelor of Information Technology degree with rst class honors from Deakin University Australia in 2012. He is currently working toward the Ph.D. degree at the School of Information Technology, Deakin University. His research interests include network management and security, especially in network trafc classication.

Yang Xiang (A08M09SM12) received the Ph.D. degree in computer science from Deakin University, Australia. He is currently with School of Information Technology, Deakin University. His research interests include network and system security, distributed systems, and networking. In particular, he is currently leading in a research group developing active defense systems against large-scale distributed network attacks. He is the Chief Investigator of several projects in network and system security, funded by the Australian Research Council (ARC). He has published more than 100 research papers in many international journals and conferences, such as IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, IEEE TRANSACTIONS ON INFORMATION SECURITY AND FORENSICS, and IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS. He has served as the Program/General Chair for many international conferences such as ICA3PP 12/11, IEEE/IFIP EUC 11, IEEE TrustCom 11, IEEE HPCC 10/09, IEEE ICPADS 08, and NSS 11/10/09/08/07. He has been the PC member for more than 50 international conferences in distributed systems, networking, and security. He serves as the Associate Editor of IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS and the Editor of Journal of Network and Computer Applications.

Wanlei Zhou (M92SM09) received the Ph.D. degree in 1991 from the Australian National University, Canberra, Australia, and the D.Sc. degree from Deakin University, Victoria, Australia, in 2002. He is currently the chair professor of Information Technology and the Head of the School of Information Technology, Deakin University, Melbourne. His research interests include distributed and parallel systems, network security, mobile computing, bioinformatics, and e-learning. He has published more than 200 papers in refereed international journals and refereed international conference proceedings. Since 1997, he has been involved in more than 50 international conferences as the general chair, a steering chair, a PC chair, a session chair, a publication chair, and a PC member.

Jun Zhang (M12) received the Ph.D. degree in 2011 from University of Wollongong, Australia. He is currently with School of Information Technology at Deakin University, Melbourne, Australia. His research interests include network and system security, pattern recognition, and multimedia processing. He has published more than 30 research papers in the international journals and conferences, such as IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, and IEEE International Conference on Image Processing. Dr. Zhang received the 2009 Chinese Government award for outstanding selfnanced student abroad.

Yong Xiang (M12SM12) received the B.E. and M.E. degrees from the University of Electronic Science and Technology of China, Chengdu, China, in 1983 and 1989, respectively. In 2003, he received the Ph.D. degree from The University of Melbourne, Melbourne, Australia. He was with the Southwest Institute of Electronic Equipment of China, Chengdu, from 1983 to 1986. In 1989, he joined the University of Electronic Science and Technology of China, where he was a Lecturer from 1989 to 1992 and an Associate Professor from 1992 to 1997. He was a Senior Communications Engineer with Bandspeed Inc., Melbourne, Australia, from 2000 to 2002. He is currently an Associate Professor with the School of Information Technology at Deakin University, Melbourne, Australia. His research interests include blind signal/system estimation, information and network security, communication signal processing, multimedia processing, pattern recognition, and biomedical signal processing.

You might also like