You are on page 1of 114

AI-BASED TCP PERFORMANCE MODELLING

K. Mahmoud

M.Sc. September 2012

AI-BASED TCP PERFORMANCE MODELLING

A thesis submitted to the University of Plymouth in partial fulllment of the requirements for the degree of

Master of Science

Project Supervisor: Dr Bogdan Ghita

Karim Mahmoud
September 2012 School of Computing and Mathematics Faculty of Science and Technology University of Plymouth, UK

AI-Based TCP Performance Modelling


Karim Mahmoud

Abstract
Different mathematical models exist for modelling TCP algorithms and the interrelations between TCP and network parameters. In this research, an articial neural network modelling approach was considered in order to model TCP performance, represented in the transmission time needed to transfer data payload within TCP ows. Two models were developed, for each lossless and lossy TCP connections. A base line was dened by a mathematical model in order to compare the accuracy obtained in estimating the transmission time needed in terms of regression between actual and estimated values, and in terms of cumulative distribution of relative error. The neural models had initially given better results over the mathematical model for the same conditions and datasets used. Manual analysis was performed on poorly estimated samples, and this revealed the presence of additional prolonged idle periods within ows, which was not accounted for in the mathematical model, and was not sufciently estimated in neural models. The effect of idle time on modelling accuracy has been thoroughly investigated to study the effect it had on reinitialising the congestion window and how different TCP implementations dealt with idle time occurrences when resuming transmission. Other ltering criteria were applied on trafc to exclude statistical outliers and non-standard TCP connections. This has provided improved results for both models used. Nevertheless, the neural network modelling approach had outperformed the mathematical modelling of TCP throughput along all stages of this research. Finally, it was suggested to revise the available mathematical model to take idle time into consideration.

Declaration
This is to certify that the candidate, Karim Mahmoud carried out the work submitted herewith.

Candidates Signature: Karim Mahmoud . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: 30/09/2012

Supervisors Signature: Dr Bogdan Ghita . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: 30/09/2012

Second Supervisors Signature: Dr David Lancaster . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: 30/09/2012

Copyright & Legal Notice


This copy of the dissertation has been supplied on the condition that anyone who consults it is understood to recognize that its copyright rests with its author and that no part of this dissertation and information derived from it may be published without the authors prior written consent. The names of actual companies and products mentioned throughout this dissertation are trademarks or registered trademarks of their respective owners.

iii

Acknowledgements
I wish to express my deep and sincere appreciation to Dr Bogdan Ghita for his guidance, assistance, patience, and usual constructive feedback. Working under his supervision has been inspiring and has developed a deeper condence in my research and intellectual abilities. I feel privileged to have been one of Dr Ghitas students. I also wish to express my gratitude to all the teaching staff at the School of Computing and Mathematics at Plymouth University for a wonderful learning experience. To all those who supported my decision to pursue a masters degree: Dr Mahmoud Khalil at Ain Shams University, Rami Mohamed and Walid Refaat at Orange Business Services. To my loving parents and sister for their persistent encouragement and support.

Table of Contents
Page 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Project Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 TCP Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 TCP Transition States . . . . . . . . . . . . . . . . . 2.2.1.1 Connection Establishment . . . . . . . . . 2.2.1.2 Data Transfer . . . . . . . . . . . . . . . . . 2.2.1.3 Connection Termination . . . . . . . . . . . 2.2.2 TCP Flow Control . . . . . . . . . . . . . . . . . . . . 2.2.2.1 Sliding Window Protocol . . . . . . . . . . . 2.2.3 TCP Congestion Control . . . . . . . . . . . . . . . . 2.2.3.1 Slow Start . . . . . . . . . . . . . . . . . . . 2.2.3.2 Congestion Avoidance . . . . . . . . . . . . 2.2.3.3 Retransmission Timeout, Fast Retransmit, Recovery . . . . . . . . . . . . . . . . . . . . 2.2.3.4 Fast Retransmit . . . . . . . . . . . . . . . . 2.2.3.5 Fast Recovery . . . . . . . . . . . . . . . . . 2.2.4 Idle Time Considerations . . . . . . . . . . . . . . . 2.2.5 TCP Timers . . . . . . . . . . . . . . . . . . . . . . . 2.3 Formula-Based Modelling . . . . . . . . . . . . . . . . . . . 2.3.1 Cardwell Mathematical Model . . . . . . . . . . . . 2.4 Previous Research and Machine Learning Approaches . . 2.4.1 Performance Estimation . . . . . . . . . . . . . . . . 2.4.2 Performance Prediction . . . . . . . . . . . . . . . . . 2.4.3 History-Based Models . . . . . . . . . . . . . . . . . 2.4.4 Articial Neural Networks . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Research Methodology . . . . . . . . . 3.1 Data Acquisition . . . . . . . . . . . . 3.2 Data Pre-processing . . . . . . . . . . 3.2.1 TCPTRACE . . . . . . . . . . 3.2.2 Data Processing in MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and Fast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 5 5 6 6 7 7 7 8 8 9 9 10 11 11 12 12 12 15 16 19 19 19 20 21 22 23 23 25 25 25

vii

TABLE OF CONTENTS 3.3 Neural Network Modelling in MATLAB 3.4 Statistical Analysis in MATLAB . . . . 3.4.1 Regression . . . . . . . . . . . . . 3.4.2 MSE . . . . . . . . . . . . . . . . . 3.4.3 Absolute Relative Error . . . . . 3.5 Base Line for Analysing Model Accuracy 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 26 26 27 27 27 29 29 30 30 31 31 32 32 33 33 33 34 34 35 36 37 38 40 41 41 44 44 44 44 45 45 45 46 47 47 48 49

4 Data Pre-processing and Trafc Analysis . . . . . . . . . . 4.1 Types of Trafc . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Extracting TCP Parameters . . . . . . . . . . . . . . . . . . 4.3 TCP Parameters Pre-processing . . . . . . . . . . . . . . . . 4.3.1 Identifying Valid TCP Flows . . . . . . . . . . . . . . 4.3.2 Selection of Forward Direction . . . . . . . . . . . . 4.3.3 Classication of Lossless and Lossy Flows . . . . . . 4.3.4 Computing the Mathematical Throughput Estimate 4.3.5 Normalisation of TCP Parameters . . . . . . . . . . 4.4 Statistical Distribution of TCP Parameters . . . . . . . . . 4.4.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Data Transmitted . . . . . . . . . . . . . . . . . . . . 4.4.3 Initial Window Size . . . . . . . . . . . . . . . . . . . 4.4.4 Maximum Segment Size . . . . . . . . . . . . . . . . 4.4.5 Data Transmission Time . . . . . . . . . . . . . . . . 4.4.6 Average RTT . . . . . . . . . . . . . . . . . . . . . . . 4.4.7 Maximum Idle Time . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Neural Network Modelling . . . . . . . . . . . . . . . 5.1 Backpropagation Feed Forward Neural Networks 5.2 Backpropagation Neural Network Parameters . . 5.2.1 Initialization of Weights . . . . . . . . . . . 5.2.2 Initialization of Bias . . . . . . . . . . . . . 5.2.3 Learning Rate . . . . . . . . . . . . . . . . . 5.2.4 Momentum . . . . . . . . . . . . . . . . . . . 5.2.5 Hidden Layers and Nodes . . . . . . . . . . 5.2.6 Number of Samples . . . . . . . . . . . . . . 5.2.7 Stopping Criteria . . . . . . . . . . . . . . . 5.3 Neural Network Model Structure . . . . . . . . . . 5.3.1 Lossless Model . . . . . . . . . . . . . . . . . 5.3.2 Lossy Model . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

REFERENCES 6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Results from the Combined Dataset (UNIBS-2009 and MAWI) . . . 6.1.1 Considering All Valid TCP Connections . . . . . . . . . . . . 6.1.1.1 Results for the Lossless Dataset . . . . . . . . . . . 6.1.1.2 Results for the Lossy Dataset . . . . . . . . . . . . . 6.1.2 Idle Time Investigation . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Filtering TCP Connections with High Relative Idle Time . . 6.1.3.1 Results for the Lossless Dataset . . . . . . . . . . . 6.1.3.2 Results for the Lossy Dataset . . . . . . . . . . . . . 6.1.4 Investigation of Non-Standard Flows . . . . . . . . . . . . . . 6.1.5 Filtering Non-Standard Flows . . . . . . . . . . . . . . . . . . 6.1.5.1 Results for the Lossless Dataset . . . . . . . . . . . 6.1.5.2 Results for the Lossy Dataset . . . . . . . . . . . . . 6.1.6 Throughput and Estimation Error . . . . . . . . . . . . . . . 6.2 Manual Analysis of Connection with Poorly Estimated Throughput 6.3 Results from the Plymouth University Campus Dataset . . . . . . . 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusions and Future Research Directions 7.1 Conclusions . . . . . . . . . . . . . . . . . . . 7.2 Research Limitations . . . . . . . . . . . . . . 7.3 Direction of Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 51 52 53 55 57 57 59 61 62 62 65 67 69 70 71 73 73 74 75 77 81 81 81 83 83 83 84 86 86 87 89 89 91

8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 UNIBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 MAWI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Results Using the Dataset from Plymouth University Campus . . . B.1 Considering All Valid TCP Connections . . . . . . . . . . . . . . . . . B.1.1 Results for the Lossless Dataset . . . . . . . . . . . . . . . . . B.1.2 Results for the Lossy Dataset . . . . . . . . . . . . . . . . . . . B.2 Filtering TCP Connections with High Relative Idle Time and NonStandards TCP Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Results for the Lossless Dataset . . . . . . . . . . . . . . . . . B.2.2 Results for the Lossy Dataset . . . . . . . . . . . . . . . . . . . C MATLAB Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 Cardwell Mathematical Model Implementation . . . . . . . . . . . . . C.2 Neural Network Modelling . . . . . . . . . . . . . . . . . . . . . . . . .

ix

List of Tables
Table Page

4.1 TCP parameters of interest as collected by tcptrace. . . . . . . . . . . 30 4.2 Number of valid TCP ow for both lossless and lossy subsets. . . . . 31 4.3 Mean values of TCP parameters evaluated for the three datasets used. 36 5.1 Stopping criteria used for the neural network during learning process. 47 5.2 Neural network structure and input parameters for the lossless model. 48 5.3 Neural network structure and input parameters for the lossy model. 49 6.1 MSE and regression results post ltering samples with high maximum idle time to average RTT (lossless combined dataset). . . . . . . 6.2 MSE and regression results post ltering samples with high maximum idle time to average RTT (lossy combined dataset). . . . . . . . 6.3 Neural network results obtained post ltering non-standard ows from the lossless subset. . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Mathematical results obtained post ltering different non-standard ows from the lossless subset. . . . . . . . . . . . . . . . . . . . . . . . 6.5 Neural network results obtained post ltering different non-standard ows from the lossy subset. . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Mathematical results obtained post ltering different non-standard ows from the lossy subset. . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Accuracy results for the lossless dataset of Plymouth University. . . 6.8 Accuracy results for the lossy dataset of Plymouth University. . . . . A.1 Composition of the UNIBS 2009 trace (UNIBS: Data sharing, 2011). A.2 Composition of the MAWI traces(UNIBS: Data sharing, 2011). . . . . 57 60 63 63 65 65 71 71 81 82

xi

List of Figures
Figure 2.1 2.2 2.3 2.4 TCP state transition diagram for both client and server. . . . . . . . . Timeline of TCP connection establishment and termination. . . . . . Slow start and congestion avoidance sending patterns. . . . . . . . . Slow start and congestion avoidance, as implemented for TCP Tahoe and TCP Reno. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 6 7 10 11

3.1 Process diagram of research stages. . . . . . . . . . . . . . . . . . . . . 23 3.2 Regression analysis showing regression tting line and residual values. 26 4.1 Percentages of both lossless and lossy TCP connections within the network trafc captured. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Cumulative distribution of throughput. . . . . . . . . . . . . . . . . . 4.3 Cumulative distribution of data transmitted. . . . . . . . . . . . . . . 4.4 Cumulative distribution of initial window bytes. . . . . . . . . . . . . 4.5 Cumulative distribution of initial window packets. . . . . . . . . . . . 4.6 Cumulative distribution of MSS. . . . . . . . . . . . . . . . . . . . . . 4.7 Cumulative distribution of data transmission time. . . . . . . . . . . 4.8 Cumulative distribution of RTT. . . . . . . . . . . . . . . . . . . . . . . 4.9 Cumulative distribution of maximum idle time. . . . . . . . . . . . . 4.10 Box-and-whisker diagrams of TCP time parameters (UNIBS Trafc) 4.11 Box-and-whisker diagrams of TCP time parameters (MAWI Trafc) . 4.12 Box-and-whisker diagrams of TCP time parameters (Plymouth University Trafc) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 5.2 5.3 5.4 Simplied neural network structure . . . . . . . . . . . . . . . . . . . Computations at a single neural perceptron. . . . . . . . . . . . . . . Backpropagation of error signal to update neural network weights. . MSE performance measures for learning, validating, and testing subsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Neural network model developed for lossless TCP trafc. . . . . . . . 5.6 Neural network model developed for lossy TCP trafc. . . . . . . . . 32 33 34 35 35 36 37 37 38 39 39 39 42 43 43 46 48 48

6.1 Regression obtained for lossless connections for the combined dataset using both mathematical and neural network model. . . . . . . . . . 52 6.2 CDF of absolute relative error for lossless connections (combined dataset). 53 6.3 Regression obtained for lossy connections for the combined dataset using both mathematical and neural network model. . . . . . . . . . 54

xiii

LIST OF FIGURES 6.4 CDF of absolute relative error for lossless connections for the combined dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Time-sequence graph for a TCP connection with relatively high idle time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Regression obtained for lossless connections for the combined dataset using both mathematical and neural network model, after ltering connections with maximum idle time larger than twice the average RTT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 CDF of absolute relative error for lossless connections for the combined dataset, after ltering connections with maximum idle time larger than twice the average RTT. . . . . . . . . . . . . . . . . . . . . 6.8 Regression obtained for lossy connections for the combined dataset using both mathematical and neural network model, after ltering connections with maximum idle time larger than twice the average RTT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 CDF of absolute relative error for lossy connections for the combined dataset, after ltering connections with maximum idle time larger than twice the average RTT. . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Regression obtained for lossless connections for the combined dataset using both mathematical and neural network model, post ltering various non-standard ows. . . . . . . . . . . . . . . . . . . . . . . . . 6.11 CDF of absolute relative error for lossless connections for the combined dataset, prior and post ltering various non-standard ows. . . 6.12 Regression obtained for lossy connections for the combined dataset using both mathematical and neural network model, post ltering various non-standards ows. . . . . . . . . . . . . . . . . . . . . . . . . 6.13 CDF of absolute relative error for lossy connections for the combined dataset, prior and post ltering various non-standards ows. . . . . . 6.14 Scatter plot of actual throughput and corresponding relative error of estimated throughput, for lossless connections of the combined dataset, prior any ltering. . . . . . . . . . . . . . . . . . . . . . . . . . 6.15 Scatter plot of actual throughput and corresponding relative error of estimated throughput, for lossless connections of the combined dataset, after ltering connections with high idle time to RTT ratio. . 6.16 Trace 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.17 Trace 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55 56

58

59

60

61

64 64

66 67

68

69 70 70

B.1 Regression obtained for lossless connections for the Plymouth dataset using both mathematical and neural network model, prior any ltering. 83 B.2 CDF of absolute relative error for lossless connections for the Plymouth dataset, prior any ltering. . . . . . . . . . . . . . . . . . . . . 84 B.3 Regression obtained for lossy connections for the Plymouth dataset using both mathematical and neural network model, prior any ltering. 84

xiv

B.4 CDF of absolute relative error for lossless connections for the Plymouth dataset, prior any ltering. . . . . . . . . . . . . . . . . . . . . B.5 Regression obtained for lossless connections for the Plymouth dataset using both mathematical and neural network model, after ltering all non-standards TCP ows and connections with high relative idle time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 CDF of absolute relative error for lossless connections for the Plymouth dataset, after ltering all non-standards TCP ows and connections with high relative idle time. . . . . . . . . . . . . . . . . . . . B.7 Regression obtained for lossy connections for the Plymouth dataset using both mathematical and neural network model, after ltering all non-standards TCP ows and connections with high relative idle time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.8 CDF of absolute relative error for lossless connections for the Plymouth dataset, after ltering all non-standards TCP ows and connections with high relative idle time. . . . . . . . . . . . . . . . . . . .

85

86

87

87

88

xv

Acronyms and Abbreviations


ACK Acknowledgement Segment AI Articial Intelligence

CDF Cumulative Distribution Function CWND Congestion Window FIN Finish Segment IP IW Internet Protocol Initial Congestion Window

MSE Mean Squared Error MSS Maximum Segment Size RFC Request for Comments RST Reset Segment RTO Retransmission Timeout RTT Round Trip Time RTTVAR Round Trip Time Variation RWND Receiving Window SMSS Sender Maximum Segment Size SVR Support Vector Regression SYN Synchronisation Segment TCP Transport Control Protocol

xvii

Introduction
The majority of the Internet trafc is dominated by the Transmission Control Protocol (TCP), as it carries about 90 percent of exchanged trafc (Shah et al., 2007). Due to this very important role for TCP, the performance of this protocol in particular reects directly on the general performance of IP networks and the Internet. From here comes the need to provide realistic and efcient performance modelling of the TCP transport protocol in particular, and to nd relationships between the protocols performance with regards to TCP parameters, and network conditions when transferring trafc. Many traditional mathematical models have been developed to model the behaviour of TCP, and despite the complexity of such models, the accuracy obtained when modelling short-lived TCP connections is not always valid (Ghita and Furnell, 2008). The use of Articial Intelligence models such as neural networks has been approached in several previous researches in order to obtain more accurate performance models of TCP connections not only for steadystate period of a TCP connections, but for short-lived connections as well, where slow start has a primary effect the connections performance. The motivation for choosing this project was to explore how efcient articial neural networks can be used for TCP throughput estimation, what is the level of accuracy such model can reach, and explore if further improvements can be made with respect to previous approaches.

1.1

Project Aim and Objectives

The goal of the project is to successfully develop a robust model using articial neural network in MATLAB that can accurately estimate the TCP throughput for a wide range of TCP transfers with variant and different network path characteristics. The research will attempt to extend and contribute to the previous research approaches, and in such a case where no further improvement could be reach, then it would be essential to comprehend the challenges and obstacles that prevent us from achieving that ideal estimation model. The objectives of this project are listed as follow:

Chapter 1. Introduction 1. Obtain a full understanding of the TCP transport protocol, operation and different network algorithms. 2. Have a thorough overview of the different available mathematical models used for TCP performance (i.e. throughput) evaluation, and analyse their efciency, considering the actual observed TCP performance. 3. Analyse trafc traces from various live networks, and initially perform statistical analysis on these traces in order to obtain an understanding of the nature of the trafc and the distribution of TCP parameters collected, and their signicance to TCP throughput. 4. Based on trafc analysis, select suitable TCP parameters that should be considered as input for the neural network model. 5. Develop an articial neural network in MATLAB that models TCP performance with regards to the selected TCP parameters. 6. Evaluate the efciency of the developed neural network when modelling TCP connections with various characteristics, such as short-lived connections, where slow start congestion control strategy consumes a large part of the connection time, and consider ltering of statistical and behavioural outliers, and observe the effect these exclusion on the acquired estimation accuracy. 7. Study the effect of packet loss on TCP performance, and how the developed neural network model reacts to such network impairment. 8. Identify any other relationship or patterns between network parameters and TCP throughput, and how this relationship can be represented in a model. 9. Provide an evaluation on how neural network are suitable for modelling TCP performance, and under which conditions it performs better or worse.

1.2

Thesis Structure

The thesis is structured into ve major parts as follow: Chapter 2 includes a complete overview on all the theoretical elds of study that were addressed in this research as well as a background on related research

1.2. Thesis Structure and ndings that were previously published. It starts with a review on the TCP standards and an explanation of the different stages within the lifespan of a typical TCP connection, taking into consideration different scenarios and network conditions. Algorithms associated with each stage of TCP connections are also briey demonstrated as documented in different RFCs (Request for Comments). It then includes a review on different types for modelling TCP such as mathematical models and history-based models. A short introduction to articial neural networks has been covered as being the main approach taken to develop a history-based model in this research. Chapter 3 gives a breakdown of the methodology and approaches considered for carrying out this research, as well as the data acquisition approach and a description of the trafc traces collected. Chapter 4 demonstrates the pre-processing stages that have been applied on collected datasets, and the present fundamental statistical and trend analysis performed on these datasets into order to have an understanding of the nature of TCP connections being used for modelling, and recognise the characteristics of the different TCP parameters being considered for modelling TCP performance. Chapter 5 provides a description of the articial neural networks developed, their structure, and the selected training and validating criteria as developed in MATLAB, as well as an explanation of the backpropagation neural network as the technique chosen for AI-based modelling. Chapter 6 demonstrates all the results obtained while developing the neural network models in MATLAB, and explains the results and ndings of each step taken while developing the models, and shows how different TCP parameters had variant effects on the estimated results obtained. It also demonstrates the various ltering conditions that have been used in order to reach satisfactory results, while analysing and explaining the signicance behind these improved results. Chapter 7 draws conclusions behind the analysis performed using neural network modelling, and how these analysis suggests the consideration of new TCP parameters when modelling TCP performance. The chapter then shed light on some limitations that have been encountered in this research, and provides some future research directions.

Literature Review
This chapter starts by providing a brief review on the TCP protocol and the various TCP algorithms associated with the different stages and conditions encountered by TCP connections. The chapter then gives an overview on the different techniques used for modelling TCP throughput, such as formula-based and history-based models. Finally, previous research that has been done using AI-based methods - particularly articial neural networks - are analysed by comparing their methodological approaches and presenting the results obtained by each approach.

2.1

Background

When evaluating the performance of a network path, we are mostly concerned with the performance of TCP connections, since it represents the majority of the overall trafc in a network. Among different type of trafc, bulk TCP transfers which last for more than few seconds can be considered of greater importance, and are more suitable for TCP throughput prediction as opposed to short-lived TCP connections which are highly affected by the slow-start congestion control mechanism (He et al., 2007). However it remains essential to have robust models that can estimate the performance of short-lived transfers efciently as well. The importance of quality and performance provisioning has been rising, hence the need to develop models that replicate network protocols such as TCP in order to estimate or predict the performance of data transfers (Ghita et al., 2005). Such models can improve our understanding to the behaviour of Internet trafc and the interrelations between various TCP and network parameters. Additionally these performance predictions can have several applications such as the dynamic selection of the best path for a particular data transfer between end-hosts where multiple paths are available such as distributed contents and multi-homing networks or when mirrored resources and server selection is an option in a grid network architecture (Mirza et al., 2010).

Chapter 2. Literature Review

2.2

TCP Protocol

The following sections provide an overview on the well known TCP algorithms used along TCP connections, and are of particular interest in the context of the this research.

2.2.1

TCP Transition States

The states in which a TCP connection goes through can be summarised in three main stages as described in RFC 793 (Postel, 1981); connection establishment, data transfer, and connection termination. The transition from one state to another is accomplished by the exchange a specic sequence of segments. These states, transitions and segments can be represented in Figure 2.1, and are elaborated in the following sections.

Closed
> ing th > no hing < : t ive no ce : < Re end S

Receive: <nothing> (timeout) Send: RST

Listen

Re ce iv Se e: < nd no : S th YN in

g>

YN : S CK ive , A ce SYN e R d: n Se

SYN Received

Receive: ACK Send: <nothing>

Connection Established (Data Transfer State)

Receive: SYN, ACK Send: ACK

SYN Sent

Receive: Ack Send: <nothing>

: FIN eive Rec d: Ack Sen

Rec eive : Sen <nothin d: F IN g>

CLOSE_WAIT

FIN_WAIT_1

e: Ack Receiv thing> <no Send:


Receive: <nothing> Send: FIN FIN_WAIT_2
Receiv e: Send: FIN Ack

LAST_ACK

TIME_WAIT

Passive Close

Active Close

Figure 2.1: TCP state transition diagram for both client and server.

2.2. TCP Protocol 2.2.1.1 Connection Establishment

As demonstrated in Figure 2.2, the TCP protocol uses a three-way handshake mechanism in order to establish a connection. In this mechanism, the client initiates the handshake by sending a SYN segment to the server specifying the port number on which a connection is needed and its starting sequence number. The server responds to this request by sending a similar SYN segment with its own starting sequence number, and acknowledging the sequence number sent from the client. Finally, the client responds to the server acknowledging its sequence number. At this point, a TCP connection is established, and both client and server are transitioned to the data transfer state (Stevens, 1993).
|Time | |0.000 | |0.010 | |0.318 | | | |6.973 | |7.282 | |7.532 | |7.541 | | 146.15.88.126 | | 164.133.140.237 | SYN | |(1513) ------------------> (80) | SYN, ACK | |(1513) <------------------ (80) | ACK | |(1513) ------------------> (80) | | | FIN, ACK | |(1513) <------------------ (80) | ACK | |(1513) ------------------> (80) | FIN, ACK | |(1513) ------------------> (80) | ACK | |(1513) <------------------ (80) | | |Seq = 1337909654 | |Seq = 289962769 Ack = 1337909655 | |Seq = 1337909655 Ack = 289962770 | |Seq | |Seq | |Seq | |Seq | = 290028713 Ack = 1337910152 = 1337910152 Ack = 290028714 = 1337910152 Ack = 290028714 = 290028714 Ack = 1337910153

Figure 2.2: Timeline of TCP connection establishment and termination. 2.2.1.2 Data Transfer

Once a connection has been established, both side of the connection are able to exchange data segments. This data transfer stage is regulated using the TCP congestion control algorithm, starting with the slow start phase, and if necessary following with congestion avoidance and segment recovery at the occurrence of any retransmission timeout as explained in section 2.2.3. 2.2.1.3 Connection Termination

A TCP connection can be interrupted and terminated at any time if a RST segment is sent in either direction. However the normal behaviour to terminate a connection is to initiate a graceful termination, as demonstrated in Figure 2.1. Considering that TCP connections are full-duplex, once a connection is established, it requires

Chapter 2. Literature Review four segments to be fully terminated. Some applications may require to only keep the TCP in a half-close state, and hence justify the need for two segments in each direction in order to select which direction is to be closed and which to be kept open, or to simply fully terminate the connection. The terminal side - usually the client - that initiates the termination of a TCP connection is said to enter an active close termination. The client sends a FIN segment to the server, entering a FIN WAIT 1 state waiting for an ACK and a FIN from the server side, either to be received within individual segments or within a single segment. Once it receives an ACK from the server, the client enters a FIN WAIT 2 state, and once a FIN is received, it then enters a TIME WAIT state and sends an ACK back to the server. On the other hand, the side responding to a termination request by receiving a FIN segment is said to be entering a passive close termination. It responds by sending an ACK and entering in a CLOSE WAIT state, and then sending a FIN segment entering in LAST ACK, and waiting for the last ACK to be received from the client. The TCP connection is considered closed once the last ACK has been received. In brief, a complete TCP connection is bounded by a SYN segment and a FIN segment in each direction. This is important to notice in order to identify any incomplete or interrupted TCP connection. The timeline of a simple TCP connection establishment and termination is shown in Figure 2.2 excluding any data transfer segments.

2.2.2

TCP Flow Control

This section briey describe the TCP algorithms used to handle both types of trafc (i.e. bulk transfer ows and short-lived ows) within the lifespan of a TCP connection.

2.2.2.1

Sliding Window Protocol

The TCP ow control is based on the sliding window mechanism. During data transfer, the order of segments sent and received is controlled using sequence numbers. Both sender and receiver keep track of these numbers. Each side of the connection also maintains and advertises about its window size, which determines the maximum number of segments it can receive and buffer successfully before processing them. This in turn denes the number of segment a sender would transmit before receiving acknowledgements. This mechanism is maintained using a slid-

2.2. TCP Protocol ing window at each side, and the window is moved forward whenever a segment is received in the correct sequence.

2.2.3

TCP Congestion Control

TCP has no prior knowledge of the limitations and conditions of the network path. Accordingly, TCP algorithms must anticipate and adjust its behaviour continuously with respect to the status of the network. This is basically achieved using two associated mechanism: slow start and congestion avoidance. Both mechanisms aim to limit the number of unacknowledged packets from sender to receiver to avoid swamping the receiver or the network with a number of packets it can not process or buffer. Slow start and congestion avoidance are implemented at the at the sender side. Fast retransmission and fast recovery are two algorithms that are meant to deal with segment losses within a TCP connections. According to RFC 5681 (Allman et al., 2009), these four algorithms are the principles for congestion control, and are described in details in the following sections.

2.2.3.1

Slow Start

In order to avoid congestion along a TCP connection, two windows are used. One at the sending side, and is referred to as the congestion window (cwnd) to limit the number of unacknowledged segments the sender can transmit. The cwnd is evaluated and maintained by the sender and never advertised. A similar window is used by the receiving side, and is referred to as the receiving window (rwnd) which is constantly advertised to the sender to update it about the maximum number outstanding segments it can support. When transmitting, TCP on the sender side is always bounded by the minimum value of both cwnd and rwnd (Allman et al., 2009). The slow start algorithm is used to gradually increase the cwnd. Slow start is engaged in two phases of the TCP connection: initially once a TCP connection is established, and subsequently whenever a retransmission timeout usually resulting from a loss segment occurs. As described in RFC 5681 (Allman et al., 2009), the initial value of the of cwnd referred to as (IW) is decided at the sender side according to the following condi-

Chapter 2. Literature Review tions, where SMSS is the senders maximum segment size: 4SMSSbytes (maximum 4 segments), if SMSS 1095 bytes, IW = 3SMSSbytes (maximum 3 segments), if 1095 bytes SMSS < 2190 bytes, or 2SMSSbytes (maximum 2 segments), if SMSS > 2190 bytes. (2.1) The cwnd is then incremented by the SMSS value for each ACK received. This behaviour leads to an effect of doubling the cwnd value every RTT, as shown in Figure 2.3.
Receiver

Sender cwnd=1 cwnd=2 cwnd=4 cwnd=8 cwnd=9

Slow Start

Congestion Avoidance

Figure 2.3: Slow start and congestion avoidance sending patterns.

The slow start process continues to increment the cwnd exponentially as shown in Figure 2.4, which is meant to be efcient to determine a reasonable window size to be used along a particular TCP connection (Stallings, 2001). This process terminates when either the cwnd reaches a maximum threshold value called (ssthresh), after which the TCP connection transitions to the congestion avoidance phase, or when a retransmission timeout occurs indicating a probable segment loss. Both cases are explained in the next sections.

2.2.3.2

Congestion Avoidance

Figure 2.4 illustrates the transition from slow start to congestion avoidance based on the current threshold value ssthresh. During congestion avoidance, the sender adopt an additive increase approach for adjusting the cwnd. Depending on the TCP implementation, it should increment the current cwnd by at most one SMSS.

10

2.2. TCP Protocol


20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

timeout

Congestion window size (Mutliples of SMSS)

TCP Reno TCP Tahoe

ssthresh

ssthresh ssthresh

Slow Start 0 1 2 3

Congestion Avoidance

7 8 9 10 11 Time (Multiples of RTT)

12

13

14

15

16

17

18

Figure 2.4: Slow start and congestion avoidance, as implemented for TCP Tahoe and TCP Reno.

2.2.3.3

Retransmission Timeout, Fast Retransmit, and Fast Recovery

In early implementations of TCP, the detection of segment losses was performed using a retransmission timeout timer that triggers a retransmission once the timer elapses assuming a segment loss, as implemented in TCP Tahoe. The duration of this timer is referred to as RTO, and is evaluated in terms of the Round Trip Time (RTT) measured within a TCP connection, as documented in RFC 793 (Postel, 1981). The fact that TCP Tahoe only sends cumulative acknowledgements does increase the time needed for the RTO timer to expire and detect a segment loss. 2.2.3.4 Fast Retransmit

A more efcient approach for assuming segment loss was proposed by Jacobson (1990) and called fast retransmit. The fast retransmit algorithm suggested that whenever the TCP receiver detects an out-of-order segment, it should send or resend an ACK for the last segment received in correct order. The receiver should continue sending these duplicate ACKs as long as the missing segment has not been received, and the correct sequence of segments has not been restored. From the sending side, receiving a duplicate ACK would either imply a congested network or a segment loss. The fast retransmit algorithm states that once three

11

Chapter 2. Literature Review duplicate ACKs have been received by the sender, it should then retransmit the last unacknowledged segment. The fact that the receiver has been sending duplicate ACKs implies that it has been receiving segments subsequent to the lost segment. Hence, the sender is only supposed to retransmit the assumed lost segment (Stallings, 2001). 2.2.3.5 Fast Recovery

At any moment within the lifetime of a TCP connection, whether during slow start or congestion avoidance phases, at the occurrence of a timeout, the slow start process is reinitialised and the current cwnd is reset to one SMSS. However, the limiting threshold (ssthresh) has to be modied to either half the value of maximum amount of unacknowledged data in the network, or twice the SMSS value, whichever is larger (Allman et al., 2009). This is to anticipate further possible congestion as previously experienced when using higher ssthresh. The adjustment of the cwnd itself is dependent on the TCP implementation. It may be reset to one SMSS as in TCP Tahoe, and hence slow start is reinitialised until the new ssthresh is reached, or it may be set directly to the new ssthresh, and hence congestion avoidance is directly invoked, as implemented in TCP Reno. The approach taken by both TCP Tahoe and TCP Reno at timeout is illustrated in Figure 2.4.

2.2.4

Idle Time Considerations

As later demonstrated within the trafc analysis, idle time within the lifespan of a TCP connection may be substantial in some cases and accordingly may lead to signicant estimation error of the throughput or transmission time. It is then essential to explore the how TCP implementations deal with these idle times.

2.2.5

TCP Timers

In addition to purely logical idle time, the conditional transitioning between one TCP stage to another can be time consuming, and may negatively affect the evolution of a TCP connection, and increase the overhead observed either to reach a smooth data transfer stage, or to efciently terminate the transmission. Accordingly these transitions have to be bounded by different TCP timers to ensure TCP does not remain stuck in a certain stage. The understanding of these timers were

12

2.2. TCP Protocol particularly useful when performing manual analysis of TCP connection in further stages of the research. TCP implementations use two different clocks (tick counters): a slower clock with interval set to 500 ms, and a faster clock set with 200 ms interval. In order to regulate the value of each TCP timeout, these timers are triggered in a multiple number of ticks (500 ms and 200 ms) as needed (Stevens and Wright, 1995). According to the implementation specications described by Stevens and Wright (1995), seven types of timers are used and are listed as follow: Connection Establishment Timer: At connection establishment, the rst SYN segment sent from the client times out after around 6 seconds (12 ticks). After that the client send a second and a third SYN segments which times out after 24 seconds and 48 seconds respectively. In typical implementation of TCP, after a total period of 75 seconds without a response from the server, the TCP connection is aborted (Stevens and Wright, 1995). Retransmission Timer: As previously mentioned, the retransmission timer is used to assume a segment loss. For each segment sent, once the RTO timer is elapsed before receiving an ACK from the receiver, the segment is resent. This RTO value is dynamically calculated and based on previous values of smoothed RTT (SRTT) and variations in RTT (RTTVAR), as described in RFC 6298 (Paxson et al., 2011): Initially, when neither previous RTT has been measured nor previous RTO has been calculated, the RTO is set to one second. Once a RTT measurement has been made, SRTT is set to measured RTT, and RTTVAR is set to half the measured RTT. The current RTO is then calculated as follow:

RT O = SRT T + K RT T V AR; WhereK = 4

(2.2)

With the measurement of further RTT, the value of SRTT and RTTVAR are adjusted as follow: RT T V AR = (1 ) RT T V AR + |SRT T RT T |; Where: = 1/4 (2.3)

13

Chapter 2. Literature Review SRT T = (1 ) SRT T + RT T ; Where: = 1/8 (2.4)

Delayed Acknowledgement Timer: Whenever segments are received and have not been acknowledged yet, and do not require a direct ACK, the delayed acknowledgement timer in started. Once the timer expires, a cumulative ACK is sent to acknowledge all received segments. The aim of this mechanism is to reduce the overhead resulting from sending direct ACKs for every segment. The typical timer value as implemented in is 200 ms, and can be changed up to 500 ms. Persist Timer: The window size from receiver has to be continuously advertised to the sender. Since the ACK segments are not reliably transmitted, the connection may enter a deadlock state in which the receiver is waiting for further data from the sender, while the sender is waiting for the receivers window to be advertised. In order to avoid this deadlock, TCP would trigger the persist timer whenever a null window size is advertised. If the timer elapses without receiving a non-null value, the sender responds by issuing a probe to the receiver. Keep Alive Timer: The keep-alive timer is an optional service that is run at the TCP application layer, for which it allows one side of the TCP connection to probe the other side to check if it is still alive after a prolonged period of idle time without any data or acknowledgements being exchanged. The default status for this timer is to be turned off, and its default value is set to two hours, as documented in RFC 1122 (Braden, 1989). FIN WAIT 2 Timer: A TCP connection termination is initiated by sending a FIN segment from one side to another side and receiving an ACK segment, waiting for a similar FIN segment in the opposite direction as previously demonstrated in Figure 2.1. As TCP transition to the FIN WAIT 2 stage, a timer of 10 minutes is rstly started, and then reinitialised to 75 seconds. By the expiration of the second timer, the TCP connection is dropped if no FIN segment is received. TIME WAIT Timer: As mentioned in section 2.2.1.3, TIME WAIT is the last state that a TCP client ends up in, once FIN and ACK segments have been exchanged in each direction. The client would remain in this state for a period equivalent to twice as much as the MSL Maximum Segment Lifetime which is the maximum amount of time a segment can remain valid in a network without being

14

2.3. Formula-Based Modelling discarded and has a default value of two minutes. Accordingly this timer is referred to as the 2MSL timer, and has a default value of four minutes (Stevens and Wright, 1995). At this point, the TCP connection would be considered cleanly and logically terminated, however, the TCP socket would not transits to a close state until the 2MSL has passed. The purpose of using such timer is to ensure that no delayed segments will be wrongly considered as part of a subsequent TCP connection (Postel, 1981). This state timer is not of a particular interest from an IP layer 3 perspective.

2.3

Formula-Based Modelling

Formula-based models depend on mathematical expressions to evaluate the expected TCP throughput from the TCP parameters. The following mathematical model was proposed by Mathis et al. (1997) and is referred to as the square-root formula. E [R ] = T M 2bp 3 (2.5)

Where E[R] is the expected TCP throughput, R is the actual throughput, M is the maximum segment size, b is the number of TCP segments per new ACK, T is the RTT, and p is the loss rate. Another mathematical model was also proposed by Padhye et al. (1998): E [R] = min( T M 2bp + T0 min(1, 3 2bp )p(1 + 32p2 ) 8 , W ) T (2.6)

Where T0 is the TCP retransmission timeout period, W is the maximum window size. The value of the TCP throughput is evaluated based on the TCP parameters while the TCP ow is in progress, and hence the value is considered an estimated value rather than a prediction. A slight modication was introduced by He et al. (2007) by using TCP transfer probes prior to the transfer of the original ow. The probes used could possibly be ping sessions or very small TCP transfers (64KB). These probes are sent periodically in order to determine the TCP parameters on the path such as RTT and loss rate. These parameters would then be fed into the mathematical model proposed in order to predict the TCP throughput value prior

15

Chapter 2. Literature Review to any ow transfer. The accuracy results obtained from this formula-based model were relatively low, with median Root Mean Square Relative Error (RMSRE) of 2. The RMSRE was less than 0.4 for only 20% of the traces.

2.3.1

Cardwell Mathematical Model

A mathematical model was proposed by Cardwell et al. (2000). This model was used as a reference and baseline in this research in order to evaluate the performance results obtained from the AI-based model developed in further stages of the research. The reason behind selecting this model in particular was for its complete modelling of all the stages observed in a TCP connection, as well as its applicability to both lossless and lossy TCP ows. The model denes and aggregates the time spent during four phases of the TCP connection; slow start, recovering loss segment, congestion avoidance and time spent due to delayed acknowledgements as expressed in Equation 2.7. Each phase and associated mathematical representation as described in (Cardwell et al., 2000) is explained in the following section. The script implementation of the model in MATLAB is documented in Appendix C.1.

E [T ] = E [TSS ] + E [Tloss ] + E [Tca ] + E [Tdelack ]

(2.7)

Connection Establishment: Cardwell et al. (2000) proposed an estimation for the time cost during connection establishment. Nevertheless, this period was not within the scope of this research, as the TCP throughput or transmission time was mainly evaluated for the data transfer period of TCP connections. The transmission time evaluated by tcptrace was also exclusively the time spent from rst data segment to the last data segment observed excluding connection establishment and termination.

Initial Slow Start: As the occurrence of a segment loss would end the slow start phase, Cardwell et al. (2000) initially evaluate this probability in terms of loss rate as per Equation 2.8. Then calculates the number of segments expected to be sent during slow start in terms of this probability as per Equation 2.9.

16

2.3. Formula-Based Modelling

lss = 1 (1 p)d

(2.8)

d (1 (1 p) )(1 p) + 1 p WSS (d) = d

if loss rate (p) >0 if loss rate (p) = 0

(2.9)

Knowing the number of segments sent during slow start, the expected window size by the end of slow start is calculated as per Equation 2.10. E [WSS ] = E [dSS ]( 1) w1 + (2.10)

The total time spent in slow start is then calculated as per Equation 2.11. RT T [log ( Wmax ) + 1 + 1 (E [d ] SS w1 Wmax E [TSS ] = RT T log ( E [dSS ]( 1) + 1) w1
Wmax w1 )] 1

whenE [WSS ] > Wmax whenE [WSS ] Wmax (2.11)

Occurrence of First Loss: Cardwell et al. (2000) then evaluate the probability of having packet losses due to retransmission timeouts RTO as per Equation 2.12, and the expected time cost for a RTO as per Equation 2.14.

Q(p, w) = min 1,

(1 + ((1 p)3 )(1 (1 p)w3 )) (1 (1 p)b )/(1 (1 p)3 )

(2.12)

Gp = 1 + p + 2p2 + 4p3 + 8p4 + 16p5 + 32p6

(2.13)

E [Z T O ] =

G(p)T0 1p

(2.14)

T loss calculated in Equation 2.15 is then the expected time spent in order to recover from segment loss.

17

Chapter 2. Literature Review

Tloss = lss (Q(p, E [WSS ]) E [Z T O ] + (1 Q(p, [WSS ])) RT T )

(2.15)

Transferring the Remainder: The amount of data segments to be transmitted after slow start and loss recovery is calculated by Equation 2.16, and the expected size of congestion window (W p) at segment loss event is calculated by Equation 2.17. d ca is the amount of data left to be transmitted after Slow Start and loss occurrence. E [dca ] = d E [dSS ] (2.16)

W ( p) =

2+b + 3b

8(1 p) + 3bp

2+b 3b

(2.17)

Cardwell et al. (2000) evaluates the steady state throughput (R) using Equation 2.18, and accordingly deducts the time needed for transmitting the remainder of segments using that throughput as per Equation 2.19. This is the time spent in congestion avoidance phase.

R=

W (p) 1p + 2 +Q(p,W (p)) p Q(p,W (p))G(p)T0 b RT T ( 2 W (p)+1)+ 1p 1p Wmax + 2 +Q(p,Wmax ) p 1p b RT T ( 8 Wmax + pW +2)+ max Q(p,Wmax )G(p)T0 1p

if W (p) < Wmax (2.18) otherwise

Tca =

dca R

(2.19)

Delayed Acknowledgements: The delayed ACK timer is meant to delay the transmission of ACK, and combining several ACKs into one single ACK in order to minimise the overhead. This timer depends on the TCP implementation, and usually ranges from 150 to 200 msec.

18

2.4. Previous Research and Machine Learning Approaches

2.4

Previous Research and Machine Learning Approaches

Machine learning is one area of articial intelligence (AI) where computers are able to modify and adapt their behaviour and are able to take actions, decisions, or make predictions so that these actions and decisions get more accurate to reect the real and correct ones (Freeman and Skapura, 1991). This part of the literature review provides a survey of the previous research approaches taken in developing models for TCP performance evaluation using machine learning techniques and highlighting their research methodologies and associated results and ndings. Then, an overview over articial intelligence modelling is presented which particularly describes the concepts behind articial neural networks. Detailed information about backpropagation articial neural networks are included in Chapter 5 where the modelling approach taken and techniques used within this research project have been described. A research was made by He et al. (2007) to develop a model for predicting the TCP throughput for bulk TCP transfers in particular. As a testbed, their research made use of the MIT RON (Resilient Overlay Networks) project, which architecture is made up of 50-60 nodes distributed in universities, research labs and ISPs in the US, Europe and Asia. In their research, they have initially strengthened on the difference between performance estimation and performance prediction for a network path.

2.4.1

Performance Estimation

The estimation of TCP performance is performed after the TCP ow has started and can be evaluated all along the transmission ow. For a certain ow, the TCP parameters and path characteristics are fed into the TCP performance evaluation model in order to estimate the value of the TCP throughput. This approach is considered non-intrusive as no additional trafc is generated on the network path as opposed to the approach taken in performance prediction.

2.4.2

Performance Prediction

The objective of predicting the performance of a TCP transmission is to evaluate the expected TCP throughput value prior to the start of the TCP ow. This

19

Chapter 2. Literature Review approach is usually performed using probes such as ping utility or small TCP transfers that are generated and scheduled periodically. The measurements obtained from these probes are then used as inputs for TCP throughput evaluation models. This probing approach can be considered highly intrusive if it leads to the saturation of the network path, and hence the probes should limited by both size and frequency as much as possible. He et al. (2007) have classied the models used to evaluate the performance of TCP for TCP transfers into two classications; formula-based or mathematical models as the one previously described in this chapter, and history-based models, each approach having its own advantages and drawbacks.

2.4.3

History-Based Models

History-based models mainly depend on the previous knowledge acquired from historical TCP transfers. The models use adaptive learning in order to form relationships between observed path characteristics and the resulted TCP throughput of each transfer. Accordingly, history-based models are independent of the TCP implementation used at the server and the receiver ends, which is considered a great advance over mathematical models. In the research done by (He et al., 2007), they have developed a history-based prediction model based on linear predictors such as Moving Average, Exponential Weighted Moving Average, and non-seasonal Holt-Winters. Such linear predictors performed mathematical operation to estimate future values of TCP throughput as a linear function of previous samples. The prediction accuracy of their historybased model gave better accuracy with a RMSRE less than 0.4 for 90% of the traces. It was suggested by (He et al., 2007) to utilize hybrid predictors which would consider TCP transfer characteristics, as well as throughput history in order to obtain more accurate throughput estimates. It was also suggested to develop TCP throughput models that consider the paths load, buffering and cross trafc nature as input to the model, in a way that the model would be independent of TCP connection characteristics. Another research was made by Mirza et al. (2010) in which they adopted a machine learning approach to predict TCP throughput. They have used Support Vector Regression (SVR) which is a supervised learning method used for pattern classication depending on the dataset used for training the classication model. They have used a laboratory testbed consisting of end hosts connected through a

20

2.4. Previous Research and Machine Learning Approaches dumbbell topology with a bottleneck point to create and control congestion through their experiments. Monitoring cards were placed at the congestion point to capture packets leaving and entering the bottleneck level. The measurements used in their models were the available bandwidth on the congested link, the queuing at the bottleneck node, and the loss rate. They have used both passive and active path measurements. For the passive measurements, parameters (available bandwidth, queuing, and loss rate) were obtained from precaptured TCP ows, and for active measurements the same parameters were obtained from the active monitoring cards. Their results obtained from their experiments indicated that for bulk TCP transfer, the predicted TCP throughput was within 10% of the actual value 87% of the time. For possible future work, Mirza et al. (2010) suggested to consider other machine learning tools rather than the SVR approach. They also emphasized on the importance of nely tuning training sets used for developing the model used in the supervised learning process. A research approach for estimating TCP performance using neural networks was adopted by (Ghita et al., 2005). In their research they have used three sources of captured trafc: synthetic connections generated by network simulators, semisupervised connections which were captured from automatic retrieval tools, and unsupervised trafc which was captured from real network trafc traces. In their research they have divided their training data sets into two categories, one without packet losses, and another with packet losses. They have used the Stuttgart Neural Network Simulator (SNNS) for the training of the datasets and for developing the neural network model. The results obtained from the neural network model have revealed signicant improvement over the mathematical model with nearly a tenfold improvement of the relative error. On the other hand, for the trafc with packet losses, the mathematical model has shown better performance.

2.4.4

Articial Neural Networks

Articial neural network is machine learning tool used for recognition or classication processes. The use of articial neural networks can drastically simplify the complex mathematical models needed for modelling. Additionally, neural networks are recognised for improving estimation accuracy, by assuming new relationships among inputs, and between inputs and associated target outputs. Neural networks are also recognised for being able to extend its recognition and classication knowledge, by associating new estimation output values for inputs that have not been

21

Chapter 2. Literature Review previously encountered by the neural network either during training or validation, and hence being applicable to extended and larger datasets. In this research, backpropagation feed forward neural networks will primarily be considered for the modelling process. Backpropagation, or propagation of error, is a common method of teaching articial neural networks how to perform a given task. It is a supervised learning method, and is an implementation of the Delta rule. It is most useful for feed-forward networks networks that have no feedback (Freeman and Skapura, 1991).

2.5

Summary

In this chapter, theoretical elds of study and research related to the project have been covered. A general overview of the transition between TCP stages, and a brief description of each stage has been provided. Different timers and associated periods of idle time have been justied. This was essential in order to obtain a good understanding of the different conditions encountered by TCP connections. An overview has been made over the previous researches that were based on machine learning approaches. The conditions that were considered such as the network topologies from which trafc have been captured, and the number of testbed samples that were considered. The ndings of each research have been presented and the accuracy results obtained by the models developed have been documented. That was evident in order to have an initial baseline for the expectations of accuracy and performance of articial intelligence method in modelling TCP connections. Some assumptions have been made regarding the approaches to consider developing neural network models, such as the type of trafc to be used, and how to categorise trafc according to the presence of any loss segments. These results and ndings are expected to be compared with the results obtained by the completion of this project.

22

Research Methodology
After completing a literature review, the practical part of the project was carried out in three main stages; collecting and analysing different trafc captures, extracting relevant TCP parameters needed, ltering several data subsets as required, and nally using these subsets train the neural network. The main stages of the project are shown in the process diagram in Figure 3.1 and are explained in the following sections:

TCP Traffic

Mathematical Model

Neural Network Modelling Extract Parameters


TCP Input Parameters Calculated Performance Estimated Performance

Lossless Model

Lossy Model TCP Performance (Transmission Time & Throughput) Feedback

Post-Processing Traffic Analysis


Accuracy Analysis Correlation MSE

Manual Analysis

Figure 3.1: Process diagram of research stages.

3.1

Data Acquisition

Using synthetic connections for testing was a possible option. However, the main aim of the research was to study and investigate the behaviour of everyday Internet trafc, and hence to rely solely on connections captured from either large enterprises or T1 lines. Several sources of captured trafc were considered in anal-

23

Chapter 3. Research Methodology ysis and training the AI-based neural network model. The initial purpose for that was to cover as many types of connections with various conditions in the training process, aiming to obtain better learning rates and faster convergence of the neural model. Another reason was to cover different trafc types, in order to develop a robust neural model that provides better estimation accuracy. It was observed that throughput of TCP connections not only depends on the TCP parameters for each TCP connection, but also on the network conditions of each trace; conditions that are not accounted for in conventional TCP mathematical models, such as the behavioural sending characteristics of the TCP servers, resulting in varied and inconsistent idle time periods. The following sections describe the characteristics of the three source of synthetic trafc used in this research. All analysis and modelling within the research was initially performed on an aggregated dataset including both the trafc captured from Brescia University campus and the few trafc traces collected from the MAWI Group. The purpose for aggregating these traces into one dataset was to obtain a single dataset large enough for neural network modelling, and to ensure that the number of TCP connections which would be considered as training and validating samples when developing the neural network model would be sufcient to ensure no over-tting of the models to the data available. The total number of connections that were aggregated from these two sources were 1,900,440 TCP connections. At later stages of the research, the dataset collected from the campus of the Plymouth University was used in order to validate the results and analysis performed. The total number of connections captured on the campus of Plymouth University which were used for results validation were 6,355,344 TCP connections.

Campus Network of Brescia University (UNIBS): These traces were collected on the edge router of the campus network of the University of Brescia on three consecutive working days, mainly composed of TCP (99%) and UDP trafc, which corresponds to around 79,000 ows in total (UNIBS: Data sharing, 2011). More information on these trafc traces is listed in Appendix A.1.

MAWI Working Group Trafc Archive: These are daily traces at a transPacic line (150Mbps link) (MAWI Working Group Trafc Archive, n.d.). Several trafcs traces were selected and aggregated into a single trace. Detailed information about these traces are listed in Appendix A.2.

24

3.2. Data Pre-processing Campus Network of Plymouth University: Hundreds of trafc captures were collected at the campus of Plymouth University. These captures were all aggregated into a single dataset after excluding incomplete connections. The dataset included 5,665,167 TCP connections made by local clients to remote servers, and 690,186 TCP connections made by remote clients to local servers.

3.2

Data Pre-processing

The following tools and software were used during the research.

3.2.1

TCPTRACE

All collected traces have been initially processed through tcptrace in order produce datasets of TCP ow records associated to each trace. tcptrace is a network tool running under Linux, which can accept trafc captures from other tools such as tcpdump, and produce information about each TCP connection as seen in that trafc. Further information on TCP parameters extraction is explained in Chapter 4.

3.2.2

Data Processing in MATLAB

Once datasets of TCP records were available, they were imported into MATLAB for further processing. Each dataset was then divided into two subsets according to whether segment loss were identied or not. Lossless and lossy subsets have been used separately in all stages of the research in order to provide clearer results and analysis about the capability of the models to estimate the performance of lossy TCP trafc. Detailed steps of the different stages of data preprocessing are included in Chapter 4.

3.3

Neural Network Modelling in MATLAB

The MATLAB Neural Network Toolbox was used to develop the neural network modelling due to it efciency and simplicity in designing different model. The toolbox also provides automated visualisation tools for detailed performance measures of the models developed.

25

Chapter 3. Research Methodology

3.4

Statistical Analysis in MATLAB

All statistical analysis in this research were performed using the Statistics Toolbox in MATLAB. The output of the developed neural network model representing the transmission time estimated for each TCP connection was compared with the actual value of transmission time as collected from tcptrace. This comparison was done using correlation analysis in MATLAB.

3.4.1

Regression

According to (Hair et al., 1995), regression analysis is a general statistical technique used to analyse and identify a relationship between a single dependent parameter and a set of other independent parameters. In this research we are mainly concerned to apply regression analysis between the actual throughput and estimated throughput by each model. This regression is represented with a simple tting line as shown in Figure 3.2. Residual values are represented as the deviation from the tted regression line.
9 8 7 Outputs 6 5 4 3 2 1 1 2 3 4 Targets 5 6 7 8 Residual value Y=T Fitting Line Scattered Data

Figure 3.2: Regression analysis showing regression tting line and residual values.

3.4.2

MSE

The performance of neural network model was continuously evaluated during the learning process using the Mean Squared Error (MSE) between actual and estimated throughput values. According to (Kleinbaum et al., 1997), the MSE is ex-

26

3.5. Base Line for Analysing Model Accuracy pressed as the sum of squared errors divided by their corresponding degree of freedom (n-k-1), where k is the number of independent variables in the model, and n is the number of samples. MSE is expressed in Equation 3.1

M SE = S 2 =

1 nk1

(ei )2

(3.1)

3.4.3

Absolute Relative Error

The statistical cumulative distribution function (CDF) of absolute relative error was used to study the accuracy obtained by each model and compare it with other models or other modelling criteria.

3.5

Base Line for Analysing Model Accuracy

The TCP throughput estimation accuracy by the mathematical model dened by Cardwell et al. (2000) was primarily used as a base line to evaluate the performance of the neural network models developed in MATLAB. The same parameters used for the mathematical model were considered to be used for the neural network models, in order to evaluate accuracy measurements under the same modelling conditions. The estimation accuracy of each of the mathematical and neural network model prior to any sort of ltering to the TCP connections was also considered as a second baseline to evaluate the change in estimation accuracy prior and post applying ltering conditions to each model individually.

3.6

Summary

This chapter provided an overview over the principal stages of project and the ow of data and feedback between processing, modelling and analysis steps. Brief descriptions were given about the tools used for data acquisition, data pre-processing, and neural network modelling. nally, a brief explanation of the different statistical analysis methods that were used to evaluate the performance of the models developed, and how the mathematical model was used as a baseline for modelling performance evaluation.

27

Data Pre-processing and Trafc Analysis


This chapter aims to provide an overview on the sources of network trafc used during the research, and explain the different stages of pre-processing performed on traces prior to any analysis and prior to modelling the neural network. The chapter also includes basic statistical analysis performed to understand the normal distribution of TCP parameters values in order to anticipate any ltering criteria that may be further investigated when modelling the neural network.

4.1

Types of Trafc

TCP application can categorized into two major types producing two different trends of trafc data ow: Interactive data ow which is characterised by smaller segment sizes and bidirectional ow of data, and bulk data ow which is characterised by large segment sizes usually in one direction which is from a server to client (Stevens, 1993). TCP algorithms are expected to deal with both kinds of trafc efciently using the different algorithms summarised in Chapter 2. The percentage of each type of data ow may be represented in either the number of packets exchanged on the Internet or the size of these data ows in bytes. A study done by Caceres et al. (1991) implied that interactive data ow packets are responsible for 25-45% of all Internet trafc. However in terms of network bytes, bulk transfers represent 90-95% of the overall trafc. In this research no prior classication based on types of trafc has been made to TCP ows, aiming to develop a robust model applicable for all sorts of connections. Nevertheless, in later stages of the project, interactive data ows have been excluded from the training datasets in order to evaluate the contribution of these ows to the inaccuracy of throughput estimation for both mathematical and neural network models.

29

Chapter 4. Data Pre-processing and Trafc Analysis

4.2

Extracting TCP Parameters

All collected trafc traces have been processed using tcptrace in order to generate a complete dataset of records, each record containing information about a TCP connection as recognised by the source and destination IP addresses and ports. Among the many parameters produced by tcptrace, the parameters listed in Table 4.1 as dened by (Ramadas, 2003) were of particular interest for this research.

Table 4.1: TCP parameters of interest as collected by tcptrace.


TCP Parameter SYN/FIN pkts sent actual data sent total packets RTT avg avg segm size initial window pkts max owin Description The count of all the packets with SYN/FIN ags set. The total bytes transmitted during data transfer stage, including any retransmission. The total number of packets, including packets exchanged during connection establishment and termination. The average value of RTT. The average segment size over the lifetime of a TCP connection. The number of segments within the initial window advertised. The maximum number of unacknowledged data in bytes observed during the connection lifetime. As, the TCP congestion window at the sender side cannot be determined. Hence, it is estimated using the outstanding unacknowledged data. The maximum segment size The total number of triple duplicate acknowledgements received by the sender. This number is usually used to represent the number of assumed loss segments over a TCP connection. The average retransmission time between successive transmission and retransmission of a segment. Total data transmission time, excluding time spent during connection establishment and termination.

max segm size triple dupacks

avg retr time data xmit time

4.3

TCP Parameters Pre-processing

The following stages of pre-processing have been applied to the collected trafc traces, in order to obtain a set of variables representing each TCP ow, and can be used subsequently for neural network modelling.

30

4.3. TCP Parameters Pre-processing

4.3.1

Identifying Valid TCP Flows

Available dataset have been ltered to exclude all TCP ows that either incomplete, or have TCP parameters which would be considered invalid for modelling. The following ltering criteria have been applied. 1. Exclude incomplete TCP connection by investigating the number of SYN and FIN segments exchanged. A complete TCP connection would normally include at least a single SYN/FIN set in each direction. 2. Exclude TCP ows with no data transmitted in either direction. 3. Exclude TCP ows for which average RTT measured is null. 4. Exclude TCP ows for which initial sending window equal to null. The total number of valid TCP ows post ltering invalid and incomplete connections is listed in Table 4.2. Table 4.2: Number of valid TCP ow for both lossless and lossy subsets.
Dataset UNIBS MAWI Plymouth University Total Number of TCP Flows 79630 1820810 6355344 Number of Valid Flows 45944 83733 2471273 Number of Valid Lossless Flows 45024 76297 2413974 Number of Valid Lossy Flows 920 7436 57299

4.3.2

Selection of Forward Direction

TCP connections are bidirectional, and according to RFC 3449 (Balakrishnan et al., 2002), the forward direction of a connection is characterised by more voluminous data ow. For server-client connections, this direction is usually from the server to the client. On the other hand, the reverse direction is characterised by less data being transmitted and is usually used for acknowledging data sent in the forward direction. At this stage of data preprocessing, all TCP connections have been processed to select the forward direction based on the amount of data transmitted in each direction. Only the forward direction has been considered for modelling.

31

Chapter 4. Data Pre-processing and Trafc Analysis

4.3.3

Classication of Lossless and Lossy Flows

The occurrence of segment losses is indicated on a network path once the TCP sender receives triple duplicate acknowledgements (Kurose and Ross, 2009) as demonstrated in Chapter 2. Valid TCP ows have been classied into two different subsets according to the number of triple duplicate ACKs sent in the reverse direction: a subset with lossless connections, and another subset with lossy connections. The segment loss rate for lossy connections has been evaluated as per equation 4.1. Although this method of calculation may not be purely accurate, yet it has shown realistic results when introducing the calculated loss rate(p) to the mathematical model as described by Cardwell et al. (2000).

Loss rate (p) =

T riple duplicate ACKs T otal number of actual data segments

(4.1)

Figure 4.1: Percentages of both lossless and lossy TCP connections within the network trafc captured.

4.3.4

Computing the Mathematical Throughput Estimate

Cardwells mathematical model demonstrated in Chapter 2 has been used to evaluate the estimated throughput of each TCP connection. The accuracy of the mathematical model was considered as a base line to evaluate and the performance and accuracy of the neural network model developed.

32

4.4. Statistical Distribution of TCP Parameters

4.3.5

Normalisation of TCP Parameters

The actual values of TCP parameters as produced by tcptrace were found to be varying over very large ranges. Also, the values were found to have different scales for each parameter. Hence it was essential to either normalise or standardise these values. Several standardisation techniques have been experimented, by scaling the input and targets to have values ranging from 0 to 1 according to the CDF of each TCP parameter. However these techniques have not led to much improvement. Normalising the TCP parameters using a natural logarithmic function was found to provide better accuracy gures, and faster learning convergence when training the neural network.

4.4
4.4.1

Statistical Distribution of TCP Parameters


Throughput

The TCP throughput of trafc collected at both Brescia University and Plymouth University shows similar and even distribution. However, the TCP throughput values of MAWI trafc indicated relatively lower values with higher percentile of connection with throughput around 10 Bps as shown in Figure 4.2.
Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10
1 2 3 4

MAWI Working Group Traffic Archive Campus network of the University of Brescia Campus network of Plymouth University
5 6 7 8

10

10

10 10 10 Throughput (Bps, log scale)

10

10

10

Figure 4.2: Cumulative distribution of throughput.

33

Chapter 4. Data Pre-processing and Trafc Analysis

4.4.2

Data Transmitted

The amount of actual data transmitted over connections at Plymouth University was relatively very low with a mean value of 329 bytes, compared to 245,257 bytes and 100,905 bytes at Brescia University and MAWI group respectively. The CDF distribution of data transmitted is shown in Figure 4.3.

Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10
1 2 3 4

MAWI Working Group Traffic Archive Campus network of the University of Brescia Campus network of Plymouth University
5 6 7 8 9

10

10

10 10 10 10 Actual data transmitted (Bytes, log scale)

10

10

10

Figure 4.3: Cumulative distribution of data transmitted.

4.4.3

Initial Window Size

The initial window size (IW) in the three sources of data were observed to comply with the slow start algorithm. The mean values of IW are listed in Table 4.3. The majority of TCP ows were observed to use either two or three segments for the IW. However, interestingly some connections used IW size larger than four segments sometimes reaching 12 segments - as shown in Figure 4.5, which contradicts the guidelines set in RFC 5681, which dene a maximum limit of four segments for the IW if the MSS value is less than 1095 bytes (Allman et al., 2009). This implies that some TCP implementations are not exactly following the RFC guidelines.

34

4.4. Statistical Distribution of TCP Parameters


Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10
1 2 3 4 5

MAWI Working Group Traffic Archive Campus network of the University of Brescia Campus network of Plymouth University

10

10 10 Initial window bytes (log scale)

10

10

Figure 4.4: Cumulative distribution of initial window bytes.

Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 MAWI Working Group Traffic Archive Campus network of the University of Brescia Campus network of Plymouth University

8 10 12 Number of initial window packets

14

16

18

20

Figure 4.5: Cumulative distribution of initial window packets.

4.4.4

Maximum Segment Size

Figure 4.6 shows that the majority of TCP ows of both UNIBS and MAWI datasets were had a MSS value of either 1430 bytes or 1460 bytes. The dataset from Plymouth University was particularly constrained with a MSS of 1368 bytes. These

35

Chapter 4. Data Pre-processing and Trafc Analysis observations were taken into consideration when ltering outliers at later stages of the research.
Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 600 700 800 900 1000 1100 Maximum Segment Size (Bytes)
X: 1368 Y: 0.1813

MAWI Working Group Traffic Archive Campus network of the University of Brescia Campus network of Plymouth University

1200 1300 1400

1500

Figure 4.6: Cumulative distribution of MSS.

Table 4.3: Mean values of TCP parameters evaluated for the three datasets used.
TCP Parameter MSS (bytes) Maximum Idle Time (sec) Average RTT (msec) Throughput (Bps) Initial Windows Size (bytes) Initial Windows Size (packets) Actual Data Bytes (bytes) Transmission Time (sec) UNIBS 968.76 30.06 92.01 6528.35 1496.65 1.69 245256.75 33.97 MAWI 1162.02 6.02 229.29 17751.86 1922.08 1.81 100905.68 6.24 Plymouth University 1283.86 23.54 38.17 33596.33 2930.56 2.66 329.48 13.73

4.4.5

Data Transmission Time

The distribution fo data transmission of TCP connections in the three datasets was found to be evenly distributed, without any outstanding observation as shown in Figure 4.7.

36

4.4. Statistical Distribution of TCP Parameters


Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 3 10
2 1 0 1

MAWI Working Group Traffic Archive Campus network of the University of Brescia Campus network of Plymouth University

10

10

10 10 10 Transmission time (seconds, log scale)

10

10

10

Figure 4.7: Cumulative distribution of data transmission time.

4.4.6

Average RTT

Highest average RTT values were observed on the MAWI dataset with a mean value of 229 msec, the mean value of average RTT at Brescia University and Plymouth University was 92 msec and 38 msec respectively. Figure 4.8 show the better distribution of response time at Plymouth University.
Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 1 10
0 1 2

MAWI Working Group Traffic Archive Campus network of the University of Brescia Campus network of Plymouth University
3 4 5

10

10 10 10 Average RTT (seconds, log scale)

10

10

Figure 4.8: Cumulative distribution of RTT.

37

Chapter 4. Data Pre-processing and Trafc Analysis

4.4.7

Maximum Idle Time

As described in section 2.2.5, there are seven conditions in which a TCP connection may encounter signicant idle time, and the timer values as implemented may be considered substantially large. The only gure of idle time evaluated by tcptrace is the maximum period of idle time in a TCP connection and is only considered if it occurred during data transfer, and neither at connection establishment nor after data transfer is complete and waiting for connection termination. Nevertheless, the maximum idle time still provides a good estimate or representation of the total idle time within TCP ows. Figure 4.9 demonstrates the CDF of maximum idle time for the sources of data. Highest mean value observed was 30.06 seconds at Brescia University, compared to 23.54 seconds at Plymouth University, and only 6.02 seconds for the MAWI dataset.
Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10
1 2 3 4 5 6

MAWI Working Group Traffic Archive Campus network of the University of Brescia Campus network of Plymouth University

10

10 10 10 Maximum Idle Time (seconds, log scale)

10

10

Figure 4.9: Cumulative distribution of maximum idle time. Further statistical distribution analysis was performed on the TCP timing parameters (i.e. transmission time, average RTT, and maximum idle time) using box-and-whisker diagrams as shown in Figures 4.10, 4.11 and 4.12. The purpose of this analysis was to: Identify the 2nd percentile and the 98th percentile of each TCP parameter, and exclude statistical outliers from the dataset used for analysis and neural network training in later stages of the research.

38

4.4. Statistical Distribution of TCP Parameters Identify and demonstrate the signicance of the maximum idle time observed in a TCP connection with respect to both the average RTT and the data transmission time. This observation implied the need to study the effect of excluding connection with relatively high idle time periods from the datasets on throughput estimation accuracy.

Maximum Idle Time

Average RTT

Transmission Time 10
4

10

10

10

10 10 Time (seconds)

10

10

10

Figure 4.10: Box-and-whisker diagrams of TCP time parameters (UNIBS Trafc)

Maximum Idle Time

Average RTT

Transmission Time 10
4

10

10

10

10 10 Time (seconds)

10

10

10

Figure 4.11: Box-and-whisker diagrams of TCP time parameters (MAWI Trafc)

Maximum Idle Time

Average RTT

Transmission Time 10
4

10

10

10

10 10 Time (seconds)

10

10

10

Figure 4.12: Box-and-whisker diagrams of TCP time parameters (Plymouth University Trafc)

39

Chapter 4. Data Pre-processing and Trafc Analysis

4.5

Summary

The rst part of this chapter reviewed and justied the different stages of data pre-processing applied on collected traces. Starting from identifying connections with valid TCP parameters to ensure the correct modelling of the developed neural network, as well as excluding incomplete TCP connections, and then selecting the forward direction for each direction, and discarding ows reverse direction. The criteria upon which segment loss were assumed in TCP connection was explained, and how datasets were divided into two subsets of lossless and lossy TCP ows, and then the mathematical throughput estimation was evaluated for each sample. The process of parameters normalisation was dened and justied. The second part of the chapter presented the statistical analysis performed on all the TCP parameters used for modelling, the conclusions made based on this analysis, and how the affected the modelling criteria in later stages of the project.

40

Neural Network Modelling


Articial neural networks have been chosen as a supervised machine learning approach, providing the neural network model with the TCP ow parameters as inputs, and the actual data transmission time as target. This chapter briey gives an overview on backpropagation feed forward neural networks, and how associated model parameters have been dened for the model. Two separate neural network have been used for modelling TCP throughput, as previously justied. One model for lossless TCP ows, and another for lossy ows. The structure and training parameters of each model are selected and justied in this chapter. The selection of these parameters was based upon best practices, as well as observed performance measures during the research.

5.1

Backpropagation Feed Forward Neural Networks

Backpropagation neural networks are multilayer perceptrons networks structured in three kinds of layers as shown in Figure 5.1. A single input layer, a single output layer, and a number (n) of hidden layers between inputs and outputs. The number of hidden layers and nodes is purely dependent on the order of complexity of the system being modelled. According to (Mehrotra et al., 1997), the backpropagation neural network are based on gradient descent. During the learning process, weight between neurons are updated in the direction associated with the negative gradient of the errors values between outputs and targets (Mehrotra et al., 1997).

41

Chapter 5. Neural Network Modelling


Input Layer Hidden Layer(s) Output Layer

. . .

. . .

. . .

. . .

Figure 5.1: Simplied neural network structure

The backpropagation process or propagation of error aims to update the weights between neurons in order to minimise the squared error resulting from the simulation of input values through the network (Callan, 1998). The algorithms adopt the Delta rule to adjust weights in feed-forward neural networks, where no feedback is given from the output layer to the input layer. The backpropagation algorithm can be summarised into two iteration processes as follow:

Forward Direction: The neural network is simulated with a training sample from the dataset available, along with the desired target(s) value for that sample. At each neuron, the summation of inputs multiplied by their associated weights is evaluated, and then entered as input to a predened learning function f(x) and this is considered as neurons output as expressed in equation 5.1. The activation of neurons is executed in the forward direction, where the output signal from each neuron is propagated to the next layer until the output layer is reached, at which the error signal ( ) between the output (y) and target (z) values is calculated as per equation 5.2.

y = f (x1 w1 + x2 w2 + x3 w3 + ... + xN wN )

(5.1)

42

5.1. Backpropagation Feed Forward Neural Networks

=zy

(5.2)

Inputs x1 Weights

x2

w1

w2
x3 . . . xN w3

Function

F(wi xi)

Output

wN

Figure 5.2: Computations at a single neural perceptron.

Backward Direction: Once the error at output layer has been evaluated, the error signal is then propagated in the backward direction in order to calculate the error internally and assign a blame factor for each neuron in the network. This process is referred to as the backpropagation process, and aims to adapt the weights in the neural network to minimise the nal error signal, as shown in Figure 5.3. The function by which the local error is optimised is called the optimisation function. Levenberg-Marquardt optimization has been used in this research as it is known to provide stable and fast convergence of error optimisation for non-linear systems (Wilamowski and Irwin, 2011; Demuth and Baele, 2011).

wij(n+1) = wij + . .

df(e) d(e)

x1

wi

f(e)

x2

Figure 5.3: Backpropagation of error signal to update neural network weights.

43

Chapter 5. Neural Network Modelling

5.2

Backpropagation Neural Network Parameters

The following approaches are considered as guidelines for the initialisation and selection of parameters used for model developing. These guideline are not set in stone, and are usually changed based on performance results obtained experientially. The network parameters are also subject to modication during the learning process.

5.2.1

Initialization of Weights

The selection of initial weights between nodes are generally chosen randomly, as no prior signicance of these weights is know before the training process. However, these weights are usually set to be small and varying between -1 and 1 (Mehrotra et al., 1997). In this research, initial weights were randomly initiated in MATLAB.

5.2.2

Initialization of Bias

Bias values are mainly used at the input layer in order to add a biasing term in addition to the input value. This is done in cases where inputs of some training samples have null values (zeros), which cause the null value to propagate to all weights in the forward direction, and hence bias values are used to alter this condition (Haykin, 1998). Similar to weights, initial biases were also randomly initiated in MATLAB.

5.2.3

Learning Rate

The learning rate affects the magnitude by which weights are updated in each backpropagation of errors as per Equation 5.3, and consequently affects the magnitude and direction of errors at the neuron level. According to (Mehrotra et al., 1997), if the learning rate is set too high, the updated weights will become relatively high, so that when the neural network is simulated with a different samples with different target values, the error signal will tend to highly oscillate for each training sample. On the other hand, if learning weight is set too low, the weight will tend to never update, and will very slow or even no convergence.

w = x

(5.3)

44

5.2. Backpropagation Neural Network Parameters

5.2.4

Momentum

The momentum factor describes the amount of inuence the change in weights simulating a certain sample will have on the subsequent changes in weights when simulating the further samples. In order to consider the momentum factor, Equation 5.3 is revised to consider the momentum factor as in Equation 5.4, where (n) denotes the learning epoch number.

wji (n + 1) = dpj api +

wji (n)

(5.4)

5.2.5

Hidden Layers and Nodes

The selection of the number of hidden layers and number of nodes in each hidden layer is subject to experimental observations. A single hidden layer is usually sufcient in order to represent any non linear function mapping variable from one space to another. According to (Mehrotra et al., 1997), having very few hidden node can lead to under-tting, where the neural network would not have sufcient number of nodes and weights to represent complex datasets. Choosing too many hidden nodes can cause over-tting, where the neural model would have too much information represented in nodes and weights, with respect to smaller training sets, and hence the model will not have sufcient data to train all the nodes it includes. The following guidelines are usually used to determine the number of hidden nodes in each layer:

H= Or H=

M2 + N2

(5.5)

2 (M + N ) 3

(5.6)

Where H, M and N are the number of hidden, input and output nodes respectively.

5.2.6

Number of Samples

The number of training samples needed to train a neural network must be sufcient in order to obtain satisfactory accuracy results. Equation 5.7 provides a

45

Chapter 5. Neural Network Modelling guideline for selecting the number of samples with respect to the desired accuracy, where P denotes the needed number of training samples, W denotes the total number of weights in the neural network, and a represents the desired accuracy.

W (1 a)

(5.7)

Additionally, two other subsets of samples different from the samples used in the training process. One subset would be used to validate the estimation results at each epoch, and the other subset would be used to test the robustness of the nal developed neural model. During the learning process, these three subsets are used to evaluate the performance of the model at each epoch and decide which set of weight provides the best results, as demonstrated in 5.4. During this research, the number os samples used at any stage was bounded to 100,000 samples due to processing time limitations.
Best Validation Performance is 0.05385 at epoch 65
10
2

Train Validation Test Best

10

Mean Squared Error (mse)

10

10

10

20

40

60

80

100

115 Epochs

Figure 5.4: MSE performance measures for learning, validating, and testing subsets.

5.2.7

Stopping Criteria

Several stopping criteria of the learning process have been considered in order to obtain acceptable accuracy results while minimising the learning and processing

46

5.3. Neural Network Model Structure time needed. These criteria are described in Table 5.1.

Table 5.1: Stopping criteria used for the neural network during learning process.
Stopping Criteria Maximum number of training epochs Description Used to limit the number of epochs to be considered if the network has not converged to the desired performance. This condition is used as a limit over training time Represented in terms of MSE. Usually set to null in order to obtain best performance results, and hence this stopping condition is practically never met, but rather used to dene the target goal. Stops the training in case the gradient in error signal is less than the minimum gradient, indicating that the performance is not improving. This value should be slightly higher than zero. Used to stop the training if the performance measures for a number of successive validation checks have not improved Value 1000 epochs

Minimum Performance Value

null

Minimum gradient descent

0.00001

Maximum number of validation checks

50 checks

5.3

Neural Network Model Structure

The following sections describe the structure of the neural networks used for modelling both lossless and lossy subsets. All other training parameters have been dened in the previous section.

5.3.1

Lossless Model

The structure of the model developed for the lossless subset of TCP samples is shown in Figure 5.5. Six input variables and a single output variable were considered as input as listed in Table 5.2.

47

Chapter 5. Neural Network Modelling

Figure 5.5: Neural network model developed for lossless TCP trafc.

Table 5.2: Neural network structure and input parameters for the lossless model.
Neural Network Parameter Lossless Dataset Model 1. Actual data bytes transmitted 2. Average RTT 3. Average segment size 4. Initial window size 5. Maximum congestion window 6. Maximum segment size Estimated data transmission time Actual data transmission time

Input Variables

Output Variable Target Variable

5.3.2

Lossy Model

For the lossy model, two additional input variables have been considered: the segment loss rate as calculated using triple duplicate acknowledgements, and the average retransmission time as evaluated by tcptrace. The structure of the model is shown in Figure 5.6, and the detailed input and output variables are listed in Table 5.3.

Figure 5.6: Neural network model developed for lossy TCP trafc.

48

5.4. Summary Table 5.3: Neural network structure and input parameters for the lossy model.
Neural Network Parameter Lossy Dataset Model 1. Actual data bytes transmitted 2. Average RTT 3. Average segment size 4. Initial window size 5. Maximum congestion window 6. Maximum segment size 7. Segment loss rate 8. Average retransmission time Estimated data transmission time Actual data transmission time

Input Variables

Output Variable Target Variable

5.4

Summary

This chapter has explained the concepts behind backpropagation neural networks, and the convenience of using neural networks as a supervised learning approach due to it delta rule of updating network weights in the reverse direction after the simulation of each sample from the dataset. The structure of both models used for lossless and lossy datasets has been dened, and the initialisation of neural parameters and learning criteria has been justied for each model. The modelling of the neural models was based on well known guidelines in addition to experimental trials and the observed model performance for each trial.

49

Results and Analysis


In this chapter, estimation performance results from the developed neural network models are presented and compared with both the actual performance value for each TCP connection, as well as the performance estimated by Cardwells mathematical model. Initially, the results obtained using the combined dataset (i.e. UNIBS and MAWI) are fully demonstrated and analysed. Different ltering conditions have been applied to the TCP trafc collected, in order to investigate the effect of various TCP parameters on the performance results. Hence, each ltering criteria has been investigated individually and compared with other ltering and analysis criteria. Finally, the same analysis methodology was performed on the dataset collected from the campus at the University of Plymouth for validation purposes, and an overview of the performance results obtained are presented by the end of the chapter, followed with a summary.

6.1

Results from the Combined Dataset (UNIBS-2009 and MAWI)

The following section present all the results obtained for the dataset which combines both UNIBS and MAWI trafc traces.

6.1.1

Considering All Valid TCP Connections

After ltering all invalid TCP samples, lossless trafc has been identied and represented 121,321 TCP samples, while lossy trafc represented 8,356 TCP samples. The results obtained for each subset are as follow.

51

Chapter 6. Results and Analysis 6.1.1.1 Results for the Lossless Dataset

The regression value obtained from the mathematical model was 0.3216, and from the neural network model was 0.7680, which is considered relatively higher, and shows that the neural network model outperforms the mathematical model, even at initial stages of analysis. The distribution of scattered actual and estimated transmission time for both models as shown in Figure 6.1 was important to analyse. For the mathematical model, it is well observed that a relatively small subset of the samples are well located along the ideal t line (Y=T), while the majority of scattered samples are well distributed below the this line with high residual values. This clearly indicates that the mathematical model assumes ideal data transfer conditions and takes no account for any possible additional time (idle time) within the lifespan of the connection. While the actual - target - transmission time tends to be higher in most of the connections. On the other hand, the neural network model seems to be accounting for this possible additional time for each connection, and hence the scattering of samples is evenly distributed above and below the ideal t line (Y=T). It is also worth noticing that the slope of the tting line resulting from the neural network model is slightly closer to the ideal t than the one resulting from the mathematical model.
Lossless Dataset Mathematical Model: R=0.32156
8 xmit time Fit Y=T 8

Lossless Dataset Neural Network Model: R=0.768


xmit time Fit Y=T

Output ~= 0.59*Target + 0.18


8 6 4 2 0 2 4 6 8

Output ~= 0.39*Target + 2.3

4 6 6 8 6 4 2 0 2 4 6 8

Target

Target

Figure 6.1: Regression obtained for lossless connections for the combined dataset using both mathematical and neural network model. The MSE of performance estimation was 11.8910 using the mathematical model,

52

6.1. Results from the Combined Dataset (UNIBS-2009 and MAWI) and 1.8253 using the neural network model. Both these gures are relatively high considering that the performance values were not real values but rather normalised using the natural logarithmic function (base e). Nevertheless, the neural network models still provided better estimation accuracy. This is also shown by the cumulative distribution of absolute relative estimation error resulting from each model in Figure 6.2.
Empirical CDF 1 Neural network model Mathematical model

0.8 % of connections

0.6

0.4

0.2

0 20 10

10

15

10 10 Absolute relative error (log scale)

10

10

10

Figure 6.2: CDF of absolute relative error for lossless connections (combined dataset).

6.1.1.2

Results for the Lossy Dataset

Introducing the lossy subset of TCP connections to Cardwells mathematical model provided a regression of 0.5232, and MSE value of 4.2733. While the neural network model provided a regression of 0.9446, and MSE value of only 0.3150. The scattering of estimated and actual transmission time for both models is shown in Figure 6.3. The same observation was made for the mathematical model as most scattered samples were distributed below the ideal t line (Y=T), indicating inability in predicting any additional time within TCP connection besides pure data

53

Chapter 6. Results and Analysis transfer time, and only providing accurate results for connections with ideal conditions. The cumulative distribution of absolute relative error in Figure 6.4 shows lower estimation error for the neural network model.

Lossy Dataset Mathematical Model: R=0.52319


xmit time Fit Y=T

Lossy Dataset Neural Network Model: R=0.94463


xmit time Fit Y=T

Output ~= 0.46*Target + 0.18

Output ~= 0.9*Target + 0.23


6 4 2 0 2 4 6 8

6 6 4 2 0 2 4 6 8

Target

Target

Figure 6.3: Regression obtained for lossy connections for the combined dataset using both mathematical and neural network model.

Interestingly, comparing these results with the results previously presented for the lossless dataset, both models seems to provided better accuracy for the lossy subset. This may either be due to the relatively lower number of samples in the lossy dataset (8356) compared to the lossless dataset (121321), as the consideration of less samples in a dataset - prior to any ltering - implies the presence of less outliers as well, which would naturally provide better estimation accuracy. Regarding the neural network model specically, the inclusion of the loss rate and average retransmission time within the inputs during the learning process may in fact induce the model to account for extra transmission time, and hence improving estimation accuracy compared to lossless conditions where none of the inputs would suggest the need for extra transmission time.

54

6.1. Results from the Combined Dataset (UNIBS-2009 and MAWI)


Empirical CDF 1 Neural network model Mathematical model

0.8 % of connections

0.6

0.4

0.2

0 6 10

10

10 10 Absolute relative error (log scale)

10

10

Figure 6.4: CDF of absolute relative error for lossless connections for the combined dataset.

6.1.2

Idle Time Investigation

Results obtained using unltered trafc in the previous sections have suggested the presence of idle time within TCP connections. This time seems not being considered at all when calculating the data transmission time mathematically, and not being sufciently estimated when using neural networks. By identifying samples with highest error gure, retrieving these connections from their corresponding traces, and manually investigating them using tcptrace,it was noted that many TCP connections suffered from prolonged idle time while the connection still being active, but simply no data was being transmitted. No irregularity or segment loss was observed during data transmission. As shown by the time-sequence graph in Figure 6.5, the connection suffered from two periods of consecutive idle time (3 minutes and 2 minutes).

55

Chapter 6. Results and Analysis

Figure 6.5: Time-sequence graph for a TCP connection with relatively high idle time.

According to RFC 5681 (Allman et al., 2009), TCP connections which suffer from large idle time periods in the midst of bursts of data transfer are more likely to logically reset the connection and reinitialise the slow start phase with a cwnd equal to the initial cwnd as dened by Equation 2.1. The connection is to be considered in idle state if no segments have been exchanged for a period longer than the retransmission timeout (RTO) (Allman et al., 2009). However this mechanism is not considered in many cases as it was observed from the manual analysis of various traces. As the cwnd is only evaluated at the sender side and never sent on a TCP connection (Allman et al., 1999), the possible way to deduct the cwnd for a connection is to observe the size of segments owing from sender to receiver, and analyse if these segments are smaller with respect to the rwnd advertised by the receiver. However this was very difcult to investigate as it was observed that the receiver usually generates an ACK segment shortly after a batch of data segments received from the sender, and hence not following the ow control guidelines of windowing.

56

6.1. Results from the Combined Dataset (UNIBS-2009 and MAWI)

6.1.3

Filtering TCP Connections with High Relative Idle Time

Since the idle time within a TCP connection is not dependent on other TCP parameters, but rather dependent on the behaviour of both the TCP server and client, neither a mathematical model nor an AI-based model would be able to accurately predict the time cost due to these idle periods. Filtering conditions were set in order to investigate the effect of idle time over performance estimation accuracy. The maximum idle time in a connection with respect to the average RTT for each connection as detected by tcptrace was used as an representative value of the total idle time. Different ltering criteria were used for both lossless and lossy subsets of TCP samples. 6.1.3.1 Results for the Lossless Dataset

Regression and MSE results for each ltering criteria are listed in table . Multiples of the ratio of the maximum idle time to average RTT have been considered as listed in Table 6.1. Table 6.1: MSE and regression results post ltering samples with high maximum idle time to average RTT (lossless combined dataset).
Maximum Idle Time to Average RTT Ratio 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Number of Samples 19458 24621 27342 29617 30913 32105 33338 34410 35627 36804 37840 38761 39487 40186 40848 Neural Network Model MSE regression 0.0229 0.9892 0.0410 0.9849 0.0699 0.9759 0.0963 0.9685 0.1108 0.9649 0.1361 0.9578 0.1588 0.9510 0.1619 0.9505 0.1675 0.9496 0.1956 0.9415 0.1995 0.9409 0.2144 0.9369 0.2181 0.9363 0.2301 0.9330 0.2421 0.9299 Mathematical Model MSE regression 0.2220 0.9067 0.2405 0.9262 0.2945 0.9200 0.3788 0.9073 0.4286 0.9019 0.4848 0.8958 0.5629 0.8857 0.6255 0.8789 0.6982 0.8717 0.7840 0.8634 0.8549 0.8574 0.9140 0.8518 0.9684 0.8468 1.0250 0.8413 1.0821 0.8371

The regression and MSE results obtained when ltering connections with max-

57

Chapter 6. Results and Analysis imum idle time less than twice the average RTT were 0.9892 and 0.0229 respectively for the neural network model and 0.9067 and 0.2220 respectively for the mathematical model. This is considered a recognisable improvement in estimation accuracy comparing to results prior any ltering. However, it was observed that the number of samples ltered (19458) only represented 16.04 percent of the number of samples prior ltering (121321).

Lossless Dataset Mathematical Model: R=0.90667


xmit time Fit Y=T

Lossless Dataset Neural Network Model: R=0.98918


xmit time Fit Y=T

Output ~= 0.98*Target + 0.0015


4 3 2 1 0 1 2 3 4

Output ~= 0.76*Target + 0.15

Target

Target

Figure 6.6: Regression obtained for lossless connections for the combined dataset using both mathematical and neural network model, after ltering connections with maximum idle time larger than twice the average RTT.

The regression analysis for both models is shown in Figure 6.6. The uniform scattering of estimated transmission time by both model along the idle tting line (Y=T) had clearly improved, and residual values had signicantly decreased. The CDF for absolute relative error is show in Figure 6.7. At these near ideal conditions, the neural network model is still providing better accuracy performance.

58

6.1. Results from the Combined Dataset (UNIBS-2009 and MAWI)


Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 Neural Network Model (prior filtering) Mathematical Model (prior filtering) Neural Network Model (post filtering) Mathematical Model (post filtering)

10

10

10

10 10 10 Absolute relative error (log scale)

10

10

10

Figure 6.7: CDF of absolute relative error for lossless connections for the combined dataset, after ltering connections with maximum idle time larger than twice the average RTT.

6.1.3.2

Results for the Lossy Dataset

The same ltering conditions based on the maximum idle time have been applied to the lossy subset of TCP connections. The regression and MSE results are listed in Table 6.2. By ltering connections with maximum idle time larger than twice the average RTT, the number of samples was reduced to 2339 connections, and regression and MSE gures were 0.9888 and 0.0257 respectively for the neural network model, and 0.9491 and 0.3225 respectively for the mathematical model. The regression analysis for both models are shown in Figure 6.8, and the CDF of absolute relative error is shown in Figure 6.9.

59

Chapter 6. Results and Analysis Table 6.2: MSE and regression results post ltering samples with high maximum idle time to average RTT (lossy combined dataset).
Maximum Idle Time to Average RTT Ratio 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Number of Samples 2339 3642 4214 4578 4923 5214 5388 5550 5699 5842 5969 6078 6171 6247 6319 Neural Network Model MSE regression 0.0257 0.9888 0.0322 0.9865 0.0389 0.9851 0.0415 0.9845 0.0459 0.9827 0.0494 0.9815 0.0549 0.9794 0.0590 0.9781 0.0677 0.9753 0.0686 0.9753 0.0699 0.9750 0.0686 0.9756 0.0729 0.9743 0.0731 0.9746 0.0784 0.9728 Mathematical Model MSE regression 0.3225 0.9491 0.4999 0.9359 0.5990 0.9324 0.6700 0.9248 0.7113 0.9181 0.7642 0.9102 0.7997 0.9046 0.8337 0.8995 0.8693 0.8948 0.8942 0.8930 0.9195 0.8897 0.9451 0.8862 0.9617 0.8851 0.9853 0.8828 1.0044 0.8798

Lossy Dataset Mathematical Model: R=0.94908


6 xmit time Fit Y=T 6

Lossy Dataset Neural Network Model: R=0.98606


xmit time Fit Y=T

Output ~= 0.84*Target + 0.25

Output ~= 0.98*Target + 0.03


2 1 0 1 2 3 4 5 6

0 1 1 2 2 2 1 0 1 2 3 4 5 6

Target

Target

Figure 6.8: Regression obtained for lossy connections for the combined dataset using both mathematical and neural network model, after ltering connections with maximum idle time larger than twice the average RTT.

60

6.1. Results from the Combined Dataset (UNIBS-2009 and MAWI)


Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 3 10
2 1

Neural Model (prior filtering idle time) Mathematical Model (prior filtering idle time) Neural Model (post filtering idle time) Mathematical Model (post filtering idle time) 10 10 10 Absolute relative error (log scale)
0

10

10

Figure 6.9: CDF of absolute relative error for lossy connections for the combined dataset, after ltering connections with maximum idle time larger than twice the average RTT.

6.1.4

Investigation of Non-Standard Flows

TCP connection samples for which performance was poorly estimated even after excluding connections with relatively high idle time were identied and manually analysed. It was observed that the behaviour of some of these connections deviated from the normal behavioural of a TCP connections. The following parameters were considered for excluding non-standard connections: 1. RST packets sent: According to RFC 793 (Postel, 1981), RST packets are sent whenever a terminal receives an unexpected segment for the current state to the current state of the TCP connection as explained in section 2.2.1. Most commonly, a RST packet is sent when TCP stops receiving ACK packet, implying that the connection has become half-open and no segments are expected to be received in the closed direction. Hence the connection is considered unsynchronised and has to be fully closed. 2. SYN/FIN packets exchanged: Normally, a single SYN and FIN segments are sent in each direction. Sending more than one SYN/FIN indicates a un-

61

Chapter 6. Results and Analysis usual condition (e.g. congestion or segment loss) at either sides of the TCP connection. 3. Data packets in the reverse direction: TCP performance is best estimated for unidirectional connections. According to RFC 3449 (Balakrishnan et al., 2002), the reverse direction usually suffers from worse path characteristics from the forward direction. The existence of date in the reverse direction along with TCP ACKs is more likely to reduce the frequency of ACKs and accordingly reduce throughput in the forward direction. 4. Non-HTTP trafc: Although the aim of this project is to research robust TCP performance modelling, non-HTTP trafc has been excluded at this stage in order to study the effect of mixture of TCP trafc mixtures on overall performance. 5. Small Maximum Segment Size (MSS): As the evolution of slow start in TCP is initially based on the MSS value, small values of MSS may cause TCP to reach the optimal cwnd rapidly. According to the statistical distribution of MSS performed in Chapter 4, the minimum value for MSS has been set to 1400 bytes for the combined dataset, and to 1368 for the dataset collected at Plymouth University. Following the statistical analysis performed in Chapter 4, the statistical distribution graphs indicated the presence of statistical outliers for each TCP parameter in the dataset. The 2nd percentile and the 98th percentile of each TCP parameter have also been excluded in this stage of the research.

6.1.5

Filtering Non-Standard Flows

In order to focus the analysis on the effect of each type of non-standard ows on the estimation of TCP performance, samples with relatively high idle time (maximum idle time larger than six time the average RTT) have been ltered in advance. The total number of samples used were 51088 samples. The following sections the results obtained for both lossless and lossy subsets. 6.1.5.1 Results for the Lossless Dataset

Table 6.3 demonstrates the improvement in performance measures obtained when ltering different non-standard types of ows individually, for the subset of lossless

62

6.1. Results from the Combined Dataset (UNIBS-2009 and MAWI) TCP connections using the neural network model, and how the ltering of each type of ow contributes to the improvement of error gures. The same comparison when using the mathematical model is demonstrated in Table 6.4. Table 6.3: Neural network results obtained post ltering non-standard ows from the lossless subset.
Filtering Criteria Number of Connections MSE All valid and complete TCP connections Only include connection with Idle Time less than 6 Average RTT Filtering Different Non-Standard Flows - Exclude connections with RESET packets observed - Exclude connections with more than one SYN/FIN packets received in each direction - Exclude connections with one or more data packets sent in the reverse direction (client to server) - Include only connections transmitting web trafc (source port 80) - Excluding connections with small MSS (MSS less than 1400 bytes) Filtering All Non-Standard Flows 25075 26823 15602 12480 15776 11048 0.0593 0.0647 0.0569 0.0398 0.0661 0.0325 1.2408 1.2254 1.1738 1.0736 1.3908 1.1061 2.7408 2.1574 2.8547 1.4581 2.0806 1.4688 0.0023 0.0001 0.0021 0.0004 0.0009 0.0010 0.9802 0.9777 0.9833 0.9871 0.9784 0.9881 121321 27342 1.7069 0.0646 Neural Network Model Min Error 9.3243 1.1533 Max Error 9.0602 2.1055 Mean Error 0.0025 0.0001 Regression 0.7851 0.9778

Table 6.4: Mathematical results obtained post ltering different non-standard ows from the lossless subset.
Filtering Criteria Number of Connections MSE All valid and complete TCP connections Only include connection with maximum idle Time less than 6 average RTT Filtering Different Non-Standard Flows - Exclude connections with RESET packets observed - Exclude connections with more than one SYN/FIN packets received in each direction - Exclude connections with one or more data packets sent in the reverse direction (client to server) - Include only connections transmitting web trafc (source port 80) - Excluding connections with small MSS (MSS less than 1400 bytes) Filtering All Non-Standard Flows 25075 26823 15602 12480 15776 11048 0.2937 0.2929 0.1638 0.1034 0.1531 0.0921 1.1011 1.1011 1.1011 1.1011 1.1011 1.1011 2.1954 2.1954 2.1886 1.8631 2.1886 1.8631 0.2450 0.2564 0.0061 0.0916 0.0063 0.1143 0.9200 0.9200 0.9548 0.9692 0.9515 0.9707 121321 27342 11.8910 0.2945 Min Error 8.0849 1.1011 Mathematical Model Max Error 13.4185 2.1954 Mean Error 2.0735 0.2597 Regression 0.3216 0.9200

After ltering all sorts of non-standard ows, the MSE measure of the neural network model has decreased from 0.0646 to 0.0325 (50.31%), and the regression between outputs and targets has increased from 0.9778 to 0.9881 (101.05%). While for the mathematical model, MSE was decreased from 0.2945 to 0.0921 (31.27%) after ltering non-standard ows, and the regression has improved from 0.9200 to 0.9707 (105.51%). The ltering criteria which has the most positive effect on improving both the MSE and regression gures was to include only TCP connections transmitting HTTP trafc (port 80). The regression analysis is shown in Figure 6.10, and the cumulative distribution of the absolute relative errors for both models, prior and

63

Chapter 6. Results and Analysis post ltering statistical outliers and non-standard ows is demonstrated in Figure 6.11.
Lossless Dataset Mathematical Model: R=0.97024
4 3 2 1 0 1 2 3 4 5 6 xmit time Fit Y=T

Lossless Dataset Neural Network Model: R=0.98796


xmit time Fit Y=T

4 3

Output ~= 0.98*Target + 0.008


6 5 4 3 2 1 0 1 2 3 4

Output ~= 0.96*Target + 0.092

2 1 0 1 2 3 4 5 6 6 5 4 3 2 1 0 1 2 3 4

Target

Target

Figure 6.10: Regression obtained for lossless connections for the combined dataset using both mathematical and neural network model, post ltering various nonstandard ows.

Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 7 10
6 5 4 3 2 1 0 1 2 3

Neural network model (post filtering) Mathematical model (post filtering) Neural network model (prior filtering) Mathematical model (prior filtering)

10

10

10

10 10 10 Absolute relative error (log scale)

10

10

10

10

Figure 6.11: CDF of absolute relative error for lossless connections for the combined dataset, prior and post ltering various non-standard ows.

64

6.1. Results from the Combined Dataset (UNIBS-2009 and MAWI) 6.1.5.2 Results for the Lossy Dataset

The same analysis approach for studying the effect of non-standard ows was taken for the lossy subset of TCP connections. Results obtained using the neural network modelling is shown Table 6.5 and Figure 6.13, and results obtained using the mathematical model are demonstrated in Table 6.6 and Figure 6.13. Performance measure prior to any ltering was considered as a base line for evaluating the performance improvements using each model. Table 6.5: Neural network results obtained post ltering different non-standard ows from the lossy subset.
Filtering Criteria All valid and complete TCP connections Only include connection with Idle Time less than 6 Average RTT Filtering Different Non-Standard Flows - Exclude connections with RESET packets observed - Exclude connections with more than one SYN/FIN packets received in each direction - Exclude connections with one or more data packets sent in the reverse direction (client to server) - Include only connections transmitting web trafc (source port 80) - Excluding connections with small MSS (MSS less than 1400 bytes) Filtering All Non-Standard Flows Number of Connections 8356 4214 4102 3682 4163 4052 4022 3341 MSE 0.2735 0.0371 0.0358 0.0331 0.0352 0.0343 0.0356 0.0301 Neural Network Model Min Error 7.1937 1.0218 1.7615 1.1945 1.1635 0.6390 0.8103 0.6951 Max Error 5.7939 1.4864 1.4194 1.3616 1.5705 1.6490 1.4909 1.4656 Mean Error 0.0250 0.0035 0.0011 0.0005 0.0033 0.0016 0.0017 0.0017 Regression 0.9522 0.9857 0.9857 0.9866 0.9860 0.9860 0.9853 0.9863

Table 6.6: Mathematical results obtained post ltering different non-standard ows from the lossy subset.
Filtering Criteria Number of Connections MSE All valid and complete TCP connections Only include connection with Idle Time less than 6 Average RTT Filtering Different Non-Standard Flows - Exclude connections with RESET packets observed - Exclude connections with more than one SYN/FIN packets received in each direction - Exclude connections with one or more data packets sent in the reverse direction (client to server) - Include only connections transmitting web trafc (source port 80) - Excluding connections with small MSS (MSS less than 1400 bytes) Filtering All Non-Standard Flows 4102 3682 4163 4052 4022 3341 0.5951 0.5607 0.5968 0.5871 0.5897 0.5381 0.4120 0.4120 0.4120 0.4120 0.4120 0.4120 3.0915 3.0915 3.0915 2.2595 3.0915 1.9999 0.6488 0.6233 0.6490 0.6453 0.6444 0.6111 0.9307 0.9311 0.9308 0.9308 0.9283 0.9253 8356 4214 4.2733 0.5990 Min Error 2.9911 0.4120 Mathematical Model Max Error 11.9338 3.0915 Mean Error 1.3394 0.6500 Regression 0.5232 0.9324

For the lossy subset, ltering all sorts of non-standards ows has improved the MSE measure of the neural network model from 0.0371 to 0.0301 (81.13%), and slightly increased the regression between outputs and targets from 0.9857 to 0.9863 (100.06%). Excluding only non-HTTP trafc led to better regression results (0.9866), while further ltering of all other non-standards ows had not provided regression results as good as when only when excluding non-HTTP trafc (0.9863). As for the mathematical model, by ltering all non-standards ows, MSE measure

65

Chapter 6. Results and Analysis had decreased from 0.5990 to 0.5381 (89.83%). However, this has not improved the regression results. The best regression results were obtained prior to ltering non-standards ows (0.9324). Regression results are shown in Figure 6.12.

Lossy Dataset Mathematical Model: R=0.9287


xmit time Fit Y=T

Lossy Dataset Neural Network Model: R=0.98584


xmit time Fit Y=T

Output ~= 0.79*Target + 0.32

Output ~= 0.97*Target + 0.041


3 2 1 0 1 2 3 4 5

1 2 2 3 3 3 2 1 0 1 2 3 4 5

4 4

Target

Target

Figure 6.12: Regression obtained for lossy connections for the combined dataset using both mathematical and neural network model, post ltering various nonstandards ows.

At this stage of ltering, the variation in performance gures may not be consistent due to the reduced number of training samples to only 3341 TCP connections, which may have led to under-training of the neural network, and hence the degraded accuracy results when simulating the neural model with testing samples. This degradation is demonstrated by the CDF of absolute relative error in Figure 6.13.

66

6.1. Results from the Combined Dataset (UNIBS-2009 and MAWI)


Empirical CDF 1 0.9 0.8 0.7 % of connections 0.6 0.5 0.4 0.3 0.2 0.1 0 3 10
2 1

Neural network model (post filtering outliers) Mathematical model (post filtering outliers) Neural network model (prior filtering outliers) Mathematical model (prior filtering outliers) 10 10 10 Absolute relative error (log scale)
0

10

10

Figure 6.13: CDF of absolute relative error for lossy connections for the combined dataset, prior and post ltering various non-standards ows.

6.1.6

Throughput and Estimation Error

This section investigates the relationship between actual throughput of TCP connections and the estimation error by the developed neural model. It also investigates the effect of the presence of idle time on that relationship. This was done bearing in mind that large bulk transfers are usually characterised with large throughput, and short-lived characterised with lower values of throughput due to their limitation by slow start. The following graphs show the relationship observed between transfer throughput and the error of estimated throughput obtained by the neural network model. The error represented is relative to the actual value of throughput. Figure 6.14 demonstrated high relative error in estimated throughput for ows with small actual throughput. This observation was made prior to ltering any TCP ows.

67

Chapter 6. Results and Analysis


10
6

Relative error in estimated throughput

10

10

10

10

10

10

10

10

10

10 10 Throughput (Bps)

10

10

10

Figure 6.14: Scatter plot of actual throughput and corresponding relative error of estimated throughput, for lossless connections of the combined dataset, prior any ltering.

After ltering TCP ows with maximum idle time larger than twice the average RTT, the rst observation made was the improvement in estimation accuracy of the neural network model. Estimated error relative to actual throughput had decreased signicantly, as shown in Figure 6.15. It was also observed that the neural model performed better when modelling TCP connection with high actual throughput.

68

6.2. Manual Analysis of Connection with Poorly Estimated Throughput


10
6

Relative error in estimated throughput

10

10

10

10

10

10

10

10

10

10 10 Throughput (Bps)

10

10

10

Figure 6.15: Scatter plot of actual throughput and corresponding relative error of estimated throughput, for lossless connections of the combined dataset, after ltering connections with high idle time to RTT ratio.

6.2

Manual Analysis of Connection with Poorly Estimated Throughput

It was observed that excluding TCP connection with many data packets in the forwarding direction from Client-to-Server had led to better MSE performance. By analysing one of these TCP connections as shown in Figure 6.16, it was observed that many consecutive idle-time periods were spent after each packet was sent from the client and acknowledged by the server. For this connection, the neural network estimated a transmission time of 0.44 seconds, and the mathematical model predicted 0.59 seconds, while the actual transmission time for this connection was 5.27 seconds. On the other hand, by analysing a connection as shown in Figure 6.17, for which transmission time was accurately estimated by the neural network, it was found that as packets were being sent from the server to the client in a normal sequence, no idle time periods were observed between data transmissions. Excluding connections of such behaviour where data packets are being sent from Client to Server had highly improved the performance of both neural and mathematical models as previously mentioned.

69

Chapter 6. Results and Analysis Figure 6.16: Trace 1


06:09:26.375138 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:26.493072 IP 35.76.77.0.80 > 207.56.171.90.50192: 1460,wscale 0,sackOK,eol> 06:09:27.126592 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:27.245157 IP 35.76.77.0.80 > 207.56.171.90.50192: 06:09:27.915636 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:28.033710 IP 35.76.77.0.80 > 207.56.171.90.50192: 06:09:28.650841 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:28.768898 IP 35.76.77.0.80 > 207.56.171.90.50192: 06:09:29.308848 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:29.426904 IP 35.76.77.0.80 > 207.56.171.90.50192: 06:09:30.060412 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:30.178609 IP 35.76.77.0.80 > 207.56.171.90.50192: 06:09:30.835876 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:31.053228 IP 35.76.77.0.80 > 207.56.171.90.50192: 06:09:31.673373 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:31.697234 IP 207.56.171.90.50192 > 35.76.77.0.80: 06:09:31.815411 IP 35.76.77.0.80 > 207.56.171.90.50192: . . S 3238724546:3238724546(0) win 5840 <mss 1460,sackOK,wscale 5> S 1458449122:1458449122(0) ack 3238724547 win 4380 <mss . 1:1449(1448) ack 1 win 183 . ack 1449 win 5828 P 2978:4426(1448) ack 1 win 183 . ack 1449 win 5828 <sack 1 2978:4426> . 4426:5874(1448) ack 1 win 183 . ack 1449 win 5828 <sack 1 2978:5874> P 1449:1530(81) ack 1 win 183 . ack 1530 win 5909 <sack 1 2978:5874> . 1530:2978(1448) ack 1 win 183 . ack 5874 win 10253 . 5874:7322(1448) ack 1 win 183 . ack 7322 win 11701 P 7322:8770(1448) ack 1 win 183 . 8770:10218(1448) ack 1 win 183 . ack 10218 win 14597

Figure 6.17: Trace 2


06:02:40.658849 IP 06:02:40.667219 IP 1460,sackOK,eol> 06:02:40.860109 IP 06:02:40.865996 IP 06:02:40.874803 IP 06:02:40.874815 IP 06:02:41.079359 IP 06:02:41.087734 IP 06:02:41.087742 IP 06:02:41.087749 IP 06:02:41.279251 IP 06:02:41.287740 IP 06:02:41.287747 IP 06:02:41.287754 IP 06:02:41.475389 IP 06:02:41.481255 IP 06:02:41.483753 IP 06:02:41.489757 IP 06:02:41.489764 IP 06:02:41.675431 IP . . 45.26.181.94.50230 > 180.148.237.165.80: 180.148.237.165.80 > 45.26.181.94.50230: 45.26.181.94.50230 45.26.181.94.50230 180.148.237.165.80 180.148.237.165.80 45.26.181.94.50230 180.148.237.165.80 180.148.237.165.80 180.148.237.165.80 45.26.181.94.50230 180.148.237.165.80 180.148.237.165.80 180.148.237.165.80 45.26.181.94.50230 45.26.181.94.50230 180.148.237.165.80 180.148.237.165.80 180.148.237.165.80 45.26.181.94.50230 > > > > > > > > > > > > > > > > > > 180.148.237.165.80: 180.148.237.165.80: 45.26.181.94.50230: 45.26.181.94.50230: 180.148.237.165.80: 45.26.181.94.50230: 45.26.181.94.50230: 45.26.181.94.50230: 180.148.237.165.80: 45.26.181.94.50230: 45.26.181.94.50230: 45.26.181.94.50230: 180.148.237.165.80: 180.148.237.165.80: 45.26.181.94.50230: 45.26.181.94.50230: 45.26.181.94.50230: 180.148.237.165.80: S 2940382284:2940382284(0) win 8192 <mss 1460,nop,nop,sackOK> S 178968040:178968040(0) ack 2940382285 win 65535 <mss . ack 1 win 64240 P 1:490(489) ack 1 win 64240 . 1:1461(1460) ack 490 win 65535 . 1461:2921(1460) ack 490 win 65535 . ack 2921 win 64240 . 2921:4381(1460) ack 490 win 65535 . 4381:5841(1460) ack 490 win 65535 . 5841:7301(1460) ack 490 win 65535 . ack 5841 win 64240 . 7301:8761(1460) ack 490 win 65535 . 8761:10221(1460) ack 490 win 65535 . 10221:11681(1460) ack 490 win 65535 . ack 7301 win 64240 . ack 10221 win 64240 . 11681:13141(1460) ack 490 win 65535 . 13141:14601(1460) ack 490 win 65535 . 14601:16061(1460) ack 490 win 65535 . ack 11681 win 64240

6.3

Results from the Plymouth University Campus Dataset

The same testing and analysis approaches were considered to investigate the trafc traces collected at the campus of Plymouth University. A summary of the results obtained at each stage are summarized in Table 6.7 and Table 6.8 for the lossless and lossy subsets respectively. Results have conrmed and validated the previous analysis done on the combined dataset for all ltering criteria. Detailed results

70

6.4. Summary and graphs have been included in Appendix B. Table 6.7: Accuracy results for the lossless dataset of Plymouth University.
Filtering Criteria Number of Connections 100000 100000 100000 100000 100000 100000 100000 100000 Neural Network Model MSE All valid and complete TCP connections Only include connection with Idle Time less than 6 Average RTT Filtering Different Non-Standards Flows Individually - Exclude connections with RESET packets observed - Exclude connections with more than one SYN/FIN packets received in each direction - Exclude connections with one or more data packets sent in the reverse direction (client to server) - Include only connections transmitting web trafc (source port 80) - Excluding connections with small MSS (MSS less than 1400 bytes) Filtering All Non-Standards Flows 0.2749 0.2182 0.1841 0.2016 0.2627 0.5102 0.9615 0.9697 0.9738 0.9722 0.9403 0.8697 1.8160 1.8132 1.8516 1.8263 1.0025 0.9951 0.7197 0.7211 0.7065 0.7205 0.7590 0.7428 3.3045 0.2765 Regression 0.7722 0.9617 Mathematical Model MSE 19.5918 1.8263 Regression 0.3408 0.7205

Interestingly, after ltering TCP ows with high idle time periods, the estimation accuracy of the neural model for lossy connections was better than the model for lossless connections. A possible explanation for this behaviour is the over-tting of weights of the neural model developed for lossy connections, considering that the number of training samples was signicantly decreased to 4,347 lossy samples compared to the 100,000 lossless samples. Table 6.8: Accuracy results for the lossy dataset of Plymouth University.
Filtering Criteria Number of Connections 57299 4347 4046 2270 4033 4347 4130 1913 Neural Network Model MSE All valid and complete TCP connections Only include connection with Idle Time less than 6 Average RTT Filtering Different Non-Standards Flows Individually - Exclude connections with RESET packets observed - Exclude connections with more than one SYN/FIN packets received in each direction - Exclude connections with one or more data packets sent in the reverse direction (client to server) - Include only connections transmitting web trafc (source port 80) - Excluding connections with small MSS (MSS less than 1400 bytes) Filtering All Non-Standards Flows 0.0890 0.1071 0.1292 0.4897 0.0842 0.1066 0.9887 0.9827 0.9829 0.9404 0.9886 0.9828 0.5274 0.7820 0.5390 0.5830 0.5322 0.7173 0.9429 0.9361 0.9347 0.9343 0.9372 0.9419 1.3533 0.1127 Regression 0.8697 0.9856 Mathematical Model MSE 12.2968 0.5830 Regression 0.5886 0.9343

6.4

Summary

This chapter demonstrated the results obtained at all the testing stages along the timeline of the project. The combined dataset (i.e. UNIBS and MAWI) was primarily considered. A base line was set by modelling all valid TCP ows in the dataset and considering the accuracy results obtained for comparison with further results at other stages. Different ltering criteria was applied on the dataset, based on the

71

Chapter 6. Results and Analysis presence of idle time with the TCP lifespan. The effect of non-standard TCP ows has been investigated by ltering these ows individually. At each stage of testing, accuracy results were also compared with the accuracy obtained from Cardwells mathematical model. It was noticed that the neural network model has outperformed the mathematical model in each and every testing criteria, under different ltering conditions on the trafc, and for either lossless or lossy connections. Finally, the same testing and analysis approaches were taken using the dataset from Plymouth University, which revealed similar ndings.

72

Conclusions and Future Research Directions


7.1 Conclusions

The following section explains the ndings and conclusions from the modelling and analysis carried out during the project. Initially, lossless TCP ows were considered for modelling. When applying regression analysis between the actual transmission time and the transmission time estimated by the mathematical model, it was noticed that the mathematical model was always providing underestimate values. This indicated the presence of an additional element of time within the lifespan of TCP connections. These underestimates led to high estimation inaccuracy. The MSE of estimated throughput was up to 11.8910 and the regression value was only around 0.32156. When using neural networks to model the transmission time needed, the improvement in estimation accuracy was instantly observed, as residual error values were equally distributed in the positive and negative direction. In away, the neural model had accounted for the additional time observed in TCP ows. The MSE of estimated throughput was up to 1.8253 and the regression value was only around 0.768. These regression results were similar to the results obtained in a previous modelling approach by Ghita and Furnell (2008), as the regression value obtained using neural network under the same conditions was 0.70072. Manual analysis of TCP connection for which throughput was poorly estimated were manually investigated, and revealed the presence of prolonged idle time period within these connections, which impacted the estimation accuracy. Further approaches were taken to investigate the effect of excluding samples relatively high idle time from the training dataset, and this criteria revealed noticeable improvement. Regression values have increased to 0.9067 and 0.9892 for the mathematical model and neural network respectively, and the MSE in estimation were 0.2220 and 0.0229 for the mathematical model and neural network respectively.

73

Chapter 7. Conclusions and Future Research Directions After modelling throughput of lossy TCP ows before any ltering to the available samples, interestingly the results obtained were slightly better than the results obtained from the lossless dataset. The neural model provided regression value of 0.94463 .This was probably due to the over-tting of the neural model, due to the relatively lower number of training samples compared to the samples in the lossless dataset.

Further investigation of trafc preprocessing was done in order to lter different sort of non-standards TCP ows to the dataset and observing the effect of excluding these ows on the estimation performance of both models for both lossless and lossy trafc. Regression results were improved to 0.9881, and MSE in estimation decreased to 0.0325 by excluding all non-standard ows.

7.2

Research Limitations

The selection of number of training samples in the dataset was limited by processing time and resources in MATLAB. Usually the dataset size was limited to nearly 100,000 samples. This may have not represented all actual variations and conditions in real network trafc. On the other hand, the number of samples for lossy trafc were insufcient for training, which seems to have led to over-tting the data which was reected in the better estimation accuracy of the lossy neural model over the lossless neural model.

Manual analysis of traces during research was not always successful, as the identication of TCP algorithm and implementation was very difcult, and the adjustment of the congestion window was not always the same. At some points, it was also confusing to clearly identify the congestion window by looking at the batches of segments sent by the sender, as it was observed that the receiver would tend to regularly acknowledge bursts of data, rather than following the standard ow control windowing mechanism and waiting to receive all segments as expected by the algorithm. Although regulations and recommendations are dened by many RFCs, there seem to be a lot of differences between current implementations, which makes it more difcult to analyse trafc traces.

74

7.3. Direction of Future Research

7.3

Direction of Future Research

The observation made to the estimation of transmission time as resulted from the neural network model and how the model in way anticipated for the idle time periods in TCP connections suggest the modication of available mathematical model to possibly include an additional average additional time to the total transmission time. This average could result from a function of average RTT, loss rate and congestion window. The application of such modied model could be implemented and evaluated in a simulated environment such as (NS2). In this research, only the maximum idle time as calculated by tcptrace was considered. Although this value may give a good representation of total idle time, specially using AI-based methods, more research can be done to modify the output from tcptrace to iteratively evaluate the total idle time during the complete lifetime of a TCP connection, and consider this value as input at neural network modelling stages. This is expected to provide better estimation accuracy. It was found difcult to identify TCP algorithm for manual analysis and traces investigation. A research done by Yang et al. (2011) proposed an approach to identify the implemented TCP congestion algorithm used in captured TCP trafc. This approach could be adopted in order to investigate how each implementation deal with the presence of idle time, and how congestion window is modied after the occurrence of an idle time period. Hence, a comparison could be done between different TCP congestion implementations and evaluate how each implementation perform in nding the ideal congestion window after an idle time, which should result in faster transmission after these idle time periods. Different neural network models should be considered, with different learning and optimisation function. The selection of other AI-based models apart from neural network is also to be considered. The use of larger training and validating datasets is also advisable, and ensuring that the traces collected are from different sources, in order to obtain datasets of various types and conditions. This should lead to the development of more robust AI-based model.

75

References
Allman, M., Paxson, V. and Blanton, E. (2009), TCP Congestion Control, RFC 5681 (Draft Standard). http://www.ietf.org/rfc/rfc5681.txt (accessed 23/08/2012) Allman, M., Paxson, V. and Stevens, W. (1999), TCP Congestion Control, RFC 2581 (Proposed Standard). Obsoleted by RFC 5681, updated by RFC 3390. http://www.ietf.org/rfc/rfc2581.txt (accessed 23/08/2012) Balakrishnan, H., Padmanabhan, V., Fairhurst, G. and Sooriyabandara, M. (2002), TCP Performance Implications of Network Path Asymmetry, RFC 3449 (Best Current Practice). http://www.ietf.org/rfc/rfc3449.txt (accessed 23/08/2012) Braden, R. (1989), Rfc 1122 requirements for internet hosts - communication layers. http://tools.ietf.org/html/rfc1122 (accessed 23/08/2012) Caceres, R., Danzig, P. B., Jamin, S. and Mitzel, D. J. (1991), Characteristics of wide-area tcp/ip conversations, SIGCOMM Comput. Commun. Rev. 21(4), 101112. http://doi.acm.org/10.1145/115994.116003 (accessed 23/08/2012) Callan, R. (1998), Essence of Neural Networks, Prentice Hall PTR, Upper Saddle River, NJ, USA. Cardwell, N., Savage, S. and Anderson, T. (2000), Modeling tcp latency, pp. 1742 1751. Demuth, H. and Baele, M. (2011), Neural Network Toolbox. Users guide, The MathWorks, Inc., Natick, MA. Freeman, J. A. and Skapura, D. M. (1991), Neural networks: algorithms, applications, and programming techniques, Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA.

77

REFERENCES Ghita, B. and Furnell, S. (2008), Neural network estimation of tcp performance, International Conference on Communication Theory, Reliability, and Quality of Service 0, 5358. Ghita, B. V., Furnell, S. M., Lines, B. L. and Ifeachor, E. (2005), Tcp performance estimation using neural networks modelling, Proceedings of Fifth International Network Conference pp. 1930. Hair, Jr., J. F., Anderson, R. E., Tatham, R. L. and Black, W. C. (1995), Multivariate data analysis (4th ed.): with readings, Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Haykin, S. (1998), Neural Networks: A Comprehensive Foundation, 2nd edn, Prentice Hall PTR, Upper Saddle River, NJ, USA. He, Q., Dovrolis, C. and Ammar, M. (2007), On the predictability of large transfer tcp throughput, Comput. Netw. 51(14), 39593977. http://dx.doi.org/10.1016/j.comnet.2007.04.013 (accessed 23/08/2012) Jacobson, V. (1990), Modied tcp congestion avoidance algorithm, end2endinterest mailing list . ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail (accessed 23/08/2012) Kleinbaum, D. G., Nizam, A., Kupper, L. L. and Muller, K. E. (1997), Applied regression analysis and other multivariable methods, 3rd edn, Duxbury Press, Pacic Grove, CA, USA. Kurose, J. F. and Ross, K. W. (2009), Computer Networking: A Top-Down Approach, 5th edn, Addison-Wesley Publishing Company, USA. Mathis, M., Semke, J., Mahdavi, J. and Ott, T. (1997), The macroscopic behavior of the tcp congestion avoidance algorithm, SIGCOMM Comput. Commun. Rev. 27(3), 6782. http://doi.acm.org/10.1145/263932.264023 (accessed 23/08/2012) MAWI Working Group Trafc Archive (n.d.). http://mawi.wide.ad.jp/mawi/

78

REFERENCES Mehrotra, K., Mohan, C. K. and Ranka, S. (1997), Elements of articial neural networks, MIT Press, Cambridge, MA, USA. Mirza, M., Sommers, J., Barford, P. and Zhu, X. (2010), A machine learning approach to tcp throughput prediction, IEEE/ACM Trans. Netw. 18(4), 1026 1039. http://dx.doi.org/10.1109/TNET.2009.2037812 (accessed 23/08/2012) Padhye, J., Firoiu, V., Towsley, D. and Kurose, J. (1998), Modeling tcp throughput: a simple model and its empirical validation, SIGCOMM Comput. Commun. Rev. 28(4), 303314. http://doi.acm.org/10.1145/285243.285291 (accessed 23/08/2012) Paxson, V., Allman, M., Chu, J. and Sargent, M. (2011), Computing TCPs Retransmission Timer, RFC 6298 (Proposed Standard). http://www.ietf.org/rfc/rfc6298.txt (accessed 23/08/2012) Postel, J. (1981), Transmission Control Protocol, RFC 793 (Standard). Updated by RFCs 1122, 3168, 6093, 6528. http://www.ietf.org/rfc/rfc793.txt (accessed 23/08/2012) Ramadas, M. (2003), TCPTRACE Manual, Internetworking Research Group, Ohio University. http://tcptrace.org/manual (accessed 23/08/2012) Shah, S., Rehman, A., Khan, A. and Shah, M. (2007), Tcp throughput estimation: A new neural networks model, in Emerging Technologies, 2007. ICET 2007. International Conference on, pp. 94 98. Stallings, W. (2001), High Speed Networks and Internets: Performance and Quality of Service, 2nd edn, Prentice Hall PTR, Upper Saddle River, NJ, USA. Stevens, W. R. (1993), TCP/IP illustrated (vol. 1): the protocols, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Stevens, W. R. and Wright, G. R. (1995), TCP/IP illustrated (vol. 2): the implementation, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. UNIBS: Data sharing (2011). http://www.ing.unibs.it/ntw/tools/traces

79

REFERENCES Wilamowski, B. M. and Irwin, J. D. (2011), Intelligent Systems, 2nd edn, CRC Press, Inc., Boca Raton, FL, USA, chapter 12, pp. 12.112.15. Yang, P., Luo, W., Xu, L., Deogun, J. and Lu, Y. (2011), Tcp congestion avoidance algorithm identication, in Proceedings of the 2011 31st International Conference on Distributed Computing Systems, ICDCS 11, IEEE Computer Society, Washington, DC, USA, pp. 310321. http://dx.doi.org/10.1109/ICDCS.2011.27 (accessed 23/08/2012)

80

Data Sources
A.1 UNIBS

These traces were provided after contacting the telecommunication networks group at the University of Brescia. According to (UNIBS: Data sharing, 2011), the traces were collected on the edge router of the campus network of the University of Brescia on three consecutive working days (09/30, 10/01 and 10/02). They are composed of trafc generated by a set of twenty workstations. The composition of trafc is demonstrated in Table A.1. All traces were bi-directional, anonymised, and payload-stripped.

Table A.1: Composition of the UNIBS 2009 trace (UNIBS: Data sharing, 2011).
Class of protocols Web Mail P2P (Bittorrent) P2P (Edonkey) Skype (TCP) Skype (UDP) Other Total Flows 61.20% 5.70% 9.30% 18.40% 1.40% 3.80% 0.20% 78998 Bytes 12.50% 0.20% 15.90% 70.20% 1.00% 0.00% 0.20% 27 GB

A.2

MAWI

These traces were obtained from MAWI Working Group Trafc Archive (MAWI Working Group Trafc Archive, n.d.), and aggregated to the combined dataset along with the traces provided by Brescia University. Statistical information about the traces are listed in Table A.2.

81

Appendix A. Data Sources Table A.2: Composition of the MAWI traces(UNIBS: Data sharing, 2011).
Dump File 200612311400.dump 200701090800.dump 200812291400.dump 201001011400.dump 201001021400.dump Total Time 899.82 sec 899.28 sec 899.93 sec 900.51 sec 900.58 sec Capture Size 390.31MB 554.83MB 828.95MB 642.74MB 807.98MB Number Packets 7150380 10000589 15014930 11452979 14648937 of Average Rate 42.38Mbps 62.24Mbps 101.83Mbps 63.43Mbps 91.27Mbps Number Flows 406366 407150 502579 771738 747759 of

82

Results Using the Dataset from Plymouth University Campus

B.1

Considering All Valid TCP Connections

B.1.1

Results for the Lossless Dataset

Lossless Dataset Mathematical Model: R=0.34075


8 xmit time Fit Y=T 8

Lossless Dataset Neural Network Model: R=0.77412


xmit time Fit Y=T

Output ~= 0.23*Target + 4.1

Output ~= 0.6*Target + 0.37


8 6 4 2 0 2 4 6 8

4 6 6 8 8 8 6 4 2 0 2 4 6 8

Target

Target

Figure B.1: Regression obtained for lossless connections for the Plymouth dataset using both mathematical and neural network model, prior any ltering.

83

Appendix B. Results Using the Dataset from Plymouth University Campus


Empirical CDF 1 Neural network model Mathematical model

0.8 % of connections

0.6

0.4

0.2

10 10 Absolute relative error (log scale)

10

Figure B.2: CDF of absolute relative error for lossless connections for the Plymouth dataset, prior any ltering.

B.1.2

Results for the Lossy Dataset

Lossy Dataset Mathematical Model: R=0.58858


8 xmit time Fit Y=T 8

Lossy Dataset Neural Network Model: R=0.87173


xmit time Fit Y=T

Output ~= 0.76*Target + 0.49


6 4 2 0 2 4 6 8

Output ~= 0.63*Target + 2

Target

Target

Figure B.3: Regression obtained for lossy connections for the Plymouth dataset using both mathematical and neural network model, prior any ltering.

84

B.1. Considering All Valid TCP Connections


Empirical CDF 1 Neural network model Mathematical model

0.8 % of connections

0.6

0.4

0.2

0 6 10

10

10 10 Absolute relative error (log scale)

10

10

Figure B.4: CDF of absolute relative error for lossless connections for the Plymouth dataset, prior any ltering.

85

Appendix B. Results Using the Dataset from Plymouth University Campus

B.2

Filtering TCP Connections with High Relative Idle Time and Non-Standards TCP Flows

B.2.1

Results for the Lossless Dataset

Lossless Dataset Mathematical Model: R=0.84477


xmit time Fit Y=T

Lossless Dataset Neural Network Model: R=0.98526


xmit time Fit Y=T

Output ~= 0.97*Target + 0.056


6 4 2 0 2 4

Output ~= 0.65*Target + 0.14

Target

Target

Figure B.5: Regression obtained for lossless connections for the Plymouth dataset using both mathematical and neural network model, after ltering all nonstandards TCP ows and connections with high relative idle time.

86

B.2. Filtering TCP Connections with High Relative Idle Time and Non-Standards TCP Flows
Empirical CDF 1 Neural network model Mathematical model

0.8 % of connections

0.6

0.4

0.2

10 10 Absolute relative error (log scale)

10

Figure B.6: CDF of absolute relative error for lossless connections for the Plymouth dataset, after ltering all non-standards TCP ows and connections with high relative idle time.

B.2.2

Results for the Lossy Dataset

Lossy Dataset Mathematical Model: R=0.96781


xmit time Fit Y=T

Lossy Dataset Neural Network Model: R=0.9915


xmit time Fit Y=T

Output ~= 0.97*Target + 0.032


3 2 1 0 1 2 3 4

Output ~= 0.89*Target + 0.12

Target

Target

Figure B.7: Regression obtained for lossy connections for the Plymouth dataset using both mathematical and neural network model, after ltering all non-standards TCP ows and connections with high relative idle time.

87

Appendix B. Results Using the Dataset from Plymouth University Campus


Empirical CDF 1 Neural network model Mathematical model

0.8 % of connections

0.6

0.4

0.2

0 5 10

10

10 10 10 10 Absolute relative error (log scale)

10

10

Figure B.8: CDF of absolute relative error for lossless connections for the Plymouth dataset, after ltering all non-standards TCP ows and connections with high relative idle time.

88

MATLAB Scripts
C.1 Cardwell Mathematical Model Implementation
%% This function implements the TCP mathematical model defined in Cardwell N., Savage S.Anderson T., (2000), Modelling TCP latency, IEEE INFOCOM, pages 1724-1751. %% Argument(s): A vector with the TCP statistics of a single valid TCP connection %% Return variable(s): TCP transmission time (T) %% function T = MathModel(sample) %% Initialise and get parameters T ss = 0; % T loss = 0; % T ca = 0; % T delack = 0; % b = 2; % Delayed acknowledgements are sent every b data segments. Typical value is 2 alpha = 1+1/b; % 1.5 d bytes = sample(24); % d pkts = sample(23); % RTT = sample(61); % avg seg size = sample(41); % w1 = sample(51); % initial win = sample(50); max owin = sample(46); % MSS = sample(39); % W max = ceil(max owin/MSS); p = sample(155); % T 0 = sample(81); % %% Calculate T ss (Time spent during slow start) % Calculate the number of data segments expected to send during Slow Start if(p>0) d ss = (1-(1-p) d pkts)*(1-p)/p + 1; else d ss = d pkts; end % Calculate expected window size by the end of Slow Start W ss = d ss*(alpha-1)/alpha+w1/alpha; % Calculate T ss (Time spent in Slow Start) if(W ss > W max) T ss = (RTT/1000)*((logbase((W max/w1),alpha)) + 1 + 1/W max * (d ss - (alpha*W max-w1)/(alpha-1))); else T ss = (RTT/1000)*logbase(((d ss*(alpha-1)/w1)+1), alpha); end %% Calculate T loss (Time spent during segment loss recovery) % l ss is the probability for slow start to end due to a packet loss. % Equation(16) l ss = 1-(1-p) d pkts; % Q is the probability of packet losses due to retransmission timeouts RTO. % Equation(17) function prob = Q(a,b) prob = min(1, (1+((1-a) 3)*(1-(1-a) (b-3)))/((1-(1-a) b)/(1-(1-a) 3))); end

89

Appendix C. MATLAB Scripts

% % G Z

Z TO is the expected cost for an RTO. Equation(18,19) p = 1 + p + 2*p 2 + 4*p 3 + 8*p 4 + 16*p 5 + 32*p 6; TO = G p*T 0/(1-p);

% T loss is the expected cost for any RTO or fast recovery, by the end of the initial slow start phase. % Equation(20) T loss = l ss * (Q(p,W ss)*Z TO + (1-Q(p,W ss))*RTT); %% Calculate T ca (Time spent sending the remainder of data (CA) % d ca is the amount of data left to be transmitted after Slow Start and loss occurrence. % Equation(21) d ca = d pkts - d ss; % W p is the expected cwnd at the time of loss events % Equation(23) W p = (2+b)/(3*b)+sqrt(8*(1-p)/(3*b*p)+((2+b)/(3*b)) 2); % R is the steady state throughput % Equation(22) if(W p<W max) R = (((1-p)/p)+(W p/2)+(Q(p,W p)))/((RTT*(b/2*W p+1))+(Q(p,W p)*G p*T 0)/(1-p)); else R = (((1-p)/p)+(W max/2)+(Q(p,W max)))/((RTT*(b/8*W max+((1-p)/(p*W max))+2))+(Q(p,W max)*G p*T 0)/(1-p)); end % T ca is the time spent in sending the remaining data if(p>0) T ca = d ca/R; end %% Calculate T delack (cost due to delayed acknowledgements) % Calculate T delack (Time spent due to delayed acknowledgement) % if(initial win==MSS) % T delack = 0.2; % end %% Calculate total transmission time (T) % Equation(25) T = T ss + T loss + T ca + T delack; end

90

C.2. Neural Network Modelling

C.2

Neural Network Modelling

%% This script solves an Input-Output Fitting problem with a Neural Network function [net,tr] = NeuralNet(inputs, targets) %% Deciding number of hidden neurons inputs size = size(inputs); outputs size = size(targets); num hidden neurons = ceil((inputs size(1)+outputs size(1))*2/3); %% Create a Fitting Network hiddenLayerSize = [num hidden neurons]; net = fitnet(hiddenLayerSize); %% Choose Input and Output Pre/Post-Processing Functions net.inputs1.processFcns = removeconstantrows,mapminmax; net.outputs2.processFcns = removeconstantrows,mapminmax; %% Setup Division of Data for Training, Validation, Testing net.divideFcn = dividerand; % Divide data randomly net.divideMode = sample; % Divide up every sample net.divideParam.trainRatio = 75/100; net.divideParam.valRatio = 15/100; net.divideParam.testRatio = 15/100; %% Training function net.trainFcn = trainlm; % Levenberg-Marquardt Backpropagation %% Choose a Performance Function net.performFcn = mse; % Mean Squared Error %% Choose Plot Functions net.plotFcns = plotperform,plottrainstate,ploterrhist, ... plotregression, plotfit; %% Choose Maximum Number of Validation Increases net.trainParam.max fail = 200; %% Train the Network [net,tr]= train(net,inputs,targets); %% Test the Network outputs = net(inputs); errors = gsubtract(targets,outputs); performance = perform(net,targets,outputs); %% Recalculate Training, Validation and Test Performance trainTargets = targets .* tr.trainMask1; valTargets = targets .* tr.valMask1; testTargets = targets .* tr.testMask1; trainPerformance = perform(net,trainTargets,outputs); valPerformance = perform(net,valTargets,outputs); testPerformance = perform(net,testTargets,outputs); %% View the Network view(net) %% Plots figure, plotregression(targets,outputs,Neural Network,targets, math,Mathematical model); %% Plot CDF of absolute relative errors for both Neural Network and Mathematical models. relativeErrorsNN = errors ./ targets; absRelativeErrorsNN = abs(relativeErrorsNN); mathErrors = gsubtract(targets, math); relativeErrorsMath = mathErrors ./ targets; absRelativeErrorsMath = abs(relativeErrorsMath); cdfplot(absRelativeErrorsNN); set(gca, XScale, log); hold on; cdfplot(absRelativeErrorsMath);

91

You might also like