You are on page 1of 12

IDENTIFICATION AND DETECTION OF OUTLIERS USING SUPERVISED LEARNING TECHNIQUES

A DISSERTATION Submitted in partial fulfillment of the requirements for the award of the degree of MASTER OF TECHNOLOGY in INFORMATION TECHNOLOGY

by HIMANSHU UNIYAL (En No-GE-11160488 )

DEPARTMENT OF INFORMATION TECHNLOGY GRAPHIC ERA UNIVERSITY, DEHRADUN DEHRADUN 248002 (INDIA) JULY, 2013

DEDICATED TO, My Parents And My Supervisor Dr. Bhaskar Pant

ii

566/6, Bell Road, Clement Town, Dehradun, Uttarakhand, Web Site : www.geu.ac.in

CANDIDATES DECLARATION
I hereby certify that the work which is being presented in the dissertation entitled IDENTIFICATION AND DETECTION OF OUTLIERS USING SUPERVISED LEARNING TECHNIQUES in partial fulfillment of the requirements for the award of the Degree of Master of Technology in Information Technology and submitted in the Department of Information Technology of the Graphic Era University, Dehradun is an authentic record of my own work carried out during a period from August 2012 to July 2013 under the supervision of Dr. Bhaskar Pant, Associate Professor, Department of Information Technology of the Graphic Era University, Dehradun. The matter presented in this dissertation has not been submitted by me for the award of any other degree of this or any other Institute. (HIMANSHU UNIYAL) This is to certify that the above statement made by the candidate is correct to the best of my knowledge.

Signature Head of Department

(Dr. Bhaskar Pant) Supervisor

The Viva-Voce examination of Mr. Himanshu Uniyal, has been held on

Signature of Internal Examiner

Signature of External Examiner

iii

ABSTRACT

Outliers can be defined as the data points which do not belong to the rest of the dataset based on some criteria. Outliers are the data points that do not conform to the normal points characterizing the data set. The traditional data mining techniques are used to find general patterns that are applicable to the majority of data whereas outlier detection focuses on the data objects whose behavior is different from the rest of the data set or in other words we can say that outlier detection approaches focus on discovering patterns that occur infrequently in the data, as opposed to conventional data mining techniques that attempt to find patterns that occur frequently in the data. Outliers are treated as noisy data in the dataset and are likely the points that, once removed from the data, the dataset as a whole has less uncertainty or disorder. But outliers are different from the noise data as noise is random error or variance in a measured variable and noise should be removed before outlier detection. Outliers are interesting as they violate the mechanism that generates the normal data. Outliers can be detected by using supervised and unsupervised learning techniques. Supervised learning is in the presence of a teacher where class labels are known whereas unsupervised learning is in the absence of a teacher where there is no knowledge of class labels. Classification is an example of supervised learning technique and clustering is an example of unsupervised learning technique. In this project, outlier detection techniques are applied to find outliers using supervised and unsupervised techniques such as Greedy algorithm is used as a classification technique and fuzzy c-means clustering is used as a fuzzy clustering technique.

KEYWORDS: Outliers, Outlier Detection, Supervised Learning, Unsupervised Learning, Classification, Greedy Algorithm, Support Vector Machine, Clustering, Fuzzy Clustering, Fuzzy C-Means Clustering.

iv

ACKNOWLEDGEMENT

This project is by far the most significant accomplishment in my life and it would be impossible without people who supported me and believed in me. I would like to extend my gratitude and my sincere thanks to my honorable, esteemed supervisor Dr. Bhasker Pant, for their immeasurable guidance and valuable time that he devoted for project. I sincerely thank for their exemplary guidance and encouragement. Their trust and support inspired me in the most important moments of making right decisions and I am glad to work with them.

I want to express great thanks to Prof. R.C.Joshi, Dr. Santosh Kumar, and Dr. Emmanuel Shubhankar Pillai for providing a continuous motivation and help and as well guiding me.

I would also like to thanks our Director Research Dr. Ankush Mittal, Vice Chancellor G.E.U Dr. M.R.Tyagi and Head of Department Dr. D. Borodoloi who provide me to all facilities and coordination.

I would like to thank all my friends and especially my classmates for all the thoughtful and mind stimulating discussion we had, which prompted us to think beyond the obvious. I have enjoyed their companionship so much during my stay in G.E.U Dehradun.

I would like to thank all those who made my stay in G.E.U Dehradun an unforgettable and rewarding experience.

HIMANSHU UNIYAL

TABLE OF CONTENTS
CANDIDATES DECLERATION ACKNOWLEDGEMENT ABSTRACT LIST OF ABBREVIATIONS LIST OF FIGURES LIST OF TABLES 1. INTRODUCTION 1.1 Outliers 1.2 Types of Outliers 1.2.1 Global Outliers 1.2.2 Contextual Outliers 1.2.3 Collective Outliers 1.3 Challenges of Outlier Detection 1.3.1 Modeling Normal Objects and Outliers Effectively 1.3.2 Application-specific Outlier Detection 1.3.3 Handling Noise in Outlier Detection 1.3.4 Understandability 1.4 Outlier Detection Methods 1.4.1 Supervised, Semi-Supervised, and Unsupervised Methods 1.4.1.1 Supervised Methods 1.4.1.2 Unsupervised Methods 1.4.1.3 Semi-Supervised Methods v ix x xii 1 1 4 4 4 5 5 5 5 6 6 6 7 7 8 9 iii iv

1.4.2 Statistical Methods, Proximity-Based Methods, and Clustering-Based Methods 10

vi

1.4.2.1 Statistical Methods 1.4.2.2 Proximity-Based Methods 1.4.2.3 Clustering-Based Methods 1.5 Outline of Dissertation 2. BACKGROUND AND LITERATURE SURVEY 2.1 Introduction 2.2 Outlier Detection Approaches 2.2.1. Distance-based approach 2.2.1.1. Distance-based definitions for outliers 2.2.2 Distribution-based approach 2.2.2.1. A method for high dimensional data 2.2.3. Density-based approach 2.2.3.1. Local Outlier Factor 2.3 Clasification 2.3.1 Classification by Decision Tree Induction 2.3.2 Bayesian Classification 2.3.2.1 Bayes Theorem 2.3.2.2 Nave Bayesian Classification 2.4 Clustering 2.4.1 Partitional Clustering 2.4.2 Hierarchical Methods 2.4.3 Density-based Methods 2.4.4 Grid-based Methods 3. METHODS AND METHODOLOGIES 3.1 Greedy Algorithm 3.2 Fuzzy C-Means Clustering
vii

10 10 11 11 12 12 15 15 15 17 17 18 18 19 19 19 20 20 22 23 23 24 24 25 25 26

3.3 Support Vector Machine 3.3.1 Finding Optimal Parameter Values 3.4 LIBSVM-Library for Support Vector Machine 4. EXPERIMENTAL RESULT 4.1 Introduction 4.2 Experiment 4.2.1 Steps for finding Entropy and Detection of Outliers 5. CONCLUSION AND SCOPE FOR FUTURE WORK 5.1 Conclusion 5.2 Scope for Future Work PUBLICATION OUT OF THIS WORK REFERENCES

28 31 32 34 34 34 35 46 46 46 48 49

viii

LIST OF ABBREVIATIONS
FCM SVM SVR SVC RBF SRM ERM LIBSVM Fuzzy C-Means Support Vector Machine Support Vector Regression Support Vector Classification Radial Basic Function Structural Risk Minimization Empirical Risk Minimization Library for Support Vector Machine

ix

LIST OF FIGURES
Figure 1.1: Figure 1.2: Figure 1.3: An outlier data point. Outlier Detection Issues with Uncertain Data Outlier detection process in Data Mining. 2 3 7 6

Figure 2.1: Illustration of outlier definition by Knorr and Ng. Figure 3.1: that the Figure 3.2: Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 4.6

A linear SVM. The circled data points are the support vectors--the examples are closest to the decision boundary. They determine the margin with which two classes are separated. Soft margin showing misclassification between the two categories. The first scan of the benign data set. 70th instance removed after first scan. 152th instance removed after first scan. 161th instance removed after first scan. 392th instance removed after first scan. Second scan of the data set after removal of 70th, 152th, 161th, 392th outlier instance. 28 29 29 30 30 30 31 22 23 27 27 27 28 28

Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10 Figure 4.11 Figure 4.12 Figure 4.13

163rd instance removed after second scan. Third scan of the data set after removal of 163 outlier instance. The first scan of the malignant data set. 132th instance removed after first scan. 178th instance removed after first scan. 192th instance removed after first scan.

Second scan of the data set after removal of 132th, 178th, 192th outlier instance. 31
x

Figure 4.14 Figure 4.15 Figure 4.16 Figure 4.17 Figure 4.18

166th instance removed after second scan. 228th instance removed after second scan Third scan of the data set after removal of 166th, 228th outlier instance. Outlier data points with 0 memberships for benign tumors. Outlier data points with 0 memberships for malignant tumors.

31 32 32 33 33

xi

LIST OF TABLES
Table 4.1 Comparison of FCM and SVM.

xii

You might also like