You are on page 1of 22

Association rule mining for suspicious email detection: A data mining approach

1.PROBLEM DEFINITION
It's hard to remember what our lives were like without email. Ranking up there with the web as one of the most useful features of the Internet, billions of messages are sent each year. Though email was originally developed for sending simple text messages, it has become more robust in the last few years. So, it is one possible source of data from which potential problem can be detected. Thus the problem is to find a system that identifies the problem is to find a system that identifies the Tense deception in communication through emails. Even after classification of deceptive emails we must be able to differentiate the informative emails from the alerting emails. We refer to informative emails as those giving details about the already happened hazardous events and the alert emails are those which remain us to prevent those hazard events to occur in the fore coming days. As the Internet grows at a phenomenal rate email systems has become a widely used electronic form of communication. Every day, a large number of people exchange messages in this fast and inexpensive way. With the excitement on electronic commerce growing, the usage of email will increase more exponential.

1 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach 2 LITERATURE SURVEY
EXISTING SYSTEM In the existing system, the mails are sent to the authenticated users who are intended to be received. Some defects in existing system are: 1) Suspicious mails cannot be detected. 2) Offensive users cannot be identified. This system is a fresh idea and has not yet been implemented in industries. The basic idea although was conceived by a group of professional network administrators and finally the concept was created. Association rule mining has been extensively investigated in the data mining literature. Many efficient algorithm has been proposed, the most popular being Apriori algorithm and FP growth. Association rule mining typically aims at discovering association between items in a transactional database. For the association rule to be acceptable it is basically carried out in 2 steps: In 1st step frequent item set are discovered i.e. item sets whose support is no less the minimum support. In 2nd step association rules are derived from frequent item sets. We constraint the association rules to be discovered such that the antecedent of the rules is composed of a conjunction of features from the mammogram while the consequent of the rule is always the category to which the mammogram belongs. PROPOSED SYSTEM In the proposed system the suspicious users are detected and the offensive mails are blocked. Features of proposed system: 1) This helps in finding out anti social elements. 2) This provides the security to system which adapts it. 3) This also helps the intelligence bureau, crime branch etc.,

2 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach
We present an association rule mining algorithm (Apriori algorithm) to detect suspicious email the proposed method is implemented using the java there are three parts: Email Preprocessing, Building the associative classifier and validation. 1)Email preprocessing: email preprocessing involves the process of transforming email messages into a representation module for the Apriori algorithm. Concepts used in the email preprocessing are Email data extraction and presentation , text term extraction , lexical analysis ,feature selection . 2)Building the associative classifier: this involves email classification process,Apriori algorithm from suspicious email detection .Association searches for interesting association or correlation relationships among items in a given large data set. The association rules generated is in numerical values. Hence the visualized output with respect to output column of preprocessing. Processing such as lexical analysis ,stop list word removal and stoning should be applied to the extracted data.

3 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach 3. SOFTWARE REQUIREMENTS SPECIFICATION
3.1 INTRODUCTION
3.1.1 PROJECT SCOPE The scope of the Project allows users to detect any wrongdoings within their web mail accounts so that they can easily identify what emails contains threatening messages which may harm the security of the country and what emails contain normal messages.

3.1.2 USER CLASSES AND CHARACTERISTICS The major user classes that are expected to use this product are as follows 1)The defense system of the country The defense system of the country will be the most benefited from this software tool. Through this they can detect thousands of emails and classify them as normal and suspicious. 2)Open Source Community The Open Source Community is expected to be the main user class of this product. These users are expected to share their codes and ideas by uploading and downloading Source codes over the p2p network. 3)General Users The general users are expected to use our tool for secure their web accounts.

3.1.3 OPERATING ENVIRONMENT Client Side Requirement OS : Windows Software Packages : Java Server Side Requirements OS : Windows Software Packages :

4 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach
Java technology: Java environment includes a large no. of tools which are part of the system known as java development kit (JDK) and hundreds of classes, methods, and interfaces grouped into packages forms part of java standard library(JSL). Oracle: Oracle is a relational database management system, which organizes data in the form of tables. Oracle is one of many database servers based on RDBMS model, which manages a seer of data that attends three specific things-data structures, data integrity and data manipulation. With oracle cooperative server technology we can realize the benefits of open, relational systems for all the applications. Oracle makes efficient use of all systems resources, on all hardware architecture; to deliver unmatched performance, price performance and scalability. Any DBMS to be called as RDBMS has to satisfy Dr.E.F.Codds rules. Java Servlet Technology - Servlets are platform-independent, 100% pure Java server-side modules that fit seamlessly into a web server framework and can be used to extend the capabilities of a web server with minimal overhead, maintenance, and support. JDBC (Java Database Connectivity) - Provides a uniform interface to a wide range of relational databases, and provides a common base on which higher-level tools and interfaces can be built. JMS (Java Mail Service)

3.1.4 DESIGN AND IMPLEMENTATION CONSTRAINTS 1)Software updation constraints 2)Hardware constraints

3.1.5 ASSUMPTIONS AND DEPENDENCIES We are assuming that the users will not use apriori algorithm to misuse the mail detection program so as to find any loophole that may render the software useless.

3.2 SYSTEM FEATURES

5 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach
3.2.1 FUNCTIONAL REQUIREMENTS

We will read input mail from mail server like Gmail or yahoo mail. Remove the tags from mail and get the contents.
We will Split the email content into set of digests and Get the feature list from

the digests.
We will Compute Transformed Feature Set of message TFSet(Ma).

Get the Feature Set , Read the Spam and Ham knowledgebase and Check the mail with spam and ham knowledgebase , Compute the Score and Classify the email is spam or ham.

3.3 EXTERNAL INTERFACE REQUIREMENTS


3.3.1 USER INTERFACES: 1.Know your user 2.pay attention to patterns 3.Stay consistent 4.Provide feedback 5.Speak their language 6.Keep moving forward 3.3.2 HARDWARE INTERFACES: The selection of hardware is very important in the existence and proper working of any software. In the selection of hardware, the size and the capacity requirements are also important. The suspicious email detection can be efficiently run on Pentium system with at least 128 MB RAM and Hard disk drive having 20 GB. Floppy disk drive of 1.44 MB and 14 inch

6 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach
Samsung color monitor suits the information system operation.(A Printer is required for hard copy output). 1)Pentium processor 233 MHZ or above 2)RAM Capacity 3)Hard Disk 256MB 20GB

3.3.3 SOFTWARE INTERFACES

One of the most difficult tasks is that, the selection of the software, once system requirement is known is determining whether a particular software package fits the requirements. After initial selection further security is needed to determine the desirability of particular software compared with other candidates. This section first summarizes the application requirement question and then suggests more detailed comparisons. Operating System :: Windows xp/7 Server Side :: JSP with Tomcat Server Client Side :: HTML ,JavaScript Services :: JDBC Database :: Oracle 10g/XE

3.4 NON FUNCTIONAL REQUIREMENTS


The project requires internet connectivity to send and receive emails between client and server.

7 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach
3.4.1 PERFORMANCE REQUIREMENTS: The performance of a system indicates the response time of the system, utilization of the resources and throughput behavior of the system. Care is taken so as to ensure a system with comparatively high performance. 3.4.2 SAFETY REQUIREMENTS: A mixture containing 1000 informative emails, 1000 alert emails and 1000 normal emails.The system was trained with the training dataset and the default support and confidence threshold were used.When training process was finished, the top 20 best quality rules were taken as the final classification rules.

3.4.3 SECURITY REQUIREMENTS: The factors that protect the software from accidental or malicious access, use, modification, destruction, or disclosure. Security can be ensured as the project involves authenticating the users. 3.4.4 SOFTWARE QUALITY ATTRIBUTE: We can find that a simple Apriori algorithm can provide better classification result for suspicious email detection. In the near future, we plan to incorporate other techniques like different ways of feature selection,and Classification using other methods.One major advantage of the association rule based classifier is that it does not assume that terms are independent and its training is relatively fast.Furthermore, the rules are human understandable and easy to be maintained or pruned by human being.

3.6 ANALYSIS MODEL


3.6.1 DATA FLOW DIAGRAM: A data flow diagram (DFD) is a graphical representation of the "flow" of data through an information system. DFDs can also be used for the visualization of data processing (structured design). A DFD provides no information about the timing of processes, or about whether processes will operate in sequence or in parallel. It is therefore quite different from a flowchart, which shows the flow of control through an algorithm, allowing a reader to determine what operations will be performed, in what order, and under what circumstances, 8 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach
but not what kinds of data will be input to and output from the system, nor where the data will come from and go to, nor where the data will be stored .

3.6.2 CLASS DIAGRAM In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of static structure diagram that describes the structure of a system by showing the system's classes, their attributes, and the relationships between the classes. This diagram shows various classes or main entities involved in the system and also their relationship with each other. It depicts the attributes and operations each class can carry out, individually and with help of other classes in the system designed. Fig3.2 shows various classes or main entities involved in the system and also their relationship with each other. It depicts the attributes and operations each class can carry out, individually and with help of other classes in the system designed.

3.6.3 STATE-TRANSITION DIAGRAMS: This data model is based on perception of real world that consist of collection of basic object called entities and of relationship. An entity is an object in the real world that distinguishable from other objects. Entities are described in database by a set of attributes. A relationship is an association among entities. Attribute is descriptive property of each member of entity set. Entity relationship structure consist of ,Rectangle Represent entity sets. Ellipses - Represent attribute, Diamonds Represents relationships among entity sets, Lines Link

9 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

3.7 SYSTEM IMPLEMENTATION PLAN

Implementation includes all those activities that take place to convert the old system to the new system .The new system will replace the existing system. The aspects of implementation are as follows. Conversion, Post Implementation Review. 1)Conversion Conversion means changing from one system to another. The objective is to put the tested system into operation. It involves proper installation of the software package developed and training the operating staff. The software has been installed and found to be functioning properly. The users how to be trained to handle the system effectively. Sample data provide to the operating stuff and were asked to operate on the system. The operating stuffs now have a clear out look of the software and are ready for practical implementation of the package. 2)Post Implementation Review A post implantation review is an evaluation of system in terms of the extent to which the system accomplishes the stated objectives. This starts after the system is implemented and conversation is complete.

10 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

4. SYSTEM DESIGN
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Task Project selection Discuss about various projects Project topic searching Obstacle Detection System for UAV as a project. Synopsis creation Review 1st Approval of project Information gathering Collecting required documents Study of white papers and references Estimation of resources required Documentation and designing UML ,DFD , Activity Diagrams Review 2nd SRS generation Partial Report generation Understanding and study of simple image processing in java. Study of AWT classes,Event Handling and their methods Study of opening of images in java Understanding of how to use pixels instead of images Study of pixel grabber and image producer concept Study of conversion of image into pixel array 11 Dept. of Computer Engg. DYPCOE,Akurdi Starting 15/07/2011 15/07/2011 19/07/2011 22/07/2011 23/07/2011 26/07/2011 26/07/2011 28/07/2011 28/07/2011 07/08/2011 29/08/2011 11/09/2011 11/09/2011 04/10/2011 05/10/2011 06/10/2011 08/10/2011 21/12/2011 28/12/2011 06/01/2012 11/01/2012 18/01/2012 Finish 25/07/2011 18/07/2011 21/07/2011 22/07/2011 25/07/2011 26/07/2011 27/07/2011 10/09/2011 06/08/2011 28/08/2011 10/09/2011 3/10/2011 3/10/2011 04/10/2011 06/10/2011 07/10/2011 11/10/2011 27/12/2011 05/01/2012 10/01/2012 17/01/2012 19/01/2012 No. of days 11 4 3 1 3 1 2 45 10 22 13 23 23 1 2 2 4 7 9 5 7 2

Association rule mining for suspicious email detection: A data mining approach
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 Study of erosion function Implementation of erosion function Study of dilation function Implementation of dilation function Preparation for review III Study of opening function Implementation of opening function Study of closing function Implementation of closing function Study of CMO filter and PS filter Implementation of PS filter Study and implement log in form Study and implementation of Thresholding Implemented remaining GUI for other function Review IV 20/01/2012 23/01/2012 25/01/2012 29/01/2012 04/02/2012 08/02/2012 11/02/2012 16/02/2012 19/02/2012 25/02/2012 01/03/2012 08/03/2012 11/03/2012 16/03/2012 27/03/2012 22/01/2012 24/01/2012 28/01/2012 03/02/2012 07/02/2012 10/02/2012 15/02/2012 18/02/2012 24/02/2012 28/02/2012 07/03/2012 10/03/2012 15/03/2012 26/03/2012 30/03/2012 3 2 4 6 4 3 5 3 6 4 7 3 5 11 4

Design is essentially a blue print or it acts as a bridge between the requirement specification and the final solution for satisfying the requirements. Based on the work-flow described above we can draw the following conclusions for the Software System that has to be developed: 1)The System needs to be a web-based system so that it allows the admin & clients to access the secure mail over the Internet. 2)being a web-based system, it enables the users to send e-mails to other users who are already registered. An added advantage is since the e-mail is delivered instantly, there could be instant responses from the admin if any suspicious emails are detected. 3)the whole process depends on communications between admin & the users. If all these communications are done through a web-based system, then the time period for the whole process can be considerably brought down. 4)The System needs to store the details of all the users. 5)The System needs to store the details of all the information (sent mails, composed mails etc) held by all the users. 6)The System needs to store the details of all the requirements held by the different users. 12 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach
7)since it is a web-based system, a Login authorization should be provided so that Admin and users will be able to lookup & use options that are specific to them. 8)The System should allow the users to enter their details. 9)The System should provide an option to generate a user Report. 10)The System should provide an option to generate block mails Report. 11)The System should provide an option to generate selected users Report.

4.1 SYSTEM ARCHITECTURE


System Architecture:

Input Email

Pre-Processing

ALPACAS Framework

Feature Preserving Fingerprint

Privacy Preserving Collaboration Protocol


13 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

Fig 4.1 System architecture

Fig 4.2 The standard text classification set up.

14 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

4.2 UML DIAGRAMS


4.2.1 USE CASE DIAGRAM
Input Mail

Remove tags User Agent Get the feature list

Compute TFSet

Compute Score

Classify spam or ham

15 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

4.2.2 ACTIVITY DIAGRAM

PreProcessing

Feature Preserving Fingerprint Split content

Privacy Preserving Collaboration Protocol

Input Mail

Read Mail

Get Feature Set

Get Feature List Remove tags Read Ham / Spam knowledgebase

Compute TFset Get Content Compute Score

Classify ham or spam

16 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

4.2.3.SEQUENCE DIAGRAM

17 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

Preprocessing : User
1 : I nput Mail() 2 : Read Mail()

Feature Extraction

Mail classification

3 : Remove Tag()

4 : Get Content() 5 : Mail Content() 6 : Get set of digest()

7 : Get Feature List()

8 : Compute TFSet() 9 : Feature Set() 10 : Read Ham/Spam knowledgebase()

11 : Compute Score()

12 : Classify the mail()

13 : Spam/ham mail()

18 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

5 TECHNICAL SPECIFICATION
5.1 ADVANTAGES
1 .java script can be used for client side application 2. Java script provides means to contain multiframe windows for presentation of the web. 3. Java script provides basic data validation before it is sent to the server. Eg : login and password checking or whether the values entered are correct or whether all fields in a form are filled and reduced network traffic 4. It creates interactive forms and client side lookup tables . .

5.2 DISADVANTAGES
Server

should be continuously running while executing the application.

5.3 APPLICATIONS
1)Suspicious Email detection to classify between normal and threatening emails. 2) To identify fraudulent email and publishing schemes.

6 APPENDIX A

19 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach
Term or Acronym
Java Servlet Technology

Definition Servlets are platform-independent, 100% pure Java server-side modules that fit seamlessly into a web server framework and can be used to extend the capabilities of a web server with minimal overhead,

JDBC

(Java

maintenance, and support. Database .Provides a uniform interface to a wide range of relational databases, and provides a common base on which higher-level tools and interfaces can be built. Definitions and Acronyms

Connectivity)

Abbreviations:
JMS : Java Mail Service

20 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach 7 BIBLIOGRAPHY
[J2EE-Overview] - http://java.sun.com/j2ee/overview.html [JS-NET] - http://developer.netscape.com/docs/manuals/communicator/jsref/contents.htm [J2EE-Home] - http://java.sun.com/j2ee/

[J2EE-Components] http://java.sun.com/j2ee/blueprints/platform_technologies/component/index.html [SUN-Developer] - http://developer.java.sun.com/developer/ 1] JAVA SERVLETS ,TATA McGraw HILL, Karl Moss 2] SOFTWARE ENGINEERING A Practitioner's Approach - McGraw-Hill Publications - Roger S. Pressman. 3] Oracle-SQL & Pl/Sql Programming - Evan Byross

21 Dept. of Computer Engg. DYPCOE,Akurdi

Association rule mining for suspicious email detection: A data mining approach

22 Dept. of Computer Engg. DYPCOE,Akurdi

You might also like