Lung Disease Prediction System Using Naive Bayes and K Means Clustering

LUNG DISEASE PREDICTION SYSTEM USING K-MEAN
CLUSTERING AND NAÏVE BAYES
Dissertation Part-I Progress Report

of
Master of Technology (Computer Science)
3rd Semester
SUBMITTED BY:
MD FARHAN HAIDER
M.Tech (CS)3rd Semester
Enrollment No.: A160025
UNDER THE SUPERVISION OF:
Mr. Abdul Wahid

Professor
Department of CS&IT, MANUU, Hyderabad
DEPARTMENT OF COMPUTER SCIENCE & IT

SCHOOL OF COMPUTER SCIENCE & INFORMATION TECHNOLOGY
MAULANA AZAD NATIONAL URDU UNIVERSITY, HYDERABAD
GACHIBOWLI, HYDERABAD - 500032, INDIA
MAULANA AZAD NATIONAL URDU UNIVERSITY
Gachibowli, Hyderabad-500032
Certificate
This is to certify that the Dissertation Part-1 Progress Report entitled “LUNG DISEASE PREDICTION
SYSTEM USING K-MEAN CLUSTERING AND NAÏVE BAYES” submitted by MD FARHAN HAIDER bearing Roll No
A160025 in partial fulfillment of the requirements for the award of Master of Technology (CS) Degree during
2017-2019 at the Department of CS&IT is an authentic work carried out by him/her under my guidance and
supervision.
The results presented in this report have been verified and are found to be satisfactory. The results
embodied in this dissertation have not been submitted to any other University or Institute for the award of any
other degree or diploma.
Supervisor’s Signature
DRC Member’s Signature Head

Department of CS&IT
CANDIDATE’S DECLARATION
I hereby declare that the thesis work presented in this Dissertation Part-1 Progress Report entitled “LUNG
DISEASE PREDICTION SYSTEM USING K-MEAN CLUSTERING AND NAÏVE BAYES” towards the partial fulfillment
of the requirement for the award of the degree of Master of Technology (Computer Science) submitted in the
Department of CS&IT, Maulana Azad National Urdu University, Hyderabad, Telangana, India is an authentic
record of my own work carried out under the guidance of MR. ABDUL WAHID, Professor, Department of CS&IT,
Maulana Azad National Urdu University, Hyderabad (Telangana).
I have not submitted the matter embodied in this progress report for the award of any other degree or diploma
to any other University or Institute.
Date:
Place: MD FARHAN HAIDER
ACKNOWLEDGEMENT
I express my sincere gratitude towards my Supervisor MR. ABDUL WAHID, Professor, Department of CS&IT,
MANUU Hyderabad for consistently providing me with the required guidance to help me in the
timely and successful completion of this report.
I am deeply indebted to Coordinator MR. MOHAMMAD ISLAM, Department of CS&IT, MANUU for his
valuable suggestions and support. In spite of his extremely busy schedules in Department, he was
always available to share with me his deep insights, wide knowledge and extensive experience.
Again I sincerely thank Professor Abdul Wahid, Dean School of Computer Science & Information Technology,
Dr. Pradeep Kumar, Head Department of Computer Science & Information Technology and all other faculty
members of our department for their valuable feedback during internal evaluations.
ABSTRACT
Data mining techniques are starting gaining its popularity nearly three decades
ago. Till last few years data mining approach was not in been used in health care
organization. Researchers have started paying attention towards this field, it has
been found by the researcher health care sector is possessing a very large volume
of data but all this are highly unorganized. If this organized in a proper way using
data mining technique. It can be easily used for the prediction of various diseases.
I will develop a hybrid approach by using two technique Naïve Bayes and K -
means algorithm. Different 1 parameters are considered for prediction of the lung
disease. It helps in predicting lung disease using various attributes and it predicts
the output as in the prediction form. For the grouping of various attributes, it uses
k-means algorithm and for predicting it uses naïve Bayes algorithm.
TABLE OF CONTENTS
DESCRIPTION PAGE NO
Contents I-II
List of Figures III
List of Tables IV
1. Introduction 1-4
1.1) Introduction
1.2) K-means clustering
1.3) Naïve Bayes
1.4) K-Means – Naïve Bayes Hybrid
2. Objectives 5-6
2.1) Objectives
3. Literature Survey 7-10
3.1) Literature Survey
4. Proposed Method 11-16
4.1) Methodologies
4.2) Proposed Method
4.3) Performance Measurement
5. Time Table (Plan of Work) 17-18
5.1) Plan of Work
6. Tools 19-21
6.1) Tools
7. Tentative Outcomes 22-23
7.1) Tentative Outcomes
I
References 24-26
II
LIST OF FIGURES
Figure No. Name of the Figure Page No.
Figure 4.1 K-means clustering process 12

Figure 4.2 Taking dataset and preprocess 13
Figure 4.3 Clustering using k-means 14
Figure 4.4 Classification using Naïve Bayes 15
Figure 6.1 WEKA tool 21
III
LIST OF TABLES
Table No. Name of the Table Page No.
Table 3.1 Base Papers 8

Table 3.2 Research Papers using naïve bayes 8
Table 3.3 Research Papers using k-means clustering 9
Table 3.4 Research Papers using WEKA 9
Table 5.1 Plan of Work 18
IV
CHAPTER 1
INTRODUCTION
1
1.1) Introduction
Lung cancer is the leading cause of cancer-related death and is responsible for more than a quarter
of all deaths due to cancer in the United States. It accounts for 13-14% of all cancer diagnoses,
making it the second most commonly diagnosed malignancy in both men and women (not counting
skin cancers). Until the 20th century, however, lung cancer was a relatively rare disease. That
changed with the advent of wide-scale cigarette smoking, which remains the leading cause of lung
cancer today.
There are two main types of lung cancer: non-small cell lung cancer (NSCLC) and small cell lung
cancer (SCLC). The majority of lung cancer patients have NSCLC, which usually grows and
spreads more slowly and has a better 5-year overall survival rate than SCLC.
In the real world, Lung cancer accounts for more deaths than any other cancer in both men and
women. Lung Cancer disease is the fifth leading cause of death in the world over the past 10
years (World Health Organization 2016). According to the WHO (World Health Organization)
report lung Disease is the leading cause of death across the world accounting for 1.58 million,
accounting for about 27 % of all cancer deaths. Death rate began declining in 1991 in men and
in 2003 in women.
Early detection of lung cancer is essential in reducing life losses. However, earlier treatment
requires the ability to detect lung cancer in early stages. Early diagnosis requires an accurate and
reliable diagnostic procedure that allows physicians to distinguish benign lung disease from
malignant ones.
Health data is rapidly increasing in the world. Health data is very large and complex due to this
processing of data using traditional data processing techniques is very difficult. For simplicity,
machine-learning techniques like KNN, SVM, D.T have been used. Some tool like Python (pandas)
and Weka are widely used in the data analytics field.
The two main concepts that we will come across repeatedly throughout this work are:
 K-Means Clustering
 Naïve Bayes
2
1.2) K-Means Clustering
K-means is the simplest learning algorithm to solve the clustering problems. The process is
simple and easy, it classifies given data set into a certain number of clusters. It defines k
centroids for each cluster. They must be placed as much as possible far away from each other.
Then take each point belonging to the given data set and relate into the nearest centroid. If no
point is pending then a group age is done. Then we re-calculate knew centroid for the cluster
resulting from previous steps. When we get the k centroid, a new binding is to be done between
sane data points and nearest centroid. A loop is been generated because of this loop key centroid
change the location step by step until no more changes are done.
The advantages of k means clustering algorithms are simplicity and speed.

Algorithm:-
1) Select k center from the problem (random)
2) Divide data into k clusters by grouping points.
3) Calculate the mean of k cluster to find new centers.
4) Repeat steps 2 and 3 until centers do not change.
1.3) Naïve Bayes
Naïve Bayes classifier is based on Bayes theorem. It has strong independence assumption. It is
also known as an independent feature model. Naïve Bayes is mainly used when the inputs are
high. It gives output in more sophisticated form. The probability of each input attribute is
shown from the predictable state.
Bayes theorem:-
P(H|X) = P(X|H) P(H) / P(X)

Where
P(H|X ) is a posterior probability of H conditioned on X
P(X|H) is a posterior probability of X conditioned on H
P(H)is a prior probability of H
P(X) is a prior probability of X
3
Naïve Bayes will basically predict the output whether the patient will have chances of getting
the lung disease or not.
1.4) K-Means – Naïve Bayes Hybrid:
The k-means clustering and naïve Bayes hybrid approach has been used for some other disease
prediction and has been shown to produce better results than the simple approaches.
The model dataset that we get after applying the K-Means algorithm will compare the values
of a dataset with a trained dataset. It will apply the Bayes theorem and the probability will be
obtained whether the patient will have lung disease or not.
 K-means clustering has the ability to handle massive data and cluster those data efficiently
and quickly.
 Naive Bayes algorithm will be used as a classification.
4
CHAPTER 2
OBJECTIVES
5
2.1) Objectives
The objectives of this research are as follows:
 To study different disease prediction algorithms and literature review.
 To study and analyze existing systems for lung disease and identify issues and challenges.
 To design a system for lung disease prediction based on patient data.
 To design a system for more accuracy in lung disease prediction than already existing
systems.
 To implement a system using hybrid algorithms for increasing efficiency.
 To test and validate the proposed system.
Prediction of the lung disease is a very complicated task, and in the current world, it mainly
depends upon the individual medical practitioner. If all individual medical practitioners are
combined on one data set, it will be very useful for the younger generation of the medical
practitioner and ultimately it will help the people. Heart disease prediction hybrid approach is
been used, the combination of the most popular clustering technique called ‘K-Means' and as
a Classifier ‘Naïve Bayes' algorithm are used. Because of a hybrid approach, this technique is
most suitable for any complex problem and it produces results with very good accuracy.
6
CHAPTER 3
LITERATURE
SURVEY
7
3.1) Literature Survey:
A variety of research papers were studied and analyzed during the literature survey for the research
on the various disease methods that have been employed over these years using k-means clustering,
or Naïve Bayes. The methodologies used in the research studies and their findings are presented
below.
Table 3.1: Base Papers
Table 3.2: Research Papers using k-means clustering
8
Table 3.3: Research Papers using naïve Bayes
Table 3.4: Research Papers using WEKA
[1] Data mining technique widely used for computational and discovering patterns in large data
sets. Data mining approach was found by researchers in the middle of 90’s, and its been
observed that it is very important technique for fetching unknowns patterns and vital
information from large data set.
9
[2] Rucha Shinde, proposed heart disease prediction system using naïve bayes and k-means
clustering. We are using k-means clustering for increasing the efficiency of the output. This is the
most effective model to predict patients with heart disease. This model could answer complex
queries, each with its own strength with respect to ease of model interpretation, access to detailed
information and accuracy.
[3] Priyanka D proposed a system to implement K-Means Clustering algorithms. This performs
certain number of iterations randomly, which access the nearest observations into k, to attain the
high-speed time consumption and offers stability of the accurate result. Here, this research
approaches the Compactness and Connectedness for accuracy result. The compactness and
connectedness for complementary measures are used and it is found that the efficiency and
effectiveness of the method for predicting Heart Disease is better than the other three techniques
through software prototype.
[4] The main aim of this analysis is to develop a prototype Health Care Prediction System using,
Naive Bayes. The System will discover and extract hidden data related to diseases (heart attack,
cancer and diabetes) from a historical heart disease database. It will answer complicated queries
for diagnosing sickness and so assist care practitioners to form intelligent clinical selections, which
ancient call support systems cannot. By providing effective treatments, it conjointly helps to reduce
treatment prices. To reinforce visualization and easy interpretation.
[5] Some implementations of K-means only allow numerical values for attributes. In that case, it
may be necessary to convert the data set into the standard spreadsheet format and convert
categorical attributes to binary.It may also be necessary to normalize the values of attributes that
are measured on substantially different scales (e.g., "age" and "income"). While WEKA provides
filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in
WEKA . This is because WEKA SimpleKMeans algorithm automatically handles a mixture of
categorical and numerical attribute. The WEKA SimpleKMeans algorithm uses Euclidean distance
measure to compute distances between instances and clusters.
10
CHAPTER 4
PROPOSED METHOD
11
4.1) Methodologies
The methodologies used in our proposed system will be based on the combination of k- means
clustering and naïve Bayes algorithm.
i) Clustering
ii) Classification
 To analyze data related to lung diseases for data mining through Weka.
 K-means clustering and naïve Bayes techniques will be used.
 K-means clustering has the ability to handle massive data and cluster those data
efficiently and quickly.
 A simple and straightforward iterative method will be used to partition the data set into
k-number of clusters.
 Naive Bayes algorithm will be used as a classification algorithm.
Table 4.1: k-means clustering process
12
4.2) Proposed Method
Firstly, I will preprocess the data because data in the real world is dirty, incomplete and noisy.
Incomplete in lacking attributes values and lacking attributes of interest or containing only
aggregate value noisy in terms of containing errors or outliers and inconsistent containing
discrepancies in names or codes. And then apply clustering algorithm on dataset after applying
clustering algorithm we use classification for prediction.
Data preprocessing steps in Weka:

Firstly, Run Weka software, launch the explorer window and select the ―Preprocess‖ tab. Then
Open the lung dataset, and enter what information do you have about the data set (e.g. number of
instances, attributes and classes)? What type of attributes does this dataset contain (nominal or
numeric)? What are the classes in this dataset? Which attribute has the greatest standard deviation?
What does this tell you about that attribute? After entered the data set under ―Filter, choose the
Standardize filter and apply it to all attributes. What does it do? How does it affect the attributes’
statistics? Click ―Undo to understanding the data and now apply the ―Normalize, filter and apply
it to all the attributes. What does it do? How does it affect the attributes’ statistics? How does it
differ from ―Standardize? Click Undo again to return the data to its original state. At the bottom,
right of the window there should be a graph, which visualizes the dataset, making sure ―Class:
class (Nom) is selected in the drop-down box .click Visualize All. What can you interpret from
these graphs? Which attribute(s) discriminate best between the classes in the dataset? How do the
13
Standardize and Normalize filter affects these graphs? Under Filter, choose the Attribute Selection
filter. What does it do? Are the attributes it selects the same as the ones you chose as discriminatory
above? How does its behavior change as you alter its parameters?
Figure 4.2: Taking dataset and preprocess
14
Clustering in WEKA:
This pattern divides the records in database into different groups. In the same group, the groups
have the similar properties. Between groups the differences should be as bigger as possible,
and in the same group, the differences should be as smaller as possible. There is no predefined
class that’s why its comes under the unsupervised learning .
Steps involved in WEKA

Load the data file browsers .arff into WEKA using the same steps we used to load data into the
Preprocess tab. Take a few minutes to look around the data in this tab. Look at the columns,
the attribute data, the distribution of the columns, etc. With this data set, we are looking to
create clusters, so instead of clicking on the Classify tab, click on the Cluster tab. Click Choose
and select technique from the choices that appear.
Figure 4.3: Clustering by using k-means
15
Classification in WEKA:
Classification is the process of finding a set of models that describe and distinguish data classes
and concepts, for the purpose of being able to use the model to predict the class whose label is
unknown. Classification is a two step process, first, it build classification model using training
data. Every object of the dataset must be pre-classified i.e. its class label must be known; second
the model generated in the preceding step is tested by assigning class labels to data objects in a test
data set. Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute. The model is represented as classification rules, decision trees, or mathematical
formulae. Second step is model usage. It is for classifying future or unknown objects. It estimates
accuracy of the model. The known label of test sample is compared with the classified result from
the model. Model construction describe a set of predetermines classes. Accuracy rate is the
percentage of test set samples that are correctly classified by the model. Test set is independent of
training set, otherwise over-fitting will occur. If the accuracy is acceptable, use the model to
classify data tuples whose class labels are not known.
Steps involve in WEKA

Basically there are four steps involved in WEKA for classification.
 Preparing the data
 Choose classify and apply algorithm
 Generate trees
 Analysis the result or output
Firstly, Prepare the data, load the data and the data should be in .arff format. After loaded the data
choose classify then choose classification algorithm and generate the trees.
Figure 4.4: Classification by using naïve bayes
16
4.3) Performance Measurement
After the training process is completed, the system will be tested for its performance. The testing
will be done on the Test Dataset. Test Dataset will be part of text Dataset that hasn’t been used for
the training purpose. The ratios for Training Dataset vs Testing Dataset can be 75:25, 65:35, 60:40
and so on based on the size of the available dataset. The ratio is decided on the basis of the size of
the dataset so that enough of a dataset is available for both training the system and then testing it
as well.
The performance of the proposed method will be then measured using a confusion matrix. As both
the input data required and the output by the system is discrete, therefore confusion matrix makes
the best choice for evaluating the final performance of our system. The final performance of the
system will be measured by comparing the total number of True Positives and True Negatives with
the total number of False Positives and False Negatives as predicted by the system and thus giving
a clear idea of the performance of the system.
 True Positive (TP) : Observation is positive, and is predicted to be positive.
 False Negative (FN) : Observation is positive, but is predicted negative.
 True Negative (TN) : Observation is negative, and is predicted to be negative.
 False Positive (FP) : Observation is negative, but is predicted positive.
Classification rate or accuracy is given by the relation:
Accuracy: TP +TN
TP+TN+FP+FN
17
CHAPTER 5
TIMETABLE
(PLAN OF WORK)
18
5.1) Plan of Work:
Table 5.1: Plan of Work
Academic Calendar Activity Status
Week 1-2 Literature Searching Done
Week 3-12 Literature Survey and Review Done
Week 12-17 Start work on the first draft. Aim to Done

complete chapter 1.
Week 18 Submit draft of chapter 1 to the supervisor. In Progress
Week 18-28 Work on the first draft of the remaining

Pending
chapters.
Week 29 Submit the first draft to the supervisor. Pending

Receive feedback on previous work.
Week 30 Receive feedback on the first draft of the Pending

main chapters.
19
CHAPTER 6
TOOLS
20
6.1) Tools
WEKA Tool is used to implementing K-Means Clustering and Naïve Bayes will be:
 K-means for clustering

 Naïve Bayes for classification
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It is
also well-suited for developing new machine learning schemes.
WEKA contains “clusters” for finding groups of similar instances in a dataset. The clustering
schemes available in WEKA are k-Means, EM, Cobweb, X-means, FarthestFirst. Clusters can be
visualized and compared to “true” clusters (if given). Evaluation is based on log-likelihood if
clustering scheme produces a probability distribution.
In the ‘Clusterer’ box click on the ‘Choose' button. In pull-down menu select WEKA Æ Clusterers,
and select the cluster scheme ‘SimpleKMeans’. Some implementations of K-means only allow
numerical values for attributes; therefore, we do not need to use a filter.
Classifiers in WEKA are the models for predicting nominal or numeric quantities. The learning
schemes available in WEKA include decision trees and lists, instance-based classifiers, support
vector machines, multi-layer perceptrons, logistic regression, and Bayes' nets. “Meta”- classifiers
include bagging, boosting, stacking, error-correcting output codes, and locally weighted learning.
21
Figure 6.1: WEKA Tools
22
CHAPTER 7
TENTATIVE OUTCOMES
23
7.1) Tentative Outcomes
The proposed system aims to use the hybrid algorithm for lung disease using K-Means and NB
that’s more efficient and can be trained easily compared to the existing simple algorithms in a
lesser time and gives better output results. The results from the proposed system are expected to
be more precise and more accurate than those that are produced from the existing simple algorithms
for lung disease.
This research aims to extend the capabilities of K-Means Clustering with the help of NB, by taking
some pre-trained K-means Clustering and classify them for a lung disease problem.
The primary outcomes that are expected from the proposed system are as follows:
 Lung disease prediction system will be developed by combining Naïve Bayes and K-Means
algorithm.
 Weka Tools would be used to reduce the execution time of algorithms.
 The prediction system may be faster, less computationally expensive, time efficient and
produce more accurate results.
 The proposed system will help doctors to efficiently predict lung diseases in the initial
stages for better treatment.
24
REFERENCES
25
World Health Organization (2011) The top ten causes of death. World Health Organization (2013)
Deaths from coronary heart disease.
 [1] P. V. Maral, “Heart Disease Prediction Using Naive Bayes and K-Means Techniques,”
Novat. Publ. Int. J. Res. Publ. Eng. Technol., vol. 3, no. 6, pp. 2454–7875, 2017.
 [2] R. Shinde, S. Arjun, P. Patil, and P. J. Waghmare, “An Intelligent Heart Disease
Prediction System Using K-Means Clustering and Naïve Bayes Algorithm,” Int. J. Comput.
Sci. Inf. Technol., vol. 6, no. 1, pp. 637–639, 2015.
 [3] D. Priyanka and M. S. S. Banu, “Prediction on Lung Disease Using K means
Algorithm,” vol. 1, no. 11, pp. 239–242, 2015.
 [4] G. Singh, K. Bagwe, S. Shanbhag, S. Singh, and S. Devi, “Heart disease prediction
using Naïve Bayes,” Int. Res. J. Eng. Technol., vol. 4, no. 3, pp. 1–3, 2017.
 [5] S. Jain, M. Aalam, and M. Doja, “K-means clustering using weka interface,” Proc.
4th Natl. Conf., 2010.
 [6] W. Zhang and F. Gao, “An improvement to naive bayes for text classification,”
Procedia Eng., vol. 15, pp. 2160–2164, 2011.
 [7] K. Vanitha and G. R. L. Rani, “Analysis of Classification and Clustering
Algorithms using Weka For Banking Data,” no. 0976, pp. 104–107.
 [8] S. Singhal and M. Jena, “W-06. Study on WEKA Tool for Data Preprocessing
, Classification and Clustering,” India - WEKA, vol. 2, no. 6, pp. 250–253, 2013.
 [9] P. Ramachandran, N. Girija, T. Bhuvaneswari, and A. Professor, “Early
Detection and Prevention of Cancer using Data Mining Techniques,” Int. J. Comput.
Appl., vol. 97, no. 13, pp. 975–8887, 2014.
 [10] S. Vijiyarani, S. Sudha, and M. P. Research Scholar, “Disease Prediction in Data
Mining Technique – A Survey,” Int. J. Comput. Appl. Inf. Technol., vol. II, no. I, pp.
2278–7720, 2013.
 [11] T. Karthikeyan and P. Thangaraju, “PCA-NB Algorithm to Enhance the
Predictive Accuracy,” vol. 6, no. 1, pp. 381–387, 2014.
 [12] D. Kavinya, “Lung Disease Classification Using Support Vector Machine,” vol.
3, no. 3, pp. 84–86, 2015.
 [13] A. Trivedi, “International Journal of Advanced Research in Computer Science
and Software Engineering Evaluation of Student Classification Based On Decision
Tree,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 4, no. 2, pp. 111–112, 2014.
26
 [14] U. Sharma, “Suitability of neural network for disease prediction : a
comprehensive literature review,” vol. 5, no. 6, pp. 12–20, 2017.
 [15] C. H. Chen, W. T. Huang, T. H. Tan, C. C. Chang, and Y. J. Chang, “Using K-
nearest neighbor classification to diagnose abnormal lung sounds,” Sensors
(Switzerland), vol. 15, no. 6, pp. 13132–13158, 2015.
 [16] M. Makinaci, “Support vector machine approach for classification of cancerous
prostate regions,” Int. Enformatika Conf., vol. 1, no. 7, pp. 166–169, 2005.
 [17] A. Kumar, M. Kamaleshwar, S. K. K, S. K. R. S, and J. Arunnehru, “An
Improved Disease Prediction System Using Machine Learning,” no. 4, 2018.
 [18] P. Mirajkar and A. Pradesh, “An Integrated Cancer Prediction System Using
Data Mining Techniques,” vol. 3, no. 1, pp. 1497–1501, 2018.
 [19] A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and A. Choudhary, “A Lung
Cancer Outcome Calculator Using Ensemble Data Mining on SEER Data Categories
and Subject Descriptors,” Kdd, 2011.
 [20] R. Ada, & Kaur, “A Study of Detection of Lung Cancer Using Data Mining
Classification Techniques,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 3, no. 3, pp.
131–134, 2013.
 [21] B. Sciences, B. G. Krishna, and A. Pradesh, “a Predictive Model for Heart
Disease Using clustering techniques,” vol. 8, no. 3, pp. 529–534, 2017.
 [22] V. Krishnaiah, G. Narsimha, N. Subhash, and C. #3, “Diagnosis of Lung Cancer
Prediction System Using Data Mining Classification Techniques,” Int. J. Comput. Sci.
Inf. Technol., vol. 4, no. 1, pp. 39–45, 2013.
 [23] Nur Hafieza Ismail, Fadhilah Ahmad, Azwa Abdul Aziz, “Implementing WEKA as a
data mining tool to analyze students academic performance using naïve Bayes classifier”,
27

Lung Disease Prediction System Using Naive Bayes and K Means Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lung Disease Prediction System Using Naive Bayes and K Means Clustering

Uploaded by

Copyright:

Available Formats

LUNG DISEASE PREDICTION SYSTEM USING K-MEAN

CLUSTERING AND NAÏVE BAYES

Dissertation Part-I Progress Report

UNDER THE SUPERVISION OF:

Mr. Abdul Wahid

DEPARTMENT OF COMPUTER SCIENCE & IT

DRC Member’s Signature Head

timely and successful completion of this report.

List of Figures III

1.2) K-means clustering

1.3) Naïve Bayes

1.4) K-Means – Naïve Bayes Hybrid

3. Literature Survey 7-10

3.1) Literature Survey

4. Proposed Method 11-16

4.2) Proposed Method

4.3) Performance Measurement

5. Time Table (Plan of Work) 17-18

5.1) Plan of Work

7. Tentative Outcomes 22-23

7.1) Tentative Outcomes

Figure No. Name of the Figure Page No.

Figure 4.1 K-means clustering process 12

Table No. Name of the Table Page No.

Table 3.1 Base Papers 8

The advantages of k means clustering algorithms are simplicity and speed.

1.3) Naïve Bayes

P(H|X) = P(X|H) P(H) / P(X)

1.4) K-Means – Naïve Bayes Hybrid:

 Naive Bayes algorithm will be used as a classification.

The objectives of this research are as follows:

 To study different disease prediction algorithms and literature review.

 To design a system for lung disease prediction based on patient data.

 To implement a system using hybrid algorithms for increasing efficiency.

 To test and validate the proposed system.

Table 3.1: Base Papers

Table 3.2: Research Papers using k-means clustering

Table 3.4: Research Papers using WEKA

 K-means clustering and naïve Bayes techniques will be used.

 Naive Bayes algorithm will be used as a classification algorithm.

Table 4.1: k-means clustering process

Data preprocessing steps in Weka:

Figure 4.2: Taking dataset and preprocess

Steps involved in WEKA

Figure 4.3: Clustering by using k-means

Steps involve in WEKA

Figure 4.4: Classification by using naïve bayes

Classification rate or accuracy is given by the relation:

Table 5.1: Plan of Work

Academic Calendar Activity Status

Week 1-2 Literature Searching Done

Week 3-12 Literature Survey and Review Done

Week 12-17 Start work on the first draft. Aim to Done

Week 18 Submit draft of chapter 1 to the supervisor. In Progress

Week 18-28 Work on the first draft of the remaining

Week 29 Submit the first draft to the supervisor. Pending

Week 30 Receive feedback on the first draft of the Pending

 K-means for clustering

 Weka Tools would be used to reduce the execution time of algorithms.

You might also like