You are on page 1of 7

International Journal of Computational Intelligence and Information Security, May 2013, Vol. 4 No.

5, ISSN: 1837-7823

An Implementation of Data Mining Techniques in Health Care Industries


R.Anand1, Dr.S.K. Srivatsa2
1

Research Scholar, Sri Chandra Sekarendra Viswa Maha Vidyalaya, Enathur, Kanchipuram-531 602. 2 Sr.Professor, St. Joseph College of Engineering, Chennai-600 119.

Abstract
Data Mining is an emerging field. Every day health care industry produces huge volume of data. It is very tedious task to find the right data for the right place. In other words we are having rich data but poor utilization. In this paper we discuss how data mining techniques can be used in health care industry. Health care industry has enormous data. Healthcare industry today generates large amounts of complex data about patients, hospitals resources, disease diagnosis, electronic patient records, medical devices etc. The large amounts of data is a key resource to be processed and analyzed for knowledge extraction that enables support for cost-savings and decision making. Data mining brings a set of tools and techniques that can be applied to this processed data to discover hidden patterns that provide healthcare professionals an additional source of knowledge for making decisions. With continuous advances in technology, increasing number of clinicians are using electronic medical records to accumulate substantial amounts of data about their patients with the associated clinical conditions and treatment details. The hidden relationships and patterns within these information would further our medical knowledge including its efficiencies and deficiencies. Methodologies that are being used in parallel industries with increasing effectively need to be modified and applied to discover this knowledge. In this article we are going to discuss about how data mining can be used in medical field, how it solves the business issues, challenges in data mining, data mining techniques and the semma methodology.

Keywords: Data Mining, Health Care, Neural Networks, Neuro Fuzzy, Decision Tree algorithm 1.Introduction
Data mining can be defined as the process of finding previously unknown patterns and trends in databases and using that information to build predictive models. Alternatively, it can be defined as the process of data selection and exploration and building models using vast data stores to uncover previously unknown patterns. Data mining is not newit has been used intensively and extensively by financial institutions, for credit scoring and fraud detection; marketers, for direct marketing and cross-selling or up-selling; retailers, for market segmentation and store layout; and manufacturers, for quality control and maintenance scheduling. In healthcare, data mining is becoming increasingly popular, if not increasingly essential. Several factors have motivated the use of data mining applications in healthcare. The existence of medical insurance fraud and abuse, for example, has led many healthcare insurers to attempt to reduce their losses by using data mining tools to help them find and track offenders. Fraud detection using data mining applications is prevalent in the commercial world, for example, in the detection of fraudulent credit card transactions. Recently, there have been reports of successful data mining applications in healthcare fraud and abuse detection. Another factor is that the huge amounts of data generated by healthcare transactions are too complex and voluminous to be processed and analyzed by traditional methods. Data mining can improve decision-making by discovering patterns and trends in large amounts of complex data. Such analysis has become increasingly essential as financial pressures have heightened the need for healthcare organizations to make decisions based on the analysis of clinical and financial data. Insights gained from data mining can influence cost, revenue, and operating efficiency while maintaining a high level of care.

10

International Journal of Computational Intelligence and Information Security, May 2013, Vol. 4 No. 5, ISSN: 1837-7823

2. Data Mining in Medical Data Bases


Data mining is an essential step of knowledge discovery. In recent years it has attracted great deal of interest in Information industry. Knowledge discovery process consists of an iterative sequence of data cleaning, data integration, data selection, data mining pattern recognition and knowledge presentation. In particulars, data mining may accomplish class description, association, classification, clustering, prediction and time series analysis. Data mining in contrast to traditional data analysis is discovery driven. Data mining is a young interdisciplinary field closely connected to data warehousing, statistics, machine learning, neural networks and inductive logic programming.

Fig 1 Data Base Architecture


Data mining provides automatic pattern recognition and attempts to uncover patterns in data that are difficult to detect with traditional statistical methods. Without data mining it is difficult to realize the full potential of data collected within healthcare organization as data under analysis is massive, highly dimensional, distributed and uncertain. Massive healthcare data needs to be converted into information and knowledge, which can help control, cost and maintains high quality of patient care. Healthcare data includes Patient centric data and Aggregate data. For health care organization to succeed they must have the ability to capture, store and analyze data (Fig.1) Online analytical processing (OLAP) provides one way for data to be analyzed in a multi-dimensional capacity. With the adoption of data warehousing and data analysis/OLAP tools, an organization can make strides in leveraging data for better decision making. Many healthcare organizations struggle with the utilization of data collected through an organization online transaction processing (OLTP) system that is not integrated for decision making and pattern analysis. For successful healthcare organization it is important to empower the management and staff with data warehousing based on critical thinking and knowledge management tools for strategic decision making. Data 11

International Journal of Computational Intelligence and Information Security, May 2013, Vol. 4 No. 5, ISSN: 1837-7823

warehousing can be supported by decision support tools such as data mart, OLAP and data mining tools. A data mart is a subset of data warehouse. It focuses on selected subjects. Online analytical processing (OLAP) solution provides a multi-dimensional view of the data found in relational databases. With stored data in two dimensional format OLAP makes it possible to analyze potentially large amount of data with very fast response times and provides the ability for users to go through the data and drill down or roll up through various dimensions as defined by the data structure.

3. Solving Business Issues


The SAS data mining solution has been used in health care to overcome a wide range of business issues and problems. Some of these include: Segmenting customers/patients accurately into groups with similar health patterns. Rapidly identifying who are the most profitable customers and the underlying reasons. Understanding why customers leave for competitors (attrition, churn analysis). Planning for effective information systems management. Preparing for demand of resources. Anticipating customers/patients future actions, given their history and characteristics. Predicting medical diagnosis. Forecasting treatment costs. Predicting length of stay in a hospital. Identifying medical procedure expenditures and utilization by analyzing claims and point-of-care data. Predicting total cost of patient care.

Once the business problems have been defined and agreed upon, the next logical step is to determine the type and amount of data that will be necessary for making business decisions. As a precursor to data mining, a data warehouse strategy and implementation is suggested. Integration with SAS software gives the SAS data mining solution several distinguishing characteristics which allow faster, easier and more accurate conversion of data into knowledge useful to decision makers. Data diversity The SAS data mining solution is designed to accept a wider range of data formats than any other data mining product currently on the market. It will accept data from relational and hierarchical databases, flat files, and other data formats, and it will accept this data from all major hardware platforms. Distributed client/server The SAS data mining solution supports both the data server model of client/server computing, in which data located on a remote machine can be accessed, and the compute server model, which allows data to be processed on a remote server and then forwarded to a client. This is particularly well suited to analytical tasks involving large volumes of data that require superior processing capabilities. Consistent implementation on multiple platforms The SAS data mining solutions are fully integrated with SAS Software and give users the flexibility to use their platforms of choice, ranging form desktop machines to powerful servers. Integrated data management SAS softwares data management facilities guarantee data integrity without the need for re-keying or additional validation of data. No other data mining solution includes seamless integration with such a comprehensive range of data management functionality. Once the business objectives and data issues have been resolved, the methodology and approach to data mining can begin.

12

International Journal of Computational Intelligence and Information Security, May 2013, Vol. 4 No. 5, ISSN: 1837-7823

4. Data Mining Challenges


Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Handling high-dimensionality Handling noise, uncertainty, and incompleteness of data Incorporation of constraints, expert knowledge, and background knowledge in data mining Pattern evaluation and knowledge integration Mining diverse and heterogeneous kinds of data: e.g., bioinformatics, Web, software/system engineering, information networks Application-oriented and domain-specific data mining Invisible data mining (embedded in other functional modules) Protection of security, integrity, and privacy in data mining

5. Semma Methodology
The methodology and approach that SAS Institute proposes is referred to as SEMMA, for Sample, Explore, Modify, Model, and Assess. Beginning with a statistically representative sample of data, users can apply exploratory statistical and visualization techniques, select and transform the most significant predictive variables, model the variables to predict outcomes, and affirm the models accuracy.

Sample The first step is to extract a portion of a large data set big enough to contain the significant information yet small enough to manipulate quickly. Explore This phase involves searching speculatively for unanticipated trends and anomalies so as to gain understanding and ideas. This can reveal which subset of attributes will be the most productive to work with the modeling phase. Data visualization delivers intuitive tools for business professionals, while statistical techniques offer added detail for specialist. Modify The insights that are gained from the exploration phase enable knowledge workers to group the most productive subsets and clusters of data together for further analysis and exploration. Model This process involves searching automatically for a variable combination that reliably predicts a desired outcome. Data mining techniques such as neural networks, tree-based models, and traditional statistical techniques can help reveal patterns in the data and provide a best-fitting predictive model. Assess During this evaluation process, assessment of the results gained from modeling provides indications as to which results should be conveyed to senior management, how to model new questions that have been raised by the previous results and thus proceed back to the exploration phase. SEMMA is a process that allows SAS Institute to distinguish ourselves by being the only vendor that can offer all of these components, as well as the ability to seamlessly integrate them with a companys existing hardware and software strategy.

13

International Journal of Computational Intelligence and Information Security, May 2013, Vol. 4 No. 5, ISSN: 1837-7823

Fig.2 Semma Methodology 6. Data Mining Techniques


There are various data mining techniques available with their suitability dependent on the domain application. Statistics provide a strong fundamental background for quantification and evaluation of results. However, algorithms based on statistics need to be modified and scaled before they are applied to data mining. We now describe a few Classification data mining techniques with illustrations of their applications to healthcare. Rule set classifiers Complex decision trees can be difficult to understand, for instance because information about one class is usually distributed throughout the tree. An alternative formalism consisting of a list of rules of the form if A and B and C and ... then class X, where rules for each class are grouped together. A case is classified by finding the first rule whose conditions are satisfied by the case; if no rule is satisfied, the case is assigned to a default class IF conditions THEN conclusion This kind of rule consists of two parts. The rule antecedent (the IF part) contains one or more conditions about value of predictor attributes where as the rule consequent (THEN part) contains a prediction about the value of a goal attribute. An accurate prediction of the value of a goal attribute will improve decision-making process. IF-THEN prediction rules are very popular in data mining; they represent discovered knowledge at a high level of abstraction. In the health care system it can be applied as follows: (Symptoms) (Previous--- history) ----> (Causeof--- disease) Example 1: If_then_rule induced in the diagnosis of level of alcohol in blood IF Sex = MALE AND Unit = 8.9 AND Meal = FULL THEN 14

International Journal of Computational Intelligence and Information Security, May 2013, Vol. 4 No. 5, ISSN: 1837-7823

Diagnosis=Blood_alcohol_content_HIGH. Decision Tree algorithms Decision tree include CART (Classification and Regression Tree), ID3 (Iterative Dichotomized 3). These algorithms differ in selection of splits, when to stop a node from splitting, and assignment of class to a non-split node. CART uses Gini index to measure the impurity of a partition or set of training tuples. It can handle high dimensional categorical data. Decision Trees can also handle continuous data (as in regression) but they must be converted to categorical data. Neural Network Architecture The architecture of the neural network used in this study is the multilayered feed-forward network architecture with 20 input nodes, 10 hidden nodes, and 10 output nodes. The number of input nodes is determined by the finalized data; the number of hidden nodes is determined through trial and error and the number of output nodes is represented as a range showing the disease classification. The most widely used neural-network learning method is the BP algorithm. Learning in a neural network involves modifying the weights and biases of the network in order to minimize a cost function. The cost function always includes an error term a measure of how close the network's predictions are to the class labels for the examples in the training set. Additionally, it may include a complexity term that reacts to a prior distribution over the values that the parameters can take. Neural networks have been proposed as useful tools in decision making in a variety of medical applications. Neural networks will never replace human experts but they can help in screening and can be used by experts to double-check their diagnosis. In general, results of disease classification or prediction task are true only with a certain probability. Neuro-Fuzzy Stochastic back propagation algorithm is used for the construction of fuzzy based neural network. The steps involved in the algorithm are as follows: First, initialize weights of the connections with random values. Second for each unit compute net input value, output value and error rate. Third, to handle uncertainty for each node, certainty measure (c) for each node is calculated. Based on the certainty measure the decision is made. The level of the certainty is computed using the following conditions. a. If 0.8 \< c 1, then there exists very high certainty b. If 0.6 \< c 0.8, then there exists high certainty c. If 0.4 \< c 0.6, then there exists average certainty d. If 0.1 \< c 0.4, then there exists less certainty e. If c 0.1, then there exists very less certainty The network constructed consists of 3 layers namely an input layer, a hidden layer and an output layer. Sample trained neural network consisting of 9 input nodes, 3 hidden nodes and 1 output node is shown in Figure 2. When a thrombus or blood clot occupies more than 75% of surface area of the lumen of an artery then the expected result may be a prediction of cell death or heart disease according to medical guidelines i.e. R is generated with reference to the given set of input data.

7. Summary
The effective use of information and technology is crucial for health care organizations to stay competitive in todays complex, evolving environment. The challenges faced when trying to make sense of large, diverse, and often complex data source are considerable. In an effort to turn information into knowledge, health care organizations are implementing data mining technologies to help control costs and improve the efficacy of patient care. Data mining can be used to help predict future patient behavior and to improve treatment programs. By identifying high-risk patients, clinicians can better manage the care of patients today so they do not become the problems of tomorrow. 15

International Journal of Computational Intelligence and Information Security, May 2013, Vol. 4 No. 5, ISSN: 1837-7823

We studied the problem of constraining and summarizing different algorithms of data mining. We focused on using different algorithms for predicting combinations of several target attributes. Finally we conclude that if we use proper data mining algorithms in health care industry we produce better results and it can be used to prevent several diseases.

8. References
[1] Shams, K. and M. Frashita, 2001. Data Warehousing Toward Knowledge Management. Topics in Health Information Management, 21: 3. [2] Jones, A.W., 1990. Physiological Aspects of Breath-Alcohol Measurements. Alcohol Drugs Driving, 6:1-25. [3] Han, J. and M. Kamber, 2001. Data Mining: Concepts and Techniques. San Francisco, Morgan Kauffmann Publishers. [4] Veletsos, A. (2003). Getting to the bottom of hospital finances. Health Management Technology, 24(8), 30-31. [5] Dakins, D.R. (2001). Center takes data tracking to heart. Health Data Management, 9(1), 32-36. [6] Johnson, D.E.L. (2001). Web-based data analysis tools help providers, MCOs contain costs. Health Care Strategic Management, 19(4), 16-19. [7] Schuerenberg, B.K. (2003). An information excavation. Health Data Management, 11(6), 80-82. [8] Piazza, P. (2002). Health alerts to fight bioterror. Security Management, 46(5), 40. [9] Brewin, B. (2003). New health data net may help in fight against SARS. Computerworld, 37(17), 1, 59. [10] Paddison, N. (2000). Index predicts individual service use. Health Management Technology, 21(2), 14-17. [11] Johnston G. System adds to biodefense readiness. Bio-IT World. November 1, 2002. Available at www.bioitworld.com/ news/110102_report1436.html. Accessed July 21, 2004. [12] Jiawei Han, Micheline Kamber. Data mining concepts and techniques. Morgan Kaufmann Publishers. ISBN 1055860-489-8 [13] Philip Baylis et al. Better health care with data mining, Clementine working with health care. SPSS white paper. Shared Medical Systems Limited, UK [14] Kristin B. Degrug, MSHS. Healthcare Applications of Knowledge Discovery in Databases. Journal of Healthcare Information Management, Vol. 14, no. 2, Summer 2000.

16

You might also like