Data Mining Paper

1
SEMINAR 2005
A Technical Presentation On
DATA MINING
P.CHANAKYA. Y9MC95008, IV Semester, II MCA, St.Anns College of P.G. Studies, Chirala 523 187, Contact: funnychanu@gmail.com www.funnychanu.weebly.com/seminor
ABSTRACT
Commercial databases are growing at unprecedented rates. This explosive growth in stored data has generated an urgent need for new techniques and automated tools that can intelligently assist us in transforming the vast amounts of data into useful information and knowledge. Data mining is an automated or convenient tool for the extraction of patterns representing knowledge implicitly stored in large databases, data warehouses and other massive information repositories.
Here we describe basic ideas behind the Data mining and Warehousing, the need for Data mining & Warehousing, the actual process of Data mining, the architecture of Data mining system, the different techniques involved, the architecture of Data Warehouse, the advantages and problem areas in Data Warehousing.
Data mining:
Data Mining is the extraction of hidden predictive information from large databases.
Need for Data mining:

Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by Data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.
Architecture for Data Mining:
The starting point in this architecture is Data Warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. An OLAP (OnLine Analytical Processing) server enables a more sophisticated enduser business model to be applied when navigating the Data Warehouse. The Data Mining Server must be integrated with the Data Warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. Integration with the Data Warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new Decisions and results, the organization can continually mine the best practices and apply them to future decisions.
Data mining process:
The phases depicted start with the raw data and finish with the extracted knowledge, which was acquired as a result of the following stages:
Selection
Selecting or segmenting the data according to some criteria, e.g. all those people who own a car, in this way subsets of the data can be determined.
Preprocessing
This is the data cleaning stage where certain information is removed, which is deemed unnecessary and may slow down queries e.g., it is unnecessary to note the sex of a patient when studying pregnancy. Also the data is reconfigured to ensure a consistent format as there is a possibility of inconsistent formats because the data is drawn from several sources. e.g. sex may recorded as f or m and also as 1 or 0.
Transformation
The data is not merely transferred across but transformed in that overlays may added such as the demographic overlays commonly used in market research. The data is made usable and navigable.
Data mining
data.
This stage deals with the extraction of patterns from the
Interpretation and evaluation

The patterns identified by the system are interpreted into knowledge, which can then be used to support human decision-making e.g. prediction and classification tasks, summarizing the contents of a database or explaining observed phenomena.
Data mining techniques:

1. Cluster Analysis: The first step is to discover subsets of related objects and then find descriptions e.g., D1, D2, D3 etc. which describe each of these subsets.
discover
construct
D1
subsets
descriptions
D2
D3
Clustering and Segmentation basically partition the database so that each partition or group is similar according to some criteria or metric.
2. Induction: A database is a store of information but more important is

the information, which can be inferred from it. Induction has been used in the following ways within data mining.
2.1. Decision trees: Decision trees are simple knowledge representation

and they classify examples to a finite number of classes, the nodes are labeled with attribute names, the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Objects are classified by following a path down the tree, by taking the edges, corresponding to the values of the attributes in an object.
2.2. Rule induction: A data mine system has to infer a model from the
database i.e. it may define classes such that the database contains one or more attributes that denote the class of a tuple i.e. the predicted attributes while the remaining attributes are the predicting attributes. Class can then be defined by condition on the attributes. When the classes are defined the system should be able to infer the rules that govern classification, in other words the system should find the description of each class.
3. Neural networks: Neural networks have the remarkable ability to derive

meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyze. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including:
Sales forecasting Industrial process control Customer research Data validation Risk management Target marketing etc.
4. On-line Analytical processing :OLAP was a term coined by E F Codd

(1993) and was defined by him as the dynamic synthesis, analysis and consolidation of large volumes of multidimensional data. Codd has developed rules or requirements for an OLAP system: Multidimensional conceptual view
Transparency Accessibility Consistent reporting performance Client/server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross dimensional operations Intuitative data manipulation Flexible reporting Unlimited dimensions and aggregation levels
5. Data Visualizations: Data visualization makes it possible for the

analyst to gain a deeper, more intuitive understanding of the data and as such can work well along side data mining.
Need for Data Warehousing: A Data Warehouse is nothing but a

repository of data collected from the various operational systems of an organization. This data is then comprehensively analyzed to gain competitive advantage. The analysis is basically used in decision making at the top level.
10
Data
Warehouse:
Data
Warehouse
is
Subject-oriented,
Integrated,Time-variant and Non-volatile collection of data in support of management decisions.
Data Warehousing Systems: Legacy Systems: Any information system currently in use that was built
using previous technology generations. Most legacy Systems are
operational in nature, largely because the automation of transactionoriented business process had long been the priority of IT projects.
Source Systems: Any system from which data is taken for a data
warehouse. A source system is often called a legacy system in a mainframe environment.
Operational Data Stores (ODS):- An ODS is a collection of integrated

External Data Sources databases designed to support the monitoring of operations. The ODS
contains subject oriented, volatile, and current enterprise-wide Reports detailed information.
EXTRACT CLEAN TRANSFOR M LOAD REFRESH
Meta Data Repository serves OLAP
Data Mining
DW
Operational Systems
11
Operational Systems Vs Data Warehousing Systems:

Operational Holds current data Data is dynamic Read/Write accesses Repetitive processing Transaction driven Application oriented Used by clerical staff for day-to-day operations Normalized data model (ER model)
DATA WAREHOUSE ARCHITECTURE
Data Warehouse Holds historic data Data is largely static Read only accesses Adhoc complex queries Analysis driven Subject oriented Used by top managers for analysis Denormalized data for model queries of the
(Dimensional model) Must be optimized for writes and Must be optimized small queries. involving a large
portion
warehouse.
Advantages of Data Warehousing:

Potential high Return on Investment Competitive Advantage Increased Productivity of Corporate Decision Makers
12
Problems with Data Warehousing:

Underestimation of resources for data loading hidden problems with source systems required data not captured Increased end-user demands High maintenance Long duration projects Complexity of integration
CONCLUSION
Comprehensive data warehouses that integrate operational data with customer, supplier, and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. However, there is a growing gap between more powerful storage and retrieval systems and the users ability to effectively analyze and act on the information they contain. Both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses, but brute force navigation of data is not enough. A new
13
technological leap is needed to structure and prioritize information for specific end-user problems. The data mining tools can make this leap. Quantifiable business benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users.
REFERENCES
1. Building the Data Warehouse 2. Data Warehouse Architecture and Implementation 4. http://www.thearling.com/dmintro/dmintro.htm 5. http://www.sdgcomputing.com/glossary.htm - W.H. Inmon. - Barry Devlin.

Data Mining Paper

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Paper

Uploaded by

Copyright:

Available Formats

1

Need for Data mining:

Architecture for Data Mining:

Data mining process:

This stage deals with the extraction of patterns from the

Interpretation and evaluation

Data mining techniques:

2. Induction: A database is a store of information but more important is

2.1. Decision trees: Decision trees are simple knowledge representation

3. Neural networks: Neural networks have the remarkable ability to derive

4. On-line Analytical processing :OLAP was a term coined by E F Codd

5. Data Visualizations: Data visualization makes it possible for the

Need for Data Warehousing: A Data Warehouse is nothing but a

Integrated,Time-variant and Non-volatile collection of data in support of management decisions.

Operational Data Stores (ODS):- An ODS is a collection of integrated

Meta Data Repository serves OLAP

Operational Systems Vs Data Warehousing Systems:

DATA WAREHOUSE ARCHITECTURE

Advantages of Data Warehousing:

Problems with Data Warehousing:

You might also like