You are on page 1of 13

DATA MINING AND DATA WAREHOUSING Sri Indu College of Engineering and Technology V.

ASHVINI, B-952, NGOS Colony, Vanasthalipuram, Hyderabad-70, ashvini_reddy@yahoo.com ABSTRACT In todays world, the competitive edge is coming less from optimization and more from the proactive use of information that these systems have been collecting over the years. Companies are beginning to realize the vast potential of the information that they hold in their organizations. If they can tap into this information, they can significantly improve the quality of their decision making and the probability of the organization through focused actions. Data Warehousing and Mining is a technique for storing and retrieving the data in an effective and efficient manner. A data warehouse is designed especially for decision support queries. The idea behind data mining then is the non trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in India. It explains in detail, the architecture, characteristics, and tools of a Data Warehouse and the process of mining with attractive visualization. This paper also aims at explaining the different stages in data mining and at the same time it also explains in the modeling of a data warehouse and dimensional modeling. In the end the presentation explains the applications of Data Warehousing and Mining and summarizes by revealing the references.

INTRODUCTION: Now a days there is a lot of confusion concerning the terms data mining and data warehousing (also referred to as business intelligence in the marketplace today). To my chagrin, many IT professionals use the two terms interchangeably, with little hesitation or regard for the differences between the two types of applications. While the goals of both are related, and often overlap; data mining and data warehousing are dedicated to furnishing different types of analytics, for different types of users and therefore merit their own space. Data Warehouses and Data Mining techniques are becoming indispensable parts of business intelligence programs. Use these links to learn more about these emerging fields and keep on top of this trend DATA WAREHOUSING: The data warehouse now makes it possible to get answers to business questions that have been very difficult, if not impossible, to answer-especially those that are time sensitive or cross subject areas. Data warehousing is the process of extracting and transforming operational data into informational data and loading it into a central data store or warehouse. Once the data is loaded it is accessible via desktop query and analysis tools by the decision makers. A common way of introducing data warehousing is to refer to the characteristics of data warehouse as set forth by William Inmon:

Subject Oriented Integrated Nonvolatile Time Variant

Subject Oriented: Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?

Integrated: Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile: Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant: In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. Functionality of OLAP: Data Warehousing Overview OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity. OLAP helps the user synthesize enterprise information through comparative, personalized viewing, as well as through analysis of historical and projected data in various "what -if" data model scenarios. Comes in many varieties -- ROLAP, MOLAP, HOLAP, etc ROLAP: Relational OLAP Uses a RDBMS to implement and OLAP environment Typically involves a star schema to provide the multidimensional capabilities OLAP tool manipulates RDBMS star schema data Called slow lap by MOLAP vendors

MOLAP: Multidimensional OLAP Uses a MDDBS (e.g., Essbase) to store and access data Usually requires proprietary (non SQL) data access tool Provides exceptionally fast response times HOLAP: HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information HOLAP: Hybrid OLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information Comparison of OLAP and OLTP: OLAP applications are quite different from On-line Transaction Processing (OLTP) applications which consist of a large number of relatively simple transactions. The transactions usually retrieve and update a small number of records that are contained in several distinct tables. The relationships between the tables are generally simple. The difference between OLAP and OLTP has been summarized as, OLTP servers handle mission-critical production data accessed through simple queries; while OLAP servers handle management-critical data accessed through an iterative analytical investigation. Both OLAP and OLTP have specialized requirements and therefore require special optimized servers for the two types of processing. DIMENSIONAL MODELING: Dimensional modeling uses three basic concepts they are measures, facts, dimensions. Is powerful in representing the requirements of the business user in the context of database tables. Focuses on numeric data, such as values counts, weights, balances and occurrences. Must identify Business process to be supported Grain (level of detail). CONVENTIONS USED IN DIMENSIONAL MODELING: Facts: A fact is a collection of related data items, consisting of measures and context data. Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process. Facts are measured, continuously

Valued, rapidly changing information. Can be calculated and/or derived. A table that is used to store business information (measures) that can be used in mathematical equations. Dimensions: A dimension is a collection of members or units of the same type of views. Dimensions determine the contextual background for the facts. Dimensions represent the way of businesspeople talk about the data resulting from a business process, e.g., who, what, when, where, why, how Measures (Variables): A measure is a numeric attribute of a fact, representing the performance or behavior of the business relative to dimensions. The actual numbers are called as variables Dimension members: Member is a distinct name to determine data items position (eggs. Time - Month, quarter). Dimension hierarchies: Allow for the rollup of data to more summarized summarized levels. Time, day, month, quarter, year DATA WAREHOUSE ARCHITECTURE: Architecture Choices depend on Current infrastructure, Business environment, desired management and control structure, resources, Commitment, Data Warehouse/data mart. Architecture Choices determine: 3 choices Global, Independent, Interconnected (or) a combination of these three. Global Architecture: This structure is related to scope of data access and storage does not only mean centralized, it can be physically centralized or distributed enterprise view of data time-consuming & costly to implement

Independent Architecture:

It is considered to be a stand-alone which is controlled by a department It includes minimal integration with no global view Very fast to implement

Interconnected Architecture: It is distributed, integrated and interconnected and gives a global view of enterprise more complexity which manages / controls data another tier in architecture to share common data between multiple data marts which have a data sharing schema across data marts

TYPES OF DATA WAREHOUSING: Enterprise Data Warehouse: Contains data drawn from multiple operational systems Supports time- series and trend analysis across different business areas Can be used as a transient storage area to clean all data and ensure consistency Can be used to populate data marts Can be used for everyday and strategic decision making Marketing data mart:

Data Mart: Departmental subsets that focus on selected subjects: customer, products, and sales.

Faster roll out, but complex integration in the long run. Logical subset of enterprise data warehouse Organized around a single business process. Less expensive and much smaller than a full blown corporate

WAREHOUSE TOOLS AND APLICATIONS: In sales and marketing analysis across all industries. Inventory turn and product tracking in manufacturing. Category management, vendor analysis, and marketing program effectiveness analysis in retail. Profitable lane or driver risk analysis in transportation Profitability analysis or risk assessment in banking. Claims analysis or fraud detection in insurance etc DATAMINING: Data mining implies digging through tons of data to identify the patterns &relationships contained within the business activity and history. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Clementine User Guide, a data mining toolkit Provide a single version of the truth Improve decision making.

PROCESS OF MINING: The analogy with the mining process is described as: Data mining refers to "using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful".

The above diagram summarizes the some of the stages/processes identified in data mining and knowledge discovery by Usama Fayyad & Evangelos Simoudis, two of leading exponents of this area. The phases depicted start with the raw data and finish with the extracted knowledge which was acquired as a result of the following stages:

Selection - selecting or segmenting the data according to some criteria e.g. all those people who own a car, in this way subsets of the data can be determined. Preprocessing - this is the data cleansing stage where certain information is removed which is deemed unnecessary and may slow down queries for example unnecessary to note the sex of a patient when studying pregnancy. Also the data is reconfigured to ensure

a consistent format as there is a possibility of inconsistent formats because the data is drawn from several sources e.g. sex may recorded as f or m and also as 1 or 0.

Transformation - the data is not merely transferred across but transformed in that overlays may added such as the demographic overlays commonly used in market research. The data is made useable and navigable.

Data mining - this stage is concerned with the extraction of patterns from the data. A pattern can be defined as given a set of facts(data) F, a language L, and some measure of certainty C a pattern is a statement S in L that describes relationships among a subset Fs of F with a certainty c such that S is simpler in some sense than the enumeration of all the facts in Fs.

Interpretation and evaluation - the patterns identified by the system are interpreted into knowledge which can then be used to support human decision-making e.g. prediction and classification tasks, summarizing the contents of a database or explaining observed phenomena.

Rule induction: The extraction of useful if-then rules from data based on statistical significance. The Data-Mining Communities: They are Data, Information, and Knowledge Data: Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. This includes:

operational or transactional data such as, sales, cost, inventory, payroll, and accounting nonoperational data, such as industry sales, forecast data, and macro economic data meta data - data about the data itself, such as logical database design or data dictionary definitions

Information: The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling and when.

Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts. What can data mining do? Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining Brick company illustrates the data explosion. How does data mining work? While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.

Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.

Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining problems/issues: Data mining systems rely on databases to supply the raw data for input and this raises problems in that databases tend be dynamic, incomplete, noisy, and large. Other problems arise as a result of the adequacy and relevance of the information stored APPLICATIONS: Some examples of successes": 1. Decision trees constructed from bank-loan histories to produce algorithms to decide whether to grant loan. 2. Patterns of traveler behavior mined to manage the sale of discounted seats on planes, rooms in hotels, etc. 3. Diapers and beer." Observation that customers who buy diapers are more likely to by beer than Average allowed supermarkets to place beer and diapers nearby, knowing many customers would walk between them. Placing potato chips between increased sales of all three items. 4. Skycap and Sloan Sky Survey: clustering sky objects by their radiation levels in different bands allowed astronomers to distinguish between galaxies, nearby stars, and many other kinds of celestial objects. 5. Comparison of the genotype of people with/without a condition allowed the discovery of a set of genes that together account for many cases of diabetes. This sort of mining will become much more important as the human genome is constructed

CONCLUSION: Data ware housing provides the means to change raw data into information for making effective business decisions-the emphasis on information, not data. The data ware house is hub for decision support data. A good data ware house will.provide the RIGHT datato the RIGHT peopleat the RIGHT time: RIGHT NOW! The two applications types are similar in that they rely on historical data to drive profitability in the future. Data mining and Data Ware Housing concepts are already implemented but the name itself is new for this market. However, we will have to wait and see whether they behave like a crystal ball. REFERENCES: http://en.wikipedia.org/wiki/Data mining http://en.wikipedia.org/wiki/Data warehouse William H.Inmon, Richard D.Hackathorn: Using the Data Ware house, John Wiley &Sons

You might also like