Professional Documents
Culture Documents
UNIT-1 Introduction : Fundamentals of data mining, Data Mining Functionalities, Classification of Data Mining systems, Major issues in Data Mining. Data Preprocessing : Needs Preprocessing the Data, Data Cleaning, Data Integration and Transformation, Data Reduction, Discretization and Concept Hierarchy Generation Fundamentals of data mining: Evolution of Database Technology: 1. 1960s: Data collection, database creation, IMS and network DBMS 2. 1970s: Relational data model, relational DBMS implementation 3. 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 4. 1990s: Data mining, data warehousing, multimedia databases, and Web databases 5. 2000s Stream data management and mining Data mining with a variety of applications Web technology and global information systems
Introduction: What is DATA MINING? Extracting or mining knowledge from large amounts of data Data -driven discovery and modeling of hidden patterns (we never new existed) in large volumes of data Extraction of implicit, previously unknown and unexpected, potentially extremely useful information from data The term data mining also called as Gold Mining Knowledge mining from databases Knowledge extraction Data/pattern analysis Knowledge Discovery Databases or KDD Online transactional processing(OLTP) DATA MINING APPLICATIONS: Science: Chemistry, Physics Bioscience Sequence-based analysis Protein structure and function prediction Protein family classification Microarray gene expression Financial Industry - banks, businesses, e-commerce
3
Stock and investment analysis Pharmaceutical companies Health care Sports and Entertainment What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. (or) A process of transforming data into information and making it available to users in a timely enough manner to make a difference Information
data (or) A data warehouse is a subject-oriented+integrated+time-varying+nonvolatile collection of data that is used primarily in organizational decision making(note:data warehousing also called as OLAP(online analytical processing) NOTE: Data Mining works with Warehouse Data * Data Warehousing provides the Enterprise with a memory
* Data Mining provides the Enterprise with intelligence Knowledge Discovery & Data Mining:
process of extracting previously unknown, valid, (understandable) information from large databases (or)
and
actionable
Data mining is a step in the KDD process of applying data analysis and discovery algorithms KDD Process:
(or)
Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner.
Interpretation/Evaluation: Identify and display frequently accessed sequences. Potential User Applications: Cache prediction Personalization Architecture: Typical Data Mining System:
Ac i e r ht
Architecture of a typical data mining system describes the analysis of data mining system. Analysis of data mining system performs the following steps: Step1: consider various databases Like relational database, datawarehouse, spatial database and etc. Step2: Perform preprocessing techniques Preprocessing technique used for to preprocess the data that means removing redundancy, noisy, unwanted data. By that we are improving the quality by this we can get the best results. Step3: Database or data warehouse server Any data mining system maintains database server/data warehouse server because to store the data without noisy Step4: Data mining engine.. Every data mining system maintains data mining engine acts like a query evaluation engine i.e. which takes the query depends on the user request and it sends the result (i.e. different patterns based up on the user view)
8
Step5: Pattern evaluation: Patterns are generated by Data mining engine Step6: Graphical user Interface End user uses the GUI and he receives the result of his requested patterns Data Mining: On What Kinds of Data?: Data mining can be applied on the following different kinds of data bases.. 1. Relational database 2. Data warehouse 3. Transactional database 4. Advanced database and information repository Object-relational database Spatial and temporal data Time-series data Stream data Multimedia database Heterogeneous and legacy database Text databases & WWW
Data Mining Functionalities: Data mining functionalities are 1. Concept description: Characterization and discrimination 2. Association (correlation and causality)
3. Classification and Prediction 4. Cluster analysis 5. Outlier analysis 6. Trend and evolution analysis
1.
Concept description: Characterization and discrimination : Generalize, summarize, and contrast data characteristics
1. Characterization: provides a concise and succinct summarization of the given collection of data 2,Comparison/discrimination: provides descriptions comparing two or more collections of data Data generalization: A process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones.
2.
Association (correlation and causality) : Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Applications:1) Basket data analysis , 2) cross-marketing, 3) Catalog design, 4) Loss-leader analysis,
10
*used for prediction(future analysis ) to know the unknown attributes with their values..by using classifier algorithms and decision tree.(in data mining) *which constructs some models(like decision trees) then which classifies the attributes.. *already we know the types of attributes are 1.categorical attribute and 2.numerical attribute *these classification can work on both the above mentioned attributes. Prediction: prediction also used for to know the unknown or missing values.. which also uses some models in order to predict the attributes models like neural networks, if else rules and other mechanisms classification and prediction are used in the Applications like *credit approval *target marketing *medical diagnosis 4. Cluster analysis :
What is Cluster Analysis?: Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters
11
General Applications of Clustering: 1. Pattern Recognition 2. Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining 3. Image Processing 4. Economic Science (especially market research) 5. WWW Document classification Cluster Weblog data to discover groups of similar access patterns
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location
12
2.
3.
4.
5.
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults What Is Good Clustering?: A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity
1.
2. The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns 5.Outlier analysis: 1. What are outliers? The set of objects are considerably dissimilar from the remainder of the data
Example: Sports: Michael Jordon(related person), Wayne Gretzky(un related person(this is noisy), ... 2. Problem
Find top n outlier points 3. Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis
13
Note: to solve outlier analysis problem we are using all the above clustering methods..use any one of the clustering method and solve the problem of outlier analysis 6. Trend and evolution analysis: It is a combination of all the above analysisthe user always up-to-date with the database and the user should apply the related methods and approaches based on type of data. Classification of Data Mining systems: Note: Data mining is a combination of multiple disciplines
Classification of data mining can be divided in to two types *Descriptive data mining *Predictive data mining Descriptive data mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms Note: Descriptive data mining means describing the general details of data.
14
Again descriptive data mining classified into some views like 1. Kinds of data to be mined 2. Kinds of knowledge to be discovered 3. Kinds of techniques utilized 4. Kinds of applications adapted
1.
Kinds of data to be mined: Kinds of data to be mined are : Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW Kinds of knowledge to be discovered:
2.
Based on the following properties we can discover the kinds of knowledge The properties are 1. Characterization, 2. discrimination, 3. association, 4. classification, 5. clustering, 6. trend/deviation, 7. outlier analysis, etc. Note:process of extracting previously unknown, valid, and actionable (understandable) information from large databases (or) (KDD): process of finding useful information and patterns in data.
15
Data mining is a step in the KDD process of applying data analysis and discovery algorithms KDD Process:
3. Kinds of techniques utilized: Based on the type of discipline we are using the data mining techniques Some disciplines follows: Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Note: Techniques means operations, methods, approaches and etc
4.
Data mining used in the following applications like: Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, Web mining, etc.
Predictive data mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data Data analysis can be done by the following 1. Concept description: Characterization and discrimination 2. Association (correlation and causality)
16
3. Classification and Prediction 4. Cluster analysis 5. Outlier analysis 6. Trend and evolution analysis
1.
Generalize, summarize, and contrast data characteristics 1. Characterization: provides a concise and succinct summarization of the given collection of data 2,Comparison/discrimination: provides descriptions comparing two or more collections of data Data generalization: A process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones.
2.
Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Applications:1) Basket data analysis , 2) cross-marketing, 3) Catalog design, 4) Loss-leader analysis, 3. Classification and Prediction: Classification:
17
*used for prediction (future analysis) to know the unknown attributes with their values..by using classifier algorithms and decision tree.(in data mining) *which constructs some models (like decision trees) then which classifies the attributes.. *already we know the types of attributes are 1.categorical attribute and 2.numerical attribute *this classification can work on both the above mentioned attributes. Prediction: prediction also used for to know the unknown or missing values.. which also uses some models in order to predict the attributes models like neural networks, if else rules and other mechanisms Classification and prediction are used in the Applications like *credit approval *target marketing *medical diagnosis 4. Cluster analysis:
What is Cluster Analysis?: Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters
18
General Applications of Clustering: 1. Pattern Recognition 2. Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining 3. Image Processing 4. Economic Science (especially market research) 5. WWW Document classification Cluster Weblog data to discover groups of similar access patterns Examples of Clustering Applications:
1.
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
2.
3.
4.
5.
19
What Is Good Clustering?: A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns 5. Outlier analysis: 4. What are outliers? The set of objects are considerably dissimilar from the remainder of the data
Example: Sports: Michael Jordon(related person), Wayne Gretzky(un related person(this is noisy), ... 5. Problem
Find top n outlier points 6. Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis
20
Note: to solve outlier analysis problem we are using all the above clustering methods..use any one of the clustering method and solve the problem of outlier analysis 6.Trend and evolution analysis: It is a combination of all the above analysisthe user always up-to-date with the database and the user should apply the related methods and approaches based on type of data.
Major issues in Data Mining: The three major issues of data mining are
1. 2. 3. 1.
Mining methodology issues User interaction issues Applications and social impacts issues Mining methodology issues: Metehodology issues are
1. Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web 2. Performance: efficiency, effectiveness, and scalability 3. Pattern evaluation: the interestingness problem 4. Incorporation of background knowledge 5. Handling noise and incomplete data 6. Parallel, distributed and incremental mining methods 7. Integration of the discovered knowledge with existing one: knowledge fusion
21
2.
User interaction issues: User interaction issues are 1. Data mining query languages and ad-hoc mining 2. Expression and visualization of data mining results 3. Interactive mining of knowledge at multiple levels of abstraction
3.
Applications and social impacts issues: application and social impacts issues are 1. Domain-specific data mining & invisible data mining 2. Protection of data security, integrity, and privacy
UNIT II Data Warehouse and OLAP Technology for Data Mining Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation,Further Development of Data Cube Technology, From Data Warehousing to Data Mining. What is a data warehouse?:
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. (or) A process of transforming data into information and making it available to users in a timely enough manner to make a difference Information
22
data (or) A data warehouse is a subject-oriented+integrated+time-varying+nonvolatile collection of data that is used primarily in organizational decision making(note:data warehousing also called as OLAP(online analytical processing) NOTE: Data Mining works with Warehouse Data * Data Warehousing provides the Enterprise with a memory * Data Mining provides the Enterprise with intelligence Note: Data WarehouseSubject-Oriented: 1. Organized around major subjects, such as customer, product, sales. 2. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. 3. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Data WarehouseIntegrated: 1. Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records 2. Data cleaning and data integration techniques are applied.
23
Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. Data WarehouseTime Variant: 1. The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) 2. Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element. Data WarehouseNon-Volatile: 1. A physically separate store of data transformed from the operational environment. 2. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing:
24
Differences between OLTP(Data mining) AND OLAP(Data warehousing) Here OLTP also called as transactional database and OLAP also called as operational database
O LTP users function D B design data usage access unit of w ork #users D B size m etric clerk, IT professional day to day operations application -oriented current, up -to-date detailed, flat relational isolated repetitive
O LA P know ledge w orker decision support subject -oriented historical, summarized, multidimension al integrated, consolidated ad-hoc
read/w rite lots of scans index/hash on prim. key short, simple transaction complex query millions hundreds 100GB -TB thousands 100M B B -G transaction throughput
OLT
25
DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehousetuned for OLAP: complex multidimensional view, consolidation. 2. Different functions and different data:
OLAP
queries,
missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
A multi-dimensional data model: 1. A data warehouse is based on a multidimensional data model which views data in the form of a data cube 2. A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables 3. In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
26
C
Conceptual Modeling of Data Warehouses: 1. Modeling data warehouses: dimensions & measures
Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
27
Ea xm
tim e
tim e_key day day_of_the_w eek m onth
28
Ea xm
Four views regarding the design of a data warehouse 1. Top-down view allows selection of the relevant information necessary for the data warehouse 2. Data source view
time
exposes the information being captured, stored, and managed by operational systems
3. Data warehouse view consists of fact tables and dimension tables 4. Business query view sees the perspectives of data in the warehouse from the view of end-user Datawarehouse architecture Architecture also called as Multi-Tiered
M
OLAP Server Architectures:
1.
Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces
30
Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability
2.
Multidimensional OLAP (MOLAP) Array-based techniques) multidimensional storage engine (sparse matrix
fast indexing to pre-computed summarized data 3. Hybrid OLAP (HOLAP) User flexibility, e.g., low level: relational, high-level: array 4. Specialized SQL servers specialized support for SQL queries over star/snowflake schemas Data warehouse implementation
Data warehouse implementation is based on 1. Efficient Data Cube Computation 2. Cube Operation 3. Cube Computation: ROLAP-Based Method and etc 1. Efficient Data Cube Computation: Data cube can be viewed as a lattice of cuboids a. The bottom-most cuboid is the base cuboid b. The top-most cuboid (apex) contains only one cell c. How many cuboids in an n-dimensional cube with L levels?
31
Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
f. Selection of which cuboids to materialize i. Based on size, sharing, access frequency, etc.
2.
Cube Operation:
1. Cube definition and computation in DMQL define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales 2. Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year
3.
Need compute the following Group-Bys (date, product, customer), (date,product),(date, customer), (product, customer), (date), (product), (customer) ()
32
3. Cube Computation: ROLAP-Based Method: 1. Uses Efficient cube computation methods ROLAP-based cubing algorithms (Agarwal et al96) Array-based cubing algorithm (Zhao et al97) Bottom-up computation method (Beyer & Ramarkrishnan99) H-cubing technique (Han, Pei, Dong & Wang:SIGMOD01) 2. ROLAP-based cubing algorithms Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Grouping is performed on some sub-aggregates as a partial grouping step Aggregates may be computed from previously computed aggregates, rather than from the base fact table
33
Further development of data cube technology: Data mining and data warehousing uses some advanced data cube technology i.e. development of data cube technology called Iceberg Cube Iceberg Cube: 1. Computing only the cuboid cells whose count or other aggregates satisfying the condition: HAVING COUNT(*) >= minsup 2. Motivation a. Only a small portion of cube cells may be above the water in a sparse cube b. Only calculate interesting datadata above certain threshold c. Suppose 100 dimensions, only 1 base cell. How many aggregate (non-base) cells if count >= 1? What about count >= 2?
Iceberg cube uses advanced Bottom-Up Computation (BUC) for advanced data base applications BUC:
1.
34
Drawbacks of BUC: 1. Requires a significant amount of memory On par with most other CUBE algorithms though 2. Does not obtain good performance with dense CUBEs 3. Overly skewed data or a bad choice of dimension ordering reduces performance 4. Cannot compute iceberg cubes with complex measures CREATE CUBE Sales_Iceberg AS SELECT month, city, cust_grp, AVG(price), COUNT(*) FROM Sales_Infor CUBEBY month, city, cust_grp HAVING AVG(price) >= 800 AND COUNT(*) >= 50 From data warehousing to data mining:
35
Data Warehouse Usage: Three kinds of data warehouse applications 1. Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs 2. Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting 3. Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks Note: From On-Line Analytical Processing to On Line Analytical Mining (OLAM) called from data warehousing to data mining From On-Line Analytical Processing to On Line Analytical Mining (OLAM) 1. Why online analytical mining? High quality of data in data warehouses DW contains integrated, consistent, cleaned data 2. Available information processing structure surrounding data warehouses
36
ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools 3. OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc. 4. On-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks. 5.Architecture of OLAM:
37
M in in g
38
Summary:
1. Data warehouse 2. A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures 3. OLAP operations: drilling, rolling, slicing, dicing and pivoting 4. OLAP servers: ROLAP, MOLAP, HOLAP 5. Efficient computation of data cubes Partial vs. full vs. no materialization Multiway array aggregation Bitmap index and join index implementations 6. Further development of data cube technology Discovery-drive and multi-feature cubes From OLAP to OLAM (on-line analytical mining)
39
UNIT - III Data Mining Primitives, Languages, and System Architectures : Data Mining Primitives, Data Mining Query Languages, Designing Graphical User Interfaces Based on a Data Mining Query Language Architectures of Data Mining Systems Data mining primitives: What defines a data mining task?: Why Data Mining Primitives and Languages: 1. Finding all the patterns autonomously in a database? unrealistic because the patterns could be too many but uninteresting 2. Data mining should be an interactive process User directs what to be mined 3. Users must be provided with a set of primitives to be used to communicate with the data mining system 4. Incorporating these primitives in a data mining query language More flexible user interaction Foundation for design of graphical user interface Standardization of data mining industry and practice
What Defines a Data Mining Task ? 1. Task-relevant data Typically interested in only a subset of the entire database Specify
40
1. the name of database/data warehouse (AllElectronics_db) 2.names of tables/data cubes containing relevant data (item, customer, purchases, items_sold) 3.conditions for selecting the relevant data (purchases made in Canada for relevant year) 4.relevant attributes or dimensions (name and price from item, income and age from customer) 2. Type of knowledge to be mined Concept description, association, classification, prediction, clustering, and evolution analysis * Studying buying habits of customers, mine associations between customer profile and the items they like to buy * Use this info to recommend items to put on sale to increase revenue * Studying real estate transactions, mine clusters to determine house characteristics that make for fast sales * Use this info to make recommendations to house sellers who want/need to sell their house quickly * Study relationship between individuals sport statistics and salary * Use this info to help sports agents and sports team owners negotiate an individuals salary Pattern templates that all discovered patterns must match P(X:Customer, W) and Q(X, Y) => buys(X, Z) * X is key of customer relation * P & Q are predicate variables, instantiated to relevant attributes
41
* W & Z are object variables that can take on the value of their respective predicates Search for association rules is confined to those matching some set of rules, such as: * Age(X, 30..39) & income (X, 40K..49K) => buys (X, VCR) [2.2%, 60%] * Customers in their thirties, with an annual income of 40-49K, are likely (with 60% confidence) to purchase a VCR, and such cases represent about 2.2% of the total number of transactions What Defines a Data Mining Task: It defines 1. Task-relevant data 2. Type of knowledge to be mined 3. Background knowledge 4. Pattern interestingness measurements 5. Visualization of discovered patterns
1.
Task-Relevant Data (Minable View):consider the following for task relevant data a. Database or data warehouse name b. Database tables or data warehouse cubes c. Condition for data selection d. Relevant attributes or dimensions e. Data grouping criteria
42
2. Types of knowledge to be mined: ):consider the following for type of knowledge to be mined a. Characterization b. Discrimination c. Association d. Classification/prediction e. Clustering f. Outlier analysis g. Other data mining tasks 3. Background Knowledge: Concept Hierarchies: a. Allow discovery of knowledge at multiple levels of abstraction b. Represented as a set of nodes organized in a tree Each node represents a concept Special node, all, reserved for root of tree c. Concept hierarchies allow raw data to be handled at a higher, more generalized level of abstraction
d.
Four major types of concept hierarchies, schema, set-grouping, operation derived, rule based
43
Schema hierarchy total or partial order among attributes in the database schema, formally expresses existing semantic relationships between attributes Table address
a. create table address (street char (50), city char (30), province_or_state char (30), country char (40)); b. Concept hierarchy location i. street < city < province_or_state < country 2.Set-grouping hierarchy organizes values for a given attribute or dimension into groups or constant range values
44
all
AC (loc
a.{young, middle_aged, senior} subset of all(age) ii. {20-39} = young iii. {40-59} = middle_aged iv. {60-89} = senior 3.Operation-derived hierarchy based on operations specified by users, experts, or the data mining system a.email address or a URL contains hierarchy info relating departments, universities (or companies) and countries b.E-mail address v. dmbook@cs.sfu.ca c. Partial concept hierarchy i. login-name < department < university < country 4.Rule-based hierarchy either a whole concept hierarchy or a portion of it is defined by a set of rules and is evaluated dynamically based on the current data and rule definition a.Following rules used to categorize items as low profit margin, medium profit margin and high profit margin *Low profit margin - < $50 *Medium profit margin between $50 & $250 *High profit margin - > $250 b.Rule based concept hierarchy *low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) < $50
45
* medium_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) >= $50 and (P1 P2) <= $250
* high_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 - P2) > $250
Measurements of Pattern Interestingness: 1. After specification of task relevant data and kind of knowledge to be mined, data mining process may still generate a large number of patterns 2. Typically, only a small portion of these patterns will actually be of interest to a user 3. The user needs to further confine the number of uninteresting patterns returned by the data mining process Utilize interesting measures 4. Four types: simplicity, certainty, utility, novelty
1.
Objective measures viewed as functions of the pattern structure or number of attributes or operators More complex a rule, more difficult it is to interpret, thus less interesting Example measures: rule length or number of leaves in a decision tree
2.
Certainty Measure of certainty associated with pattern that assesses validity or trustworthiness
46
* Confidence of 85% for association rule buys (X, computer) => buys (X, software) means 85% of all customers who bought a computer bought software also
3.
* Estimated by a utility function such as support percentage of task relevant data tuples for which pattern is true * Support (A=>B) = # tuples containing both A & B/ total # of tuples
4.
Novelty those patterns that contribute new information or increased performance to the pattern set
*not previously known, surprising Visualization of Discovered Patterns: 1. Different backgrounds/usages may require different forms of representation E.g., rules, tables, crosstabs, pie/bar chart etc. 2. Concept hierarchy is also important Discovered knowledge might be more understandable when represented at high level of abstraction Interactive drill up/down, pivoting, slicing and dicing provide different perspective to data 3. Different kinds of knowledge require different representation: association, classification, clustering, etc.
47
A DMQL can provide the ability to support ad-hoc and interactive data mining By providing a standardized language like SQL * Hope to achieve a similar effect like that SQL has on relational database * Foundation for system development and evolution * Facilitate information exchange, technology transfer, commercialization and wide acceptance 2. Design DMQL is designed with the primitives described earlier Syntax for DMQL: 1. Syntax for specification of task-relevant data the kind of knowledge to be mined concept hierarchy specification interestingness measure pattern presentation and visualization 2. Putting it all together a DMQL query Syntax for task-relevant data specification:
1. use database database_name, or use data warehouse data_warehouse_name directs the data mining task to the database or data warehouse specified
48
2. from relation(s)/cube(s) [where condition] specify the database tables or data cubes involved and the conditions defining the data to be retrieved 3. in relevance to att_or_dim_list Lists attributes or dimensions for exploration order by order_list Specifies the sorting order of the task relevant data group by grouping_list Specifies criteria for grouping the data having condition Specifies the condition by which groups of data are considered relevant (DMQL) ::= (DMQL_Statement);{(DMQL_Statement)
| |
49
S
50
Sp
Sn yt ko nw
51
Sn y ta Discrim
52
Synt kno w
Mine_
1. Interestingness measures and thresholds can be specified by the user with the statement: with <interest_measure_name> threshold = threshold_value *Example: with support threshold = 0.05 with confidence threshold = 0.7 Syntax for pattern presentation and visualization specification:
1.
Synta s e if pc
54
We have syntax which allows users to specify the display of discovered patterns in one or more forms display as <result_form>
Result_form = Rules, tables, crosstabs, pie or bar charts, decision trees, cubes, curves, or surfaces
ope d
2.To facilitate interactive viewing at different concept level, the following syntax is defined:
Multilevel_Manipulation ::= roll up on attribute_or_dimension | drill down on attribute_or_dimension | add attribute_or_dimension | drop attribute_or_dimension
Pu tti
55
s e ifi pc
us da b s e ta a e
O &
1. What tasks should be considered in the design GUIs based on a data mining query language? Data collection and data mining query composition Presentation of discovered patterns Hierarchy specification and manipulation Manipulation of data mining primitives Interactive multilevel mining Other miscellaneous information Data Mining System Architectures:
As o i sc
56
1. Coupling data mining system with DB/DW system No couplingflat file processing, not recommended Loose coupling *Fetching data from DB/DW Semi-tight couplingenhanced DM performance *Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight couplingA uniform information processing environment *DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. Summary: 1. Five primitives for specification of a data mining task task-relevant data kind of knowledge to be mined background knowledge interestingness measures knowledge presentation and visualization techniques to be used for displaying the discovered patterns 2. Data mining query languages DMQL, MS/OLEDB for DM, etc. 3. Data mining system architecture No coupling, loose coupling, semi-tight coupling, tight coupling
57
UNIT - IV Concepts Description : Characterization and Comparison : Data Generalization and Summarization- Based Characterization, Analytical Characterization: Analysis of Attribute Relevance, Mining Class Comparisons: Discriminating between Different Classes, Mining Descriptive Statistical Measures in Large Databases. What is concept description? : 1.Concept description deals with Descriptive mining & predictive data mining Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data Concept description consists two important aspects 1.charecterization and 2.comparision 1.Characterization: provides a concise and succinct summarization of the given collection of data 2,Comparison: provides descriptions comparing two or more collections of data Data generalization and summarization-based characterization: Data generalization: A process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones. *Data generalization and summarization-based characterization uses one technique called ATTRIBUTE ORIENTED INDUCTION
58
*To describe the details of the entities and their attributes with out any error.. *If there is a OLAP of large database there the user facing some difficult to describe the details clearly without any conflict of particular large database so we are using AOI technique Attribute-Oriented Induction: 8. Proposed in 1989 (KDD 89 workshop) 9. AOI is done by the following steps
10. For
DMQL query can be written as: use Big_University_DB mine characteristics as Science_Students in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in graduate Corresponding SQL statement: Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {Msc, MBA, PhD }
Step1: Collect the task-relevant data( initial relation) using a relational database query i.e mentioned above Step2: Perform generalization by attribute removal or attribute generalization.
59
Step3: Apply aggregation by merging identical, generalized tuples and accumulating their respective counts. Step4: Interactive presentation with users. Basic Principles of Attribute-Oriented Induction: 1.Data focusing: task-relevant data, including dimensions, and the result is the initial relation. 2.Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) As higher level concepts are expressed in terms of other attributes. 3.Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A. 4.Attribute-threshold control: typical 2-8, specified/default. 5.Generalized relation threshold control: control the final relation/rule size. Analytical characterization: Analysis of attribute relevance 11.First we discuss why and what terms related to Analysis of attribute relevance Why ? 12.Which dimensions should be included? 13.How high level of generalization? 14.Is it interactive What? We use statistical method for preprocessing data * filter out irrelevant or weakly relevant attributes
60
* retain or rank the relevant attributes * Performs relevance related to dimensions and levels analytical characterization, analytical comparison
Let we see the term how that means how we are performing the steps for analysis of attribute relevance Step1: Data Collection Step2: Analytical Generalization Step3: Use information gain analysis (e.g., entropy or other measures) to identify highly relevant dimensions and levels. Step4: Relevance Analysis Sort and select the most relevant dimensions and levels.(Relavance analysis uses some mechanisms i.e entropy mechanism, normalization mechanism and decision tree mechanism to describe the details clearly) Step5: Attribute-oriented Induction for class description On selected dimension/level Step5: OLAP operations (e.g. drilling, slicing) on relevance rules Mining class comparisons: Discriminating between different classes Comparison: Comparing two or more classes. Method: *Partition the set of relevant data into the target class and the contrasting class(es) *Generalize both classes to the same high level concepts *Compare tuples with the same high level descriptions
61
*Present for every tuple its description and two measures: support - distribution within single class comparison - distribution between classes *Highlight the tuples with strong discriminant features Relevance Analysis: *Find attributes (features) which best distinguish different classes. Example: Analytical comparison 1. Data collection *target and contrasting classes 2. Attribute relevance analysis *remove attributes name, gender, major, phone# 3. Synchronous generalization *controlled by user-specified dimension thresholds *prime target and contrasting classes relations/cuboids 4. Drill down, roll up and other OLAP operations on target and contrasting classes to adjust levels of abstractions of resulting description 5. Presentation *as generalized relations, crosstabs, bar charts, pie charts, or rules *contrasting measures to reflect comparison between target and contrasting classes e.g. count% Mining descriptive statistical measures in large databases
62
*To better understand the details of entities and attributes we are using some statistical measures *for large databases of OLAP we are using some statistical measures inorder to combine the characterization and comparision in one table with the measures called Tw and Dw *characterization also called as quantitative characterization rule and comparision also called as quantitative comparision rule *By using the above rules we can compare target class and contrast class with in same table. Note: The main Motivation of statistical measures and Mining Data Dispersion Characteristics is To better understand the data & which uses several operations like median,max,min operations in order to describe the details of data UNIT - V Mining Association Rules in Large Databases : Association Rule Mining, Mining Single-Dimensional Boolean Association Rules from Transactional Databases, Mining Multilevel Association Rules from Transaction Databases, Mining Multidimensional Association Rules from Relational Databases and Data Warehouses, From Association Mining to Correlation Analysis, Constraint-Based Association Mining. Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Applications:1) Basket data analysis ,2) cross-marketing, 3)catalog design,
63
4)Loss-leader analysis, 5)clustering, classification, etc. Types of association rules: Types of association rules are 1) Boolean dimensional association rule 2) Single dimensional association rule 3) Multi dimensional association rule 4) Multilevel association rule 5) Quantitative association rule 1) Boolean dimensional association rule: Boolean dimensional association rule can be defined as comparing existing predicates/dimensions with non existing predicates/dimensions.. 2) Single dimensional association rule: Single dimensional association rule can be defined as the statement which contains only single predicate/dimension **Single dimensional association rule also called as Intra dimensional association rule 3) Multi dimensional association rule: Multi dimensional association rule can be defined as the statement which contains only two (or) more predicates/dimensions **Multi dimensional association rule also called as Inter dimensional association rule 4) Multilevel association rule: Multilevel association rules can be defined as applying association rules over different levels of data abstraction 5) Quantitative association rule: Quantitative association rule can be defined as applying association rules to improve the measure count values with minimum support and minimum confidence.
64
Mining single-dimensional Boolean association rules from transactional databases: *consider the following simple transactional database to mine single dimensional and Boolean association rules based on two measures called support and confidence for each frequent pattern which are added recently to the database.. * Single dimensional association rule: Single dimensional association rule can be defined as the statement which contains only single predicate/dimension **Single dimensional association rule also called as Intra dimensional association rule * Boolean dimensional association rule: Boolean dimensional association rule can be defined as comparing existing predicates/dimensions with non existing predicates/dimensions.. *Here we are using APRIORI ALGORITHM to find frequent patterns from the transactional database
65
The Apriori Algorithm Example In the above example L1,L2 Refers to frequent item set table C1,C2.Refers to candidate generation table se D DatabaseD Databa
STEPS TO PERFORM APRIORI ALGORITHM: Step1: consider the simple transactional database called D
Step2: scan the database D to generate candidate generation table(C1) with their support count (Note: Consider here the minimum support count is 2) and remove the items which have
66
Step5: Again scan the database D and verify the support count for each item
C2------------------------generates L2
itemset {1 3} {2 3} {2 5} {3 5}
sup 2 2 3 2
itemset C3 {2 3 5}
Step7: Again scan the database D and verify the support count for each item
Finally we conclude that Apriori algorithm is used for to find out the frequent patterns in a repositorywith the measures like support and confidence..
67
Note: If our transactional database consists more number of frequent items then there is a problem with candidate generation tableit is difficult to constructso to this the solution is FP-TREE Algorithm (Frequent Pattern Algorithm follows tree structure and which is more efficient then Apriori Algorithm) **the efficiency of apriori based on the following methods 1. Hash-based item set counting: uses hashing bucket technique.. 2. Transaction reduction: uses scanning mechanism 3. Partitioning: uses partition technique on frequent item sets 4. Sampling: mining frequent patterns applying association rules 5. Dynamic item set counting: which can add new item sets into the database FP-TREE ALGORITHM: **Fp-tree algorithm also used to find the frequent patterns from the transactional database or relational database or datawarehouse.. *specially used for to solve the problem of apriori algorithm that means in FPTREE we are finding the frequent patterns with the use of candidate generation table.. *uses A divide-and-conquer methodology: decompose mining tasks into smaller ones *Avoids candidate generation
68
Here Minimum Support count is 2 Steps: 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Order frequent items in frequency descending order 3. Scan DB again, construct FP-tree
Note: FP-tree uses L-order method *perform mining FP-tree then identify the frequent patterns
69
*FP-growth is an faster than Apriori and is also faster than tree-projection method Benefits of the FP-tree Structure: Completeness: never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant informationinfrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database Mining multilevel association rules from transactional databases Multilevel association rule: Multilevel association rules can be defined as applying association rules over different levels of data abstraction
Ex: Steps to perform multilevel association rules from transactional database are: Step1:consider frequent item sets Step2: arrange the items in hierarchy form Step3: find the Items at the lower level ( expected to have lower support)
70
Step4: Apply association rules on frequent item sets Step5: Use some methods and identify frequent itemsets Note: Support is categorized into two types 1.Uniform Support: the same minimum support for all levels 2.Reduced Support: reduced minimum support at lower levels *multilevel association rules can be applied on both the supports Example figure for Uniform Support: Multi-level mining with uniform support
71
Mining multidimensional association rules from Relational databases and data warehouse Multi dimensional association rule: Multi dimensional association rule can be defined as the statement which contains only two (or) more predicates/dimensions **Multi dimensional association rule also called as Inter dimensional association rule We can perform the following association rules on relational database and data warehouse * 1)Boolean dimensional association rule: Boolean dimensional association rule can be defined as comparing existing predicates/dimensions with non existing predicates/dimensions..
72
2) Single dimensional association rule: Single dimensional association rule can be defined as the statement which contains only single predicate/dimension **Single dimensional association rule also called as Intra dimensional association rule as usually multi dimensional association rule.. ** Multi dimensional association rule can be applied on different types of attributes ..here the attributes are 1.Categorical Attributes finite number of possible values, no ordering among values 2. Quantitative Attributes numeric, implicit ordering among values Note1:Relational database can be viewed in the form of tablesso on tables we are performing the concept hierarchy In relational database we are using the concept hierarchy that means generalization in order to find out the frequent item sets Generalization: replacing low level attributes with high level attributes called generalization Note2: data warehouse can be viewed in the form of multidimensional data model (uses data cubes) in order to find out the frequent patterns. From association mining to correlation analysis: *Here Association mining to correlation analysis can be performed Based on interesting measures of frequent items. *among frequent items we are performing correlation analysis *correlation analysis means one frequent item is dependent on other frequent item..
73
for each frequent item we are considering the above mentioned two measures to perform mining and correlation analysis.. Note: Association mining: Association mining can be defined as finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Constraint-based association mining: *constraint based association mining can be defined as applying different types of constraints on different types of knowledge *so the kinds of constraints used in the mining are 1.Knowledge type constraint 2. Data constraint 3. Dimension/level constraints 4. Rule constraints 5. Interestingness constraints 1. Knowledge type constraint: Based on classification, association, we are applying the knowledge type constraints. 2. Data constraints: SQL-like queries Ex: Find product pairs sold together in Vancouver in Dec.98. 3. Dimension/level constraints: in relevance to region, price, brand, customer category.
74
4.
Rule constraints: On the form of the rules to be mined (e.g., # of predicates, etc) small sales (price < $10) triggers big sales (sum > $200).
UNIT - VI Classification and Prediction : Issues Regarding Classification and Prediction, Classification by Decision Tree Induction, Bayesian Classification, Classification by Backpropagation, Classification Based on Concepts from Association Rule Mining, Other Classification Methods, Prediction, Classifier Accuracy. What is classification? What is prediction?: Classification:
75
*used for prediction(future analysis ) to know the unknown attributes with their values..by using classifier algorithms and decision tree.(in data mining) *which constructs some models(like decision trees) then which classifies the attributes.. *already we know the types of attributes are 1.categorical attribute and 2.numerical attribute *these classification can work on both the above mentioned attributes. Prediction: prediction also used for to know the unknown or missing values.. 15.which also uses some models in order to predict the attributes 16.models like neural networks, if else rules and other mechanisms Classification and prediction are used in the Applications like *credit approval *target marketing *medical diagnosis Issues regarding classification and prediction: There are two issues regarding classification and prediction they are Issues (1): Data Preparation Issues (2): Evaluating Classification Methods Issues (1): Data Preparation: Issues of data preparation includes the following 1) Data cleaning *Preprocess data in order to reduce noise and handle missing values (refer preprocessing techniques i.e. data cleaning notes) 2) Relevance analysis (feature selection)
76
Remove the irrelevant or redundant attributes (refer unit-iv AOI Relevance analysis)
3)
Data transformation (refer preprocessing techniques i.e data cleaning notes) Generalize and/or normalize data
Issues (2): Evaluating Classification Methods: considering classification methods should satisfy the following properties 1. Predictive accuracy 2. Speed and scalability *time to construct the model *time to use the model 3. Robustness *handling noise and missing values 4. Scalability *efficiency in disk-resident databases 5. Interpretability: *understanding and insight provided by the model 6. Goodness of rules *decision tree size *compactness of classification rules Classification by decision tree induction: Already we know decision tree is one of the models of classification first we see what decision tree is Decision tree :
77
*A flow-chart-like tree structure *Internal node denotes a test on an attribute *Branch represents an outcome of the test *Leaf nodes represent class labels or class distribution * Decision tree generation consists of two phases 1. Tree construction 2. Tree pruning 1. Tree construction: At start, all the training examples are at the root Partition examples recursively based on selected attributes 2. Tree pruning: used for Identify and remove branches that reflect noise or outliers Decision tree allows algorithm i.e. greedy algorithm we are using in order to remove outliers Basic algorithm (a greedy algorithm): * Tree is constructed in a top-down recursive divideand-conquer manner * At start, all the training examples are at the root * Attributes are categorical (if continuous-valued, they are discretized in advance) * Examples are partitioned selected attributes recursively based on
* Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) i.e. by
78
normalization outliers..
test
the
attributes
which
contains
Note: Why decision tree induction in data mining? 1. Relatively faster learning classification methods) speed (than other
2. Convertible to simple and easy to understand classification rules 3. Can use SQL queries for accessing databases 4. Comparable classification accuracy with other methods Bayesian Classification: Bayesian classification used for in order to predict the attributes and to know the missing values.. Bayesian Classification: Why? Because which satisfies some properties which are helpful to classify and predict the attributes.. *properties are follows 1. Probabilistic learning 2. Incremental 3. Probabilistic prediction 4. Standard 1. Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems. 2. Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is
79
correct. data.
3. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities 4. Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Classification by back propagation : Classification by back propagation done by neural networks it is one model in order to predict the attributes.. Advantages By using neural networks model: 1. prediction accuracy is generally high 2. robust, works when training examples contain errors 3. output may be discrete, real-valued, or a vector of several discrete or real-valued attributes 4. fast evaluation of the learned target function Drawbacks/Criticism of Neural Networks Model are: 1. long training time 2. difficult to understand the learned function (weights) 3. not easy to incorporate domain knowledge In neural networks a neuron can be represented with
80
Classification based on concepts from association rule mining: 1 Association-Based Classification: Several methods for association-based classification
17. ARCS:
Quantitative association mining and clustering of association rules mainly focus on scalability and also accuracy classification: It mines high support and high confidence rules in the form of cond_set => y, where y is a class label (Classification by aggregating emerging patterns) Emerging patterns (EPs): the item sets whose support increases significantly from one class to another
18. Associative
19. CAEP
81
k-nearest Methods:
neighbor
classifier
(or)
Instance-Based
*Instance-based learning: Store training examples and delay the processing (lazy evaluation) until a new instance must be classified K-nearest neighbor approach : Instances represented as points in a Euclidean space.
1.
Uses The k-Nearest Neighbor Algorithm: correspond to points in the n-D space.
All instances
2. The nearest neighbor are defined in terms of Euclidean distance. 3. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples
82
1.The k-NN algorithm for continuous-valued target functions *Calculate the mean values of the k nearest neighbors 2.Distance-weighted nearest neighbor algorithm * Weight the contribution of each of the k neighbors according to their distance to the query point xq *giving greater weight to closer neighbors * Similarly, for real-valued target functions 3. Robust to noisy data by averaging k-nearest neighbors 4. Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. *To overcome it, axes stretch or elimination of the least relevant attributes. 2. Case-Based Reasoning: 1. Also uses: lazy evaluation + analyze similar instances 2. Difference: Instances are not points in a Euclidean space 3. Genetic Algorithms: GA: based on an analogy to biological evolution 1. Each rule is represented by a string of bits 2. An initial population is created consisting of randomly generated rules
y
1 d ( xq , x )2 i
3. Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings
83
4. The fitness of a rule is represented by its classification accuracy on a set of training examples 5. Offsprings are generated by crossover and mutation 4.Rough Set Approach: 1. Rough sets are used to approximately or roughly define equivalent classes 2. A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C) 3. Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity
5.
Fuzzy Sets: 1. Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph) 2. Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated
84
3. For a given new sample, more than one fuzzy value may apply 4. Each applicable rule contributes a vote for membership in the categories 5. Typically, the truth values for each predicted category are summed Prediction: **Prediction is similar to classification *First, construct a model *Second, use model to predict unknown value Predictive Modeling in Databases: 1. Predictive modeling: Predict data values or construct generalized linear models based on the database data. 2. One can only distributions 3. Method outline: *Minimal generalization *Attribute relevance analysis *Generalized linear model construction *Prediction 4. prediction allows types of regression Regress Analysis Prediction:
1.
predict
value
ranges
or
category
and
Log-Linear
Models
in
Linear regression: Y = + X
85
* Two parameters , and specify the line and are to be estimated by using the data at hand. * using the least squares criterion to the known values of Y1, Y2, , X1, X2, .
2.
the above.
3.
Log-linear models: *The multi-way table of joint probabilities approximated by a product of lower-order tables. *Probability: p(a, b, c, d) = ab ac ad bcd is
Classification Accuracy: Estimating Error Rates: 1. Partition: Training-and-testing *use two independent data sets, e.g., training set (2/3), test set(1/3) *used for data set with large number of samples 2. Cross-validation *divide the data set into k subsamples *use k-1 subsamples as training data and one subsample as test data --- k-fold cross-validation *for data set with moderate size 3. Bootstrapping (leave-one-out)
86
*for small size data Note: Boosting and Bagging are two techniques Boosting & Bagging increases classification accuracy *Applicable to decision trees or Bayesian classifier
UNIT - VII Cluster Analysis Introduction : Types of Data in Cluster Analysis, A Categorization of Major Clustering Methods, Partitioning Methods, Density-Based Methods, Grid-Based Methods, Model-Based Clustering Methods, Outlier Analysis.
Cluster Analysis Introduction: What is Cluster Analysis?: Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters
87
Note: cluster analysis is a preprocessing step for data mining algorithms General Applications of Clustering: 6. Pattern Recognition 7. Spatial Data Analysis a. create thematic maps in GIS by clustering feature spaces b. detect spatial clusters and explain them in spatial data mining 8. Image Processing 9. Economic Science (especially market research) 10. WWW a. Document classification b. Cluster Weblog data to discover groups of similar access patterns Examples of Clustering Applications:
7.
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
8.
9.
88
10. City-planning:
Identifying groups of houses according to their house type, value, and geographical location studies: Observed earth quake epicenters should be clustered along continent faults What Is Good Clustering?: A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity
11. Earth-quake
3.
4. The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Type of data in clustering analysis: Type of data clustering analysis is
1. 2.
Interval-scaled variables:
On Standardize data Calculate the mean absolute deviation: & Calculate the standardized measurement (z-score) to know the similarity among the objects
89
The above 2 measurements are used for to measure the similarity or dissimilarity between two data objects
2.
Binary Variables:
Coefficient mechanisms are used to measure the similarity or dissimilarity between two data objects on binary variables Jaccard coefficient & Simple matching coefficient are the two coefficient mechanisms Binary variables like 0 and 1 for the variables a,b..
1 0 sum
3.
1 a c a +c
0 b d b +d
sum a +b c +d p
Nominal Variables:
Generalization concept and Coefficient mechanisms are used to measure the similarity or dissimilarity between two data objects on nominal variables Nominal variables like e.g., red, yellow, blue, green
Ordinal Variables: - An ordinal variable can be discrete or continuous - order is important, e.g., rank (gold, silver, bronze)
-
use coefficient mechanisms and deviation mechanisms to measure the similarity or dissimilarity between two data objects on ordinal variables Ratio-Scaled Variables:
90
1.
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale such as AeBt or Ae-Bt (A & B are positive constants) apply some methods to measure the similarity or dissimilarity between two data objects on ratio-scaled variables methods are like logarithmic transformation Variables of Mixed Types:
2. -
3. 4.
1. A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
-
Here use one mechanism to measure the similarity or dissimilarity between the objects of different types i.e. by using the weighted formula
( ( p =1ij f )dij f ) d (i, j) = f p ( f =1ij f )
f is binary or nominal:
zif = r 1 M 1
f
91
Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: functions based on connectivity and density
2.
3.
4. 5.
Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other
1.
Partitioning method: Construct a partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion - Global optimal: exhaustively enumerate all partitions - Heuristic methods: k-means and k-medoids algorithms - k-means (MacQueen67): Each cluster is represented by the center of the cluster - k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw87): Each cluster is represented by one of the objects in the cluster
2.
92
b. Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. c. Assign each object to the cluster with the nearest seed point. d. Go back to Step 2, stop when no more new assignment. K-Means example
1.
Consider 2, 3, 6, 8, 9, 12, 15, 18, 22 break into 3 clusters Cluster 1 - 2, 8, 15 mean = 8.3 Cluster 2 - 3, 9, 18 mean = 10 Cluster 3 - 6, 12, 22 mean = 13.3
2. Re-assign Cluster 1 - 2, 3, 6, 8, 9 mean = 5.6 Cluster 2 mean = 0 Cluster 3 12, 15, 18, 22 mean = 16.75 3. Re-assign Cluster 1 3, 6, 8, 9 mean = 6.5 Cluster 2 2 mean = 2
93
Cluster 3 = 12, 15, 18, 22 mean = 16.75 4. Re-assign Cluster 1 - 6, 8, 9 mean = 7.6 Cluster 2 2, 3 mean = 2.5 Cluster 3 12, 15, 18, 22 mean = 16.75
5.
Re-assign Cluster 1 - 6, 8, 9 mean = 7.6 Cluster 2 2, 3 - mean = 2.5 Cluster 3 12, 15, 18, 22 mean = 16.75
6. No change, so were done k-medoids algorithm: 2. Use real object to represent the cluster
1.
2. repeat * Assign each remaining object to the cluster of the nearest medoid * Randomly select a nonmedoid object * Compute the total cost, S, of swapping oj with orandom * If S < 0 then swap oj with orandom 3. until there is no change K-Medoids example:
94
1.
consider 1, 2, 6, 7, 8, 10, 15, 17, 20 break into 3 clusters Cluster = 6 1, 2 Cluster = 7 Cluster = 8 10, 15, 17, 20
2. Random non-medoid 15 replace 7 (total cost=-13) Cluster = 6 1 (cost 0), 2 (cost 0), 7(1-0=1) Cluster = 8 10 (cost 0) New Cluster = 15 17 (cost 2-9=-7), 20 (cost 5-12=-7) 3. Replace medoid 7 with new medoid (15) and reassign Cluster = 6 1, 2, 7 Cluster = 8 10 Cluster = 15 17, 20 4. Random non-medoid 1 replaces 6 (total cost=2) Cluster = 8 7 (cost 6-1=5)10 (cost 0) Cluster = 15 17 (cost 0), 20 (cost 0) New Cluster = 1 2 (cost 1-4=-3) 5. 2 replaces 6 (total cost=1) 6. Dont replace medoid 6 Cluster = 6 1, 2, 7 Cluster = 8 10 Cluster = 15 17, 20 7. Random non-medoid 7 replaces 6 (total cost=2)
95
Cluster = 8 10 (cost 0) Cluster = 15 17(cost 0), 20(cost 0) New Cluster = 7 6 (cost 1-0=1), 2 (cost 5-4=1) 8. Dont Replace medoid 6 Cluster = 6 1, 2, 7 Cluster = 8 10 Cluster = 15 17, 20 9. Random non-medoid 10 replaces 8 (total cost=2) dont replace Cluster = 6 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 15 17 (cost 0), 20(cost 0) New Cluster = 10 8 (cost 2-0=2) 10. Random non-medoid 17 replaces 15 (total cost=0) dont replace Cluster = 6 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 8 10 (cost 0) New Cluster = 17 15 (cost 2-0=2), 20(cost 3-5=-2) 11. Random non-medoid 20 replaces 15 (total cost=3) dont replace Cluster = 6 1(cost 0), 2(cost 0), 7(cost 0) Cluster = 8 10 (cost 0) New Cluster = 20 15 (cost 5-0=2), 17(cost 3-2=1) 12. Other possible changes all have high costs
96
1 replaces 15, 2 replaces 15, 1 replaces 8, 13. No changes, final clusters Cluster = 6 1, 2, 7 Cluster = 8 10 Cluster = 15 17, 20 Hierarchical Clustering:
1.
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Use A Dendrogram approach Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. Density-Based Clustering:
2.
3.
3.
1. Clustering based on density (local cluster criterion), such as density-connected points 2. Major features: a. Discover clusters of arbitrary shape b. Handle noise c. One scan d. Need density parameters as termination condition 3. Which uses some density functions
97
DBSCAN algorithm to measure the similarity dissimilarity between the objects of different types
or
4.
Uses multi-resolution grid data structure Uses Several interesting methods to measure the similarity or dissimilarity between the objects of different types
- STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) - WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB98) - A multi-resolution clustering approach using wavelet method - CLIQUE: Agrawal, et al. (SIGMOD98)
-
Consider any one approach here we are considering STING: A Statistical Information Grid Approach
- Each cell at a high level is partitioned into a number of smaller cells in the next lower level - Statistical info of each cell is calculated and stored beforehand and is used to answer queries - Parameters of higher level cells can be easily calculated from parameters of lower level cell
-
98
- Start from a pre-selected layertypically with a small number of cells - For each cell in the current level compute the confidence interval - Remove the irrelevant cells from further consideration - When finish examining the current layer, proceed to the next lower level - Repeat this process until the bottom layer is reached - Advantages: - Query-independent, update
-
easy
to
parallelize,
incremental
- Disadvantages: All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected
5.
Model-Based Clustering Methods: 1. Attempt to optimize the fit between the data and some mathematical model 2. Statistical and AI approach
99
21. Produces a classification scheme for a set of unlabeled objects 22. Finds characteristic description for each concept (class) COBWEB (Fisher87) 23. A popular a simple method of incremental conceptual learning 24. Creates a hierarchical clustering in the form of a classification tree 25. Each node refers to a concept and probabilistic description of that concept
26. Other
contains
1. Neural network approaches a. Represent each cluster as an exemplar, acting as a prototype of the cluster b. New objects are distributed to the cluster whose exemplar is the most similar according to some dostance measure 2. Competitive learning a. Involves a hierarchical architecture of several units (neurons) b. Neurons compete in a winner-takes-all fashion for the object currently being presented
6.
100
27. The set of objects are considerably dissimilar from the remainder of the data 28. Example: Sports: Michael Jordon, Wayne Gretzky, ... 2. Problem 29. Find top n outlier points 3. Applications: 30. Credit card fraud detection 31. Telecom fraud detection 32. Customer segmentation 33. Medical analysis Note: to solve outlier analysis problem we are using all the above clustering methods..use any one of the clustering method and solve the problem of outlier analysis Summary: 1. Cluster analysis groups objects based on their similarity and has wide applications 2. Measure of similarity can be computed for various types of data 3. Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods 4. Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distancebased or deviation-based approaches
101
5. There are still lots of research issues on cluster analysis, such as constraint-based clustering
UNIT - VIII Mining Complex Types of Data : Multimensional Analysis and Descriptive Mining of Complex, Data Objects, Mining Spatial Databases, Mining Multimedia Databases, Mining Time-Series and Sequence Data, Mining Text Databases, Mining the World Wide Web.
102
on type of attributes generalization can be performed suppose consider two types of attributes like Set-valued attribute & List-valued or a sequence-valued attribute
For Set valued attribute: Generalization of each value in the set into its corresponding higher-level concepts 35. Derivation of the general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, or the weighted average for numerical data
36. E.g.,
violin, music,
For List-valued or a sequence-valued attribute: 37. Same as set-valued attributes except that the order of the elements in the sequence should be observed in the generalization Mining Spatial Databases or mining spatial datawarehouse:
38. Spatial
data warehouse: spatial datawarehouse can be defined as Integrated, subject-oriented, time-variant, and nonvolatile spatial data repository for data analysis and decision making
40. Spatial
data cube contains Both dimensions and measures for spatial components
43. Example:
Input 44.A map with about 3,000 weather probes scattered in B.C. 45.Daily data for temperature, precipitation, wind velocity, etc. 46.Concept hierarchies for all attributes Output 47.A map that reveals patterns: merged (similar) regions Goals 48.Interactive analysis (drill-down, slice, dice, pivot, roll-up) 49.Fast response time 50.Minimizing storage space used Challenge 51.A merged region may contain hundreds of (polygons) primitive regions
104
Dimension table
Fact table
105
region_name time temperature precipitation Measurements region map area count Note: Spatial data warehouse uses olap operations like drill-down, slice, dice, pivot, roll-up Mining multimedia databases: Multimedia database: Multimedia database is a collection of image descriptions, such as keywords, captions, size, audio, viedo and some other Medias * Multimedia database uses Content-based retrieval systems Content-based retrieval systems means: performing query analysis based on type of media
For example: Find all of the images that are similar to the given image sample
Compare the feature vector (signature) extracted from the sample with the feature vectors of images that have already been extracted and indexed in the image database Multimedia database uses the methods like Color histogram-based signature Multifeature composed signature Wavelet based signature
106
These Signatures are used for performing queries on color images(color images is one media) C-BIRD: Content-Based Image Retrieval from Digital libraries
Figure: Mining multimedia database By using the above methods we can perform the multimedia database means searching any one of the media from the multimedia database
107
For example: the above C-BIRD software showing that we can search the multimedia database by the following statements: by image colors by color percentage by color layout by texture density by texture Layout by object model by illumination invariance by keywords Multimedia database also contains the dimensions and measurements Which also uses multimedia data cube.. For example:
108
M in in g
109
Mining time-series and sequence data: 1 Time-series database Consists of sequences of values or events changing with time Data is recorded at regular intervals Characteristic time-series components Trend, cycle, seasonal, irregular 2 Applications Financial: stock price, inflation Biomedical: blood pressure Meteorological: precipitation
110
A time series can be illustrated as a time-series graph which describes a point moving with the passage of time Categories of Time-Series Movements Long-term or trend movements (trend curve) Cyclic movements or cycle variations, e.g., business cycles Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. Irregular or random movements
111
Mining sequence data: Mining sequence data can be done by Sequential Pattern Mining Sequential Pattern Mining: 1. Mining of frequently occurring patterns related to time or other sequences 2. Sequential pattern mining usually concentrate on symbolic patterns 3. Examples Renting Star Wars, then Empire Strikes Back, then Return of the Jedi in that order Collection of ordered events within an interval 4. Applications Targeted marketing Customer retention Weather prediction Mining text databases: 1. Text databases or document databases Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.
112
Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data 2. Information retrieval A field developed in parallel with database systems Information is organized into (a large number of) documents Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
Information Retrieval
1. Typical IR Systems Online library catalogs Online document management systems 2. Information retrieval vs. database systems Some DB problems are not present in IR, e.g., update, transaction management, complex objects Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance
113
1. Key-word based association analysis 2. Automatic document classification 3. Similarity detection Cluster documents by a common author Cluster documents common source containing information from a
4. Link analysis: unusual correlation between entities 5. Sequence analysis: predicting a recurring event 6. Anomaly detection: find information that violates usual patterns 7. Hypertext analysis Patterns in anchors/links Anchor text correlations with linked objects
Mining World-Wide Web: The WWW is huge, distributed, global information service center for
widely
Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. Hyper-link information Access and usage information
114
2. WWW provides rich sources for data mining 3. Challenges Too huge for effective data warehousing and data mining Too complex and heterogeneous: no standards and structure Note: Web Mining: A more challenging task : 1. Searches for Web access patterns Web structures Regularity and dynamics of Web contents 2. Problems The abundance problem Limited coverage of the Web: hidden Web sources, majority of data in DBMS Limited query interface based on keyword-oriented search Limited customization to individual users
115
W e
116
M ultip le L ultiple
Layer n
Web Usage Mining 1. Mining Web log records to discover user access patterns of Web pages 2. Applications Target potential customers for electronic commerce
...
117
Enhance the quality and delivery of Internet information services to the end user Improve Web server system performance Identify potential prime advertisement locations 3. Web logs provide rich information about Web dynamics Typical Web log entry includes the URL requested, the IP address from which the request originated, and a timestamp
118
Mining th
Design of a W