DWH 2m

UNIT 1 2 Marks
1. What is datawarehouse? A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process
2. What is the significant use of subject oriented datawarehouse? Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 3. Why do we use integrated version of datawarehouse? Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
When data is moved to the warehouse, it is converted.
4. What is the role of time variant feature in Datawarehouse? The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element.
5. What is meant by non volatile nature in datawarehouse? A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment.
6. State the difference between datawarehouse vs operational DBMS. Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach -dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set
Data warehouse: update-driven, high performance Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
7. List the distinct features of OLTP with OLAP. Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
8. Why we need separate datawarehouse? Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
9. Give the conceptual modeling of datawarehouse. Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
10. Define the distributive measure of datawarehouse categories.
distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.
11. Define the algebraic measure of datawarehouse categories. algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
12. Define the holistic measure of datawarehouse categories. holistic: if there is no constant bound on the storage size needed to describe a subaggregate.
13. List the OLAP operations and their functionality? Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
14. What are the different types of datawarehouse design process? Four views regarding the design of a data warehouse a. Top-down view i. allows selection of the relevant information necessary for the data warehouse b. Data source view i. exposes the information being captured, stored, and managed by operational systems c. Data warehouse view i. consists of fact tables and dimension tables d. Business query view i. sees the perspectives of data in the warehouse from the view of end-user
15. Define enterprise warehouse. Enterprise warehouse e. collects all of the information about subjects spanning the entire organization
16. What is meant by datamart? a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart
17. Define virtual warehouse. f. A set of views over operational databases g. Only some of the possible summary views may be materialized
18. What are the backend tools and utilities of Datawarehouse? Data extraction: h. get data from multiple, heterogeneous, and external sources Data cleaning: i. detect errors in the data and rectify them when possible Data transformation: j. convert data from legacy or host format to warehouse format Load: k. sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh l. propagate the updates from the data sources to the warehouse
19. What are the applications of datawarehousing? Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing
multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools.
20. Why we need online analytical mining? m. High quality of data in data warehouses i. DW contains integrated, consistent, cleaned data n. Available information processing structure surrounding data warehouses i. ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools o. OLAP-based exploratory data analysis i. mining with drilling, dicing, pivoting, etc. p. On-line selection of data mining functions i. integration and swapping of multiple mining functions, algorithms, and tasks.
16 Marks
1. Give the Architecture of Datawarehouse and explain its usage. 2. State the difference between OLTP and OLAP in detail. 3. Explain the operations performed on data warehouse with examples 4. Write short notes on data warehouse Meta data. 5. Explain the Conceptual Modeling of Data Warehouses
Unit 2 1. Why we need data preporocessing. Data in the real world is dirty i) incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ii) noisy: containing errors or outliers iii) inconsistent: containing discrepancies in codes or names
2. List the multidimensional measure of data quality? i) Accuracy ii) Completeness iii) Consistency iv) Timeliness v) Believability vi) Value added vii) Interpretability viii) Accessibility
3. What is meant by Data cleaning? Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
4. Define Data integration. Integration of multiple databases, data cubes, or files
5. Why we need Data transformation
o min-max normalization o z-score normalization o normalization by decimal scaling
i) New attributes constructed from the given ones
6. Define Data reduction. Data reduction Obtains reduced representation in volume but produces the same or similar analytical results.
7. What is meant by Data discretization It can be defined as Part of data reduction but with particular importance, especially for numerical data
8. What is the discretization processes involved in data preprocessing? It reduces the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.
9. Define Concept hierarchy. It reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
10. Why we need Data Mining Primitives and Languages? unrealistic because the patterns could be too many but uninteresting
o User directs what to be mined
mining system g these primitives in a data mining query language
11. What are the types of knowledge to be mined?
12. Define Datamining Query Language. i) A DMQL can provide the ability to support ad-hoc and interactive data mining ii) By providing a standardized language like SQL a) Hope to achieve a similar effect like that SQL has on relational database b) Foundation for system development and evolution c) Facilitate information exchange, technology transfer, commercialization and wide acceptance
13. What tasks should be considered in the design GUIs based on a data mining query language?
i) Data collection and data mining query composition ii) Presentation of discovered patterns iii) Hierarchy specification and manipulation iv) Manipulation of data mining primitives v) Interactive multilevel mining
vi) Other miscellaneous information
14. What are the types of Coupling data mining system with DB/DW system? 1. No couplingflat file processing, not recommended 2. Loose coupling -Fetching data from DB/DW 3. Semi-tight couplingenhanced DM performance a. Provide efficient implement a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions
4. Tight couplingA uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc.
15. List the five primitives for specification of a data mining task. -relevant data
tion and visualization techniques to be used for displaying the discovered patterns
16. Descriptive vs. predictive data mining i) Descriptive mining: describes concepts or task-relevant data sets in concise, summarative, informative, discriminative forms ii) Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data
17. What is the strength of Data Characterization? i) An efficient implementation of data generalization
ii) Computation of various kinds of measures a) e.g., count( ), sum( ), average( ), max( ) iii) Generalization and specialization can be performed on a data cube by roll-up and drill-down
18. Give the list of limitations of Data Characterization. a. handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric values. b. Lack of intelligent analysis, cant tell which dimensions should be used and what levels should the generalization reach
19. How Attribute oriented induction will be done? c. Collect the task-relevant data( initial relation) using a relational database query d. Perform generalization by attribute removal or attribute generalization. e. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts. f. Interactive presentation with users.
20. Give the basic principle of Attribute oriented induction. -relevant data, including dimensions, and the result is the initial relation. -removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) As higher level concepts are expressed in terms of other attributes. -generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A. -threshold control: typical 2-8, specified/default.
21. What is the basic algorithm for Attribute oriented induction? 1. InitialRel: Query processing of task-relevant data, deriving the initial relation.
2. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? 3. PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a prime generalized relation, accumulating the counts. 4. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations.
22. State the difference between Characterization and OLAP. Similarity: g. Presentation of data summarization at multiple levels of abstraction. h. Interactive drilling, pivoting, slicing and dicing.
Differences: i. Automated desired level allocation. j. Dimension relevance analysis and ranking when there are many relevant dimensions. k. Sophisticated typing on dimensions and measures. l. Analytical characterization: data dispersion analysis.
23. What is meant by Boxplot analysis? m. Data is represented with a box n. The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ o. The median is marked by a line within the box p. Whiskers: two lines outside the box extend to Minimum and Maximum 24. Define Histogram Analysis. Graph displays of basic statistical class descriptions q. Frequency histograms i. A univariate graphical method ii. Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data
25. What is meant by Quantile plot? 1. Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) 2. Plots quantile information 3. For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of the data are below or equal to the value xi
26. Define Quantile-Quantile plot. o Graphs the quantiles of one univariate distribution against the corresponding quantiles of another o Allows the user to view whether there is a shift in going from one distribution to another
27. Define Scatter Plot.
pair of values is treated as a pair of coordinates and plotted as points in the plane
28. Give the definition for Loess Curve. dependence ted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression
29. How we can measure the dispersion of data? Quartiles, outliers and boxplots r. Quartiles: Q1 (25th percentile), Q3 (75th percentile) s. Inter-quartile range: IQR = Q3 Q1 t. Five number summary: min, Q1, M, Q3, max u. Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually v. Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation
16 Marks
1. Explain major tasks in Data Preprocessing. 2. What is data cleaning? List and explain various techniques used for data cleaning? 3. How is Attribute Oriented Induction implemented? Explain with an example 4. Why do we preprocess the data? Explain how data preprocessing techniques can improve the quality of the data. 5. List out and describe the primitives for specifying a data mining task. 6. Describe how concept hierarchies and data generalization are useful in data Mining.
Unit 3
1.What are the Applications of Association rule mining? n Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. State the rule measure for finding association. n support, s, probability that a transaction contains {X + Y + Z} n confidence, c, conditional probability that a transaction having {X + Y} also contains Z
2.What are the different way to find association? Boolean vs. quantitative associations (Based on the types of values handled) n buys(x, SQLServer) ^ buys(x, DMBook) buys(x, DBMiner) [0.2%, 60%] n age(x, 30..39) ^ income(x, 42..48K) buys(x, PC) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above) Single level vs. multiple-level analysis
3.n What brands of beers are associated with what brands of diapers? Why counting supports of candidates a problem? n The total number of candidates can be very huge n One transaction may contain many candidates
4.Give the Method to find supports of candidates. n Candidate itemsets are stored in a hash-tree n Leaf node of hash-tree contains a list of itemsets and counts n Interior node contains a hash table n Subset function: finds all the candidates contained in a transaction
5.List the Methods to Improve Aprioris Efficiency. Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent
6,What is the advantage of FP Tree Structure? Completeness: n never breaks a long pattern of any transaction n preserves complete information for frequent pattern mining Compactness n reduce irrelevant informationinfrequent items are gone n frequency descending ordering: more frequent items are more likely to be shared n never be larger than the original database (if not count node-links and counts)
7.Give the method for mininf frequent pattern using FP tree Structure. n For each item, construct its conditional pattern-base, and then its conditional FP-tree n Repeat the process on each newly created conditional FP-tree n Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)
8.List the major step to mine FP Tree. Construct conditional pattern base for each node in the FP-tree Construct conditional FP-tree from each conditional pattern-base Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns
9.What is meant by Node-link property For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai\'s node-links, starting from ai\'s head in the FP-tree header
10Define Prefix path property
To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry the same count as node ai.
13. What is the principle of frequent pattern growth. Pattern growth property n Let a be a frequent itemset in DB, B be a\'s conditional pattern base, and b be an itemset in B. Then a b is a frequent itemset in DB iff b is frequent in B.
14. Why Is Frequent Pattern Growth Fast? n FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection n No candidate generation, no candidate test n Use compact data structure n Eliminate repeated database scan n Basic operation is counting and FP-tree building
15. What is meant by Icerberg query? It Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold
16. Define Multiple Level Association Rule. Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at appropriate levels could be quite useful. Transaction database can be encoded based on dimensions and levels We can explore shared multi-level mining
17. What is meant by Uniform support. Uniform Support: the same minimum support for all levels
n + One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. n Lower level items do not occur as frequently. If support threshold n too high miss low level associations n too low generate too many high level associations
18. What do you meant by Reduced Support? It reduced minimum support at lower levels There are 4 search strategies: a) Level-by-level independent
b)
Level-cross filtering by k-itemset
c)
Level-cross filtering by single item
d)
Controlled level-cross filtering by single item
19. Why progressive refinement is suitable for reduced support? Mining operator can be expensive or cheap, fine or rough Trade speed with quality: step-by-step refinement.
20. What is the functionality of Superset mining? It Preserve all the positive answersallow a positive false test but not a false negative test.
21.Define two or Multi Step mining. First apply rough/cheap operator (superset coverage) Then apply expensive algorithm on a substantially reduced candidate set
22. What is categorical and Quantitative Attribute? Categorical Attributes finite number of possible values, no ordering among values Quantitative Attributes numeric, implicit ordering among values
23. What is the limitations of ARCS? n Only quantitative attributes on LHS of rules. n Only 2 attributes on LHS. (2D limitation) n An alternative to ARCS Non-grid-based equi-depth binning clustering based on a measure of partial completeness.
24. Define Quantitative association rules Quantitative attributes are dynamically discretized into binsbased on the distribution of the data.
25. State Distance-based association rules n This is a dynamic discretization process that considers the distance between data points.
26. Give the two-step mining of spatial association. Step 1: rough spatial computation (as a filter) Using MBR or R-tree for rough estimation. Step2: Detailed spatial algorithm (as refinement) Apply only to those objects which have passed the rough spatial association test (no less than min_support)
16 Marks
1.Discuss the following in detail: Association Mining Support Confidence Rule measures
2.Explain how mining will be done in frequent item sets with an example. 3/Describe join and prune steps in Apriori Algorithm. 4.Discuss the approaches for mining databases multi dimensional association rule from transactional databases. Give suitable examples. A database has four transactions. Let min_sup = 60% and min_conf = 80%. TID Date Items_bought {K, A, D, B} {D, A, C, E, B}
T100 10/15/08 T200 10/15/08
T300 10/16/08
{C, A, B, E}
T400 10/16/08
{B, A, D }
(i) (ii)
Find all frequent itemsets using Apriori and FP-growth respectively. List all strong association rules matching the following Meta rule,
V X transactions buys(X, item1) ^ buys(X, item2) buys(X, item3) Where X Customer, item I - A, B, etc.
5.(i) Explain the methods to improve the Aprioris Efficiency. (ii) Construct the FP tree for given transaction DB
TID 100 200 300 400 500
Frequent Itemsets f,c,a,m,p f,c,a,b,m f,b c,b,p f,c,a,m,p
unit 4 1. What is the functionality of Classification process? a. predicts categorical class labels b. classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
2. Give the role of prediction in datamining. It models continuous-valued functions, i.e., predicts unknown or missing values
3. List the typical Applications of classification and prediction credit approval target marketing medical diagnosis treatment effectiveness analysis
4. Define Supervised learning (classification) a. Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
b. New data is classified based on the training set
5. What is meant by Unsupervised learning (clustering) a. The class labels of training data is unknown b. Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
6. What is the process involved in Data Preparation?
Preprocess data in order to reduce noise and handle missing values
Remove the irrelevant or redundant attributes
Generalize and/or normalize data
7. How we can evaluate classification methods?
scalability o time to construct the model o time to use the model
o handling noise and missing values
o efficiency in disk-resident databases
o understanding and insight provded by the model s
o decision tree size o compactness of classification rules
8. What is meant by Decision Tree? -chart-like tree structure
els or class distribution
9. What are the 2 phases involved in Decision Tree? a. Tree construction
b. Tree pruning branches that reflect noise or outliers
10. What is the condition to stop the partitioning? a. All samples for a given node belong to the same class b. There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf c. There are no samples left
11. State the functionality of Greedy Algorithm. -down recursive divide-and-conquer manner
inuous-valued, they are discretized in advance)
gain)
12. What is meant by Information gain? a. All attributes are assumed to be categorical b. Can be modified for continuous-valued attributes
13. Define Gini Index. -valued
need other tools, such as clustering, to get the possible split values
14. State the two approaches to avoid overfitting. a. Prepruning: Halt tree construction earlydo not split a node if this would result in the goodness measure falling below a threshold
b. Postpruning: Remove branches from a fully grown treeget a sequence of progressively pruned trees decide which is the best pruned tree
15. Why decision tree induction in data mining? a. relatively faster learning speed (than other classification methods) b. convertible to simple and easy to understand classification rules
c. can use SQL queries for accessing databases d. comparable classification accuracy with other methods
16. Why we need Bayesian Classification? 1. Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems 2. Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.
3. Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities 4. Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
17. CCC Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem
P(C|X) = P(X|C)P(C) / P(X)
C such that P(X|C)P(C) is maximum
18. What is meant by K-Nearest Neighbor Algorithm? 1. All instances correspond to points in the n-D space. 2. The nearest neighbor are defined in terms of Euclidean distance. 3. The target function could be discrete- or real- valued.
4. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq.
19. Define case based reasoning approach. a. Instances represented by rich symbolic descriptions (e.g., function graphs) b. Multiple retrieved cases may be combined c. Tight coupling between case retrieval, knowledge-based reasoning, and problem solving
20. What is the role of Genetic Algorithm? 1. GA: based on an analogy to biological evolution 2. Each rule is represented by a string of bits 3. An initial population is created consisting of randomly generated rules e.g., IF A1 and Not A2 then C2 can be encoded as 100 4. Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings 5. The fitness of a rule is represented by its classification accuracy on a set of training examples 6. Offsprings are generated by crossover and mutation
21. State the functionality of Rough set Approach. o approximately or roughly define equivalent classes be in C) and an upper approximation (cannot be described as not belonging to C) the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity
22. Define Prediction with classification. i) Prediction is similar to classification a. First, construct a model
b. Second, use model to predict unknown value i. Major method for prediction is regression 1. Linear and multiple regression 2. Non-linear regression ii) Prediction is different from classification a. Classification refers to predict categorical class label b. Prediction models continuous-valued functions
23. How we can estimate error rates? 1. Partition: Training-and-testing 2. use two independent data sets, e.g., training set (2/3), test set(1/3) 3. used for data set with large number of samples 4. Cross-validation 5. divide the data set into k subsamples 6. use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation 7. for data set with moderate size 8. Bootstrapping (leave-one-out) 9. for small size data
24. What are the types of Prediction? Linear regression: Y = a + b X a. Two parameters , a and b specify the line and are to be estimated by using the data at hand. b. using the least squares criterion to the known values of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models:
The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = aab baccad dbcd
25. What is meant by Boosting?
1. Boosting increases classification accuracy Applicable to decision trees or Bayesian classifier 2. Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor 3. Boosting requires only linear time and constant space
26. State the role of cluster Analysis. Cluster: a collection of data objects c. Similar to one another within the same cluster d. Dissimilar to the objects in other clusters Cluster analysis e. Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes
27. Give the applications of clustering. i. Pattern Recognition ii. Spatial Data Analysis a. create thematic maps in GIS by clustering feature spaces b. detect spatial clusters and explain them in spatial data mining iii. Image Processing iv. Economic Science (especially market research) v. WWW
f. Document classification g. Cluster Web log data to discover groups of similar access patterns
28. What is the requirement for clustering in data mining? Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability
29. What are the algorithms used for clustering? 1. Partitioning algorithms: Construct various partitions and then evaluate them by some criterion 2. Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion 3. Density-based: based on connectivity and density functions 4. Grid-based: based on a multiple-level granularity structure 5. Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other
30. What are outliers? ssimilar from the remainder of the data
16 Marks
1. 2. 3. 4. 5. 6.
Discuss Bayesian classification with its theorem What is prediction? Explain about various prediction techniques. Briefly outline the major steps of decision tree classification. Discuss the different types of clustering methods. Describe the working of PAM (Partioning Around Medoids) algorithm. Explain the measure of attributes in decision tree induction and outline the major steps involved in it .
Unit 5
1. Which attribute is said to be set valued attribute? -level concepts e set, such as the number of elements in the set, the types or value ranges in the set, or the weighted average for numerical data video_games}
2. How we classified Sequence value Attribute? -valued attributes except that the order of the elements in the sequence should be observed in the generalization
3. Define Plan mining. Plan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans)
4. Give the methods for computing spatial data cube. On-line aggregation: collect and store pointers to spatial objects in a spatial data cube o expensive and slow, need efficient aggregation techniques Precompute and store all the possible combinations o huge space overhead Precompute and store rough approximations in a spatial data cube o accuracy trade-off
5. What are the 2 steps needed to make spatial association? Two-step mining of spatial association:
-tree for rough estimation
min_support)
6. What is meant by Spatial trend analysis ial dimension
distance from an ocean
7. Define Description-based retrieval systems and perform object retrieval based on image descriptions, such as keywords, captions, size, and time of creation -intensive if performed manually
8. State the process involved in Content-based retrieval systems and wavelet transforms
9. List the descriptors present in multidimensional analysis. visual characteristic
edge layout vector
10. What is the requirement for maintaining Time-series database
-series components rregular
16 Marks
1. Give some examples for text based database and explain how it is implemented using datamining system. 2. Explain mining WWW process. 3. Discuss some of the application using data mining system? 4. Describe the trends that cover data mining System in detail 5. Describe how multidimensional analysis performed in data mining system. 6. Explain the ways in which descriptive mining of complex data objects is identified with an example 7. How spatial database helpful in data mining system? 8. Explain the concept involved in multimedia database.

DWH 2m

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWH 2m

Uploaded by

Copyright:

Available Formats

UNIT 1 2 Marks

When data is moved to the warehouse, it is converted.

10. Define the distributive measure of datawarehouse categories.

4. Define Data integration. Integration of multiple databases, data cubes, or files

5. Why we need Data transformation

o min-max normalization o z-score normalization o normalization by decimal scaling

i) New attributes constructed from the given ones

o User directs what to be mined

mining system g these primitives in a data mining query language

11. What are the types of knowledge to be mined?

vi) Other miscellaneous information

27. Define Scatter Plot.

Variance and standard deviation

10Define Prefix path property

Level-cross filtering by k-itemset

Level-cross filtering by single item

Controlled level-cross filtering by single item

T100 10/15/08 T200 10/15/08

TID 100 200 300 400 500

Frequent Itemsets f,c,a,m,p f,c,a,b,m f,b c,b,p f,c,a,m,p

b. New data is classified based on the training set

6. What is the process involved in Data Preparation?

Preprocess data in order to reduce noise and handle missing values

Remove the irrelevant or redundant attributes

Generalize and/or normalize data

7. How we can evaluate classification methods?

scalability o time to construct the model o time to use the model

o handling noise and missing values

o efficiency in disk-resident databases

o understanding and insight provded by the model s

o decision tree size o compactness of classification rules

8. What is meant by Decision Tree? -chart-like tree structure

els or class distribution

9. What are the 2 phases involved in Decision Tree? a. Tree construction

b. Tree pruning branches that reflect noise or outliers

inuous-valued, they are discretized in advance)

13. Define Gini Index. -valued

P(C|X) = P(X|C)P(C) / P(X)

C such that P(X|C)P(C) is maximum

25. What is meant by Boosting?

-tree for rough estimation

6. What is meant by Spatial trend analysis ial dimension

distance from an ocean

9. List the descriptors present in multidimensional analysis. visual characteristic

edge layout vector

10. What is the requirement for maintaining Time-series database

-series components rregular

You might also like