53 Data-Warehousing N Data Mining

Page.1Q1.
Explain the meaning of data cleaning and data

formating.
Ans-Data cleaning
This step complements the previous one. It is also the most time
consuming due to a lot of possible techniques that can be
implemented so as to optimize data quality for future modeling stage.
Possible techniques for data cleaning include:
Data normalization. For example decimal scaling into the range
(0,1), or standard deviation normalization.
Data smoothing. Discretization of numeric attributes is one
example, this is helpful or even necessary for logic based methods.
Treatment of missing values. There is not simple and safe solution
for the cases where some of the attributes have significant number of
missing values. Generally, it is good to experiment with and without
these attributes in the modelling phase, in order to find out the
importance of the missing values.
Data formatting
Final data preparation step which represents syntactic modifications to the data that do
not change its meaning, but are required by the particular modelling tool chosen for the
DM task. These include:
reordering of the attributes or records: some modelling tools require reordering of the
attributes (or records) in the dataset: putting target attribute at the beginning or at the
end, randomizing order of records (required by neural networks for example)
changes related to the constraints of modelling tools: removing commas or tabs,
special characters, trimming strings to maximum allowed number of characters,
replacing special characters with allowed set of special characters.
Q2.What is metadata ? Explain various purpose in which
metadata is used.
Ans-Meta data is data about data. Since data in a dataware house is both voluminous
and dynamic, it needs constant monitoring. This can be done only if we have a separate
set of data about data is stored. This is the purpose of meta data.
Meta data is useful for data transformation and load data management and query
generation. This chapter introduces a few of the commonly used meta data functions
for each of them. Meta data, by definition is data about data or data that describes
the data. In simple terms, the data warehouse contains data that describes different
situations. But there should also be some data that gives details about the data stored in
a data warehouse. This data is metadata. Metadata, apart form other things, will be
used for the following purposes.
1. data transformation and loading
2. data management
3. query generation
Q3.Write the steps in designing of fact tables.
Ans-DESIGNING OF FACT TABLES
The above listed methods, when iterated repeatedly will help to finally arrive at a set of
entities that go into a fact table. The next question is how big a fact table can be? An
answer could be that it should be big enough to store all the facts, still making the task
of collecting data from this table reasonably fast. Obviously, this depends on the
hardware architecture as well as the design of the database. A suitable
hardware architecture can ensure that the cost of collecting data is reduced by the
inherent capability of the hardware on the other hand the database designed should
ensure that whenever a data is asked for, the time needed to search for the same is
minimum. In other words, the designer should be able to balance the value of
information made available by the database and cost of making the same data
available to the user. A larger database obviously stores more details, so is definitely
useful, but the cost of storing a larger database as well as the cost of searching and
evaluating the same becomes higher. Technologically, there is perhaps no limit on the
size of the database. How does one optimize the cost- benefit ratio? There are no
standard formulae, but some of the following facts can be taken not of.
i.
Understand the significance of the data stored with respect to
time. Only those data that are still needed for processing need to
be stored. For example customer details after a period of time
may become irrelevant. Salary details paid in 1980s may be of
little use in analyzing the employee cost of 21st century etc. As
and when the data becomes obsolete, it can be removed.
ii. Find out whether maintaining of statistical samples of each of the subsets could be
resorted to instead of storing the entire data. For example, instead of storing the sales
details of all the 200 towns in the last 5 years, one can store details of 10 smaller
towns, five metros, 10 bigger cities and 20 villages. After all data warehousing most
often is resorted to get trends and not the actual figures. The subsets of these individual
details can always be extrapolated
Q3.List and explain the aspects to be looked into while designing the summary
tables.
Ans-ASPECTS TO BE LOOKED INTO WHILE DESIGNING THE
SUMMARY TABLES
The main purpose of using summary tables is to cut down the time taken to execute a
specific query. The main methodology involves minimizing the volume of data being
scanned each time the query is to be answered. In other words, partial answers to the
query are already made available. For example, in the above cited example of mobile
market, if one expects
i) the citizens above 18 years of age
ii) with salaries greater than 15,000 and
iii) with professions that involve traveling are the potential customers, then, every time
the query is to be processed (may be every month or every quarter), one will have to
look at the entire data base to compute these values and then combine them suitably to
get the relevant answers. The other method is to prepare summary tables, which have
the values pertaining toe ach of these sub-queries, before hand, and then combine them
as and when the query is raised .
Itcan be noted that the summaries can be prepared in the background (or when the
number of queries running are relatively sparse) and only the aggregation can be done
on the fly. Summary table are designed by following the steps given below
i) Decide the dimensions along which aggregation is to be done.
ii) Determine the aggregation of multiple facts.
iii) Aggregate multiple facts into the summary table.
Q4.Explain the role of access control issues in data mart

design.
Ans-ROLE OF ACCESS CONTROL ISSUES IN DATA MART
DESIGN
This is one of the major constraints in data mart designs. Any data warehouse, with its
huge volume of data is, more often than not, subject to various access controls as to
who could access which part of data. The easiest case is where the data is partitioned so
clearly that a user of each partition cannot access any other data. In such cases, each of
these can be put in a data mart and the user of each can
access only his data . In the data ware house, the data pertaining to all these marts are
stored, but the partitioning are retained. If a super user wants to get an overall view of
the data, suitable aggregations can be generated. However, in certain other cases the
demarcation may not be so clear. In such cases, a judicious analysis of the privacy
constraints so as to optimize the privacy of each data mart is maintained.
Data marts, as described in the previous sections can be designed, based on several
splits noticeable either in the data or the organization or in privacy laws. They may also
be designed to suit the user access tools. In the latter case, there is not much choice
available for design parameters. In the other cases, it is always desirable to design the
data mart to suit the design of the ware house itself. This helps to maintain maximum
control on the data base instances, by ensuring that the same design is replicated in
each of the data marts. Similarly the summary informations on each of the data mart
can be a smaller replica of the summary of the data ware house it self.
Q5.List the application and reasons for the growing popularity of data mining.
Ans-REASONS FOR THE GROWING POPULARITY OF DATA
MINING
a) Growing Data Volume
The main reason for necessity of automated computer systems for intelligent data
analysis is the enormous volume of existing and newly appearing data that require
processing. The amount of data accumulated each day by various business, scientific,
and governmental organizations around the world is daunting. It becomes impossible
for human analysts to cope with such overwhelming amounts of data.
b) Limitations of Human Analysis
Two other problems that surface when human analysts process data are the inadequacy
of the human brain when searching for complex multifactor dependencies in data, and
the lack of objectiveness in such an analysis. A human expert is always a hostage of the
previous experience of investigating other systems. Sometimes this helps, sometimes
this hurts, but it is almost impossible to get rid of this fact.
c) Low Cost of Machine Learning
One additional benefit of using automated data mining systems is that this process has a
much lower cost than hiring an many highly trained professional statisticians. While
data mining does not eliminate human participation in solving the task completely, it
significantly simplifies the job and allows an analyst who is not a professional in
statistics and programming to manage the process of extracting knowledge from data.
Q6What is data mining ? What kind of data can be mined ?
Ans-There are many definitions for Data mining. Few important definitions are given
below.
Data mining refers to extracting or mining knowledge from large amounts of data.
Data mining is the process of exploration and analysis, by automatic or semiautomatic
means, of large quantities of data in order to discover meaningful patterns and rules.
WHAT KIND OF DATA CAN BE MINED?
In principle, data mining is not specific to one type of media or data. Data mining
should be applicable to any kind of information repository. However, algorithms and
approaches may differ when applied to different types of data. Indeed, the challenges
presented by different types of data vary significantly.
Data mining is being put into use and studied for databases, including relational
databases, object-relational databases and object oriented databases, data warehouses,
transactional databases, unstructured and semi structured repositories such as the World
Wide Web, advanced databases such as spatial databases,
multimedia databases, time-series databases and textual databases, and even flat files.
Here are some examples in more detail
Flat files: Flat files are actually the most common data source for data mining
algorithms, especially at the research level. Flat files are simple data files in text or
binary format with a structure known by the data mining algorithm to be applied. The
data in these files can be transactions, time-series data, scientific measurements.
Relational Databases: A relational database consists of a set of tables containing either
values of entity attributes, or values of attributes from entity relationships. Tables have
columns and rows, where columns represent attributes and rows represent tuples.
Q7.Give the top level syntax of the data mining query languages DMQL.
Ans-A data mining language helps in effective knowledge discovery from the data
mining systems. Designing a comprehensive data mining language is challenging
because data mining covers a wide spectrum of tasks from data characterization to
mining association rules, data classification and evolution analysis.
Each task has different requirements. The design of an effective data mining query
language requires a deep understanding of the power, limitation and underlying
mechanism of the various kinds of data mining tasks.
Q8. Explain the meaning of data mining with apriori algorithm
Ans-APriori algorithm data mining discovers items that are frequently associated
together. Let us look at the example of a store that sells DVDs, Videos, CDs, Books
and Games. The store owner might want to discover which of these items customers
are likely to buy together. This can be used to increase the stores cross sell and upsell
ratios. Customers in this particular store may like buying a DVD and a Game in 10 out
of every 100 transactions or the sale of Videos may hardly ever be associated
with a sale of a DVD. With the information above, the store could strive for more
optimum placement of DVDs and Games as the sale of one of them may improve the
chances of the sale of the other frequently associated item. On the other hand, the
mailing campaigns may be fine tuned to reflect the fact that offering discount
coupons on Videos may even negatively impact the sales of DVDs offered in the same
campaign. A better decision could be not to offer both DVDs and Videos in a
campaign. To arrive at these decisions, the store may have had to analyze 10,000 past
transactions of customers using calculations that seperate frequent and consequently
important associations from weak and unimportant associations.
iv) Determine the level of aggregation and the extent of embedding.

v) Design time into the table.
vi) Index the summary table.
2.Q9.Explain the working principle of decision tree used for data mining.
Ans-DATA MINING WITH DECISION TREES
Decision trees are powerful and popular tools for classification and prediction. The
attractiveness of tree-based methods is due in large part to the fact that, it is simple and
decision trees represent rules. Rules can readily be expressed so that we humans can
understand them or in a database access language like SQL so that records falling into
a particular category may be retrieved. In some applications, the accuracy of a
classification or prediction is the only thing that matters; if a direct mail firm obtains a
model that can accurately predict which members of a prospect pool are most
likely to respond to a certain solicitation, they may not care how or why the model
works.
Decision tree working concept
Decision tree is a classifier in the form of a tree structure where each node is either:
a leaf node, indicating a class of instances, or
a decision node that specifies some test to be carried out on a single attribute0value,
with one branch and sub-tree for each possible outcome of the test.
A decision tree can be used to classify an instance by starting at the root of the tree and
moving through it until a leaf node, which provides the classification of the instance.
Example: Decision making in the Bombay stock market
Suppose that the major factors affecting the Bombay stock market are:
what it did yesterday;
what the New Delhi market is doing today;
bank interest rate;
unemployment rate;
Indias prospect at cricket.
Q10. What is Bayes theorem ? Explain the working procedure of Bayesisan
classifier.
Ans-Bayes Theorem
Let X be a data sample whose class label is unknown. Let H be some hypothesis, such
as that the data sample X belongs to a specified class C. For classification problems,
we want to determine P (H/X), the probability that the hypothesis H holds given the
observed data sample X. P(HX) is the posterior probability, or a posteriori probability,
of H conditioned on X. For example, suppose the world of data samples consists of
fruits, described by their color and shape. Suppose that X is red and round, and that H
is the hypothesis that X is an apple. Then P(HX) reflects our confidence that
X is an apple given that we have seen that X is red and round. In contrast, P(H) is the
prior probability, or a priori probability, of H. For our example, this is the probability
that any given data sample is an apple, regardless of how the data sample looks. The
posterior probability, P(HX), is based on more information (Such as background
knowledge) than the prior probability, P(H), which is independent of X.
Similarly, P(XH) is the posterior probability of X conditioned on H. That is, it is the
probability that X is red and round given that we know that it is true that X is an apple.
P(X) is the prior probability of X. . P(X), P(H), and P(XH) may be estimated from the
given data, as we shall see below. Bayes theorem is useful in that it provides a way of
calculating the posterior probability,
P(HX), from P(H), P(X), and
P(XH). Bayes theorem is
P(HX) = P(XH) P(H)
P(X)
Q11.Explain how neural network can be used for data mining ?
Ams-A neural processing element receives inputs from other connected processing
elements. These input signals or values pass through weighted connections, which
either amplify or diminish the signals. Inside the neural processing element, all of these
input signals are summed together to give the total input to the unit. This total input
value is then passed through a mathematical function to produce an output or decision
value ranging from 0 to 1. Notice that this is a real valued (analog) output, not a digital
0/1 output. If the input signal matches the connection weights exactly, then the output
is close to 1. If the input signal totally mismatches the connection weights then the
output is close to 0. Varying degrees of similarity are represented by the intermediate
values. Now, of course, we can force the neural processing element to make a binary
(1/0) decision, but by using analog values ranging between 0.0 and 1.0 as the outputs,
we are retaining more information to pass on to the next layer of neural processing
units. In a very real sense, neural networks are analog computers.
Each neural processing element acts as a simple pattern recognition machine. It checks
the input signals against its memory traces (connection weights) and produces an
output signal that corresponds to the degree of match between those patterns. In typical
neural networks, there are hundreds of neural processing elements whose pattern
recognition and decision making abilities are harnessed together to
solve problems.
Backpropagation
Backpropagation learns by iteratively processing a set of training samples, comparing
the networks
prediction for each sample with the actual known class label. For each sample with the
actual known
class label. For each training sample, the weights are modified so as to minimize the
means squared error
between the networks prediction and the actual class. These modifications are made in
the backwards
direction, that is, from the output layer, through each hidden layer down to the first
hidden layer (hence the
name backpropagation). Although it is not guaranteed, in general the weights will
eventually coverage,
and the learning process stops. The algorithm is summarized in Figure each step is
described below.
Initialize the weights: The weights in the network are initialized to small random
numbers (e.g.,
ranging from 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated with it, as
explained below. The
K-means algorithm
This algorithm has as an input a predefined number of clusters, that is the k from its
name. Means stands for an average, an average location of all the members of a
particular cluster. When dealing with clustering techniques, one has to adopt a notion
of a high dimensional space, or space in which orthogonal dimensions are all attributes
from the table of data we are analyzing. The value of each attribute of an
example represents a distance of the example from the origin along the attribute axes.
The coordinates of this point are averages of attribute values of all examples that
belong to the cluster. The steps of the K-means algorithm are given below.
1. Select randomly k points (it can be also examples) to be the
seeds for the centroids of k clusters.
2. Assign each example to the centroid closest to the example,
forming in this way k exclusive clusters of examples.
3. Calculate new centroids of the clusters. For that purpose average
all attribute values of the examples belonging to the same cluster (centroid).
Q12.Explain the STAR-FLAKE schema in detail.
Ans-STAR FLAKE SCHEMAS
One of the key factors for a data base designer is to ensure that a database should be
able to answer all types of queries, even those that are not initially visualized by the
developer. To do this, it is essential to understand how the data within the database is
used. In a decision support system, which is what a data ware is supposed to provide
basically, a large number of different questions are asked about the same set of facts.
For example, given a sales data question like
i) What is the average sales quantum of a particular item?
ii) Which are the most popular brands in the last week?
iii) Which item has the least tumaround item.
iv) How many customers returned to procure the same item within one month. Etc.,.
Can be asked. They are all based on the sales data, but the method of viewing the data
to answer the question is different. The answers need to be given by rearranging or
cross referencing different facts.
Q13.Explain the method for designing dimension tables.
Ans-DESIGNING DIMENSION TABLES
After the fact tables have been designed, it is essential to design the dimension tables.
However, the design of dimension tables need not be considered a critical activity,
though a good design helps in improving the performance. It is also desirable to keep
the volumes relatively small, so that restructuring cost will be
less. Now we see some of the commonly used dimensions.
Star dimension
They speed up the query performance by denormalising reference information into a
single table. They presume that the bulk of queries coming are such that they analyze
the facts by applying a number of constraints to a single dimensioned data.
For example, the details of sales from a stores can be stored in horizontal rows and
select one/few of the attributes. Suppose a cloth store stores details of the sales one
below the other and questions like how many while shirts of size 85" are sold in one
week are asked. All that the query has to do is to put the relevant constraints to get the
information. This technique works well in solutions where there are a number of
entitles, all related to the key dimension entity.
Q14.Explain the Horizontal partioning in briefly.
Ans-Needless to say, the dataware design process should try to maximize the
performance of the system. One of the ways to ensure this is to try to optimize by
designing the data base with respect to specific hardware architecture. Obviously, the
exact details of optimization depends on the hardware platforms.
Normally the following guidelines are useful:
i. maximize the processing, disk and I/O operations.
ii. Reduce bottlenecks at the CPU and I/O
The following mechanisms become handly
4.3.1 Maximising the processing and avoiding bottlenecks
One of the ways of ensuring faster processing is to split the data query into several
parallel queries, convert them into parallel threads and run them parallel. This method
will work only when there are sufficient number of processors or sufficient processing
power to ensure that they can actually run in parallel. (again not that to run five threads,
it is not always necessary that we should have five processors. But to ensure optimality,
even a lesser number of processors should be able to do the job, provided they are able
to do it fast enough to avoid bottlenecks at eh processor).
Normalisation
The usual approach in normalization in database applications is to ensure that the data
is divided into two or more tables, such that when the data in one of them is updated, it
does not lead to anamolies of data (The student is advised to refer any book on data
base management systems for details, if interested).
The idea is to ensure that when combined, the data available is consistent.
However, in data warehousing, one may even tend to break the large table into several
denormalized smaller tables. This may lead to lots of extra space being used. But it
helps in an indirect way It avoids the overheads of joining the data during queries.
To make things clear consider the following table
Q16.Explain the need of data mart in detail.
Ans-THE NEED FOR DATA MARTS
In a crude sense, if you consider a data ware house as a store house of data, a data mart
is a retail outlet of data. Searching for any data in a huge store house is difficult, but if
the data is available, you should be positively able to get it. On the other hand, in a
retail out let, since the volume to be searched from is small, you can be able to access
the data fast. But it is possible that the data you are searching for may not be available
there, in which case you have to go back to your main store house to search for the
data. Coming back to technical terminology, one can say the following are the reasons
for which data marts are created.
i) Since the volume of data scanned is small, they speed up the query processing.
ii) Data can be structured in a form suitable for a user access too
iii) Data can be segmented or partitioned so that they can be used on different platforms
and
also different control strategies become applicable.
biases are similarly initialized to small random numbers.

Each training sample, X, is processed by the following steps.
3.IDENTIFY THE ACCESS TOOL REQUIREMENTS

Data marts are required to support internal data structures that support the user access
tools. Data within those structures are not actually controlled by the ware house, but
the data is to be rearranged and up dated by the ware house. This arrangement (called
populating of data) is suitable for the existing requirements of data analysis. While the
requirements are few and less complicated, any populating
method may be suitable, but as the demands increase (as it happens over a period of
time) the populating methods should match the tools used.
As a rule, this rearrangement (or populating) is to be done by the ware house after
acquiring the data from the source. In other words, the data received from the source
should not directly be arranged in the form of structures as needed by the access tools.
This is because each piece of data is likely to be used by several access tools which
need different populating methods. Also, additional requirements may
come up later. Hence each data mart is to be populated from the ware house based on
the access tool requirements of the data ware house.
Q17.Explain the Data warehouse process manager in detail.
Ans-DATAWARE HOUSE PROCESS MANAGERS
These are responsible for the smooth flow, maintainance and upkeep of data into and
out of the database. The main types of process managers are
Load manager
Warehouse manager and
Query manager
We shall look into each of them briefly. Before that, we look at a schematic diagram
that defines the boundaries of the three types of managers.
Load manager
This is responsible for any data transformations and for loading of data into the
database. They should
effect the following
Data source interaction
Data transformation
Data load.
The actual complexity of each of these modules depend on the size of the database.
It should be able to interact with the source systems to verify the received data. This is
a very important aspect and any improper operations leads to invalid data affecting the
entire warehouse. The concept is normally achieved by making the source and data
ware house systems compatible.
Ware house Manager
The warehouse manager is responsible for maintaining data of the ware house. It
should also create and maintain a layer of meta data. Some of the responsibilities of the
ware house manager are
o Data movement
Meta data management
o Performance monitoring
o Archiving.
Data movement includes the transfer of data within the ware house, aggregation,
creation and maintenance of tables, indexes and other objects of importance. It should
be able to create new aggregations as well as remove the old ones. Creation of
additional rows / columns, keeping track of the aggregation
processes and creating meta data are also its functions.
Query Manager
We shall look at the last of manager, but not of any less importance, the query
manager. The main responsibilities include the control of the following.
o Users access to data
o Query scheduling
Query Monitoring
These jobs are varied in nature and have not been automated as yet.
The main job of the query manager is to control the users access to data and also to
present the data as a result of the query processing in a format suitable to the user. The
raw data, often from different sources, need to be compiled in a format suitable for
querying. The query manager will have to act as a mediator between the user on one
hand and the meta data on the other.
Q18.Explain the Data warehouse Delivery process in detail.
Ans-THE DATA WAREHOUSE DELIVERY PROCESS
This section deals with the dataware house from a different view point - how the
different components that go into it enable the building of a data ware house. The study
helps us in two ways:
i) to have a clear view of the data ware house building process.
ii) to understand the working of the data ware house in the context of the components.
Now we look at the concepts in details.
i. IT Strategy
The company must and should have an overall IT strategy and the data ware housing
has to be a part of the overall strategy. This would not only ensure that adequate
backup in terms of data and investments are available, but also will help in integrating
the ware house into the strategy. In other words, a data ware house can not be
visualized in isolation.
ii. Business case analysis
This looks at an obvious thing, but is most often misunderstood. The overall
understanding of the
business and the importance of various components there in is a must. This will ensure
that one can clearly justify the appropriate level of investment that goes into the data
ware house design and also the amount of returns accruing.
Unfortunately, in many cases, the returns out of the ware housing activity are not
quantifiable. At the end of the year, one cannot say - I have saved / generated 2.5 crore
Rs. because of data ware housing - sort of statements. Data ware house affects the
business and strategy plans indirectly - giving scope for undue expectations on one
hand and total neglect on the other. Hence, it is essential that the designer
must have a sound understanding of the overall business, the scope for his concept
Q19.Briefly explain the system management tools.

Ans- SYSTEM MANAGEMENT TOOLS
The most important jobs done by this class of managers includes the following
1. Configuration managers
2. schedule managers
3. event managers
4. database mangers
5. back up recovery managers 6. resource and performance a monitors.
We shall look into the working of the first five classes, since last type of managers are
less critical in nature.
Configuration manager
This tool is responsible for setting up and configuring the hardware. Since several types
of machines are being addressed, several concepts like machine configuration,
compatibility etc. are to be taken care of, as also the platform on which the system
operates.
Schedule manager
The scheduling is the key for successful warehouse management. Almost all operations
in the ware house need some type of scheduling. Every operating system will have its
own scheduler and batch control mechanism. But these schedulers may not be capable
of fully meeting the requirements of a data warehouse.
Event manager
An event is defined as a measurable, observable occurrence of a defined action. If this
definition is quite vague, it is because it encompasses a very large set of operations.
The event manager is a software that continuously monitors the system for the
occurrence of the event and then take any action that is suitable (Note that the event is a
measurable and observable occurrence). The action to be taken is also normally
specific to the event.
Database manager
The database manger normally will also have a separate (and often independent)
system manager module. The purpose of these managers is to automate certain
processes and simplify the execution of others. Some of operations are listed as
follows.
Ability to add/remove users
o User management
o Manipulate user quotas
o Assign and deassign the user profiles
Q20.What is schema? Distinguish between facts and dimensions.
Ans- schema- A schema, by definition, is a logical arrangements of facts that facilitate
ease of storage and retrieval, as described by the end users. The end user is not
bothered about the overall arrangements of the data or the fields in it. For example, a
sales executive, trying to project the sales of a particular item is only interested in the
sales details of that item where as a tax practitioner looking at the same data will be
interested only in the amounts received by the company and the profits made.
Distinguish between facts and dimensions
The star schema looks a good solution to the problem of ware housing. It simply states
that one should identify the facts and store it in the read-only area and the dimensions
surround the area. Whereas the dimensions are liable to change, the facts are not. But
given a set of raw data from the sources, how does one identify the facts and the
dimensions? It is not always easy, but the following steps can help in that
direction.
i) Look for the fundamental transactions in the entire business process. These basic
entities are the facts.
ii) Find out the important dimensions that apply to each of these facts. They are the
candidates for dimension tables.
iii) Ensure that facts do not include those candidates that are actually dimensions, with
a set of facts attached to it.
iv) Ensure that dimensions do not include these candidates that are actually facts.
Q21.Explain how to categorize data mining system.
Ans- CATEGORIZE DATA MINING SYSTEMS
There are many data mining systems available or being developed. Some are
specialized systems dedicated to a given data source or are confined to limited data
mining functionalities, other are more versatile and comprehensive. Data mining
systems can be categorized according to various criteria among
other classification are the following
a) Classification according to the type of data source mined: this classification
categorizes data mining systems according to the type of data handled such as spatial
data, multimedia data, time-series data, text data, World Wide Web, etc.
b) Classification according to the data model drawn on: this classification
categorizes data mining systems based on the data model involved such as relational
database, object-oriented database, data warehouse, transactional, etc.
Classification according to the king of knowledge discovered: this classification
categorizes data mining systems based on the kind of knowledge discovered or data
mining functionalities, such as characterization, discrimination, association,
classification, clustering, etc. Some systems tend to be comprehensive systems offering
several data mining functionalities together.
Q22 ..A DATA MINING QUERY LANGUAGE
A data mining query language provides necessary primitives that allow users to
communicate with data mining systems. But novice users may find data mining query
language difficult to use and the syntax difficult to remember. Instead , user may prefer
to communicate with data mining systems through a graphical user interface (GUI). In
relational database technology , SQL serves as a standard core language
for relational systems , on top of which GUIs can easily be designed. Similarly, a data
mining query language may serve as a core language for data mining system
implementations, providing a basis for the development of GUI for effective data
mining. A data mining GUI may consist of the following functional components
a)
Data collection and data mining query composition - This component
allows the user to specify task-relevant data sets and to compose data
mining queries. It is similar to GUIs used for the specification of
relational queries.
b) Presentation of discovered patterns This component allows the display of the
discovered patterns in various forms, including tables, graphs, charts, curves and other
visualization techniques.
(data ware house) in the project, so that he can answer the probing questions.
4.Q.Enlist the desirable schemes required for a good architecture

of data mining systems.
Ans- ARCHITECTURES OF DATA MINING SYSTEMS
A good system architecture will enable the system to make best use of the software
environment , accomplish data mining tasks in an efficient and timely manner,
interoperate and exchange information with other information systems, be adaptable to
users different requirements and evolve with time. To know what are the desired
architectures for data mining systems, we view data mining is integrated
with database/data warehousing and coupling with the following schemes
a) no-coupling
b) loose coupling
c) semitight coupling
d) tight-coupling
No-coupling It means that data mining system will not utilize any function of a
database or data warehousing system. Here in this system , it fetches data from a
particular source such as a file , processes data using some data mining algorithms and
then store the mining result in another file. This system has some disadvantages
1) Database system provides a great deal of flexibility and efficiency at storing ,
organizing, accessing and processing data. Without this in a file, Data mining system
may spend a more amount of time finding, collecting , cleaning and transforming data.
Qno--CLUSTERING IN DATA MINING
Requirements for clustering
Clustering is a division of data into groups of similar objects. Each group, called
cluster, consists of objects that are similar between themselves and dissimilar to objects
of other groups. Representing data by fewer clusters necessarily loses certain fine
details (akin to lossy data compression), but achieves
simplification. It represents many data objects by few clusters, and hence, it models
data by its clusters. Data modeling puts clustering in a historical perspective rooted in
mathematics, statistics, and numerical analysis. From a machine learning perspective
clusters correspond to hidden patterns, the search for clusters is unsupervised learning,
and the resulting system represents a data concept. Therefore, clustering
is unsupervised learning of a hidden data concept. Data mining deals with large
databases that impose on clustering analysis additional severe computational
requirements.
Requirements for clustering
Clustering is a challenging and interesting field potential applications pose their own
special requirements.
The following are typical requirements of clustering in data mining.
Scalability: Many clustering algorithms work well on small data sets containing
fewer than 200 data objects However, a large database may contain millions of objects.
Clustering on a sample of a given large data set may lead to biased results. Highly
scalable clustering algorithms are needed.
Ability to deal with different types of attributed: Many algorithms are designed to
cluster interval-based (numerical) data. However, applications may require clustering
other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures
of these data types.
Nominal, Ordinal and Ratio-Scaled Variables
Nominal Variables
A nominal variable is a generalization of the binary variable in that it can take on more
than two states. For example, map_color is a nominal variable that may have, say, five
states: red, yellow, green, pink and blue.
Nominal variables can be encoded by asymmetric binary variables by creating a new
binary variable for each of the M nominal states. For an object with a given state value,
the binary variable representing that state is set to 1, while variable map_color, a binary
variable can be created for each of the five colors listed above. For an object having the
color yellow, the yellow variable is set to 1, while the remaining four variables are set
to 0..
Ordinal Variables
A discrete ordinal variable resembles a nominal variable, except that the M states of the
ordinal value are ordered in a meaningful sequence. Ordinal variables are very useful
for registering subjective assessments of qualities that cannot be measured objectively.
For example professional ranks are often enumerated in a sequential order, such as
assistant, associate, and full. A continuous ordinal variable
looks like a set of continuous data of an unknown scale; that is, the relative ordering of
the values is essential but their actual magnitude is not. For example, the relative
ranking in a particular sport (e.g., gold, silver, bronze) is often more essential than the
actual values of a particular measure.
Ratio-Scaled Variables
A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an
exponential scale, approximately following the formula AeBt or Ae-Bt, Where A and
B are positive constants. Typical examples include the growth of a bacteria population,
or the decay of a radioactive element. To compute the dissimilarity between objects
described by ratio-scaled variable There are three methods to handle ratio-scaled
variables for computing the dissimilarity between objects.
Neural Network Topologies
The arrangement of neural processing units and their interconnections can have a
profound impact on the processing capabilities of the neural networks. In general, all
neural networks have some set of processing units that receive inputs from the outside
world, which we refer to appropriately as the input units. Many neural networks also
have one or more layers of hidden processing units that receive inputs only from
other processing units. A layer or slab of processing units receives a vector of data or
the outputs of a previous layer of units and processes them in parallel. The set of
processing units that represents the final result of the neural network computation is
designated as the output units. There are three major connection topologies that
define how data flows between the input, hidden, and output processing units.
Feed-Forward Networks
Feed-forward networks are used in situations when we can bring all of the information
Q24.With the help of a block diagram explain the typical process flow in a data
Warehouse.
Ans- TYPICAL PROCESS FLOW IN A DATA WAREHOUSE
Any data ware house must support the following activities
i) Populating the ware house (i.e. inclusion of data)
ii) day-to-day management of the ware house.
iii) Ability to accommodate the changes.
The processes to populate the ware house have to be able to extract the data, clean it
up, and make it available to the analysis systems. This is done on a daily / weekly basis
depending on the quantum of the data population to be incorporated.
The day to day management of data ware house is not to be confused with maintenance
and management of hardware and software. When large amounts of data are stored and
new data are being continually added at regular intervals, maintaince of the quality of
data becomes an important element. Ability to accommodate changes implies the
system is structured in such a way as to be able to cope with future changes without the
entire system being remodeled. Based on these, we can view the processes that a
typical data ware house scheme should support as follows.
Q25.How the Naive Bayesian classification works.
Ans- Naive Bayesian Classification
The nave Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Each data sample is represented by an n-dimensional feature vector, X = (x1,
x2, . . . . xn), depicting n measurements made on the sample from n attributes,
respectively, A1, A2, . .An.
2. Suppose that there are m classes, C1, C2, . Cm. Given an unknown data sample, X
(i.e., having no class label), the classifier will predict that X belongs to the class having
the highest posterior probability, conditioned on X. That is, the nave Bayesian
classifier assigns an unknown sample X to the class Ci if and only if
P(CiX) > P(CjX) for 1 j m, j I Thus we maximize P(Ci/X). The class Ci for
which P(Ci/X) is maximized is called the maximum posteriori hypothesis. By Bayes
theorem P(CiX) = P(XCi) P(Ci)
P(X) 3. As P(X) is constant for all classes, only P(XCi) P(Ci) need be maximized. If
the class prior probabilities are not known, then it is commonly assumed that the
classes are equally likely,
that is, P(C1) = P(C2) = = P(Cm), and we would therefore maximize P(XCi).
Otherwise, we maximize P(XCi) P(Ci). Note that the class prior probabilities may be
estimated by P(Ci) = si/s where si is the number of training samples of class Ci, and s is
the total number of training samples.
Training Bayesian Belief Networks
In the learning or training of a belief network, a number of scenarios are possible. The
network structure may be given in advance or inferred from the data. The network
variables may be observable or hidden in all or some of the training samples. The case
of hidden data is also referred to as missing values or incomplete data.
If the network structure is known and the variables are observable, then training the
network is straightforward. It consists of computing the CPT entries, as is similarly
done when computing the probabilities involved in native Bayesian classification.
Neural Network Topologies
The arrangement of neural processing units and their interconnections can have a
profound impact on the processing capabilities of the neural networks. In general, all
neural networks have some set of processing units that receive inputs from the outside
world, which we refer to appropriately as the input units. Many neural networks also
have one or more layers of hidden processing units that receive
inputs only from other processing units. A layer or slab of processing units receives a
vector of data or the outputs of a previous layer of units and processes them in parallel.
The set of processing units that represents the final result of the neural network
computation is designated as the output units. There are three major connection
topologies that define how data flows between the input, hidden, and output
processing units.
Backpropagation
Backpropagation learns by iteratively processing a set of training samples, comparing
the networks prediction for each sample with the actual known class label. For each
sample with the actual known class label. For each training sample, the weights are
modified so as to minimize the means squared error between the networks prediction
and the actual class. These modifications are made in the backwards
direction, that is, from the output layer, through each hidden layer down to the first
hidden layer (hence the name backpropagation). Although it is not guaranteed, in
general the weights will eventually coverage, and the learning process stops. The
algorithm is summarized in Figure each step is described below.
Initialize the weights: The weights in the network are initialized to small random
numbers (e.g., ranging from 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated
with it, as explained below. The biases are similarly initialized to small random
numbers.
Nonlinear Regression
Polynomial regression can be modeled by adding polynomial terms to the basic linear
model. By applying transformations to the variables, we can convert the nonlinear
model into a linear one that can then be solved by the method of least squares.
Transformation of a polynomial regression model to a linear regression model.
Consider a cubic polynomial relationship given by
Y = + 1X1 + 2X2 + 3X3
To convert this equation to linear form, we define new variables:
X 1 = X X2 = X2 X3 = X3
Using the above Equation can then be converted to linear form by applying the above
assignments, resulting in the equation Y = + 1X1 + 2X2 + 3X3 which is solvable
by the method of least squares.
to bear on a problem at once, and we can present it to the neural network. It is like a
pop quiz, where the teacher walks in, writes a set of facts on the board, and says, OK,
tell me the answer. You must take the data, process it, and jump to a conclusion. In
this type of neural network, the data flows through the network in one
direction, and the answer is based solely on the current set of inputs.
5. Explain the concept of data warehousing and data mining.
Ans. A data warehouse is a collection of a large amount of data and these data is the
pieces of information Which is use to suitable managerial decisions. (a storehouse of
data) eg:- student data to the details of the citizens of a city or the sales of previous
years or the number of patients that came to a hospital with different ailments. Such
data becomes a storehouse of information.
Data mining is the process of exploration and analysis, by automatic or semiautomatic
means, of large quantities of data in order to discover meaningful patterns and rules.
The main concept of datamining using a variety of techniques to identify nuggets of
information or decision making knowledge in bodies of data, and extracting these in
such a way that they can be put to use in the areas such as decision support, prediction,
forecasting and estimation.
Q15. Define data mining query in term of primitives.
Ans: a) Growing Data Volume: The main reason for necessity of automated
computer systems for intelligent data analysis is the enormous volume of existing and
newly appearing data that require processing. The amount of data accumulated each
day by various business, scientific, and governmental organizations around the world is
daunting.
b) Limitations of Human Analysis: Two other problems that surface when human
analysts processdata are the inadequacy of the human brain when searching for
complex multifactor dependencies in data, and the lack of objectiveness in such an
analysis.
c) Low Cost of Machine Learning: While data mining does not eliminate human
participation in solving the task completely, it significantly simplifies the job and
allows an analyst who is not a professional in statistics and programming to manage
the process of extracting knowledge from data.
Qno-List various applications of Data mining in various fields.
YaExplain in brief the data mining applications

Ans: Data mining has many varied field of application which is listed below:
Retail/Marketing:
Identify buying patterns from customers.Find associations among customer
demographic characteristics
Predict response to mailing campaigns
Market basket analysis
Banking:
Detect patterns of fraudulent credit card useIdentity loyal customers
Predict customers, determine credit card spending
Identify stock trading
Insurance and Health Care:
Claims analysis
Identify behavior pattern of risky customers.Identify fraudulent behavior
Transportation:
Determine the distribution schedules among outlets.Analyze loading
Medicine:
Characterize patient behavior to predict office visits.
Identify successful medical therapies for different illnesses.
Q20.What are the guidelines for KDD environment.
Ans: The following are the guidelines for KDD environment are:1. Support extremely large data sets: Data mining deals with extremely large data
sets consisting of billions of records and without proper platforms to store and handle
these volumes of data, no reliable
data mining is possible. Parallel servers with databases optimized for decision support
system oriented queries are useful. Fast and flexible access to large data sets is of very
important.
2. Support hybrid learning: Learning tasks can be divided into three areas:
a.classification tasks b. knowledge engineering tasks c. problem-solving tasks. All
algorithms can not perform well in all the
above areas as discussed in previous chapters. Depending on our requirement one has
to choose the appropriate one.
3. Establish a data warehouse: A data warehouse contains historic data and is subject
oriented and static, that is, users do not update the data but it is created on a regular
time-frame on the basis of the operational data of an organization.
4. Introduce data cleaning facilities: Even when a data warehouse is in operation, the
data is certain to contain all sorts of heterogeneous mixture. Special tools for cleaning
data are necessary and some advanced tools are available, especially in the field of deduplication of client files.
5. Facilitate working with dynamic coding: Creative coding is the heart of the
knowledge discovery process. The environment should enable the user to experiment
with different coding schemes, store partial results make
attributes discrete, create time series out of historic data, select random sub-samples,
separate test sets and so on.
Q21. Explain data mining for financial data analysis.
Ans: Financial data collected in the banking and financial industries are often relatively
complete, reliable and of high quality, which facilitates systematic data analysis and
data mining. The various issues are
a) Design and construction of data warehouses for multidimensional data analysis
and data mining:
Data warehouses need to be constructed for banking and financial data.
Multidimensional data analysis methods should be used to analyze the general
properties of such data. Data warehouses, data cubes, multifeature and discoverydriven data cubes, characteristic and comparative analyses and outlier analyses all play
important roles in financial data analysis and mining.
b) Loan payment prediction and customer credit policy analysis: Loan payment
prediction and customer credit analysis are critical to the business of a bank. Many
factors can strongly or weakly influence loan payment performance and customer
Q23. What is the importance of period of retention of data?

Ans: A businessman says he wants to the data to be retained for as long as possible 5,
10, 15 years the longer the better. The more data we have, the better the information
generated. But such a view thing is unnecessarily simplistic. If a company wants to
have an idea of the recorder levels, details of sales for last 6 months to one year may be
enough. Sales pattern of 5 years is unlikely to be relevant today. So, It is important to
determine the retention period for each function but once it is drawn, it becomes easy to
decide on the optimum value of data to be stored.
Q25. Give the advantages and disadvantages of equal segment partitioning.
Ans: The advantage is that the slots are reusable. Suppose we are sure that we will no
more need the data of 10 years back, then we can simply delete the data of that slot and
use it again. Of course there is a serious draw back in the scheme if the partitions tend
to differ too much in size. The number of visitors visiting a till station, say in summer
months, will be much larger than in winter months and hence Purchase
recommendations can e advertised on the web, in weekly flyers or on the sales receipts
to help improve customer service, aid customers in selecting items and increase sales
37. Define aggregation. Explain steps require designing summary table.
Ans: Association: - A collection of items and a set of records, which containsome
number of items from the given collection, an association function is anoperation
against this set of records which return affinities or patterns that existamong the
collection of items. Summary table are designed by following the steps
given as follows: a) decide the dimensions along which aggregation is to be done.
b) Determine the aggregation of multiple facts. c) Aggregate multiple facts into
the summary table. d) Determine the level of aggregation and the extent of
embedding. e) Design time into the table. f) Index the summary table.
Q30.Explain horizontal and vertical partitioning and differentiate them.
Ans: HORIZONTAL PARTITIONING-This is essentially means that the table is
partitioned after the first few thousand entries, and the next few thousand entries etc.
This is because in most cases, not all the information in the fact table needed all the
time. Thus horizontal partitioning helps to reduce the query access time, by directly
cutting down the amount of data to be scanned by the queries.
a) Partition by time into equal segments : This is the most straight forward method of
partitioning by months or years etc. This will help if the queries often come regarding
the fortnightly or monthly performance / sales etc.
b) Partitioning by time into different sized segments: This is very useful technique
to keep the physical table small and also the operating cost low.
VERTICAL PARTITIONING- A vertical partitioning schema divides the table
vertically. Each row is divided into 2 or more partitions. i) We may not need to access
all the data pertaining to a student all the time.
For example, we may need either only his personal details like age, address etc. or only
the examination details of marks scored etc. Then we may choose to split them into
separate tables, each containing data only about the relevant fields. This will speed up
accessing.
Q27. Explain data mining for retail industry application.
Ans: The retail industry is a major application area for data mining since it collects
huge amount of data on sales, customer shopping history, goods transportation, and
consumption and service records and so on. The quantity of data collected continues to
expand rapidly, due to web or e-commerce.
a) Design and construction of data warehouses on the benefits of data mining: The
first aspect is to design a warehouse. Here it involves deciding which dimensions and
levels to include and what preprocessing to perform in order to facilitate quality and
efficient data mining.
b) Multidimensional analysis of sales, customers, products, time and region: The
retail industry requires timely information regarding customer needs, product sales,
trends and fashions as well as the quality, cost, profit and service of commodities. It is
therefore important to provide powerful multidimensional analysis and visualization
tools, including the construction of sophisticated data cubes according to the needs of
data analysis.
36.Explain multi dimensional schemas.
Ans: This is a very
convenient method of
analyzing data, when it
goes beyond the normal
tabular relations.
For example, a store
maintains a table of
each item it sells over a
month as a table, in each
of its 10 outlets. This is
a 2 dimensional table. One the other hand, if the company wants a data of all items
sold by its outlets, it can be done by simply by
superimposing the 2 dimensional table for each of
these items one behind the other. Then it
becomes a 3 dimensional view. Then the query,
instead of looking for a 2 dimensional rectangle of
data, will look for a 3 dimensional cuboid of data.
There is no reason why the dimensioning should
stop at 3 dimensions. In fact almost all queries can
be thought of as approaching a multi-dimensioned
unit of data from a multidimensional volume of the
schema. A lot of designing effort goes into optimizing such searches.
Q26. Explain the Query generation.
Ans: Meta data is also required to generate queries. The query manger uses the
metadata to build a history of all queries run and generator a query profile for each
user, or group of uses. We simply list a few of
the commonly used meta data for the query. The names are self explanatory. o QueryTable accessed- Column accessed, Name, Reference identifier. o Restrictions appliedColumn name, Table name, Reference identifier ,Restrictions. o Join criteria appliedColumn name, Table name, Reference Identifier, Column name, Table name,
Reference identifier. o Aggregate function used-Column name, Reference identifier,
Aggregate function. o Syntax o Resources o Disk
credit rating. Data mining methods, such as feature selection and attribute relevance
ranking may help identify important factors and eliminate irrelevant ones.
c) Classification and clustering of customers for targeted marketing: Classification
and clustering methods can be used for customer group identification and targeted
marketing..
Upload By:Abhimnayu kumar singh

53 Data-Warehousing N Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

53 Data-Warehousing N Data Mining

Uploaded by

Copyright:

Available Formats

Page.1Q1.

Explain the meaning of data cleaning and data

Q4.Explain the role of access control issues in data mart

iv) Determine the level of aggregation and the extent of embedding.

biases are similarly initialized to small random numbers.

3.IDENTIFY THE ACCESS TOOL REQUIREMENTS

Q19.Briefly explain the system management tools.

4.Q.Enlist the desirable schemes required for a good architecture

YaExplain in brief the data mining applications

Q23. What is the importance of period of retention of data?

You might also like