Professional Documents
Culture Documents
K-means algorithm
This algorithm has as an input a predefined number of clusters, that is the k from its
name. Means stands for an average, an average location of all the members of a
particular cluster. When dealing with clustering techniques, one has to adopt a notion
of a high dimensional space, or space in which orthogonal dimensions are all attributes
from the table of data we are analyzing. The value of each attribute of an
example represents a distance of the example from the origin along the attribute axes.
The coordinates of this point are averages of attribute values of all examples that
belong to the cluster. The steps of the K-means algorithm are given below.
1. Select randomly k points (it can be also examples) to be the
seeds for the centroids of k clusters.
2. Assign each example to the centroid closest to the example,
forming in this way k exclusive clusters of examples.
3. Calculate new centroids of the clusters. For that purpose average
all attribute values of the examples belonging to the same cluster (centroid).
Q12.Explain the STAR-FLAKE schema in detail.
Ans-STAR FLAKE SCHEMAS
One of the key factors for a data base designer is to ensure that a database should be
able to answer all types of queries, even those that are not initially visualized by the
developer. To do this, it is essential to understand how the data within the database is
used. In a decision support system, which is what a data ware is supposed to provide
basically, a large number of different questions are asked about the same set of facts.
For example, given a sales data question like
i) What is the average sales quantum of a particular item?
ii) Which are the most popular brands in the last week?
iii) Which item has the least tumaround item.
iv) How many customers returned to procure the same item within one month. Etc.,.
Can be asked. They are all based on the sales data, but the method of viewing the data
to answer the question is different. The answers need to be given by rearranging or
cross referencing different facts.
Q13.Explain the method for designing dimension tables.
Ans-DESIGNING DIMENSION TABLES
After the fact tables have been designed, it is essential to design the dimension tables.
However, the design of dimension tables need not be considered a critical activity,
though a good design helps in improving the performance. It is also desirable to keep
the volumes relatively small, so that restructuring cost will be
less. Now we see some of the commonly used dimensions.
Star dimension
They speed up the query performance by denormalising reference information into a
single table. They presume that the bulk of queries coming are such that they analyze
the facts by applying a number of constraints to a single dimensioned data.
For example, the details of sales from a stores can be stored in horizontal rows and
select one/few of the attributes. Suppose a cloth store stores details of the sales one
below the other and questions like how many while shirts of size 85" are sold in one
week are asked. All that the query has to do is to put the relevant constraints to get the
information. This technique works well in solutions where there are a number of
entitles, all related to the key dimension entity.
Q14.Explain the Horizontal partioning in briefly.
Ans-Needless to say, the dataware design process should try to maximize the
performance of the system. One of the ways to ensure this is to try to optimize by
designing the data base with respect to specific hardware architecture. Obviously, the
exact details of optimization depends on the hardware platforms.
Normally the following guidelines are useful:
i. maximize the processing, disk and I/O operations.
ii. Reduce bottlenecks at the CPU and I/O
The following mechanisms become handly
4.3.1 Maximising the processing and avoiding bottlenecks
One of the ways of ensuring faster processing is to split the data query into several
parallel queries, convert them into parallel threads and run them parallel. This method
will work only when there are sufficient number of processors or sufficient processing
power to ensure that they can actually run in parallel. (again not that to run five threads,
it is not always necessary that we should have five processors. But to ensure optimality,
even a lesser number of processors should be able to do the job, provided they are able
to do it fast enough to avoid bottlenecks at eh processor).
Normalisation
The usual approach in normalization in database applications is to ensure that the data
is divided into two or more tables, such that when the data in one of them is updated, it
does not lead to anamolies of data (The student is advised to refer any book on data
base management systems for details, if interested).
The idea is to ensure that when combined, the data available is consistent.
However, in data warehousing, one may even tend to break the large table into several
denormalized smaller tables. This may lead to lots of extra space being used. But it
helps in an indirect way It avoids the overheads of joining the data during queries.
To make things clear consider the following table
Q16.Explain the need of data mart in detail.
Ans-THE NEED FOR DATA MARTS
In a crude sense, if you consider a data ware house as a store house of data, a data mart
is a retail outlet of data. Searching for any data in a huge store house is difficult, but if
the data is available, you should be positively able to get it. On the other hand, in a
retail out let, since the volume to be searched from is small, you can be able to access
the data fast. But it is possible that the data you are searching for may not be available
there, in which case you have to go back to your main store house to search for the
data. Coming back to technical terminology, one can say the following are the reasons
for which data marts are created.
i) Since the volume of data scanned is small, they speed up the query processing.
ii) Data can be structured in a form suitable for a user access too
iii) Data can be segmented or partitioned so that they can be used on different platforms
and
also different control strategies become applicable.
(data ware house) in the project, so that he can answer the probing questions.
Q24.With the help of a block diagram explain the typical process flow in a data
Warehouse.
Ans- TYPICAL PROCESS FLOW IN A DATA WAREHOUSE
Any data ware house must support the following activities
i) Populating the ware house (i.e. inclusion of data)
ii) day-to-day management of the ware house.
iii) Ability to accommodate the changes.
The processes to populate the ware house have to be able to extract the data, clean it
up, and make it available to the analysis systems. This is done on a daily / weekly basis
depending on the quantum of the data population to be incorporated.
The day to day management of data ware house is not to be confused with maintenance
and management of hardware and software. When large amounts of data are stored and
new data are being continually added at regular intervals, maintaince of the quality of
data becomes an important element. Ability to accommodate changes implies the
system is structured in such a way as to be able to cope with future changes without the
entire system being remodeled. Based on these, we can view the processes that a
typical data ware house scheme should support as follows.
Q25.How the Naive Bayesian classification works.
Ans- Naive Bayesian Classification
The nave Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Each data sample is represented by an n-dimensional feature vector, X = (x1,
x2, . . . . xn), depicting n measurements made on the sample from n attributes,
respectively, A1, A2, . .An.
2. Suppose that there are m classes, C1, C2, . Cm. Given an unknown data sample, X
(i.e., having no class label), the classifier will predict that X belongs to the class having
the highest posterior probability, conditioned on X. That is, the nave Bayesian
classifier assigns an unknown sample X to the class Ci if and only if
P(CiX) > P(CjX) for 1 j m, j I Thus we maximize P(Ci/X). The class Ci for
which P(Ci/X) is maximized is called the maximum posteriori hypothesis. By Bayes
theorem P(CiX) = P(XCi) P(Ci)
P(X) 3. As P(X) is constant for all classes, only P(XCi) P(Ci) need be maximized. If
the class prior probabilities are not known, then it is commonly assumed that the
classes are equally likely,
that is, P(C1) = P(C2) = = P(Cm), and we would therefore maximize P(XCi).
Otherwise, we maximize P(XCi) P(Ci). Note that the class prior probabilities may be
estimated by P(Ci) = si/s where si is the number of training samples of class Ci, and s is
the total number of training samples.
Training Bayesian Belief Networks
In the learning or training of a belief network, a number of scenarios are possible. The
network structure may be given in advance or inferred from the data. The network
variables may be observable or hidden in all or some of the training samples. The case
of hidden data is also referred to as missing values or incomplete data.
If the network structure is known and the variables are observable, then training the
network is straightforward. It consists of computing the CPT entries, as is similarly
done when computing the probabilities involved in native Bayesian classification.
Neural Network Topologies
The arrangement of neural processing units and their interconnections can have a
profound impact on the processing capabilities of the neural networks. In general, all
neural networks have some set of processing units that receive inputs from the outside
world, which we refer to appropriately as the input units. Many neural networks also
have one or more layers of hidden processing units that receive
inputs only from other processing units. A layer or slab of processing units receives a
vector of data or the outputs of a previous layer of units and processes them in parallel.
The set of processing units that represents the final result of the neural network
computation is designated as the output units. There are three major connection
topologies that define how data flows between the input, hidden, and output
processing units.
Backpropagation
Backpropagation learns by iteratively processing a set of training samples, comparing
the networks prediction for each sample with the actual known class label. For each
sample with the actual known class label. For each training sample, the weights are
modified so as to minimize the means squared error between the networks prediction
and the actual class. These modifications are made in the backwards
direction, that is, from the output layer, through each hidden layer down to the first
hidden layer (hence the name backpropagation). Although it is not guaranteed, in
general the weights will eventually coverage, and the learning process stops. The
algorithm is summarized in Figure each step is described below.
Initialize the weights: The weights in the network are initialized to small random
numbers (e.g., ranging from 1.0 to 1.0, or -0.5 to 0.5). Each unit has a bias associated
with it, as explained below. The biases are similarly initialized to small random
numbers.
Nonlinear Regression
Polynomial regression can be modeled by adding polynomial terms to the basic linear
model. By applying transformations to the variables, we can convert the nonlinear
model into a linear one that can then be solved by the method of least squares.
Transformation of a polynomial regression model to a linear regression model.
Consider a cubic polynomial relationship given by
Y = + 1X1 + 2X2 + 3X3
To convert this equation to linear form, we define new variables:
X 1 = X X2 = X2 X3 = X3
Using the above Equation can then be converted to linear form by applying the above
assignments, resulting in the equation Y = + 1X1 + 2X2 + 3X3 which is solvable
by the method of least squares.
to bear on a problem at once, and we can present it to the neural network. It is like a
pop quiz, where the teacher walks in, writes a set of facts on the board, and says, OK,
tell me the answer. You must take the data, process it, and jump to a conclusion. In
this type of neural network, the data flows through the network in one
direction, and the answer is based solely on the current set of inputs.
5. Explain the concept of data warehousing and data mining.
Ans. A data warehouse is a collection of a large amount of data and these data is the
pieces of information Which is use to suitable managerial decisions. (a storehouse of
data) eg:- student data to the details of the citizens of a city or the sales of previous
years or the number of patients that came to a hospital with different ailments. Such
data becomes a storehouse of information.
Data mining is the process of exploration and analysis, by automatic or semiautomatic
means, of large quantities of data in order to discover meaningful patterns and rules.
The main concept of datamining using a variety of techniques to identify nuggets of
information or decision making knowledge in bodies of data, and extracting these in
such a way that they can be put to use in the areas such as decision support, prediction,
forecasting and estimation.
Q15. Define data mining query in term of primitives.
Ans: a) Growing Data Volume: The main reason for necessity of automated
computer systems for intelligent data analysis is the enormous volume of existing and
newly appearing data that require processing. The amount of data accumulated each
day by various business, scientific, and governmental organizations around the world is
daunting.
b) Limitations of Human Analysis: Two other problems that surface when human
analysts processdata are the inadequacy of the human brain when searching for
complex multifactor dependencies in data, and the lack of objectiveness in such an
analysis.
c) Low Cost of Machine Learning: While data mining does not eliminate human
participation in solving the task completely, it significantly simplifies the job and
allows an analyst who is not a professional in statistics and programming to manage
the process of extracting knowledge from data.
Qno-List various applications of Data mining in various fields.
credit rating. Data mining methods, such as feature selection and attribute relevance
ranking may help identify important factors and eliminate irrelevant ones.
c) Classification and clustering of customers for targeted marketing: Classification
and clustering methods can be used for customer group identification and targeted
marketing..
Upload By:Abhimnayu kumar singh