You are on page 1of 17

MBA

INFORMATION SYSTEMS
DATA WAREHOUSING AND DATA MINING

Assignment
Enrolment number: MBISMCT13727119

Self Declaration
I declare that the assignment submitted by me is not a verbatim/photo static
copy from the website/books/journals/manuscripts.

Signature of the student

Signature of the faculty concerned

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Q.1 Discuss in detail, the major issues in data mining.


Answer:
Data Mining is the process of analyzing data from different perspectives
and summarizing it into useful information - information that can be used to
increase revenue, cuts costs, or both. Data mining is an analytical tool for
analyzing data. It allows analyze data from many different dimensions or
angles, categorize it and summarize the relationships identified. Technically,
data mining is the process of finding correlations or patterns among dozens
of fields in large relational databases.

The objective of data mining is to identify valid, novel, potentially useful and
understandable correlations and patterns in existing data. In the popular
mind, data mining refers to finding answers from a companys data that an
analyst or executive has not thought to ask. Data mining however does
create both data and insights that add to the knowledge of the organization.

Every new field has its initial successes that intrigue others to investigate it
further. Early results from data mining indicate:
People who buy Rolls Royce Car
Men who buy chicken on Friday night also tend to buy beer
Who is likely to repay a loan?
Appropriate play selection and match-ups in IPL games.

Usually, data mining leads to steady incremental changes rather than major
transformations. It leads to small advantages each year with each customer
and each project. Over time, these changes accumulate like compound
May 2015

Page 2 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

interest. Although breakthroughs occur from time to time, they cannot be


counted on to happen on a regular basis.

Major Issues in data mining:


Mining methodology and user interaction issues: These reflect the
kinds of knowledge mined and the ability to mine knowledge at multiple
granularities, the use of domain knowledge, ad hoc mining and knowledge
visualization.
Mining different kinds of knowledge databases: Data mining should
cover a wide spectrum of data analysis and knowledge discovery tasks,
including data characterization, discrimination, association,
classification, clustering, tread and deviation analysis and similarity
analysis.
Interactive mining of knowledge at multiple levels of abstraction: The
data mining process should be interactive. Interactive mining allows
users to focus the search for patterns, providing and refining data
mining requests based on returned results.
Incorporation of background knowledge: Background knowledge may
be used to guide the discovery process and allow discovered patterns
to be expressed in concise terms and at different levels of abstraction.
Data mining query languages and ad hoc mining: Relational query
languages (such as SQL) allow users to pose ad hoc queries for data
retrieval.
Presentation and visualization of data mining results: Discovered
knowledge should be expressed in high-level languages, visual
representations or other expressive forms so that knowledge can be
easily understood and directly usable by humans.
Handling noisy or incomplete data: When mining data regularities,
these objects may confuse the process, causing the knowledge model
constructed to over fit the data.
Pattern evaluation--the interestingness problem: A data mining system
can uncover thousands of patterns. Many of the patterns discovered
May 2015

Page 3 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

may be uninteresting to the given user, representing common


knowledge or lacking novelty.
Mining different kinds of knowledge from diverse data types, e.g., bio,

stream, Web
Performance: efficiency, effectiveness and scalability
Pattern evaluation: the interestingness problem
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge

fusion
User interaction: Here the issues are related to the mode of user
interaction or how much of user interaction is involved:
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts:
There are also some issues concerning the applications used for data mining
and its social impacts which can be summarized as:
Domain-specific data mining & invisible data mining
Protection of data security, integrity and privacy
Data Collection and Data Organization:
What data has been collected and where it is?
How do I combine legacy systems with current data systems?
Customer Story
What is the meaning of some of these data values?
Modeling Issues and Data Difficulties:
Data Preparation
Rare or Unknown Targets
Over Sampling
Under coverage
Dirty Data
Errors
May 2015

Page 4 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Missing Values
Dimension Reduction (Variable Selection)
Under and Over Fitting
Temporal Infidelity
Model Evaluation

Skepticism and Communication:


Skepticism
Breaking the Rules (statisticians)
Magic (non-analytical individuals)
Communication
Performance issues: These include efficiency, scalability, and
parallelization of data mining algorithms.
Efficiency and scalability of data mining algorithms: To effectively
extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable.
Parallel, distributed and incremental mining algorithms: The huge size
of many databases, the wide distribution of data, and the
computational complexity of some data mining methods are factors
motivating the development of algorithms that divide data into
partitions that can be processed in parallel.
Issues relating to the diversity of database types:
Handling of relational and complex types of data: Specific data mining
systems should be constructed for mining specific kinds of data.
Mining information from heterogeneous databases and global
information systems: Local- and wide-area computer networks connect
many sources of data, forming huge, distributed and heterogeneous
DBs.
Research Issues: Many issues still need to be addressed to reap quality
knowledge from the sophisticated algorithms available for data mining. For
example:
How good is the quality of discovered knowledge?
May 2015

Page 5 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Does the same method always produce the same results?


Are different tools required for different application domains?
What factors affect tool performance?
How do human cognitive factors affect the results?
Primary research issues include:
Methodological issues in applying the tools
Performance improvement through the integration of tools and
techniques
Exogenous factors to consider in modeling
Interdisciplinary perspectives
The economics of data mining.
Other Issues in Data Mining:
Data is an important issue. Dealing with incomplete raw data or
erroneous input is not a trivial task.
Algorithms differ in the ways the models are generated. How the
quality of an algorithm is assessed, its robustness, scalability,
preprocessing, generalizability, and reliability are some of the critical
issues.
The way model performance is measured is an important
consideration. The same model performs differently in different
domains due to the quality of the data, normative criteria, and the
decision-maker factors involved.

May 2015

Page 6 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Data integrity issue. Data analysis can only be as good as the data that
is being analyzed.
The issue of cost. The more powerful the data mining queries, the
greater the utility of the information which increases the pressure for
larger, faster systems, which are more expensive.
Limited data Information: A database is often designed for purposes
different from data mining and sometimes the properties or attributes
that would simplify the learning task are not present nor can they be
requested from the real world. Inconclusive data causes problems
Noise and missing values: Databases are usually contaminated by
errors so it cannot be assumed that the data they contain is entirely
correct. Missing data can be treated by discovery systems in a number
of ways such as:
Simply disregard missing values
Omit the corresponding records
Infer missing values from known values
Treat missing data as a special value to be included additionally
in the attribute domain
Average over the missing values using Bayesian techniques.
Data Uncertainty: Uncertainty refers to the severity of the error and
the degree of noise in the data. Data precision is an important
consideration in a discovery system.
Size, updates, and irrelevant fields: Databases tend to be large
and dynamic in that their contents are ever-changing as information is
added, modified or removed. The problem with this from the data
mining perspective is how to ensure that the rules are up-to-date and
consistent with the most current information.
Steps to overcome data mining problems:
Challenge: Dirty Data
In terms of dirty data, use informed intuition and data profiling method.
Informed intuition required the analysts to really get to know the
May 2015

Page 7 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

data. Data Profiling entails checking to see if the data falls into predefined norms.
Don't forget to look at a missing data plot to easily identify systematic
pattern of missing data
Use anomaly detector to flag records to put before subject matter
experts to further clean the data.
Calculate descriptive statistics about the data and visualize before
starting the modeling process.
Discussions with the business owners of the data can help to better
understand the quality.
Try to understand the complexity of the data by looking at multivariate
combinations of data values.
Training decision trees for each variable given the remainders allows to
either to replace NULL values or to check deviating values
Create an artificial multidimensional definition of outliers and of virtual
clusters using fuzzy sets and tried to trap dirty data. Examination of
trapped data provided clues to write programs for cleaning specific
types of "dirtiness".
Communicate the presence of "dirty" data to clients.
Challenge: Explaining Data Mining to Others
Taking small impactful projects internally and then promoting those
projects throughout the organization helps adoption. Finally, serving
the data up in a meaningful application - BI tool - shows the
stakeholders what data mining is capable of delivering.
Initiate Knowledge Sharing Sessions about Data Mining basics and
purposes.
Graphical representations are very helpful
Try to make the management to buy the solution developed
Focus on dollars, overall benefit of model application to the Balance
Sheet and P&L.
Measuring results compared to control groups is the best to convince
people about data mining results.
May 2015

Page 8 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Explaining results and their business impact with visual & graphical
presentation, explaining historical trends & variance analysis, logically
helps explain business trends in data to business users.
Visualize and explain models and model spaces. Explain and interpret
results. Show and explain evaluation and significance of results.
Challenge: Unavailability of Data / Difficult Access to Data
Try to find out which data is necessary for the business
Provide a dedicated resource for data collection
Organize meetings with client and develop a project plan for data
collection.
If the entire data is unavailable, start work with available data.
Provide necessary authorization to collect data
Design and implement a dedicated database model for data mining
purposes
Conclusion:
Thus from the above discussion it can be concluded that though there exist
some major issues and challenges in data mining there are also
right tools and techniques that can overcome those issues and still
make data mining very useful to any business.

Q.2 How does


transformation?

data

integration

happen

and

what

are

its

Answer:
Data Integration
In todays world, volumes of data grow exponentially in all realms from
personal data to enterprise and global data. Thus it is becoming extremely
important to be able to understand data sets and organize them. Such
disciplines as data integration, migration, synchronization, business
intelligence etc. allow this.
May 2015

Page 9 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Data integration involves combining data residing in different sources and


providing users with a unified view of these data, which are stored using
various technologies. A complete data integration solution encompasses
discovery, cleansing, monitoring, transforming and delivery of data from a
variety of sources.
Data integration becomes increasingly important in cases of merging
systems of two companies or consolidating applications within one company
to provide a unified view of the company's data assets. The later initiative is
often called a data warehouse. This process becomes significant in a variety
of situations, which include both commercial and scientific domains.
The basic to be followed during data integration is:
Understand the information and foster collaboration between
business and IT
Encourage a standardized approach to discovering your IT assets and
establishing a common business language.
Cleanse data and monitor data quality
Analyze, cleanse, monitor and manage data, enabling better business
decisions and improve business process execution.
Transform data in any style and deliver it to any system
Integrate data on demand across multiple sources and targets, while
satisfying the most complex requirements with the most scalable runtime
available.

Data integration Architecture overview

May 2015

Page 10 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Challenges of Data Integration


At first glance, the biggest challenge is the technical implementation of
integrating data from disparate often incompatible sources. However, a
much bigger challenge lies in the entirety of data integration. It has to
include the following phases:

Design
The data integration initiative in a company must be an initiative of
business, not IT. There should be a champion who understands the data
assets of the enterprise and will be able to lead the discussion about the
long-term data integration initiative in order to make it consistent,
successful and beneficial.
Analysis of the requirements (BRS), i.e. why is the data integration being
done, what are the objectives and deliverables. From what systems will
the data be sourced? Is all the data available to fulfill the requirements?
What are the business rules? What is the support model and SLA?
Analysis of the source systems, i.e. what are the options of extracting the
data from the systems (update notification, incremental extracts, full
extracts), what is the required/available frequency of the extracts? What
is the quality of the data? Are the required data fields populated properly
May 2015

Page 11 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

and consistently? Is the documentation available? What are the data


volumes being processed? Who is the system owner?
Any other non-functional requirements such as data processing window,
system response time, estimated number of (concurrent) users, data
security policy, backup policy.
What is the support model for the new system? What are the SLA
requirements?
And last but not least, who will be the owner of the system and what is
the funding of the maintenance and upgrade expenses?
The results of the above steps need to be documented in form of SRS
document, confirmed and signed-off by all parties which will be
participating in the data integration project.

Implementation
Based on the business requirement, a feasibility study should be performed
to select the tools to implement the data integration system. The larger
enterprises which already have started other projects of data integration are
in an easier position as they already have experience and can extend the
existing system and exploit the existing knowledge to implement the system
more effectively. There are cases, however, when using a new, better suited
platform or technology makes a system more effective compared to staying
with existing company standards. For example, finding a more suitable tool
which provides better scaling for future growth/expansion, a solution that
lowers the implementation/support cost, lowering the license costs,
migrating the system to a new/modern platform, etc.

Testing
Along with the implementation, the proper testing is a must to ensure that
the unified data are correct, complete and up-to-date.

Both technical IT and business needs to participate in the testing to ensure


that the results are as expected. Therefore, the testing should incorporate at
May 2015

Page 12 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

least Performance Stress test (PST), Technical Acceptance Testing (TAT) and
User Acceptance Testing (UAT).

Data Integration Techniques


There are several organizational levels on which the integration can be
performed. As we go down the level of automated integration increases.

Manual Integration or Common User Interface - users operate with all


the relevant information accessing all the source systems or web page
interface. No unified view of the data exists.
Application Based Integration - requires the particular applications to
implement all the integration efforts. This approach is manageable only in
case of very limited number of applications.

Middleware Data Integration - transfers the integration logic from


particular applications to a new middleware layer. Although the integration
logic is not implemented in the applications anymore, there is still a need for
the applications to partially participate in the data integration.

Uniform Data Access or Virtual Integration - leaves data in the source


systems and defines a set of views to provide and access the unified view to
the customer across whole enterprise. For example, when a user accesses
the customer information, the particular details of the customer are
transparently acquired from the respective system. The main benefits of the
virtual integration are nearly zero latency of the data updates propagation
from the source system to the consolidated view, no need for separate store
for the consolidated data. However, the drawbacks include limited possibility
of data's history and version management, limitation to apply the method
only to 'similar data sources (e.g. same type of database) and the fact that
the access to the user data generates extra load on the source systems
which may not have been designed to accommodate.
May 2015

Page 13 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Common Data Storage or Physical Data Integration - usually means


creating a new system which keeps a copy of the data from the source
systems to store and manage it independently of the original system. The
most well know example of this approach is called Data Warehouse (DW).
The benefits comprise data version management, combining data from very
different sources (mainframes, databases, flat files, etc.). The physical
integration however requires a separate system to handle the vast volumes
of data.

Data Transformation:
In metadata and data warehouse, a data transformation converts a set of
data values from the data format of a source data system into the data
format of a destination data system.

Data transformation can be divided into two steps:


Data mapping maps data elements from the source data system to the
destination data system and captures any transformation that must occur
Code generation that creates the actual transformation program
Data element to data element is frequently complicated by complex
transformations that require one-to-many and many-to-one transformation rules.
The code generation step takes the data element mapping specification and creates
an executable program that can be run on a computer system. Code generation can
also create transformation in easy-to-maintain computer languages.

A simple Data Transformation example:


Considering the customer focused data mining projects, over 80% of the
time is spent preparing and transforming the customer data into a usable
format. Often the data is transformed to a 'single row per customer' or
similar summarized format, and many columns are created to act as inputs
into predictive or clustering models. Such data transformation can also be
referred to as ETL (extract, transform, and load).
Data Transformation Sample
May 2015

Page 14 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Referring to data processing steps that are separate from those mandatory
or statistical requirements of the modeling algorithm, there are relatively
simple steps in data processing can yield significantly better results than
tweaking algorithm parameters. Some of these data processing steps are
likely to be industry or data specific, but they are widely useful. They dont
necessarily have to be statistical in nature.

Example data processing steps:


Creation of additional dummy columns:
Where the data has a single category column that contains one of several
values (in this example voice calls, sms calls, data calls etc) we can use a
CASE statement to create a new column for each category. We can use 0
or 1 as indicators if the category value occurs in any specific row, but we
can also use the value of a numeric field (for example call count or
duration of the data is already partly summarized). A new column is
created for each category field.

Example:
May 2015

Page 15 of 17

MBA

DATA WAREHOUSING & DATA


MINING

INFORMATION SYSTEMS

Initial Data
Custome

Enrolment number:
MBISMCT13727119

Converted Data #01

Category

Score

Joe

Pen

10

Joe

Pencil

Mei

custome
r

category

Score

Pen_Ind

Pencil_Ind

Joe

Pen

10

20

Joe

Pencil

20

Pen

15

Mei

Pen

15

Joe

Pencil

25

Joe

Pencil

25

Mei

Pencil

20

Mei

Pencil

20

Converted Data #02


custome
r

category

score

Pen score

Pencil score

Joe

Pen

10

10

Joe

Pencil

20

20

Mei

Pen

15

15

Joe

Pencil

25

25

Mei

Pencil

20

20

Summarization
Aggregate the data so that we have only one row per customer and sum
or average the dummy and/or raw columns. So we could change the
previously converted data #02 to:
Customer

Pen score

Pencil score

Joe

10

45

Mei

15

20

Steps to overcome Data Transformation Challenges:


Identifying measurable and tangible business benefits
May 2015

Page 16 of 17

MBA
INFORMATION SYSTEMS

DATA WAREHOUSING & DATA


MINING

Enrolment number:
MBISMCT13727119

Document the benefits which the business need to attain so that the

Scope and requirements shall be defined properly


Do not Ignore business process reengineering
Do not Underestimate the role of system integrator
Proper Governance, Communication and stakeholder management
Thoroughly study and prepare a risk & mitigation plan for big Bang

theory
Adopt Proven Implementation methodology
Seek for Customer specific customizations
Develop Seamless integration with associated applications
Prepare a well established procedure for data migration

Conclusion:
From the above discussion it can be concluded that though there are some
challenges in data integration and its transformation there are also
right tools, techniques and the right decisions that can overcome
those issues and still make data integration very useful to any business.

May 2015

Page 17 of 17

You might also like