Professional Documents
Culture Documents
INFORMATION SYSTEMS
DATA WAREHOUSING AND DATA MINING
Assignment
Enrolment number: MBISMCT13727119
Self Declaration
I declare that the assignment submitted by me is not a verbatim/photo static
copy from the website/books/journals/manuscripts.
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
The objective of data mining is to identify valid, novel, potentially useful and
understandable correlations and patterns in existing data. In the popular
mind, data mining refers to finding answers from a companys data that an
analyst or executive has not thought to ask. Data mining however does
create both data and insights that add to the knowledge of the organization.
Every new field has its initial successes that intrigue others to investigate it
further. Early results from data mining indicate:
People who buy Rolls Royce Car
Men who buy chicken on Friday night also tend to buy beer
Who is likely to repay a loan?
Appropriate play selection and match-ups in IPL games.
Usually, data mining leads to steady incremental changes rather than major
transformations. It leads to small advantages each year with each customer
and each project. Over time, these changes accumulate like compound
May 2015
Page 2 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Page 3 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
stream, Web
Performance: efficiency, effectiveness and scalability
Pattern evaluation: the interestingness problem
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge
fusion
User interaction: Here the issues are related to the mode of user
interaction or how much of user interaction is involved:
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts:
There are also some issues concerning the applications used for data mining
and its social impacts which can be summarized as:
Domain-specific data mining & invisible data mining
Protection of data security, integrity and privacy
Data Collection and Data Organization:
What data has been collected and where it is?
How do I combine legacy systems with current data systems?
Customer Story
What is the meaning of some of these data values?
Modeling Issues and Data Difficulties:
Data Preparation
Rare or Unknown Targets
Over Sampling
Under coverage
Dirty Data
Errors
May 2015
Page 4 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Missing Values
Dimension Reduction (Variable Selection)
Under and Over Fitting
Temporal Infidelity
Model Evaluation
Page 5 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
May 2015
Page 6 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Data integrity issue. Data analysis can only be as good as the data that
is being analyzed.
The issue of cost. The more powerful the data mining queries, the
greater the utility of the information which increases the pressure for
larger, faster systems, which are more expensive.
Limited data Information: A database is often designed for purposes
different from data mining and sometimes the properties or attributes
that would simplify the learning task are not present nor can they be
requested from the real world. Inconclusive data causes problems
Noise and missing values: Databases are usually contaminated by
errors so it cannot be assumed that the data they contain is entirely
correct. Missing data can be treated by discovery systems in a number
of ways such as:
Simply disregard missing values
Omit the corresponding records
Infer missing values from known values
Treat missing data as a special value to be included additionally
in the attribute domain
Average over the missing values using Bayesian techniques.
Data Uncertainty: Uncertainty refers to the severity of the error and
the degree of noise in the data. Data precision is an important
consideration in a discovery system.
Size, updates, and irrelevant fields: Databases tend to be large
and dynamic in that their contents are ever-changing as information is
added, modified or removed. The problem with this from the data
mining perspective is how to ensure that the rules are up-to-date and
consistent with the most current information.
Steps to overcome data mining problems:
Challenge: Dirty Data
In terms of dirty data, use informed intuition and data profiling method.
Informed intuition required the analysts to really get to know the
May 2015
Page 7 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
data. Data Profiling entails checking to see if the data falls into predefined norms.
Don't forget to look at a missing data plot to easily identify systematic
pattern of missing data
Use anomaly detector to flag records to put before subject matter
experts to further clean the data.
Calculate descriptive statistics about the data and visualize before
starting the modeling process.
Discussions with the business owners of the data can help to better
understand the quality.
Try to understand the complexity of the data by looking at multivariate
combinations of data values.
Training decision trees for each variable given the remainders allows to
either to replace NULL values or to check deviating values
Create an artificial multidimensional definition of outliers and of virtual
clusters using fuzzy sets and tried to trap dirty data. Examination of
trapped data provided clues to write programs for cleaning specific
types of "dirtiness".
Communicate the presence of "dirty" data to clients.
Challenge: Explaining Data Mining to Others
Taking small impactful projects internally and then promoting those
projects throughout the organization helps adoption. Finally, serving
the data up in a meaningful application - BI tool - shows the
stakeholders what data mining is capable of delivering.
Initiate Knowledge Sharing Sessions about Data Mining basics and
purposes.
Graphical representations are very helpful
Try to make the management to buy the solution developed
Focus on dollars, overall benefit of model application to the Balance
Sheet and P&L.
Measuring results compared to control groups is the best to convince
people about data mining results.
May 2015
Page 8 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Explaining results and their business impact with visual & graphical
presentation, explaining historical trends & variance analysis, logically
helps explain business trends in data to business users.
Visualize and explain models and model spaces. Explain and interpret
results. Show and explain evaluation and significance of results.
Challenge: Unavailability of Data / Difficult Access to Data
Try to find out which data is necessary for the business
Provide a dedicated resource for data collection
Organize meetings with client and develop a project plan for data
collection.
If the entire data is unavailable, start work with available data.
Provide necessary authorization to collect data
Design and implement a dedicated database model for data mining
purposes
Conclusion:
Thus from the above discussion it can be concluded that though there exist
some major issues and challenges in data mining there are also
right tools and techniques that can overcome those issues and still
make data mining very useful to any business.
data
integration
happen
and
what
are
its
Answer:
Data Integration
In todays world, volumes of data grow exponentially in all realms from
personal data to enterprise and global data. Thus it is becoming extremely
important to be able to understand data sets and organize them. Such
disciplines as data integration, migration, synchronization, business
intelligence etc. allow this.
May 2015
Page 9 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
May 2015
Page 10 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Design
The data integration initiative in a company must be an initiative of
business, not IT. There should be a champion who understands the data
assets of the enterprise and will be able to lead the discussion about the
long-term data integration initiative in order to make it consistent,
successful and beneficial.
Analysis of the requirements (BRS), i.e. why is the data integration being
done, what are the objectives and deliverables. From what systems will
the data be sourced? Is all the data available to fulfill the requirements?
What are the business rules? What is the support model and SLA?
Analysis of the source systems, i.e. what are the options of extracting the
data from the systems (update notification, incremental extracts, full
extracts), what is the required/available frequency of the extracts? What
is the quality of the data? Are the required data fields populated properly
May 2015
Page 11 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Implementation
Based on the business requirement, a feasibility study should be performed
to select the tools to implement the data integration system. The larger
enterprises which already have started other projects of data integration are
in an easier position as they already have experience and can extend the
existing system and exploit the existing knowledge to implement the system
more effectively. There are cases, however, when using a new, better suited
platform or technology makes a system more effective compared to staying
with existing company standards. For example, finding a more suitable tool
which provides better scaling for future growth/expansion, a solution that
lowers the implementation/support cost, lowering the license costs,
migrating the system to a new/modern platform, etc.
Testing
Along with the implementation, the proper testing is a must to ensure that
the unified data are correct, complete and up-to-date.
Page 12 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
least Performance Stress test (PST), Technical Acceptance Testing (TAT) and
User Acceptance Testing (UAT).
Page 13 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Data Transformation:
In metadata and data warehouse, a data transformation converts a set of
data values from the data format of a source data system into the data
format of a destination data system.
Page 14 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Referring to data processing steps that are separate from those mandatory
or statistical requirements of the modeling algorithm, there are relatively
simple steps in data processing can yield significantly better results than
tweaking algorithm parameters. Some of these data processing steps are
likely to be industry or data specific, but they are widely useful. They dont
necessarily have to be statistical in nature.
Example:
May 2015
Page 15 of 17
MBA
INFORMATION SYSTEMS
Initial Data
Custome
Enrolment number:
MBISMCT13727119
Category
Score
Joe
Pen
10
Joe
Pencil
Mei
custome
r
category
Score
Pen_Ind
Pencil_Ind
Joe
Pen
10
20
Joe
Pencil
20
Pen
15
Mei
Pen
15
Joe
Pencil
25
Joe
Pencil
25
Mei
Pencil
20
Mei
Pencil
20
category
score
Pen score
Pencil score
Joe
Pen
10
10
Joe
Pencil
20
20
Mei
Pen
15
15
Joe
Pencil
25
25
Mei
Pencil
20
20
Summarization
Aggregate the data so that we have only one row per customer and sum
or average the dummy and/or raw columns. So we could change the
previously converted data #02 to:
Customer
Pen score
Pencil score
Joe
10
45
Mei
15
20
Page 16 of 17
MBA
INFORMATION SYSTEMS
Enrolment number:
MBISMCT13727119
Document the benefits which the business need to attain so that the
theory
Adopt Proven Implementation methodology
Seek for Customer specific customizations
Develop Seamless integration with associated applications
Prepare a well established procedure for data migration
Conclusion:
From the above discussion it can be concluded that though there are some
challenges in data integration and its transformation there are also
right tools, techniques and the right decisions that can overcome
those issues and still make data integration very useful to any business.
May 2015
Page 17 of 17