You are on page 1of 12

John B. Rollins, Ph.D.

IBM Analytics | IBM Corporation

Foundational Data Science Methodology

2015 IBM Corporation

Introduction
Why we are interested in data science
- Solve problems and answer questions
- Gain useful insights through modeling to predict outcomes or discover
underlying patterns

Rapidly evolving technologies


- Platform growth
- In-database analytics
- Text analysis
- Automation

2015 IBM Corporation

Data science methodology


Why?
- To provide a guiding strategy

What?
- General strategy that guides the processes and activities within a given
domain
- Does not depend on particular technologies or tools
- Not a set of techniques or recipes
- Provides the data scientist with a framework for how to proceed to obtain
answers

2015 IBM Corporation

Methodology diagram
Business
Understanding

Analytic
Approach

Data
Requirements

Feedback

Data Collection

Deployment

Data
Understanding

Evaluation

Modeling

Data
Preparation

2015 IBM Corporation

Business understanding
Business
Understanding

Every project begins with business understanding.


- Clearly define project objectives and requirements from the business
perspective key to a successful solution
- Business sponsors most critical in this stage
Define problem and solution requirements
- Business sponsors involved throughout the project
Provide domain expertise
Review intermediate findings
Ensure that the work generates the intended solution

2015 IBM Corporation

Analytic approach
Analytic
Approach

With a clear definition of the business problem, we define the analytic


approach to solving the problem.
- Express problem in context of statistical and machine learning techniques
- Identify suitable technique(s)
- Examples
Classification to predict response to a promotion ("yes" or "no)
Clustering and Associations for customer segmentation and market basket
analysis

2015 IBM Corporation

Data compilation
The chosen analytic approach determines the
data requirements.
- Content, formats, representations

Initial data collection is performed.


- Available data resources (structured, unstructured,
semi-structured) relevant to the problem domain
- Decide whether to obtain less-accessible data
elements
- Revise data requirements or collect more data,
if needed

Data
Requirements

Data Collection

Data
Understanding

Then data understanding is gained.


- Descriptive statistics and visualization
- Content, quality, initial insights about data
- Additional data collection to fill gaps, if needed
7

2015 IBM Corporation

Data preparation
Data preparation encompasses all activities to construct the data set.
- Data cleaning
Missing or invalid values
Eliminating duplicate rows
Formatting properly
- Combining multiple data sources
- Transforming data
- Feature engineering
- Text analysis

Accelerate data preparation by


automating common steps

Data
Preparation

2015 IBM Corporation

Modeling
Modeling focuses on developing models.
- Predictive or descriptive models
- According to the previously-defined analytic approach
- Training set for predictive modeling

Highly iterative process


- Intermediate insights refinements in data preparation & model specification
- Multiple algorithms & parameters to find best model for a given technique

Modeling

2015 IBM Corporation

Model evaluation
Model evaluation is performed during model development and before
model deployment.
- Understand the models quality
- Ensure that it properly addresses the business problem

Diagnostic measures
- Suitable to the modeling technique used
- Testing set
- Refine model as needed
Evaluation

Statistical significance tests

10

2015 IBM Corporation

Deployment and feedback


Once finalized, the model is deployed into a production environment.
- May be in a limited / test environment until model is proven
- Involves additional groups, skills, and technologies
Solution owner
Feedback

Marketing
Application developers
IT administration

Deployment

Feedback to assess model performance


- Gathering and analysis of feedback for assessment
of the models performance and impact
- Iterative process for model refinement and redeployment
- Accelerate through automated processes
11

2015 IBM Corporation

Ongoing value through good methodology


Methodology diagram illustrates the iterative nature of problem-solving in
a data science project.
Through feedback, refinement, and redeployment, models are continually
improved and adapted to evolving conditions.
The model continues to provide value to the organization for as long as
the solution is needed.

12

2015 IBM Corporation

You might also like