You are on page 1of 32

What are common mistakes

in Data Science projects?


(and how to avoid them?)
Artur Suchwałko, Ph.D., QuantUp

AI & Big Data 2018, March 10, 2018, Lviv, Ukraine


Real-world Data Science projects
Real-world Data Science projects

• Kaggle competitions and real Data Science projects are two quite
different disciplines
• When a data frame is prepared then it’s easy
• What is done not correctly and can be corrected?
• Analysis of a business problem
• Data
• Process
• Methods, models
• Hardware, sofware
• People

(Everything based on practical experience: 20 years, 100 projects, 3,000


hours of workshops.
For the majority of topics I could add quotes from talks.)
Analysis of a business problem
No. We don’t want to build a model of
production and storage in our factory
Problem:
• We’d like just to optimize cutting a log (a trunk of a dead tree) into
planks
• Let’s do it in the simplest way. Why should we waste time and
money?
• The others can do it. Why do you make it complicated?!?

Solution:
• To build the production and storage model
• Otherwise you will optimize log cutting in a different sawmill
• or something completely different
Solution of a wrong analytical problem

Problem:
• Stating of a wrong problem and solving it can decrease predictive
ability of a model
• Similarly, removing so called false predictors (leaks from future)
• But we never want to have pure predictive power. Usually business
wants actionability and real value
Solution:
• Focus on what influences your busines
Data
Preparation of a development sample is not
very important

Problem:
• Let’s take a sample and model!
• Preparation of the development sample decides if the model will fit
the reality we model or not
• The data and thus the sample is generated (or influenced) by a
process that must be well known and understoo
Solution:
• Think it over really carefully.
We have Big Data. We need to implement
Big Data solutions
Problem:
• If you can email your data or fit it in a pendrive it means you don’t
have Big Data!
• Many Data Science tasks for millions of records can be completed
using (powerful) laptops
• Decisions are data-driven or not. It’s not about data magnitude but
about way the decisions are taken
Solution:
• Be (more than) sure that we need Big Data technologies for storing
and processing
• During PoC / prototype stage don’t use Big Data tools
• Important: Not valid for some problems
Use social media data

Problem:
• It’s a tremendous effort if you don’t use an off-the-shelf solution
• Usually business value is not big

Solution:
• Be sure that the effort will be rewarded
Process
Let’s build a model in one week

Problem:
• It’s possible (in theory)
• If you don’t analyze the process thoughtfully and don’t detect false
predictors then the model will not work in production
• We will be really happy to see how well it performs on our
development sample
Solution:
• Take enough time
• Be sure that the process is correct
There is too short time to complete the task /
model
Problem:
• Data problems
• Stucked in preprocessing
• The implementation takes too long
• Too short experience
Solution:
• Prepare a full product as soon as possible, e.g.:
• cutting out all the functionalities, e.g. a scoring application with a
simple / dummy model
• a full code for building the model but using simpler methods
• improve it in the next iterations

• Using CRISP-DM / checklist to support your memory


• Usually you can start implementation from the first product version
Way you prepare the result (a model, a data
product) doesn’t matter

Problem:
• I want a model. It must work. I don’t care how you’ll build it. Just
build it!
• The process is crucial
• If it is wrong then the analysis is not fully reproducible
• We take a technical debt
• and sooner or later we will be forced to pay it back
Solution:
• Build models in a fully reproducible way
Implementation – I’m sure it’ll work out
somehow

Problem:
• Implementation without planned tests usually fail
• What is really painful, it takes time to realize that they failed (a
model works and generates risk)
Solution:
• Plan both, implementation and tests
Methods & models
AI. We desperately need AI!

Problem:
• We don’t need
• Predictive modeling is not AI!
• It happens that full control over a model is more important than
predictive power
Solution:
• Let’s think what we’d like to achieve and how to do this
• Data-driven decision making is more important
A model just learns everything it is exposed
to

Problem:
• You need to promise self-learning to sell a service / a software
• But it will not learn automatically if not fed by suitable data
• In many situations you don’t have such data to design a feedback loop

Solution:
• Analyze a process that generates the data for the development sample
• Put aside a “not touched” sample
• The model will be taught using a sample and refined in an ongoing
way
Start modeling from using Deep Learning!

Problem:
• But everybody uses it…
• No!!!
• Many problems are too simple for DL
• In particular, the problems with data in a data frame
Solution:
• Random Forest, xgboost
If we have 3000 classes then let’s build a
BIG classifier

Problem:
• For example when we’d like to recommend bank products
• Such a random classifier has error 2999/3000 = 99.97% (not 50%)
• Usually the dataset is too small

Solution:
• It’s good to use a simpler method (usually)
Hardware & software
You can do calculations using a laptop

Problem:
• Sometimes yes, you can
• But usually you cannot
• Usually it doesn’t make any sense – human’s time is more expensive
that machine’s time
Solution:
• It is good to invest some money in hardware
• or use AWS from Amazon (or something similar)
Commercial software is excellent

Problem:
• Users often tell that it is excellent unless bought
• The problems appear later

Solution:
• Test it in similar conditions it will be used
• Think seriously about using open source
Free software is excellent (and it’s free!)

Problem:
• It’s free – in terms of a buying cost
• It’s not just excellent – the cost is neccessity to have qualified people
onboard and to develop software
• There happen inconvenient problems

Solution:
• Use as it should be used
• i.e. write clear and clean code, use additional tools, e.g. VCS
• Take care of the team to have the skills needed
People
All companies have Data Science teams.
Let’s build one for us!
Problem:
• It’s possible to build a team. It will take a lot of time and lots of
money.
• If the results will be wasted then the people will leave
• They need to have fun working on projects
• If I need a plank then do I really need to buy a sawmill?

Solution::
• Be sure that:
• we know how to use their results
• it will give value to the business

• PoC can be outsourced. The first data science project can be


outsourced.
A student or a freshman is enough to give
profits from deep analytics to business
Problem:
• If someone can cut with a scalpel then will we call him a surgeon?
• Why someone who can build (technically) a model having a data
frame is called a Data Scientist?
• Data Scientist is a profession – experience matters!
• People without experience usually don’t give any business value for
a company. Even after spending a year working with data (!)
Solution:
• Hire experienced people, especially in the beginning of a DS journey
• let them teach the freshmen
• But what is you don’t have experienced people?
• Invest time, effort, and money in your team. Let a more business
analyst control the team
The team will learn everything on online
courses

Problem:
• I give each of you $20 (ok, even $50) and learn everything online
• It’s true. The team will learn some things
• But not the most important ones
• A good hands-on training cannot be substituted
Solution:
• Learning by doing (and applying)
• Control and stimulate learning
• Buy knowledge
Summary
Summary
• To avoid mistakes it is good to ask ourselves these questions (and
answer them), e.g.:
• What business problem are we solving?
• What will be business value we can get from the results?
• What could be lost in translation fro business into analytics?
• Do we have adequate and representative data?
• What process does generate them? What are they influenced by?
• What is model building process?
• What analytical tools should be used? Could we apply simpler
approaches?
• How do we control all the risk?

• It is good to do it repeatedly
• It’s best to involve someone experienced
• It’s beneficial to educate the receivers of the results
Contact
Contact

• During the conference!


• After the conference: artur [at] quantup [dot] eu