You are on page 1of 2

-3+ years of software development experience

-3+ years working experience in machine learning and/or AI

-Good fundamentals in mathematics

-Proficient in Java, Python and/or other development languages

-Experience with TensorFlow or similar software

Libraries in python for ML-

It uses the advantage of numerical and scientific computing libraries of Python such as NumPy and
SciPy.

Data Acquisition - may involve acquiring data from both internal and external sources, including social
media or web scraping. In a steady state, data extraction and transfer routines would be in place, and
new sources, once identified would be acquired following the established processes.

Data preparation - Usually referred to as "data wrangling", this step involves cleaning the data and
reshaping it into a readily usable form for performing data science. This is similar to the traditional ETL
steps in data warehousing in certain aspects, but involves more exploratory analysis and is primarily
aimed at extracting features in usable formats.

Hypothesis and modeling are the traditional data mining steps - however in a data science project,
these are not limited to statistical samples. Indeed the idea is to apply machine learning techniques to
all data. A key sub-step is performed here for model selection. This involves the separation of a
training set for training the candidate machine-learning models, and validation sets and test sets for
comparing model performances and selecting the best performing model, gauging model accuracy
and preventing over-fitting.

Steps 2 through 4 are repeated a number of times as needed; as the understanding of data and
business becomes clearer and results from initial models and hypotheses are evaluated, further
tweaks are performed. These may sometimes include Step 5 (deployment) and be performed in a pre-
production or "limited" / "pilot" environment before the actual full-scale "production" deployment, or
could include fast-tweaks after deployment, based on the continuous deployment model.
Once the model has been deployed in production, it is time for regular maintenance and operations.
This operations phase could also follow a target DevOps model which gels well with the continuous
deployment model, given the rapid time-to-market requirements in big data projects. Ideally, the
deployment includes performance tests to measure model performance, and can trigger alerts when
the model performance degrades beyond a certain acceptable threshold.

The optimization phase is the final step in the data science project life-cycle. This could be triggered
by failing performance, or due to the need to add new data sources and retraining the model, or even
to deploy improved versions of the model based on better algorithms.

Agile development processes, especially continuous delivery lends itself well to the data science
project life-cycle. As mentioned before, with increasing maturity and well-defined project goals, pre-
defined performance criteria can help evaluate feasibility of the data science project early enough in
the life-cycle. This early comparison helps the data science team to change approaches, refine
hypothesis and even discard the project if the business case is nonviable or the benefits from the
predictive models are not worth the effort to build it.

maths for linear algebra-

The basic mathematical skills required are Linear Algebra, Matrix Algebra, Probability and some basic
Calculus.

You might also like