You are on page 1of 36

CE 451: Applied Artificial Intelligence

Lecture 18 – Predictive modeling and classification strategies


START

> 50
samples?
Yes No
Predicting a Get more
Category? data
Yes No
Labeled Predicting a
data? Quantity?
Yes No No Yes
Just
Classification Clustering looking? Regression
problem problem No Yes problem
Producing
structure
Dimensionality
reduction problem
Tough
luck

Instructor: Dr. Hashim Ali


Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi
[Spring 2019]
CE451 Applied Artificial Intelligence

Predictive modeling
 Predictive modeling is a commonly used term to represent all
statistical techniques that predict future behavior.
 Solutions of Predictive modeling are a form of data-mining
technology that works by analyzing historical and current data
and generating a model to help predict future outcomes from
same/new data.
 Simply put, predictive analytics uses past trends and applies
them to future.
 For example, if a customer purchases a smart phone from a e-
commerce website,
 he might be interested in it’s accessories immediately,
 He might be a potential customer for phone battery a few years down the line.
 Currently, chances of him buying accessory of a competitor smartphone are
relatively bleak.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Usefulness of predictive modeling


 While the example might sound simple, imagine doing this for
thousands of categories you might be selling.
 With in those thousands of categories, there might be multiple
options (hundreds of covers, pouches, stylus…).
 Further, even if you have a thousand visitors every day (small
number of many e-retailers), predicting the next purchase
without data-based decisioning for these customers might
become impossible.
 This is exactly where predictive modeling and data analytics will
come to your help (how many of you are impressed with these ?)
 Remember Daraz helping you out with, You might also like…..
 Remember Youtube recommending videos for you.
 Remember Facebook suggesting People you might know.
 Remember …. Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Power of predictive modeling


Data source 1 Data source 2 Data source 3
(Browsing history) (Past transactions) (Demographics)

Data extraction
&
transformation

Predictive Business
modeling
+ Understanding

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Predictive modeling versus search problems


 Search Problems  Predictive modeling
 Start with a state (Start state).  Start with an unclassified data point.
 Traverse state space.  Determine the features on which to classify/predict.
 Infer a solution (possibly  Train/Select a classifier/predictor that can
with/without the path to it classify/predict this unseen data point into one of
from start state). the possible classes.
 Problem is “Given a search  Infer the class to which this data point belong.
strategy, to search for a  Predictive modeling can be described as the
(unknown best) solution within mathematical problem of approximating a mapping
a limited time and resources or function (f) from input variables (X) to output
path to a known solution (goal variables (y). This is called the problem of function
state)”. approximation.
 Divided into optimization and  The job of the modeling algorithm is to find the best
state space search problems. mapping function we can given the time and
resources available.
 Problem is “Given a function approximation (a
classifier/predictor, and possibly pre-existing classes
and data), assign a class to a previously unseen data
point”. Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Predictive modeling versus search strategies

belongs to which class in animals?

What are the moves to travel from to given that you can swap
an adjacent tile with
blank space in a move?

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Terminology in predictive modeling


 Classification problems
 Assigning a class to an unseen data point(s).
 Regression analysis problems
 Classification done in continuous domain, typically to estimate relationships
between variables, the problem becomes a regression analysis problem.
 Statistical classification problems
 Classification done in discrete domain with supervised learning (for example,
identifying gender in an image), the problem becomes a statistical classification
problem.
 Clustering problems
 The primary objective is to create groups (clusters) based on the similarities of
the examples of discrete domain but with unsupervised learning.
 Dimensionality reduction
 Identify the important features necessary for predictive modeling and reduce the
multi-dimensional data to fewer dimensions.
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Regression Analysis

 Regression analysis is a set of statistical processes for estimating the relationships


among variables.
 It includes many techniques for modeling and analyzing several variables, when the
focus is on the relationship between a dependent variable and one or
more independent variables (or 'predictors’).

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Regression Analysis

 Regression analysis helps one understand how the typical value of the dependent
variable (or 'criterion variable') changes when any one of the independent variables
is varied, while the other independent variables are held fixed.
 Regression analysis estimates the conditional expectation of the dependent variable
given the independent variables – that is, the average value of the dependent
variable when the independent variables are fixed.
 In all regression analysis problems, a function of the independent variables called
the regression function is to be estimated.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Regression Analysis – Linear regression


 Been around for more than 200 years and has been extensively studied.
 One of the most well-known and well-understood algorithms in statistics and
machine learning.
 Predictive modeling is primarily concerned with minimizing the error of a model or
making the most accurate predictions possible, at the expense of explainability.
 The representation of linear regression is an equation that describes a line that best
fits the relationship between the input variables (x) and the output variables (y), by
finding specific weightings for the input variables called coefficients (B).
 Different techniques can be used to learn the linear regression model from data,
such as a linear algebra solution for ordinary least squares and gradient descent
optimization.
 Some good rules of thumb when using this technique are to remove variables that
are very similar (correlated) and to remove noise from your data, if possible.
 A fast and simple technique and good first algorithm to try!
 One of the most commonly known statistical methods is ANOVA (ANalysis Of
VAriance) – a special case of Linear Regression.
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Regression Analysis – Linear regression


 For example: y = B0 + B1 * x

 We will predict y given the input x and the goal of the linear
regression learning algorithm is to find the values for the coefficients
B0 and B1.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Regression Analysis – Logistic regression


 Go-to method for binary classification
problems (problems with two class values).
 Similarity to linear regression - the goal is to
find the values for the coefficients that
weight each input variable.
 Differs with linear regression - the
prediction for the output is transformed
using a non-linear function called the
logistic function.
 The logistic function looks like a big S and will
transform any value into the range 0 to 1.
 Useful because a rule to the output of the
logistic function can be applied to snap values
to 0 and 1 (e.g. IF less than 0.5 then output 1)
and predict a class value. Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Regression Analysis - Example

Sales Accounts
 The marketing manager of Habib Bank Marketing representative
calls opened
Limited has a large marketing force at his MUJAHID HUSSAIN 96 41
TALHA MUHAMMAD 40 41
office and wants to determine whether ABDUL MALIK 104 51
there is a relationship between the number AHMAD AYUB 128 60
AHMED NAWAZ KHAN 164 61
of calls made to potential customers in a HAMZA AYOUB 76 29
month and the number of new accounts HASSAN IFTIKHAR 72 39
IBRAHIM AHMAD 80 50
opened during the month. INSHA WAMIQ 36 28

 The manager selects a random sample of JUSHAIB MAHMOOD


MAAZ BIN ALIM
84
180
43
70
15 representatives and determines the MAHEEN AYUB VINE 132 56
MIAN SHAWAL REHAN 120 45
number of publicity calls each MUHAMMAD AQEEL 44 31
representative made last month and the MUHAMMAD DANIYAL
84 30
DANISH
number of new accounts opened. Total 1440 675

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Regression Analysis - Example

 What is the expected number of accounts opened by a representative who made 20


calls?

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Regression Analysis – Example 2

Sales Accounts
 The marketing manager of Habib Bank Marketing representative
calls opened
Limited has a large marketing force at his MUJAHID HUSSAIN 96 41
TALHA MUHAMMAD 40 41
office and wants to determine whether ABDUL MALIK 104 51
there is a relationship between the number AHMAD AYUB 128 60
AHMED NAWAZ KHAN 164 61
of calls made to potential customers in a HAMZA AYOUB 76 29
month and the number of new accounts HASSAN IFTIKHAR 72 39
IBRAHIM AHMAD 80 50
opened during the month. INSHA WAMIQ 36 28

 The manager selects a random sample of JUSHAIB MAHMOOD


MAAZ BIN ALIM
84
180
43
70
15 representatives and determines the MAHEEN AYUB VINE 132 56
MIAN SHAWAL REHAN 120 45
number of publicity calls each MUHAMMAD AQEEL 44 31
representative made last month and the MUHAMMAD DANIYAL
84 30
DANISH
number of new accounts opened. Total 1440 675

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Statistical Classification and Clustering

 Statistical classification is the problem of identifying to which of a set


of categories (sub-populations) a new observation belongs, on the
basis of a training set of data containing observations (or instances)
whose category membership is known. Examples are
 assigning a given email to the "spam" or "non-spam" class, and
 assigning a diagnosis to a given patient based on observed characteristics of the
patient (gender, blood pressure, presence or absence of certain symptoms, etc.).
 Classification is an example of pattern recognition.
 Classification is considered an instance of supervised learning, i.e.,
learning where a training set of correctly identified observations is
available.
 The corresponding unsupervised procedure is known as clustering,
and involves grouping data into categories based on some measure of
inherent similarity or distance.
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Statistical Classification and Clustering

 Often, the individual observations are analyzed into a set of quantifiable


properties, known variously as explanatory variables or features.
 These properties may variously be
 Categorical (e.g., "A", "B", "AB" or "O", for blood type),
 Ordinal (e.g., "large", "medium" or "small"),
 Integer-valued (e.g., the number of occurrences of a particular word in an email) or
 Real-valued (e.g., a measurement of blood pressure).
 Other classifiers work by comparing observations to previous observations
by means of a similarity or distance function.
 An algorithm that implements classification, especially in a concrete
implementation, is known as a classifier.
 The term "classifier" sometimes also refers to the mathematical function,
implemented by a classification algorithm, that maps input data to a
category.
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Statistical Classification – Linear Discriminant Analysis

 An algorithm belonging to
Linear algorithms.
 Another famous
algorithm from Linear
algorithms is Perceptron.
 LDA consists of statistical
properties of your data,
calculated for each class.
For a single input variable
this includes:
 The mean value for each
class.
 The variance calculated
across all classes.
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Statistical Classification – Linear Discriminant Analysis


 Predictions are made by calculating a discriminate value for each class and
making a prediction for the class with the largest value.
 The technique assumes that the data has a Gaussian distribution (bell curve),
so it is a good idea to remove outliers from your data before hand.
 It’s a simple and powerful method for classification predictive modeling
problems.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Statistical Classification – Decision trees

 Decision Trees are an


important type of algorithm
for predictive modeling.
 The representation of the
decision tree model is a
binary tree (from data
structures and algorithm,
nothing too fancy).
 Each node represents a
single input variable (x) and
a split point on that variable
(assuming the variable is
numeric).
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Statistical Classification – Decision Trees


 The leaf nodes of the tree contain an output variable (y) which is used
to make a prediction.
 Predictions are made by walking the splits of the tree until arriving at
a leaf node and output the class value at that leaf node.
 Advantages:
 Trees are fast to learn and very fast for making predictions.
 They are also often accurate for a broad range of problems and
 do not require any special preparation for your data.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Statistical Classification – K-Nearest Neighbors (KNN)


 The KNN algorithm is very simple and very effective.
 The model representation for KNN is the entire training dataset.
 Predictions are made for a new data point by searching through the
entire training set for the K most similar instances (the neighbors) and
summarizing the output variable for those K instances.
 For regression problems, this might be the mean output variable, for
classification problems this might be the mode (or most common)
class value.
 The trick is in how to determine the similarity between the data
instances.
 The simplest technique if your attributes are all of the same scale (all
in inches for example) is to use the Euclidean distance, a number you
can calculate directly based on the differences between each input
variable. Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Clustering – K-Nearest Neighbors (KNN)


 Advantages:
 You can also update and curate your
training instances over time to keep
predictions accurate  Addition and
update operations are simpler.
 KNNs only performs a calculation (or learn)
when a prediction is needed, just in time.
 Disadvantages:
 KNNs need to contain complete dataset.
 KNN can require a lot of memory or space
to store all of the data.
 The idea of distance or closeness can
break down in very high dimensions (lots
of input variables) which can negatively
affect the performance of the algorithm on
your problem. Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Statistical Classification – Learning vector quantization

 The Learning Vector Quantization algorithm (LVQ) is an artificial neural


network algorithm that allows you to choose how many training instances to
hang onto and learns exactly what those instances should look like.
 The representation for LVQ is a collection of codebook vectors.
 These are selected randomly in the beginning and adapted to best summarize the
training dataset over a number of iterations of the learning algorithm.
 After training, the codebook vectors can be used to make predictions just like K-
Nearest Neighbors.
 The most similar neighbor (best matching codebook vector) is found by calculating
the distance between each codebook vector and the new data instance.
 The class value or (real value in the case of regression) for the best matching unit is
then returned as the prediction.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Statistical Classification – Learning vector quantization

 If KNN give good results on your dataset, try using LVQ to reduce the
memory requirements of storing the entire training dataset.
 Best results are achieved if you rescale your data to have the same
range, such as between 0 and 1.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Statistical Classification – Support Vector Machines SVM


 Support Vector Machines (SVM) are perhaps one of the most popular and
talked about classification algorithms.
 A hyperplane is a line that splits the input variable space.
 In SVM, a hyperplane is selected to best separate the points in the input
variable space by their class, either class 0 or class 1.
 In two-dimensions, you can visualize this as a line and let’s assume that all of
our input points can be completely separated by this line.
 The SVM learning algorithm finds the coefficients that results in the best
separation of the classes by the hyperplane.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Statistical Classification – Support Vector Machines SVM

 The distance between the hyperplane and the closest data points is
referred to as the margin.
 The best or optimal hyperplane that can separate the two classes is
the line that has the largest margin.
 Only these points are relevant in defining the hyperplane and in the
construction of the classifier.
 These points are called the support vectors.
 They support or define the hyperplane.
 In practice, an optimization algorithm is used to find the values for the
coefficients that maximizes the margin.
 SVM might be one of the most powerful out-of-the-box classifiers and
worth trying on your dataset.
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Statistical Classification – Random forests/Bagging


 Random Forest is one of the most popular and most powerful
machine learning algorithms.
 It is a type of ensemble machine learning algorithm called Bootstrap
Aggregation or bagging.
 The bootstrap is a powerful statistical method for estimating a
quantity from a data sample, such as a mean.
 You take lots of samples of your data, calculate the mean, then average all of
your mean values to give you a better estimation of the true mean value.
 In bagging, the same approach is used, but instead for estimating
entire statistical models, most commonly decision trees.
 Multiple samples of your training data are taken, followed by model
construction for each data sample.
 When you need to make a prediction for new data, each model makes a
prediction and the predictions are averaged to give a better estimate of the
true output value. Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Statistical Classification – Random forests/Bagging


 Random forest is a tweak on this approach where decision trees are created
so that rather than selecting optimal split points, suboptimal splits are made
by introducing randomness.
 The models created for each sample of the data are therefore more different
than they otherwise would be, but still accurate in their unique and different
ways. Combining their predictions results in a better estimate of the true
underlying output value.
 If you get good results with an algorithm with high variance (like decision
trees), you can often get better results by bagging that algorithm.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Clustering - Example
 I was teaching CS101 course to Civil Student Name
AALIYAN AHMED KHAN
St. Number
1
Total Marks
82,675
Grade

engineering students last semester. ABDULLAH ALI


ABDULLAH MUHAMMAD
2
3
81,675
68,3
65,125
 At the end of semester, I have to grade
ABDULLAH NADEEM AZHAR 4

AHSAN ARFAN MIANA 5 63,75


AHSAN SALAM 6 63,25
them on their total marks after final ASAD WADOOD
ATA UL MUNIM TAHIR
7
8
62,775
62,175
exams. BILAL NAZIR
FAYIZ AMIN
9
10
61,5
61,35
HAMID ALI 11 61,3
 Suppose that the total marks sheet has HAROON JAMSHED
HASNA ARSHAD
12
13
60,2
59,975

been compiled and I want to give them IJAZ ALI


ILYAS MUHAMMAD
14
15
59,95
58,625

grades between A, A-, B+, B, B-, C+, C, LAIBA SARFRAZ


LATEEF-UR-REHMAN
16
17
58,55
58,175
M. AHMED CHUNDRIGAR 18 53,7
C-, D+, D and F. M. WAQAR AZEEM
MOEZZA TEHSEEN
19
20
53,25
53,2

 Can I apply clustering? Classification? MOHAMMAD HASSAN


MUHAMMAD ABBAS
21
22
52,125
51,95
MUHAMMAD ABDULLAH 23 51,625
Regression? MUHAMMAD AMMAR 24 51,375
MUHAMMAD DAANIAL KHAN 25 51,125

 Why (not)? MUHAMMAD IBRAHIM


MUHAMMAD KHIZAR
26
27
50,85
50,375
MUHAMMAD NAVEED ZAFAR 28 50,375
MUHAMMAD USMAN 29 50,025
SARA FATIMA KAZI 30 49,575
SHAHZAIB HAIDER 31 49
SYED JAHANZAIB BUKHARI 32 41,175
TAYYABA JAVED 33 39,775
VARUN MALANI 34 39,175
YUNEEB AHMAD 35 37,3
ZAIN ASHFAQ 36 44,575

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Clustering - Example
 One strategy is to sort the students by Student Name
AALIYAN AHMED KHAN
St. Number
1
Total Marks
82,675 A
Grade

score, then determine the 9 clusters such ABDULLAH ALI


ABDULLAH MUHAMMAD
2
3
81,675
68,3
A
A-

that the difference between scores ABDULLAH NADEEM AZHAR

AHSAN ARFAN MIANA


4

5
65,125
63,75
A-
B+

within a cluster are minimum and the AHSAN SALAM


ASAD WADOOD
6
7
63,25
62,775
B+
B+
ATA UL MUNIM TAHIR 8 62,175 B+
difference across a cluster is maximized. BILAL NAZIR 9 61,5
61,35
B+
FAYIZ AMIN 10 B+
HAMID ALI 11 61,3 B+
MARKS DISTRIBUTION OF CS101 HAROON JAMSHED 12 60,2 B
HASNA ARSHAD 13 59,975 B
STUDENTS IJAZ ALI 14 59,95 B
ILYAS MUHAMMAD 15 58,625 B
90 LAIBA SARFRAZ 16 58,55 B
LATEEF-UR-REHMAN 17 58,175 B
80 M. AHMED CHUNDRIGAR 18 53,7 B-
M. WAQAR AZEEM 19 53,25 B-
70 MOEZZA TEHSEEN 20 53,2 B-
MOHAMMAD HASSAN 21 52,125 C+
MUHAMMAD ABBAS 22 51,95 C+
MARKS OBTAINED

60
MUHAMMAD ABDULLAH 23 51,625 C+
MUHAMMAD AMMAR 24 51,375 C+
50
MUHAMMAD DAANIAL KHAN 25 51,125 C+
40
MUHAMMAD IBRAHIM 26 50,85 C+
MUHAMMAD KHIZAR 27 50,375 C
MUHAMMAD NAVEED ZAFAR 28 50,375 C
30
MUHAMMAD USMAN 29 50,025 C
SARA FATIMA KAZI 30 49,575 C
20
SHAHZAIB HAIDER 31 49 C-
SYED JAHANZAIB BUKHARI 32 41,175 D+
10 TAYYABA JAVED 33 39,775 D
VARUN MALANI 34 39,175 D
0 YUNEEB AHMAD 35 37,3 F
0 10 20 30 40 ZAIN ASHFAQ 36 44,575 D+
STUDENT ID
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Dimensionality reduction

 Dimensionality reduction or dimension reduction is the process of reducing


the number of random variables under consideration by obtaining a set of
principal variables.

 Useful because of the concept of “Curse of dimensionality” - The idea of


distance or closeness can break down in very high dimensions (lots of
input variables) which can negatively affect the performance of the
algorithm on your problem.
 It suggests you only use those input variables that are most relevant to
predicting the output variable.

 It can be divided into two categories based on the type of approach used.
 feature selection and
 feature extraction.
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Dimensionality reduction

 Feature selection approaches try to find a subset of the original variables


(also called features or attributes).
 There are three strategies:
 the filter strategy (e.g., information gain),
 the wrapper strategy (e.g., search guided by accuracy), and
 the embedded strategy (features are selected to add or be removed while building
the model based on the prediction errors).
 Feature projection transforms the data in the high-dimensional space to a
space of fewer dimensions.
 The data transformation may be linear, as in principal component analysis (PCA), but
many nonlinear dimensionality reduction techniques also exist.
 For multidimensional data, tensor representation can be used in dimensionality
reduction through multilinear subspace learning.

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Predictive modeling techniques in a nutshell


Stochastic
Ensemble Locally
Quadratic Hierarchical Linear Gradient
classifiers linear
classifiers clustering regression Descent
embedding
regressor

Support Centroid-
Decision Non-linear Ensemble Spectral
Vector based
trees regression regressor embedding
Machines clustering

Learning Distribution- Principal


Neural Ridge
vector based Component Perceptrons
networks regression
quantization clustering Analysis

Density-
Linear Kernel Random
Boosting based Lasso
classifiers approximation forests
clustering

Self- Non-linear
Kernel
Isomap Elastic net organizing dimensionality K-means
estimation
maps reduction

Spectral Logistic
Clustering regression

Lecture 18 – Predictive modelling


CE451 Applied Artificial Intelligence

Clustering of predictive modeling techniques


Classification Techniques Regression Analysis
Stochastic
Ensemble Quadratic Linear Gradient Isomap
classifiers classifiers regression Descent
Random regressor
forests
Support
Decision Non-linear
Vector Lasso
trees regression
Machines Elastic net
Kernel
Learning estimation
Neural Ridge Ensemble
vector
networks regression Logistic regressor
quantization
regression
Perceptrons
Linear
Boosting
classifiers

Dimensionality Reduction techniques


Clustering Techniques Kernel Spectral
Non-linear
dimensionality
approximation embedding
Centroid- Density- reduction
Hierarchical
based based
clustering
clustering clustering
Locally Principal Self-
linear Component organizing
Distribution- embedding Analysis maps
Spectral
K-means based
Clustering
clustering
Lecture 18 – Predictive modelling
CE451 Applied Artificial Intelligence

Applying predictive modeling


START

> 50
samples?
Yes No

Predicting a Get more


Category? data
Yes No

Labeled Predicting a
data? Quantity?
Yes No No Yes
Just
Classification Clustering looking? Regression
problem problem No Yes problem

Producing
structure Dimensionality
reduction
problem
Tough
luck

Scikit-learn algorithm cheat sheet! Lecture 18 – Predictive modelling

You might also like