Co3 Co4 Pca Lda ML 2019

CO3 & CO4
• Structured Models:
Bayesian Network,
Hidden Markov Models,
Reinforcement Learning,
• Applications of ML to Perception:
Computer Vision,
Natural Language Processing,
Design and implementation Machine Learning Algorithms,
• Feedforward Networks for Classification:

Convolutional Neural Network based Recognition using Keras, Tensorflow and OpenCV
• Simulation:
Use VGG Net and AlexNet pre-trained models for face recognition and human pose
estimation problems
Questions 11:
Feed-Forward Neural Networks
Roman Belavkin
Middlesex University
Question 1
Below is a diagram if a single artificial neuron (unit):
x1
Z
Z
⇠
Z w1
ZZ
~
w2 - -
>⇢⇡
x2 v y = '(v)
w⇢3⇢
⇢
⇢
⇢
x3
Figure 1: Single unit with three inputs.
The node has three inputs x = (x1 , x2 , x3 ) that receive only binary signals
(either 0 or 1). How many di↵erent input patterns this node can receive?
What if the node had four inputs? Five? Can you give a formula that
computes the number of binary input patterns for a given number of inputs?
Answer: For three inputs the number of combinations of 0 and 1 is 8:
x1 0 1 0 1 0 1 0 1
x2 0 0 1 1 0 0 1 1
x3 0 0 0 0 1 1 1 1
and for four inputs the number of combinations is 16:
x1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
x2 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
x3 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
x4 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
1
BIS3226 2
You may check that for five inputs the number of combinations will be 32.
Note that 8 = 23 , 16 = 24 and 32 = 25 (for three, four and five inputs).
Thus, the formula for the number of binary input patterns is:
2n , where n in the number of inputs
Question 2
Consider the unit shown on Figure 1. Suppose that the weights correspond-
ing to the three inputs have the following values:
w1 = 2
w2 = 4
w3 = 1
and the activation of the unit is given by the step-function:

⇢
1 if v 0
'(v) =
0 otherwise
Calculate what will be the output value y of the unit for each of the following
input patterns:
Pattern P1 P2 P3 P4
x1 1 0 1 1
x2 0 1 0 1
x3 0 1 1 1
Answer: To find the output value y for each pattern we have to:
P
a) Calculate the weighted sum: v = i wi xi = w1 · x1 + w2 · x2 + w3 · x3
b) Apply the activation function to v
The calculations for each input pattern are:
P1 : v =2·1 4·0+1·0=2 , (2 > 0) , y = '(2) = 1

P2 : v =2·0 4·1+1·1= 3 , ( 3 < 0) , y = '( 3) = 0
P3 : v =2·1 4·0+1·1=3 , (3 > 0) , y = '(3) = 1
P4 : v =2·1 4·1+1·1= 1 , ( 1 < 0) , y = '( 1) = 0
Question 3
BIS3226 3
Logical operators (i.e. NOT, AND, OR, XOR, etc) are the building blocks
of any computational device. Logical functions return only two possible
values, true or false, based on the truth or false values of their arguments.
For example, operator AND returns true only when all its arguments are
true, otherwise (if any of the arguments is false) it returns false. If we denote
truth by 1 and false by 0, then logical function AND can be represented by
the following table:
x1 : 0 1 0 1
x2 : 0 0 1 1
x1 AND x2 : 0 0 0 1
This function can be implemented by a single-unit with two inputs:
x1 H w1
H
H
j
H ⇠
-
*⇢⇡
v y = '(v)
x2 w2
if the weights are w1 = 1 and w2 = 1 and the activation function is:

⇢
1 if v 2
'(v) =
0 otherwise
Note that the threshold level is 2 (v 2).
a) Test how the neural AND function works.
Answer:
P1 : v =1·0+1·0=0 , (0 < 2) , y = '(0) = 0
P2 : v =1·1+1·0=1 , (1 < 2) , y = '(1) = 0
P3 : v =1·0+1·1=1 , (1 < 2) , y = '(1) = 0
P4 : v =1·1+1·1=2 , (2 = 2) , y = '(2) = 1
b) Suggest how to change either the weights or the threshold level of this
single-unit in order to implement the logical OR function (true when
at least one of the arguments is true):
x1 : 0 1 0 1
x2 : 0 0 1 1
x1 OR x2 : 0 1 1 1
BIS3226 4
Answer: One solution is to increase the weights of the unit: w1 = 2

and w2 = 2:
P1 : v =2·0+2·0=0 , (0 < 2) , y = '(0) = 0

P2 : v =2·1+2·0=2 , (2 = 2) , y = '(2) = 1
P3 : v =2·0+2·1=2 , (2 = 2) , y = '(2) = 1
P4 : v =2·1+2·1=4 , (4 > 2) , y = '(4) = 1
Alternatively, we could reduce the threshold to 1:

⇢
1 if v 1
'(v) =
0 otherwise
c) The XOR function (exclusive or) returns true only when one of the
arguments is true and another is false. Otherwise, it returns always
false. This can be represented by the following table:
x1 : 0 1 0 1
x2 : 0 0 1 1
x1 XOR x2 : 0 1 1 0
Do you think it is possible to implement this function using a single

unit? A network of several units?
Answer: This is a difficult question, and it puzzled scientists for

some time because it is actually impossible to implement the XOR
function neither by a single unit nor by a single-layer feed-forward net-
work (single-layer perceptron). This was known as the XOR problem.
The solution was found using a feed-forward network with a hidden
layer. The XOR network uses two hidden nodes and one output node.
Question 4
The following diagram represents a feed-forward neural network with one

hidden layer:
1 ⇠ ⇠
- - 3 - 5 -
>⇢⇡ ⇢⇡
Z Z
⇢ Z Z
Z ⇢⇢ Z ⇢⇢
>
Z Z
⇠ ⇠
⇢ ⇢
⇢ ZZ ⇢ ZZ
~
2 ⇢ ~
Z ⇢
- ⇢ - 4 ⇢ - 6 -
⇢⇡ ⇢⇡
BIS3226 5
A weight on connection between nodes i and j is denoted by wij , such as

w13 is the weight on the connection between nodes 1 and 3. The following
table lists all the weights in the network:
w13 = 2 w35 =1
w23 =3 w45 = 1
w14 =4 w36 = 1
w24 = 1 w46 =1
Each of the nodes 3, 4, 5 and 6 uses the following activation function:
⇢
1 if v 0
'(v) =
0 otherwise
where v denotes the weighted sum of a node. Each of the input nodes (1
and 2) can only receive binary values (either 0 or 1). Calculate the output
of the network (y5 and y6 ) for each of the input patterns:
Pattern: P1 P2 P3 P4
Node 1: 0 1 0 1
Node 2: 0 0 1 1
Answer: In order to find the output of the network it is necessary to

calculate weighted sums of hidden nodes 3 and 4:
v3 = w13 x1 + w23 x2 , v4 = w14 x1 + w24 x2
Then find the outputs from hidden nodes using activation function ':
y3 = '(v3 ) , y4 = '(v4 ) .
Use the outputs of the hidden nodes y3 and y4 as the input values to the
output layer (nodes 5 and 6), and find weighted sums of output nodes 5 and
6:
v5 = w35 y3 + w45 y4 , v6 = w36 y3 + w46 y4 .
Finally, find the outputs from nodes 5 and 6 (also using '):
y5 = '(v5 ) , y6 = '(v6 ) .
The output pattern will be (y5 , y6 ). Perform these calculation for each input
pattern:
P1 : Input pattern (0, 0)
v3 = 2 · 0 + 3 · 0 = 0, y3 = '(0) = 1
v4 = 4 · 0 1 · 0 = 0, y4 = '(0) = 1
v5 = 1 · 1 1 · 1 = 0, y5 = '(0) = 1
v6 = 1 · 1 + 1 · 1 = 0, y6 = '(0) = 1
The output of the network is (1, 1).
BIS3226 6
v3 = 2·1+3·0= 2, y3 = '( 2) = 0
v4 = 4 · 1 1 · 0 = 4, y4 = '(4) = 1
v5 = 1 · 0 1·1= 1, y5 = '( 1) = 0
v6 = 1 · 0 + 1 · 1 = 1, y6 = '(1) = 1

v3 = 2 · 0 + 3 · 1 = 3, y3 = '(3) = 1
v4 = 4 · 0 1·1= 1, y4 = '( 1) = 0
v5 = 1 · 1 1 · 0 = 1, y5 = '(1) = 1
v6 = 1·1+1·0= 1, y6 = '( 1) = 0

v3 = 2 · 1 + 3 · 1 = 1, y3 = '(1) = 1
v4 = 4 · 1 1 · 1 = 3, y4 = '(3) = 1
v5 = 1 · 1 1 · 1 = 0, y5 = '(0) = 1
v6 = 1 · 1 + 1 · 1 = 0, y6 = '(0) = 1
Question 5
What is a training set and how is it used to train neural networks?
Answer: Training set is a set of pairs of input patterns with corresponding

desired output patterns. Each pair represents how the network is supposed
to respond to a particular input. The network is trained to respond correctly
to each input pattern from the training set. Training algorithms that use
training sets are called supervised learning algorithms. We may think of
a supervised learning as learning with a teacher, and the training set as a
set of examples. During training the network, when presented with input
patterns, gives ‘wrong’ answers (not desired output). The error is used to
adjust the weights in the network so that next time the error was smaller.
This procedure is repeated using many examples (pairs of inputs and desired
outputs) from the training set until the error becomes sufficiently small.
Question 6
What is an epoch?
Dimensionality Reduction
and Feature Construction
Principal components analysis (PCA)

Dimensionality Reduction
and Feature Construction
• Principal components analysis (PCA)

– Reading: L. I. Smith, A tutorial on principal components analysis (on class
website)
– PCA used to reduce dimensions of data without much loss of information.
– Used in machine learning and in signal processing and image compression

(among other things).
PCA is “an orthogonal linear transformation that transfers the data to a new
coordinate system such that the greatest variance by any projection of the data
comes to lie on the first coordinate (first principal component), the second
greatest variance lies on the second coordinate (second principal component),
and so on.”
Background for PCA
• Suppose attributes are A1 and A2, and we have n training
examples. x’s denote values of A1 and y’s denote values of
A2 over the training examples.
• Variance of an attribute:
å (x - x)i
2
var( A1 ) = i =1
(n - 1)
• Covariance of two attributes:
n
å ( x - x )( y
i i - y)
cov( A1 , A2 ) = i =1
(n - 1)
• If covariance is positive, both dimensions increase

together. If negative, as one increases, the other decreases.
Zero: independent of each other.
• Covariance matrix
– Suppose we have n attributes, A1, ..., An.
– Covariance matrix:
C n´n = (ci , j ), where ci , j = cov( Ai , A j )

æ cov( H , H ) cov( H , M ) ö
çç ÷÷
è cov(M , H ) cov(M , M ) ø
æ var( H ) 104.5 ö
= çç ÷÷
è 104.5 var(M ) ø
æ 47.7 104.5 ö
= çç ÷÷
è104.5 370 ø
Covariance matrix
• Eigenvectors:
– Let M be an n´n matrix.
• v is an eigenvector of M if M ´ v = lv
• l is called the eigenvalue associated with v
– For any eigenvector v of M and scalar a,

M ´ av = lav
– Thus you can always choose eigenvectors of length 1:

2 2
v1 + ... + vn = 1
– If M has any eigenvectors, it has n of them, and they are

orthogonal to one another.
– Thus eigenvectors can be used as a new basis for a n-dimensional

vector space.
PCA
1. Given original data set S = {x1, ..., xk}, produce new set
by subtracting the mean of attribute Ai from each xi.
Mean: 1.81 1.91 Mean: 0 0

2. Calculate the covariance matrix:
x y
x
y
3. Calculate the (unit) eigenvectors and eigenvalues of the

covariance matrix:
Eigenvector with largest
eigenvalue traces
linear pattern in data
4. Order eigenvectors by eigenvalue, highest to lowest.
æ - .677873399 ö
v1 = çç ÷÷ l = 1.28402771
è - .735178956 ø
æ - .735178956 ö
v 2 = çç ÷÷ l = .0490833989
è .677873399 ø
In general, you get n components. To reduce

dimensionality to p, ignore n-p components at the bottom
of the list.
Construct new feature vector.
Feature vector = (v1, v2, ...vp)
æ - .677873399 - .735178956 ö
FeatureVector1 = çç ÷÷
è - .735178956 .677873399 ø
or reduced dimension feature vector :
æ - .677873399 ö
FeatureVector 2 = çç ÷÷
è - .735178956 ø
5. Derive the new data set.
TransformedData = RowFeatureVector ´ RowDataAdjust
æ - .677873399 - .735178956 ö
RowFeatureVector1 = çç ÷÷
è - .735178956 .677873399 ø
RowFeatureVector 2 = (- .677873399 - .735178956 )
æ .69 - 1.31 .39 .09 1.29 .49 .19 - .81 - .31 - .71 ö
RowDataAdjust = çç ÷÷
è .49 - 1.21 .99 .29 1.09 .79 - .31 - .81 - .31 - 1.01 ø
This gives original data in terms of chosen

components (eigenvectors)—that is, along these axes.
Reconstructing the original data
We did:
TransformedData = RowFeatureVector ´ RowDataAdjust
so we can do
RowDataAdjust = RowFeatureVector -1 ´
TransformedData
= RowFeatureVector T ´ TransformedData
and
RowDataOriginal = RowDataAdjust + OriginalMean
Example: Linear discrimination using PCA for face
recognition
1. Preprocessing: “Normalize” faces
• Make images the same size
• Line up with respect to eyes
• Normalize intensities
2. Raw features are pixel intensity values (2061 features)
3. Each image is encoded as a vector Gi of these features
4. Compute “mean” face in training set:
M
1
Y=
M
åG
i =1
i
• Subtract the mean face from each face vector
F i = Gi - Y
• Compute the covariance matrix C
• Compute the (unit) eigenvectors vi of C
• Keep only the first K principal components (eigenvectors)

The eigenfaces encode the principal sources of variation
in the dataset (e.g., absence/presence of facial hair, skin tone,
glasses, etc.).
We can represent any face as a linear combination of these

“basis” faces.
Use this representation for:

• Face recognition
(e.g., Euclidean distance from known faces)
• Linear discrimination
(e.g., “glasses” versus “no glasses”,
or “male” versus “female”)
Linear Discriminant Analysis
(LDA)
Linear Discriminant Analysis (LDA) is used to solve
dimensionality reduction for data with higher attributes.
• Pre-processing step for pattern-classification and machine

learning applications.
• Used for feature extraction.
• Linear transformation that maximize the separation between
multiple classes.
• “Supervised” - Prediction agent
Feature Subspace :
To reduce the dimensions of a d-dimensional data set by

projecting it onto a (k)-dimensional subspace
(where k < d)
Feature space data is well represented?

• Compute eigen vectors from dataset
• Collect them in scatter matrix
• Generate k-dimensional data from d-dimensional
dataset.
Scatter Matrix:
• Within class scatter matrix

• In between class scatter matrix
Maximize the between class measure &

minimize the within class measure.
LDA steps:
1. Compute the d-dimensional mean vectors.

2. Compute the scatter matrices
3. Compute the eigenvectors and corresponding
eigenvalues for the scatter matrices.
4. Sort the eigenvalues and choose those with the largest
eigenvalues to form a d×k dimensional matrix
5. Transform the samples onto the new subspace.
Dataset
Attributes :
• X
• O
• Blank
Class:
• Positive(Win for X)
• Negative(Win for O)
Dataset
top- top- middle- middle- middle- bottom- bottom- bottom-

top-left-
middle- right- left- middle- right- left- middle- right- Class
square
square square square square square square square square
x x x x o o x o o positive
x x x x o o o x o positive
x x x x o o o o x positive
o x x b o x x o o negative
o x x b o x o x o negative
o x x b o x b b o negative
CO3 & CO4
• Structured Models:
Bayesian Network,
Hidden Markov Models,
Reinforcement Learning,
• Applications of ML to Perception:
Computer Vision,
Natural Language Processing,
Design and implementation Machine Learning Algorithms,
• Feedforward Networks for Classification:

• Simulation:
Use VGG Net and AlexNet pre-trained models for face recognition and human pose
estimation problems
Reinforcement Learning
“Reinforcement Learning (RL) is

a type of machine learning
technique that enables an agent
to learn in an interactive
environment by trial and error
using feedback from its own
actions and experiences.”
• Though both supervised and reinforcement learning use mapping between input and
output, unlike supervised learning where the feedback provided to the agent is correct
set of actions for performing a task, reinforcement learning uses rewards and
punishments as signals for positive and negative behavior.
• As compared to unsupervised learning, reinforcement learning is different in terms of

goals. While the goal in unsupervised learning is to find similarities and differences
between data points, in the case of reinforcement learning the goal is to find a
suitable action model that would maximize the total cumulative reward of the agent.
Some key terms that describe the basic elements of an RL problem are:
• Environment— Physical world in which the agent operates

• State — Current situation of the agent
• Reward— Feedback from the environment
• Policy— Method to map agent’s state to actions
• Value — Future reward that an agent would receive by taking an action in a
particular state
Reinforcement Learning algorithms
• Markov Decision Processes(MDPs) are mathematical frameworks to describe an
environment in RL and almost all RL problems can be formulated using MDPs. An
MDP consists of a set of finite environment states S, a set of possible actions A(s) in
each state, a real valued reward function R(s) and a transition model P(s’, s | a).
However, real world environments are more likely to lack any prior knowledge of
environment dynamics. Model-free RL methods come handy in such cases.
• Q-learning is a commonly used model-free approach which can be used for building a
self-playing PacMan agent. It revolves around the notion of updating Q values which
denotes value of performing action a in state s. The following value update rule is the
core of the Q-learning algorithm.
Applications of Reinforcement Learning
Since, RL requires a lot of data, therefore it is most applicable in domains where simulated data is
readily available like gameplay, robotics.
• RL is quite widely used in building AI for playing computer games.
• AlphaGo Zero is the first computer program to defeat a world champion in the ancient Chinese
game of Go. Others include ATARI games, Backgammon.
• In robotics and industrial automation, RL is used to enable the robot to create an efficient adaptive
control system for itself which learns from its own experience and behavior.
• DeepMind’s work on Deep Reinforcement Learning for Robotic Manipulation with Asynchronous
Policy updates is a good example of the same. Watch this interesting demonstration video.
Bayesian Network
• A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or
probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of
statistical model) that represents a set of variables and their conditional dependencies via a
directed acyclic graph (DAG).
• Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that
any one of several possible known causes was the contributing factor.
• For example, a Bayesian network could represent the probabilistic relationships between
diseases and symptoms. Given symptoms, the network can be used to compute the probabilities
of the presence of various diseases.
• Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense:
they may be observable quantities, latent variables, unknown parameters or hypotheses.
• Edges represent conditional dependencies; nodes that are not connected (no path
connects one node to another) represent variables that are conditionally independent of
each other.
• Each node is associated with a probability function that takes, as input, a particular set of
values for the node's parent variables, and gives (as output) the probability (or
probability distribution, if applicable) of the variable represented by the node.
Conditionally Independent
Example
Hidden Markov Models
Computer Vision
Computer Vision
Make computers understand images and video.
Computer Vision
Make computers understand images and video.
What kind of scene?
What kind of scene?
Where are the cars?
Where are the cars?
How far is
How far is the the
building? building?
…
…
What is Computer Vision?
• To extract useful information about real physical
objects and scenes from sensed images/video.
– 3D reconstruction from images
– Object detection/recognition
• Automatic understanding of images and video
– Computing properties of the 3D world from visual data
(measurement)
– Algorithms and representations to allow a machine to
recognize objects, people, scenes, and activities.
(perception and interpretation)
Vision for measurement
Multi-view stereo for
Real-time stereo Structure from motion community photo collections
NASA Mars Rover
Goesele et al.
Pollefeys et al.
Slide credit: L. Lazebnik
Vision for perception, interpretation
amusement park Objects
sky Activities
Scenes
The Locations
Cedar Text / writing
Wicked Point Faces
Twister Gestures
rid Ferris
Motions
e wheel rid Emotions…
e 12 E
Lake Erie wate rid
r tree e
tree people waiting in
people line
sitting on
umbrellas ride
tree maxair
carousel
deck
bench tree pedestrians
Related Disciplines
Artificial
intelligence
Machine
Graphics learning
Computer
Image vision
Cognitive
processing science
Algorithms
Why computer vision?
• As image sources multiply, so do applications
– Relieve humans of boring, easy tasks
– Enhance human abilities: human-computer interaction,
visualization
– Perception for robotics / autonomous agents
– Organize and give access to visual content
Why computer vision?
• Images and videos are everywhere!
Personal photo albums Movies, news, sports
Surveillance and security Medical and scientific images

Slide credit; L. Lazebnik
Why computer vision matters?
Safety Health Security
Comfort Fun Access

Again, what is computer vision?
• Mathematics of geometry of image formation?
• Statistics of the natural world?
• Models for neuroscience?
• Engineering methods for matching images?
• Science Fiction?
Applications of Computer Vision
• Robot Vision / Autonomous Vehicles
• Biometric Identification / Recognition
• Industrial Inspection
• Video Surveillance
• Digital Camera
• Medical Image Analysis/Processing
• Remote Sensing
• Multimedia Retrieval
• Augmented Reality
23
Biometric Recognition Vision-based Biometrics
How the Afghan Girl was Identified by Her Iris Patterns?

http://www.cl.cam.ac.uk/~jgd1000/afghan.html
35
34
Who is she?
Natural Language Processing
Aspects of language processing
• Word, lexicon: lexical analysis
– Morphology, word segmentation
• Syntax
– Sentence structure, phrase, grammar, …
• Semantics
– Meaning
– Execute commands
• Discourse analysis
– Meaning of a text
– Relationship between sentences (e.g. anaphora)
Applications
• Detect new words
• Language learning
• Machine translation
• NL interface
• Information retrieval
• …
Brief history
• 1950s
– Early MT: word translation + re-ordering
– Chomsky s Generative grammar
– Bar-Hill s argument
• 1960-80s
– Applications
• BASEBALL: use NL interface to search in a database on baseball games
• LUNAR: NL interface to search in Lunar
• ELIZA: simulation of conversation with a psychoanalyst
• SHREDLU: use NL to manipulate block world
• Message understanding: understand a newspaper article on terrorism
• Machine translation
– Methods
• ATN (augmented transition networks): extended context-free grammar
• Case grammar (agent, object, etc.)
• DCG – Definite Clause Grammar
• Dependency grammar: an element depends on another
• 1990s-now
– Statistical methods
– Speech recognition
– MT systems
– Question-answering
– …
Classical symbolic methods
• Morphological analyzer
• Parser (syntactic analysis)
• Semantic analysis (transform into a logical form, semantic
network, etc.)
• Discourse analysis
• Pragmatic analysis
Morphological analysis
• Goal: recognize the word and category
• Using a dictionary: word + category

• Input form (computed)
• Morphological rules:
Lemma + ed -> Lemma + e (verb in past form)
…
• Is Lemma in dict.? If yes, the transformation is possible
• Form -> a set of possible lemmas
Parsing (in DCG)
s --> np, vp. det -->[a]. det --> [an].
np --> det, noun. det --> [the].
np --> proper_noun. noun --> [apple].
vp --> v, ng. noun --> [orange].
vp --> v. proper_noun --> [john].
proper_noun --> [mary].
v --> [eats].
v --> [loves].
Eg. john eats an apple.
proper_noun v det noun
np
np vp
s
Semantic analysis
john eats an apple. Sem. Cat (Ontology)
proper_noun v det noun object
[person: john] λYλX eat(X,Y) [apple]
np animated non-anim
[apple]
np vp person animal
food …
[person: john] eat(X, [apple])
s vertebral …
fruit …
eat([person: john], [apple])
Parsing & semantic analysis
• Rules: syntactic rules or semantic rules
– What component can be combined with what
component?
– What is the result of the combination?
• Categories
– Syntactic categories: Verb, Noun, …
– Semantic categories: Person, Fruit, Apple, …
• Analyses
– Recognize the category of an element
– See how different elements can be combined into a
sentence
– Problem: The choice is often not unique
Write a semantic analysis grammar
S(pred(obj)) -> NP(obj) VP(pred)
VP(pred(obj)) -> Verb(pred) NP(obj)
NP(obj) -> Name(obj)
Name(John) -> John
Name(Mary) -> Mary
Verb(λyλx Loves(x,y)) -> loves
Discourse analysis
• Anaphora
He hits the car with a stone. It bounces back.
• Understanding a text
– Who/when/where/what … are involved in an event?
– How to connect the semantic representations of different
sentences?
– What is the cause of an event and what is the consequence of
an action?
–…
Pragmatic analysis
• Practical usage of language: what a sentence means in
practice
– Do you have time?
– How do you do?
– It is too cold to go outside!
–…
Problems
• Ambiguity
– Lexical/morphological: change (V,N), training (V,N), even (ADJ,
ADV) …
– Syntactic: Helicopter powered by human flies
– Semantic: He saw a man on the hill with a telescope.
– Discourse: anaphora, …
• Classical solution
– Using a later analysis to solve ambiguity of an earlier step
– Eg. He gives him the change.
(change as verb does not work for parsing)
He changes the place.
(change as noun does not work for parsing)
– However: He saw a man on the hill with a telescope.
• Correct multiple parsings
• Correct semantic interpretations -> semantic ambiguity
• Use contextual information to disambiguate (does a sentence in the text
mention that “He” holds a telescope?)
Statistical analysis to help solve ambiguity
• Choose the most likely solution
solution* = argmax solution P(solution | word, context)
e.g. argmax cat P(cat | word, context)

argmax sem P(sem | word, context)
Context varies largely (precedent work, following word, category of the precedent word, …)
• How to obtain P(solution | word, context)?

– Training corpus
Statistical language modeling
• Goal: create a statistical model so that one can

calculate the probability of a sequence of tokens
s = w , w ,…, w in a language.
1 2 n
• General approach:
Training corpus s
Probabilities of
the observed P(s)
elements
Prob. of a sequence of words
P( s ) = P( w1 , w2 ,...wn )
= P( w1 ) P( w2 | w1 )...P( wn | w1,n -1 )
n
= Õ P( wi | hi )
i =1
Elements to be estimated: P ( w | h ) = P (hi wi )

i i
P(hi )
- If hi is too long, one cannot observe (hi, wi) in the

training corpus, and (hi, wi) is hard generalize
- Solution: limit the length of hi
n-grams
• Limit hi to n-1 preceding words

Most used cases
n
– Uni-gram: P( s ) = Õ P( wi )
i =1
n
– Bi-gram: P( s) = Õ P( wi | wi -1 )
i =1
n
– Tri-gram: P( s ) = Õ P( wi | wi - 2 wi -1 )
i =1
A simple example
(corpus = 10 000 words, 10 000 bi-grams)
wi P(wi) wi-1 wi-1wi P(wi|wi-1)
I (10) 10/10 000 # (1000) (# I) (8) 8/1000
= 0.001 = 0.008
that (10) (that I) (2) 0.2
talk (8) 0.0008 I (10) (I talk) (2) 0.2
we (10) (we talk) (1) 0.1
…
talks (8) 0.0008 he (5) (he talks) (2) 0.4
she (5) (she talks) (2) 0.4
…
she (5) 0.0005 says (4) (she says) (2) 0.5
laughs (2) (she laughs) (1) 0.5
listens (2) (she listens) (2) 1.0
Uni-gram: P(I, talk) = P(I) * P(talk) = 0.001*0.0008
P(I, talks) = P(I) * P(talks) = 0.001*0.0008
Bi-gram: P(I, talk) = P(I | #) * P(talk | I) = 0.008*0.2
P(I, talks) = P(I | #) * P(talks | I) = 0.008*0
Estimation
• History: short long
modeling: coarse refined
Estimation: easy difficult
• Maximum likelihood estimation MLE
# ( wi ) # (hi wi )
P( wi ) = P(hi wi ) =
| Cuni | | Cn - gram |
– If (hi mi) is not observed in training corpus, P(wi|hi)=0
– P(they, talk)=P(they|#) P(talk|they) = 0
• never observed (they talk) in training data
– smoothing
Smoothing
• Goal: assign a low probability to words

or n-grams not observed in the training
corpus
P
MLE
smoothed
word
Smoothing methods
n-gram: a
• Change the freq. of occurrences
– Laplace smoothing (add-one):
| a | +1
Padd _ one (a | C ) =
å (| a i | +1)
a i ÎV
– Good-Turing
nr +1
change the freq. r to r* = (r + 1)
nr
nr = no. of n-grams of freq. r
Smoothing
• Combine a model with a lower-order

model
– Backoff (Katz)
ì P (w | w ) if | wi -1wi |> 0
PKatz ( wi | wi -1 ) = í GT i i -1
îa ( wi -1 ) PKatz ( wi ) otherwise
– Interpolation (Jelinek-Mercer)
PJM ( wi | wi -1 ) = lwi-1 PML ( wi | wi -1 ) + (1 - lwi-1 ) PJM ( wi )
Examples of utilization
• Predict the next word
– argmax w P(w | previous words)
• Used in input (predict the next letter/word on cellphone)
• Use in machine aided human translation
– Source sentence
– Already translated part
– Predict the next translation word or phrase
argmax w P(w | previous trans. words, source sent.)
Quality of a statistical language model
• Test a trained model on a test collection
– Try to predict each word
– The more precisely a model can predict the words,
the better is the model
• Perplexity (the lower, the better)
– Given P(wi) and a test text of length N
N
1
−
N
∑log2 P(wi )
Perplexity = 2 i=1
– Harmonic mean of probability

– At each word, how many choices does the model
propose?
• Perplexity=32 ~ 32 words could fit this position
State of the art
• Sufficient training data
– The longer is n (n-gram), the lower is perplexity
• Limited data
– When n is too large, perplexity decreases
– Data sparseness (sparsity)
• In many NLP researches, one uses 5-grams or
6-grams
• Google books n-gram (up to 5-
grams)https://books.google.com/ngrams
More than predicting words
• Speech recognition
– Training corpus = signals + words
– probabilities: P(signal|word), P(word2|word1)
– Utilization: signals sequence of words
• Statistical tagging
– Training corpus = words + tags (n, v)
– Probabilities: P(word|tag), P(tag2|tag1)
– Utilization: sentence sequence of tags
Example of utilization
• Speech recognition (simplified)
argmaxw1, …, wn P(w1, …, wn|s1, …, sn)
= argmaxw1, …, wn P(s1, …, sn|w1, …, wn) * P(w1, …, wn)
= argmaxw1, …, wn PI P(si|w1, …, wn)*P(wi|wi-1)
= argmaxw1, …, wn PI P(si|wi)*P(wi|wi-1)
– Argmax - Viterbi search
– probabilities:
• P(signal|word),
P(*** | ice-cream)=P(*** | I scream)=0.8;
• P(word2 | word1)
P(ice-cream | eat) > P(I scream | eat)
– Input speech signals s1, s2, …, sn
• I eat ice-cream. > I eat I scream.
Example of utilization
• Statistical tagging
– Training corpus = word + tag (e.g. Penn Tree Bank)
– For w1, …, wn:
argmaxtag1, …, tagn PI P(wi|tagi)*P(tagi|tagi-1)
– probabilities:
• P(word|tag)
P(change|noun)=0.01, P(change|verb)=0.015;
• P(tag2|tag1)
P(noun|det) >> P(verb|det)
– Input words: w1, …, wn
• I give him the change.
pronoun verb pronoun det noun >
pronoun verb pronoun det verb
Some improvements of the model
• Class model
– Instead of estimating P(w2|w1), estimate P(w2|Class1)
– P(me|take) v.s. P(me|Verb)
– More general model
– Less data sparseness problem
• Skip model
– Instead of P(wi|wi-1), allow P(wi|wi-k)
– Allow to consider longer dependence
State of the art on POS-tagging
• POS = Part of speech (syntactic category)
• Statistical methods
• Training based on annotated corpus (text with tags
annotated manually)
– Penn Treebank: a set of texts with manual annotations
http://www.cis.upenn.edu/~treebank/
Statistical machine translation
argmax F P(F|E) = argmax F P(E|F) P(F) / P(E)
= argmax F P(E|F) P(F)
• P(E|F): translation model

• P(F): language model, e.g. trigram model
• More to come later on translation model
Summary
• Traditional NLP approaches: symbolic, grammar, …
• More recent approaches: statistical
• For some applications: statistical approaches are better (tagging, speech
recognition, …)
• For some others, traditional approaches are better (MT)
• Trend: combine statistics with rules (grammar)
E.g.
– Probabilistic Context Free Grammar (PCFG)
– Consider some grammatical connections in statistical approaches
• NLP still a very difficult problem
Feedforward Networks for Classification
• A feedforward neural network is an artificial

neural network wherein connections
between the nodes do not form a cycle. As
such, it is different from recurrent neural
networks.
• The feedforward neural network was the
first and simplest type of artificial neural
network devised.
• In this network, the information moves in
only one direction, forward, from the input
nodes, through the hidden nodes (if any)
and to the output nodes. There are no cycles
or loops in the network.
Convolutional neural networks (CNN)
“ Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent)
artificial neural networks that are applied to analyzing visual imagery. ”
Use: Images are high-dimensional vectors. It would take a huge amount of parameters to characterize the network. To address
this problem, bionic convolutional neural networks are proposed to reduced the number of parameters and adapt the network
architecture specifically to vision tasks. Convolutional neural networks are usually composed by a set of layers that can be
grouped by their functionalities.
Introduction to Convolutional Neural Networks
Convolutional neural networks (CNNs) are the current state-of-the-art model architecture for image classification tasks. CNNs apply a
series of filters to the raw pixel data of an image to extract and learn higher-level features, which the model can then use for
classification. CNNs contains three components:
• Convolutional layers, which apply a specified number of convolution filters to the image. For each subregion, the layer performs
a set of mathematical operations to produce a single value in the output feature map. Convolutional layers then typically apply a
ReLU activation function to the output to introduce nonlinearities into the model.
• Pooling layers, which downsample the image data extracted by the convolutional layers to reduce the dimensionality of the
feature map in order to decrease processing time. A commonly used pooling algorithm is max pooling, which extracts subregions
of the feature map (e.g., 2x2-pixel tiles), keeps their maximum value, and discards all other values.
• Dense (fully connected) layers, which perform classification on the features extracted by the convolutional layers and
downsampled by the pooling layers. In a dense layer, every node in the layer is connected to every node in the preceding layer.
Typically, a CNN is composed of a stack of convolutional modules that perform feature extraction. Each module consists of a
convolutional layer followed by a pooling layer. The last convolutional module is followed by one or more dense layers that perform
classification. The final dense layer in a CNN contains a single node for each target class in the model (all the possible classes the
model may predict), with a softmax activation function to generate a value between 0–1 for each node (the sum of all these softmax
values is equal to 1). We can interpret the softmax values for a given image as relative measurements of how likely it is that the image
falls into each target class.
Convolutional neural networks (CNN)
Convolution Layer
• The process is a 2D convolution on the inputs.

• The “dot products” between weights and inputs are “integrated” across
“channels”.
• Filter weights are shared across receptive fields. The filter has same number of
layers as input volume channels, and output volume has same “depth” as the
number of filters.
Convolution Layer
Convolution Layer
Activation Layer
• Used to increase non-linearity of the

network without affecting receptive
fields of conv layers
• Prefer ReLU, results in faster training
• LeakyReLU addresses the vanishing

gradient problem
Softmax
• A special kind of activation layer, usually at

the end of FC layer outputs
• Can be viewed as a fancy normalizer (a.k.a.
Normalized exponential function)
• Produce a discrete probability distribution
vector.
• Very convenient when combined with
cross-entropy loss.
Pooling Layer
Pooling Layer
• Convolutional layers provide activation

maps.
• Pooling layer applies non-linear
downsampling on activation maps.
• Pooling is aggressive (discard info); the
trend is to use smaller filter size and
abandon pooling
Fully Connected Layers
Fully Connected Layers
• Regular neural network
• Can view as the final
• learning phase, which maps extracted visual features to desired
outputs
• Usually adaptive to classification/encoding tasks
• Common output is a vector, which is then passed through softmax to
represent confidence of classification
• The outputs can also be used as “bottleneck”
• In above example, FC generates a number which is then passed
through a sigmoid to represent grasp success probability
Keras
• Keras is an open-source neural-network library written in Python.
• It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, Theano, or PlaidML.
• Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly,
modular, and extensible.
• It was developed as part of the research effort of project ONEIROS (Open-ended Neuro-Electronic
Intelligent Robot Operating System), and its primary author and maintainer is François Chollet, a
Google engineer.
• Chollet also is the author of the XCeption deep neural network model.
• In 2017, Google's TensorFlow team decided to support Keras in TensorFlow's core library.
• Chollet explained that Keras was conceived to be an interface rather than a standalone machine-
learning framework.
• It offers a higher-level, more intuitive set of abstractions that make it easy to develop deep learning
models regardless of the computational backend used.
• Microsoft added a CNTK backend to Keras as well, available as of CNTK v2.0.
Features of Keras
• Keras contains numerous implementations of commonly used neural-network building blocks

such as layers, objectives, activation functions, optimizers, and a host of tools to make working
with image and text data easier.
• The code is hosted on GitHub, and community support forums include the GitHub issues page,
and a Slack channel.
• In addition to standard neural networks, Keras has support for convolutional and recurrent
neural networks.
• It supports other common utility layers like dropout, batch normalization, and pooling.
• Keras allows users to productize deep models on smartphones (iOS and Android), on the web, or
on the Java Virtual Machine.
• It also allows use of distributed training of deep-learning models on clusters of Graphics
Processing Units (GPU) and Tensor processing units (TPU).
Tensorflow
• TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for machine
learning applications such as neural networks.
• It is used for both research and production at Google.
• It is a standard expectation in the industry to have experience in TensorFlow to work in machine
learning.
• TensorFlow was developed by the Google Brain team for internal Google use.
• It was released under the Apache 2.0 open-source license on November 9, 2015.
• TensorFlow is Google Brain's second-generation system. Version 1.0.0 was released on February
11, 2017.
Tensorflow
• While the reference implementation runs on single devices, TensorFlow can run on multiple
CPUs and GPUs (with optional CUDA and SYCL extensions for general-purpose computing on
graphics processing units).
• TensorFlow is available on 64-bit Linux, macOS, Windows, and mobile computing platforms
including Android and iOS.
• Its flexible architecture allows for the easy deployment of computation across a variety of
platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge
devices.
• TensorFlow computations are expressed as stateful dataflow graphs. The name TensorFlow
derives from the operations that such neural networks perform on multidimensional data arrays,
which are referred to as tensors. During the Google I/O Conference in June 2016, Jeff Dean
stated that 1,500 repositories on GitHub mentioned TensorFlow, of which only 5 were from
Google.
OpenCV
• OpenCV (Open source computer vision) is a library of programming functions mainly aimed at
real-time computer vision.
• Originally developed by Intel, it was later supported by Willow Garage then Itseez (which was
later acquired by Intel).
• The library is cross-platform and free for use under the open-source BSD license.
• OpenCV supports the deep learning frameworks TensorFlow, Torch/PyTorch and Caffe.
• OpenCV is written in C++ and its primary interface is in C++, but it still retains a less
comprehensive though extensive older C interface. There are bindings in Python, Java and
MATLAB/OCTAVE.
• The API for these interfaces can be found in the online documentation. Wrappers in other
languages such as C#, Perl, Ch, Haskell, and Ruby have been developed to encourage adoption
by a wider audience.
• Since version 3.4, OpenCV.js is a JavaScript binding for selected subset of OpenCV functions
for the web platform.
• All of the new developments and algorithms in OpenCV are now developed in the C++
interface.
Applications OpenCV
• openFrameworks running the OpenCV • Motion tracking
add-on example • Augmented reality
• 2D and 3D feature toolkits To support some of the above areas, OpenCV includes a
• Egomotion estimation statistical machine learning library that contains:
• Facial recognition system • Boosting
• Gesture recognition • Decision tree learning
• Human–computer interaction (HCI) • Gradient boosting trees
• Mobile robotics • Expectation-maximization algorithm
• Motion understanding • k-nearest neighbor algorithm
• Object identification • Naive Bayes classifier
• Segmentation and recognition • Artificial neural networks
• Stereopsis stereo vision: depth • Random forest
perception from 2 cameras • Support vector machine (SVM)
• Structure from motion (SFM) • Deep neural networks (DNN)
Working with Keras Models
Pretrained Models
• AlexNet
• VGGNet
Pretrained Models Progress
LeNet (1995)
• 2 convolution + pooling layers
• 2 hidden dense layers
AlexNet
• Bigger and deeper LeNet
• ReLu, Dropout, preprocessing
VGG
• Bigger and deeper AlexNet (repeated VGG blocks)
A Convolutional Neural Network (CNN, or ConvNet) are a special kind of multi-layer
neural networks, designed to recognize visual patterns directly from pixel images with
minimal pre-processing. The ImageNet project is a large visual database designed for use
in visual object recognition software research. The ImageNet project runs an annual
software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC),
where software programs compete to correctly classify and detect objects and scenes.
AlexNet (2012)
• In 2012, AlexNet significantly outperformed all the prior competitors and won the challenge by
reducing the top-5 error from 26% to 15.3%. The second place top-5 error rate, which was not a
CNN variation, was around 26.2%.
• The network had a very similar architecture as LeNet by Yann LeCun et al but was deeper, with
more filters per layer, and with stacked convolutional layers.
• It consisted 11x11, 5x5,3x3, convolutions, max pooling, dropout, data augmentation, ReLU
activations, SGD with momentum.
• It attached ReLU activations after every convolutional and fully-connected layer. AlexNet was
trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which is the reason
for why their network is split into two pipelines.
• AlexNet was designed by the SuperVision group, consisting of Alex Krizhevsky, Geoffrey
Hinton, and Ilya Sutskever.
AlexNet Architecture
• Deeper and bigger LeNet

• Key modifications
– Dropout (regularization)
– ReLu (training)
– MaxPooling
AlexNet
VGGNet (2014)
• The runner-up at the ILSVRC 2014 competition is dubbed VGGNet by the community and was
developed by Simonyan and Zisserman.
• VGGNet consists of 16 convolutional layers and is very appealing because of its very uniform
architecture.
• Similar to AlexNet, only 3x3 convolutions, but lots of filters.
• Trained on 4 GPUs for 2–3 weeks. It is currently the most preferred choice in the community for
extracting features from images.
• The weight configuration of the VGGNet is publicly available and has been used in many other
applications and challenges as a baseline feature extractor.
• However, VGGNet consists of 138 million parameters, which can be a bit challenging to handle.
VGGNet
VGGNet
• AlexNet is deeper and bigger than LeNet to
get performance
• Go even bigger & deeper?
• Options
– More dense layers (too expensive)
– More convolutions
– Group into blocks
VGGNet
VGG Net
• VGG (from Oxford's Visual Geometry Group, https:/ / arxiv. org/ abs/ 1409. 1556).
• It was introduced in 2014, when it became a runner-up in the ImageNet challenge of that year.
• The VGG family of networks remains popular today and is often used as a benchmark against newer
architectures.
• Prior to VGG and AlexNet, the initial convolutional layers of a network used filters with large receptive
fields, such as 7 x 7.
• Additionally, the networks usually had alternating single convolutional and pooling layers.
• The authors of the paper observed that a convolutional layer with a large filter size can be replaced with a
stack of two or more convolutional layers with smaller filters (factorized convolution).
VGG Net
• For example, we can replace one 5 x 5 layer with a stack of two 3 x 3 layers, or
a 7 x 7 layer with a stack of three 3 x 3 layers.
This structure has several advantages:
• The neurons of the last of the stacked layers have the equivalent receptive field
size of a single layer with a large filter.
• The number of weights and operations of stacked layers is smaller, compared to
a single layer with large filter size.
• Let's assume we want to replace one 5 x 5 layer with two 3 x 3 layers. Let's also
assume that all layers have an equal number of input and output channels
(slices), M. The total number of weights (excluding biases) of the 5 x 5 layer is
5x5xMxM = 25xM2.
VGG Net
• On the other hand, the total weights of a single 3 x 3 layer is 3x3xMxM =

9xM2, and simply 2x(3x3xMxM) = 18xM2 for two layers, which makes this
arrangement 28% more efficient (18/25 = 0.72).
• The efficiency will increase further with larger filters.
• Stacking multiple layers makes the decision function more discriminative.
VGG with Keras, PyTorch, and TensorFlow

Co3 Co4 Pca Lda ML 2019

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Co3 Co4 Pca Lda ML 2019

Uploaded by

Copyright:

Available Formats

CO3 & CO4

• Feedforward Networks for Classification:

Below is a diagram if a single artificial neuron (unit):

Answer: For three inputs the number of combinations of 0 and 1 is 8:

and for four inputs the number of combinations is 16:

2n , where n in the number of inputs

and the activation of the unit is given by the step-function:

b) Apply the activation function to v

The calculations for each input pattern are:

P1 : v =2·1 4·0+1·0=2 , (2 > 0) , y = '(2) = 1

if the weights are w1 = 1 and w2 = 1 and the activation function is:

Note that the threshold level is 2 (v 2).

a) Test how the neural AND function works.

Answer: One solution is to increase the weights of the unit: w1 = 2

P1 : v =2·0+2·0=0 , (0 < 2) , y = '(0) = 0

Alternatively, we could reduce the threshold to 1:

Do you think it is possible to implement this function using a single

Answer: This is a difficult question, and it puzzled scientists for

The following diagram represents a feed-forward neural network with one

A weight on connection between nodes i and j is denoted by wij , such as

Answer: In order to find the output of the network it is necessary to

P2 : Input pattern (1, 0)

The output of the network is (0, 1).

The output of the network is (1, 0).

The output of the network is (1, 1).

What is a training set and how is it used to train neural networks?

Answer: Training set is a set of pairs of input patterns with corresponding

Principal components analysis (PCA)

• Principal components analysis (PCA)

– PCA used to reduce dimensions of data without much loss of information.

– Used in machine learning and in signal processing and image compression

• If covariance is positive, both dimensions increase

C n´n = (ci , j ), where ci , j = cov( Ai , A j )

– For any eigenvector v of M and scalar a,

– Thus you can always choose eigenvectors of length 1:

– If M has any eigenvectors, it has n of them, and they are

– Thus eigenvectors can be used as a new basis for a n-dimensional

Mean: 1.81 1.91 Mean: 0 0

3. Calculate the (unit) eigenvectors and eigenvalues of the

In general, you get n components. To reduce

or reduced dimension feature vector :

TransformedData = RowFeatureVector ´ RowDataAdjust

RowFeatureVector 2 = (- .677873399 - .735178956 )

This gives original data in terms of chosen

1. Preprocessing: “Normalize” faces

• Make images the same size

• Line up with respect to eyes

3. Each image is encoded as a vector Gi of these features

4. Compute “mean” face in training set:

• Compute the covariance matrix C

• Compute the (unit) eigenvectors vi of C

• Keep only the first K principal components (eigenvectors)

We can represent any face as a linear combination of these

Use this representation for:

• Pre-processing step for pattern-classification and machine

To reduce the dimensions of a d-dimensional data set by

Feature space data is well represented?

• Within class scatter matrix

Maximize the between class measure &

1. Compute the d-dimensional mean vectors.

top- top- middle- middle- middle- bottom- bottom- bottom-

• Feedforward Networks for Classification:

“Reinforcement Learning (RL) is