Professional Documents
Culture Documents
• Structured Models:
Bayesian Network,
Hidden Markov Models,
Reinforcement Learning,
• Applications of ML to Perception:
Computer Vision,
Natural Language Processing,
Design and implementation Machine Learning Algorithms,
• Simulation:
Use VGG Net and AlexNet pre-trained models for face recognition and human pose
estimation problems
Questions 11:
Feed-Forward Neural Networks
Roman Belavkin
Middlesex University
Question 1
x1
Z
Z
⇠
Z w1
ZZ
~
w2 - -
>⇢⇡
x2 v y = '(v)
w⇢3⇢
⇢
⇢
⇢
x3
Figure 1: Single unit with three inputs.
The node has three inputs x = (x1 , x2 , x3 ) that receive only binary signals
(either 0 or 1). How many di↵erent input patterns this node can receive?
What if the node had four inputs? Five? Can you give a formula that
computes the number of binary input patterns for a given number of inputs?
x1 0 1 0 1 0 1 0 1
x2 0 0 1 1 0 0 1 1
x3 0 0 0 0 1 1 1 1
x1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
x2 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
x3 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
x4 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
1
BIS3226 2
You may check that for five inputs the number of combinations will be 32.
Note that 8 = 23 , 16 = 24 and 32 = 25 (for three, four and five inputs).
Thus, the formula for the number of binary input patterns is:
Question 2
Consider the unit shown on Figure 1. Suppose that the weights correspond-
ing to the three inputs have the following values:
w1 = 2
w2 = 4
w3 = 1
Calculate what will be the output value y of the unit for each of the following
input patterns:
Pattern P1 P2 P3 P4
x1 1 0 1 1
x2 0 1 0 1
x3 0 1 1 1
Answer: To find the output value y for each pattern we have to:
P
a) Calculate the weighted sum: v = i wi xi = w1 · x1 + w2 · x2 + w3 · x3
Question 3
BIS3226 3
Logical operators (i.e. NOT, AND, OR, XOR, etc) are the building blocks
of any computational device. Logical functions return only two possible
values, true or false, based on the truth or false values of their arguments.
For example, operator AND returns true only when all its arguments are
true, otherwise (if any of the arguments is false) it returns false. If we denote
truth by 1 and false by 0, then logical function AND can be represented by
the following table:
x1 : 0 1 0 1
x2 : 0 0 1 1
x1 AND x2 : 0 0 0 1
This function can be implemented by a single-unit with two inputs:
x1 H w1
H
H
j
H ⇠
-
*⇢⇡
v y = '(v)
x2 w2
Answer:
P1 : v =1·0+1·0=0 , (0 < 2) , y = '(0) = 0
P2 : v =1·1+1·0=1 , (1 < 2) , y = '(1) = 0
P3 : v =1·0+1·1=1 , (1 < 2) , y = '(1) = 0
P4 : v =1·1+1·1=2 , (2 = 2) , y = '(2) = 1
b) Suggest how to change either the weights or the threshold level of this
single-unit in order to implement the logical OR function (true when
at least one of the arguments is true):
x1 : 0 1 0 1
x2 : 0 0 1 1
x1 OR x2 : 0 1 1 1
BIS3226 4
c) The XOR function (exclusive or) returns true only when one of the
arguments is true and another is false. Otherwise, it returns always
false. This can be represented by the following table:
x1 : 0 1 0 1
x2 : 0 0 1 1
x1 XOR x2 : 0 1 1 0
Question 4
v3 = 2·1+3·0= 2, y3 = '( 2) = 0
v4 = 4 · 1 1 · 0 = 4, y4 = '(4) = 1
v5 = 1 · 0 1·1= 1, y5 = '( 1) = 0
v6 = 1 · 0 + 1 · 1 = 1, y6 = '(1) = 1
v3 = 2 · 0 + 3 · 1 = 3, y3 = '(3) = 1
v4 = 4 · 0 1·1= 1, y4 = '( 1) = 0
v5 = 1 · 1 1 · 0 = 1, y5 = '(1) = 1
v6 = 1·1+1·0= 1, y6 = '( 1) = 0
v3 = 2 · 1 + 3 · 1 = 1, y3 = '(1) = 1
v4 = 4 · 1 1 · 1 = 3, y4 = '(3) = 1
v5 = 1 · 1 1 · 1 = 0, y5 = '(0) = 1
v6 = 1 · 1 + 1 · 1 = 0, y6 = '(0) = 1
Question 5
Question 6
What is an epoch?
Dimensionality Reduction
and Feature Construction
• Variance of an attribute:
å (x - x)i
2
var( A1 ) = i =1
(n - 1)
• Covariance of two attributes:
n
å ( x - x )( y
i i - y)
cov( A1 , A2 ) = i =1
(n - 1)
– Covariance matrix:
æ var( H ) 104.5 ö
= çç ÷÷
è 104.5 var(M ) ø
æ 47.7 104.5 ö
= çç ÷÷
è104.5 370 ø
Covariance matrix
• Eigenvectors:
– Let M be an n´n matrix.
• v is an eigenvector of M if M ´ v = lv
• l is called the eigenvalue associated with v
1. Given original data set S = {x1, ..., xk}, produce new set
by subtracting the mean of attribute Ai from each xi.
æ - .677873399 ö
v1 = çç ÷÷ l = 1.28402771
è - .735178956 ø
æ - .735178956 ö
v 2 = çç ÷÷ l = .0490833989
è .677873399 ø
æ - .677873399 - .735178956 ö
FeatureVector1 = çç ÷÷
è - .735178956 .677873399 ø
æ - .677873399 ö
FeatureVector 2 = çç ÷÷
è - .735178956 ø
5. Derive the new data set.
æ - .677873399 - .735178956 ö
RowFeatureVector1 = çç ÷÷
è - .735178956 .677873399 ø
æ .69 - 1.31 .39 .09 1.29 .49 .19 - .81 - .31 - .71 ö
RowDataAdjust = çç ÷÷
è .49 - 1.21 .99 .29 1.09 .79 - .31 - .81 - .31 - 1.01 ø
so we can do
RowDataAdjust = RowFeatureVector -1 ´
TransformedData
= RowFeatureVector T ´ TransformedData
and
RowDataOriginal = RowDataAdjust + OriginalMean
Example: Linear discrimination using PCA for face
recognition
• Normalize intensities
2. Raw features are pixel intensity values (2061 features)
M
1
Y=
M
åG
i =1
i
• Subtract the mean face from each face vector
F i = Gi - Y
• Linear discrimination
(e.g., “glasses” versus “no glasses”,
or “male” versus “female”)
Linear Discriminant Analysis
(LDA)
Linear Discriminant Analysis (LDA) is used to solve
dimensionality reduction for data with higher attributes.
Attributes :
• X
• O
• Blank
Class:
• Positive(Win for X)
• Negative(Win for O)
Dataset
• Structured Models:
Bayesian Network,
Hidden Markov Models,
Reinforcement Learning,
• Applications of ML to Perception:
Computer Vision,
Natural Language Processing,
Design and implementation Machine Learning Algorithms,
• Simulation:
Use VGG Net and AlexNet pre-trained models for face recognition and human pose
estimation problems
Reinforcement Learning
• Q-learning is a commonly used model-free approach which can be used for building a
self-playing PacMan agent. It revolves around the notion of updating Q values which
denotes value of performing action a in state s. The following value update rule is the
core of the Q-learning algorithm.
Applications of Reinforcement Learning
Since, RL requires a lot of data, therefore it is most applicable in domains where simulated data is
readily available like gameplay, robotics.
• AlphaGo Zero is the first computer program to defeat a world champion in the ancient Chinese
game of Go. Others include ATARI games, Backgammon.
• In robotics and industrial automation, RL is used to enable the robot to create an efficient adaptive
control system for itself which learns from its own experience and behavior.
• DeepMind’s work on Deep Reinforcement Learning for Robotic Manipulation with Asynchronous
Policy updates is a good example of the same. Watch this interesting demonstration video.
Bayesian Network
• A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or
probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of
statistical model) that represents a set of variables and their conditional dependencies via a
directed acyclic graph (DAG).
• Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that
any one of several possible known causes was the contributing factor.
• For example, a Bayesian network could represent the probabilistic relationships between
diseases and symptoms. Given symptoms, the network can be used to compute the probabilities
of the presence of various diseases.
• Bayesian networks are DAGs whose nodes represent variables in the Bayesian sense:
they may be observable quantities, latent variables, unknown parameters or hypotheses.
• Edges represent conditional dependencies; nodes that are not connected (no path
connects one node to another) represent variables that are conditionally independent of
each other.
• Each node is associated with a probability function that takes, as input, a particular set of
values for the node's parent variables, and gives (as output) the probability (or
probability distribution, if applicable) of the variable represented by the node.
Conditionally Independent
Example
Hidden Markov Models
Computer Vision
Computer Vision
Make computers understand images and video.
Computer Vision
Make computers understand images and video.
What kind of scene?
What kind of scene?
Where are the cars?
Where are the cars?
How far is
How far is the the
building? building?
…
…
What is Computer Vision?
• To extract useful information about real physical
objects and scenes from sensed images/video.
– 3D reconstruction from images
– Object detection/recognition
• Automatic understanding of images and video
– Computing properties of the 3D world from visual data
(measurement)
– Algorithms and representations to allow a machine to
recognize objects, people, scenes, and activities.
(perception and interpretation)
Vision for measurement
Multi-view stereo for
Real-time stereo Structure from motion community photo collections
Goesele et al.
Pollefeys et al.
Slide credit: L. Lazebnik
Vision for perception, interpretation
amusement park Objects
sky Activities
Scenes
The Locations
Cedar Text / writing
Wicked Point Faces
Twister Gestures
rid Ferris
Motions
e wheel rid Emotions…
e 12 E
Lake Erie wate rid
r tree e
tree people waiting in
people line
sitting on
umbrellas ride
tree maxair
carousel
deck
bench tree pedestrians
Related Disciplines
Artificial
intelligence
Machine
Graphics learning
Computer
Image vision
Cognitive
processing science
Algorithms
Why computer vision?
• As image sources multiply, so do applications
– Relieve humans of boring, easy tasks
– Enhance human abilities: human-computer interaction,
visualization
– Perception for robotics / autonomous agents
– Organize and give access to visual content
Why computer vision?
• Images and videos are everywhere!
35
34
Who is she?
Natural Language Processing
Aspects of language processing
• Word, lexicon: lexical analysis
– Morphology, word segmentation
• Syntax
– Sentence structure, phrase, grammar, …
• Semantics
– Meaning
– Execute commands
• Discourse analysis
– Meaning of a text
– Relationship between sentences (e.g. anaphora)
Applications
• Detect new words
• Language learning
• Machine translation
• NL interface
• Information retrieval
• …
Brief history
• 1950s
– Early MT: word translation + re-ordering
– Chomsky s Generative grammar
– Bar-Hill s argument
• 1960-80s
– Applications
• BASEBALL: use NL interface to search in a database on baseball games
• LUNAR: NL interface to search in Lunar
• ELIZA: simulation of conversation with a psychoanalyst
• SHREDLU: use NL to manipulate block world
• Message understanding: understand a newspaper article on terrorism
• Machine translation
– Methods
• ATN (augmented transition networks): extended context-free grammar
• Case grammar (agent, object, etc.)
• DCG – Definite Clause Grammar
• Dependency grammar: an element depends on another
• 1990s-now
– Statistical methods
– Speech recognition
– MT systems
– Question-answering
– …
Classical symbolic methods
• Morphological analyzer
• Parser (syntactic analysis)
• Semantic analysis (transform into a logical form, semantic
network, etc.)
• Discourse analysis
• Pragmatic analysis
Morphological analysis
• Goal: recognize the word and category
np
np vp
s
Semantic analysis
john eats an apple. Sem. Cat (Ontology)
proper_noun v det noun object
[person: john] λYλX eat(X,Y) [apple]
np animated non-anim
[apple]
np vp person animal
food …
[person: john] eat(X, [apple])
s vertebral …
fruit …
eat([person: john], [apple])
Parsing & semantic analysis
• Rules: syntactic rules or semantic rules
– What component can be combined with what
component?
– What is the result of the combination?
• Categories
– Syntactic categories: Verb, Noun, …
– Semantic categories: Person, Fruit, Apple, …
• Analyses
– Recognize the category of an element
– See how different elements can be combined into a
sentence
– Problem: The choice is often not unique
Write a semantic analysis grammar
S(pred(obj)) -> NP(obj) VP(pred)
VP(pred(obj)) -> Verb(pred) NP(obj)
NP(obj) -> Name(obj)
Name(John) -> John
Name(Mary) -> Mary
Verb(λyλx Loves(x,y)) -> loves
Discourse analysis
• Anaphora
He hits the car with a stone. It bounces back.
• Understanding a text
– Who/when/where/what … are involved in an event?
– How to connect the semantic representations of different
sentences?
– What is the cause of an event and what is the consequence of
an action?
–…
Pragmatic analysis
• Practical usage of language: what a sentence means in
practice
– Do you have time?
– How do you do?
– It is too cold to go outside!
–…
Problems
• Ambiguity
– Lexical/morphological: change (V,N), training (V,N), even (ADJ,
ADV) …
– Syntactic: Helicopter powered by human flies
– Semantic: He saw a man on the hill with a telescope.
– Discourse: anaphora, …
• Classical solution
– Using a later analysis to solve ambiguity of an earlier step
– Eg. He gives him the change.
(change as verb does not work for parsing)
He changes the place.
(change as noun does not work for parsing)
– However: He saw a man on the hill with a telescope.
• Correct multiple parsings
• Correct semantic interpretations -> semantic ambiguity
• Use contextual information to disambiguate (does a sentence in the text
mention that “He” holds a telescope?)
Statistical analysis to help solve ambiguity
• Choose the most likely solution
Context varies largely (precedent work, following word, category of the precedent word, …)
• General approach:
Training corpus s
Probabilities of
the observed P(s)
elements
Prob. of a sequence of words
P( s ) = P( w1 , w2 ,...wn )
= P( w1 ) P( w2 | w1 )...P( wn | w1,n -1 )
n
= Õ P( wi | hi )
i =1
– Uni-gram: P( s ) = Õ P( wi )
i =1
n
– Bi-gram: P( s) = Õ P( wi | wi -1 )
i =1
n
– Tri-gram: P( s ) = Õ P( wi | wi - 2 wi -1 )
i =1
A simple example
(corpus = 10 000 words, 10 000 bi-grams)
wi P(wi) wi-1 wi-1wi P(wi|wi-1)
I (10) 10/10 000 # (1000) (# I) (8) 8/1000
= 0.001 = 0.008
that (10) (that I) (2) 0.2
talk (8) 0.0008 I (10) (I talk) (2) 0.2
we (10) (we talk) (1) 0.1
…
talks (8) 0.0008 he (5) (he talks) (2) 0.4
she (5) (she talks) (2) 0.4
…
she (5) 0.0005 says (4) (she says) (2) 0.5
laughs (2) (she laughs) (1) 0.5
listens (2) (she listens) (2) 1.0
Uni-gram: P(I, talk) = P(I) * P(talk) = 0.001*0.0008
P(I, talks) = P(I) * P(talks) = 0.001*0.0008
Bi-gram: P(I, talk) = P(I | #) * P(talk | I) = 0.008*0.2
P(I, talks) = P(I | #) * P(talks | I) = 0.008*0
Estimation
• History: short long
modeling: coarse refined
Estimation: easy difficult
• Maximum likelihood estimation MLE
# ( wi ) # (hi wi )
P( wi ) = P(hi wi ) =
| Cuni | | Cn - gram |
– If (hi mi) is not observed in training corpus, P(wi|hi)=0
– P(they, talk)=P(they|#) P(talk|they) = 0
• never observed (they talk) in training data
– smoothing
Smoothing
smoothed
word
Smoothing methods
n-gram: a
• Change the freq. of occurrences
– Laplace smoothing (add-one):
| a | +1
Padd _ one (a | C ) =
å (| a i | +1)
a i ÎV
– Good-Turing
nr +1
change the freq. r to r* = (r + 1)
nr
nr = no. of n-grams of freq. r
Smoothing
– Interpolation (Jelinek-Mercer)
PJM ( wi | wi -1 ) = lwi-1 PML ( wi | wi -1 ) + (1 - lwi-1 ) PJM ( wi )
Examples of utilization
• Predict the next word
– argmax w P(w | previous words)
• Used in input (predict the next letter/word on cellphone)
• Use in machine aided human translation
– Source sentence
– Already translated part
– Predict the next translation word or phrase
argmax w P(w | previous trans. words, source sent.)
Quality of a statistical language model
• Test a trained model on a test collection
– Try to predict each word
– The more precisely a model can predict the words,
the better is the model
• Perplexity (the lower, the better)
– Given P(wi) and a test text of length N
N
1
−
N
∑log2 P(wi )
Perplexity = 2 i=1
• Statistical tagging
– Training corpus = words + tags (n, v)
– Probabilities: P(word|tag), P(tag2|tag1)
– Utilization: sentence sequence of tags
Example of utilization
• Speech recognition (simplified)
argmaxw1, …, wn P(w1, …, wn|s1, …, sn)
= argmaxw1, …, wn P(s1, …, sn|w1, …, wn) * P(w1, …, wn)
= argmaxw1, …, wn PI P(si|w1, …, wn)*P(wi|wi-1)
= argmaxw1, …, wn PI P(si|wi)*P(wi|wi-1)
– Argmax - Viterbi search
– probabilities:
• P(signal|word),
P(*** | ice-cream)=P(*** | I scream)=0.8;
• P(word2 | word1)
P(ice-cream | eat) > P(I scream | eat)
– Input speech signals s1, s2, …, sn
• I eat ice-cream. > I eat I scream.
Example of utilization
• Statistical tagging
– Training corpus = word + tag (e.g. Penn Tree Bank)
– For w1, …, wn:
argmaxtag1, …, tagn PI P(wi|tagi)*P(tagi|tagi-1)
– probabilities:
• P(word|tag)
P(change|noun)=0.01, P(change|verb)=0.015;
• P(tag2|tag1)
P(noun|det) >> P(verb|det)
– Input words: w1, …, wn
• I give him the change.
pronoun verb pronoun det noun >
pronoun verb pronoun det verb
Some improvements of the model
• Class model
– Instead of estimating P(w2|w1), estimate P(w2|Class1)
– P(me|take) v.s. P(me|Verb)
– More general model
– Less data sparseness problem
• Skip model
– Instead of P(wi|wi-1), allow P(wi|wi-k)
– Allow to consider longer dependence
State of the art on POS-tagging
• POS = Part of speech (syntactic category)
• Statistical methods
• Training based on annotated corpus (text with tags
annotated manually)
– Penn Treebank: a set of texts with manual annotations
http://www.cis.upenn.edu/~treebank/
Statistical machine translation
argmax F P(F|E) = argmax F P(E|F) P(F) / P(E)
= argmax F P(E|F) P(F)
“ Convolutional neural networks (CNN, ConvNet) is a class of deep, feed-forward (not recurrent)
artificial neural networks that are applied to analyzing visual imagery. ”
Use: Images are high-dimensional vectors. It would take a huge amount of parameters to characterize the network. To address
this problem, bionic convolutional neural networks are proposed to reduced the number of parameters and adapt the network
architecture specifically to vision tasks. Convolutional neural networks are usually composed by a set of layers that can be
grouped by their functionalities.
Introduction to Convolutional Neural Networks
Convolutional neural networks (CNNs) are the current state-of-the-art model architecture for image classification tasks. CNNs apply a
series of filters to the raw pixel data of an image to extract and learn higher-level features, which the model can then use for
classification. CNNs contains three components:
• Convolutional layers, which apply a specified number of convolution filters to the image. For each subregion, the layer performs
a set of mathematical operations to produce a single value in the output feature map. Convolutional layers then typically apply a
ReLU activation function to the output to introduce nonlinearities into the model.
• Pooling layers, which downsample the image data extracted by the convolutional layers to reduce the dimensionality of the
feature map in order to decrease processing time. A commonly used pooling algorithm is max pooling, which extracts subregions
of the feature map (e.g., 2x2-pixel tiles), keeps their maximum value, and discards all other values.
• Dense (fully connected) layers, which perform classification on the features extracted by the convolutional layers and
downsampled by the pooling layers. In a dense layer, every node in the layer is connected to every node in the preceding layer.
Typically, a CNN is composed of a stack of convolutional modules that perform feature extraction. Each module consists of a
convolutional layer followed by a pooling layer. The last convolutional module is followed by one or more dense layers that perform
classification. The final dense layer in a CNN contains a single node for each target class in the model (all the possible classes the
model may predict), with a softmax activation function to generate a value between 0–1 for each node (the sum of all these softmax
values is equal to 1). We can interpret the softmax values for a given image as relative measurements of how likely it is that the image
falls into each target class.
Convolutional neural networks (CNN)
Convolution Layer
• TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for machine
learning applications such as neural networks.
• It is used for both research and production at Google.
• It is a standard expectation in the industry to have experience in TensorFlow to work in machine
learning.
• TensorFlow was developed by the Google Brain team for internal Google use.
• It was released under the Apache 2.0 open-source license on November 9, 2015.
• TensorFlow is Google Brain's second-generation system. Version 1.0.0 was released on February
11, 2017.
Tensorflow
• While the reference implementation runs on single devices, TensorFlow can run on multiple
CPUs and GPUs (with optional CUDA and SYCL extensions for general-purpose computing on
graphics processing units).
• TensorFlow is available on 64-bit Linux, macOS, Windows, and mobile computing platforms
including Android and iOS.
• Its flexible architecture allows for the easy deployment of computation across a variety of
platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge
devices.
• TensorFlow computations are expressed as stateful dataflow graphs. The name TensorFlow
derives from the operations that such neural networks perform on multidimensional data arrays,
which are referred to as tensors. During the Google I/O Conference in June 2016, Jeff Dean
stated that 1,500 repositories on GitHub mentioned TensorFlow, of which only 5 were from
Google.
OpenCV
• OpenCV (Open source computer vision) is a library of programming functions mainly aimed at
real-time computer vision.
• Originally developed by Intel, it was later supported by Willow Garage then Itseez (which was
later acquired by Intel).
• The library is cross-platform and free for use under the open-source BSD license.
• OpenCV supports the deep learning frameworks TensorFlow, Torch/PyTorch and Caffe.
• OpenCV is written in C++ and its primary interface is in C++, but it still retains a less
comprehensive though extensive older C interface. There are bindings in Python, Java and
MATLAB/OCTAVE.
• The API for these interfaces can be found in the online documentation. Wrappers in other
languages such as C#, Perl, Ch, Haskell, and Ruby have been developed to encourage adoption
by a wider audience.
• Since version 3.4, OpenCV.js is a JavaScript binding for selected subset of OpenCV functions
for the web platform.
• All of the new developments and algorithms in OpenCV are now developed in the C++
interface.
Applications OpenCV
• openFrameworks running the OpenCV • Motion tracking
add-on example • Augmented reality
• 2D and 3D feature toolkits To support some of the above areas, OpenCV includes a
• Egomotion estimation statistical machine learning library that contains:
• Facial recognition system • Boosting
• Gesture recognition • Decision tree learning
• Human–computer interaction (HCI) • Gradient boosting trees
• Mobile robotics • Expectation-maximization algorithm
• Motion understanding • k-nearest neighbor algorithm
• Object identification • Naive Bayes classifier
• Segmentation and recognition • Artificial neural networks
• Stereopsis stereo vision: depth • Random forest
perception from 2 cameras • Support vector machine (SVM)
• Structure from motion (SFM) • Deep neural networks (DNN)
Working with Keras Models
Working with Keras Models
Working with Keras Models
Working with Keras Models
Working with Keras Models
Working with Keras Models
Working with Keras Models
Pretrained Models
• AlexNet
• VGGNet
Pretrained Models Progress
LeNet (1995)
• 2 convolution + pooling layers
• 2 hidden dense layers
AlexNet
• Bigger and deeper LeNet
• ReLu, Dropout, preprocessing
VGG
• Bigger and deeper AlexNet (repeated VGG blocks)
A Convolutional Neural Network (CNN, or ConvNet) are a special kind of multi-layer
neural networks, designed to recognize visual patterns directly from pixel images with
minimal pre-processing. The ImageNet project is a large visual database designed for use
in visual object recognition software research. The ImageNet project runs an annual
software contest, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC),
where software programs compete to correctly classify and detect objects and scenes.
AlexNet (2012)
• In 2012, AlexNet significantly outperformed all the prior competitors and won the challenge by
reducing the top-5 error from 26% to 15.3%. The second place top-5 error rate, which was not a
CNN variation, was around 26.2%.
• The network had a very similar architecture as LeNet by Yann LeCun et al but was deeper, with
more filters per layer, and with stacked convolutional layers.
• It consisted 11x11, 5x5,3x3, convolutions, max pooling, dropout, data augmentation, ReLU
activations, SGD with momentum.
• It attached ReLU activations after every convolutional and fully-connected layer. AlexNet was
trained for 6 days simultaneously on two Nvidia Geforce GTX 580 GPUs which is the reason
for why their network is split into two pipelines.
• AlexNet was designed by the SuperVision group, consisting of Alex Krizhevsky, Geoffrey
Hinton, and Ilya Sutskever.
AlexNet Architecture
• VGG (from Oxford's Visual Geometry Group, https:/ / arxiv. org/ abs/ 1409. 1556).
• It was introduced in 2014, when it became a runner-up in the ImageNet challenge of that year.
• The VGG family of networks remains popular today and is often used as a benchmark against newer
architectures.
• Prior to VGG and AlexNet, the initial convolutional layers of a network used filters with large receptive
fields, such as 7 x 7.
• Additionally, the networks usually had alternating single convolutional and pooling layers.
• The authors of the paper observed that a convolutional layer with a large filter size can be replaced with a
stack of two or more convolutional layers with smaller filters (factorized convolution).
VGG Net
• For example, we can replace one 5 x 5 layer with a stack of two 3 x 3 layers, or
a 7 x 7 layer with a stack of three 3 x 3 layers.
• The neurons of the last of the stacked layers have the equivalent receptive field
size of a single layer with a large filter.
• The number of weights and operations of stacked layers is smaller, compared to
a single layer with large filter size.
• Let's assume we want to replace one 5 x 5 layer with two 3 x 3 layers. Let's also
assume that all layers have an equal number of input and output channels
(slices), M. The total number of weights (excluding biases) of the 5 x 5 layer is
5x5xMxM = 25xM2.
VGG Net