Professional Documents
Culture Documents
Version 3.1
Budapest, 2010
2010 by the CELLULAR SENSORY WAVE COMPUTERS LABORATORY, HUNGARIAN ACADEMY OF SCIENCES
(MTA SZTAKI), and the JEDLIK LABORATORIES OF THE PAZMANY UNIVERSITY, BUDAPEST
EDITED BY, K. KARACS , GY, CSEREY, . ZARNDY, P. SZOLGAY, CS. REKECZKY, L- KK, V. SZAB, G. PAZIENZA
AND T. ROSKA
BUDAPEST, HUNGARY
TABLE OF CONTENTS
INTRODUCTION .............................................................................................................................. 1
1. TEMPLATES/INSTRUCTIONS............................................................................................... 3
1.1. BASIC IMAGE PROCESSING............................................................................................. 4
GradientIntensityEstimation ........................................................................................... 4
Estimation of the gradient intensity in a local neighborhood
Smoothing ....................................................................................................................... 5
Smoothing with binary output
DiagonalHoleDetection .................................................................................................. 7
Detects the number of diagonal holes from each diagonal line
HorizontalHoleDetection................................................................................................ 8
Horizontal connected component detector
VerticalHoleDetection .................................................................................................... 9
Detects the number of vertical holes from each vertical column
MaskedCCD.................................................................................................................. 10
Masked connected component detector
CenterPointDetector ..................................................................................................... 11
Center point detection
ConcentricContourDetector ......................................................................................... 13
Concentric contour detection (DTCNN)
GlobalConnectivityDetection ....................................................................................... 14
Deletes marked objects
GlobalConncetivityDetection1 ..................................................................................... 16
Detects the one-pixel thick closed curves and deletes the open curves from a
binary image
ContourExtraction ........................................................................................................ 17
Grayscale contour detector
CornerDetection ........................................................................................................... 18
Convex corner detector
DiagonalLineRemover .................................................................................................. 20
Deletes one pixel wide diagonal lines
VerticalLineRemover .................................................................................................... 21
Deletes vertical lines
ThinLineRemover.......................................................................................................... 22
Removes thin (one-pixel thick) lines from a binary image
ApproxDiagonalLineDetector ...................................................................................... 23
Detects approximately diagonal lines
DiagonalLineDetector .................................................................................................. 24
Diagonal-line-detector template
GrayscaleDiagonalLineDetector .................................................................................. 25
Grayscale diagonal line detector
ii Table of Contents
RotationDetector .......................................................................................................... 26
Detects the rotation of compact objects in a binary image, having only
horizontal and vertical edges; removes all inclined objects or objects having
at least one inclined edge
HeatDiffusion ............................................................................................................... 27
Heat-diffusion
EdgeDetection .............................................................................................................. 28
Binary edge detection template
OptimalEdgeDetector ................................................................................................... 30
Optimal edge detector template
MaskedObjectExtractor................................................................................................ 31
Masked erase
GradientDetection ........................................................................................................ 32
Locations where the gradient of the field is smaller than a given threshold
value
PointExtraction ............................................................................................................ 33
Extracts isolated black pixels
PointRemoval ............................................................................................................... 34
Deletes isolated black pixels
SelectedObjectsExtraction............................................................................................ 35
Extracts marked objects
FilledContourExtraction .............................................................................................. 36
Finds solid black framed areas
ThresholdedGradient.................................................................................................... 37
Finds the locations where the gradient of the field is higher than a given
threshold value
3x3Halftoning ............................................................................................................... 38
3x3 image halftoning
5x5Halftoning1 ............................................................................................................. 40
5x5 image halftoning
5x5Halftoning2 ............................................................................................................. 42
5x5 image halftoning
Hole-Filling .................................................................................................................. 44
Hole-Filling
ObjectIncreasing .......................................................................................................... 45
Increases the object by one pixel (DTCNN)
3x3InverseHalftoning ................................................................................................... 46
Inverts the halftoned image by a 3x3 template
5x5InverseHalftoning ................................................................................................... 48
Inverts the halftoned image by a 5x5 template
LocalSouthernElementDetector ................................................................................... 50
Local southern element detector
PatternMatchingFinder ................................................................................................ 51
Finds matching patterns
LocalMaximaDetector .................................................................................................. 52
Local maxima detector template
iii
MedianFilter ................................................................................................................. 53
Removes impulse noise from a grayscale image
LeftPeeler...................................................................................................................... 55
Peels one pixel from the left
RightEdgeDetection ...................................................................................................... 56
Extracts right edges of objects
MaskedShadow ............................................................................................................. 57
Masked shadow
ShadowProjection ......................................................................................................... 58
Projects onto the left the shadow of all objects illuminated from the right
VerticalShadow ............................................................................................................. 59
Vertical shadow template
DirectedGrowingShadow ............................................................................................. 60
Generate growing shadows starting from black points
Threshold ...................................................................................................................... 61
Grayscale to binary threshold template
LE3pixelLineDetector .................................................................................................. 79
Lines-not-longer-than-3-pixels-detector template
PixelSearch ................................................................................................................... 80
Pixel search in a given range
LogicANDOperation .................................................................................................... 81
Logic "AND" operation
LogicDifference1 .......................................................................................................... 82
Logic Difference and Relative Set Complement (P2 \ P1 = P2 - P1) Template
LogicNOTOperation..................................................................................................... 83
Logic "NOT" and Set Complementation (P P =Pc) template
LogicOROperation ....................................................................................................... 84
Logic "OR" and Set Union (Disjunction ) template
LogicORwithNOT ......................................................................................................... 85
Logic "OR" function of the initial state and the logic "NOT" of the input
PatchMaker .................................................................................................................. 86
Patch maker template
SmallObjectRemover .................................................................................................... 87
Deletes small objects
BipolarWave ................................................................................................................. 88
Generates black and white waves
4.1. TACTILE SENSOR MODELING BY USING EMULATED DIGITAL CNN-UM ........ 265
5. SIMULATORS........................................................................................................................ 285
5.1. MATCNN SIMULATOR .................................................................................................. 287
Linear templates specification.................................................................................... 287
Nonlinear function specification in ............................................................................ 288
Running a CNN Simulation ........................................................................................ 290
Sample CNN Simulation with a Linear Template ...................................................... 292
Sample CNN Simulation with a Nonlinear ................................................................. 292
Sample CNN Simulation with a Nonlinear ................................................................. 294
Sample Analogic CNN Algorithm .............................................................................. 295
MATCNN simulator references .................................................................................. 297
5.2. 1D CELLULAR AUTOMATA SIMULATOR ................................................................. 299
Brief notes about 1D binary Cellular Automata ........................................................ 299
1D CA Simulator ........................................................................................................ 300
APPENDIX 1 UMF ALGORITHM DESCRIPTION ................................................................ 305
APPENDIX 2: VIRTUAL AND PHYSICAL CELLULAR MACHINES ................................ 309
TEMPLATE ROBUSTNESS ........................................................................................................ 316
REFERENCES ............................................................................................................................... 317
INDEX ............................................................................................................................................. 323
INDEX (OLD NAMES) .................................................................................................................. 327
INTRODUCTION
*
T. Roska, Cellular wave computers for nano-tera-scale technology beyond Boolean, spatial-temporal logic in
million
processor devices, Electronics Letters, Vol. 43., No.8., April 2007.
C. Baatar, W. Porod, T. Roska, (Eds.), Cellular Nanoscale Sensory Wave Computing, Springer, 2009, ISBN 978-1-
4419-1010-3
+
L. O. Chua and T. Roska, Cellular Neural Networks and visual computing: Foundations and applications, Cambridge
University Press, 2002 (paperback: 2005)
Chapter 1. Templates/Instructions
4 1. Templates/Instructions
0 0 0 b b b
A= 0 0 0 B= b 0 b z= 0
0 0 0 b b b
I. Global Task
II. Examples
Example 1: image name: avergra1.bmp, image size: 64x64; template name: avergrad.tem .
input output
Example 2: image name: avergra2.bmp, image size: 64x64; template name: avergrad.tem.
input output
1.1. Basic Image Processing 5
0 1 0 0 0 0
A= 1 2 1 B= 0 0 0 z= 0
0 1 0 0 0 0
I. Global Task
II. Example: image name: madonna.bmp, image size: 59x59; template name: avertrsh.tem .
input output
0 1.2 0 0 0.9 0
A1 = 1.2 1.8 1.2 A2 = 0.9 1.8 0.9
0 1.2 0 0 0.9 0
The transients of the examined cell are presented in the following figure corresponding to the
templates A, A1 and A2.
6 1. Templates/Instructions
1.1. Basic Image Processing 7
DiagonalHoleDetection: Detects the number of diagonal holes from each diagonal line [6]
1 0 0 0 0 0
A= 0 2 0 B= 0 0 0 z= 0
0 0 -1 0 0 0
I. Global Task
II. Example: image name: a_letter.bmp, image size: 117x121; template name: ccd_diag.tem .
input output
8 1. Templates/Instructions
HorizontalHoleDetection: Detects the number of horizontal holes from each horizontal row
[6]
0 0 0 0 0 0
A= 1 2 -1 B= 0 0 0 z= 0
0 0 0 0 0 0
I. Global Task
II. Example: image name: a_letter.bmp, image size: 117x121; template name: ccd_hor.tem .
input output
1.1. Basic Image Processing 9
VerticalHoleDetection: Detects the number of vertical holes from each vertical column [6]
0 1 0 0 0 0
A= 0 2 0 B= 0 0 0 z= 0
0 -1 0 0 0 0
I. Global Task
II. Example: image name: a_letter.bmp, image size: 117x121; template name: ccd_vert.tem .
input output
10 1. Templates/Instructions
Left-to-right
0 0 0 0 0 0
A= 1 2 -1 B= 0 -3 0 z= -3
0 0 0 0 0 0
I. Global Task
II. Example: Right-to-left shifting. Image names: ccdmsk1.bmp, ccdmsk2.bmp; image size:
40x20; template name: ccdmaskr.tem .
1 0 0 0 0 0
A1 = 1 4 -1 B1 = 0 0 0 z1 = -1
1 0 0 0 0 0
1 1 1 0 0 0
A2 = 1 6 0 B2 = 0 0 0 z2 = -1
1 0 -1 0 0 0
1 1 1 0 0 0
A3 = 0 4 0 B3 = 0 0 0 z3 = -1
0 -1 0 0 0 0
1 1 1 0 0 0
A4 = 0 6 1 B4 = 0 0 0 z4 = -1
-1 0 1 0 0 0
...
1 0 -1 0 0 0
A8 = 1 6 0 B8 = 0 0 0 z8 = -1
1 1 1 0 0 0
I. Global Task
CENTER1:
0 0 0 1 0 0
A1 = 0 1 0 B1 = 1 4 -1 z1 = -1
0 0 0 1 0 0
CENTER2:
0 0 0 1 1 1
A2 = 0 1 0 B2 = 1 6 0 z2 = -1
0 0 0 1 0 -1
CENTER3:
0 0 0 1 1 1
A3 = 0 1 0 B3 = 0 4 0 z3 = -1
0 0 0 0 -1 0
CENTER4:
0 0 0 1 1 1
A4 = 0 1 0 B4 = 0 6 1 z4 = -1
0 0 0 -1 0 1
...
CENTER8:
0 0 0 1 0 -1
A8 = 0 1 0 B8 = 1 6 0 z8 = -1
0 0 0 1 1 1
The robustness of templates CENTER1 and CENTER2 are (CENTER1) = 0.22 and (CENTER2)
= 0.15, respectively. Other templates are the rotated versions of CENTER1 and CENTER2, thus
their robustness values are equal to the mentioned ones.
II. Example: image name: chineese.bmp, image size: 16x16; template name: center.tem .
input output
1.1. Basic Image Processing 13
0 -1 0 0 0 0
A= -1 3.5 -1 B= 0 4 0 z= -4
0 -1 0 0 0 0
I. Global Task
Examples
Example 1: image name: conc1.bmp, image size: 16x16; template name: concont.tem .
input output
Example 2: image name: conc2.bmp, image size: 100x100; template name: concont.tem .
0 0.5 0 0 -0.5 0
A= 0.5 3 0.5 B= -0.5 3 -0.5 z= -4.5
0 0.5 0 0 -0.5 0
I. Global Task
Given: two static binary images P1 (mask) and P2 (marker). The mask contains
some black objects against the white background. The marker contains
the same objects, except for some objects being marked. An object is
considered to be marked, if some of its black pixels are changed into
white.
Input: U(t) = P1
Initial State: X(0) = P2
Boundary Conditions: Fixed type, uij = -1, yij = -1 for all virtual cells, denoted by
[U]=[Y]=[-1]
Output: Y(
Y(t) ) = Binary image containing the unmarked objects only.
Remark:
The template determines whether a given geometric pattern is "globally" connected in one
contiguous piece, or is it composed of two or more disconnected components.
II. Example: image names: connect1.bmp, connect2.bmp; image size: 500x200; template name:
connecti.tem .
t=125 t=250
t=375 t=500
1.1. Basic Image Processing 15
t=625 t=750
t=875 t=1000
t=1125 t=1250
16 1. Templates/Instructions
GlobalConncetivityDetection1: Detects the one-pixel thick closed curves and deletes the open
curves from a binary image [61]
6 6 6 -3 -3 -3
A= 6 9 6 B= -3 9 -3 z= -4.5
6 6 6 -3 -3 -3
I. Global Task
Remarks: The binary image P containing closed and open curves (one-pixel thick) is applied both at
the input and loaded as initial state. If one pixel is removed from a closed curve, it becomes an open
curve and is deleted, as shown in the second image. The compact (solid) objects from the image are
not modified.
0 0 0 a a a
A= 0 2 0 B= a 0 a z= 0.7
0 0 0 a a a
a
0.5
I. Global Task
II. Example: image name: madonna.bmp, image size: 59x59; template name: contour.tem .
input output
18 1. Templates/Instructions
0 0 0 -1 -1 -1
A= 0 1 0 B= -1 4 -1 z= -5
0 0 0 -1 -1 -1
I. Global Task
II. Example: image name: chineese.bmp, image size: 16x16; template name: corner.tem .
input output
CORNCH_1
0 0 0 -1 -1 -1
A= 0 1 0 B= -1 3.9 -1 z= -5
0 0 0 -1 -1 -1
1.1. Basic Image Processing 19
CORNCH_2
0 0 0 -1 -1 -1
A= 0 0.5 0 B= -1 3.6 -1 z= -5
0 0 0 -1 -1 -1
The transients of the examined cell are presented in the following figure corresponding to the
templates CornerDetector, CORNCH_1 and CORNCH_2.
20 1. Templates/Instructions
0 0 0 -1 0 -1
A= 0 1 0 B= 0 1 0 z= -4
0 0 0 -1 0 -1
I. Global Task
II. Example: image name: deldiag1.bmp, image size: 21x21; template name: deldiag1.tem .
input output
1.1. Basic Image Processing 21
0 0 0 0 -1 0
A= 0 1 0 B= 0 1 0 z= -2
0 0 0 0 -1 0
I. Global Task
II. Example: image name: delvert1.bmp, image size: 21x21; template name: delvert1.tem .
input output
22 1. Templates/Instructions
2 2 2 0 0 0
A= 2 8 2 B= 0 0 0 z= -2
2 2 2 0 0 0
I. Global Task
0 0 0 0 0 -1 -1 -1 0.5 1
0 0 0 0 0 -1 -1 1 1 0.5
A= 0 0 2 0 0 B= -1 1 5 1 -1 z= -13
0 0 0 0 0 0.5 1 1 -1 -1
0 0 0 0 0 1 0.5 -1 -1 -1
I. Global Task
II. Example: image name: diag.bmp, image size: 246x191; template name: diag.tem .
input output
24 1. Templates/Instructions
0 0 0 -1 0 1
A= 0 1 0 B= 0 1 0 z= -4
0 0 0 1 0 -1
I. Global Task
II. Example: image name: diag1liu.bmp, image size: 21x21; template name: diag1liu.tem .
input output
1.1. Basic Image Processing 25
0 0 0 0 0 b b a a a
0 0 0 0 0 b b b a a
A= 0 0 1 0 0 B= a b 0 b a z= -1.8
0 0 0 0 0 a a b b b
0 0 0 0 0 a a a b b
0.5 0.5
-0.5 -0.5
I. Global Task
II. Example: image name: diaggray.bmp, image size: 61x61; template name: diaggray.tem .
input output
26 1. Templates/Instructions
RotationDetector: Detects the rotation of compact objects in a binary image, having only
horizontal and vertical edges; removes all inclined objects or objects having
at least one inclined edge [61]
I. Global Task
HeatDiffusion: Heat-diffusion
I. Global Task
II. Example: image name: diffus.bmp, image size: 106x106; template name: diffus.tem .
input output
28 1. Templates/Instructions
0 0 0 -1 -1 -1
A= 0 1 0 B= -1 8 -1 z= -1
0 0 0 -1 -1 -1
I. Global Task
Given: static binary image P
Input: U(t) = P
Initial State: X(0) = Arbitrary (in the examples we choose xij(0)=0)
Boundary Conditions: Fixed type, uij = 0 for all virtual cells, denoted by [U]=0
Output: Y(
Y(t) ) = Binary image showing all edges of P in black
Template robustness: = 0.12 .
Remark:
Black pixels having at least one white neighbor compose the edge of the object.
II. Examples
Example 1: image name: logic05.bmp, image size: 44x44; template name: edge.tem .
input output
1.1. Basic Image Processing 29
Example 2: image name: michelan.bmp, image size: 627x253; template name: edge.tem .
input
output
30 1. Templates/Instructions
0 0 0 -0.11 0 0.11
A= 0 0 0 B= -0.28 0 0.28 z= 0
0 0 0 -0.11 0 0.11
I. Global Task
II. Example: image name: bird.bmp, image size: 256x256; template name: optimedge.tem .
input output
1.1. Basic Image Processing 31
I. Global Task
II. Example: Left-to-right erase. Image names: ccdmsk3.bmp, ccdmsk2.bmp; image size: 40x20;
template name: erasmask.tem .
GradientDetection: Finds the locations where the gradient of the field is smaller than a given
threshold value [9]
0 0 0 a a a
A= 0 1 0 B= a 0 a z= z*
0 0 0 a a a
where z* is a given threshold value, and a is defined by the following nonlinear function:
a
-0.2 0.2
vuij -vukl
-1
-2
I. Global Task
II. Example: image name: circles.bmp, image size: 60x60; template name: extreme.tem .
Threshold value z* = 3.9 .
input output
1.1. Basic Image Processing 33
0 0 0 -1 -1 -1
A= 0 1 0 B= -1 1 -1 z= -8
0 0 0 -1 -1 -1
I. Global Task
II. Example: image name: figdel.bmp, image size: 20x20; template name: figdel.tem .
input output
34 1. Templates/Instructions
0 0 0 1 1 1
A= 0 1 0 B= 1 8 1 z= -1
0 0 0 1 1 1
I. Global Task
II. Example: image name: figdel.bmp, image size: 20x20; template name: figextr.tem .
input output
1.1. Basic Image Processing 35
I. Global Task
Given: two static binary images P1 (mask) and P2 (marker). P2 contains just a
part of P1 (P2 P1).
Input: U(t) = P1
Initial State: X(0) = P2
Boundary Conditions: Fixed type, yij = 0 for all virtual cells, denoted by [Y]=0
Output: Y(
Y(t) ) = Binary image representing those objects of P1 which are
marked by P2.
Template robustness: = 0.12 .
II. Example: image names: figdel.bmp, figrec.bmp; image size: 20x20; template name: figrec.tem
0 1 0 0 0 0
A= 1 5 1 B= 0 2 0 z = -5.25
0 1 0 0 0 0
I. Global Task
II. Example: image names: findare1.bmp, findare2.bmp; image size: 270x246; template name:
findarea.tem .
ThresholdedGradient: Finds the locations where the gradient of the field is higher than a given
threshold value [9]
0 0 0 b b b
A= 0 1 0 B= b 0 b z= z*
0 0 0 b b b
where z* is a given threshold value, and b is defined by the following nonlinear function:
b
2
-2 2 vuij-vukl
I. Global Task
II. Example: image name: circles.bmp, image size: 60x60; template name: gradient.tem .
Threshold value z* = -4.8 .
input output
38 1. Templates/Instructions
I. Global Task
II. Examples
Example 1: image name: baboon.bmp, image size: 512x512; template name: hlf3.tem .
input output
1.1. Basic Image Processing 39
Example 2: image name: peppers.bmp, image size: 512x512; template name: hlf3.tem .
input output
40 1. Templates/Instructions
I. Global Task
II. Examples
Example 1: image name: baboon.bmp, image size: 512x512; template name: hlf5kc.tem .
input output
1.1. Basic Image Processing 41
Example 2: image name: peppers.bmp, image size: 512x512; template name: hlf5kc.tem .
input output
42 1. Templates/Instructions
-0.02 -0.07 -0.10 -0.07 -0.02 0.02 0.07 0.10 0.07 0.02
-0.07 -0.32 -0.46 -0.32 -0.07 0.07 0.32 0.46 0.32 0.07
A= -0.10 -0.46 1.05 -0.46 -0.10 B= 0.10 0.46 0.81 0.46 0.10 z= 0
-0.07 -0.32 -0.46 -0.32 -0.07 0.07 0.32 0.46 0.32 0.07
-0.02 -0.07 -0.10 -0.07 -0.02 0.02 0.07 0.10 0.07 0.02
I. Global Task
II. Examples
Example 1: image name: baboon.bmp, image size: 512x512; template name: hlf5.tem .
input output
1.1. Basic Image Processing 43
Example 2: image name: peppers.bmp, image size: 512x512; template name: hlf5.tem .
input output
44 1. Templates/Instructions
0 1 0 0 0 0
A= 1 3 1 B= 0 4 0 z= -1
0 1 0 0 0 0
I. Global Task
II. Example: image name: a_letter.bmp, image size: 117x121; template name: hole.tem .
input output
1.1. Basic Image Processing 45
I. Global Task
II. Example: image name: a_letter.bmp, image size: 117x121; template name: increase.tem . One
iteration step of a DTCNN is performed.
input output
46 1. Templates/Instructions
I. Global Task
II. Examples
Example 1: image name: invhlf3_1.bmp, image size: 512x512; template name: invhlf3.tem .
input output
1.1. Basic Image Processing 47
Example 2: image name: invhlf3_2.bmp, image size: 512x512; template name: invhlf3.tem .
input output
48 1. Templates/Instructions
I. Global Task
II. Examples
Example 1: image name: invhlf5_1.bmp, image size: 512x512; template name: invhlf5.tem .
input output
1.1. Basic Image Processing 49
Example 2: image name: invhlf5_2.bmp, image size: 512x512; template name: invhlf5.tem .
input output
50 1. Templates/Instructions
0 0 0 0 0 0
A= 0 1 0 B= 0 1 0 z= -3
0 0 0 -1 -1 -1
I. Global Task
II. Example: image name: lcp_lse.bmp, image size: 17x17; template name: lse.tem .
input output
1.1. Basic Image Processing 51
0 0 0 b b b
A= 0 1 0 B= b b b z = -N+0.5
0 0 0 b b b
where
1, if corresponding pixel is required to be black
b = 0, if corresponding pixel is do not care
1, if corresponding pixel is required to be white
N = number of pixels required to be either black or white, i.e. the number of non-zero values
in the B template
I. Global Task
Given: static binary image P possessing the 3x3 pattern prescribed by the
template
Input: U(t) = P
Initial State: X(0) = Arbitrary (in the examples we choose xij(0)=0)
Boundary Conditions: Fixed type, uij = 0 for all virtual cells, denoted by [U]=0
Output: Y(t)Y() = Binary image representing the locations of the 3x3 pattern
prescribed by the template. The pattern having a black/white pixel where
the template value is +1/-1, respectively, is detected.
II. Example: image name: match.bmp, image size: 16x16; template name: match.tem .
0 0 0 1 -1 1
A= 0 1 0 B= 0 1 0 z= -6.5
0 0 0 1 -1 1
input output
52 1. Templates/Instructions
0 0 0 b b b
A= 0 3 0 B= b 0 b z= -3.5
0 0 0 b b b
where
0.5 if vuij - vukl 0
b =
0 otherwise
I. Global Task
II. Example: image name: avergra1.bmp, image size: 64x64; template name: maxloc.tem .
input output
1.1. Basic Image Processing 53
0 0 0 d d d
A= 0 1 0 D M = d 0 d z= 0
0 0 0 d d d
I. Global Task
II. Example: image name: median.bmp, image size: 256x256; template name: median.tem .
input output
1.1. Basic Image Processing 55
0 0 0 0 0 0
A= 0 1 0 B= 1 1 0 z= -1
0 0 0 0 0 0
I. Global Task
II. Example: image name: peelhor.bmp, image size: 12x12; template name: peelhor.tem .
input output
56 1. Templates/Instructions
0 0 0 0 0 0
A= 0 1 0 B= 1 1 -1 z= -2
0 0 0 0 0 0
I. Global Task
II. Example: image name: chineese.bmp, image size: 16x16; template name: rightcon.tem .
input output
1.1. Basic Image Processing 57
Left-to-right
0 0 0 0 0 0
A= 1.5 1.8 0 B= 0 -1.2 0 z= 0
0 0 0 0 0 0
I. Global Task
ShadowProjection: Projects onto the left the shadow of all objects illuminated from the right [6]
Old names: LeftShadow, SHADOW
0 0 0 0 0 0
A= 0 2 2 B= 0 2 0 z= 0
0 0 0 0 0 0
I. Global Task
II. Examples
Example 1: Left shadow. Image name: a_letter.bmp, image size: 117x121; template name:
shadow.tem .
input output
Example 2: Shadow in the east-western direction. Image name: a_letter.bmp, image size:
117x121.
0 0 0 0 0 0
A= 0 2 0 B= 0 2 0 z= 0
0 0 2 0 0 0
input output
1.1. Basic Image Processing 59
0 1 0 0 0 0
A= 0 2 0 B= 0 0 0 z= 2
0 1 0 0 0 0
I. Global Task
II. Example: image name: chineese.bmp, image size: 16x16; template name: shadsim.tem .
input output
60 1. Templates/Instructions
SHADOW0:
0.4 0.3 0 0 0 0
A= 1 2 -1 B= 0 1.4 0 z= 2.5
0.4 0.3 0 0 0 0
SHADOW45:
0 0 -1 0 0 0
A= 1 2 0 B= 0 1.4 0 z= 2.5
1 1 0 0 0 0
I. Global Task
II. Example: image name: points.bmp; image size: 100x100; template names: shadow0.tem,
shadow45.tem.
0 0 0 0 0 0
A= 0 2 0 B= 0 0 0 z= -z* , -1< z* <1
0 0 0 0 0 0
I. Global Task
II. Example: Threshold value z* = 0.4 . Image name: madonna.bmp, image size: 59x59; template
name: treshold.tem .
input output
1.2. MATHEMATICAL MORPHOLOGY
The basic operations of binary mathematical morphology [32] are erosion and dilation. These
operations are defined by two binary images, one being the operand, the other the structuring
element. In the CNN implementation, the former image is the input, while the function (the
templates) itself depends on the latter image. If the structuring element set does not exceed the size
of the CNN template the dilation and erosion operators can be implemented with a single CNN
template. The implementation method is the following: The A template matrix is zero in every
position. The structuring element set should be directly mapped to the B template (See Figure). If it
is an erosion operator, the z value is equal to (1-n), where n is the number of 1s in the B template
matrix. If it is a dilation operator, the B template must be reflected to the center element, and the z
value is equal to (n-1), where n is the number of 1s in the B template matrix. When calculating the
operator, the image should be put to the input of the CNN, and the initial condition is zero
everywhere. The next Figure shows an example for template syntetization.
0 1 0 0 0 0
0 1 1 1 1 0
0 0 0 0 1 0
0 0 0 0 1 0
Erosion template: A = 0 0 0 B = 0 1 1 z = 2
0 0 0 0 0 0
0 0 0 0 0 0
Dilation template: A = 0 0 0 B = 1 1 0 z=2
0 0 0 0 1 0
Example: Erosion and dilation with the given structuring element set. Image name:
binmorph.bmp; image size: 40x40; template names: eros_bin.tem, dilat_bin.tem .
The basic operations of grayscale mathematical morphology [32] are erosion and dilation. These
operations are defined by two grayscale images, one being the operand, the other the structuring
element set (S). In the CNN implementation, the former image is the input, while the function (the
templates) itself depends on the latter image. If the structuring element set does not exceed the size
of the CNN template the dilation and erosion operators can be implemented with a single CNN
template. The implementation method is the following: single template grayscale mathematical
morphology is implemented on a slightly modified CNN. The state equation of the modified CNN is
the following:
v xij (t ) = v xij (t ) + v yij (t ) + D
klN r ( ij )
ij ;kl (vukl (t ) v yij (t )) + z
It means that A=[1], and the inputs of the nonlinear functions of the D template are the difference
between the input values and the appropriate neighborhood positions, and the center output value.
The morphological operation can be implemented with a single template on this CNN structure.
The erosion template is the following:
dij;kl(vukl-vyij)
1
0 0 0 d 1,1 d 0,1 d1,1
A = 0 1 0, D = d 1, 0 d 0, 0 d1, 0 , z = 1
0 0 0 d 1, 1 d 0, 1 d 1, 1 S() vukl-vyij
-1
where: S() is the real value of the structuring element set (S) at point =(k,l). If it is not defined
dij;kl 0.
Example for grayscale erosion with a 3x3 square shaped zero value structuring element set. The
black areas have shrunk. Image name: grmorph.bmp; image size: 288x272.
64 1. Templates/Instructions
0 0 0 0 0 0
A= 0 2 0 B= 0 0 0 z= 4
0 0 0 0 0 0
I. Global Task
II. Example: image name: madonna.bmp, image size: 59x59; template name: filblack.tem .
0 0 0 0 0 0
A= 0 2 0 B= 0 0 0 z= -4
0 0 0 0 0 0
I. Global Task
II. Example: image name: madonna.bmp, image size: 59x59; template name: filwhite.tem .
I. Global Task
II. Example: image name: points.bmp, image size: 50x50; template name: bprop.tem .
input output
68 1. Templates/Instructions
I. Global Task
II. Example: image name: patches.bmp, image size: 50x50; template name: wprop.tem .
input output
1.3. Spatial Logic 69
I. Global Task
II. Example: image name: hollow.bmp, image size: 180x160; template name: hollow.tem .
FILL35:
1 0 1 0 0 0
A= 0 2 0 B= 0 1 0 z= 2
1 1 0 0 0 0
FILL65:
1 0 0 0 2 0
A= 1 2 0 B= 0 0 0 z= 3
0 0 2 0 0 0
I. Global Task
II. Example: image name: arcs.bmp, image size: 100x100; template name: arc_fill.tem .
0 0 -2 0 0 0 0 0 0 0
0 -4 16 -4 0 0 0 0 0 0
A= -2 16 -39 16 -2 B= 0 0 0 0 0 z= 0
0 -4 16 -4 0 0 0 0 0 0
0 0 -2 0 0 0 0 0 0 0
I. Global Task
II. Examples
Example 1: Ball surface reconstruction (10% of the points is known). Image names: interp1.bmp,
interp2.bmp; image size: 80x80; template name: interp.tem .
0 0 0 1 1 1
A= 0 1 0 B= 1 6 1 z= -3
0 0 0 1 1 1
I. Global Task
II. Example: image name: junction.bmp, image size: 140x120; template name: junction.tem .
input output
74 1. Templates/Instructions
JunctionExtractor1: Finding the intersection points of thin (one-pixel thick) lines from two
binary images
I. Global Task
Given: two static binary images P1 and P2 containing thin (one-pixel thick) lines
or curves, among other (compact) objects
Input: U(t) = P1
Initial State: X(0) = P2
Boundary Conditions: Fixed type, yij = 0 for all virtual cells, denoted by [Y]=0
Output: Y(
Y(t) ) = Binary image containing all the intersection points
between the thin lines contained in the binary images P1 and P2
Remarks: The two binary images can be interchanged, i.e. we can apply P2 at the input and load P1
as initial state. The feedback and control templates are identical. Even if other (compact) objects are
present in the two images, their overlapping is not detected, except intersection points of thin lines.
0 0 0 0 0 0
A= 0 1 0 B= 2 2 2 z= -7
0 0 0 1 -2 1
II. Example: image name: lcp_lse.bmp, image size: 17x17; template name: lcp.tem .
input output
76 1. Templates/Instructions
0 0 1 0 0 0 0 1 0 0
0 0 0.5 0 0 0 0 1 0 0
A= 0 0 2 0 0 B= 0 0 1 0 0 z= -5.5
0 0 0.5 0 0 0 0 1 0 0
0 0 1 0 0 0 0 1 0 0
I. Global Task
II. Example: image name: lincut7v.bmp, image size: 20x20; template name: lincut7v.tem
input output
1.3. Spatial Logic 77
0 0 0 b a a
A= 0 1.5 0 B= b 0 a z= -4.5
0 0 0 a b b
I. Global Task
II. Examples
Example 1 (simple): image name: line3060.bmp, image size: 41x42; template name:
line3060.tem .
input output
78 1. Templates/Instructions
Example 2 (complex): image name: michelan.bmp, image size: 625x400; template name:
line3060.tem .
input
output
1.3. Spatial Logic 79
0 0 0 0 0 -1 0 -1 0 -1
0 0.3 0.3 0.3 0 0 -1 -1 -1 0
A= 0 0.3 3 0.3 0 B= -1 -1 4 -1 -1 z= -2
0 0.3 0.3 0.3 0 0 -1 -1 -1 0
0 0 0 0 0 -1 0 -1 0 -1
I. Global Task
II. Example: image name: linextr3.bmp, image size: 25x25; template name: linextr3.tem .
input output
80 1. Templates/Instructions
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
A= 0 2 0 B= 0 0 0 0 0 0 0 z= -1
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 1 0 0 0
I. Global Task
Remark:
This operation keeps those pixels of the initial state where there is a black pixel on the
input at the place of the nonzero element of B. Based on the above example arbitrary pixel search
operations can be implemented by extending the applied template to NxN. Although the
operation requires NxN templates (7x7 in the example), members of this template class can be
decomposed into a sequence of 3x3 linear templates (see [73]).
0 0 0 0 0 0
A= 0 2 0 B= 0 1 0 z= -1
0 0 0 0 0 0
I. Global Task
II. Example: image names: logic01.bmp, logic02.bmp; image size: 44x44; template name:
logand.tem .
LogicDifference1: Logic Difference and Relative Set Complement (P2 \ P1 = P2 - P1) template
[7]
0 0 0 0 0 0
A= 0 2 0 B= 0 -1 0 z= -1
0 0 0 0 0 0
I. Global Task
II. Example: image names: logic05.bmp, logic01.bmp; image size: 44x44; template name:
logdif.tem .
0 0 0 0 0 0
A= 0 1 0 B= 0 -2 0 z= 0
0 0 0 0 0 0
I. Global Task
II. Example: image name: chineese.bmp; image size: 16x16; template name: lognot.tem .
input output
84 1. Templates/Instructions
0 0 0 0 0 0
A= 0 2 0 B= 0 1 0 z= 1
0 0 0 0 0 0
I. Global Task
II. Example: image names: logic01.bmp, logic02.bmp; image size: 44x44; template name:
logor.tem .
LogicORwithNOT: Logic "OR" function of the initial state and logic "NOT" function of the
input [24]
0 0 0 0 0 0
A= 0 2 0 B= 0 -1 0 z= 1
0 0 0 0 0 0
I. Global Task
II. Example: image names: logic06.bmp, logic02.bmp; image size: 44x44; template name:
logorn.tem .
0 1 0 0 0 0
A= 1 2 1 B= 0 1 0 z= 4.5
0 1 0 0 0 0
I. Global Task
II. Example: image name: patchmak.bmp; image size: 245x140; template name: patchmak.tem .
input output
1.3. Spatial Logic 87
1 1 1 0 0 0
A= 1 2 1 B= 0 0 0 z= 0
1 1 1 0 0 0
I. Global Task
II. Example: image name: smkiller.bmp; image size: 115x95; template name: smkiller.tem .
input output
88 1. Templates/Instructions
I. Global Task
Given: image P containing three gray levels: +1, 0, -1 (black, gray, white)
Input: U(t) = P
Initial State: X(0) = P
Boundary Conditions: Zero-flux boundary condition (duplicate)
Output: Y(
Y(t) ) = Black and white areas, the boundary of which is located at
positions where the waves collided.
Remark:
The wave starts from black and white pixels and propagates on cells which have zero state
(gray color). The final image will contain black and white areas.
-3.44 0.86 -1.64 -0.16 -1.02 -2.19 -0.23 0.16 -0.63 -0.78
-1.09 0.16 -2.19 -3.2 3.51 1.64 2.27 -3.2 1.09 2.03
A= 2.50 1.56 3.91 2.66 2.42 B= 0.08 0.55 0.86 3.52 0.08 z = 3.28
0.55 2.89 -0.62 0.47 3.67 0.39 -3.83 -3.12 -2.34 -2.11
-1.80 -0.55 2.50 -0.23 2.34 0.78 -2.66 -1.17 -1.41 1.02
I. Global Task
II. Example: image name: tx_hclc.bmp, image size: 296x222; template name: tx_hclc.tem .
input output
90 1. Templates/Instructions
I. Global Task
II. Example: image name: tx_racc.bmp, image size: 296x222; template name: tx_racc3.tem .
input output
1.4. Texture Segmentation and Detection 91
I. Global Task
II. Example: image name: tx_racc.bmp, image size: 296x222; template name: tx_racc5.tem .
input output
92 1. Templates/Instructions
TEXTURE DETECTION
TextureDetector1 (T1_RACC3)
2.27 1.80 3.36 -3.91 1.25 3.05
A= -0.70 -4.45 1.41 B= 0.86 -3.05 3.36 z = -1.64
3.20 3.98 -0.31 1.72 -0.63 -4.61
TextureDetector2 (T2_RACC3)
1.56 4.38 2.42 -2.81 2.42 -3.75
A= 4.69 -3.13 1.41 B= -5 -0.39 -5 z = -3.20
2.19 -5 0.86 3.67 4.22 3.13
TextureDetector3 (T3_RACC3)
1.64 -1.02 1.33 -3.91 -2.66 -3.13
A= 1.88 -4.61 2.89 B= 0.94 1.48 -3.13 z = -2.42
3.28 2.03 3.75 1.33 0.55 2.34
TextureDetector4 (T4_RACC3)
3.13 4.30 2.19 -3.52 4.38 -5
A= -2.81 3.13 0.16 B= -0.94 -3.05 -3.67 z = -2.42
1.88 4.92 4.53 1.41 -0.63 -4.38
I. Global Task
Given: static grayscale image P representing textures having the same flat
grayscale histograms. One of them is identical to a texture shown above.
Input: U(t) = P
Initial State: X(0) = P
Boundary Conditions: Fixed type, uij = 0, yij = 0 for all virtual cells, denoted by [U]=[Y]=0
Output: Y(t)Y(T) = Nearly binary image where the detected texture becomes
darker than the others.
These templates can be used in a classification problem when the number of different examined
textures is, for instance, more than 10 and the input textures have the same flat grayscale histograms.
1.4. Texture Segmentation and Detection 93
input output
Example 2: Texture detection with the TextureDetector2 (t2_racc3.tem) template.
input output
Example 3: Texture detection with the TextureDetector3 (t3_racc3.tem) template.
input output
Example 4: Texture detection with the TextureDetector4 (t4_racc3.tem) template.
input output
1.5. MOTION
ImageDifferenceComputation: Logic difference between the initial state and the input
pictures with noise filtering [7]
I. Global Task
II. Example: image names: logdnfi0.bmp, logdnfi1.bmp; image size: 44x44; template name:
logdifnf.tem .
0 0 0 0 0 0
A = 0 0 0 B = 1.5 0 0 = 10 CNN
0 0 0 0 0 0
I. Global Task
II. Example: image names: motdep1.bmp, motdep2.bmp; image size: 20x20; template name:
motdepen.tem .
0 0 0 0 0 0
A= 0 1 0 B= 0 6 0 z= -2
0 0 0 0 0 0
I. Global Task
Given: static binary image P
Input: U(t) = P
Initial State: X(0) = P
Boundary Conditions: Fixed type, yij = 0 for all virtual cells, denoted by [Y]=0
Output: Y(T) = Binary image representing only objects of P moving
Y(t)
slower than 1 pixel/delay time in arbitrary directions.
II. Example: image names: motind1.bmp, motind2.bmp, motind3.bmp, motind4.bmp; image size:
16x16; template name: motindep.tem .
SPEED CLASSIFICATION
The algorithm is capable of classifying the speed of black-and-white objects moving parallel to the
image plane. It extracts objects moving faster than a given speed determined by the current of the
threshold template. Grayscale image sequences can be converted to black-and-white using the
Smoothing template.
The flow-chart of the algorithm:
Smoothing Smoothing
Temporal differentiation
DIFFERENCE IMAGE
Classifying
Smoothing:
0.06 0.13 0.06
Bsmoothing = 0.13 0.24 0.13
0.06 0.13 0.06
Temporal differentiation:
0 0 0 0 0 0 0 0 0
Adiff = 0 1 0 Bdiff = 0 0.4 0 = 0 -0.4 0
0 0 0 0 0 0 0 0 0
Classifying speed:
0 0 0 0 0 0
Aclass = 0 2 0 Bclass = 0 0 0 zclass = -threshold
0 0 0 0 0 0
Recalling objects:
98 1. Templates/Instructions
Example: Input and output pictures. Image names: speed1.bmp, speed2.bmp; image size:
163x105.
1.5. Motion 99
0 0 0 0 0 0
A22 = 0 2 0 A21 = 0 3 0 z2 = 2
0 0 0 0 0 0
I. Global Task
Given: a binary image sequence P1 and a binary image P2. P1 represents the
objects to be traced, P2 consists of black pixels marking the objects to be
traced.
Input: U1(t) = P1
Initial State: X1(0) = P2
Boundary Conditions: Fixed type, yij = 0 for all virtual cells, denoted by [Y]=0
Output: Y(T) = Binary image representing the actual position of the
Y1(t)
marked objects
Y(
Y2(t) ) = Binary image showing the whole path of the marked
objects
II. Example: image names: trace1.bmp, trace2.bmp; image size: 140x80; template name: trace.tem
In the retina, and the visual cortex, there are single and double color opponent cells [23]. Their
receptive fields are as follows:
+ -
- GR
+ G + -
R RG
(a)
(b)
where (a) belongs to the single and (b) belongs to the double opponent cell. The template simulating
the single opponent cell has two layers. The input of the first layer is the monochromatic red map,
while the second layer gets the green map. The result appears on the second layer. The template is
the following:
0 0 0 -0.25 -0.25 -0.25
B12 = 0 2 0 B22 = -0.25 0 -0.25
0 0 0 -0.25 -0.25 -0.25
By swapping the layers we get the template generating the G+R- single opponents. The output of the
R+G- and G+R- layers provide the input for the first and second layer of the double opponent
structure, respectively. The output appears on the second layer. The template is as follows:
DEPTH CLASSIFICATION
The algorithm determines the depth of black-and-white objects based on a pair of stereo images. It
determines whether an object is closer than a given distance or not. The first step of the algorithm is
to reduce the objects in both input images to a single pixel; then the distance between corresponding
points is calculated. The distance can be thresholded to determine whether the object is too close or
not. As a first preprocessing step, grey-scale images can be converted into black-and-white using the
Smoothing template.
The flow-chart of the algorithm:
elongate object
calculate distance
classify depth
recall
Templates:
Elongate objects: add pixels to the top and bottom of each object (use the left image as input)
0 0 0 0 3 0
Aelongate = 0 1 0 Belongate = 0 3 0 zelongate = 4.5
0 0 0 0 3 0
0 0 0
Adistance = b 1 b
0 0 0
102 1. Templates/Instructions
0.5
where b is defined by the following nonlinear function:
-1.05 -0.05 v y ij -v y kl
Read out depth: (use the right center points as a fixed state map)
0 0 0
Aread = 0 1 0 zread = -2
0 0 0
Classify depth:
0 0 0
Aclass = 0 2 0 zclass = -threshold
0 0 0
a a a 0 0 0
A= a 1 a B= 0 0 0 z= 0
a a a 0 0 0
0.25 a
0.125
-2 -1 vyij -vykl
I. Global Task
II. Example: image name: globmax.bmp, image size: 51x51; template name: globmax.tem .
input output
1.9. GAME OF LIFE AND COMBINATORICS
0 0 0 0 0 0
A= a 1 b B= 0 0 0 z= 0
0 0 0 0 0 0
a b
3
1.5 -1.5
vyij - vykl vyij -vykl
-3
I. Global Task
II. Example: image name: histogr.bmp, image size: 7x5; template name: histogr.tem .
INPUT OUTPUT
1.9. Game of Life and Combinatorics 105
0 0 0 -1 -1 -1
A11 = 0 1 0 B11 = -1 0 -1 z= -1
0 0 0 -1 -1 -1
0 0 0 -1 -1 -1
A22 = 0 1 0 B21 = -1 -1 -1 z= -4
0 0 0 -1 -1 -1
I. Global Task
II. Example: image name: life_1.bmp, image size: 16x16; template name: life_1.tem .
input output
106 1. Templates/Instructions
a a a 0 0 0
A= a b a B= 0 0 0 z= 0
a a a 0 0 0
a b
1
0.1
-0.9 -0.13 0.9
1 vykl -0.27 -0.45 vyij
-0.5
-1
I. Global Task
II. Example
See the example of the GameofLife1Step template (template name: life_1l.tem).
1.9. Game of Life and Combinatorics 107
I. Global Task
II. Example
See the example of the GameofLife1Step template (template name: life_dt.tem).
108 1. Templates/Instructions
The goal of the one dimensional majority vote-taker template is to decide whether a row of an input
image contains more black or white pixels, or their number is equal. The effect is realized in two
phases. The first template (setting the initial state to 0) gives rise to an image, where the sign of the
rightmost pixel corresponds to the dominant color. Namely, it is positive, if there are more black
pixels than white ones; it is negative in the opposite case, and is 0 in the case of equality. By using
the second template this information can be extracted, which drives the whole network into black or
white, depending on the dominant color, or leaves the rightmost pixel unchanged otherwise. The
method can easily be extended to two or even three dimensions.
First template
0 0 0 0 0 0
A= 1 0 0 B= 0 0.05 0 z= 0
0 0 0 0 0 0
Second template
0 0 0 0 0 0
A= 0 a 2 B= 0 0 0 z= 0
0 0 0 0 0 0
-1
Example: image name: histogr.bmp, image size: 7x5; template names: majvot1.tem,
majvot2.tem.
INPUT OUTPUT
1.9. Game of Life and Combinatorics 109
The template determines whether the number of black pixels in a row is even or odd. As a result, the
leftmost pixel in the output image corresponds to the parity of the row, namely, black represents odd,
while white means even parity. It is also true that each pixel codes the parity of the pixels right to it,
together with the pixel itself. Naturally, the parity of a column or a diagonal can be counted in the
same manner. The parity of an array can also be determined if columnwise parity is counted on the
result of the rowwise parity operation. The initial state should be set to -0.5.
0 0 0 0 0 0
A= 0 a b B= 0 c 0 z= 0
0 0 0 0 0 0
Example: image name: histogr.bmp, image size: 7x5; template name: parity1.tem .
INPUT OUTPUT
110 1. Templates/Instructions
0 0 0 0 0 0
A= 0 1 0 D= d 0 0 z= 0
0 0 0 0 0 0
I. Global Task
Remark:
A particular pixel in the output is black if an odd number of black pixels can be found at the
left of the particular pixel in the input (including the position of the pixel itself).
II. Example: image name: parity2.bmp, image size: 7x5; template name: parity2.tem .
INPUT OUTPUT
1.9. Game of Life and Combinatorics 111
A one-dimensional array of n values in the [-1,+1] interval can be sorted in descending order in n
steps with the following time- and space-varying template. In each odd step, the templates should be
applied in the ARALARALARAL ... pattern, while in each even step in the ALARALARALAR ... pattern.
To suppress side effects, the left and right boundaries should be set to +1 and -1, respectively.
0 0 0 0 0 0
AL = a 1 0 AR = 0 1 b
0 0 0 0 0 0
-2
Input
Output
1.10. PATTERN FORMATION
Spatio-temporal pattern formation in two-layer oscillatory CNN is studied in [56] and showed e.g. that some
exotic types of spiral waves exist on this type of network. As an example, spiral waves might consist of not
only two types of motif (black and white patches) but also, for instance, checkerboard patterns. These three
types of motifs propagate like spiral waves and transform continuously into each other.
I. Global Task
III. Templates
Turing pattern generating templates extended with coupling parameter (1 and 2). As an example,
template A11 generates cow patches while template A22 generates checker board pattern.
1 0.1 1 0 0 0
A11 = 0.1 -2 0.1 B11 = 0 0 0 z= 0 1 = 2
1 0.1 1 0 0 0
1 -0.1 1 0 0 0
A22 = -0.1 -2 -0.1 B21 = 0 0 0 z= 0 2 = 2
1 -0.1 1 0 0 0
1.10. Pattern formation 113
IV. Example
Output of the 1st layer (2nd layer behaves in a very similar way):
The template class analyzed in [57] produces novel spatio-temporal patterns that exhibit complex dynamics.
The character of these propagating patterns depends on the self-feedback and on the sign of the coupling
below the self-feedback template element.
0 0 0 0 0 0
A= s p q B= 0 b 0 z= z
0 r 0 0 0 0
I. Global Task
II. Examples
Examples are generated for the sign-antisymmetric case having sq<0 (s = -q).
Patterns generated depend on the sign of the extra template element r:
if r>0 a pattern is formed which is solid inside, however its right border is oscillating.
if r<0 a texture-like oscillating pattern is formed.
The input & initial state is shown in the upper left corner. It is a three-pixel wide bar. The pictures in
the different regions show few typical snapshots of outputs belonging to that region. The
arrangement and size of the different regions gives only qualitative information.
1.11. NEUROMORPHIC ILLUSIONS AND SPIKE GENERATORS
I. Global Task
II. Example: image name: herring.bmp, image size: 256x256; template name: herring.tem .
input output
1. 11. Neuromorphic Illusions And Spike Generators 117
I. Global Task
II. Example: image name: muller.bmp, image size: 44x44; template name: muller.tem .
input output
118 1. Templates/Instructions
0 0 0 0 0 0
A= 0 0 0 B= 0 1 0 z= 0
0 0 0 0 0 0
4 4 4 4
3 3 3 3
-1 1 v -1 v -1 v v
1 0.2 0.6 1 -1 0.2 0.6 1
Example: Input and output waveforms
1. 11. Neuromorphic Illusions And Spike Generators 119
0 0 0 0 0 0
A= 0 0 0 B= 0 1 0 z= 0
0 0 0 0 0 0
Vxij
g1 g2
+ +
1.8 - -2.3 -
g1(v) g2(v)
10 10
-1 1 v -1 0.8 1 v
0 0 0 0 0 0
A= 0 0 0 B= 0 1 0 z= 0
0 0 0 0 0 0
Vxij
=3
g1 g2
+ +
2.5 - -1.0 -
g1(v) g2(v)
10 10
-1 1 v -1 1 v
0 0 0 0 0 0
A= 0 a 0 B= 0 1 0 z= 0
0 0 0 0 0 0
-10
Short Description
The following simple algorithm simulates the functioning of a cellular automaton. Input and
output pictures are binary. The input of the nth iteration is replaced by the output of the (n-1)th
iteration.
Typical Example
The following example shows a few consecutive states of the simulated cellular automata.
i=1
i=i+1 Loading (replacing) INPUT
XOR
XOR
XOR
XOR
XOR
(i+1) th iteration
SHIFT_NE:
0 0 0 0 0 1
A= 0 0 0 B= 0 0 0 z= 0
0 0 0 0 0 0
SHIFT_E:
0 0 0 0 0 0
A= 0 0 0 B= 0 0 1 z= 0
0 0 0 0 0 0
124 1. Templates/Instructions
SHIFT_SE:
0 0 0 0 0 0
A= 0 0 0 B= 0 0 0 z= 0
0 0 0 0 0 1
SHIFT_S:
0 0 0 0 0 0
A= 0 0 0 B= 0 0 0 z= 0
0 0 0 0 1 0
SHIFT_SW:
0 0 0 0 0 0
A= 0 0 0 B= 0 0 0 z= 0
0 0 0 1 0 0
ALPHA source
/* CELLAUT.ALF */
/* Performs a function of the cellular automata; */
PROGRAM cellaut(in);
CONSTANT
ONE = 1;
TIME = 5;
TIMESTEP = 1.0;
ENDCONST;
CHIP_SET simulator.eng;
/* Chip definition section */
A_CHIP
SCALARS
IMAGES
im1: BINARY;
im2: BINARY;
im3: BINARY;
ENDCHIP;
PROCESS cellaut;
USE (shift_nw, shift_ne, shift_e, shift_se, shift_s, shift_sw);
SwSetTimeStep (TIMESTEP);
HostLoadPic(in, input);
im1:= input;
cycle:=1;
im1:=im2;
cycle := cycle + 1;
ENDREPEAT;
ENDPROCESS;
ENDPROG;
126 1. Templates/Instructions
Short Description
The following simple algorithm simulates the functioning of a general cellular automaton. Input
and output pictures are binary. In each consecutive step the initial state is equal to zero. The input
of the nth iteration is replaced by the output of the (n-1)th iteration.
Typical Example
The following example shows a few consecutive steps of the simulated general cellular
automaton.
t = 10 t = 15 t = 20
t = 25 t = 45 t = 99
1. 12. Cellular Automata 127
i=1
i=i+1 Loading (replacing) INPUT
Template
OUTPUT
(i+1) th iteration
0 0.5 0 0 0.5 0
A= 0.5 2 -1 B= 0.5 -0.5 0.5 z= 0.5
0 -1 0 0 0.5 0
1.13. OTHERS
PathFinder: Finding all paths between two selected points through a labyrinth [61]
0.5 4 0.5 0 0 0
A= 4 12 4 B= 0 8 0 z= 8
0.5 4 0.5 0 0 0
I. Global Task
( )
where d = sign y ij x kl ; [ 1, 0] , with (B=0, z=0).
I. Global Task
II. Examples
Example 1: image name: inphole.bmp, image size: 64x64; template name: nel_aintpol3.tem .
Example 2: image name: inpeye.bmp, image size: 64x64; template name: nel_aintpol3.tem .
ImageDenoising: Image denoising based on the total variational (TV) model of Rudin-Osher-
Fatemi [59, 60]
0 a 0 0 0 0
A= a 1 a D= 0 d 0
0 a 0 0 0 0
(
where a = sign xij x kl ) [ 1, 0] and d = 2 (xij u ij ) [0, 1] , with (B=0, z=0).
I. Global Task
II. Examples
Example 1: image name: osrufa5.bmp, image size: 214x216; templates name: osrufa2.tem.
input output
132 1. Templates/Instructions
Example 2: image name: cameraman10.bmp, image size: 256x256; templates name: osrufa.tem.
input output
1.13. Others 133
-
- - - 0 1 2 3
0 j 0 0 0 0
e y
A= e j x (3 + 2 ) e j x B= 0 2 0 z= 0
0 j 0 0 0 0
e y
where x and y control the spatial frequency tuning of the filter and controls the bandwidth.
( )
Note that the off-center elements are complex valued j = 1 . The state is also assumed to be
complex valued. Note that the center element of the template presented here differs from that
presented in [53] by 1 because we assume the standard CNN equation here, whereas [53] used an
equation without the resistive loss term.
I. Global Task
II. Example: image name: annulus.bmp, image size: 64x64; template name: cgabor.tem .
Two-Layer Gabor: Two-layer template implementing even and odd Gabor-type filters
0 cos( y ) 0 0 0 0
(
A11 = cos( x ) 3 + 2 ) cos( x ) B1 = 0 2 0 z1 = 0
0 cos( y ) 0 0 0 0
0 cos( y ) 0 0 0 0
(
A22 = cos( x ) 3 + 2 ) cos( x ) B2 = 0 0 0 z2 = 0
0 cos( y ) 0 0 0 0
0 sin ( y ) 0 0 sin ( y ) 0
A12 = sin ( x ) 0 sin ( x ) A21 = sin ( x ) 0 sin ( x )
0 sin ( y ) 0 0 sin ( y ) 0
where x and y control the spatial frequency tuning of the filter and controls the bandwidth.
This template is equivalent to the Complex-Gabor template, where we have separated the real and
imaginary parts to two layers.
I. Global Task
II. Example: image name: annulus.bmp, image size: 64x64; template name: cgabor.tem .
P (input) Y1 Y2
where = 0.2 , x = 2 / 8 , y = 0 .
136 1. Templates/Instructions
Example: image names: LenaS.bmp; image size: 128x128; template name: CS2.tem.
Old names: Linear Template Inverse (Ai=1-B; Bi=1-A; A and B see above)
I. Global Task
Given: a linear template as well as two static gray scale images P1 (result of the
linear template operation (see the test template above and its output) and
P2. (masked version of the original image). P3 is a binary version of P2
providing the fixed state mask for CNN operation. P3 indicates the
positions of supporting pixels where the interpolation is fixed.. The result
of the inverse of a linear template operation is computed rapidly using
masked diffusion even if the template cannot be inverted (linear template
convolution kernel - have zero Eigen values).
Input: U(t) = P1
Initial State: X(0) = P2
The inverse operation of a linear template can be easily performed by using a dense support
even if the theoretical inverse converges too slowly or if theoretically the template operation cannot
be inverted.
II. Example: image names: LenaSCs.bmp, LenaSMask.bmp, MaskS.bmp; image size: 128x128;
template name: DiffM2.tem .
0 0 0 abHV (1 - a) b H a b H (1 - V)
A= 0 0 0 B= a (1 - b) V (1 - a ) (1 - b ) a (1 - b ) (1 - V) z= 0
0 0 0 a b (1 - H) V (1 - a ) b (1 - H) a b (1 - H) (1 - V)
where
H = 1, if dy > 0; H = 0, otherwise
V = 1, if dx > 0; V = 0, otherwise
a = |dx| b = |dy|
I. Global Task
Remark:
Translations by a value greater than one pixel can be achieved by applying the same
template many times. For example, if dx > dy > 1:
ny = trunc(dy); fy = dy ny;
nx = trunc(dx); fx = dx nx;
- Apply template Translation(fx,fy)
- repeat nx-ny times Translation (1,1)
- repeat nx times Translation (1,0)
II. Example image name: lenna.bmp, image size: 256x256; dx = -10.5; dy = 4.7
INPUT OUTPUT
1.13. Others 139
0 0 0 abHV (1 - a) b H a b H (1 - V)
A= 0 0 0 B= a (1 - b) V (1 - a ) (1 - b ) a (1 - b ) (1 - V) z= 0
0 0 0 a b (1 - H) V (1 - a ) b (1 - H) a b (1 - H) (1 - V)
where
dx (i, j ) = O x + ( j O x )cos (i O y )sin j
dy (i, j ) = O y + ( j O x )sin + (i O y )cos i
H = 1, if dy(i,j) > 0; H = 0, otherwise
V = 1, if dx(i,j) > 0; V = 0, otherwise
a = |dx(i,j) | b = |dy(i,j)|
If |dx|>1 or |dy|>1 for some i,j, an exact rotation is not possible with a neighborhood order 1.
- for k = 1 to m:
I. Global Task
INPUT OUTPUT
Chapter 2. Subroutines and Simpler Programs
142 2. Subroutines and simpler programs
SKELBW1:
0 0 0 1 1 0
A1 = 0 1 0 B1 = 1 5 -1 z1 = -1
0 0 0 0 -1 0
SKELBW2:
0 0 0 2 2 2
A2 = 0 1 0 B2 = 0 9 0 z2 = -2
0 0 0 -1 -2 -1
SKELBW3:
0 0 0 0 1 1
A3 = 0 1 0 B3 = -1 5 1 z3 = -1
0 0 0 0 -1 0
SKELBW4:
0 0 0 -1 0 2
A4 = 0 1 0 B4 = -2 9 2 z4 = -2
0 0 0 -1 0 2
SKELBW5:
0 0 0 0 -1 0
A5 = 0 1 0 B5 = -1 5 1 z5 = -1
0 0 0 0 1 1
SKELBW6:
0 0 0 -1 -2 -1
A6 = 0 1 0 B6 = 0 9 0 z6 = -2
0 0 0 2 2 2
SKELBW7:
0 0 0 0 -1 0
A7 = 0 1 0 B7 = 1 5 -1 z7 = -1
0 0 0 1 1 0
SKELBW8:
0 0 0 2 0 -1
A8 = 0 1 0 B8 = 2 9 -2 z8 = -2
0 0 0 2 0 -1
143
The robustness of templates SKELBW1 and SKELBW2 are (SKELBW1) = 0.18 and
(SKELBW2) = 0.1, respectively. Other templates are the rotated versions of SKELBW1 and
SKELBW2, thus their robustness values are equal to the mentioned ones.
Example: image name: skelbwi.bmp, image size: 100x100; template names: skelbw1.tem,
skelbw2.tem, , skelbw8.tem.
input output
UMF diagram
SKELBW1
SKELBW2
SKELBW3
SKELBW4
SKELBW5
SKELBW6
SKELBW7
SKELBW8
Y
144 2. Subroutines and simpler programs
GRAYSCALE SKELETONIZATION
Old names: SKELGS
0 0 0 a a 0 c c 0
A1 = 0 1 0 B1 = a 0 b z1 = -4.5 A1 = c 1 0
0 0 0 0 b 0 0 0 0
0 0 0 a a a c c c
A2 = 0 1 0 B2 = 0 0 0 z2 = -4.5 A2 = 0 1 0
0 0 0 b b 0 0 0 0
0 0 0 0 a a 0 c c
A3 = 0 1 0 B3 = b 0 a z3 = -4.5 A3 = 0 1 c
0 0 0 0 b 0 0 0 0
... ...
0 0 0 a 0 0 c 0 0
A8 = 0 1 0 B8 = a 0 b z8 = -4.5 A8 = c 1 0
0 0 0 a 0 b c 0 0
a b c
1 1
0.33
input output
input output
146 2. Subroutines and simpler programs
GREY-SCALE IMAGE
multiply
add
diffusion template
input output
147
SHORTEST PATH
Explore:
0 a 0 0 0 0
Aexplore= a 1 a Bexplore = 0 3 0 zexplore = 3
0 a 0 0 0 0
0.005 1.005
where a is defined by the following nonlinear function:
vyij -vykl
-0.25
Select:
0 0 0 0 1 0 0 0 0 0 0 0
Aright= 1 3 0 Adown= 0 3 0 Aleft = 0 3 1 Aup = 0 3 0
0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 b 0 0 0 0 0 0 0
Bright= b 0 0 Bdown= 0 0 0 Bleft = 0 0 b Bup = 0 0 0
0 0 0 0 0 0 0 0 0 0 b 0
The task is to find the shortest path from a given startpoint to a given endpoint. The task can be
solved through a so-called J-function, which gives for every point the length of the shortest path
from the startpoint to the given point in some appropriate measure (the length of the path is
scaled down into the interval [-1,+1] in the CNN solution).
The minimum in a given neighborhood can be computed through a difference-controlled
template. Thus the computation can be carried out through a two layer nonlinear CNN. The first
layer represents the J-function, and on the second layer the actual value of the J-function
increased by the unit length is computed. The startpoint is given by a binary image, where a
white pixel represents the startpoint and the other points are black. The J-function is obtained as
an output picture, which values are related to the value of J. Obstacles have J=1, i.e. they are
black.
0 0 0 d d d 0 0 0
A11= 0 1 0 D12 = d 0 d B11 = 0 a 0 z= -0.8
0 0 0 d d d 0 0 0
a
2
0.1
and d is defined by the following function: 0.001
-0.1 1 2
y i,j-y kl
0 0 0
A21= 0 1 0 z= c
0 0 0
where c=2/maximal path length
Input Parameters
U1, U2 Input binary images (object set A and set B)
B_WAVE Binary wave propagation
WAVE_MAPPING Depending on the machine it can be a nonlinear template or a simple
linear
WAVE_MAP Spatial map encoding dynamics of wave propagation
DIFFUS and Approximation of Integrated Hausdorff metric [1]
THRESHOLD
Output Parameters
MARKERS Result of comparison: marker points of selected objects. These are to be
used for further processing. Application example can be found in [2].
153
Dynamic implementation
O b je c ts O b je c ts
(P 1 ) (P 2 )
1 -s t la y e r 2 -n d la y e r
M a s k e d T r ig g e r w a v e * (-a )
X [0 ]= P 1 A N D P 2 X [0 ]= 0
S ta te S ta te
M = P1 O R P2
M ask * (+ a )
M = P1 O R P2 W ave M ap
Input
C e ll v a lu e = a * (In p u t-O u tp u t)* T im e |t r ig g e r w ave
A v e r a g e & T h r e s h o ld
(W e ig h te d H a m m in g )
- = 2
- = 0
(P 1 ) R e c a ll
- = 0
O b je c ts
0 0 0 0 0 0 a
A1,2 = 0 a 0 B= 0 0 0 z= 0
0.1
0 0 0 0 0 0
1 V V
uij ykl
154 2. Subroutines and simpler programs
Objects Objects
(P1) (P2) Iterative part
OR AND DILATION
RECALL AND
XOR FILLING
RECALL
Objects
0 0 0 1 1 1
A= 0 2 0 B= 1 1 1 z= 8.5
0 0 0 1 1 1
Example 1
a) b)
c)
d)
Wave Map generation. a) Outlines of two partially overlapping point sets, b) Trigger wave
spreads from the intersection through the union of contiguous parts of point sets until all the
points become triggered, c) Wave map generated by increasing intensities of pixels until trigger
wave reaches them, simulation result d) Consecutive steps of generating Wave Map on the 64x64
I/O CNN-UM chip.
156 2. Subroutines and simpler programs
Example 2
A possible application of the nonlinear wave metric was presented in [48]. The study addressed
the problem of conditional basement maintenance of engines, concerning the on-line monitoring
system that should forecast engine malfunction. Optical sensing of the oil flow of the engine was
applied and debris particles were detected by using model-based object recognition. The
comparison of object-model pairs was performed via nonlinear wave metric.
(d) (e)
Bubble-debris classification algorithm (a) original gray-scale image, (b) adaptive threshold, (c)
bubble models, (d) wave map, (e) detected debris particles
157
UMF diagram
158 2. Subroutines and simpler programs
Input image sequence (last frame) Output image sequence (last frame)
Input Parameters
Output Parameters
UMF diagram
160 2. Subroutines and simpler programs
Input Parameters:
Remarks:
The hardware complexity (chip area) of the (A) solution is much lower than the (B) solution
The theoretical scalability in space (increasing the array size) is better in the (B) solution
The practical scalability in space (tiling the input) is better in the (A) solution
The computational time (speed and power consumption) depends on the available boundary
conditions
161
UMF diagram
U
X0
z z
Pr1 Pr2
z
AND
U0
z
U0
Diffusion_hor
z Shadow_ver
U0
X0 z
GlobMax
z
ERO_hor U0
Threshold
z
AND
Y
U0
Shadow_hor
Y
A B
162 2. Subroutines and simpler programs
Input Parameters
UMF diagram
Diffusion Erosion A
Trigger wave
SUB
Erosion B
Thres1
Y N
<N
Thres2 OR
++
NOTOR
=0 i Y
164 2. Subroutines and simpler programs
Input Parameters
U Input image
I,J Indexes
K Number of morphological steps
M (Def = 1) Heuristic value
T(LEVEL) Number of levels
Output Parameters
Y Output image
165
UMF diagram
166 2. Subroutines and simpler programs
OBJECT COUNTER
5 objects
Input Parameters
U Input image
Output Parameters
UMF diagram
GW
Y N
i i
=0 ++ Object Remover
i
167
Input Parameters
U Input image
uline / dline Upper / lower baselines
Output Parameters
Y1 Holes detected
Y2 Small holes
Y3 / Y4 Upper / lower holes
169
UMF diagram
170 2. Subroutines and simpler programs
Input Parameters
U Face image
Output Parameters
UMF diagram
172 2. Subroutines and simpler programs
Input Parameters
U Input image
U1 Previous prediction image
Output Parameters
Y Current prediction
173
UMF diagram
GW
Y
Yes
U
No U1
AND
GW
Yes Y
No
U1
RECALL
U1
XOR
U
OR
Y
174 2. Subroutines and simpler programs
Uprev
Input Parameters
U Input image
Ucurr The second frame of the pair
Uprev The first frame of the pair
Output Parameters
Y Output image
Y_PDFFull_x X coordinate of the most likely motion vectors
Y_PDFFull_y Y coordinate of the most likely motion vectors
175
UMF diagram
PDFFull(scale) PDF(scale)
Thres(0)
Shift W Shift N Shift E Shift S
Mul(-2)
Add
PDF(scale) PDF(scale) PDF(scale) PDF(scale)
Diffusion(scale)
exp(-x.)
*2 PDFFull(scale)
Y
<limit
1
N
Stop *
Select MAP
176 2. Subroutines and simpler programs
Input Parameters
Output Parameters
Remarks
177
UMF diagram
noisy peak layer
-1 -1 Subroutine
-1 find terminating points
edge_w1 edge_w2 edge_w3
OR
connected noisy
peak layer
178 2. Subroutines and simpler programs
COMMON AM
Input Parameters
peak layer Layer containing the representative frequency bands (peak layer)
Output Parameters
Remarks
The time tolerance of offsets can be controlled through the iteration number of the vertical wave
propagation step producing the propagating waves layer.
179
UMF diagram
peak layer
Subroutine
find common onset groups
Subroutine
find terminating points
Subroutine
find synchronized offsets
-1
recall
common AM group
Subroutine
Find terminating points
GW
terminating points
propagating waves
-1 collision layer
vertical_dilation
propagating waves
pre
XOR
propagating wavefronts
-1 match template
Two black pixel in a 3x3 region
N
collision layer
terminating points
Waves reached
OR
the boundary?
-1
collision layer recall
Subroutine
fill nonempty columns
synchronized offsets
180 2. Subroutines and simpler programs
Input Parameters
peak layer Input image with lines in which some of them vertical
distance is constant. (parallel curve)
Output Parameters
Remarks
The registration of curves is processed by a digital computer. The digital computer selects the
reference curves which initials are still in the maintained pixel list.
181
UMF diagram
Input Parameters
Output Parameters
Remarks
Tolerance of time asynchrony is set by the width of vertical line shifted on the selector layer.
183
UMF diagram
Subroutine
shift_e
find initial points
N
Rightmost
column
selected?
AND Y
-1
recall
CONTINUITY
Input Parameters
Output Parameters
Remarks
The spanned distance depends on the iteration of dilation_right template.
185
UMF diagram
peak layer
Subroutine Subroutine
find initial points find terminating points
Y
Y
eastward layer
Subroutine
-1
2nd iteration? dilation_right westward wave
N
eastward layer
eastward wave
OR AND
union layer intersect layer
-1
recall
Subroutine
skeletonization
peak layer
OR
Subroutine
skeletonization continuity layer
(three times)
186 2. Subroutines and simpler programs
Input Parameters
Output Parameters
Remarks
The search region is controlled through the black pixel search template class.
187
UMF diagram
-1 End of
recall Y
ppl template
col
set?
(complete object layer)
XOR
keep almost complete curves
Subroutine
small object removal
peak layer
parallel
-1
curves layer
col recall = common FM
group
XOR
acc pcl
(almost complete curves)
OR
188 2. Subroutines and simpler programs
PEAK-AND-PLATEAU DETECTOR
Input Parameters
Output Parameters
Remarks
There could be a few pixel wide gaps in the detected peaks and plateaus while vertical
interconnecting pixels do not cause black pixels to emerge. This gaps can be filled by the broken
line connector algorithm.
189
UMF diagram
Magnitude Image
-1 -1 -1
-1
Shift_S
-1 -1
Masked_Shadow Masked_Shadow
to South to North
AND
Input Parameters
U Input image
Output Parameters
Yh Horizontal displacement
Yv Vertical displacement
191
UMF diagram
VERT. DIFFUSION
Ht
5 5 Ht-1
10
ABS. DIFFERENCE
AVERAGE
MIN
Yv
192 2. Subroutines and simpler programs
Input Parameters
Output Parameters
UMF diagram
Bt-1 U
SUB
Bt
THRESHOLD THRESHOLD
(high) (low)
MULT () MULT ()
Bt-1
ADD
SUB
U Bt
ABS. DIFFERENCE
THRESHOLD
MORPHOLOGIC FILTERING
Ft
194 2. Subroutines and simpler programs
195
BANK-NOTE RECOGNITION
This algorithm identifies American bank-notes on color images. The bank-notes can be in the
image with arbitrary offset and rotation. The positioned algorithm finds the green and black
circles common to all US bank-notes. It analyses color, shape and size. The algorithm can be
separated into three parts. These parts are indicated in the flow-chart. The detailed description
can be found in [22]. The templates can be found in this template library.
Example: A grayscale version of a color input image, and the extracted black circle.
C
L
Recall Recall A
S
S
RIGHT SIZED OR LARGER OBJECTS LARGER OBJECTS I
F
I
Logic xor C
A
T
RIGHT SIZED OBJECTS I
O
N
197
Short Description
The CNN analogic program described here performs the basic operations needed to calculate a
hash value, when a set of binary images and a key vector are given. The current version of the
Alpha compiler cannot interpret sequences of key bits or image sequences, therefore the Alpha
source code listed below contains only two images and two key bits.
The first image is loaded to the chip, and its columns (as binary vectors) are multiplied by the
key vector. This multiplication is performed as a sequence of shift-add operations. Then, the next
binary image is added to modulo 2, and the multiplication is performed once more.
Typical Example
Gray-scale images (or video sequences) must be quantized and cut into pieces according to the
chip size. The following pictures show an input sequence and a typical hash result; the latter, of
course, heavily depends on the key bits.
circ. circ.
shift ... shift input
... ... ... layer
down down
1 2 n
... local
... ... ... memory
SHIFTSOU template:
0 0 0 0 2 0
A= 0 0 0 B= 0 0 0 z= 0
0 0 0 0 0 0
199
Short Description
The algorithm has three major steps. In the first step, only a part of the noise is discarded, but the
main features are coming out fine. In the second step a more aggressive filter were applied. After
this step, only some parts of the largest objects remained on the image. In the last step, the main
characters are reconstructed from the previous two results.
Typical Example
The speciality of this example is that the input image was stored in an extremely high-density
optical memory [40]. It is corrupted with noise heavily.
In this example, the image size is 318x93. This image was automatically cut to about 300
20x22 image tiles and processed one after the other on the chip. The algorithm was executed on
the 20x22 CNN chip [41]. Experimental results of the main feature extractor algorithm are as
follows.
(a)
(b)
(c)
(d)
Input image
slicing
smallkiller_ch1 smallkiller_ch2
dilation
dilation
erosion
erosion
reconstruction_ch1
merging
slicing
reconstruction_ch1
merging
Result
SMALLKILLER_CH1:
1 1 1 0 0 0
A= 1 2 1 B= 0 0 0 z = -1.7
1 1 1 0 0 0
SMALLKILLER_CH2:
1 1 1 0 0 0
A= 1 2 1 B= 0 0 0 z= -2
1 1 1 0 0 0
DILATION:
0 0 0 0 1 0
A= 0 0 0 B= 1 1 1 z= 4.5
0 0 0 0 1 0
201
EROSION:
0 0 0 0 1 0
A= 0 0 0 B= 1 1 1 z = -5.5
0 0 0 0 1 0
RECONSTRUCTION_CH:
0 1 0 0 0 0
A= 1 3 1 B= 0 3 0 z = -1.25
0 1 0 0 0 0
202 2. Subroutines and simpler programs
Short Description
An idea of fault tolerant template decomposition in the case of local boolean operators (binary
input/output templates) will be outlined here. Due to parameter deviations of current analog
VLSI implementations, templates generated theoretically do not work properly. A solution to this
problem is applying a sequence of so called fault tolerant templates, that make no faults. Here
two examples of such a decomposition will be presented: the Local Concave Place (LCP)
detector template and the JUNCTION template from this template library will be decomposed
into a sequence of two fault tolerant templates.
Typical Examples
Example 1: LocalConcavePlaceDetector (LCP) template decomposition
LCP:
0 0 0 0 0 0
A= 0 1 0 B= 2 2 2 z= -5
0 0 0 1 -2 1
IMAGES
P: BINARY;
204 2. Subroutines and simpler programs
ENDBOARD;
HostLoadPic(input, P);
HostDisplay(P, ONE);
LLM1 := P;
junc1 (LLM1, LLM1, LLM2, TIME, White);
205
Comments
Fault tolerant template generation results in a sequence of reliable templates.
206 2. Subroutines and simpler programs
GAME OF LIFE
Short Description
The following simple algorithm simulates the Game of Life. Both input and output pictures are
binary. Rules of the game:
a black pixel turns white if it has more than three or less than two black neighbors.
a white pixel turns black if it has exactly three black neighbors.
Typical Example
The following example shows two consecutive generations of the Game of Life simulated by
CNN.
initial Generation
i=0
i=i+1
GLIFE1.TEM GLIFE2.TEM
OUTPUT 1 OUTPUT 2
XOR
(i+1)th Generation
NULL = 0;
NUM_GEN = 9; /* Number of generations */
ONE = 1;
WHITE = -1.0;
TIME = 5;
TIMESTEP = 0.5;
ENDCONST;
CHIP_SET simulator.eng;
A_CHIP
SCALARS
IMAGES
L1: BINARY; /* LLM1 */
L2: BINARY; /* LLM2 */
L3: BINARY; /* LLM3 */
L4: BINARY; /* LLM4 */
ENDCHIP;
E_BOARD
SCALARS
var: INTEGER;
IMAGES
LargeInp: BINARY;
LargeOut: BINARY;
ENDBOARD;
OPERATIONS FROM gameoflife.tms;
FUNCTION game_of_life;
USE (glife1, glife2);
L4 := NULL; /* Zero state */
glife1 (L3, L4, L1, TIME, WHITE); /* Template1 execution */
glife2 (L3, L4, L2, TIME, WHITE); /* Template2 execution */
L4 := L1 XOR L2; /* XOR operation */
ENDFUNCT;
PROCESS freichen; /* here starts the main routine */
USE ();
SwSetTimeStep (TIMESTEP);
HostLoadPic(inputFC, LargeInp);
L4 := LargeInp;
REPEAT var := 1 TO NUM_GEN BY 1;
L3 := L4; /* Reload i-th genaration to input */
game_of_life; /* Simulating one generation of the Game of Life */
LargeOut := L4; /* copying the image from chip to board */
HostDisplay(LargeOut, ONE);
ENDREPEAT;
ENDPROCESS;
ENDPROG;
Comments
The templates can be found by using the TemMaster [38] template design software package.
208 2. Subroutines and simpler programs
0 0 0 0 b 0 0 0 0
A2 = 0 0 1 A3 = 0 1 0 A4 = 0 2 0
0 0 0 0 b 0 0 0 0
0 0 0 0 0 0
B2 = 0 a 0 B4 = 0 -1 0
0 0 0 0 0 0
z4 = 0.02
0.05
1
vuij vukl
-0.5
Example:
Hamming
legal codes input code distances best match
4
1
2
3
209
OBJECT COUNTING
This algorithm counts the connected objects on a grayscale image. The algorithm is detailed in
[11]. The cited templates can be found in this template library.
The flow-chart of the algorithm:
GREY-SCALE IMAGE
Average template
BLACK-AND-WHITE IMAGE
Hole-filler template
PREPROCESSED IMAGE
Ns Nc
Horizontal ccd
No
The input and result of the wire break detection analogic CNN algorithm
211
input image
(grey scale)
averaging
(5 iterations)
Horizontal Vertical
skeletonization skeletonization
(4*w iterations) (4*w iterations)
OR
to unify the errors
0 0 0 0.5 0 0.125
A= 0 3 0 B= 0.5 0.5 -0.5 z= -1
0 0 0 0.5 0 0.125
I. Global Task
Given: static binary image P
Input: U(t) = P
212 2. Subroutines and simpler programs
Remark:
The template HorSkelR (horizontal skeleton from right) can be obtained by rotating
HorSkelL by 180). The VerSkelT and VerSkelB templates (rotating the HorSkelR and HorSkelL
templates by 90) are used for horizontal line skeletonization.
I. Global Task
Given: static binary image P
Input: U(t) = P
Initial State: X(0) = Arbitrary (in the examples we choose xij(0)=0)
Boundary Conditions: Fixed type, uij = 0 for all virtual cells, denoted by [U]=[0]
Output: Y(t)
Y(
) = Binary image of the endings of the vertical wires
Remark:
The DeadEndH templates (rotating the DeadEndV template by 90) are used to detect the
endings of horizontal wires.
213
INPUT OUTPUT
214 2. Subroutines and simpler programs
Concavities
INPUT
Threshold
5
Hollow
50
XOR
5
Erosion
10
OUTPUT
= 70
if = 250 ns
Running time = 17.5 s
0 0 0 1 1 1
A= 0 2 0 B= 1 1 1 z = -8.5
0 0 0 1 1 1
ALPHA source
/* THE PROGRAM DETECTS CONCAVITIES OF OBJECTS */
PROGRAM concave (in; out);
CONSTANT
ONE = 1;
TWO = 2;
WHITE = -1.0;
TIME1 = 50;
TIME2 = 10;
ENDCONST;
/* Chip set definition section */
CHIP_SET simulator.eng;
A_CHIP
SCALARS
IMAGES
c1: BINARY;
c2: BINARY;
c3: BINARY;
c4: BINARY;
ENDCHIP;
E_BOARD
SCALARS
IMAGES
bi1: BINARY;
bi2: BINARY;
ENDBOARD;
OPERATIONS FROM concave.tms;
PROCESS concave;
USE (thres, hollow, erosion);
HostLoadPic(in, bi1);
HostDisplay(bi1, ONE);
c1 := bi1;
thres (c1, c1, c1, TIME1, WHITE);
hollow (c1, c1, c2, TIME1, WHITE);
216 2. Subroutines and simpler programs
c3 := c1 XOR c2;
erosion(c3, c3, c1, TIME2, WHITE);
bi2 := c3;
HostDisplay(bi2, TWO);
ENDPROCESS;
ENDPROG;
217
SCRATCH REMOVAL
On photocopier machines, the glass panel often gets scratched, which scratch is then copied
together with the material, resulting in a visually annoying copy. The following algorithm is
capable of removing such scratches assuming that the location of the scratch is known in
advance. This is a valid assumption, since the scratches can automatically be detected e.g. by
copying a blank sheet of paper. The algorithm removes the scratches gradually, peeling off pixels
circularly [19].
The flow-chart of the algorithm:
Peeled pixels
Logic XOR
Fill in pixels
corresponding to the
peeles ones by
Remaining (unfilled)
averaging their known
scratch
neighbors
Anything left
Enhanced image from the
scratch? yes
no
Restored image
218 2. Subroutines and simpler programs
Smoothing:
Selection templates: Fill templates:
0 0 0 -0.5 0 0 0.33 0 0
A1 = 0 1 0 B1 = -0.5 0.5 0 z1 = -1.5 B1 = 0.34 0 0
0 0 0 -0.5 0 0 0.33 0 0
This algorithm finds the knots and fiber breakings in a loose-waved textile. The algorithm is
detailed in [10]. The cited templates can be found in this template library. (The lincut7h is the
rotated version of lincut7v (LE7pixelVerticalLineRemover))
The flow-chart of the algorithm:
AVERTRSH TEMPLATE
LINCUT7V TEMPLATE
VERTICAL FIBERS
Example: The algorithm is demonstrated on a piece of table cloth. (a): the original image, (b) the
southern ends of the fibers (the inside one indicates the fault), (c): the knot. Image
name: textpatt.bmp; image size: 170x145.
220 2. Subroutines and simpler programs
Typical Example
INPUT OUTPUT
BINARY IMAGE
to be applyied 4
erosion template
times in a row
-3.44 0.86 -1.64 -0.16 -1.02 -2.19 -0.23 0.16 -0.63 -0.78
-1.09 0.16 -2.19 -3.2 3.51 1.64 2.27 -3.2 1.09 2.03
A = 2.50 1.56 3.91 2.66 2.42 B= 0.08 0.55 0.86 3.52 0.08 z= 4.8
0.55 2.89 -0.62 0.47 3.67 0.39 -3.83 -3.12 -2.34 -2.11
-1.80 -0.55 2.50 -0.23 2.34 0.78 -2.66 -1.17 -1.41 1.02
0 0 0 0.5 1 0.5
A= 0 1 0 B= 1 1 1 z= -6
0 0 0 0.5 1 0.5
ALPHA source
/* TEXTURE_1.ALF */
/* Separates two predefined types of binary textures */
PROGRAM texture_1(in; out);
CONSTANT
ONE = 1;
TWO = 2;
THREE = 3;
FOUR = 4;
FIVE = 5;
SIX = 6;
BLACK = 1;
TimeStep005 = 0.05;
TimeStep01 = 0.1;
TimeStep03 = 0.3;
TimeStep1 = 1;
TimeStep2 = 2;
TIME1 = 1;
TIME2 = 2;
TIME4 = 4;
TIME03 = 0.3;
ENDCONST;
/* Chip set definition section */
CHIP_SET simulator.eng;
A_CHIP
SCALARS
IMAGES
im1: BINARY;
im2: BINARY;
im3: BINARY;
im4: BINARY;
ENDCHIP;
E_BOARD
222 2. Subroutines and simpler programs
SCALARS
Loop: INTEGER;
IMAGES
input: BINARY;
output: BINARY;
ENDBOARD;
/* Definition of analog operation symbol table */
OPERATIONS FROM texture_1.tms;
PROCESS texture_1;
USE (tx_hclc1, smkiller, erosion1);
HostLoadPic(in, input);
HostDisplay(input, ONE);
im1:= input;
SwSetTimeStep (TimeStep005);
tx_hclc1(im1, im1, im2, TIME1, ZEROFLUX);
output:=im2;
HostDisplay(output, TWO);
SwSetTimeStep (TimeStep01);
smkiller (im2, im2, im3, TIME03, BLACK);
output:=im3;
HostDisplay(output, THREE);
SwSetTimeStep (TimeStep01);
REPEAT Loop:= 1 to 4 BY 1;
erosion1(im3, im3, im3, TIME2, BLACK);
ENDREPEAT;
output:=im3;
HostDisplay(output, FOUR);
SwSetTimeStep (TimeStep01);
smkiller(im3, im3, im4, TIME4, ZEROFLUX);
output:=im4;
HostDisplay(output, FIVE);
ENDPROCESS;
ENDPROG;
Comments
Using the genetic template learning algorithm, a wide range of texture types can be segmented
with high accuracy. In order to achieve flat segments propagating-type filter template(s) could be
used.
223
INPUT OUTPUT
224 2. Subroutines and simpler programs
INPUT IMAGE
Smoothing template
The Alpha language description of the core of the texture segmentation algorithm is as follows:
ALPHA source
/* TEXTURE_2.ALF */
/* Classifies 4 different textured images */
PROGRAM texture_2(in);
CONSTANT
ONE = 1;
TWO = 2;
THREE = 3;
FOUR = 4;
WHITE = -1.0;
RUNTIME_3 = 3;
TIMESTEP = 0.2;
ENDCONST;
/* Chip description file */
CHIP_SET simulator.eng;
/* Chip variables */
A_CHIP
SCALARS
IMAGES
ci1: ANALOG;
225
ci2: BINARY;
ENDCHIP;
/* Board variables */
E_BOARD
SCALARS
GLOB_COUNT: REAL;
IMAGES
input: BYTE;
display: BINARY;
ENDBOARD;
/* Template list */
OPERATIONS FROM texture_2.tms;
PROCESS texture_2;
USE (tx_hclc1);
HostLoadPic(in, input);
SwSetTimeStep (TIMESTEP);
ci1:=input;
tx_hclc1 (ci1, ci1, ci2, RUNTIME_3, ZEROFLUX);
display:=ci2;
HostDisplay(display, ONE);
ENDPROCESS;
ENDPROG;
226 2. Subroutines and simpler programs
Short Description
One of the characteristic features of objects, which the human recognition is based on, is
the local curvature. This algorithm detects property, namely, the locations of a binary image
where the local edges are convex from north. By this method, for example, the wing endings of
an airplane can be detected.
In the first part of the algorithm local shadows are created with appropriate templates in
the image into the 35, 65, 125, 155, -155, -115, -65 and -25 directions. The generation of
shadows depends on the local curvature of edges. As a result we get eight images. Then four by
four, groups of images get selected from the eight images and the logic AND operation of four
images (within each group) is performed. With this step we can enhance the direction selectivity
of the fill operation. In the next step we take the logic difference of each of these images and the
original image. Then the undesired arc locations (the orientations of these locations are
orthogonal to the preferred one) are subtracted from the resulting images. This way we get two
images containing patches which denote the possible wing endings. In the last phase shadows are
created, starting from these patches into appropriate directions according to the direction
represented by the patch. The logic AND of the two shadow images and the arc location images
one by one yields two images whose union gives the final result.
Some incorrectly detected points can be seen on the resulting image. More sophisticated
subtracting and shadowing methods are able to remove these points. The algorithm is invariant to
small rate rotation and distortion.
FILL35:
1 0 1 0 0 0
A= 0 2 0 B= 0 1 0 z= 2
1 1 0 0 0 0
FILL65:
1 0 0 0 2 0
A= 1 2 0 B= 0 0 0 z= 3
0 0 2 0 0 0
FILL125:
1 0 0 0 0 0
A= 0 2 1 B= 0 1 0 z= 2
1 0 1 0 0 0
FILL155:
0 0 2 0 0 0
A= 0 2 0 B= 2 0 0 z= 3
1 1 0 0 0 0
227
FILL-155:
0 1 1 0 0 0
A= 0 2 0 B= 0 1 0 z= 2
1 0 1 0 0 0
FILL-115:
2 0 0 0 0 0
A= 0 2 1 B= 0 0 0 z= 3
0 0 1 0 2 0
FILL-65:
1 0 1 0 0 0
A= 1 2 0 B= 0 1 0 z= 2
0 0 1 0 0 0
FILL-25:
0 1 1 0 0 0
A= 0 2 0 B= 0 0 0 z= 3
2 0 0 0 2 0
SHADOW90:
0 -1 0 0 0 0
A= 0.3 2 0.3 B= 0 1.4 0 z= 2.5
0.4 1 0.4 0 0 0
SHADOW270:
0.4 1 0.4 0 0 0
A= 0.3 2 0.3 B= 0 1.4 0 z= 2.5
0 -1 0 0 0 0
LOGANDN:
0 0 0 0 0 0
A= 0 2 0 B= 0 -1 0 z= -1
0 0 0 0 0 0
The other templates used in the algorithm are available in this library.
228 2. Subroutines and simpler programs
Typical Example
Input image
After subtracting
(logic AND) the orthogonal
directions
The result
(masked with the original)
229
INPUT
IMAGE
fill35 fill-25
template template
A
B1 B1 B2
B2
A
A AND A AND
NOT B1 NOT B 1
NOT B2 NOT B 2
SHADOW90 SHADOW270
template template
logic
AND
logic logic
AND AND
logic
OR
RESULT
IMAGE
230 2. Subroutines and simpler programs
ALPHA source
/* airplane.ALF */
/* Performs airplane wing ending detection */
IMAGES
input: BINARY; /* input */
arc35: BINARY; /* arc */
arc65: BINARY; /* arc */
arc125: BINARY; /* arc */
arc155: BINARY; /* arc */
arc_35: BINARY; /* arc */
arc_65: BINARY; /* arc */
arc_125: BINARY; /* arc */
arc_155: BINARY; /* arc */
wing_up: BINARY;
wing_left: BINARY;
wing_down: BINARY;
wing_right: BINARY;
shadow_up: BINARY;
shadow_down: BINARY;
shadow_intrsct: BINARY;
output: BINARY;
ENDCHIP;
IMAGES
ENDBOARD;
PROCESS airplane;
USE (fill35, fill65, fill125, fill155, fill_115, fill_65, fill_25, fill_155, logdif, smkiller, shadow0,
shadow90, shadow180, shadow270);
SwSetTimeStep(TS);
HostLoadPic(in, input);
HostDisplay(input, ONE);
/* Distance classification*/
shadow_intrsct:=shadow_up AND shadow_down;
HostDisplay(shadow_intrsct, EIGHT);
ENDPROCESS;
ENDPROG;
233
This algorithm detects the possible location of a pedestrian crosswalk in an image, and based estimates the
Input Parameters
U Input image
RGB filter intervals Possible RGB values of the road surface
Output Parameters
UMF diagram
Chapter 3. IMPLEMENTATION ON PHYSICAL
CELLULAR MACHINE
236 3. Implementation on physical cellular machine
The chapter will be organized as follows. First, seven different basic architectures are briefly
described. Then, the operators are grouped according to their execution methods on the different
architectures. It is followed by the analysis of the implementation. Finally, an architecture selection
guide is shown.
321 segment of the image. The algorithm is simple: the DSP has to prepare the 9 different
operands, and apply bit-wise OR operations on them.
Figure 1 shows the generation method of the first three operands. In the figure a 323
segment of a binary image is shown (9 times), as it is represented in the DSP memory. Some
fractions of horizontal neighboring segments are also shown. The first operand can be calculated
by shifting the upper line with one bit position to the left and filling in the empty MSB with the
LSB of the word from its right neighbor. The second operand is the un-shifted upper line. The
position and the preparation of the remaining operands are also shown in Figure 1a.
upper line upper line upper line
central line central line central line
OR OR
lower line lower line lower line
o1 o2 o3
o4 o5 o6
o7 o8 o9
(b)
e1=o1 OR o2 OR o3 OR o4 OR o5 OR o6 OR o7 OR o8 OR o9
(c)
Figure 1. Illustration of the binary erosion operation on a DSP. (a) shows the 9 pieces of
321 segments of the image (operands), as the DSP uses them. The operands are
the shaded segments. The arrows indicate shifting of the segments. To make it
clearer, consider a 33 neighborhood as it is shown in (b). For one pixel, the form
of the erosion calculation is shown in (c). o1, o2, o9 are the operands. The DSP
does the same, but on 32 pixels parallel.
This means that we have to apply 10 memory accesses, 6 shifts, 6 replacements, and 8 OR
operations to execute a binary morphological operation for 32 pixels. Due to the multiple cores
and the internal parallelism, the Texas DaVinci spends 0.5 clock cycles with the calculation of
one pixel.
In the low power low cost embedded DSP technology the trend is to further increase the clock
frequency, but most probably, not higher than 1 GHz, otherwise, the power budget cannot be
kept. Moreover, the drawback of these DSPs is that their cache memory is too small, which
cannot be increased significantly without significant cost increase. The only way to significantly
increase the speed is to implement a larger number of processors, however, that requires a new
way of algorithmic thinking, and software tools.
The DSP-memory architecture is the most versatile from the point of views of both in
functionality and programmability. It is easy to program, and there is no limit on the size of the
processed images, though it is important to mention that in case of an operation is executed on an
image stored in the external memory, its execution time is increasing roughly with an order of
magnitude. Though the DSP-memory architecture is considered to be very slow, as it is shown
237
238 3. Implementation on physical cellular machine
later, it outperforms even the processor arrays in some operations. In QVGA frame size, it can
solve quite complex tasks, such as video analytics in security applications on video rate [95]. Its
power consumption is in the 1-3W range. Relatively small systems can be built by using this
architecture. The typical chip count is around 16 (DSP, memory, flash, clock, glue logic, sensor,
3 near sensor components, 3 communication components, 4 power components), while this can
be reduced to the half in a very basic system configuration.
Pass-through architectures
The basic idea of this pass-through architecture is to process the images line-by-line, and to
minimize both the internal memory capacity and the external IO requirements. Most of the early
image processing operations are based on 33 neighborhood processing, hence 9 image data are
needed to calculate each new pixel value. However, these 9 data would require very high data
throughput from the device. As we will see, this requirement can be significantly reduced by
applying a smart feeder arrangement.
Figure 2 shows the basic building blocks of the pass-through architecture. It contains two parts,
the memory (feeder) and the neighborhood processor. Both the feeder and the neighborhood
processor can be configured 8 or 1 bit/pixel wide, depending on whether the unit is used for
grayscale or binary image processing. The feeder contains, typically, two consecutive whole rows
and a row fraction of the image. Moreover, it optionally contains two more rows of the mask
image, depending on the input requirements of the implemented neighborhood operator. In each
pixel clock period, the feeder provides 9 pixel values for the neighborhood processor and the
mask value optionally if the operation requires it. The neighborhood processor can perform
convolution, rank order filtering, or other linear or nonlinear spatial filtering on the image
segment in each pixel clock period. Some of these operators (e.g., hole finder, or a CNN
emulation with A and B templates) require two input images. The second input image is stored in
the mask. The outputs of the unit are the resulting and, optionally, the input and the mask images.
Note that the unit receives and releases synchronized pixels flows sequentially. This enables to
cascade multiple pieces of the described units. The cascaded units form a chain. In such a chain,
only the first and the last units require external data communications, the rest of them receives
data from the previous member of the chain and releases the output towards the next one.
An advantageous implementation of the row storage is the application of FIFO memories, where
the first three positions are tapped to be able to provide input data for the neighborhood
processor. The last positions of rows are connected to the first position of the next row (Figure
2). In this way, pixels in the upper rows are automatically marching down to the lower rows.
The neighborhood processor is of special purpose, which can implement one or a few different
kinds of operators with various attributes and parameter. They can implement convolution, rank-
order filters, grayscale or binary morphological operations, or any local image processing
functions (e.g. Harris corner detection, Laplace operator, gradient calculation, etc,). In
architectures CASTLE [99][98] and Falcon [91], e.g., the processors are dedicated to
convolution processing where the template values are the attributes. The pixel clock is matched
with that of the applied sensor. In case of a 1 megapixel frame at video rate (30 FPS), the pixel
clock is about 30 MHz (depending on the readout protocol). This means that all parts of the unit
should be able to operate at least at this clock frequency. In some cases the neighborhood
processor operates on an integer multiplication of this frequency, because it might need multiple
clock cycles to complete a complex calculation, such as a 33 convolution. Considering ASIC or
FPGA implementations, clock frequency between 100-300 MHz is a feasible target for the
neighborhood processors within tolerable power budget.
The multi-core pass-through architecture is built up from a sequence of such processors. The
processor arrangement follows the flow-chart of the algorithm. In case of multiple iterations of
the same operation, we need to apply as many processor kernels, as many iterations we need.
239
This easily ends up requiring a few dozens of kernels. Fortunately, these kernels, especially in the
black-and-white domain, are relatively inexpensive, either on silicon, or in FPGA.
Depending on the application, the data-flow may contain either sequential segments or parallel
branches. It is important to emphasize, however, that the frame scanning direction cannot be
changed, unless the whole frame is buffered, which can be done in external memory only. This
introduces a relatively long (dozens of millisecond) additional latency.
Two rows of the mask image (optional)
(FIFO)
Feeder Data in
9 pixel
values
Two rows of the image to be processed (FIFO)
33
Neighborhood low latency
Processor neighborhood
processor Data
out
Figure 2. One processor and its memory arrangement in the pass-through architecture.
For capability analysis, here we use the Spartan 3ADSP FPGA (XC3SD3400A) from Xilinx as a
reference, because this low-cost, medium performance FPGA was designed especially for
embedded image processing. It is possible to implement roughly 120 grayscale processors within
this chip, as long as the image row length is below 512, or 60 processors, when the row length is
between 512 and 1024.
AD converter; an 8 bit processor with 512 bytes of memory; and a communication unit of local
and global connections. The processor can handle images in 1, 8, and 16 bit/pixel
representations, however, it is optimized for 1 and 8 bit/pixel operations. Each processor can
execute addition, subtraction, multiplication, multiply-add operations, comparison, in a single
clock cycle on 8 bit/pixel data. It can also perform 8 logic operations on 1 bit/pixel data in
packed-operation mode in a single cycle. Therefore, in binary mode, one line of the 88 sub-
array is processed jointly, similarly to the way we have seen in the DSP. However, the Xenon
chip supports the data shifting and swapping from hardware, which means that the operation
sequence, what we have seen in Figure 1 takes 9 clock cycles only. (The swapping and the
accessing the memory of the neighbors do not need extra clock cycles.) Besides, the local
processor core functions, Xenon can also perform a global OR function. The processors in the
array are driven in a single instruction multiple data (SIMD) mode.
XENON chip Scheduler,
external I/O,
address
generator
CC C C C CC C C C
P P P P P P P P
P P P P P P P P
C C C C C C C C
P P P P P P P P
P P P P P P P P
C C Cel
C Cel
C C C Cel
C Cel
C
l l l l P P P P P P P P
P P P P P P P P
C C C C C C C C
P P P P P P P P
P P P P P P P P
CC C C C CC C C C
MUX AD
C C C C C C C C
C C C C C C C C
processors can handle two types of data (image) representations: grayscale and binary. The
instruction set of these processors include addition, subtraction, scaling (with a few discrete
factors only), comparison, thresholding, and logic operations. Since it is a discrete time
architecture, the processing is clocked. Each operation takes 1-4 clock cycles. The individual
cells can be masked. Basic spatial operations, such as convolution, median filtering, or erosion,
can be put together as sequences of these elementary processor operations. In this way the clock
cycle counts of a convolution, a rank order filtering, or a morphologic filter are between 20 and
40 depending on the number of weighting coefficients.
It is important to note that in case of the discrete time architectures (both coarse- and fine-grain),
the operation set is more elementary (lower level) than on the continuous time cores (see the next
section). While in the continuous time case (CNN like processors) the elementary operations are
templates (convolution, or feedback convolution) [77][78], in the discrete time case, the
processing elements can be viewed as RISC (reduced instruction set) processor cores with
addition, subtraction, scaling, shift, comparison, and logic operations. When a full convolution is
to be executed, the continuous time architectures are more efficient. In the case of operations
when both architectures apply a sequence of elementary instructions in an iterative manner (e.g.,
rank order filters), the RISC is the superior, because its elementary operators are more versatile
more accurate, and faster.
The internal analog data representation has both architectural and functional advantages. From
architectural point of view, the most important feature is that no AD converter is needed on the
cell level, because the sensed optical image can be directly saved in the analog memories, leading
to significant silicon space savings. Moreover, the analog memories require smaller silicon area
than the equivalent digital counterparts. From the functional point of view, the topographic
analog and logic data representations make the implementation of efficient diffusion, averaging,
and global OR networks possible.
The drawback of the internal analog data representation and processing is the signal degradation
during operation or over time. According to experience, accuracy degradation was more
significant in the old ACE16k design [84] than in the recent Q-Eye [94] or SCAMP [90] ones.
While in the former case 3-5 grayscale operations led to significant degradations, in the latter
ones even 10-20 grayscale operations can conserve the original image features. This makes it
possible to implement complex nonlinear image processing functions (e.g., rank order filter) on
discrete time architectures, while it is practically impossible on the continuous ones (ACE16k).
The two representatives of discrete time solutions, SCAMP and Q-Eye, are slightly similar in
design. The SCAMP chip was fabricated by using 0.35 micron technology. The cell array size is
128128. The cell size is 5050 micron, and the maximum power consumption is about 200mW
at 1.25MHz clock rate. The array of Q-Eye chip has 144176 cells. It was fabricated on 0.18
micron technology. The cell size is about 3030 micron. Its speed and power consumption range
is similar to that of the SCAMP chip. Both SCAMP and Q-Eye chips are equipped with single-
step mean, diffusion, and global OR calculator circuits. Q-Eye chip also provides hardware
support for single-step binary 33 morphologic operations.
waves in a programmable way. While the output of the first one (ACE-16k [84]) can be in the
grayscale domain, the output of the second one (ACLA [87][88]) is always in the binary domain.
The ACE-16k [84] is a classical CNN Universal Machine type architecture equipped with
feedback and feed-forward template matrices [78], sigmoid type output characteristics,
dynamically changing state, optical input, local (cell level) analog and logic memories, local
logic, diffusion and averaging network. It can perform full-signal range type CNN operations
[79]. Therefore, it can be used in retina simulations or other spatial-temporal dynamical system
emulations, as well. Its typical feed-forward convolution execution time is in the 5-8
microsecond range, while the wave propagation speed from cell-to-cell is up to 1 microsecond.
Though its internal memories, easily re-programmable convolution matrices, logic operations,
and conditional execution options make it attractive to use as a general purpose high-
performance sensor-processor chip for the first sight, its limited accuracy, large silicon area
occupation (~8080 micron/cell on 0.35 micron 1P5M STM technology), and high power
consumption (4-5 Watts) prevent the immediate usage in various vision application areas.
The other architecture in this category is the Asynchronous Cellular Logic Array (ACLA) [87],
[88]. This architecture is based on spatially interconnected logic gates with some cell level
asynchronous controlling mechanisms, which allow ultra high-speed spatial binary wave
propagation only. Typical binary functionalities implemented on this network are: trigger wave,
reconstruction, hole finder, shadow, etc. Assuming more sophisticated control mechanism on the
cell level, it can even perform skeletonization or centroid calculations. Their implementation is
based on a few minimal size logic transistors, which makes them hyper-fast, extremely small,
and power-efficient. They can reach 500 ps/cell wave propagation speed, with 0.2mW power
consumption for a 128128 sized array. Their very small area requirement (168 micron/cell on
0.35 micron 3M1P AMS technology) makes them a good choice to be implemented as a co-
processor in any fine-grain array processor architecture.
point numbers or eight 16bit integers. The SPEs support logic operations also. They can handle
up to 128 bits in one single step. The SPEs can only address their local 256kB SRAM memory,
while they can access the main memory of the system by DMA instructions.
SPE SPE SPE SPE SPE SPE SPE SPE
16B/cycle
LS LS LS LS LS LS LS LS
256kB 256kB 256kB 256kB 256kB 256kB 256kB 256kB
128B/cycle
16B/cycle
243
244 3. Implementation on physical cellular machine
Categorization of 2D operators
Due to their different spatial-temporal dynamics, different 2D operators require different
computational approaches. The categorization (Figure 6) was done according to their
implementation methods on different architectures. It is important to emphasize that we
categorize operators (functionalities) here, rather than wave types, because the wave types are not
necessarily inherited by the operator itself, but rather by its implementation method on a
particular architecture. As we will see, the same operator is implemented with different spatial
wave dynamic patterns on different architectures. The most important 2D operators, including all
the CNN operators [97]are considered here.
The first distinguishing feature is the location of active pixels [97]. If the active pixels are located
along one or few one-dimensional stationary or propagating curves at a time, we call the operator
front-active. If the active pixels are everywhere in the array, we call it area-active.
The common property of the front-active propagations is that the active pixels are located only at
the propagating wave fronts [80]. This means that at the beginning of the wave dynamics
(transient) some pixels become active, others remain passive. The initially active pixels may
initialize wave fronts which start propagating. A propagating wave front can activate some
245
further passive pixels. This is the mechanism how the wave proceeds. However, pixels apart
from a waveform cannot become active [97]. This theoretically enables us to compute only the
pixels which are along the front lines, and not waste efforts on the others. The question is which
are the architectures that can take advantage of such a spatially selective computation.
2D operators
content- content-
dependent independent
image. Their common feature is that they reduce the dimension of the input 2D matrices to
vectors (CCD, shadow, profile, histogram) or scalars (global maximum, global average, global
OR). It is worth to mention that on the coarse- and fine-grain topographic array processors the
shadow, profile and CCD are content-dependent operators, and the number of the iterations (or
analog transient time) depends on the image content only. The operation is completed, when the
output is ceased to change. Generally, however, , it is less efficient to include a test to detect a
stabilized output, than to let the operator run in as many cycles as it would run in the worst case.
The area active operator category contains the operators where all the pixels are to be updated
continuously (or in each iteration). A typical example is heat diffusion. Some of these operators
can be solved in a single update of all the pixels (e.g., all the CNN B templates [102]), while
others need a limited number of updates (halftoning, constrained heat diffusion, etc.).
The fine-grain architectures update every pixel location in fully parallel in each time instance.
Therefore, the area active operators are naturally the best fit for these computing architectures.
Frame overwriting
Pixel overwriting
(row-wise, left to right top to down sequence)
overwriting), or to wait until the new state value is calculated for all the pixels in the frame
(frame overwriting). In this context, update means the calculation of the new state for an entire
frame. Figure 7 and Figure 8 illustrate the difference between the two overwriting schemes. In
case of an execution-sequence-variant operation, the result depends on the frame overwriting
schemes.
Here the calculation is done pixel-wise, left to right and row-wise top to down. As we can see,
overwriting each pixel before the next pixels state is calculated (pixel overwriting) speeds up the
propagation in the directions which corresponds to the direction the calculation proceeds.
Based on the above, it is easy to draw the conclusion that the two updating schemes lead to two
completely different propagation dynamics and final results in execution-variant cases. One is
slower, but controlled, the other one is faster, but uncontrolled. The first can be used in cases
when speed maximization is the only criterion, while the second is needed when the shape and
the dynamics of the propagating wave front count. We called the former case execution-
sequence-invariant operators, the latter one execution-sequence-variant operators (Figure 6).
Frame overwriting
Pixel overwriting
(row-wise, left to right top to down sequence)
247
248 3. Implementation on physical cellular machine
of the propagation of the calculation, without paying a significant penalty for it in memory size
and latency time.
A special feature of content-dependent operators is that the path and length of the path of the
propagating wave front drastically depend on the image contents itself. For example, the range of
the necessary frame overwritings with a hole finder operation varies from zero overwriting to n/2
in a fine-grain architecture, assuming nn pixel array size. Hence, neither the propagation time,
nor the efficiency can be calculated without knowing the actual image.
Since the gap between the worst and best case is extremely high, it is not meaningful to provide
these limits. Rather, it makes more sense to provide approximations for certain image types. But
before that, we examine how to implement these operators on the studied architectures. For this
purpose, we will use the hole finder operator, as an example. Here we will clearly see how the
wave propagation follows different paths, as a consequence of varying propagation speed
corresponding to different directions. Since this is an execution-sequence-invariant operation, it
is certain that wave fronts with different trajectories lead to the same good result.
The hole finder operation, that we will study here, is a grass fire operation, in which the fire
starts from all the boundaries at the beginning of the calculation, and the boundaries of the
objects behave like firewalls. In this way, at the end of the operation, only the holes inside
objects remain unfilled.
The hole finder operation may propagate to any direction. On a fine-grain architecture the wave
fronts propagate one pixel steps in each update. Since the wave fronts start from all the edges,
they meet in the middle of the image in typically n/2 updates, unless there are large structured
objects with long bays which may fold the grass fire into long paths. In case of a text for
example, where there are relatively small non-overlapping objects (with diameter k) with large
but not spiral like holes, the wave stops after n/2+k operations. In case of an arbitrary camera
image with an outdoor scene, in most cases 3*n updates are enough to complete the operation,
because the image may easily contain large objects blocking the straight paths of the wave front.
On a pass-through architecture, thanks to the pixel overwrite scheme, the first update fills up
most of the background (Figure 9). Filling in the remaining background requires typically k
updates, assuming the largest concavity size with k pixels. This means that on a pass-through
architecture, roughly k+1 steps are enough, considering small, non-overlapping objects with size
k.
249
(a) (b)
Figure 9. Hole finder operation calculated with a pass-through architecture. (a): original
image. (b): result of the first update. (The freshly filled up areas are indicated
with grey, just to make it more comprehensible. However, they are black on the
black-and-white image, same as the objects.)
In the coarse-grain architecture we can also apply the pixel overwriting scheme within the
NN sub-arrays (Figure 10). Therefore, within the sub-array, the wave front can propagate in the
same way, as in the pass-through architecture. However, it cannot propagate beyond the
boundary of the sub-array, in a single update. In this way, the wave front can propagate N
positions in the direction which correspond to the calculation directions, and one pixel in the
other directions, in each update. In this way, in n/N updates, the wave-front can propagate n
positions in the supported directions. However, the k sized concavities in other directions would
require k more steps. To avoid these extra steps, without compromising the speed of the wave-
front, we can switch between the top-down and the bottom-up calculation directions after each
update. The resulting wave-front dynamics is shown in Figure 11. This means that for an image,
containing only few, non-overlapping small objects with concavities, we need about n/N+k steps
to complete the operation.
N pixels
n pixels
Figure 10. Coarse-grain architecture with nn pixels. Each cell is to process an NN pixel
sub-array.
The DSP-memory architecture offers several choices depending on the internal structure of
image. The simplest is to apply pixel overwriting scheme, and switch the direction of the
calculation. In case of binary image representation, only the vertical directions (up or down) can
be efficiently selected, due to the packed 32 pixel line segment storage and handling. In this way
the clean vertical segments (columns of background with maximum one object) are filled up after
the second update, and filling up the horizontal concavities would require k steps.
The CELL architecture can be considered as the combination of the pass-through architecture
and the DSP. Each of the SPEs build up 2j+1 consecutive lines of the image, and execute j
updates in each SPE, and send the lines over to the next SPE. The number of update depends on
the processor load of the operator. If the operator is simple, than multiple updates will be needed,
otherwise the data transfer between the processor will cause bottleneck. Since there are bus-rings
249
250 3. Implementation on physical cellular machine
into both directions, two data flows can be started parallel. The upper part of the image can be
processed and passed from left to right, while the lower part can be processed and passed from
right to left. In this way, the wave-front starts propagating from both up and down parallel. The
results will be calculated in (k+1)/8 step similarly to the pass-through. The 8 times speedup is
coming from the number of the SPEs processing the image parallel.
Figure 11. Hole finder operation calculated in a coarse-grain architecture. The first picture
shows the original image. The rest shows the sequence of updates, one after the
other. The freshly filled-up areas are indicated with grey (instead of black) to
make it easier to follow the dynamics of calculation.
In the 1D content-independent front active category, we use the vertical shadow (north to south)
operation as an example. In this category, varying the orientation of propagation may cause
drastic efficiency differences on the non-topographic architectures.
On a fine-grain discrete time architecture the operator is implemented in a way that in each time
instance, each processor should check the value of its upper neighbor. If it is +1 (black), it should
change its state to +1 (black), otherwise the state should not change. This can be implemented in
one single step in a way, that each cell executes an OR operation with its upper neighbor, and
overwrites its state with the result. This means that in each time instance the processor array
executed n2 operations, assuming nn pixel array size.
In discrete time architectures, each time instance can be considered as a single iteration. In each
iteration the shadow wave front moves by one pixel to the south, that is we need n steps for the
wave front to propagate from the top row to the bottom (assuming boundary condition above the
top row). In this way, the total number of operations, executed during the calculation is n3.
However, the strictly required number of operations is n2, because it is enough to do these
calculations at the wave front, only ones in each row, starting from the top row, and going down
row by row, rolling over the results from the front line to the next one. In this way, the efficiency
of the processor utilization in vertical shadow calculation in the case of fine-grain discrete time
architectures is
=1/n (2)
251
exploited due to the inadequate amount of data transfer. In case of horizontal shadow, the
processor load enables the usage of multiple SPEs.
The operators belonging to the 2D content-independent front active category require simple
scanning of the frame. In global max operation for example, the actual maximum value should be
passed from one pixel to another one. After we scanned all the pixels, the last pixel carries the
global maximum pixel value.
In fine-grain architectures this can be done in two phases. First, in n comparison steps, each
pixel takes over the value of its upper neighbor, if it is larger than its own value. After n steps,
each pixel in the bottom row contains the largest value of its column. Then, in the second phase
after the next n horizontal comparison steps, the global maximum appears at the end of the
bottom row. Thus, to obtain the final result requires 2n steps. However, as a fine-grain
architecture executes nn operations in each step, the total number of the executed operations are
2n3. However, the minimum number of requested operation to find the largest value is n2 only.
Therefore, the efficiency in this case is:
=1/2n (6)
The most frequently used operation in this category is global OR. To speed up this operation in
the fine-grain arrays, a global OR net is implemented usually [84][78]. This nn input OR gate
requires minimal silicon space, and enables us to calculate global OR in a single step (a few
microseconds).
However, in that case, when a fine-grain architecture is equipped with global OR, the global
maximum can be calculated as a sequence of iterated threshold and global OR operations with
interval halving (successive approximation) method applied in parallel to the whole array. This
means that a global threshold is applied first for the whole image at level , and if there are
pixels, which are larger than this, we will do the next global thresholding at , and so on.
Assuming 8 bit accuracy, this means that in 8 iterations (16 operations), the global maximum can
be found. The efficiency is much better in this case:
=1/16
In coarse-grain architectures, each cell calculates the global maximum in its sub-array in NN
steps. Then n/N vertical steps come, and finally, n/N horizontal steps to find the largest values in
the entire array. The total number of steps in this case is N 2 + 2n/N, and in each step, (n/N)2
operations are executed. The efficiency is:
= n2 /(N 2 + 2n/N)*(n/N)2=1/(1+2n/N 3) (7)
Since the sequence of the execution does not matter in this category, it can be solved with 100%
efficiency in pass-through and the DSP-memory architectures and on the Cell architecture.
We have to note that this task is memory bandwidth limited on the CELL architecture.
The area active operators require some computation in each pixel in each update; hence, all the
architectures work with 100% efficiency. Since the computational load is very high here, it is the
most advantageous for the many-core architectures, because the speed advantage of the many
processor can be efficiently utilized.
As we have stated in the previous section, front active wave operators run well under 100%
efficiency on topographic architectures, since only the wave fronts need calculation, and the
processors of the array in non-wave front positions do dummy cycles only or may be switched
253
off. On the other hand, the computational capability (GOPs) and the power efficiency (GOPs/W)
of multi-core arrays are significantly higher than those of DSP-memory architectures. In this
section, we show the efficiency figures of these architectures in different categories. To make fair
comparison with relevant industrial devices we have selected three market-leader, video
processing units, a DaVinci video processing DSP from Texas Instruments (TMS320DM6443)
[93], and a Spartan 3 DSP FPGA from Xilinx (XC3SD3400A) [103], and the GTX280 from
NVIDIA [104]. All three of these products functionalities, capabilities and prices were
optimized to efficiently perform embedded video analytics.
Table I summarizes the basic parameters of the different architectures, and indicates the
processing time of a 33 convolution, and a 33 erosion. To make the comparison easier, values
are calculated for images of 128128 resolution. For this purpose, we considered 128128
Xenon and Q-Eye chips. Some of these data are from catalogues, other ones are from
measurements, or estimation. As fine-grain architecture examples, we included both the SCAMP
and Q-Eye architectures.
We can see from Table I, the DSP was implemented on 90nm, while the FPGA the GPU and the
CELL on 65 nm technologies. In contrast Xenon, Q-Eye, and SCAMP were implemented on
more conservative technologies (180nm, 180nm, and 350nm respectively) and their power
budget is an order of magnitude smaller compared to DSP and FPGA, and two orders of
magnitude smaller than CELL and GPU. When we compare the computational power figures, we
also have to take these parameters into consideration.
Table I shows the speed advantages of the different architectures, compared to DSP-memory
architecture both in 33 neighborhood arithmetic (8 bit/pixel) and morphologic (1 bit/pixel)
cases. This indicates the speed advantage of the area active single step, and the front active
content-dependent execution-sequence-variant operators. In Table II, we summarize the speed
relations of the rest of the wave type operations. The table indicates the computed values, using
the formulas that we have derived in the previous section. In some cases, however, the coarse-
and especially the fine-grain arrays contain some special accelerator circuits, which takes the
advantage of the topographic arrangement and the data representation (e.g., global OR network,
mean network, diffusion network). These are marked by notes, and the real speed-up with the
special hardware is shown in parenthesis.
Among the low-power multi-core processor architectures, the pass-through is the only one that
can handle both high-resolution and low resolution images too, due to the relatively small
memory demand. While the coarse- and fine-grain architectures require the storage of 6-8 entire
frames, the pass-through architecture needs only a few lines for each processor. In case of a
mega-pixel image, it can be less than one third of the frame. This means that as opposed to the
coarse- and fine-grain architectures, the pass-through architecture can trade speed for
resolution. This is very important, because the main criticism of the topographic architectures is
that they cannot handle large images, and many of the users do not need their 1000+ FPS. The
price what the pass-through architectures pay for this trade-off is their rigidity. Once the
architecture is downloaded to an FPGA (or an ASIC is fabricated), it cannot be flexibly
reprogrammed, only the computational parameters can be varied. It is very difficult to introduce
conditional branching, unless all the passes of the branching are implemented on silicon (multi-
thread pipeline), or significant delay or latency is introduced.
253
254 3. Implementation on physical cellular machine
Table I Computational parameters of the different architectures for arithmetic (33 convolution)
and logic (33 binary erosion) operations.
DSP Pass-through Coarse-grain Fine-grain Cell GPU
(DaVinci+) (FPGA++) (Xenon) (SCAMP/Q-Eye) Architecture GTX280
Silicon technology 90nm 65nm 180nm 350/180nm 65nm 65nm
Silicon area mm2 100 100/50 576
86 W 236 W
Power consumption 1.25 W 2-3W 0.08 W 0.20 W (board)
Arithmetic proc.
clock speed 600 MHz 250 MHz 100 MHz 1,2 / 2.5 MHz 3200 MHz 1300 MHz
Number of arithmetic
proc. 8 120 256 16384 8x4 240
Nominal arithmetic 4.8 GMAC 30 GMAC 25.6GMAC 102GMAC 324GMAC
comp. power (8 bit int) (8 bit int) (8 bit int) 19GOPS**** (32 bit float) (32 bit float)
Reached arithmetic 3.5 GMAC 30 GMAC 12.2GMAC 48GMAC 14GMAC (32
comp. power (8 bit int) (8 bit int) (8 bit int) 6.7GOPS**** (32 bit float) bit float)
Efficiency of
arithmetic calc. 73% * 100% 48% *** 41% ** 47% 4%
33 convolution time
(128x128 pixel) 42.3 s***** 4.9 s 12.1 s 22 s **** 3.1s 14 s
Arithmetic speed-up 1 8.6 3.5 1.9 13.7 4
Morph. proc. clock
speed 600 MHz 83 MHz 100 MHz 1,2 / 5 MHz 3200 MHz 1300 MHz
Number of 1024 7680
morphologic proc. 64 864 2048 147456
Morphologic
processor kernel type 2 32 bit 96 9 bit 256 8 bit 16384 9 bit 8x128 240x32
Nominal morphologic
comp. power 38GLOPS# 216GLOPS 205GLOPS 737GLOPS 3280GLOPS 10400GLOPS
Reached morphologic
comp. power 10GLOPS 72GLOPS 134GLOPS 737GLOPS 1540GLOPS
Efficiency of
morphological calc. 28% * 33% 65% *** 100% 47%
33 morphologic
operation time 13.6 s***** 2.05 s 1.1 s 0.2 s 0.1s
Morphologic speed-
up 1 6.6 12.4 68.0 142
+
Texas Instrument DaVinci video processor (TMS320DM64x)
++
Xilinx Spartan 3ADSP FPGA (XC3SD3400A)
* processors are faster than cache access
** data access from neighboring cell is an additional clock cycle
*** due to pass-through stages in the processor kernel, (no effective calculation in each clock cycle)
**** no multiplication, scaling with few discrete values
***** these data-intensive operators slow down to 1/3rd or even 1/5th when the image does not fit to the internal memory
(typically above 128128 with a DaVinci, which has 64kByte internal memory)
#
LOPS: Logic operation per second (1 bit)
The CELL and the GPU can also handle high-resolution images due to their relatively high
external memory bandwidth. As it can be seen from the comparison, the efficiency of the GPU is
very low. It is due to the small size convolution kernel. In convolutions with larger kernels or in
other local operations, the GPU is much more efficient.
In our comparison tables, we have represented a typical FPGA as a vehicle to implement the
pass-through architectures. The only reason is that all the currently available pass-through
architectures are implemented in FPGAs is mainly attributed to much lower costs and quicker
time-to-market development cycles. However, they could also be certainly implemented in ASIC,
255
which would significantly reduce their power consumption, and decrease their large-volume
prices making it possible to process even multi-mega pixel images at a video rate.
Table II Speed relations in the different function groups calculated for 128128 sized images.
The notes indicate the functionalities by which the topographic arrays are speeded up
with special purpose devices.
Fine-grain
Pass- discrete time Fine-grain
DSP through Coarse-grain (SCAMP/ continuous time
(DaVinci+) (FPGA++) (Xenon) Q-Eye) (ACLA)
1D content-independent front
active operators e.g. (shadow)
processor util. Efficiency 100% 100% N/n: 6.25% 1/n: 0.8% 1/n: 0.8%
speed-up in advantageous
direction (vertical) 1 6.6 0.77 0.53 188
speed-up in disadvantageous
direction (horizontal) 1 1 2 10.6 3750
2D content-independent front
active operators
1/(1+2n/N3):
processor util. Efficiency 100% 100% 66% 1/2n: 0.4% -
speed-up (global OR) 1 6.6 8.2 (13*) 0.27 (20*) n/a
speed-up (global max) 1 8.6 2.3 n/a n/a
speed-up (average) 1 8.6 2.3 n/a (2.5)** n/a
Execution-sequence-invariant
content-dependent front active
operators
hole finder with k=10 sized 4 k+1 updates n/N+k n/2+k updates n/2+k updates
small objects updates (11) (26) (74) (74)
Speedup 1 2.4 1.9 3.7 1500
Area active
processor util. Efficiency 100% 100% 100% 100%
Speedup 1 8.6 3.5 1.9 (210***) n/a
Multi-scale
1:4 scaling 1 8.6 3.5 0.1 n/a
+
Texas Instrument DaVinci video processor (TMS320DM64x)
++
Xilinx Spartan 3ADSP FPGA (XC3SD3400A)
* Hard wired global OR device speeds up this function (<1 s concerning the whole array)
** Hard wired mean calculator device makes this function available (~2 s concerning the whole array)
*** Diffusion calculated on resistive network (<2 s concerning the whole array)
Table III shows the computational power, the consumed power and the power efficiency of the
selected architectures. As we can see, the three topographic arrays have over hundred times
power efficiency advantage compared to DSP-memory architectures. This is due to their local
data access, and relatively low clock frequency. In case of ASIC implementation, the power
efficiency of the pass-through architecture would also be increased with a similar factor.
255
256 3. Implementation on physical cellular machine
Table III Computational power, and the consumed electronic power, and their proportion in
different architectures for convolution operations.
GOPs W GOPs/W Accuracy
DaVinci 3.6 1.25 2.88 1/8 int
Pass-through (FPGA) 30 3 10 1/8 int
Xenon (64x64) 10 0.02 500 1/8 int
SCAMP (128x128) 20 0.2 100 6-7 analog
Q-Eye 25 0.2 125 6-7 analog
Cell multiprocessor 225 85 2.6 32 float
GPU 324 236 1.37 32 float
Figure 12 shows the relation between the frame-rate and the resolution in a video analysis task.
Each of the processors had to calculate 20 convolutions, 2 diffusions, 3 means, 40 morphologies
and 10 global ORs. Only the DSP-memory and pass-through architectures support trading
between resolution and frame-rate. The characteristics of these architectures form lines. The
chart shows the performance of the three discussed chips too. The chips are represented here with
their physical sizes. Naturally, this chart belongs to this particular task with the given operations
basket. By using a different one, different chart would come out.
10,000 DSP
pass-thr.
pipe-line
Xenon Xenon
SCAMP
Q-Eye
1,000
Q-Eye
pipe-line
pass-through
frame-rate
SCAMP
100
video rate
10
DSP
Figure 12. Frame-rate versus resolution in a typical image analysis task. Both of the axes are
in logarithmic scale.
As it can be seen in Figure 12, both SCAMP and Xenon have the same speed as the DSP. In the
case of Xenon, this is so, because its array size is 6464 only. In the case of SCAMP, the
processor was designed for very accurate low power calculation by using a conservative
technology. In this particular task, the Q-Eye chip was almost as fast as the pass-through
architectures, thanks to its integrated diffusion circuitry and support of binary morphology.
257
So far, we have studied how to implement the different wave type operators on different
architectures, identified constrains and bottlenecks, and analyzed the efficiency of these
implementations. After having these results in our hand, we can define rules for optimal image
processing architecture selection for topographic problems. This section considers the low power
devices, where the embedded operation is a viable option.
Image processing devices are usually special purpose architectures, optimized for solving
specific problems or a family of similar algorithms. Figure 13 shows a method of special purpose
processor architecture selection. It always starts with the understanding of the problem in all
aspect. Then, different algorithms suitable for solving the problem are derived. The algorithms
are described with flowchart, with the list of the used operations, and with the specification of
the most important parameters. In this way, a set of formal data describes the algorithms, which
are as follows: resolution, frame-rate, pixel clock, latency, computational demand (type and
number of operators), and flowchart. Other application-specific (secondary) parameters are also
given: maximal power consumption, maximal volume, economy etc. The algorithm derivation is
a human activity supported by various simulators for evaluation and verification purposes.
The next step is the architecture selection. By using the previously compiled data, we can define
a methodology for the architecture selection step. As we will see, based on the formal
specifications, we can derive the possible architectures. There might not be any, there might be
exactly one, or there might be several, according to the demands of the specification of the
algorithm.
Architecture 1
Algorithm 1
Architecture 2a
Problem Algorithm 2
(described verbally) Architecture 2b
.
. .
.
.
.
Algorithm k Architecture k
The matrix is divided into 16 segments, and each segment indicates the potential architectures
that can operate in that particular parameter environment. The matrix shows the minimal pixel
clock figures (red) in the grid points also.
In Figure 14, the pass-through and the DSP can be positioned freely between frame-rate and
resolution without constrains. Thus they appear everywhere, under a certain pixel clock rate. The
digital coarse-grain sensor-processor arrays appear in the low resolution part (left column), while
the analog (mixed-signal) fine-grain sensor-processor arrays appear in both the low and medium
resolution columns.
The next important parameter is the latency. Latency is critical when the vision device is in a
control loop, because large delays might make the control loops instable. It is worth to
distinguish three latency requirement regions:
very low latency (latency <2ms; e.g. missile, UAV, high speed robot controlling);
low latency (2ms < latency <50ms; e.g. robotic, automotive);
high latency (50ms < latency; e.g. security, industrial quality check).
Latency has two components. The first is the readout time of the sensor, and the second is the
completion of the processing on the entire frame. The readout time is negligible in the fine-grain
mixed-signal architectures, since the analog sensor readout should be transferred to an analog
memory through a fully parallel bus. The readout time is also very small (~100s) in the coarse-
grain digital processor array, because there is an embedded AD converter array to do conversion
in parallel. The DSPs and the pass-through processor arrays use external image sensors, in which
the readout time usually is in the millisecond range. Therefore, in case of very low latency
requirements, the mixed-signal and the digital focal plane arrays can be used. (There are some
ultra-high frame-rate sensors with high speed readout, which can be combined with pass-through
processors. However, these can be applied in very special applications only due to their high
complexities and costs.)
Frame-rate
[log FPS] CG_D: coarse-grain
digital focal
ultra-high CG_D FG_A Multiple PLs n/a
plane array
speed FG_A
processor
2000 architecture
32 154 1970
CG_D FG_A: fine-grain
high FG_A FG_A PT Multiple digital focal
speed PT PT PTs plane array
DSP DSP processor
architecture
100
CG_D 7.6 98 PT: pass-through
1.6 FG_A
video FG_A PT architecture
PT
speed PT DSP PT
DSP
DSP Not possible
15 1.2
0.24 15
CG_D FG_A Extremely
low FG_A PT PT PT challenging
speed PT DSP DSP DSP
DSP Challenging
1 Standard
1k 16k 76.8k 983k Resolution
(3232) (128128) (320240) (1280768) [log # pixels] Minimal
pixel clock
low res. medium video megapixel [MHz]
In the low latency category, those architectures can be used only, in which the sensor readout
time plus the processing time is smaller than the latency requirements. In the high latency region,
the latency does not mean any bottleneck.
The next descriptor of the algorithms is the computational demand. It is a list of the applied
operations. Using the execution time figures that we calculated for different operations on the
examined architectures, we can simply calculate the total execution time. (In case of the pass-
through architecture the delay of the individual stages should be summed up.) The total
processing time should satisfy the following two relations:
ttotal_processing< tlatency - treadout
ttotal_processing<1/frame_rate
The last primary parameter is the program flow. The array processors and the DSP are not
sensitive for branches in program flow. However, the pass-through architectures are challenged
by the conditional program flow branches, because the branching should happen at the
calculation of the first pixel of the frame, but at that time, the branching condition is not yet
calculated. (The condition is calculated during the processing of the entire frame.) Since that,
before the branching, the application of a frame-buffer is required, which generates significant
hardware overhead and latency increase.
There are three secondary design parameters. The first is the power consumption. Generally, the
ASIC solutions need much less power than the FPGA or DSP solutions. The second is the
cubature of the circuit. Smaller cubature can be achieved with sensor-processor arrays, because
the combination of these two functionalities reduces the chip count. The third parameter is the
economy. In case of low volume, the DSP is the cheapest, because the invested engineering cost
is the smaller there. In case of medium volume, the FPGA is the most economical, while in case
of high volume, the ASIC solutions are the cheapest.
259
260 3. Implementation on physical cellular machine
Chapter 4. PDE SOLVERS
262 4. PDE solvers
0 /h2 0 0 0 0
A= /h2 -4*/h2+1/R /h2 B= 0 0 0 z= 0
0 /h2 0 0 0 0
0 1 0 0 0 0
A= 1 -3 1 B= 0 0 0 z= 0
0 1 0 0 0 0
Example : image name: laplace.bmp, image size: 100x100; template name: laplace.tem .
0 /h2 0 0 0 0
A= /h2 -4*/h2+1/R /h2 B= 0 0 0 z= 0
0 /h2 0 0 0 0
2w 4w 4w 4w
h = p D
x 4 + 2 +
4
(1)
t 2 x 2
y 2
y
where w is the displacement of the plate, p is the applied pressure, h is the thickness of the plate and is the density
of the plate. The flexure rigidity D can be computed by the following expression:
Eh 3
D= (2)
(
12 1 2 )
where E is the Young's modulus and is the Poisson's ratio.
The dimensions of the plate is 100m100m and the thickness is 2.85m. The width of the suspension bridges is
12.5m. The tactile sensor is made from silicon so the material constants are the following: E=47GPa, =0.278 and
=2330kg/m3.
0 0 1 0 0
0 2 8 2 0
4w 4w 4w D
D 4 w = D 4 + 2 2 2 + 4 4 1 8 20 8 1 (3)
x x y y x
0 2 8 2 0
0 0 1 0 0
where x is the distance between the grid points. Equation (1) can not be solved on the current analog VLSI chips
because 55 sized templates are not supported in these architectures. Using the Falcon configurable emulated digital
CNN-UM architecture the limitations of the analog VLSI chips can be solved. The Falcon architecture can be
configured to support two CNN layers and 55 sized templates. To achieve better numerical stability the leapfrog
266 4. PDE solvers
method is used instead of the forward Euler method during the computation of the new cell value. The leapfrog
method computes the new cell value using the data from the previous time step according to the following equation:
( )
w n+1 = w n1 + tf w n (4)
where t is the time step value, wn-1, wn, wn+1 are the cell values at the previous, current and the next time step
respectively. f(.) is the derivative computed by using template (3) at the given point. The implementation of the
leapfrog method requires additional memory elements and doubles the required bandwidth of the processor but
these modifications are worthwhile because much larger time step can be used. Additionally the symmetry of the
required template operator makes it possible to optimize the arithmetic unit of the Falcon architecture. Using the
original Falcon arithmetic unit the template operation is computed in 5 clock cycles in row-wise order and 5
multipliers are used. However multiplication with 2, -8 and 20 can be done by shifts in a radix 2 number system.
Multiplication by 20 can be computed by multiplying the value by 16 and 4 and sum the partial results. After this
optimization just one clock cycle and only one multiplier is required during the computation. The required resources
to implement one processor which can compute (1) on a 512512 sized grid with different precisions are
summarized on Table I.
Results, performance
The proposed architecture will be implemented on our RC200 prototyping board from Celoxica Ltd. The Virtex-II
1000 (XC2V1000) FPGA on this card can host four Falcon processor cores using 35bit precision, which makes it
possible to compute four iterations in one clock cycle. The performance of the system is limited by the speed of the
on board memory resulting in a maximum clock frequency of 90MHz. The theoretical performance of the four
processor cores are 360 million cell update/s. Unfortunately the board has 72bit wide data bus, so 4 clock cycles are
required to read a new cell value and to store the results this reduces the achievable performance to 90 million cell
update/s. The size of the memory is also a limiting factor because the state values must fit into the 4Mbyte memory
of the board.
By using the new Virtex-II Pro devices with larger and faster memory the performance of the architecture can reach
230MHz clock rate and can compute a new cell value in each clock cycle. Additionally the huge amount of on-chip
memory and multipliers on the largest XC2VP125 FPGA makes it possible to implement 45 processor cores
resulting in 10,350 million cell update/s computing performance. On the other hand the large number of arithmetic
units makes it possible to implement higher order and more accurate numerical methods. The achievable
performance and speedup compared to conventional microprocessors are summarized on Table II. The results show
that even the limited implementation of the modified Falcon processor on our RC200 prototyping board can
outperform a high performance desktop PC. If adequate memory bandwidth (288 bit wide memory bus running on
230MHz clock frequency) is provided the performance of the emulated digital solution is 1400 times faster!
267
2.5E-11
2E-11
1.5E-11
1E-11
5E-12
0
0 0.5 1 1.5 2 2.5 3 3.5 4
64bit floating point 32bit fixed point 18bit fixed point
In conformity with the key prerequisites of the heat transfer model, the main differential equations can be
expressed in the following way:
t1 t1 t1 t
1 + 1 + 1 = 1c1 1
x x y y z z
(3)
t2 t2 t2 t2
2 + 2 + 2 m cVx
x x y y z z x
t t2
m cVy 2 = m2 c2
y (4)
t3 t3 t3 t
3 + 3 + 3 = 3 c3 3
x x y y z z
(5)
where the symbols are as follows:
- t1, t2 and t3 denote temperatures in the 1st, 2nd and 3rd area and are functions of x, y and z.
- 1, 2 and 3 are the coefficients of conductivity in the same areas and are functions of x, y and z.
- c1, c2 and c3 mean heat capacities of the rocks.
- 1, 2 and 3 is the rock thickness in the respective areas.
- and c are density and heat capacity of water.
The equation (3) and (5) describe the heat transfer in the upper and lower argillaceous, impermeable layer, while
the equation (4) denotes the process in the transitional, water saturated calcareous layer.
269
2c2
(
Vx t2k t2kx+1, y ,z
x x , y x , y ,z
)
c
2c2
(
Vy t2k t2kx , y+1,z
y x , y x , y ,z
)
and
t3kx+, y1,z = t3kx , y ,z +
1
3c3 x2 3
( x1, y , z
( )
t3kx1, y ,z 3x1, y ,z + 3x , y ,z t3kx , y ,z + 3x , y ,z t3kx+1, y ,z )
+
1
3c3 y 2 3 3
(
t k 3 + 3 t3k + 3 t3k
x , y 1, z x , y 1, z
(
( 8) x , y 1, z x , y ,z
) x , y ,z x , y ,z x , y +1, z )
+
1
3c3 z 2 3 3
(
t k 3 + 3 t3k + 3 t3k
x , y , z1 x , y , z1
( x , y , z 1 x , y ,z
) x , y ,z x , y ,z x , y , z+1 )
where is the time step, x, y and z are the distance of grid points in direction x,y and z respectively.
The third generation blade system is the IBM Blade Center QS22 equipped with new generation PowerXCell 8i
processors manufactured by using 65nm technology. Double precision performance of the SPEs are significantly
improved providing extraordinary computing density up to 6.4 TFLOPS single precision and up to 3.0 TFLOPS
double precision in a single Blade Center house. These blades are the main building blocks of the worlds fastest
supercomputer at Los Alamos National Laboratory which first breaks through the "petaflop barrier" of 1,000 trillion
operations per second.
To model the process of reinjection on emulated digital CNN architecture [8] a space-variant CNN model has
been developed based on equation (6)-(8), which is operating with 3,5 dimensional templates. The second equation
which describes the behavior of the water saturated transitional layer contains two additional parts which were
derived from the time-independent filtration equation and make the connection between the process of filtration and
heat transfer.
By the process of filtration only the temperature values have to calculated and updated during the iterations, so it
can be used zero-input CNN templates using the given initial values as initial state of the template running. To
design space-variant, non-linear template for the three-dimensional medium we have designed 3 coupled 2D
templates using an r=1 neighborhood for every three physical layers, so every feedback template-threefold is
containing 27 elements.
270 4. PDE solvers
The structure of the coupled templates for one physical layer can be seen in Figure 3., where n denotes the
described physical layer.
An1
An2
An3
The coupled templates of the second layer which was determined from equation (7) are as follows:
0 0 0 0 0 0
1 1
A21 = 0 0 A23 = 0 0
mx , y 2 c2 z 2 x , y , z 1 m c z 2 x, y, z
x, y 2 2
0 0 0 0 0 0
1
0 x1, y, z 0
mx, y 2c2 x2
1
1
mx, y 2c2 x2
( x1, y,z + x, y,z )
1 1
mx, y 2c2 y2
( x, y1,z + x, y,z ) mx, y 2c2 y
2 x , y, z
1
A22 = x, y1,z
mx, y 2c2 y2
1
( x, y,z1 + x, y,z ) +
c
Vyx,y
mx, y 2c2 z2 2c2 y
c c
Vx Vy
2c2 x 2c2 y
x, y x, y
1 c
0 x, y, z + Vx 0
mx, y 2c2 x2 2c2 x x, y
The space-variant templates for the first and third physical layers can be determined similarly, there only need to
be used the appropriate and c multiplier coefficients.
By using the previously described discretization method a C based solver is developed which is optimized for the
SPEs of the Cell architecture.
Reference
[1] S Kocsrdi, Z. Nagy, S Kostianev, P. Szolgay, FPGA based implementation of water reijection in geothermal structure, Proc. of
CNNA2006, pp. 323-327, 2006,Istanbul
272 4. PDE solvers
x gH
A=
Rc h)
At the edges of the model closed boundary conditions are used e.g. there is no mass transport across the
boundaries. In this case ux and uy are both zero at the edges of the CNN cell array.
The circulation in the barotropic ocean is generally the result of the wind stress at the oceans surface and the
source sink mass flows at the basin boundaries. In this paper we use steady wind to force our model. In this case the
ocean will generally arrive at a steady circulation after an initial transient behavior.
CNN-UM solution
Solution of equations a)-c) on a CNN-UM architecture requires finite difference approximation on a uniform
square grid. The spatial derivatives can be approximated by the following well known finite difference schemes and
CNN templates:
0 0 0
1
1 0 1 = Adx
x 2 x
0 0 0
i)
0 1 0
1
0 0 0 = Ady
y 2 x
0 1 0
j)
0 1 0
1
2 1 4 1 = An
x 2
0 1 0
k)
Using these templates the pressure and lateral viscosity terms can be easily computed on a CNN-UM architecture.
However the computation of the advection terms requires the following non-linear CNN template which can not be
implemented on the present analog CNN-UM architectures.
0 0 0
u x ,ij
u x ,ij 1 0 1 = Ax ,x ,ij
x 2 x
0 0 0
l)
Most ocean models arrange the time dependent variables ux, uy and on a staggered grid called C-grid. In this
case the pressure p and height H variables are located at the center of the mesh boxes, and mass transports ux and uy
at the center of the box boundaries facing the x and y directions respectively. In this case the state equation of the
ocean model can be solved by a one layer CNN but the required template size is 55 and space variant templates
should be used. Another approach to use 3 layers for the 3 time dependent variables. In this case the CNN-UM
solution can be described by the following equations:
du x ,ij
= f ij u y ,ij gH ij Adx + wx ,ij u x ,ij
dt m)
+ Aij An u x
1
( A x ,x ,ij u x + Ax ,y ,ij u y )
H ij
du y ,ij
= f ij u x ,ij gH ij Ady + wy ,ij u y ,ij
dt n)
+ Aij An u y
1
( A y ,x ,ij u x + Ay ,y ,ij u y )
H ij
dij
= Adx u x Ady u y o)
dt
274 4. PDE solvers
wx,ij
f ij u y,ij g H ij i-1,j i+1,j u x,ij A ij u x,i-1,j u x,i+1,j u x,i,j-1 u x,i,j+1 u x,ij H ij u x,i-1,j u x,i+1,j u x,ij u x,i,j-1 u x,i,j+1 u y,ij
- + + - -
* * 1/x
+
* *
*
+
+
*
*
+ + +
trick makes it possible to eliminate several multipliers from the arithmetic unit and greatly reduces the area
requirements.
The arithmetic unit requires 20 input values in each clock cycle to compute a new cell value. It is obvious that
these values cannot be provided from a separate memory. To solve this I/O bottleneck a belt of the cell array should
be stored on the chip. In the case of ux, uy and two lines should be stored because these lines are required in the
computation of the spatial derivatives. From the remaining values such as f, H, A, wx and wy only one line should
be stored for synchronization purposes.
Inside the arithmetic unit fixed point number representation is used because fixed point arithmetic requires much
smaller area than floating point arithmetic. The bit width of the different input values can be configured
independently before the synthesis. The width of the values are computed using simple rules and heuristics. For
example the deepest point of the Pacific Ocean is about 11,000 meters deep. So 14 bits is required to represent this
depth H with one meter accuracy. This is far more accurate than the available data of the ocean bottom so the last or
the last two bits can be safely removed. In this case we always have to multiply H by 2 or 4 if it is used in the
computations. Fortunately this multiplication can be implemented by shifts. Similar considerations can be used to
determine the required bit width and the displacement of the radix point for the remaining constant values such as f,
A, g, wx and wy. Using fixed point numbers with various width and displacement values makes it harder to design
the arithmetic unit but it is worthwhile because the area can be reduced and clock speed is increased.
In our recent implementation predefined configurable multipliers from Xilinx are used to simplify circuit design.
The maximum input width of this IP core is 64 bit thus the bit width of the u and values can not be larger than 31
and 34 bits to avoid rounding errors inside the arithmetic unit. In this case 41 dedicated 18 bit by 18 bit signed
multipliers are required to implement the arithmetic unit. Of course the bit width can be further increased by using
custom multipliers. The required resources to implement this arithmetic unit on Xilinx Virtex series FPGAs is
summarized in column General on Table I.
Results
The proposed architecture will be implemented on our RC200 prototyping board from Celoxica Ltd.. The
XC2V1000 FPGA on this card can host one arithmetic unit which makes it possible to compute a new cell value in
one clock cycle. Unfortunately the board has 72 bit wide data bus, so 5 clock cycles are required to read a new cell
value and to store the results. The performance can be increased by slightly lowering the precision as shown in
column RC200 on Table I. and implementing three memory units which use the arithmetic unit alternately. In this
case 4 clock cycles are required to compute 3 new cell values and the utilization of the arithmetic unit is 75%. The
performance of the system is limited by the speed of the on board memory resulting in a maximum clock frequency
of 90MHz. In this case the performance of the chip is 67.5 million cell update/s. The size of the memory is also a
limiting factor because the input and state values must fit into the 4Mbyte memory of the board. The cell array is
restricted to 512512 cells by the limited amount of on-board memory however the XC2V1000 FPGA can be used
to work with 1024 or even 2048 cell wide arrays.
By using the new Virtex-II Pro devices with larger and faster memory the performance of the architecture can
reach 200MHz clock rate and can compute a new cell value in each clock cycle. Additionally the huge amount of
on-chip memory and multipliers on the largest XC2VP125 FPGA makes it possible to implement 14 separate
arithmetic units. These arithmetic units work in parallel and the cumulative performance is 2800 million cell
update/s. On the other hand the large number of arithmetic units makes it possible to implement more accurate
numerical methods.
The results of the different fixed point computation are compared to the 64 bit floating point results. To evaluate
our solution a simple model is used. The size of the modeled ocean is 2097km, the boundaries are closed, the grid
size is 512512 and the grid resolution is 4096m. The model is forced by a wind blowing from west and its value is
constant along the x direction. The wind speed in the y direction is described by a reversed parabolic curve where
the speed is zero at the edges and 8m/s in the center. The results of the 36 bit fixed-point computations after 72
hours of simulation time using 32s timestep are shown in Figure 3.
276 4. PDE solvers
(a)
2000
1.8
1800
1.6
1600
1.4
1400
1.2
1200
Y (km)
1
1000
800 0.8
600 0.6
400 0.4
200 0.2
0
0 500 1000 1500 2000
X (km)
(b)
277
-3
x 10
2000 1.5
1800
1
1600
1400
0.5
1200
Y (km)
1000 0
800
-0.5
600
400 -1
200
-1.5
0
0 500 1000 1500 2000
X (km)
(c)
Fig. 3. Results after 72 hours simulation (a) Bottom topography: seamount, (b) flow direction, (c) elevation
TABLE I
RESOURCE REQUIREMENTS AND DEVICE UTILIZATION OF THE ARITHMETIC
UNIT
Bit width
Variable General RC200
f 18 10
H 17 10
A 17 10
w 18 12
34 30
u 31 30
I/O bus width (bit) 184 144
Required resources
18x18 bit
41 41
multiplier
18kbit Block
16 13
RAM
18x18 bit multiplier Available
Part number
utilization resources
XC2V1000 103% 103% 40
XC2V8000 24% 24% 168
XC2VP125 7% 7% 556
18kbit Block RAM Available
Part number
utilization resources
XC2V1000 40% 33% 40
XC2V8000 10% 8% 168
XC2VP125 3% 2% 556
Reference
Z. Nagy, Zs. Vrshzi,P. Szolgay, Emulated Digital CNN-UM Solution of Partial Differential Equations, Int. J.
CTA, Vol. 34, No. 4, pp. 445-470 (2006)
278 4. PDE solvers
For engineering applications in the topic of flow simulation the basic of developments is the well-known Navier-
Stokes system of equations. This system was determined from the fundamental laws of mass conservation,
momentum conservation and energy conservation.
In this paper we will concentrate on inviscid, compressible fluids where the dissipative, transport phenomena of
viscosity, mass diffusion and thermal conductivity can be neglected. Therefore if we simply drop all the terms
involving friction and thermal conduction from the Navier-Stokes equations, than we have the Euler equations, so
the governing equations for inviscid flows .
Euler equations
The resulting equations according to the above mentioned physical principles for an unsteady, two-dimensional,
compressible inviscid flow are displayed below without external sources. In order to take the compressibility and
variations of density in high-speed flows into account, we utilize the conservation form of the governing equations,
using the density-based formulation.
Continuity equation
+ ( V) = 0 (1)
t
Momentum equations
( u ) p
x component: + ( uV ) = (2)
t x
( v) p
y component: + ( vV ) = (3)
t y
where t denotes time, x and y are the space coordinates, furthermore in two-dimensional Cartesian coordinates the
vector operator is defined as
i +j (4)
x y
The dependent variables are , V(u, v) and p and they denote the density, the velocity vector field and the scalar
field of pressure, respectively. We can see in the above-mentioned equations that we have three equations in terms
of four unknown flow-field variables. In aerodynamics, it is generally reasonable to assume that the gas is perfect
gas so the equation of state can be written in the following form:
p = RT (5)
J
where R is the specific gas constant, and its value in case of air is 286,9 and T is the absolute temperature
kg K
and the temperature value can be defined as an initial condition because we made isothermal system consideration.
For this reason the fourth governing equation, the energy equation, can be neglected.
In the next sections we will examine these methods applied to two-dimensional first order hyperbolic Euler
equations from the point of view of their stability, realizability with CNN templates, usability in different
engineering applications and their hardware utilization on an emulated digital CNN-UM solution.
During the discretization of Euler equations we are interested in replacing the different partial derivatives with
suitable algebraic difference quotients.
u uin, +j 1 uin, j t 2u
= (6)
t i , j t 2 x 2
uin+1, j uin, j ( t ) 3u
2
u
= (7)
x i , j 2x 6 x 3
Substituting (6) and (7) into (1)-(3) the following discretized formulas can be derived:
t
in, +j 1 = in, j
2 x
( in+1, j uin+1, j in1, j uin1, j ) 2ty ( in, j +1vin, j +1 in, j 1vin, j 1 ) (8)
t
in, +j 1uin, +j 1 = in, j uin, j RT
2 x
( in+1, j in1, j )
(9)
t
2x
( in+1, j uin+1, j uin+1, j in1, j uin1, j uin1, j ) 2ty ( in, j +1uin, j +1vin, j +1 in, j 1uin, j 1vin, j 1 )
t
in, +j 1vin, +j 1 = in, j vin, j RT
2y
( in, j +1 in, j 1 )
(10)
t
2x
( in+1, j uin+1, j vin+1, j in1, j uin1, j vin1, j ) 2ty ( in, j +1vin, j +1vin, j +1 in, j 1vin, j 1vin, j 1 )
where t is time-step, while x and y denote differences between grid points in directions x and y.
The von Neumann stability analysis for hyperbolic equations shows that the solutions applying the FCTS methods
will be unconditionally unstable.
in, +j 1vin, +j 1 =
4
( i, j +1vi , j +1 + in, j 1vin, j 1 + in+1, j vin+1, j + in1, j vin1, j ) RT 2ty ( in, j +1 in, j 1 )
1 n n
(13)
t t
2x
( in+1, j uin+1, j vin+1, j in1, j uin1, j vin1, j )
2y
( in, j +1vin, j +1vin, j +1 in, j 1vin, j 1vin, j 1 )
280 4. PDE solvers
The Lax-Wendroff method computes new values of dependent variables in two steps. First it evaluates the values at
half-step time using the Lax method and in second step it applies leapfrog method with half-step. The computation
formula for (2) can be seen below.
First step:
1 n n t
in, +j 1/ 2uin, +j 1/ 2 =
4
( i +1, j ui +1, j + in, j +1uin, j +1 + in1, j uin1, j + in, j 1uin, j 1 )
2 x
RT ( in+1, j in1, j )
(14)
t t
2x
( i +1, j ui+1, jui+1, j i1, j ui1, j ui1, j ) 2y ( i, j +1ui, j +1vi, j +1 i, j 1ui, j 1vi, j 1 )
n n n n n n n n n n n n
Second step:
t
in, +j 1uin, +j 1 = in, j uin, j
2 x
( in++1,1/j 2uin++1,1/j 2uin++1,1/j 2 in+1,1/j 2uin+1,1/j 2uin+1,1/j 2 )
(15)
t t
2 y
( in, +j +1/12uin, +j 1/+12 vin, +j 1/+12 in, +j 1/12uin, +j 1/12 vin, +j 1/12 )
2 x
RT ( in++1,1/j 2 in+1,1/j 2 )
The partial differential equations of density and the y directional velocity component can be discretized in similar
way.
In equations (11) through (15) the notations are the same as they were in the discrete form of FTCS.
The von Neumann stability analysis shows in both cases of Lax and Lax-Wendroff methods that these schemes are
conditionally stable in case of ordinary hyperbolic equations and although there is no analytical stability analysis to
determine limiting time step requirements because of the nonlinear nature of the Euler equations, the following
empirical formula can be applied.
( t )CFL
t (16)
1 + 2 / Re
where 0.7 0.9 , ( t )CFL can be determined using the Courant-Friedrichs-Lewy condition and
Re = min ( Re x , Re y ) 0 , where Re denotes the Reynolds number.
Considering the above mentioned methods it is very remarkable that during the computation regular, rectangular
grids can be used, so we have the possibility to implement these solutions on CNN-UM architecture. Furthermore
the last two examined methods have some valuable advantages compared to the FTCS in case of hyperbolic partial
differential equations:
they conserve the mass in a closed system,
the computational accuracy of Lax-Wendroff method is second order in time and space,
the hardware utilization of the implemented Lax formula will be very similar compared to
the case using FTCS method and
oscillations caused by square points like corners in the space exist in case of Lax-
Wendroff method, or the solution of Lax method is dissipative but the effect of these
properties does not blow up like FTCS scheme if the time step is defined properly.
In 1995 a former CNN based Navier-Stokes solver was published, by which it was possible to divide both
momentum equations with density because that solution was designed to model the behavior of incompressible
fluids but in our solution it is necessary to use dividers due to compressible property of medium in order to get the u
and v values after steps.
The Euler equations were solved by a modified Falcon processor array in which the arithmetic unit has been
changed according to the discretized continuity and momentum equations. For the momentum equations the values
of R and T were defined as an initial condition because we supposed slow motions and so isothermal system
condition in the fluid flow.
Since each CNN cell has only one real output value, three layers are needed to represent the variables u, v and in
case of FTCS and Lax approximations. In these cases the CNN templates acting on the u layer can easily be taken
from (9) and (12) in conformity with using the FTCS or the Lax scheme. Equations (17)-(19) show templates, in
281
which cells of different layers at positions (k, l) are connected to the cell of layer u at position (i, j). The terms in (9)
and (12) including only are realized by
0 0 0
RT .
A ,u =
2x
1 0 1 [ kl ] (17)
0 0 0
The nonlinear terms are
0 0 0
1 u2 ,
A u ,u = 1 0 1 kl kl
2x (18)
0 0 0
0 1 0
1
Auv ,u = 0 0 0 [ kl ukl vkl ].
2x (19)
0 1 0
Of course the value of u can get only after the division with the density value. The templates for the and v layers
can be defined analogously.
In case of Lax-Wendroff scheme u, v and denote the state variables of the first step, while u, v and the state
variables of the second computational step so the number of required layers is six. The linear and nonlinear
templates can be determined in a very similar way as in case of FTCS or Lax methods. The only difference will be
that the layers without comma will be interconnected with layers having comma notation. For example the nonlinear
connection between the cell of layer u at position (i, j) and cells of u2 at positions (k, l) in (15) can be described by
the following template:
0 0 0
A u,u =
1
2x
1 0 1 u 2 .
kl kl ( ) (20)
0 0 0
In accordance with the different discretized equation systems we have designed three complex circuits which are
able to update state values of a cell in every clock cycle in emulated digital CNN-UM architecture.
The proposed arithmetic unit to compute the derivative of u layer using Lax method is shown in Fig. 1. In order that
the different hardware units of this arithmetic could work with maximal parallelism and so achieve the highest
possible clock speed during the computation, pipelining technique has been used.
282 4. PDE solvers
* * * * +
* * + + * * *
+ - + -
- + -
- +
- +
+ -
- n +1
i, j
u in, +j 1
Figure 1. The proposed arithmetic unit to compute the derivative of u layer
in the solution using Lax approximation method
Other trick can be applied if we choose the value of t, x and y to be integer power of two because the
multiplication with these values can be done by shifts so we can eliminate several multipliers from the hardware and
additionally the area requirements will be greatly reduced.
Comparing the equation systems of (8)-(10) and (11)-(13) one can see that the only modification in case of solution
applying FTCS, which need to be executed, will be that two additional multipliers need to be build in but some
adders can be eliminated.
The implementation of Euler equations discretized by the Lax-Wendroff scheme requires about two times larger
hardware area than the previous solutions because in this case actually both the arithmetic units of Lax and FTCS
solutions need to be implemented in one circuit.
The input and some result images of 32 bit precision simulation computing 1 million iterations with 2-10 second time
step on a model having 128x128 grid points can be seen in Fig.10. The red area denotes larger, while the blue area
means smaller density and the arrows show the direction of the flow. Using the Lax method the computational time
was 1772.7 seconds using Athlon 64 3500+ processor, and 1098.9 seconds on one core of an Intel Core2Duo T7200
applying floating point numbers. This is equivalent to approximately 9.3 million and 14.7 million cell update/s,
respectively. The Lax-Wendroff method is about 60% slower as the Lax method. In this case the computation can
be carried out in 2852.4 s and 1803.5 s on the AMD and Intel processors, respectively.
Using the Lax method the previous simulation takes approximately just 123.18 s using our XC2V3000 FPGA and
65.5 s using the XUP2Pro so the computation has been accelerated approximately by 8.9 and 16.7 times compared
to the performance of Core2Duo. Using the high performance XC4SX55 FPGA the simulation lasts 8.19 seconds,
so the computation is 134-fold faster. The expected computing time of the Lax-Wendroff method on the XC4SX55
FPGA is 20.9 s which is 86.3 times faster than the Core2Duo microprocessor.
283
In this section, two freely downloadable simulator is introduced. The first is the MATCNN
simulator, which enables the simulation of any linear or non-linear CNN, and CNN template
sequence. The second is the CI simulator, which supports the simulation of cellular automations.
287
d
v x (t ) =
dt ij
v x ij ( t ) + Aij, kl v y kl
(t ) + Bij, kl vu kl
( t ) + I ij
kl N r kl N r
(3)
+ A$ij, kl ( v yy ) + B$ij, kl ( vuu )
kl N r kl N r
MATCNN format:
GRADT_A = [ 0 0 0;
0 2 0;
0 0 0 ];
GRADT_Bb = [ 1 1 1;
1 0 1;
1 1 1 ];
GRADT_b = [ 1 3 -3 3 0 0 3 3 ]; (function of the nonlinear interaction, see later)
GARDT_I = -1.8;
d
v x (t ) =
dt ij
v x ij ( t ) + Aij, kl v y kl
(t ) + Bij, kl vu kl
( t ) + I ij
kl N r kl N r
(4)
+ D$ ij, kl ( v )
kl N r
MATCNN format:
MEDIAN_A = [ 0 0 0;
0 1 0;
288 5. Simulators
0 0 0 ];
MEDIAN_Dd = 0.5 *[ 1 1 1;
1 0 1;
1 1 1 ];
MEDIAN_d = [ 0 2 0 -1 2 1 12 ]; (function of the nonlinear interaction, see later)
NLINDIFF_A = [ 0 0 0;
0 1 0;
0 0 0 ];
NLINDIFF_Dd = [ 0.5 1 0.5;
1 0 1;
0.5 1 0.5 ];
NLINDIFF_d = [ 1 5 -2 0 -0.1 0 0 1 0.1 0 2 0 122 ]
A unique nonlinear function a, b, and d is assigned to each nonlinear operator A$ , B$ , and D$ , respectively.
These nonlinear functions determine the nonlinear cell interaction and allowed to be piecewise-constant (pwc) or
piecewise-linear (pwl). Their specification in MATCNN is as follows:
For a and b: [ interp p_num x_1 y_1 x_2 y_2 x_n y_n]
Where: interp - interpolation method (0 - pwc; 1 - pwl)
p_num - number of points
x1 y1 - first point ordinate and abscissa
xn yn - last point ordinate and abscissa
For d: [ interp p_num x_1 y_1 x_2 y_2 x_n y_n intspec]
Where: interp - interpolation method (0 - pwc; 1 - pwl)
p_num - number of points
x1 y1 - first point ordinate and abscissa
xn yn - last point ordinate and abscissa
intspec - interaction specification
In case of nonlinearity d the interaction should also be specified, the valid codes are as follows:
For the interaction: d ( v )
289
11 v = vu 2 kl v u1ij
12 v = vu kl v x ij
13 v = vu kl v y ij
21 v = v x kl v u ij
22 v = v x kl v x ij
23 v = v x kl v y ij
31 v = v y vu ij
kl
32 v = v y v x ij
kl
33 v = v y v y ij
kl
For the interaction: d ( v )( v ) 100 should be added to the above codes.
Remark: it can be seen that in case of the code 11 and code 33 the interaction is exactly that of the a and b,
$ called the generalized nonlinear operator since it includes both A$ and
respectively (this also explains why is D
B$ ). The above specification of the interactions makes it possible to formulate the operators of the gray-scale
morphology, statistical filtering and nonlinear diffusion as simple CNN templates.
The MATCNN Toolbox includes a default CNN template library (temlib.m) where a number of tested CNN
templates (linear, nonlinear AB-type and nonlinear D-type) are given in the format discussed in subsection 1.3.
The demonstration examples of 2.2 are using this template library. The user can define a similar library assigned to
the algorithms.
The name of the images (global variables) assigned to all CNN models in MATCNN are as follows:
INPUT1 - U or U1 (primary input image of the CNN model)
INPUT2 - U2 (secondary input image of the CNN model)
STATE - X (state image of the CNN model)
OUTPUT - Y (output of the CNN model)
BIAS - B (bias image of the CNN model: bias map)
MASK - M (mask image of the CNN model: fixed-state map)
Remark: the sum of the constant cell current (I) specified in the CNN template and the space variant bias
value (Bij) is the space variant current of the CNN model (Iij = I + Bij). The default bias is zero. The mask is a binary
map (Mij = 1 or -1) and determines whether a CNN cell is active (operating) or inactive (fixed). The default mask
$ - type operators can have
value is 1 (all cells are active). (see also footnotes 1 and 2) Arithmetical, logical and D
two input values (U=U1 and U2).
290 5. Simulators
Global Variables
Scripts and functions of the MATCNN assume that certain global variables exist and are initialized in the
environment.
The global variables that should be modified by the user when simulating a CNN model are as follows:
UseMask
UseBiasMap
Boundary
TemGroup
TimeStep
IterNum
Subsection 1.7 explains the role of the above variables, more details and the initial setting can be found in
subsection 2.1.
Remark: There exists another group of global environmental variables used and modified by MATCNN
scripts and functions that should not be modified by the user. These global variables are listed and explained in
subsection 2.1.
Remark: the environment always has to be set, if the template library is not specified the default library of the
MATCNN is temlib.m.
initialize the input and state images assigned to the CNN model
Remark: the input initialization is optional (depends on the template), but the state layer should always be
initialized. In this phase noise can also be added to the images for test purposes (see CImNoise).
Remark: if a the UseBiasMap global is set to 1 the BIAS image should be initialized, similarly when the
UseMask global is set to 1 the MASK image should also be initialized. The Cnn_SetEnv MATCNN script sets
these global variables to zero.
Remark: the boundary condition specified can be constant (-1 Boundary 1), zero flux (Boundary = 2) and
torus (Boundary = 3).
set the simulation parameters (time step and number of iteration steps)
Remark: the default values for TimeStep = 0.2 and IterNum = 25, that correspond approximately to 5 (when R
= 1 and C = 1) analog transient length, guarantee that in a non-propagating CNN transient all cells will reach
the steady state value. However, for a number of templates different settings should be used.
e.g. LoadTem(EDGE); - load the EDGE template from the specified library
Remark: All templates of a project or algorithm can be stored in a common library (M-file). The LoadTem
function activates the given one from the specified library that will determine the CNN model of the simulation.
e.g. RunTem; - run the CNN simulation with the specified template
Remark: the ODE of the CNN model is simulated using the forward Euler formula.
Remark: since the CNN is primarily referred to as a visual microprocessor, and the inputs and outputs are images, in
most cases it might be helpful to visualize these images using different magnification rates, palettes etc. The
user can exploit the capabilities of MATLABs graphical interface and the Image Processing Toolbox when
evaluating the CNN performance.
A simple example using the EDGE template from the default template library to perform edge detection on
a black and white image (road.bmp) stored in users directory. The output is visualized and saved as roadout.bmp.
-------------------------------------------
Cnn_SetEnv;
INPUT1 = LBmp2CNN(Road);
STATE = INPUT1;
Boundary = -1;
TimeStep = 0.1;
IterNum = 50;
LoadTem(EDGE);
RunTem;
CNNShow(OUTPUT);
SCNN2Bmp(RoadOut, OUTPUT);
-------------------------------------------
Subsection 2.1 gives an example for the linear, nonlinear AB-type and nonlinear D-type template. It is
also shown how the template operations can be combined to built up an analogic CNN algorithm.
292 5. Simulators
Given an algorithm, defined for example by an UMF diagram. This algorithm contains elementary instructions
defined by various templates and local logic operators. Demonstration algorithms are included in the MATCNN
simulation program. See subsections 1.8.1 1.8.4 for details on running these algorithms made up of elementary
instructions.
The second algorithm, demonstrates a non-linear gradient template using zero-flux boundary condition.
The third algorithm, demonstrates how to filter a noisy image with a D-type non-linear medain template using zero-
flux boundary condition.
The fourth algorithm is an example for a simulation using multiple templates. Noise is added to the original image,
and a series of templates: threshold, median, edge templates are executed.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% D_ALGO Sample ANALOGIC CNN ALGHORITHM with linear and nonlinear templates
%
% set CNN environment
Cnn_SetEnv % default environment
TemGroup = 'TemLib'; % default template library
% load images, initialize layers
load pic2; % loads the image from pic2.mat to the INPUT1
STATE = INPUT1;
% put noise in the image
STATE = cimnoise(STATE, 'salt & pepper',0.05);
INPUT1 = STATE; % 1st input
INPUT2 = STATE; % 2nd input
LAM1 = STATE; % the nosy original image is also stored in LAM1
% set boundary condition
Boundary = 2; % zero flux boundary condition
% run nonlinear D-type template (filtering)
LoadTem('MEDIAN'); % loads the MEDIAN template
TimeStep = 0.02;
IterNum = 50;
RunTem; % runs the CNN simulation
LAM2 = OUTPUT; % the CNN output is stored in LAM2
% run linear templates (thresholding and edge detection)
LoadTem('THRES'); % loads the THRES template
TimeStep = 0.4;
IterNum = 15;
RunTem; % runs the CNN simulation
INPUT1 = OUTPUT;
STATE = OUTPUT;
LoadTem('EDGE'); % loads the EDGE template
RunTem; % runs the CNN simulation
% show results
subplot(2,2,1); cnnshow(LAM1); % displays the noisy original image
xlabel(Input);
subplot(2,2,2); cnnshow(LAM2); % displays the result of noise filtering
xlabel(1. O: Median);
subplot(2,2,3); cnnshow(INPUT1); % displays the result of thresholding
xlabel(2. O: Threshold);
subplot(2,2,4); cnnshow(OUTPUT); % displays the result of edge detection
xlabel(3. O: Edge);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
296 5. Simulators
In this section a short reference is given to the MATCNN scripts, functions and global variables and some
CNN examples are shown.
Basic:
Miscellaneous:
MATCNN MEX-files:
Basic MEX-files:
MATCNN demos:
The default template library is specified in TemLib template group where the syntax of linear and nonlinear
CNN templates is also given.
Type HELP "FName" in MATLAB environment to learn the details on the MATCNN M-files (scripts and
functions) and MEX-files!
Remarks:
The installation guide and detailed information about the available UNIX script files (slink - creates a soft link
to the existing M-files to deal with the case sensitivity problem, cunix - compiles all *.c source files and creates the
MEX-kernels in UNIX environment) and the Windows batch file (cwindows.bat - compiles all *.c source files and
creates the DLL MEX-kernels in Windows environment) can be found in the file readme.txt where the known bug
reports are also included.
299
User guide
version 1.0
Usually, the bit string is considered to be an unrolled ring, in the sense that the right
neighbor of the last bit is the first bit of the string, while the left neighbor of the first string
is the last neighbor.
The 8-tuple (7 6... 0) defines the dynamics of the Cellular Automaton, and
hence it is called CA rule. There are 28 = 256 possible rules, which can be numbered
from 0 to 255 through the formula:
Therefore, a CA rule transforms a given bit string xn into another bit string xn+1 at each
iteration; given an initial state x0, the union of all xn, for every n up to infinite, is called
space-time pattern obtained from x0.
Even though Cellular Automata have been introduced long before Cellular
Neural/Nonlinear Networks, CA are actually a special case of CNNs. Indeed, it is
possible to prove that every local rule is a code for attractors of a dynamical system
which can written as a Universal CNN cell.
Example
300 5. Simulators
x0 = 00010111
x0
x1
x2
x3
x4
x5
1D CA Simulator
There exist numerous solid results about CA which have been found through a
rigorous theoretical approach. However, empirical experiments can help to understand
what kind of dynamics may arise in CA and under what conditions certain phenomena
can be verified.
For this reason, here we describe the working principle of a simulator which can
be downloaded from the webpage
http://sztaki.hu/~gpazienza/CAsimulator.nbp
The main window of the simulator is divided into three parts (from left to
right):1) controls; 2) space-time pattern of the CA without input; 3) space-time pattern of
the CA with input.
301
Control panel
There are only 88 fundamental 1D binary CA rules, whose numbers are specified
in (Chua & Pazienza, 2009). All other rules are equivalent to one of them, and the
equivalence tables can be found in (Chua, 2009)
The length of the bit string can vary from 3 to 99 in this version, even though it
should be kept under 25, otherwise the resolution of the single cell may become too
small to be appreciated.
The initial bit string can be set either randomly or by the user (only the first option
is active in the current version). The random initial bit string can be changed by
modifying the random seed.
The number of iterations for which the given rule should be run is controlled by
the slide iterations. Given any deterministic CA rule, at least one bit string must repeat
itself after at most 2L iterations. When such repetition occurs, it is possible to determine
the -limit orbit the initial bit string belongs to, according to the definition given in. In
302 5. Simulators
practice, when L is equal to or lower than 20, it is sufficient to set the number of
iterations to 30 to find out the basin of attraction.
Visualization
The states of a binary Cellular Automaton can be represented by only two color:
either white (for 0) and black (for 1), as Wolfram proposes, or blue (for 0) and red (for
1), as Chua proposes. It is possible to select either color combination by selecting the
color box in the simulator.
Furthermore, the cells appear as divided by thin lines when the option mash is
selected. The following figures show the four possible combinations of color and mesh.
303
Space-time patterns
In the space-time pattern, the initial bit string is displayed on the first row, while
the following rows contain the successive iterations. In the bottom part, it is shown the
kind of the -limit orbit the initial bit string belongs to.
The standard model of Cellular Automaton, extensively studied by Wolfram and
Chua among others, does not include an input, but the evolution is made on the the
initial state. However, it can be interesting to see what happens when an input is
introduced into the system, according to the scheme
In which the rule is executed on the pattern resulting from the logic OR between
the state and the input.
An emblematic example of the dramatic changes that the introduction of an input
may imply is the simulation of rule 170 with an arbitrary input bit string. Rule 170 acts as
304 5. Simulators
a bit shift, in the sense that at each iteration it shift all bit left by one position. This kind
of -limit orbit is called Bernoulli-shift with =1 and =1.
When an arbitrary input is introduced the left shift operation performed by rule
170 implies that the input, which is constant both in space and time, is quickly extended
to the whole bit string. In other words, after a few iterations (at most L), we obtain
necessarily a constant output.
References
L. O. Chua, A Nonlinear Dynamics Perspective of Wolframs New Kind of Science Vol. I, II, III , World Scientific,
20052009.
L. O. Chua, G. E. Pazienza, L. Orzo, V. Sbitnev, and J. Shin, A Nonlinear Dynamics Perspective of Wolframs New
Kind of Science, Part IX: Quasi-Ergodicity, International Journal of Bifurcation and Chaos, 9:18, pag. 2487-2642,
2008.
L. O. Chua, G. E. Pazienza, and J. Shin, A Nonlinear Dynamics Perspective of Wolframs New Kind of Science, Part
X: Period-1 rules, International Journal of Bifurcation and Chaos, 5:19, pag. 1425-1655, 2009.
L. O. Chua, G. E. Pazienza, and J. Shin, A Nonlinear Dynamics Perspective of Wolframs New Kind of Science, Part
XI: Period-2 rules, International Journal of Bifurcation and Chaos, 6:19, pag. 1751-1931, 2009.
L. O. Chua and G. E. Pazienza, A Nonlinear Dynamics Perspective of Wolframs New Kind of Science, Part XII:
Period-3, Period-6 and Permutive rules, International Journal of Bifurcation and Chaos, 9:19, 2009.
Appendix 1: UMF Algorithm Description
Version 1.3
Y
306 Appendix 1: UMF algorithm description
U
2 3 X01
z
Y Y
A step delay operation performs a value delay. The time
can be specified either in or in GAPU instruction
steps. If neither is given, delay time defaults to a single
GAPU instruction step.
X02
z
Parametric templates
U
X0
z
Y
TEM (p)
Y
Parallel
Example: Multiplication with constant A typical parallel structure with two parallel flows is
shown below, by combining them in the final
MULT(p)
U1 U2
0 0 0 0 0 0 X1 X2
A = 0 1 0 , B = 0 p 0
z z
0 0 0 0 0 0
Y
Appendix 1: UMF algorithm description 307
Y
Operators
We can use a compact symbol below for the compact
cell and use it in cascade, parallel and combined C-like imperative operators
cascade-parallel structures.
i i
=0 ++
Decisions
GW
Y
< 0.5
N
308 Appendix 1: UMF algorithm description
Subroutine_1
T
U
X0
z
TEM1
X0 z Triggers
TEM2
GW
Y
Usage
Y
U
N
X0
z
Subroutine_1
Y
D
Y epending on the output of the decision (that can be
Iterations and vectors of arrays either yes or no) the proper dataflow continues.
Y flow=U;
repeat
flow=T(flow);
TEMPLATE until GW==1;
10x Y=flow;
10
Merging arrays
U1 U2
3 3
Y
Appendix 2: Virtual and Physical Cellular Machine 309
Core=Cell
Core or cell will be used as synonyms, it is defined as a unit implementing a well defined operator (with
input, output, state) on binary, real or string variables (also defined as logic, arithmetic/analog or
symbolic variables, respectively). Cores/cells are used typically in arrays, mostly with well defined
interaction patterns with their neighbor core/cells, although sparse longer
wires/communications/interactions are also allowed.
Core is used if we emphasize the digital implementation, cell is used if it is more general.
A Logic (L), Arithmetic/analog (A) or Symbolic (S) elementary array instruction is defined via r
input (u(t)), m output (y(t)) and n state (x(t)) variables ( t is the time instant), operating on binary, real, or
symbol variables, respectively. Each dynamic cell is connected mainly locally, in the simplest case, to
their neighbor cells.
L: A typical logic elementary array instruction might be a binary logic function on n or nxn (2D)
binary variables, (special cases: a disjunctive normal form, a memory look-up table array, a binary
state machine, an integer machine),
A: a typical arithmetic/analog elementary array instruction is a multiply and accumulate (add) term
(MAC) core/cell array or a dynamic cell array generating a spatial-temporal wave, and
S: a typical symbolic elementary array instruction might be a string manipulation core/cell array,
mainly locally connected .
Mainly local connectedness means that the local connection has a speed preference compared to a global
connection via a crossbar path.
We have three elementary cell processor (cell core) array implementation types:
D: A digital algorithm with input, state and output vectors of real/ arithmetic (finite precision
analog), binary/digital logic, and symbolic variables (typically implemented via digital circuits).
R: A real valued dynamical system cell with analog/continuous or arithmetic variables (typically
implemented via mixed mode/analog-and-logic circuits and digital control processors), placed in a
mainly locally connected array
310 Appendix 2: Virtual and Physical Cellular Machine
G: A physical dynamic entity with well defined Geometric Layout and I/O ports (function in layout)
(typical implementations are CMOS and/or nanoscale designs, or optical architectures with
programmable control), placed in a mainly locally connected array.
3. Physical parameters of array processor units (typically a chip or a part of a chip) and
interconnections
cores/cells can be placed on a single Chip , typically in a square grid, with input and output
physical connectors typically at the corners (sometimes at the bottom and top corners in a 3D
packaging) of the Chip, altogether there are K input/output connectors. The maximal value of dissipation
of the Chip is W. The physics is represented by the maximal values of , K, and W (as well as the
operating frequency). The operating frequency might be global for the whole Chip Fo, or could be local
within the Chip, fi (some parts might be switched off, fi = 0 ), may be a partially local frequency fo > Fo
The interconnection pathways between the arrays and other major building blocks are characterized by
the delay and the bandwidth (B).
4. Virtual and Physical Cellular Machine architectures and their building blocks
In the homogeneous Virtual Cellular Machine, the basic problem is to execute a task, for example a
Cellular Wave Computer algorithm, on a bigger topographic Virtual Cellular Array using a smaller size
of physical cellular array. Four different types of algorithms have already been developed (Zarndy,
2008)
Among the many different, sometimes problem oriented heterogeneous Virtual Cellular Machine
architectures we define two typical ones. Their five building blocks, are as follows.
(i) Cellular processor arrays of one dimensional , CP1, and two dimensional , CP2, ones
(ii) P - classical digital computer with memory & I/O, for example a classical microprocessor
(iii) T - topographic fully parallel 2D (or 1D) input
Appendix 2: Virtual and Physical Cellular Machine 311
(iv) M - memory with high speed I/O, single port or dual port (L1, L2, L3 parts as cache and/or local
memories with different access times)
(v) B - data bus with different speed ranges (B1, B2, )
The CP1 and CP2 types of cellular arrays may be composed of cell/core arrays of simple and complex
cells. In the CNN Universal Machine, each complex cell contains logic and analog/arithmetic
components, as well as local memories, plus local communication and control units. Each array has its
own controlling processor; we called it in the CNN Universal Machine as Global Analog/arithmetic-and-
logic Programming Unit (GAPU).
The size of the arrays in the Virtual Cellular Machines are typically large enough to handle all the
practical problems that might encounter in the minds of the designers. In the physical implementation,
however, we confront the finite, reasonable, cost effective sizes and other physical parameters.
The Physical Cellular Machine architecture is defined by the same kind of five building blocks ,
however, with well defined physical parameters, either in a similar architecture like that of the Virtual
Cellular Machine or a different one.
A building block could be physically implemented as a separate chip or as a part of a chip. The
geometry of the architecture is reflecting the physical layout within a chip and the chips within the
Machine (multi-chip machine).
This architectural geometry defines also the communication (interacting) speed ranges, as well. Hence
physical closeness means higher speed ranges and smaller delays.
The spatial location or topographic address of each elementary cell or core, within a building block,
as well as that of each building block within a chip, and each chip, within the Virtual Cellular Machine
(Machine) architecture, plays a crucial role. This is one of the most dramatic difference compared to
classical computer science.
In the Physical Cellular Machine models we can use exact, typical or qualitative values for size, speed,
delay, power, and other physical parameters. The simulators can use these values for performance
evaluation.
We are not considering here the problems and design issues within the building blocks, it was fairly
well studied in the Cellular Wave Computing or CNN Technology literature, as well as implementing a
virtual 1D or 2D Cellular Wave Computer on a smaller physical machine. The decomposition of bigger
memories on smaller physical memories are the subject of the extensively used virtual memory concept.
We mention that sometimes a heterogeneous machine can be implemented on a single chip by using
the different areas for different building blocks (Rekeczky et. al., 2008)
The architecture of the Virtual Cellular Machine and the Physical Cellular Machine might be the same,
though the latter might have completely different physical parameters. On the other hand they might have
completely different architectures.
The internal functional operation of the cellular building blocks are not considered here. On one hand,
they are well studied in the recent Cellular Wave Computer literature, as well as in the recent
implementations (ACE 16k, ACE 25k = Q-Eye, XENON), etc.), on the other hand, they can be modeled
based on the Graphics Processing Units (GPU) and FPGA literature. Their functional models are
described elsewhere (see also the Open CL language description).
The two basic types of multi-cellular heterogeneous Virtual Machine architectures are defined next.
312 Appendix 2: Virtual and Physical Cellular Machine
M1
M2
M f0 f0 f0 f0
F0 F0 F0 F0 F0
B0 B0
Fig. 1.
B2
M0
P1 P0 P1.....P7 T ... T P2
M1 M2 M3
MII
B1
B3
IO1
Fig. 2.
Appendix 2: Virtual and Physical Cellular Machine 313
Fig.3.
Fig. 4.
314 Appendix 2: Virtual and Physical Cellular Machine
1c
~50c.
1D 2D
1c.
1c.
10c.
1c.
100c.
CELL organization
Fig. 5.
Physical Task/Problem/
Implementation Workload
Fig.
Appendix 2: Virtual and Physical Cellular Machine 315
6. The dynamic operational graph and its use for acyclic UMF diagrams
Extending the UMF diagrams (Roska, 2003) describing Virtual Cellular Machines leads to digraphs,
with processor array and memory nodes, and signal array pathways as branches with bandwidth weights.
These graphs with the dissipation side-constraint define optimization problems representing the design
task, under well defined equivalent transformations.
In some well defined cases, especially within a 1D or 2D homogeneous array, the recently introduced
method via Genetic Programming with Indexed Memory (GP-IM) using UMF diagrams with Directed
Acyclic Graphs (DAG) seems a promising tool showing good results in simpler cases (Pazienza, 2008).
Appendix 3: Template Robustness 316
TEMPLATE ROBUSTNESS
Here we will give the definition of template robustness for the case of uncoupled binary input/output
templates.
It is known that the set of uncoupled binary input/output templates is isomorphic to the set of linearly
separable Boolean functions of 9 variables [44]. Such functions can be described by 9-dimensional
hyper-cubes [45]. If the function is linearly separable, a hyper-plane exists which separates the set of
-1s from the set of 1s.
Definition: The robustness of the template T, denoted by , is defined as the minimal distance of the
hyper-plane, which separates the set of -1s from the set of 1s, and from the vertices of the hyper-cube
(see Figure 1a for an illustration in 2 dimensions).
The robustness of T can be increased by choosing the optimal template Topt, for which the minimal
distance of the separating hyper-plane from the vertices is maximal [45] (see Figure 1b).
u2 u2
1 1
-1 1
-1 1 u1 u1
-1 -1
a) b)
Figure 1. These diagrams illustrate the separation of vertices for the 2 dimensional logic function
F(u 1 , u 2 ) = u 1 u 2 . Logic TRUE and FALSE are represented by filled and empty circles,
respectively. The concept of robustness is also illustrated. Figure a) shows a few possible
separating lines. The thick line corresponds to the template T with robustness = 0.18. Figure b)
depicts the optimal separation line corresponding to the template Topt; its robustness is = 0.71 .
References 317
REFERENCES
[1] L. O. Chua and L. Yang, Cellular neural networks: Theory and Applications, IEEE Transactions on Circuits
and Systems, Vol. 35, pp. 1257-1290, October 1988.
[2] L. O. Chua and L. Yang, The CNN Paradigm, IEEE Transactions on Circuits and SystemsI: Fundamental
Theory and Applications, Vol. 40, pp. 147-156, March 1993.
[3] T. Roska and L. O. Chua, The CNN Universal Machine: An Analogic Array Computer, IEEE Transactions
on Circuits and SystemsII: Analog and Digital Signal Processing, Vol. 40, pp. 163-173, March 1993.
[4] The CNN Workstation Toolkit, Version 6.0, MTA SzTAKI, Budapest, 1994.
[5] P. L. Venetianer, A. Radvnyi, and T. Roska, "ACL (an Analogical CNN Language), Version 2.0, Research
report of the Analogical and Neural Computing Laboratory, Computer and Automation Research Institute,
Hungarian Academy of Sciences (MTA SzTAKI), DNS-3-1994, Budapest, 1994.
[6] T. Matsumoto, T. Yokohama, H. Suzuki, R. Furukawa, A. Oshimoto, T. Shimmi, Y. Matsushita, T. Seo and
L. O. Chua, "Several Image Processing Examples by CNN", Proceedings of the International Workshop on
Cellular Neural Networks and their Applications (CNNA-90), pp. 100-112, Budapest, 1990.
[7] T. Roska, T. Boros, A. Radvnyi, P. Thiran, L. O. Chua, "Detecting Moving and Standing Objects Using
Cellular Neural Networks", International Journal of Circuit Theory and Applications, October 1992, and
Cellular Neural Networks, edited by T. Roska and J. Vandewalle, 1993.
[8] T. Boros, K. Lotz, A. Radvnyi, and T. Roska, "Some Useful New Nonlinear and Delay-type Templates",
Research report of the Analogical and Neural Computing Laboratory, Computer and Automation Research
Institute, Hungarian Academy of Sciences (MTA SzTAKI), DNS-1-1991, Budapest, 1991.
[9] S. Fukuda, T. Boros, and T. Roska, "A New Efficient Analysis of Thermographic Images by using Cellular
Neural Networks", Research report of the Analogical and Neural Computing Laboratory, Computer and
Automation Research Institute, Hungarian Academy of Sciences (MTA SzTAKI), DNS-11-1991, Budapest,
1991.
[10] L. O. Chua, T. Roska, P. L. Venetianer, and . Zarndy, "Some Novel Capabilities of CNN: Game of Life and
Examples of Multipath Algorithms", Research report of the Analogical and Neural Computing Laboratory,
Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SzTAKI), DNS-3-1992,
Budapest, 1992.
[11] L. O. Chua, T. Roska, P. L. Venetianer, and . Zarndy, "Some Novel Capabilities of CNN: Game of Life and
Examples of Multipath Algorithms", Proceedings of the International Workshop on Cellular Neural Networks
and their Applications (CNNA-92), pp. 276-281, Munich, 1992.
[12] T. Roska, K. Lotz, J. Hmori, E. Lbos, and J. Takcs, "The CNN Model in the Visual Pathway - Part I: The
CNN-Retina and some Direction- and Length-selective Mechanisms", Research report of the Analogical and
Neural Computing Laboratory, Computer and Automation Research Institute, Hungarian Academy of
Sciences (MTA SzTAKI), DNS-8-1991, Budapest, 1991.
[13] T. Roska, J. Hmori, E. Lbos, K. Lotz, L. Orz, J. Takcs, P. L. Venetianer, Z. Vidnynszky, and . Zarndy,
"The Use of CNN Models in the Subcortical Visual Pathway", Research report of the Analogical and Neural
Computing Laboratory, Computer and Automation Research Institute, Hungarian Academy of Sciences
(MTA SzTAKI), DNS-16-1992, Budapest, 1992.
[14] P. Szolgay, I. Kispl, and T. Kozek, "An Experimental System for Optical Detection of Layout Errors of
Printed Circuit Boards Using Learned CNN Templates", Proceedings of the International Workshop on
Cellular Neural Networks and their Applications (CNNA-92), pp. 203-209, Munich, 1992.
[15] K. R. Crounse, T. Roska, and L. O. Chua, "Image halftoning with Cellular Neural Networks", IEEE
Transactions on Circuits and SystemsII: Analog and Digital Signal Processing, Vol. 40, No. 4, pp. 267-283,
1993.
318 References
[16] H. Harrer and J. A. Nossek, "Discrete-Time Cellular Neural Networks", TUM-LNS-TR-91-7, Technical
University of Munich, Institute for Network Theory and Circuit Design, March 1991.
[17] T.Sziranyi and M.Csapodi, "Texture classification and Segmentation by Cellular Neural Network using
Genetic Learning", Computer Vision and Image Understanding, Vol. 71, No. 3, pp. 255-270, September 1998.
[18] A. Schultz, I. Szatmri, Cs. Rekeczky, T. Roska, and L. O. Chua, Bubble-debris classification via binary
morphology and autowave metric on CNN, International Symposium on Nonlinear Theory and its
Applications, Hawaii, 1997
[19] P. L. Venetianer, F. Werblin, T. Roska, and L. O. Chua, "Analogic CNN Algorithms for some Image
Compression and Restoration Tasks", IEEE Transactions on Circuits and Systems, Vol. 42, No.5, 1995.
[20] P. L. Venetianer, K. R. Crounse, P. Szolgay, T. Roska, and L. O. Chua, "Analog Combinatorics and Cellular
Automata - Key Algorithms and Layout Design using CNN", Proceedings of the International Workshop on
Cellular Neural Networks and their Applications (CNNA-94), pp. 249-256, Rome, 1994.
[21] H. Harrer, P. L. Venetianer, J. A. Nossek, T. Roska, and L. O. Chua, "Some Examples of Preprocessing
Analog Images with Discrete-Time Cellular Neural Networks", Proceedings of the International Workshop on
Cellular Neural Networks and their Applications (CNNA-94), pp. 201-206, Rome, 1994.
[22] . Zarndy, F. Werblin, T. Roska, and L. O. Chua, "Novel Types of Analogic CNN Algorithms for
Recognizing Bank-notes", Proceedings of the International Workshop on Cellular Neural Networks and their
Applications (CNNA-94), pp. 273-278, Rome, 1994.
[23] E. R. Kandel and J. H. Schwartz, "Principles of Neural Science", Elsevier, New York, 1985.
[24] A. Radvnyi, "Using Cellular Neural Network to 'See' Random-Dot Stereograms" in Computer Analysis of
Images and Patterns, Lecture Notes in Computer Science 719, Springer Verlag, 1993.
[25] M. Csapodi, Diploma Thesis, Technical University of Budapest, 1994.
[26] K. Lotz, Z. Vidnynszky, T. Roska, and J. Hmori, "The receptive field ATLAS for the visual pathway",
Report NIT-4-1994, Neuromorphic Information Technology, Graduate Center, Budapest, 1994.
[27] G. Tth, Diploma Thesis, Technical University of Budapest, 1994.
[28] T. Boros, K. Lotz, A. Radvnyi, and T.Roska, "Some useful, new, nonlinear and delay-type templates",
Research report of the Analogical and Neural Computing Laboratory, Computer and Automation Research
Institute, Hungarian Academy of Sciences (MTA SzTAKI), DNS-1-1991, Budapest, 1991.
[29] G. Tth, "Analogic CNN Algorithm for 3D Interpolation-Approximation", Research report of the Analogical
and Neural Computing Laboratory, Computer and Automation Research Institute, Hungarian Academy of
Sciences (MTA SzTAKI), DNS-2-1995, Budapest, 1995.
[30] P. Perona and J. Malik, Scale space and edge detection using anisotropic diffusion, Proceedings of the IEEE
Computer Society Workshop on Computer Vision, 1987.
[31] F. Werblin, T. Roska, and L. O. Chua, The Analogic Cellular Neural Network as a Bionic Eye, International
Journal of Circuit Theory and Applications, Vol. 23, No. 6, pp. 541-569, 1995.
[32] R. M. Haralick, S. R. Sternberg, and X. Zhuang, Image Analysis Using Mathematical Morphology, IEEE
Transactions on Pattern Analysis and Machine Intelligence, pp. 532-550, Vol. PAMI-9, No. 4, July 1987.
[33] L. O. Chua, T. Roska, T. Kozek, and . Zarndy, The CNN Paradigm A Short Tutorial, Cellular Neural
Networks, T. Roska and J. Vandewalle, editors, John Wiley & Sons, New York, 1993, pp. 1-14.
[34] Cs. Rekeczky, Y. Nishio, A. Ushida, and T. Roska, CNN Based Adaptive Smoothing and Some Novel Types
of Nonlinear Operators for Grey-Scale Image Processing, in proceedings of NOLTA95, Las Vegas,
December 1995.
[35] T. Szirnyi, Robustness of Cellular Neural Networks in image deblurring and texture segmentation,
International Journal of Circuit Theory and Applications, Vol. 24, pp. 381-396, May 1996.
[36] . Zarndy, The Art of CNN Template Design, International Journal of Circuit Theory and Applications,
Vol. 27, No. 1, pp. 5-23, 1999.
References 319
[37] M. Csapodi, J. Vandewalle, and T. Roska, Applications of CNN-UM chips in multimedia authentication,
ESAT-COSIC Report / TR 97-1, Department of Electrical Engineering, Katholieke Universiteit Leuven, 1997.
[38] L. Nemes, L. O. Chua, TemMaster Template Design and Optimization Tool for Binary Input-Output CNNs,
Users Guide, Analogical and Neural Computing Laboratory, Computer and Automation Research Institute,
Hungarian Academy of Sciences (MTA-SzTAKI), Budapest, 1997.
[39] P. Szolgay, K. Tmrdi, "Optical detection of breaks and short circuits on the layouts of printed circuit boards
using CNN", Proceedings of the International Workshop on Cellular Neural Networks and their Applications
(CNNA-96), pp. 87-91, Seville, 1996.
[40] Hvilsted, S.; Ramanujam, P.S., Side-chain liquid crystalline azobenzene polyesters with unique reversible
optical storage properties. Curr. Trends Pol. Sci. (1996) v.1, pp. 53-63.
[41] S. Espejo, A. Rodriguez-Vzquez, R. A. Carmona, P. Fldesy, . Zarndy, P. Szolgay, T. Szirnyi, and
T. Roska, 0.8m CMOS Two Dimensional Programmable Mixed-Signal Focal-Plane Array Processor with
On-Chip Binary Imaging and Instruction Storage, IEEE Journal on Solid State Circuits, Vol. 32., No. 7.,
pp. 1013-1026,. July 1997.
[42] G. Lin, S. Espejo, R. Domnguez-Castro, E. Roca, and A. Rodriguez-Vzquez, CNNUC3: A Mixed-Signal
64x64 CNN Universal Chip, Proceedings of the International Conference on Microelectronics for Neural,
Fuzzy and Bio-inspired Systems (MicroNeuro99), pp. 61-68, Granada, Spain, 1999.
[43] S. Ando, "Consistent Gradient Operations", IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 22., No. 3., pp. 252-265,. March 2000.
[44] L. O. Chua, "CNN: a paradigm for complexity", World Scientific Series On Nonlinear Science, Series A, Vol.
31, 1998.
[45] L. Nemes, L.O. Chua, and T. Roska, Implementation of Arbitrary Boolean Functions on the CNN Universal
Machine, International Journal of Circuit Theory and Applications - Special Issue: Theory, Design and
Applications of Cellular Neural Networks: Part I: Theory, (CTA Special Issue - I), Vol. 26. No. 6, pp. 593-
610, 1998.
[46] I. Szatmri, Cs. Rekeczky, and T. Roska, "A Nonlinear Wave Metric and its CNN Implementation for Object
Classification", Journal of VLSI Signal Processing, Special Issue: Spatiotemporal Signal Signal Processing
with Analogic CNN Visual Microprocessors, Vol.23, No.2/3, pp. 437-448, Kluwer, 1999.
[47] I. Szatmri, "The implementation of a Nonlinear Wave Metric for Image Analysis and Classification on the
64x64 I/O CNN-UM Chip", CNNA 2000, 6th IEEE International Workshop on Cellular Neural Networks and
their Applications, May 23-25, 2000, University of Catania, Italy.
[48] I. Szatmri, A. Schultz, Cs. Rekeczky, T. Roska, and L. O. Chua, "Bubble-Debris Classification via Binary
Morphology and Autowave Metric on CNN", IEEE Trans. on Neural Networks, Vol. 11, No. 6, pp.1385-1393,
November 2000.
[49] P. Fldesy, L. Kk, T. Roska, . Zarndy, and G. Brtfai, Fault Tolerant CNN Template Design and
Optimization Based on Chip Measurements, Proceedings of the IEEE International Workshop on Cellular
Neural Networks and their Applications (CNNA98), pp. 404-409, London, 1998.
[50] P. Fldesy, L. Kk, . Zarndy, T. Roska, and G. Brtfai, Fault Tolerant Design of Analogic CNN Templates
and Algorithms Part I: The Binary Output Case, IEEE Transactions on Circuits and Systems special issue
on Bio-Inspired Processors and Cellular Neural Networks for Vision, Vol. 46, No. 2, pp. 312-322, February
1999.
[51] . Zarndy, T. Roska, P. Szolgay, S. Zld, P. Fldesy and I. Petrs, "CNN Chip Prototyping and Development
Systems", European Conference on Circuit Theory and Design - ECCTD'99, Design Automation Day
proceedings, (ECCTD'99-DAD), Stresa, Italy, 1999.
[52] I. Petrs, T. Roska, "Application of Direction Constrained and Bipolar Waves for Pattern Recognition",
Proceedings of the IEEE International Workshop on Cellular Neural Networks and their Applications
(CNNA2000), pp. 3-8, Catania, Italy, 23-25 May, 2000.
[53] B. E. Shi, "Gabor-type filtering in space and time with cellular neural networks," IEEE Transactions on
Circuits and Systems-I: Fundamental Theory and Applications, vol. 45, pp. 121-132, 1998.
320 References
[54] G. Tmr, K. Karacs, and Cs. Rekeczky: Analogic Preprocessing and Segmentation Algorithms for Off-line
Handwriting Recognition., IEEE Journal on Circuits, Systems and Computers, Vol. 12(6), pp. 783-804,
Dec. 2003.
[55] L. Orz, T. Roska, "A CNN image-compression algorithm for improved utilization of on-chip resources",
Proceedings of the IEEE International Workshop on Cellular Neural Networks and their Applications
(CNNA2004), pp. 297-302, Budapest, Hungary, 22-24 July, 2004.
[56] I. Szatmari, Synchronization Mechanism in Oscillatory Cellular Neural Networks, Research report of the
Analogical and Neural Computing Laboratory, Computer and Automation Research Institute, Hungarian
Academy of Sciences (MTA SzTAKI), DNS-1-2006, Budapest 2006.
[57] I. Petrs, T. Roska, and L. O. Chua "New Spatial-Temporal Patterns and The First Programmable On-Chip
Bifurcation Test-Bed", IEEE Trans. on Circuits and Systems I, (TCAS I.), Vol. 50(5), pp. 619-633, May 2003.
[58] A. Gacsdi and P. Szolgay, Image Inpainting Methods by Using Cellular Neural Networks, Proceedings of
the IEEE International Workshop on Cellular Neural Networks and their Applications (CNNA2005),
ISBN:0780391853, pp. 198-201, Hsinchu, Taiwan, 2005
[59] L. Rudin, S. Osher, and E. Fatemi, Nonlinear total variational based noise removal algorithms, Physica D,
Vol. 60, pp. 259268, 1992.
[60] A. Gacsdi, P. Szolgay, A variational method for image denoising by using cellular neural networks,
Proceedings of the IEEE International Workshop on Cellular Neural Networks and their Applications
(CNNA2004), ISBN 963-311-357-1, pp. 213-218, Budapest, Hungary, 2004.
[61] R. Matei, New Image Processing Tasks On Binary Images Using Standard CNNs, Proceedings of the
International Symposium on Signals, Circuits and Systems, SCS'2001, pp.305-308, July 10-11, 2001, Iai,
Romania
[62] R. Matei, Design Method for Orientation-Selective CNN Filters, Proceedings of the IEEE International
Symposium on Circuits and Systems ISCAS2004, May 23-26, 2004, Vancouver, Canada
[63] CNN Young Researcher Contest, Analogic CNN Algorithm Design, 7th IEEE International Workshop on
Cellular Neural Networks and their Applications, Frankfurt-Germany, July 2002.
[64] D. Blya: "CNN Universal Machine as Classification Platform: an ART-like Clustering Algorithm", Int.
Journal of Neural Systems, 2003, Vol. 13(6), pp. 415-425.
[65] B. Roska, F. Werblin, Rapid global shifts in natural scenes block spiking in specific ganglion cell types,
Nature Neuroscience, 2003
[66] D. Blya Sudden Global Spatial-Temporal Change Detection and its Applications, Journal of Circuits,
Systems, and Computers (JCSC), Vol. 12(5), Aug-Dec 2003.
[67] Gy. Cserey, Cs. Rekeczky and P. Fldesy PDE Based Histogram Modification With Embedded
Morphological Processing of the Level-Sets, Journal of Circuits, System and Computers (JCSC 2002).
[68] K. Karacs, G. Prszky and T. Roska, Intimate Integration of Shape Codes and Linguistic Framework in
Handwriting Recognition via Wave Computers, European Conference on Circuit Theory and Design,
Krakw, Poland, Sept. 2003.
[69] Z. Szlvik, T. Szirnyi, Face Identification with CNN-UM, European Conference on Circuit Theory and
Design, Krakw, Poland, Sept. 2003.
[70] Cs. Rekeczky, G. Tmr, and Gy. Cserey Multi-Target Tracking With Stored Program Adaptive CNN
Universal Machines in Proc. 7th IEEE International Workshop on Cellular Neural Networks and their
Applications, Frankfurt am Main, Germany, July 22-24, 2002., pp. 299-306.
[71] L. Trk, . Zarndy CNN Optimal Multiscale Bayesian Optical Flow Calculation, European Conference
on Circuit Theory and Design, Krakw, Poland, Sept. 2003.
[72] Z. Fodrczi, A. Radvnyi Computational Auditory Scene Analysis in Cellular Wave Computing Framework
International Journal of Circuit Theory and Applications, Vol: 34(4) pp: 489-515, ISSN:0098-9886
(July 2006)
References 321
[73] L. Kk and . Zarndy, "Implementation of Large-Neighborhood Nonlinear Templates on the CNN Universal
Machine", International Journal of Circuit Theory and Applications, Vol. 26, No. 6, pp. 551-566, 1998.
[74] G. Constantini, D. Casali, M. Carota, and R. Perfetti, Translation and Rotation of Grey-Scale Images by means
of Analogic Cellular Neural Network, Proceedings of the IEEE International Workshop on Cellular Neural
Networks and their Applications (CNNA2004), ISBN 963-311-357-1, pp. 213-218, Budapest, Hungary, 2004.
[75] M. Radvnyi, G. E. Pazienza, and K. Karacs, Crosswalks Recognition through CNNs for the Bionic Camera:
Manual vs. Automatic Design, in Proc. of the 19th European Conference on Circuit Theory and Design,
Antalya, Turkey, 2009.
[76] L.O. Chua and L. Yang, Cellular Neural Networks: Theory and Applications, IEEE Transactions
on Circuits and Systems, vol. 35, no. 10, October 1988, pp. 1257-1290, 1988.
[77] L.O. Chua and T. Roska, The CNN Paradigm, IEEE Transactions on Circuits and Systems - I, vol. 40, no. 3,
March 1993, pp. 147-156, 1993.
[78] T. Roska and L.O. Chua, The CNN Universal Machine: An Analogic Array Computer, IEEE Transactions on
Circuits and Systems - II, vol. 40, March 1993, pp. 163-173. 1993.
[79] S. Espejo, R. Carmona, R. Domnguez-Castro and A. Rodrguez-Vzquez A VLSI-Oriented Continuous-Time
CNN Model, International Journal of Circuit Theory and Applications, Vol. 24, pp. 341-356, May-June 1996.
[80] Cs. Rekeczky and L. O. Chua, Computing with Front Propagation: Active Contour and Skeleton Models in
Continuous-time CNN, Journal of VLSI Signal Processing Systems, Vol. 23, No. 2/3, pp. 373-402,
November-December 1999.
[81] J.M.Cruz, L.O.Chua, and T.Roska, A Fast, Complex and Efficient Test Implementation of the CNN Universal
Machine, Proc. of the third IEEE Int. Workshop on Cellular Neural Networks and their Application (CNNA-
94), pp. 61-66, Rome Dec. 1994.
[82] H.Harrer, J.A.Nossek, T.Roska, L.O.Chua, A Current-mode DTCNN Universal Chip, Proc. of IEEE Intl.
Symposium on Circuits and Systems, pp135-138, 1994.
[83] A. Paasio, A. Dawindzuk, K. Halonen, V. Porra, Minimum Size 0.5 Micron CMOS Programmable 48x48
CNN Test Chip European Conference on Circuit Theory and Design, Budapest, pp. 154-15, 1997.
[84] Gustavo Lian Cembrano, ngel Rodrguez-Vzquez, Servando Espejo-Meana, Rafael Domnguez-Castro:
ACE16k: A 128x128 Focal Plane Analog Processor with Digital I/O. Int. J. Neural Syst. 13(6): 427-434 (2003)
[85] S. Espejo, R. Carmona, R. Domingez-Castro, and A. Rodrigez-Vzquez, "CNN Universal Chip in CMOS
Technology", Int. J. of Circuit Theory & Appl., Vol. 24, pp. 93-111, 1996
[86] S. Espejo, R. Carmona, R. Domnguez-Castro and A. Rodrguez-Vzquez A VLSI-Oriented Continuous-Time
CNN Model, International Journal of Circuit Theory and Applications, Vol. 24, pp. 341-356, May-June 1996.
[87] P.Dudek "An asynchronous cellular logic network for trigger-wave image processing on fine-grain massively
parallel arrays", IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing,. 53 (5):
pp. 354-358, 2006.
[88] A. Lopich, P. Dudek, Implementation of an Asynchronous Cellular Logic Network As a Co-Processor for a
General-Purpose Massively Parallel Array, ECCTD 2007, Seville, Spain.
[89] A. Lopich, P. Dudek., " Architecture of asynchronous cellular processor array for image skeletonization",
Circuit Theory and Design, Volume: 3, On page(s): 81-84, 2005.
[90] P.Dudek and S.J.Carey, "A General-Purpose 128x128 SIMD Processor Array with Integrated Image Sensor",
Electronics Letters, vol.42, no.12, pp.678-679, June 2006
[91] Z. Nagy, P. Szolgay "Configurable Multi-Layer CNN-UM Emulator on FPGA" IEEE Transactions on Circuits
and Systems I: Fundamental Theory and Applications, Vol. 50, pp. 774-778, 2003
[92] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy Introduction to the Cell
multiprocessor IBM J. Res. & Dev. vol. 49 no. 4/5 July/September 2005
[93] www.ti.com
[94] 176x144 Q-Eye chip, www.anafocus.com
[95] Video security application: http://www.objectvideo.com/ efficient
[96] Cs. Rekeczky, J. Mallett, A. Zarandy, Security Video Analitics on Xilinx Spartan -3A DSP, Xcell
Journal, Issue 66, fourth quarter 2008, pp: 28-32
[97] . Zarndy, "The Art of CNN Template Design", Int. J. Circuit Theory and Applications - Special Issue:
Theory, Design and Applications of Cellular Neural Networks: Part II: Design and Applications, (CTA Special
Issue - II), Vol.17, No.1, pp.5-24, 1999
322 References
[98] . Zarndy, P. Keresztes, T. Roska, and P. Szolgay, "CASTLE: An emulated digital architecture; design issues,
new results", Proceedings of 5th IEEE International Conference on Electronics, Circuits and Systems,
(ICECS'98), Vol. 1, pp. 199-202, Lisboa, 1998
[99] P. Keresztes, . Zarndy, T. Roska, P. Szolgay, T. Bezk, T. Hdvgi, P. Jns, A. Katona, "An emulated
digital CNN implementation", Journal of VLSI Signal Processing Special Issue: Spatiotemporal Signal
Processing with Analogic CNN Visual Microprocessors, (JVSP Special Issue), Kluwer, 1999 November-
December
[100] P. Fldesy, . Zarndy, Cs. Rekeczky, and T. Roska Configurable 3D integrated focal-plane sensor-processor
array architecture, Int. J. Circuit Theory and Applications (CTA), pp: 573-588, 2008
[101] L.O. Chua, T. Roska, T. Kozek, . Zarndy CNN Universal Chips Crank up the Computing Power, IEEE
Circuits and Devices, July 1996, pp. 18-28, 1996.
[102] T. Roska, L. Kk, L. Nemes, . Zarndy, M. Brendel and P. Szolgay, "CNN Software Library (Templates and
Algorithms) Version 7.2", (DNS-1-1998), Budapest, MTA SZTAKI, 1998, http://cnn-
technology.itk.ppke.hu/Library_v2.1b.pdf
[103] http://www.xilinx.com/support/documentation/data_sheets/ds706.pdf
Index 323
INDEX
Classic DSP-memory architecture, 236
1
CNN MODELS OF SOME COLOR VISION
1D CA Simulator, 300
PHENOMENA: SINGLE AND DOUBLE
1-DArraySorting, 111
OPPONENCIES, 100
5x5Halftoning2, 42 ContourExtraction, 17
5x5InverseHalftoning, 48 CornerDetection, 18
5x5TextureSegmentation1, 89
D
5x5TextureSegmentation2, 91
DEPTH CLASSIFICATION, 101
A
DETECTION OF MAIN CHARACTERS, 199
ApproxDiagonalLineDetector, 23 DiagonalLineRemover, 20
IMAGES, 170
E
B
EdgeDetection, 28
FUNCTION, 197
G
Categorization of 2D operators, 244
GAME OF LIFE, 206
CELLULAR AUTOMATA, 122
GameofLife1Step, 105
CenterPointDetector, 11
GameofLifeDTCNN1, 106
324 Index
GameofLifeDTCNN2, 107 L
GENERALIZED CELLULAR AUTOMATA, 126
LaplacePDESolver, 263
GLOBAL DISPLACEMENT DETECTOR, 190
LE3pixelLineDetector, 79
GlobalConncetivityDetection1, 16
LE7pixelVerticalLineRemover, 76
GlobalConnectivityDetection, 14
LeftPeeler, 55
GlobalMaximumFinder, 103
Linear templates specification, 287
GRADIENT CONTROLLED DIFFUSION, 146
LinearTemplateInversion, 136
GradientDetection, 32
LocalConcavePlaceDetector, 75
GradientIntensityEstimation, 4
LocalMaximaDetector, 52
GRAYSCALE MATHEMATICAL MORPHOLOGY, 63
LocalSouthernElementDetector, 50
GRAYSCALE SKELETONIZATION, 144
LogicANDOperation, 81
GrayscaleDiagonalLineDetector, 25
LogicDifference1, 82
GrayscaleLineDetector, 77
LogicNOTOperation, 83
H LogicOROperation, 84
LogicORwithNOT, 85
HAMMING DISTANCE COMPUTATION, 208
HeatDiffusion, 27 M
HerringGridIllusion, 116
MajorityVoteTaker, 108
HISTOGRAM MODIFICATION WITH EMBEDDED
Many-core hierarchical graphic processor unit (GPU),
MORPHOLOGICAL PROCESSING OF THE LEVEL-
243
SETS, 164
MaskedCCD, 10
HistogramGeneration, 104
MaskedObjectExtractor, 31
HOLE DETECTION IN HANDWRITTEN WORD
MaskedShadow, 57
IMAGES, 168
MATCNN simulator references, 297
Hole-Filling, 44
MAXIMUM ROW(S) SELECTION, 160
HorizontalHoleDetection, 8
MedianFilter, 53
I MotionDetection, 95
MllerLyerIllusion, 117
ImageDenoising, 131
MULTI SCALE OPTICAL FLOW, 174
ImageDifferenceComputation, 94
Multi-core heterogeneous processors array with high-
ImageInpainting, 129
performance kernels (CELL), 242
ISOTROPIC SPATIO-TEMPORAL PREDICTION
MULTIPLE TARGET TRACKING, 158
CALCULATION BASED ON PREVIOUS
ObjectIncreasing, 45
References 325
PointRemoval, 34 TextureDetector2, 92
PoissonPDESolver, 264 TextureDetector3, 92
Threshold, 61
R
ThresholdedGradient, 37
RightEdgeDetection, 56
Translation(dx,dy), 138
Rotation, 139
Two-Layer Gabor, 135
RotationDetector, 26
V
ROUGHNESS MEASUREMENT VIA FINDING
CONCAVITIES, 213 VERTICAL WING ENDINGS DETECTION OF
Running a CNN Simulation, 290 AIRPLANE-LIKE OBJECTS, 226
VerticalHoleDetection, 9
S
VerticalLineRemover, 21
Sample Analogic CNN Algorithm, 295
VerticalShadow, 59
Sample CNN Simulation a Nonlinear, 292, 294
W
Sample CNN Simulation with a Linear Template, 292
SCRATCH REMOVAL, 217 WhiteFiller, 66
SelectedObjectsExtraction, 35 WhitePropagation, 68
ShadowProjection, 58
SHORTEST PATH, 147
SmallObjectRemover, 87
Smoothing, 5
Index (old names) 327
B FINDAREA, 36
FramedAreasFinder, 36
BLACK, 65
G
C
GLOBMAX, 103
CCD_DIAG, 7
GRADIENT, 37
CCD_HOR, 8
CCD_VERT, 9 H
CCDMASK, 10
HISTOGR, 104
CENTER, 11
HistogramComputation, 104
CONCCONT, 13
HLF3, 38
Connectivity, 14
HLF33, 38
CONTOUR, 17
HLF5, 42
ContourDetector, 17
HLF55, 42
CORNER, 18
HLF55_KC, 40
CornerDetector, 18
HLF5KC, 40
CUT7V, 76
HOLE, 44
D HoleFiller, 44
HOLLOW, 69
DELDIAG1, 20
HorizontalCCD, 8
DELVERT1, 21
DIAG, 23 I
DIAG1LIU, 24
INCREASE, 45
DIAGGRAY, 25
INTERP, 71
DIFFUS, 27
INTERPOL, 71
E INV, 83
INVHLF3, 46
EDGE, 28
INVHLF33, 46
EdgeDetector, 28
INVHLF5, 48
ERASMASK, 31
INVHLF55, 48
EXTREME, 32
INV-OR, 85
F
J
FIGDEL, 33
JUNCTION, 73
FIGEXTR, 34
328 Index (old names)
L O
LCP, 75 OR, 84
LeftShadow, 58
P
LGTHTUNE, 79
PA-PB, 82
LIFE_1, 105
PA-PB_F1, 94
LIFE_1L, 106
PARITY, 109
LIFE_DT, 107
PATCHMAK, 86
LINCUT7V, 76
PEELHOR, 55
LINE3060, 77
LINEXTR3, 79 R
LOGAND, 81
RECALL, 35
LOGDIF, 82
RIGHTCON, 56
LOGDIFNF, 94
RightContourDetector, 56
LogicAND, 81
S
LogicDifference2, 94
LogicNOT, 83 SHADMASK, 57
LogicOR, 84 SHADOW, 58
LOGNOT, 83 SHADSIM, 59
LSE, 50 SMKILLER, 87
SORTING, 111
M
SUPSHAD, 59
MAJVOT, 108
T
MASKSHAD, 57
MATCH, 51 TRACE, 99
MAXLOC, 52 TRESHOLD, 61
MD_CONT, 96 TX_HCLC, 89
MEDIAN, 53 TX_RACC3, 90
MOTDEPEN, 95 TX_RACC5, 91
MOTINDEP, 96
V
MotionDetection1, 95
VerticalCCD, 9
MotionDetection2, 96
MOVEHOR, 95 W
N WHITE, 66
NEL_AINTPOL3, 129