You are on page 1of 3

1

Feature Assignment in DDAG


Santosh Arvind Adimoolam

C ONTENTS I Error of DDAG architecture I-A Error of paths . . . . . . . . . . . . . . . . . Feature Assignment II-A Feature assignment to subDAGs . . . . . . . . Algorithm III-A Complexity . . . . . . . . . . . . . . . . . . . III-B Alternate algorithm . . . . . . . . . . . . . . . Feature IV-A IV-B IV-C IV-D IV-E IV-F extraction techniques Principle Component Analysis . . . . . Multiclass linear discriminant analysis Discrete fourier transform . . . . . . . Discrete cosine transform . . . . . . . Central Moments . . . . . . . . . . . . Distance transform . . . . . . . . . . . I. E RROR OF DDAG ARCHITECTURE Denition I-.1. Let N0 be a root node in DAG and N1 and N2 be its daughter nodes. N1 is on the left and N2 is on the right as shown in the gure I. Then the transmission probability tN0 of the node is dened as the fraction of samples in the union of all classes that pass to N1 . We may also call this the left transmission probability. The right transmission probability is 1left transmission probability. 1 1 1 2 2 2 2 3 3 3 3 3 3 3

II

node in the path. The root node is included among left nodes and transmission probability for empty set of nodes is taken to be 1. Then the error of misclassication at a node Nj via a path P in the DAG is
Xl P,Xl <Nj

t Xl
Xr P,Xr <Nj

(1 tXr ) QNj

(1)

III

Call the above expression as Q(P,Nj ) where Q(P,Nj ) is the probability of misclassication at a node Nj via path P . Then the total error of misclassication of DAG is given by
n

IV

. . . . . .

. . . . . .

. . . . . .

. . . . . .

C2 1

Q(P,Nj )
j=0 P

(2)

Remark I-A.1. The total error of DAG architecture is the summation of error of the paths in the DAG for all possible paths from the root to the end nodes. **Do not confuse the misclassication probability at a node QN with the total probability of misclassication Q(D,N ) at that node in the DAG. And QD(N ) which is the misclassication probability of entire SubDAG with root node N is different from Q(D,N ) . This notation is introduced later. Q(D,N ) is given by Q(P,N )
P

II. F EATURE A SSIGNMENT Denition II-.2. Let S be a set of nodes of DAG D and C is a set of classes of available features. Then the feature assignment to D among C is a function F aD : S C.
Fig. 1. transmission prob

The error of misclassication QD depends on the function F aD . Given a DAG architecture D, our problem is to nd the function F aD for which QD is least.

Remark I-.2. Let QN0 be the probability of misclassication at the node N0 . Change in QN0 affects the transmission probability tN0 and vice-versa.

We use DAG(N ) to denote a DAG with root node N . If N is a daughter node in a DAG then DAG(N ) represents the SubDAG with root node N . Q(F aD ) denotes the total misclassication probability of a DAG D with respect to feature assignment F aD . **If N is any node in D then F aD(N ) is the function F aD restricted to subDAG D(N ) while F aD (N ) is the value of function F aD at node N . Both expressions are very similar but have different meanings.

A. Error of paths Implicitly we will always be referring to paths from the root to the end nodes of the DAG unless otherwise specied. Lets consider a DAG architecture for n classes consisting of n C2 nodes. A sample passes through a path of n nodes in a DAG. However if it is misclassied, then the misclassication has occurred at only one particular node and not at multiple nodes. In a DAG hierarchy, we compare Nj < Ni if Ni node is higher that Nj in the hierarchy. Let X be any node in a path P. We denote X as Xl or Xr depending on whether X is a left or right daughter of its parent

Denition II-.3. A feature assignment F aD to D is said to be optimal if Q(F aD ) Q(F aD ) for every other feature assignment F aD .

Q F a2 Q F a1 0 D D However F a1 is an optimal assignment D Q F a2 = Q F a1 and F a2 is also optimal. D D D F aD(N1 ) = F a2 1 ) D(N F aD(N2 ) = F a2 2 ) D(N Only F aD (N0 ) is different. So, by changing F aD (N0 ) to F a2 (N0 ) we get an optimal DAG D same as F a2 . This proves the theorem. D Remark II-A.2. If F aD is a feature assignment to D, changing F aD (N ) does not change the total probability of misclassication Q F aD(M ) of subDAG D(M) if N is not a node in D(M). III. A LGORITHM Remark II-A.2 implies that all nodes at a same level can be optimized simultaneously with respect to their subDAGs. This result is used in the algorithm. Algorithm III-.1. For i = n 2 to 0, i do Optimize all nodes at level i with respect to their subDAGs. The algorithm directly follows from the theorem. Notice that the algorithm proceeds by optimizing from end nodes to the start node. By optimizing a node we mean reducing the error of the entire subDAG with the node as its root, by changing the feature assigned to that one particular node. This does not entail that the error at the node reduces. A. Complexity At every node we are running a subDAG to assign a feature to the node. So, total number of classications is
n1

to

D.

So,

Fig. 2.

DDAG

A. Feature assignment to subDAGs The following theorem leads us to propose an algorithm for optimal feature assignment. Theorem II-A.1. Consider a DAG D with root node N0 and N1 and N2 as its two daughter nodes as shown in gure 2. Let F aD be the feature assignment to D. If F aD(N1 ) and F aD(N2 ) are optimal feature assignments, then by changing only F aD (N0 ) to a value for which Q(F aD ) is least, we get an optimal feature assignment. The theorem states that if the two subDAGs at the daughter nodes of the root node are optimal w.r.t feature assignment to the DAG, then optimizing the DAG by just changing the value of the root node will give us an optimal DAG. Proof: Let t be the transmission probability of N0 . By using equations (1) and (2) we can arrive at QD = QN0 + tQD(N1 ) + (1 t)QD(N2 ) F a1 D (3)

be an optimal feature assignment to D which is different Let from F aD in the theorem. Dene F a2 as D 2 F a2 2 ) = F aD(N2 ) F aD(N ) = F aD(N1 ) D(N 1 F a2 (N0 ) = F a1 (N0 ) D D The transmission probability of a node depends only on the features assigned to that node and not on any other node. Transmission probability of N0 is the same in case of F a2 and F a1 as D D F a2 (N0 ) = F a1 (N0 ). Let t be the transmission probability. By D D equation 3

(n i)(i+1 C2 ) =
i=1

n4 7n2 + 6n 12

(7)

This is multiplied by the number of feature classes C As complexity is high, large number of samples can not be used for training. If however we can estimate the transmission probability of every node, then we can use a simpler algorithm proposed below. B. Alternate algorithm Consider a node N and its two daughter nodes denoted as Nl and Nr . If t is the transmission probability of a node N , following equation holds QD(N ) = QN + tQD(Nl ) + (1 t)QD(Nr ) . Algorithm III-B.1. For i = n 2 to 0, i do: At every node N at level i optimize QN + tQD(Nl ) + (1 t)QD(Nr ) Store QD(N ) = QN + tQD(Nl ) + (1 t)QD(Nr )

Q F a2 = Q F a2 (N0 ) + tQ F a2 1 ) D D D(N +(1 t)Q F a2 2 ) D(N Q F a1 = Q F a1 (N0 ) + tQ F a1 1 ) D D D(N +(1 t)Q F a1 2 ) D(N

(4)

(5)

Q F a1 (N0 ) = Q F a2 (N0 ) as dened for the function F a2 . D D D By substraction equation (4) from (5) we get Q F a2 Q F a1 = t Q F a2 1 ) Q F a1 1 ) D D D(N D(N +(1 t) Q F a2 2 ) Q F a1 2 ) D(N D(N (6)

F a2 1 ) and F a2 2 ) are optimal because they are the same as D(N D(N F aD(N1 ) and F aD(N2 ) . Hence, 6 implies

Complexity of algorithm III-B.1 is n C2 C which is second degree in n. This is much faster than the previous algorithm. However, this algorithm may not be accurate because it would be difcult to estimate the transmission probability of every node accurately.

IV. F EATURE EXTRACTION TECHNIQUES A. Principle Component Analysis Pseudocode: 1. Calculate covariance matrix dd 2. Find d largest Eigenvalues of 3. Find the d eigenvectors corresponding to those eigenvalues 4. Project data matrix to d dimensions has d2 inputs and each input requires calculating n multiplications. So, number of multiplications for step 1 = d2 n. In step 2 and 3, calculating eigenvalues and corresponding vectors requires QR decomposition in general. So, QR decomposition requires 25d3 operations. Projecting to d dimensions in the nal step requires ndd multiplications. Total number of multiplications = d2 n + 25d3 + ndd . B. Multiclass linear discriminant analysis Pseudocode: 1) Find the means of samples in every class. 2) Find the covariance matrix of the means-b . 3) Find the covariance matrix of all samples-. 4) Calculate A = 1 b 5) Calculate the eigenvalues and eigenvectors of A The eigenvectors are used for feature extraction by projections. Calculating covariance matrix of b in step 2 requires C d2 multiplications, where C is the number of classes and d is the dimension of the vectors. Step 3 requires n d2 multiplications, where n is the total number of samples. In step 4, calculation of inverse of requires nearly d3 by Gauss-Jordan ellimination requires d(d 1)(4d + 1)/6 multiplications and d(3d 1)/2 divisions. Multiplication of 1 and requires d3 multiplications. Calculating eigenvalues and eigenvectors by QR requires 25d3 multiplications.

E. Central Moments If an image is two dimensional, for j and k being integers, let f(x,y) be the pixel value at a point (x,y) then the central moment j,k is given by: j,k =
x,y (x

x )(y y )f (x, y)
x,y

f (x, y)

Number of multiplications=Number of pixels in the image. Number of divisions=1. F. Distance transform The map labels each pixel in the image to its nearest boundary pixel. If we take an n m pixels binary image, then nm searches have to be performed. Each search takes nm/d2 queries where d is the reduced dimension. Complexity n2 m2 /d2 queries for a N M binary image reduced to d dimensional vector.

Total number of multiplications is (26 + 2/3)d3 + (n + C 0.5)d2 Total number of divisions is d(3d 1)/2. C. Discrete fourier transform Let (x1 , x2 , ..., xn ) be a vector of dimension d. Then if we want to obtain a d dimension vector by eliminating the higher frequencies, a Discrete fourier transform can be used. Pseudocode: 1) initialize y1 , y2 , ..., yd 2) For t = 1 to d do yt = d xt exp 2( k1 )(t 1) j=1 d Number of multiplications for discrete fourier transform is dd . D. Discrete cosine transform Let (x1 , x2 , ..., xn ) be a vector of dimension d. Then if we want to obtain a d dimension vector by eliminating the higher frequencies, a Discrete cosine transform can be used. Pseudocode: 1) initialize y1 , y2 , ..., yd 2) For t = 1 to d do yt = 0.5(x1 + (1)t1 xd ) + d1 xt cos d1 (t 1)(j 1) j=2 Number of multiplications for discrete fourier transform is dd . 1 d 6

You might also like