You are on page 1of 53

AGE AND GENDER ESTIMATION FROM AUDIO FEATURES USING DISCRIMINANT ANALYSIS AND NN FRAMEWORK

A thesis submitted in partial fulfilment of the requirements for The award of the degree of

M.Tech. in COMMUNICATION SYSTEMS

By PUJARI SUJAY GIRISH

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI 620 015. MAY 2011

BONAFIDE CERTIFICATE This is to certify that the project titled AGE AND GENDER ESTIMATION FROM AUDIO FEATURES USING DISCRIMINANT ANALYSIS AND NN FRAMEWORK is a bonafide record of the work done by

PUJARI SUJAY GIRISH (208109013)

in partial fulfilment of the requirements for the award of the degree of Master of Technology in Communication Systems of the NATIONAL INSTITUTE OF TECHNOLOGY, TIRUCHIRAPPALLI, during the year 2010-2011.

S. DEIVALAKSHMI Guide Head of the Department

Project Viva-voce held on _____________________________

Internal Examiner

External Examiner

ABSTRACT
In the field of speech processing, applications like Interactive voice response (IVR) systems and Artificial Intelligence needs to replicating human behavior; one of them is auditory feature of human and from that to perceive sex and approximate age of speaker. With the help of selected features extracted from unknown speaker voice, proposed automated system can estimate age group and gender of that person. For the sake of classification 7 classes are defined namely young malefemale, adult male-female, senior male-female and child. This system is made of mainly 2 parts feature extraction from real time samples from microphone and followed by 2 stage feature classification. Features like Pitch, MFCC & delta MFCC are considered for extraction .For purpose of classification combination of Canonical discriminant analysis and NN framework is applied. For the purpose of experiment required stimuli databases were collected from 192 different speakers. Keywords: Speech Processing, Age, Gender, Discriminant analysis, Neural Network.

ACKNOWLEDGEMENTS

I take this opportunity to express my sincere thanks & deep sense of gratitude to my project guide Mrs S. Deivalakshmi, Assistant Professor, Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli for her guidance, and kind co-operation.

With immense pleasure, I record my profound gratitude and indebtedness to Prof. Sanjay Patil, Department of Electronics and Telecommunication, Maharashtra Academy of Engineering, Pune University for his needful suggestions & guidance.

I would like to express my sincere thanks to Prof. P. Somaskandan, Professor and Head of the Department, Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli for providing with all facilities from the part of department for the successful completion of this project.

I express my deep sense of gratitude to Dr S. Raghavan, Prof. Department of ECE and Mr M. Bhaskar, Associate Prof. Dept. of ECE for giving me the much required lab facilities; I would also like to thank him for his motivation and support.

My special thanks to Anil, Jamuna, Nithyananth, Senkathir, Kishore and Pardu for their encouragement and invaluable help to collect audio database.

I would like to thank to all teaching staff and my classmates and computer support group staff, for their sincere help. Last but not the least; I dedicate this work to my parents and my family. Sujay Pujari May 2011

TABLE OF CONTENTS
Title ABSTRACT.... ACKNOWLEDGEMENTS... TABLE OF CONTENTS... LIST OF FIGURES.... LIST OF TABLES.. ABBREVIATIONS. CHAPTER 1 INTRODUCTION 1 3 5 6 Page No iii iv v vi viii ix

1.1 Objectives and Approach ....... 1.2 Database Collection .. 1.3 Study Outline CHAPTER 2 .

LITERATURE REVIEW

CHAPTER 3 FEATURE EXTRACTION 3.1 Pitch. 3.2 MFCC (Mel Frequency Cepstral Coefficient ).. 3.3 Windowing CHAPTER 4 FEATURE CLASSIFICATION 4.1 First Stage with Discriminant Analysis .. 4.2 Second Stage with NN frameworks. CHAPTER 5 RESULTS AND DISCUSSION 30 34 36 41 42 5.1 Unknown stimuli results .. 5.2 classification stage one result .. 5.3 classification stage two result .. CHAPTER 6 CONCLUSION & FURTHER WORK REFERENCES 20 22 9 12 15

LIST OF FIGURES
Figure No No 1.1 1.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Title Page

Proposed System 2 Snapshot of recording with Wave Surfer....4 An example, input sinusoidal signal. ....10 Autocorrelation of give frame input .11 Number of samples between 2 maximums. ..11 MFCC feature extraction steps .....12 Mel frequency Vs Frequency 13 Mel filter bank ...14 Hamming window ..15 Overlapped frames followed by windowing function ...16 Reconstructed waveform (Above) after windowing and original wave (below) ...16

3.10 3.11 4.1 4.2

Cross correlation between original signal and reconstructed one...17 Welch method all periodogram and their average18 Abstract Flow of Proposed classification stages ....19 Classification based on Discriminant analysis followed by decision based on Euclidean distance for stage one classification ...21

4.3

Equivalent decision C1 of NN framework based on 3 neural networks output .22

4.4 4.5 4.6 4.7

Euclid. distance method for decision based on stage 2...............22 Neural Network structure for NNA, NNB and NNC .23 Classification algorithm in stage 2..24 Matlab nprtool for NN implementation......26

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 6.1

Average pitch for Females of all 4 classes from database 1 .......27 Average pitch for Males of all 4 classes from database 1 ..28 Waveform of one of the record from database 1 ...............................28 Pitch track for waveform shown in fig. 5.3 ...............29 Pitch track- Unknown stimuli ...30 13 MFCC coefficients- Unknown stimuli ..30 12 dMFCC coefficients- Unknown stimuli ...................................31 11 ddMFCC coefficients- Unknown stimuli .31 Feature vector of 37 x 1 for Unknown stimuli ..32 Discriminant score plot for all 3 groups ....35 NN1 framework NNA network...36 NN1 framework NNB network ...36 NN1 framework NNC network ...37 NN2 framework NNA network...37 NN2 framework NNB network ...38 NN2 framework NNC network ...38 Stage 1 + NN2 framework all females.39 Stage 1 + NN1 framework males.39 Overall classification result ........40 Comparison chart for successful estimation of class..41

LIST OF TABLES
Table no Title Page no

1.1 5.1 5.2 5.3 5.4

Classification Groups..4 Neural networks output unknown stimuli....33 Canonical Discriminant function coefficients34 Functions at group centroids..34 classification result of stage 1 with database 1...35

ABBREVIATIONS

CDF DA DS MFCC NN YM YF AM FM SM SF DCT PSD

Canonical Discriminant function Discriminant Analysis Discriminant score Mel Frequency Cepstrum Coefficients Neural Network Young Male Young Female Adult Male Adult Female Senior Male Senior Female Discrete Cosine Transform Power Spectral Density

CHAPTER-1 INTRODUCTION

Automatic speech recognition (ASR) based algorithms are widely deployed for customer care and service applications. ASR research is currently moving from mere speech-to-text (STT) systems towards rich transcription (RT) systems, which annotate recognized text with non-verbal information such as speaker identity, emotional state. In Interactive voice response systems; this approach is already being used to identify dialogs involving angry customers, which can then be analyzed with the goal of automatically identifying problematic dialogs, transferring unsatisfied customers to an agent, and other purposes. Also, the first adaptive dialogs are now appearing, particularly in systems exposed to inhomogeneous user groups. These can adapt degree of automation, order of presentation, waiting queue music, or other properties to properties of the caller such as age or gender. As an example, it would be possible to offer different advertisements to children and adults in the waiting queue. In non-personalized services, speaker classification will be based on the callers speech data. While classifier performance is only one factor influencing the utility of the above approach in an IVR system, it is certainly a major factor. Proposed algorithm for automatic age and gender estimation is going to help in same regard, as it is a proposed algorithm which classifies speaker voice in one among the classes, which predicts its gender and approximate age group.

1.1 Objectives and Approach

The ultimate aim of the proposed system is to predict age group and gender of speaker with the help of its stimuli of any length at real time. Now such systems are mainly consisting of two stages feature extraction, selection and Classification based on extracted features. For such observations or to find features which can give distinct values for different classes we need to have database. So one of the first tasks was to collect this audio database; followed by feature extraction and classification. For feature extraction features like pitch, MFCC & delta MFCC coefficients are worked out. In following stage of classification we adapted two different method namely CDA and NN framework. With the help of all collected 290 stimuli present in the database neural

networks are trained and at the end such trained networks are used for real time classification.

Audio input Human voice

Feature set extraction

Classification stage 1 based on Discriminant analysis

Classification stage 2 based on NN Frameworks

Child 1

Young Male 5

Senior Female 4

Young Female 2 Adult Male 6 Adult Female 3

Senior Male 7

Fig. 1.1 Proposed System

1.2 Database Collection

For classification purpose we had collected data for following 8 different groups: I. II. III. IV. V. VI. VII. VIII. Child Boy (age <15) Child Girl (age <15) Young Men (age<30) Young Women (age<30) Adult Gents (age<55) Adult Ladies (age<55) Senior citizen Male (age>55) Senior citizen Female (age>55)

At the starting, we had collected 2 stimuli from 105 speakers namely 1) HAPPY BIRTHDAY 2) To tell your name in your mother tongue. For example, <My Name is Sujay> or <Maz nav ... ahe.>Marathi or < En peyar... > Tamil or < Naa peru...> Telugu. With specification of a) Fs sampling rate=8000 samples/sec b) Bits per sample =16 bit c) Mono channel

By the end of experiment on these 2 stimuli it was quite clear that to distinguish between groups I & II was quite impossible. So finally we fixed classification groups to 7 classes by merging group I and II to a single group known as child. And finally we adapted classification groups as given in Table 1.1

Table 1.1 Classification Groups Group no I 2 3 4 5 6 7 Group Symbol C YF AF SF YM AM SM Category Child (age <18) Young Female (age <30) Adult Female (age <55) Senior Female (age >55) Young Male (age <30) Adult Male (age <55) Senior Male (age >55)

3) After this we adapted stimuli of OM but with condition to extend it to more than 10seconds and with single breathe. Like- OOOOOOOOOOOMMMMMMMMMMMmmmmmmmm We have such 87 sample recordings with following specifications: A. Fs sampling rate=16000 samples/sec B. Bits per sample =16 bit C. Mono channel

Fig. 1.2 Snapshot of recording with Wave Surfer

For recording purpose we used open source tool known as Wave Surfer 1.8.8p3 for the purpose of recording & editing with desired specifications. We can refer these 3 databases by Database1, Database 2and Database 3. In which all files are stored in the form of Name _Age.wav.

1.3 Study Outline

This project thesis is organized as follows. Chapter 2 reviews the literature review & background of the algorithms adapted towards estimation of age and gender. The Materials and Methods used in this study are discussed in Chapter 3 and Chapter 4. Chapter 3 deals with first part of feature extraction and Chapter 4 with Feature classification. Chapter 5 provides the results and discussion and the Chapter 6 concludes thesis with future direction.

CHAPTER-2 LITERATURE REVIEW


In this chapter, important literatures used to implement proposed algorithm are reviewed.

Minematsu, N.et.al (1993), In Automatic estimation of ones age with his/her speech based upon acoustic modelling techniques of speakers, proposed technique to identify subjective elderly speakers with prosodic features such as MFCC based speech rate.

William R. Klecka (1980), in Discriminant Analysis presents a lucid and simple introduction to several related statistical procedures known as discriminant analysis. Discriminant Analysis (DA) introduces canonical discriminant function (CDF) of variables in discriminant analysis. Professor Klecka derives canonical discriminant function coefficients, provides spatial interpretation of them, and provides a nice discussion of the interpretation of CDFs. He presents clear discussion of unstandardized and standardized

SPSS ver. 14 manual on algorithms titled Discriminant explains all steps involved toward Classification based on CDF coefficients.

Braun,Aet. Al (1999), In Estimating speaker age across Languages , he conducted analysis to show correlation between calender age and perceived age with help of Italian and Dutch stimulies; and further concluded that male and female listener can safely be combined.

Cerrato, L. et. Al (2000), In subjective age estimation of telephonic voices,he carried out the statistical analysis which show that listeners are capable of of assigning a general chronological age category to voice without seeing or knowing the speaker & they are able to make distinguish beetween male and female voice transmitted over telephone line.

Krauss, R. M.et.al (2002)

In Inferring speakers, physical attributes from their voices, he examined listeners ability to make accurate inferences about speakers from the non-linguistic content of their speech.

Shafran, I. et. al. (2003), In Voice signatures, he explores problem of extracting voice signature from speaker voice & found standard Mel warped cepstral features, speaking rate & shimmer to be useful.

Rabiner, L. et. al. (1976) In A comparative performance study of several pitch detection algorithms ,all Pitch detection algorithms are discussed .According to him Pitch can be as low as 40 Hz (for very low pitch male) or high as 600 Hz (for very high pitched female or childs voice)

Rabiner, L. et. al. (1977) In On the use of autocorrelation analysis for pitch detection, with the help of short time autocorrelation analysis, pitch detection technique is explained.

Mcleod,P. and Wywill, G. (2005) A smarter way to find pitch, he found that existing pitch algorithms that use Fourier domain suffer from spectral leakage and he suggested windowing as remedy over it.

Metze, F. et.al. (2007) In Comparison of four approaches to age and gender recognition for telephone application, compares different approaches to age gender classification on telephone speech with small and large utterance lengths.

Welch, P. D. (1967) In The Use of Fast Fourier Transform for the Estimationof Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms, use of FFT for PSD estimation is given.

Moller, M. (1993) In, A scaled conjugate algorithm for fast supervised learning, he introduces SCG whose performance is benchmarked compare to that of back propagation. Also it is fully

automated, included no critical user dependent parameter & avoids time consuming line search

Huang, X. et .al. (2001) In "Spoken Language Processing: A guide to Theory, Algorithm, and System Development," Prentice Hall, he describes prosodic phenomenas like pitch with algorithms available.

Childers, D. et.al.(1977) In The Cepstrum: A guide to processing, Pragmatic details of Cepstrum concepts is given.

Spiegl, W. et. al. (2009) In Analysing Features for Automatic Age Estimation on Cross-Sectional Data, developed acoustic feature set for the estimation of persons age from recorded speech signal. & demonstrated that age can be effectively estimated using feature vector of prosodic, spectral & cepstral features.

CHAPTER-3 FEATURE EXTRACTION

In this chapter Feature extraction algorithms are explained. In this work 2 type of features are found more suitable namely Pitch and MFCC coefficients. 3.1 PITCH Pitch represents perceived fundamental frequency of a sound and it may be quantified as frequency in cycles per second (hertz), however pitch is not purely objective physical property, but a subjective psychoacoustical attribute of sound. According to Huang, X. [ref 10], prosody is a complex weave of physical, phonetic effects that is being employed to express attitude, assumptions, and attention as a parallel channel in our daily speech communication. The semantic content of a spoken or written message is referred to as its denotation, while the emotional and attentional effects intended by the speaker or inferred by a listener are part of the messages connotation. Prosody has an important supporting role in guiding a listeners recovery of the basic messages (denotation) and a starring role in signalling connotation, or the speakers attitude toward the message, toward the listener(s), and toward the whole communication event. From the listeners point of view, prosody consists of systematic perception and recovery of a speakers intentions based on: I. II. III. IV. Pauses: to indicate phrases and to avoid running out of air. Pitch: rate of vocal-fold cycling (fundamental frequency) as a function of time. Rate/relative duration: phoneme durations, timing, and rhythm. Loudness: relative amplitude/volume.

Pitch is the most expressive of the prosodic phenomena. As we speak, we systematically vary our fundamental frequency to express our feelings about what we are saying, or to direct the listeners attention to especially important aspects of our spoken message

3.1.1 PITCH DETECTION According to Naotoshi Seo [ref 15], Pitch can be detected in following ways: a. b. c. d. Autocorrelation method Cepstrum method Harmonic product spectrum method (HPS) Linear predictive coding (LPC)

In our work we have adapted first method, in order to calculate pitch, we need at least two peaks to be within the block we are measuring pitch. We can ensure that at least 2 peaks are within the block. And therefore block size must be greater than 3 wavelength of lowest possible frequency. Lowest possible frequency is known as Pitch floor. No of minimum samples required per frame

Nmin =
Samples/frame

3 * Fs Pitch floor

According to this method to get pitch we need to get autocorrelation of signal for given block or frame. Then sample distance between value at first sample and at second highest peak K can be used to find fundamental frequency, where Fs is sampling frequency.

Pitch =

Fs
(no of samplescovered between 2 maximum peaksK)
. Hertz

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1

Amplitude

50

100

150

200 250 300 no of samples

350

400

450

500

Fig. 3.1: An example, input sinusoidal signal.

8000 6000 4000 2000


Amplitude

0 -2000 -4000 -6000 -8000

2000

4000

6000

8000 10000 12000 14000 16000 18000 sample number

Fig. 3.2: Autocorrelation of given input frame

8000 7000 6000 5000 4000 3000 2000 1000 0 0 200 400 600 sample number 800 1000
X: 41 Y: 7980

Amplitude

Fig. 3.3 Number of samples between 2 maximums.

For Example as shown in fig. 2, K = (41-1) = 40 For, Fs = 8000 Pitch=8000/40=200 Hz

3.2 MFCC (Mel Frequency Cepstral Coefficient

We are using MFCC which is so popular as its efficient to compute .It incorporates a perceptual Mel frequency scale. It seprates source and filter. IDFT (DCT) decorrelates the features which in turn improves differences..

Fig. 3.4 MFCC feature extraction steps

3.2.1 MEL SCALE Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz I.e. human perception of frequency is non-linear: Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz

Fig. 3.5 Mel frequency Vs Frequency

For our work we are using 13 Mel filter banks as shown in fig. 1, which inturns gives 13 MFCC coeficients

Fig. 3.6 Mel filter bank

3.2.2

LOG ENERGY

Logarithm compresses dynamic range of values o o Human response to signal level is logarithmic humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes Makes frequency estimates less sensitive to slight variations in input (power variation due to speakers mouth moving closer to mike) Phase information not helpful in speech

3.2.3

CEPSTRUM

According to Childers, D. [ref 4] ; cepstrum is nothing but spectrum of spectrum. The cepstrum requires Fourier analysis But were going from frequency space back to time So we actually apply inverse DFT . Since the log power spectrum is real and symmetric, inverse DFT reduces to a Discrete Cosine Transform (DCT)

3.2.4

DELTA MFCC and DOUBLE DELTA MFCC

These are nothing but MFCC variations and variation in MFCC variations. For 13 MFCC coefficients we get 12 delta-MFCC and 11-delta2 MFCC coefficients. Which is nothing but Quefrency.

3.3 Windowing

Instead of recording all the audio signal we are using windowing with overlaping which helps to limit buffer length to 1024 samples and previous frame results like PSD. And then we can implement methods like welch to find out periodogram for non-stationary signals.

Fig. 3.7 Hamming window - W (n) = 0.54 - 0.46 * COS (2*pi*n/N)

In our case we are adapting windowing for estimation of pitch in case of pitch track and PSD estimation for extracting MFCC coefficients.

Here we are adapting hamming window of length 1024 with 50% overlapping.

Fig. 3.8 Overlapped frames followed by windowing function

Fig. 3.9 : Reconstructed waveform(Above) after windowing and original wave (below)

Cross correlation
X: 2560 Y: 15.53

15

10

-5

-10

-15

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Fig. 3.10 Crosscorrelation between original signal and reconstructed one. 3.3.1 PSD ESTMATION -Welch's Method

Welch's method for estimating power spectra is carried out by dividing the time signal into successive blocks, forming the Periodograms for each block, and averaging. Denote the mth windowed, zero-padded frame from the signal x by

Where R is defined as the window hop size, and let k denote the number of available frames. Then the Periodogram of the mth block is given by

And the Welch estimate of the PSD is given by

In our work we are using Hamming window of N=1024, with 50% overlapping.

4.5

x 10

-4

Periodograms of all blocks

3.5

Watt/Hz

2.5

1.5

0.5

500

1000

1500

2000 Hz

2500

3000

3500

4000

4500

1.2

x 10

-4

Average periodogram
X: 993 Y: 0.0001179

0.8

Watt/HZ

0.6

0.4

0.2

500

1000

1500

2000 Hz

2500

3000

3500

4000

4500

Fig. 3.11 : Welch method all periodogams and their average.

CHAPTER-4 FEATURE CLASSIFICATION

In this chapter we are going to propose combination of two classification algorithms as two stage classification. As there are 7 classes to classify C,YF,AF,SF,YM,AM & SM. It was found

that with the help of statistical classificationa like Canonical Discriminant Analysis we can predict speaker as Male , Female or Child. Now, that is the first stage of classification.Then to specify young, adult or senior among Male and Female group we are using NN

framework;which is nothing but second stage.

Feature set

Classification based on Discriminant analysis

MALE Group 5 to 7

FEMALE Group 2 to 4

NN Framework -1

C 1

NN Framework -2

YM 5

AM 6

SM 7

YF 2

AF 3

SF 4

Fig 4.1 Abstract Flow of Proposed classification stages

4.1 First Stage with Discriminant Analysis Feature classification stage 1 is done using Discriminant Analysis (DA). In this method 2 Canonical Discriminant functions are determined from extracted features. For the feature vectors only two features namely pitch and delta2 MFCC (10) are only used as input vector for this state. For the purpose of training we used 39 Female, 27 child and 37 Male stimuli from Database 1. After extracting features from feature set for all Training cases we determined unstandardized coefficients along with group centroids for 2 functions. Now, we can determine Discriminant score for unknown feature set. And classification is done based on Euclidean distance rule. 4.1.1 Steps for Canonical Discriminant Analysis: Selected 103 samples can be referred as training database. So using SPSS package we can find out canonical Discriminant functions, in our case we have 3 classes so it will end up with 2 functions. For that we need to give (no. of samples from that group x 2 feature values) such 3 feature matrices as input. Along with a (Total samples x 1) vector depicting truth value of that class. After following steps as explained in Klecka [ref 5] we will have following information for each function 1. Unstandardized coefficients D 2. Constant D0 3. Function values at group centroids Now Canonical Discriminant Function can be determined as, f=D0+XD Where X is (1xP) feature vector for given mammogram.

4.1.2 Classification based on CDF Now after substituting Xinput value in obtained CDF we will get finput; this value is nothing but Discriminant Score (DS) for given input feature vector. And for 2 functions you will get 2 different finput; Therefore, F1 = [finput1 finput2] We will have 3 [2 x 1] group centroids values, we can find out the group which is having minimum Euclidian distance from F1, can be selected as classified group. C1 C3 F1

C2

2 Trained Functions with X input [Pitch d2MFCC (10)] 1. Unstandardized coefficients D 2. Constant D0 3. Function values at group centroids F1 Nearest Centroid belongs to class M or F or C

C F

Fig. 4.2 Classification based on Discriminant analysis followed by decision based on Euclidean distance for stage one classification

4.2 Second Stage with NN frameworks


Now we just need to apply NN framework as per applicable to male or female speaker. Now for both NN frameworks we are going to apply same algorithm as shown in figure .That means we are going to apply [37 x 1] feature vector simultaneously to 3 Neural Networks. And output obtained from each network can be considered as co-ordinates in 3d plane assuming 3rd coordinate as zero. From such 3 position vectors P1, P2 & P3 we will get centroid co-ordinates C1.Now among [1 0 0], [0 1 0] & [0 0 1] which are target values for 3 subclasses of male & female. Minimum distance between centroid C1 and among three points proves selection of one of the three classes. [1 0 0] P1 C1

[0 0 1] P3 [0 1 0] P2

Fig. 4.3 Equivalent output C1 of NN framework based on 3 neural networks output Figure: [1 0 0]

L1 C1 L3

[0 0 0] [0 1 0] L2 [0 0 1]

Fig. 4.4 Euclidian distance method for decision based on classification stage 2 Now these 3 neural networks are nothing but 3 trained networks obtained by considering 2 classes as target output at a time so such 3C2=3 neural networks are required.

Input Layer 37 Elements W2 i, j ayer Target 2 elements

op1

op2

Hidden Layer W i, j ayer 40 elements

Fig. 4.5 Neural Network structure for NNA, NNB and NNC

Then, From NNA, P1 = [op1 op2 0] From NNB, P2 = [0 op1 op2] From NNC, P3 = [op1 0 op2] C1=centroid of (P1, P2 & P3); L1=dist(C1,[1 0 0]) L2=dist(C1,[0 1 0]) L3=dist(C1,[0 0 1])

For both NN1 and NN2 framework following method is applicable. NNA is neural network with the help of only Young and Adults male or Female NNB is neural network with the help of only Senior and Adults male or Female

NNC is neural network with the help of only Young and senior male or Female

X input [37x1] feature vector {Pitch, 13 MFCC, 12 dMFCC, 11 ddMFCC}

NNA 0

NNB 0

NNC 0

P1

P2

P3

Centroid C1

L1

L2

L3

Smaller L decides class

Y A

Fig. 4.6 Classification algorithm in stage 2

4.2.2 Neural Network Implementation

For the purpose of neural network implementation we used Matlab tool specially designed for Neural Network Pattern Recognition tool which can be invoked by nprtool command. This tool uses Conjugate gradient back propagation Method as explained in ref. With the help of trainscg command .Speciality about this SCGB method is it con train any network as long as its weight, net input & transfer function have derivative function. This algorithm is based on conjugate directions and does not perform a line search at each iterations.

Training stops when any one of occurs: 1. Maximum no of epochs is reached 2. Maximum amount of time is reached 3. Performance is minimised to goal 4. Performance gradient falls below minimum gradient. 5. Validation performance has increased more than max fail time.

We are taking number neurons in hidden layer equal to 40. And for training all 3 databases combinedly we are using. In that tool itself we can specify percentage samples for training, validation and testing; we are using them with ratio of 70%, 15% and 15% respectively. For training it uses Hyperbolic tangent sigmoid transfer function. for neuron modelling..

After satisfactory training i.e. good classification rate you can save this network in memory and can anytime invoke it at the time of testing.

Fig. 4.7 Matlab nprtool for NN implementation

CHAPTER-5 RESULTS AND DISCUSSION


From database1 we have extracted pitch feature. It was found that group F1 and M1 do not show any distinct features and can be safely combined to a class of Child. At same time using pitch one can clearly distinguish between child and Men; but with child and Women only pitch was found not to be completely reliable. o o o o Children: 15 years, male (M1) and female(F1)

Young people: 15-30 years,

male (M2) and female (F2) female (F3)

Adults: 30-55years, male (M3) and Seniors: 55 years,

male (M4) and female (F4)

Fig. 5.1 Average pitch for Females of all 4 classes from database 1

Fig. 5.2 Average pitch for Males of all 4 classes from database 1

Fig. 5.3 Waveform of one of the record from database 1 Happy birthday

Hz

Frame no.
Fig. 5.4 Pitch track for waveform shown in fig. 5.3 While we are plotting pitch track i.e. pitch contour for database 1 and 2 that it was showing dramatic variations in pitch within that stimuli .So it was followed by collection of database 3 in which pitch contour is near to average value of pitch at all time. With the help of database 3 one more fact we come to know that, For Male < 12 and Female <18 are showing distinct results as compare to others. It was observed that for boys there is change in pitch after age of 12 years, which is 18 years for girls. So it was deciding factor for fixing classification group as given in Table 1.1.According to which child come under category of any human having less than 18 years age. In following section with one stimuli example we will give results obtained and calculation part while following algorithm

5.1 Unknown stimuli results


150

100

50

0
Hz

-50

-100

-150

-200

10

20

30

40

50 frame no

60

70

80

90

100

Fig. 5.5 Pitch track- Unknown stimuli

-5

-10

-15

-20

-25

-30

10

12

14

Fig. 5.6: 13 MFCC coefficients- Unknown stimuli

35

30

25

20

15

10

-5

10

12

Fig. 5.7: 12 dMFCC coefficients- Unknown stimuli

-5

-10

-15

-20

-25

-30

-35

10

11

Fig. 5.8: 11 ddMFCC coefficients- Unknown stimuli

120

100

80

60

40

20

-20

-40

10

15

20 Feature number

25

30

35

40

Fig. 5.9 Feature vector of 37 x 1 for Unknown stimuli [Mean (pitch) 13-MFCC 12-dMFCC 11-ddMFCC]

Finput = [ pitch = 106.4090 Discriminant score

ddMFCC (10) = 0.1201 ]

DS1 = [106.4090 0.1201] * [0.027 DS2 = [106.4090 0.1201] *[-0.006 F1 Centroid c0 Centroid c1 Centroid c2

0.77] T + 5.93 = -2.9325 2.3797] T + 0.9263 = 0.5349

= [-2.93 0.53] (Child) = [2.5286 -0.3146] (Male) = [0.4167 0.3881] (Female) = [-2.2844 -0.1795]

Here, distance between c2 & F1 is smaller than other distances & c2 belongs to group of male. Classification stage 1 result: Male Now, in classification stage 2, it will go through NN2 framework. In this stage it will again passes through NNA, NNB & NNC networks. Table 5.1 Neural networks output unknown stimuli

Network NNA NNB NNC Therefore,

Op1 0.7367 0.9899 0.9628

Op2 0.1617 0.0201 0.0194

P1= [0.7367 0.1617 0] P2= [ 0 0.9628 0.0194] P3= [0.9899 0 0.0201] C1= [0.5755 0.3748 0.0132]..centroid L1= 0.5664 L2= 0.8499 L3=1.2023 This means again class 1 means Young group. And final classification will be Male-Young means group 5-YM

So output of our algorithm is from Matlab environment: --------Group no------------------------------------------------------Child = 1 Female<30 = 2 Female<55 = 3 Female>55 = 4 Male<30 = 5 Male<55 = 6 Male>55 = 7 --------------------------------And answer is-------group = 5 --------------------------------------------------------------------Group 5 is nothing but YM and it was true positive result.

5.2 Classification result stage one Canonical Discriminant Analysis Classification results of stage one using database one as training database.

Table 5.2

Canonical Discri minant Function Coeffi cients Function pitch ddmf cc10 (Constant) 1 .027 .779 -5.939 2 -.006 2.380 .926

Unstandardized coef f icients

Table 5.3
Functions at Group Centroids Function group .00 1.00 2.00 1 2.529 .417 -2.284 2 -.315 .388 -.179

Unstandardized canonical discriminant f unct ions ev aluated at group means

Fig. 5.10 Discriminant score plot for all 3 groups Table 5.4 classification result of stage 1 with database 1
b,c Classification Results

Original

Count

Cross-v alidateda Count

group .00 1.00 2.00 .00 1.00 2.00 .00 1.00 2.00 .00 1.00 2.00

Predicted Group Membership .00 1.00 2.00 22 5 0 4 31 4 0 1 36 81.5 18.5 .0 10.3 79.5 10.3 .0 2.7 97.3 21 6 0 4 31 4 0 1 36 77.8 22.2 .0 10.3 79.5 10.3 .0 2.7 97.3

Total 27 39 37 100.0 100.0 100.0 27 39 37 100.0 100.0 100.0

a. Cross v alidation is done only f or those cases in t he analy sis. In cross v alidation, each case is classif ied by the f unctions deriv ed f rom all cases other than that case. b. 86.4% of original grouped cases correctly classif ied. c. 85.4% of cross-v alidated grouped cases correctly classif ied.

5.3 Classification result (Confusion Matrices) stage two- Neural Network

Fig 5.11 NN1 framework NNA network & this deal with YM and AM category

Fig 5.12 NN1 framework NNB network & this deal with SM and AM category

Fig. 5.13 NN1 framework NNC network & this deal with YM and SM category

Fig. 5.14 NN2 framework NNA network & this deal with YF and AF category

Fig. 5.15 NN2 framework NNB network & deals with SF and AF category

Fig. 5.16 NN2 framework NNC network & this deal with YF and SF category

Fig. 5.17 Stage 1 + NN2 framework all females

Fig. 5.18 Stage 1 + NN1 framework all males

Fig 5.19 Overall classification result when whole database 3 as testing samples

CHAPTER-6 CONCLUSION AND FUTURE WORK

Proposed Automatic Age and Gender estimating system is implemented with the help of Matlab toolbox.Figure 6.1 compares classification rates obtained by applying database 3 for testing at the end of second/last classification stage. It is found that overall male category is having good classification rate .Except AF other results are quite satisfactory including overall classification rate which was 69.4%.

classification rate %
100 90 80 70 60 50 40 30 20 10 0 Child YF AF SF YM AM SM classification rate

Fig. 6.1 Comparison chart for successful estimation of class. As part of further work, we need to train /neural network whenever true class of user we know and it is showing different class. And not only there is need to collect more stimuli but also one needs to explore more features.

REFERENCES

1. Welch, P. D. (1967); The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms, IEEE Trans. on Audio Electroacoustic, Volume AU-15,pages 70-73. 2. Rabiner, L. et. al. (1976); A comparative performance study of several pitch detection algorithms, IEEE Tran. Acoustics, Speech and Signal Processing, Volume 24, Issue 5, Page 399-418.

3. Rabiner, L. et. al. (1977); On the use of autocorrelation analysis for pitch detection, IEEE Tran. Acoustics, Speech and Signal Processing, Volume 25, Issue 1, Page 24-33. 4. Childers, D. et.al.(1977); The Cepstrum: A guide to processing, Proc. IEEE . Volume 65, issue 10,pp 1428-1443

5. William R. Klecka (1980); Discriminant Analysis, sage university paper

6. Minematsu, N.et.al (1993); Automatic estimation of ones age with his/her speech based upon acoustic modelling techniques of speakers. ICASSP-93, 1993 IEEE International Conference Acoustics, Speech, and Signal Processing 7. Moller, M. (1993); A scaled conjugate algorithm for fast supervised learning, Neural Networks, volume 6(4), 523-533.

8. Braun,Aet. Al (1999); Estimating speaker age across Languages , The International Congress of Phonetic Sciences -IPhS99

9. Cerrato, L. et. Al (2000); subjective age estimation of telephonic voices, Speech Communication archive. Volume 31, Issue 2-3 (June 2000), Elsevier.

10. Huang, X. et .al. (2001); "Spoken Language Processing: A guide to Theory, Algorithm, and System Development," Prentice Hall.

11. Krauss, R. M.et.al (2002); Inferring speakers, physical attributes from their voices, Journal of Experimental Social Psychology, 38, Page 618-625.

12. Shafran, I. et. al. (2003); Voice signatures, In Proc. IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2003.

13. Mcleod,P. and Wywill, G. (2005); A smarter way to find pitch, Proc. , International computer Music CONFERENCE, Barcelona, July 2005,pp 300-303

14. Metze, F. et.al. (2007); Comparison of four approaches to age and gender recognition for telephone application, ICASSP. 15. Naotoshi Seo (2008); ENEE632 Project4 Part I: Pitch Detection, ECE dept., Maryland university. 16. Spiegl, W. et. al. (2009); Analysing Features for Automatic Age Estimation on Cross-Sectional Data 10th Annual Conference of the International Speech Communication Association, Brighton. Page 1-4.

17. SPSS ver. 14 manual on algorithms titled Discriminant

You might also like