Professional Documents
Culture Documents
A thesis submitted in partial fulfilment of the requirements for The award of the degree of
BONAFIDE CERTIFICATE This is to certify that the project titled AGE AND GENDER ESTIMATION FROM AUDIO FEATURES USING DISCRIMINANT ANALYSIS AND NN FRAMEWORK is a bonafide record of the work done by
in partial fulfilment of the requirements for the award of the degree of Master of Technology in Communication Systems of the NATIONAL INSTITUTE OF TECHNOLOGY, TIRUCHIRAPPALLI, during the year 2010-2011.
Internal Examiner
External Examiner
ABSTRACT
In the field of speech processing, applications like Interactive voice response (IVR) systems and Artificial Intelligence needs to replicating human behavior; one of them is auditory feature of human and from that to perceive sex and approximate age of speaker. With the help of selected features extracted from unknown speaker voice, proposed automated system can estimate age group and gender of that person. For the sake of classification 7 classes are defined namely young malefemale, adult male-female, senior male-female and child. This system is made of mainly 2 parts feature extraction from real time samples from microphone and followed by 2 stage feature classification. Features like Pitch, MFCC & delta MFCC are considered for extraction .For purpose of classification combination of Canonical discriminant analysis and NN framework is applied. For the purpose of experiment required stimuli databases were collected from 192 different speakers. Keywords: Speech Processing, Age, Gender, Discriminant analysis, Neural Network.
ACKNOWLEDGEMENTS
I take this opportunity to express my sincere thanks & deep sense of gratitude to my project guide Mrs S. Deivalakshmi, Assistant Professor, Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli for her guidance, and kind co-operation.
With immense pleasure, I record my profound gratitude and indebtedness to Prof. Sanjay Patil, Department of Electronics and Telecommunication, Maharashtra Academy of Engineering, Pune University for his needful suggestions & guidance.
I would like to express my sincere thanks to Prof. P. Somaskandan, Professor and Head of the Department, Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli for providing with all facilities from the part of department for the successful completion of this project.
I express my deep sense of gratitude to Dr S. Raghavan, Prof. Department of ECE and Mr M. Bhaskar, Associate Prof. Dept. of ECE for giving me the much required lab facilities; I would also like to thank him for his motivation and support.
My special thanks to Anil, Jamuna, Nithyananth, Senkathir, Kishore and Pardu for their encouragement and invaluable help to collect audio database.
I would like to thank to all teaching staff and my classmates and computer support group staff, for their sincere help. Last but not the least; I dedicate this work to my parents and my family. Sujay Pujari May 2011
TABLE OF CONTENTS
Title ABSTRACT.... ACKNOWLEDGEMENTS... TABLE OF CONTENTS... LIST OF FIGURES.... LIST OF TABLES.. ABBREVIATIONS. CHAPTER 1 INTRODUCTION 1 3 5 6 Page No iii iv v vi viii ix
1.1 Objectives and Approach ....... 1.2 Database Collection .. 1.3 Study Outline CHAPTER 2 .
LITERATURE REVIEW
CHAPTER 3 FEATURE EXTRACTION 3.1 Pitch. 3.2 MFCC (Mel Frequency Cepstral Coefficient ).. 3.3 Windowing CHAPTER 4 FEATURE CLASSIFICATION 4.1 First Stage with Discriminant Analysis .. 4.2 Second Stage with NN frameworks. CHAPTER 5 RESULTS AND DISCUSSION 30 34 36 41 42 5.1 Unknown stimuli results .. 5.2 classification stage one result .. 5.3 classification stage two result .. CHAPTER 6 CONCLUSION & FURTHER WORK REFERENCES 20 22 9 12 15
LIST OF FIGURES
Figure No No 1.1 1.2 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 Title Page
Proposed System 2 Snapshot of recording with Wave Surfer....4 An example, input sinusoidal signal. ....10 Autocorrelation of give frame input .11 Number of samples between 2 maximums. ..11 MFCC feature extraction steps .....12 Mel frequency Vs Frequency 13 Mel filter bank ...14 Hamming window ..15 Overlapped frames followed by windowing function ...16 Reconstructed waveform (Above) after windowing and original wave (below) ...16
Cross correlation between original signal and reconstructed one...17 Welch method all periodogram and their average18 Abstract Flow of Proposed classification stages ....19 Classification based on Discriminant analysis followed by decision based on Euclidean distance for stage one classification ...21
4.3
Euclid. distance method for decision based on stage 2...............22 Neural Network structure for NNA, NNB and NNC .23 Classification algorithm in stage 2..24 Matlab nprtool for NN implementation......26
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 6.1
Average pitch for Females of all 4 classes from database 1 .......27 Average pitch for Males of all 4 classes from database 1 ..28 Waveform of one of the record from database 1 ...............................28 Pitch track for waveform shown in fig. 5.3 ...............29 Pitch track- Unknown stimuli ...30 13 MFCC coefficients- Unknown stimuli ..30 12 dMFCC coefficients- Unknown stimuli ...................................31 11 ddMFCC coefficients- Unknown stimuli .31 Feature vector of 37 x 1 for Unknown stimuli ..32 Discriminant score plot for all 3 groups ....35 NN1 framework NNA network...36 NN1 framework NNB network ...36 NN1 framework NNC network ...37 NN2 framework NNA network...37 NN2 framework NNB network ...38 NN2 framework NNC network ...38 Stage 1 + NN2 framework all females.39 Stage 1 + NN1 framework males.39 Overall classification result ........40 Comparison chart for successful estimation of class..41
LIST OF TABLES
Table no Title Page no
Classification Groups..4 Neural networks output unknown stimuli....33 Canonical Discriminant function coefficients34 Functions at group centroids..34 classification result of stage 1 with database 1...35
ABBREVIATIONS
Canonical Discriminant function Discriminant Analysis Discriminant score Mel Frequency Cepstrum Coefficients Neural Network Young Male Young Female Adult Male Adult Female Senior Male Senior Female Discrete Cosine Transform Power Spectral Density
CHAPTER-1 INTRODUCTION
Automatic speech recognition (ASR) based algorithms are widely deployed for customer care and service applications. ASR research is currently moving from mere speech-to-text (STT) systems towards rich transcription (RT) systems, which annotate recognized text with non-verbal information such as speaker identity, emotional state. In Interactive voice response systems; this approach is already being used to identify dialogs involving angry customers, which can then be analyzed with the goal of automatically identifying problematic dialogs, transferring unsatisfied customers to an agent, and other purposes. Also, the first adaptive dialogs are now appearing, particularly in systems exposed to inhomogeneous user groups. These can adapt degree of automation, order of presentation, waiting queue music, or other properties to properties of the caller such as age or gender. As an example, it would be possible to offer different advertisements to children and adults in the waiting queue. In non-personalized services, speaker classification will be based on the callers speech data. While classifier performance is only one factor influencing the utility of the above approach in an IVR system, it is certainly a major factor. Proposed algorithm for automatic age and gender estimation is going to help in same regard, as it is a proposed algorithm which classifies speaker voice in one among the classes, which predicts its gender and approximate age group.
The ultimate aim of the proposed system is to predict age group and gender of speaker with the help of its stimuli of any length at real time. Now such systems are mainly consisting of two stages feature extraction, selection and Classification based on extracted features. For such observations or to find features which can give distinct values for different classes we need to have database. So one of the first tasks was to collect this audio database; followed by feature extraction and classification. For feature extraction features like pitch, MFCC & delta MFCC coefficients are worked out. In following stage of classification we adapted two different method namely CDA and NN framework. With the help of all collected 290 stimuli present in the database neural
networks are trained and at the end such trained networks are used for real time classification.
Child 1
Young Male 5
Senior Female 4
Senior Male 7
For classification purpose we had collected data for following 8 different groups: I. II. III. IV. V. VI. VII. VIII. Child Boy (age <15) Child Girl (age <15) Young Men (age<30) Young Women (age<30) Adult Gents (age<55) Adult Ladies (age<55) Senior citizen Male (age>55) Senior citizen Female (age>55)
At the starting, we had collected 2 stimuli from 105 speakers namely 1) HAPPY BIRTHDAY 2) To tell your name in your mother tongue. For example, <My Name is Sujay> or <Maz nav ... ahe.>Marathi or < En peyar... > Tamil or < Naa peru...> Telugu. With specification of a) Fs sampling rate=8000 samples/sec b) Bits per sample =16 bit c) Mono channel
By the end of experiment on these 2 stimuli it was quite clear that to distinguish between groups I & II was quite impossible. So finally we fixed classification groups to 7 classes by merging group I and II to a single group known as child. And finally we adapted classification groups as given in Table 1.1
Table 1.1 Classification Groups Group no I 2 3 4 5 6 7 Group Symbol C YF AF SF YM AM SM Category Child (age <18) Young Female (age <30) Adult Female (age <55) Senior Female (age >55) Young Male (age <30) Adult Male (age <55) Senior Male (age >55)
3) After this we adapted stimuli of OM but with condition to extend it to more than 10seconds and with single breathe. Like- OOOOOOOOOOOMMMMMMMMMMMmmmmmmmm We have such 87 sample recordings with following specifications: A. Fs sampling rate=16000 samples/sec B. Bits per sample =16 bit C. Mono channel
For recording purpose we used open source tool known as Wave Surfer 1.8.8p3 for the purpose of recording & editing with desired specifications. We can refer these 3 databases by Database1, Database 2and Database 3. In which all files are stored in the form of Name _Age.wav.
This project thesis is organized as follows. Chapter 2 reviews the literature review & background of the algorithms adapted towards estimation of age and gender. The Materials and Methods used in this study are discussed in Chapter 3 and Chapter 4. Chapter 3 deals with first part of feature extraction and Chapter 4 with Feature classification. Chapter 5 provides the results and discussion and the Chapter 6 concludes thesis with future direction.
Minematsu, N.et.al (1993), In Automatic estimation of ones age with his/her speech based upon acoustic modelling techniques of speakers, proposed technique to identify subjective elderly speakers with prosodic features such as MFCC based speech rate.
William R. Klecka (1980), in Discriminant Analysis presents a lucid and simple introduction to several related statistical procedures known as discriminant analysis. Discriminant Analysis (DA) introduces canonical discriminant function (CDF) of variables in discriminant analysis. Professor Klecka derives canonical discriminant function coefficients, provides spatial interpretation of them, and provides a nice discussion of the interpretation of CDFs. He presents clear discussion of unstandardized and standardized
SPSS ver. 14 manual on algorithms titled Discriminant explains all steps involved toward Classification based on CDF coefficients.
Braun,Aet. Al (1999), In Estimating speaker age across Languages , he conducted analysis to show correlation between calender age and perceived age with help of Italian and Dutch stimulies; and further concluded that male and female listener can safely be combined.
Cerrato, L. et. Al (2000), In subjective age estimation of telephonic voices,he carried out the statistical analysis which show that listeners are capable of of assigning a general chronological age category to voice without seeing or knowing the speaker & they are able to make distinguish beetween male and female voice transmitted over telephone line.
In Inferring speakers, physical attributes from their voices, he examined listeners ability to make accurate inferences about speakers from the non-linguistic content of their speech.
Shafran, I. et. al. (2003), In Voice signatures, he explores problem of extracting voice signature from speaker voice & found standard Mel warped cepstral features, speaking rate & shimmer to be useful.
Rabiner, L. et. al. (1976) In A comparative performance study of several pitch detection algorithms ,all Pitch detection algorithms are discussed .According to him Pitch can be as low as 40 Hz (for very low pitch male) or high as 600 Hz (for very high pitched female or childs voice)
Rabiner, L. et. al. (1977) In On the use of autocorrelation analysis for pitch detection, with the help of short time autocorrelation analysis, pitch detection technique is explained.
Mcleod,P. and Wywill, G. (2005) A smarter way to find pitch, he found that existing pitch algorithms that use Fourier domain suffer from spectral leakage and he suggested windowing as remedy over it.
Metze, F. et.al. (2007) In Comparison of four approaches to age and gender recognition for telephone application, compares different approaches to age gender classification on telephone speech with small and large utterance lengths.
Welch, P. D. (1967) In The Use of Fast Fourier Transform for the Estimationof Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms, use of FFT for PSD estimation is given.
Moller, M. (1993) In, A scaled conjugate algorithm for fast supervised learning, he introduces SCG whose performance is benchmarked compare to that of back propagation. Also it is fully
automated, included no critical user dependent parameter & avoids time consuming line search
Huang, X. et .al. (2001) In "Spoken Language Processing: A guide to Theory, Algorithm, and System Development," Prentice Hall, he describes prosodic phenomenas like pitch with algorithms available.
Childers, D. et.al.(1977) In The Cepstrum: A guide to processing, Pragmatic details of Cepstrum concepts is given.
Spiegl, W. et. al. (2009) In Analysing Features for Automatic Age Estimation on Cross-Sectional Data, developed acoustic feature set for the estimation of persons age from recorded speech signal. & demonstrated that age can be effectively estimated using feature vector of prosodic, spectral & cepstral features.
In this chapter Feature extraction algorithms are explained. In this work 2 type of features are found more suitable namely Pitch and MFCC coefficients. 3.1 PITCH Pitch represents perceived fundamental frequency of a sound and it may be quantified as frequency in cycles per second (hertz), however pitch is not purely objective physical property, but a subjective psychoacoustical attribute of sound. According to Huang, X. [ref 10], prosody is a complex weave of physical, phonetic effects that is being employed to express attitude, assumptions, and attention as a parallel channel in our daily speech communication. The semantic content of a spoken or written message is referred to as its denotation, while the emotional and attentional effects intended by the speaker or inferred by a listener are part of the messages connotation. Prosody has an important supporting role in guiding a listeners recovery of the basic messages (denotation) and a starring role in signalling connotation, or the speakers attitude toward the message, toward the listener(s), and toward the whole communication event. From the listeners point of view, prosody consists of systematic perception and recovery of a speakers intentions based on: I. II. III. IV. Pauses: to indicate phrases and to avoid running out of air. Pitch: rate of vocal-fold cycling (fundamental frequency) as a function of time. Rate/relative duration: phoneme durations, timing, and rhythm. Loudness: relative amplitude/volume.
Pitch is the most expressive of the prosodic phenomena. As we speak, we systematically vary our fundamental frequency to express our feelings about what we are saying, or to direct the listeners attention to especially important aspects of our spoken message
3.1.1 PITCH DETECTION According to Naotoshi Seo [ref 15], Pitch can be detected in following ways: a. b. c. d. Autocorrelation method Cepstrum method Harmonic product spectrum method (HPS) Linear predictive coding (LPC)
In our work we have adapted first method, in order to calculate pitch, we need at least two peaks to be within the block we are measuring pitch. We can ensure that at least 2 peaks are within the block. And therefore block size must be greater than 3 wavelength of lowest possible frequency. Lowest possible frequency is known as Pitch floor. No of minimum samples required per frame
Nmin =
Samples/frame
3 * Fs Pitch floor
According to this method to get pitch we need to get autocorrelation of signal for given block or frame. Then sample distance between value at first sample and at second highest peak K can be used to find fundamental frequency, where Fs is sampling frequency.
Pitch =
Fs
(no of samplescovered between 2 maximum peaksK)
. Hertz
Amplitude
50
100
150
350
400
450
500
2000
4000
6000
8000 7000 6000 5000 4000 3000 2000 1000 0 0 200 400 600 sample number 800 1000
X: 41 Y: 7980
Amplitude
We are using MFCC which is so popular as its efficient to compute .It incorporates a perceptual Mel frequency scale. It seprates source and filter. IDFT (DCT) decorrelates the features which in turn improves differences..
3.2.1 MEL SCALE Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz I.e. human perception of frequency is non-linear: Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz
For our work we are using 13 Mel filter banks as shown in fig. 1, which inturns gives 13 MFCC coeficients
3.2.2
LOG ENERGY
Logarithm compresses dynamic range of values o o Human response to signal level is logarithmic humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes Makes frequency estimates less sensitive to slight variations in input (power variation due to speakers mouth moving closer to mike) Phase information not helpful in speech
3.2.3
CEPSTRUM
According to Childers, D. [ref 4] ; cepstrum is nothing but spectrum of spectrum. The cepstrum requires Fourier analysis But were going from frequency space back to time So we actually apply inverse DFT . Since the log power spectrum is real and symmetric, inverse DFT reduces to a Discrete Cosine Transform (DCT)
3.2.4
These are nothing but MFCC variations and variation in MFCC variations. For 13 MFCC coefficients we get 12 delta-MFCC and 11-delta2 MFCC coefficients. Which is nothing but Quefrency.
3.3 Windowing
Instead of recording all the audio signal we are using windowing with overlaping which helps to limit buffer length to 1024 samples and previous frame results like PSD. And then we can implement methods like welch to find out periodogram for non-stationary signals.
In our case we are adapting windowing for estimation of pitch in case of pitch track and PSD estimation for extracting MFCC coefficients.
Here we are adapting hamming window of length 1024 with 50% overlapping.
Fig. 3.9 : Reconstructed waveform(Above) after windowing and original wave (below)
Cross correlation
X: 2560 Y: 15.53
15
10
-5
-10
-15
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Fig. 3.10 Crosscorrelation between original signal and reconstructed one. 3.3.1 PSD ESTMATION -Welch's Method
Welch's method for estimating power spectra is carried out by dividing the time signal into successive blocks, forming the Periodograms for each block, and averaging. Denote the mth windowed, zero-padded frame from the signal x by
Where R is defined as the window hop size, and let k denote the number of available frames. Then the Periodogram of the mth block is given by
In our work we are using Hamming window of N=1024, with 50% overlapping.
4.5
x 10
-4
3.5
Watt/Hz
2.5
1.5
0.5
500
1000
1500
2000 Hz
2500
3000
3500
4000
4500
1.2
x 10
-4
Average periodogram
X: 993 Y: 0.0001179
0.8
Watt/HZ
0.6
0.4
0.2
500
1000
1500
2000 Hz
2500
3000
3500
4000
4500
In this chapter we are going to propose combination of two classification algorithms as two stage classification. As there are 7 classes to classify C,YF,AF,SF,YM,AM & SM. It was found
that with the help of statistical classificationa like Canonical Discriminant Analysis we can predict speaker as Male , Female or Child. Now, that is the first stage of classification.Then to specify young, adult or senior among Male and Female group we are using NN
Feature set
MALE Group 5 to 7
FEMALE Group 2 to 4
NN Framework -1
C 1
NN Framework -2
YM 5
AM 6
SM 7
YF 2
AF 3
SF 4
4.1 First Stage with Discriminant Analysis Feature classification stage 1 is done using Discriminant Analysis (DA). In this method 2 Canonical Discriminant functions are determined from extracted features. For the feature vectors only two features namely pitch and delta2 MFCC (10) are only used as input vector for this state. For the purpose of training we used 39 Female, 27 child and 37 Male stimuli from Database 1. After extracting features from feature set for all Training cases we determined unstandardized coefficients along with group centroids for 2 functions. Now, we can determine Discriminant score for unknown feature set. And classification is done based on Euclidean distance rule. 4.1.1 Steps for Canonical Discriminant Analysis: Selected 103 samples can be referred as training database. So using SPSS package we can find out canonical Discriminant functions, in our case we have 3 classes so it will end up with 2 functions. For that we need to give (no. of samples from that group x 2 feature values) such 3 feature matrices as input. Along with a (Total samples x 1) vector depicting truth value of that class. After following steps as explained in Klecka [ref 5] we will have following information for each function 1. Unstandardized coefficients D 2. Constant D0 3. Function values at group centroids Now Canonical Discriminant Function can be determined as, f=D0+XD Where X is (1xP) feature vector for given mammogram.
4.1.2 Classification based on CDF Now after substituting Xinput value in obtained CDF we will get finput; this value is nothing but Discriminant Score (DS) for given input feature vector. And for 2 functions you will get 2 different finput; Therefore, F1 = [finput1 finput2] We will have 3 [2 x 1] group centroids values, we can find out the group which is having minimum Euclidian distance from F1, can be selected as classified group. C1 C3 F1
C2
2 Trained Functions with X input [Pitch d2MFCC (10)] 1. Unstandardized coefficients D 2. Constant D0 3. Function values at group centroids F1 Nearest Centroid belongs to class M or F or C
C F
Fig. 4.2 Classification based on Discriminant analysis followed by decision based on Euclidean distance for stage one classification
[0 0 1] P3 [0 1 0] P2
Fig. 4.3 Equivalent output C1 of NN framework based on 3 neural networks output Figure: [1 0 0]
L1 C1 L3
[0 0 0] [0 1 0] L2 [0 0 1]
Fig. 4.4 Euclidian distance method for decision based on classification stage 2 Now these 3 neural networks are nothing but 3 trained networks obtained by considering 2 classes as target output at a time so such 3C2=3 neural networks are required.
op1
op2
Fig. 4.5 Neural Network structure for NNA, NNB and NNC
Then, From NNA, P1 = [op1 op2 0] From NNB, P2 = [0 op1 op2] From NNC, P3 = [op1 0 op2] C1=centroid of (P1, P2 & P3); L1=dist(C1,[1 0 0]) L2=dist(C1,[0 1 0]) L3=dist(C1,[0 0 1])
For both NN1 and NN2 framework following method is applicable. NNA is neural network with the help of only Young and Adults male or Female NNB is neural network with the help of only Senior and Adults male or Female
NNC is neural network with the help of only Young and senior male or Female
NNA 0
NNB 0
NNC 0
P1
P2
P3
Centroid C1
L1
L2
L3
Y A
For the purpose of neural network implementation we used Matlab tool specially designed for Neural Network Pattern Recognition tool which can be invoked by nprtool command. This tool uses Conjugate gradient back propagation Method as explained in ref. With the help of trainscg command .Speciality about this SCGB method is it con train any network as long as its weight, net input & transfer function have derivative function. This algorithm is based on conjugate directions and does not perform a line search at each iterations.
Training stops when any one of occurs: 1. Maximum no of epochs is reached 2. Maximum amount of time is reached 3. Performance is minimised to goal 4. Performance gradient falls below minimum gradient. 5. Validation performance has increased more than max fail time.
We are taking number neurons in hidden layer equal to 40. And for training all 3 databases combinedly we are using. In that tool itself we can specify percentage samples for training, validation and testing; we are using them with ratio of 70%, 15% and 15% respectively. For training it uses Hyperbolic tangent sigmoid transfer function. for neuron modelling..
After satisfactory training i.e. good classification rate you can save this network in memory and can anytime invoke it at the time of testing.
Fig. 5.1 Average pitch for Females of all 4 classes from database 1
Fig. 5.2 Average pitch for Males of all 4 classes from database 1
Fig. 5.3 Waveform of one of the record from database 1 Happy birthday
Hz
Frame no.
Fig. 5.4 Pitch track for waveform shown in fig. 5.3 While we are plotting pitch track i.e. pitch contour for database 1 and 2 that it was showing dramatic variations in pitch within that stimuli .So it was followed by collection of database 3 in which pitch contour is near to average value of pitch at all time. With the help of database 3 one more fact we come to know that, For Male < 12 and Female <18 are showing distinct results as compare to others. It was observed that for boys there is change in pitch after age of 12 years, which is 18 years for girls. So it was deciding factor for fixing classification group as given in Table 1.1.According to which child come under category of any human having less than 18 years age. In following section with one stimuli example we will give results obtained and calculation part while following algorithm
100
50
0
Hz
-50
-100
-150
-200
10
20
30
40
50 frame no
60
70
80
90
100
-5
-10
-15
-20
-25
-30
10
12
14
35
30
25
20
15
10
-5
10
12
-5
-10
-15
-20
-25
-30
-35
10
11
120
100
80
60
40
20
-20
-40
10
15
20 Feature number
25
30
35
40
Fig. 5.9 Feature vector of 37 x 1 for Unknown stimuli [Mean (pitch) 13-MFCC 12-dMFCC 11-ddMFCC]
DS1 = [106.4090 0.1201] * [0.027 DS2 = [106.4090 0.1201] *[-0.006 F1 Centroid c0 Centroid c1 Centroid c2
= [-2.93 0.53] (Child) = [2.5286 -0.3146] (Male) = [0.4167 0.3881] (Female) = [-2.2844 -0.1795]
Here, distance between c2 & F1 is smaller than other distances & c2 belongs to group of male. Classification stage 1 result: Male Now, in classification stage 2, it will go through NN2 framework. In this stage it will again passes through NNA, NNB & NNC networks. Table 5.1 Neural networks output unknown stimuli
P1= [0.7367 0.1617 0] P2= [ 0 0.9628 0.0194] P3= [0.9899 0 0.0201] C1= [0.5755 0.3748 0.0132]..centroid L1= 0.5664 L2= 0.8499 L3=1.2023 This means again class 1 means Young group. And final classification will be Male-Young means group 5-YM
So output of our algorithm is from Matlab environment: --------Group no------------------------------------------------------Child = 1 Female<30 = 2 Female<55 = 3 Female>55 = 4 Male<30 = 5 Male<55 = 6 Male>55 = 7 --------------------------------And answer is-------group = 5 --------------------------------------------------------------------Group 5 is nothing but YM and it was true positive result.
5.2 Classification result stage one Canonical Discriminant Analysis Classification results of stage one using database one as training database.
Table 5.2
Canonical Discri minant Function Coeffi cients Function pitch ddmf cc10 (Constant) 1 .027 .779 -5.939 2 -.006 2.380 .926
Table 5.3
Functions at Group Centroids Function group .00 1.00 2.00 1 2.529 .417 -2.284 2 -.315 .388 -.179
Fig. 5.10 Discriminant score plot for all 3 groups Table 5.4 classification result of stage 1 with database 1
b,c Classification Results
Original
Count
group .00 1.00 2.00 .00 1.00 2.00 .00 1.00 2.00 .00 1.00 2.00
Predicted Group Membership .00 1.00 2.00 22 5 0 4 31 4 0 1 36 81.5 18.5 .0 10.3 79.5 10.3 .0 2.7 97.3 21 6 0 4 31 4 0 1 36 77.8 22.2 .0 10.3 79.5 10.3 .0 2.7 97.3
a. Cross v alidation is done only f or those cases in t he analy sis. In cross v alidation, each case is classif ied by the f unctions deriv ed f rom all cases other than that case. b. 86.4% of original grouped cases correctly classif ied. c. 85.4% of cross-v alidated grouped cases correctly classif ied.
Fig 5.11 NN1 framework NNA network & this deal with YM and AM category
Fig 5.12 NN1 framework NNB network & this deal with SM and AM category
Fig. 5.13 NN1 framework NNC network & this deal with YM and SM category
Fig. 5.14 NN2 framework NNA network & this deal with YF and AF category
Fig. 5.15 NN2 framework NNB network & deals with SF and AF category
Fig. 5.16 NN2 framework NNC network & this deal with YF and SF category
Fig 5.19 Overall classification result when whole database 3 as testing samples
Proposed Automatic Age and Gender estimating system is implemented with the help of Matlab toolbox.Figure 6.1 compares classification rates obtained by applying database 3 for testing at the end of second/last classification stage. It is found that overall male category is having good classification rate .Except AF other results are quite satisfactory including overall classification rate which was 69.4%.
classification rate %
100 90 80 70 60 50 40 30 20 10 0 Child YF AF SF YM AM SM classification rate
Fig. 6.1 Comparison chart for successful estimation of class. As part of further work, we need to train /neural network whenever true class of user we know and it is showing different class. And not only there is need to collect more stimuli but also one needs to explore more features.
REFERENCES
1. Welch, P. D. (1967); The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms, IEEE Trans. on Audio Electroacoustic, Volume AU-15,pages 70-73. 2. Rabiner, L. et. al. (1976); A comparative performance study of several pitch detection algorithms, IEEE Tran. Acoustics, Speech and Signal Processing, Volume 24, Issue 5, Page 399-418.
3. Rabiner, L. et. al. (1977); On the use of autocorrelation analysis for pitch detection, IEEE Tran. Acoustics, Speech and Signal Processing, Volume 25, Issue 1, Page 24-33. 4. Childers, D. et.al.(1977); The Cepstrum: A guide to processing, Proc. IEEE . Volume 65, issue 10,pp 1428-1443
6. Minematsu, N.et.al (1993); Automatic estimation of ones age with his/her speech based upon acoustic modelling techniques of speakers. ICASSP-93, 1993 IEEE International Conference Acoustics, Speech, and Signal Processing 7. Moller, M. (1993); A scaled conjugate algorithm for fast supervised learning, Neural Networks, volume 6(4), 523-533.
8. Braun,Aet. Al (1999); Estimating speaker age across Languages , The International Congress of Phonetic Sciences -IPhS99
9. Cerrato, L. et. Al (2000); subjective age estimation of telephonic voices, Speech Communication archive. Volume 31, Issue 2-3 (June 2000), Elsevier.
10. Huang, X. et .al. (2001); "Spoken Language Processing: A guide to Theory, Algorithm, and System Development," Prentice Hall.
11. Krauss, R. M.et.al (2002); Inferring speakers, physical attributes from their voices, Journal of Experimental Social Psychology, 38, Page 618-625.
12. Shafran, I. et. al. (2003); Voice signatures, In Proc. IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2003.
13. Mcleod,P. and Wywill, G. (2005); A smarter way to find pitch, Proc. , International computer Music CONFERENCE, Barcelona, July 2005,pp 300-303
14. Metze, F. et.al. (2007); Comparison of four approaches to age and gender recognition for telephone application, ICASSP. 15. Naotoshi Seo (2008); ENEE632 Project4 Part I: Pitch Detection, ECE dept., Maryland university. 16. Spiegl, W. et. al. (2009); Analysing Features for Automatic Age Estimation on Cross-Sectional Data 10th Annual Conference of the International Speech Communication Association, Brighton. Page 1-4.