Professional Documents
Culture Documents
Historical introduction Mathematical background (e.g., pattern classification, acoustics) Feature extraction for speech recognition (and some neural processing) What sound units are typically defined Audio signal processing topics (pitch extraction, perceptual audio coding, source separation, music analysis) Now back to pattern recognition, but include time
Time Normalization
Linear Time Normalization Nonlinear Time Normalization Dynamic Time Warp (DTW)
Dynamic programming
Bellman optimality principle (1962): optimal policy given optimal policies from sub problems Best path through grid: if best path goes through grid point, best path includes best partial path to grid point Classic example: knapsack problem
Knapsack problem
Stuffing a sack with items, different value Goal: maximize value in sack Key point 1: If max size is 10, and we know values of solutions for max size of 9, we can compute the final answer knowing the value of adding items. Key point 2: Point 1 sounds recursive, but can be made efficiently nonrecursive by building a table
Basic DTW step w/ simple local constraints. Each (i,j) cell has local distance d and cumulative distortion D. The eqn shows the basic computational step.
DTW steps
(1) Compute local distance d in 1st column(1st frame of input) for each reference template. Let D(0,j) = d(0,j) for each cell in each template (2) For i=1 (2nd column), j=0, compute d(i,j) add to min of all possible predecessor values of D to get local value of D; repeat for each frame in each template. (3) Repeat (2) for each column to the end of input (4) For each template, find best D in last column of input (5) Choose the word for the template with smallest D
DTW Complexity
O(Nframesref . Nframesin . Ntemplates) Storage, though can just be O(Nframesref . Ntemplates) (store current column and previous column) Constant reduction: global constraints Constant reduction: local constraints
DTW-based K-means
(1) Initialize (how many, where) (2) Assign examples to closest center (DTW distance) (3) For each cluster, find template with minimum value for maximum distance, call it the center (4) Repeat (2) and (3) until some stopping criterion is reached (5) Use center templates as references for ASR
Connected Algorithm
In principle: one big distortion matrix (for 20,000 words, 50 frames/word, 1000 frame input [10 seconds] would be 109 cells!) Also required, backtracking matrix (since word segmentation not known) Get best distortion Backtrack to get words Fundamental principle: find best segmentation and classification as part of the same process, not as sequential steps
Storage efficiency
Distortion matrix -> 2 columns Backtracking matrix -> 2 rows From template points to template with lowest cost ending here From frame points to end frame of previous word
Knowledge-based segmentation
DTW combines segmentation, time norm, recognition; all segmentations considered Same feature vectors used everywhere Could segment separately, using acousticphonetic features cleverly Example: FEATURE, Ron Cole (1983)