You are on page 1of 7

SPEECH

Detection

Voice activity detection (VAD), also known as speech activity detection or speech detection, is a
technique used in speech processing in which the presence or absence of human speech is
detected.[1] The main uses of VAD are in speech coding and speech recognition. It can facilitate
speech processing, and can also be used to deactivate some processes during non-speech section
of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice
over Internet Protocol applications, saving on computation and on network bandwidth.

VAD is an important enabling technology for a variety of speech-based applications. Therefore,


various VAD algorithms have been developed that provide varying features and compromises
between latency, sensitivity, accuracy and computational cost. Some VAD algorithms also
provide further analysis, for example whether the speech is voiced, unvoiced or sustained. Voice
activity detection is usually language independent.

It was first investigated for use on time-assignment speech interpolation (TASI) systems.

The typical design of a VAD algorithm is as follows:

There may first be a noise reduction stage, e.g. via spectral subtraction.

Then some features or quantities are calculated from a section of the input signal.

A classification rule is applied to classify the section as speech or non-speech – often this
classification rule finds when a value exceeds a threshold.

There may be some feedback in this sequence, in which the VAD decision is used to improve the
noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback
operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a
lot).[1]

A representative set of recently published VAD methods formulates the decision rule on a frame
by frame basis using instantaneous measures of the divergence distance between speech and
noise. The different measures which are used in VAD methods include spectral slope, correlation
coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures.

Independently from the choice of VAD algorithm, we must compromise between having voice
detected as noise or noise detected as voice (between false positive and false negative). A VAD
operating in a mobile phone must be able to detect speech in the presence of a range of very
diverse types of acoustic background noise. In these difficult detection conditions it is often
preferable that a VAD should fail-safe, indicating speech detected when the decision is in doubt,
to lower the chance of losing speech segments. The biggest difficulty in the detection of speech
in this environment is the very low signal-to-noise ratios (SNRs) that are encountered. It may be
impossible to distinguish between speech and noise using simple level detection techniques when
parts of the speech utterance are buried below the noise.

The basic principle of a VAD device is that it extracts measured features or quantities from the
input signal and then compares these values with thresholds usually extracted from noise-only
periods. Voice activity (VAD = 1) is declared if the measured values exceed 100 Acta
Polytechnica Vol. 50 No. 4/2010 the thresholds. Otherwise, there is no speech activity or noise,
and silence (VAD = 0) is present. A general block diagram of a VAD design is shown in Fig. 1.
VAD design involves extracting acoustic features that can appropriately indicate the probability
of target speech signals existing in observed signals. Based on these acoustic features, the latter
part decides whether the target speech signals are present in the observed signals, using a
computed well-adjusted threshold value. Most VAD algorithms output a binary decision on a
frame-by-frame basis, where the frame of the input signal is a short unit of time 5–40 ms in
length. The accuracy and reliability of a VAD algorithm depends heavily on the decision
thresholds. Adapting the threshold value helps to track time-varying changes in the acoustic
environments, and hence provides a more reliable voice detection result. VAD algorithms based
on energy thresholding In energy-based VAD, the energy of the signal is compared with the
threshold depending on the noise level. Speech is detected when the energy estimation lies above
the threshold. IF (Ej > k · Er), where k > 1, frame is ACTIVE

ELSE frame is INACTIVE (1)

In the equation, Er represents the energy of the noise frames, while k · Er is the threshold used in
the decision-making. Having a scaling factor, k allows a safe band for adapting Er, and,
therefore, adapting the threshold. Different energy-based VADs differ in the way the thresholds
are updated. The simplest energy-based method, the Linear Energy-Based Detector (LED), was
first described in [8]. The rule for updating the threshold value was specified as, Ernew = (1 − p)
· Er old + p · Esilence (2) Here, Er new is the updated value of the threshold, Er old is the
previous energy threshold, and Esilence is the energy of the most recent unvoiced frame. The
reference Er is updated as a convex combination of the old threshold and the current noise
update. Parameter p is constant (0 <p< 1).
Estimation

Speech signal can be classified into voiced, unvoiced and silence regions. The near periodic
vibration of vocal folds is excitation for the production of voiced speech. The random ...like
excitation is present for unvoiced speech. There is noexcitation during silence region. Majority
of speech regions are voiced in nature that includevowels,....., semivowels and other voiced
components. The voiced regions looks like a near periodic signal in the time domain
representation. In a short term .., we may treat the voiced speech segments to be periodic for all
practical analysis and processing. The periodicity associated with such segmentsis defined is
'pitch period To' in the time domain and 'Pitch frequency or Fundamental Frequency Fo' in the
frequency domain. Unless specified, the term 'pitch' refers to the fundamental frequency ' Fo'.
Pitch is an important attribute of voiced speech. It contains speaker-specific information. It is
also needed for speech coding task. Thus estimation of pitch is one of the important issue in
speech processing.

There are a large set of methods that have been developed in the speech processing area for the
estimation of pitch. Among them the three mostly used methods include, autocorrelation of
speech, cepstrum pitch determination and single inverse .... technique (SIFT) pitch estimation.
One success of these methods is due to the involvment of simple steps for theestimation of pitch.
Even though autocorrelation method is of theoritical interest, it produce a frame work for SIFT
methods.

Pitch estimation by Autocorrelation method

The information about pitch period 'To' is more pronounced in the autocorrelation sequence of
voiced speech compared to the speech segment itsely. Fig 1 shows a 30 msec segment of voiced
speech and its autocorrelation sequence. Since autocorrelation sequence is symmetric with
respect to zero lag, only postiove lag values are shown in the figure. The 'To' information is more
pronounced in the autocorrelation sequence compared to speech. By that we mean, the second
largest peak is the autocorrelation sequence, represents To and
can be picked up easily by a simple peak picking algorithm compared to finding 'T o' from the
speech segment itself. Hence autocorrelation method is preferred over other direct methods of
pitch estimation from speech. Fig 2 shows a 30 msec segment of unvoiced speech and its
autocorrelation sequence. There is no prominent peak as in the case of voiced speech. This is the
fundamental distinction between voiced and unvoiced speech. Further, human speech pitch is
typically in the range 100-400 Hz and accordingly the pitch in the range 2.5-10 msec. Therefore
for the estimation of pitch the largest peak in the partial autocorrelation sequence starting from
2.5 msec lag is ..... out and its distance with respect to zero lag is measured as pitch peak 'To '.
This is illustrated in Fig 3. Once we know To , then pitch can be computed as ..... , where Fs isthe
sampling frequency of the speech signal and 'To' is pitch period in samples.

For instance, if To = 10 msec, Fs = 8000 Hz , then T0 in samples= 10*8=80. F0 = 8000/80 = 100


Hz

IMAGE

Detection

Edge detection is a fundamental tool in image processing, machine vision and computer vision,
particularly in the areas of feature detection and feature extraction. Edge detection includes a
variety of mathematical methods that aim at identifying points in a digital image at which
the image brightness changes sharply or, more formally, has discontinuities. The points at which
image brightness changes sharply are typically organized into a set of curved line segments
termed edges. The purpose of detecting sharp changes in image brightness is to capture
important events and changes in properties of the world. It can be shown that under rather
general assumptions for an image formation model, discontinuities in image brightness are likely
to correspond to:[2][3]

discontinuities in depth,

discontinuities in surface orientation,

changes in material properties and

variations in scene illumination.

In the ideal case, the result of applying an edge detector to an image may lead to a set of
connected curves that indicate the boundaries of objects, the boundaries of surface markings as
well as curves that correspond to discontinuities in surface orientation. Thus, applying an edge
detection algorithm to an image may significantly reduce the amount of data to be processed and
may therefore filter out information that may be regarded as less relevant, while preserving the
important structural properties of an image. If the edge detection step is successful, the
subsequent task of interpreting the information contents in the original image may therefore be
substantially simplified. However, it is not always possible to obtain such ideal edges from real
life images of moderate complexity.

Edges extracted from non-trivial images are often hampered by fragmentation, meaning that the
edge curves are not connected, missing edge segments as well as false edges not corresponding
to interesting phenomena in the image – thus complicating the subsequent task of interpreting the
image data.
Edge detection is one of the fundamental steps in image processing, image analysis, image
pattern recognition, and computer vision techniques.

Edge properties

The edges extracted from a two-dimensional image of a three-dimensional scene can be


classified as either viewpoint dependent or viewpoint independent. A viewpoint independent
edge typically reflects inherent properties of the three-dimensional objects, such as surface
markings and surface shape. A viewpoint dependent edge may change as the viewpoint changes,
and typically reflects the geometry of the scene, such as objects occluding one another.

A typical edge might for instance be the border between a block of red color and a block of
yellow. In contrast a line (as can be extracted by a ridge detector) can be a small number
of pixels of a different color on an otherwise unchanging background. For a line, there may
therefore usually be one edge on each side of the line.

A simple edge model

Although certain literature has considered the detection of ideal step edges, the edges obtained
from natural images are usually not at all ideal step edges. Instead they are normally affected by
one or several of the following effects:

focal blur caused by a finite depth-of-field and finite point spread function.

penumbral blur caused by shadows created by light sources of non-zero radius.

shading at a smooth object

A number of researchers have used a Gaussian smoothed step edge (an error function) as the
simplest extension of the ideal step edge model for modeling the effects of edge blur in practical
applications. Thus, a one-dimensional image f which has exactly one edge placed at x=0 may
be modeled as:

At the left side of the edge, the intensity is , and right of the edge it

Is . The scale parameter σ is called the blur scale of the edge. Ideally this
scale parameter should be adjusted based on the quality of image to avoid destroying true edges
of the image.
Estimation

This paper explains the use of a sharpening filter to calculate the depth of an object from a
blurred image of it. It presents a technique which is independent of edge orientation. The
technique is based on the assumption that a defocused image of an object is the convolution of a
sharp image of the same object with a two-dimensional Gaussian function whose spread
parameter (SP) is related to the object depth. A sharp image of an object is obtained from a
defocused image of the same object by applying sharpening filters. The defocused and sharp
images of the object are used to calculate the SP which is then related to the object depth.

The Laplacian sharpening filter Averaging of pixels over an area blurs detail in an image. As the
averaging or blurring operation is similar to the integration operation, the differentiation
operation can be expected to have the opposite effect. Therefore, a blurred image can be
sharpened by performing differentiation operations [13]. Because blurred features which are to
be sharpened (such as lines and edges) can have any orientation in an image, it is important to
employ a derivative operator whose output is not biased by a particular feature orientation.
Therefore, the operator should be isotropic, i.e. rotation invariant. The Laplacian is a linear
derivative operator that is rotationally invariant. The Laplacian of an image is a second-order
spatial derivative defined as

How the Laplacian is used for sharpening a blurred image can be shown by assuming that the
blur in the image is the result of a diffusion process which satisfies the well-known partial
differential equation:

where c is a constant and i is a function of x, y and t (time). i(x,y,0) is the sharp image s(x,y) at t
= 0. The blurred image i(x,y,t) is obtained at some t=τ >0. Then, i(x,y,t) is approximated at t=τ
by the following Taylor polynomial:

By ignoring the quadratic and higher-order terms and substituting s for i(x,y,0), i(x,y) for i(x,y,τ)
and c∇2 i for ∂i/∂ t , a mathematical expression can be derived for s(x, y) as:
15

The above equation indicates that the sharp image s can be obtained by subtracting from the
blurred image i a positive multiple of its Laplacian. If higher-order approximations based on the
Taylor series expansion are used, better results can be achieved. However, this will increase the
computational cost. The aim of this paper is to find a relation between blur and depth rather than
restoring the exact sharp image and the above first-order approximation is sufficient to derive
that relation. Although diffusion may not be an appropriate model for image blur, it is possible
that the sharp image can be computed by a subtractive combination of the blurred image and its
Laplacian. According to the diffusion model, a point source blurs into a spot with a brightness
distribution whose SP is proportional to c. Therefore, c can be estimated by fitting a Gaussian to
the PSF [14]. By convolving both sides of Eq. (15) with the PSF h(x,y) and substituting σ for cτ,
the following formula is obtained:

Substituting Eq. (3) into Eq. (16) gives

h(x,y) can be searched iteratively to minimize the difference between the left and right hand
sides of Eq. (17) over a region P, namely:

As stated in Section 2, h(x,y) is the unique indicator of the depth of a scene. Thus, when the
h(x,y) that minimizes the above expression is obtained, the depth can be computed using the SP
of that h(x,y). By taking the Laplacian of the blurred edge and subtracting the result from the
blurred edge (c = 1), the sharpened edge is obtained. However, it also produces overshoot or
“ringing” on either side of the edge. This problem can be solved by “clipping” the extreme low
and high grey level values.

You might also like