ow speech ete nets-Exlan at Sit
You are her: Home page > Computrs > Speech synthesizers
Home| A-Zindex| Get the book| Follow us| Random article | Timeline | Teaching guide | About us | Privacy & cookies
Advertisement
Speech
"hi" synthesizers
(we) oe
by Chris Woodford. Last
updated: December 23, 2018.
‘ow long will it be
before your computer
gazes deep in your,
‘eyes and, with all the electronic sincerity it can muster, mutters those
three litle words that mean so much: " love you"! In theory, it could happen
right this minute: virtually every modern Windows PC has a speech
‘synthesizer (a computerized voice that turns written text into speech) builtin, mostly to help people with visual disabilities
who can't read tiny text printed on a screen. How exactly do speech synthesizers go about converting writen language into
‘spoken? Let's take a closer look!
ser oxo om
Anwork: Humans dont communicate by printing werd onthe forehead for other people read £0 why should computers? Thanks to emarohone
‘agents ike Sin, Cetans, and "OK Google,” people are slomy geting used tothe dea of speaking commands fa computer and geting back spaken
repos
What is speech synthesis?
pimps comhon spec aytnse-woka elow speech eyes wet Elan Sit
Computers do their jobs in three distinct stages called input (where you feed information in, often with a keyboard or
mouse), processing (where the computer responds to your input, say, by adding up some numbers you typed in or
‘enhancing the colors on a photo you scanned), and output (where you get to see now the computer has processed your
input, typically on a screen or printed out on paper). Speech synthesis is simply a form of output where a computer or other
machine reads words to you out loud in a real or simulated voice played through a loudspeaker; the technology is often
called text-to-speech (TTS).
Talking machines are nothing new—somewhat surprisingly, they date back to the 18th century—but computers that
routinely speak to their operators are still extremely uncommon. True, we drive our cars with the help of computerized
navigators, engage with computerized switchboards when we phone utility companies, and listen to computerized apologies
‘on railroad stations when our trains are running late. But hardly any of us talk to our computers (with voice recognition) or sit
‘around waiting for them to reply. Professor Stephen Hawking was a truly unique individual—in more ways than one: can you
think of any other person famous for talking with a computerized voice? All that may change in future as computer-
‘generated speech becomes less robotic and more human.
How does speech synthesis work?
Lets say you have a paragraph of writen text that you want your computer to speak aloud. How does it turn the written
‘words into ones you can actually hear? There are essentially three stages involved, which I'l refer to as text to words,
‘words to phonemes, and phonemes to sound.
1. Text to words
Reading words sounds easy, but if you've ever listened to a young child reading a book that was just too hard for them,
youll know i's not as trivial as it seems. The main problem is that written text is ambiguous: the same writen information
‘can often mean more than one thing and usually you have to understand the meaning or make an educated guess to read it
‘correctly. So the initial stage in speech synthesis, which is generally called pre-processing or normalization, is all about
reducing ambiguity: i's about narrowing down the many different ways you could read a piece of text into the one that’s the
most appropriate.
Preprocessing involves going through the text and cleaning it up s0 the computer makes fewer mistakes when it actualy
reads the words aloud. Things like numbers, dates, times, abbreviations, acronyms, and special characters (currency
pimps combonspecy syns wakaow speech ete nets-Exlan at Sit
‘symbols and so on) need to be turned into words—and that’s harder than it sounds. The number 1843 might refer to a
‘quantity of items ("one thousand eight hundred and forty three"), a year or atime ("eighteen forty three"), or a padlock
‘combination (“one eight four three"), each of which is read out slighty differently. While humans follow the sense of what's
writen and figure out the pronunciation that way, computers generally don't have the power to do thal, so they have to use
statistical probabilty techniques (typically Hidclen Markov Models) or neural networks (computer programs structured lke
arrays of brain cells that learn to recognize patterns) to arrive at the most likely pronunciation instead. So ifthe word "year
‘occurs in the same sentence as "1843," it might be reasonable to guess this is a date and pronounce it "eighteen forty
three." ifthere were a decimal point before the numbers (".843"), they would need to be read differently as "eight four
three.
‘Anwotk: Context matters: A speech synthesizer nceds same understanding
‘ot whot its reosing,
Proprocessing also has to tackle hemographs, words
pronounced in ferent ways according to what they mean
‘The word "ead can be pronounced ether “red” or reed, so
2 senience such a5" rad the book’ is immediately Dene
problematic fora speech synthesizer. Butiftcan figure out | "#4? hen [i coin
thatthe preceding ex entirely inthe past tense, by
recognizing pasttense verbs (got up. ook a shower.
had breakfast... I read a book..."), it can make a reasonable aE
‘guess that "| read [red] a book" is probably correct, Likewise,
ifthe preceding text is "| get up... take a shower... | have breakfast..." the smart money should be on "| read [reed] a book."
2. Words to phonemes
Having figured out the words that need to be said, the speech synthesizer now has to generate the speech sounds that
make up those words. In theory, this is a simple problem: all the computer needs is a huge alphabetical lst of words and