You are on page 1of 15
ow speech ete nets-Exlan at Sit You are her: Home page > Computrs > Speech synthesizers Home| A-Zindex| Get the book| Follow us| Random article | Timeline | Teaching guide | About us | Privacy & cookies Advertisement Speech "hi" synthesizers (we) oe by Chris Woodford. Last updated: December 23, 2018. ‘ow long will it be before your computer gazes deep in your, ‘eyes and, with all the electronic sincerity it can muster, mutters those three litle words that mean so much: " love you"! In theory, it could happen right this minute: virtually every modern Windows PC has a speech ‘synthesizer (a computerized voice that turns written text into speech) builtin, mostly to help people with visual disabilities who can't read tiny text printed on a screen. How exactly do speech synthesizers go about converting writen language into ‘spoken? Let's take a closer look! ser oxo om Anwork: Humans dont communicate by printing werd onthe forehead for other people read £0 why should computers? Thanks to emarohone ‘agents ike Sin, Cetans, and "OK Google,” people are slomy geting used tothe dea of speaking commands fa computer and geting back spaken repos What is speech synthesis? pimps comhon spec aytnse-woka el ow speech eyes wet Elan Sit Computers do their jobs in three distinct stages called input (where you feed information in, often with a keyboard or mouse), processing (where the computer responds to your input, say, by adding up some numbers you typed in or ‘enhancing the colors on a photo you scanned), and output (where you get to see now the computer has processed your input, typically on a screen or printed out on paper). Speech synthesis is simply a form of output where a computer or other machine reads words to you out loud in a real or simulated voice played through a loudspeaker; the technology is often called text-to-speech (TTS). Talking machines are nothing new—somewhat surprisingly, they date back to the 18th century—but computers that routinely speak to their operators are still extremely uncommon. True, we drive our cars with the help of computerized navigators, engage with computerized switchboards when we phone utility companies, and listen to computerized apologies ‘on railroad stations when our trains are running late. But hardly any of us talk to our computers (with voice recognition) or sit ‘around waiting for them to reply. Professor Stephen Hawking was a truly unique individual—in more ways than one: can you think of any other person famous for talking with a computerized voice? All that may change in future as computer- ‘generated speech becomes less robotic and more human. How does speech synthesis work? Lets say you have a paragraph of writen text that you want your computer to speak aloud. How does it turn the written ‘words into ones you can actually hear? There are essentially three stages involved, which I'l refer to as text to words, ‘words to phonemes, and phonemes to sound. 1. Text to words Reading words sounds easy, but if you've ever listened to a young child reading a book that was just too hard for them, youll know i's not as trivial as it seems. The main problem is that written text is ambiguous: the same writen information ‘can often mean more than one thing and usually you have to understand the meaning or make an educated guess to read it ‘correctly. So the initial stage in speech synthesis, which is generally called pre-processing or normalization, is all about reducing ambiguity: i's about narrowing down the many different ways you could read a piece of text into the one that’s the most appropriate. Preprocessing involves going through the text and cleaning it up s0 the computer makes fewer mistakes when it actualy reads the words aloud. Things like numbers, dates, times, abbreviations, acronyms, and special characters (currency pimps combonspecy syns waka ow speech ete nets-Exlan at Sit ‘symbols and so on) need to be turned into words—and that’s harder than it sounds. The number 1843 might refer to a ‘quantity of items ("one thousand eight hundred and forty three"), a year or atime ("eighteen forty three"), or a padlock ‘combination (“one eight four three"), each of which is read out slighty differently. While humans follow the sense of what's writen and figure out the pronunciation that way, computers generally don't have the power to do thal, so they have to use statistical probabilty techniques (typically Hidclen Markov Models) or neural networks (computer programs structured lke arrays of brain cells that learn to recognize patterns) to arrive at the most likely pronunciation instead. So ifthe word "year ‘occurs in the same sentence as "1843," it might be reasonable to guess this is a date and pronounce it "eighteen forty three." ifthere were a decimal point before the numbers (".843"), they would need to be read differently as "eight four three. ‘Anwotk: Context matters: A speech synthesizer nceds same understanding ‘ot whot its reosing, Proprocessing also has to tackle hemographs, words pronounced in ferent ways according to what they mean ‘The word "ead can be pronounced ether “red” or reed, so 2 senience such a5" rad the book’ is immediately Dene problematic fora speech synthesizer. Butiftcan figure out | "#4? hen [i coin thatthe preceding ex entirely inthe past tense, by recognizing pasttense verbs (got up. ook a shower. had breakfast... I read a book..."), it can make a reasonable aE ‘guess that "| read [red] a book" is probably correct, Likewise, ifthe preceding text is "| get up... take a shower... | have breakfast..." the smart money should be on "| read [reed] a book." 2. Words to phonemes Having figured out the words that need to be said, the speech synthesizer now has to generate the speech sounds that make up those words. In theory, this is a simple problem: all the computer needs is a huge alphabetical lst of words and

You might also like