Speech technology


Technology is changing all our lives, and one aspect of technology is soon to become a leading influence in this revolution. One of the most fundamental attributes of the human intellect is the ability to communicate by speaking, and we have at last achieved the goal of developing machines that are themselves capable of spoken communication. To many ordinary people, speech technology seems an idea out of science fiction, yet the work of scientists around the world involved in this area of technology in recent decades has resulted in products with real commercial and industrial  potential at relatively low cost.

For most speech scientists, speech technology comprises two principal areas: automatic speech recognition, and speech synthesis. In principle, any area where technology is involved in the process of spoken communication should be regarded as an example of Speech Technology. Recognition and synthesis are, to a large extent, complementary fields, and might be thought of as two sides of the same coin. There are, however, major differences. We look first at speech recognition.


a. Applications of speech recognition

The most frequently quoted application for speech recognition is in office dictation systems. It is believed that there will be major economic benefits when a fully reliable system is on the market. Currently users must choose between systems which recognise a small vocabulary (one or two thousand words) reliably and with reasonably natural speaking style, or a large vocabulary (tens of thousands of words) in an unnatural speaking style in which words are separated by pauses. It is clear that we can expect soon to see an office dictation system which will be capable of taking dictation from many speakers using a large (though not unlimited) vocabulary and more or less natural connected speech. Such a system will receive the spoken input, and produce a letter or report with proper formatting and spelling. It must be remembered that achieving correct spelling is not an easy achievement in English, and the difficulty of converting spelling to sound and sound to spelling is one of the problems that receives most effort in English-speaking countries - a problem that could be avoided if English spelling were reformed. In this context we should note that most people can speak more rapidly than they can type, so a speech-input system is likely to speed up work in some areas.

            An important application area for speech technology, and one with a value that everyone can see,  is in helping the disabled. There are many people who are physically unable to operate a keyboard but have the power of speech. To be able to control their environment by spoken commands (open the door, switch on the heating, operate an alarm) would be a big help to such people, and voice-operated devices can provide this.


b. Techniques in speech recognition

It is a striking fact that the most lasting developments of speech technology have been the result of partnership between specialists in computer science  and electronic engineering on the one hand and specialists in speech science and linguistics on the other. Attempts to solve the many problems of speech recognition simply by advanced engineering have resulted in systems that work satisfactorily within the laboratory for an ideal speaker, but have been unable to survive exposure to the enormous variability of speech in the real world. The input of speech science has been of different types in different applications, but I believe phonetic expertise is always an essential component of a successful system.

            In earlier times, it seemed obvious that we humans, having learned what are the principal characteristics of the speech sounds that must be recognised, must then instruct the computer on how to perform the same task: such systems were known as knowledge-based systems. But more recently, we have been able to work with computer systems that are able to learn for themselves what characteristics distinguish the various units of speech; these systems are capable of learning by themselves. In the case of knowledge-based systems, the relevant input of the speech scientist was to provide the engineer with the best possible descriptions of the data. But in self-teaching systems, the input is completely different.

            If we look at how a human child learns to understand speech, we can see that the process is one of  repeated exposure to the data, day after day, with regular feedback on whether understanding has taken place correctly. There is no sudden transition in the child’s learning which is equivalent to the moment when a complex computer program begins to perform correctly. The process is one of providing the computer with very large bodies of carefully prepared training data, so that it will become familiar with each particular unit (such as a phoneme or syllable) that it must learn, in every context in which it may occur, spoken by a large and representative set of speakers. If the data is badly prepared, the learning will never be successful. There has, as a result of this, been an enormous growth in the development of speech databases to be used for training (and later on for testing) recognition systems. These databases comprise carefully-controlled recordings of many speakers (sometimes recorded under a variety of conditions), and expert transcriptions of the data made in such a way that the computer can link each symbol in the transcription with a particular part of the sound recording. Thus any present-day  attempt to develop an effective speech recognition system must have a suitable speech database, and if such a database does not exist, it must be created.


c. Speech synthesis applications

In talking about speech recognition, we noted that we can speak more rapidly than we can type. One disadvantage to speech synthesis as a way of providing information is that, in general, we can read a screen more rapidly than we can listen to a voice. Receiving information from a synthesiser can be frustratingly slow, so we need to look carefully to find applications where the advantages of speech output compensate for this. Clearly we should look at cases where the user’s eyes are not available. In-car information is one example which is developing rapidly: as cars become stuck in congestion more and more often, there is a growing market for systems which advise on the least congested route.  Speech synthesis can help the disabled. One of the most attractive applications is that of reading machines for the blind. A printed page is scanned, the text is converted into phonetic symbolic form and speech is synthesised. This requires a synthesis-by-rule program, and the improvement to synthesis-by-rule is probably the most important activities in this field. Of course, speech synthesis can also help those disabled who are unable to speak. One of Britain’s greatest scientists, Professor Stephen Hawking, is only able to speak by means of a “pointer” keyboard and speech synthesiser. Sadly, it must be admitted that the application of speech synthesis which is most likely to make money is that of talking toys.


d. Synthesis techniques

The self-teaching processes described under speech recognition above work also for synthesis - our work on constructing speech databases has value in this field also. There are many applications where it has been found that a completely artificial synthetic voice is not necessarily the best solution. As signal processing techniques develop, it can be more practical to manipulate “real” speech signals to generate new messages without the old problem of noticeable discontinuities in the signal where pieces of speech are joined together. Finally, it is important to remember that while high-quality synthesis of small sections of speech is important in the speech research context, it is synthesis-by-rule which represents the commercial future of this field, and at present there is still a long way to go before the goal of truly natural-sounding synthetic speech from synthesis-by-rule is achieved. Synthesis-by-rule takes as its input written text and produces as its output connected speech.


e. Education

There is growing interest in speech technology as a way of providing additional teaching for advanced-level language learners who need practice in using the spoken language. Computer systems are being developed which give learners tasks, evaluate their spoken performance and diagnose errors. It is not realistic to think of these as replacing teachers, but rather as providing additional work for students who require additional practice outside the classroom.


What to read

P.B.DENES AND E. PINSON The Speech Chain, W.H.Freeman & Co., 2nd edn., 1993.

Chapters 10 and 11 contain a reasonably up-to-date review of synthesis and automatic recognition.