Invited paper given at the Romanian Awareness Seminar on Language and Technology, Bucharest, January 1996. An updated version was published in ‘Recent Advances in Romanian Language Technology’, eds. D. Tufis and P. Andersen, Editura Academiei Romane, Bucharest 1997.
SPEECH TECHNOLOGY: A LOOK TO THE FUTURE
Professor of Phonetics and Director of the Speech Research Laboratory,
University of Reading, U.K.
Technology is changing all our lives, and one aspect of technology is soon to become a leading influence in this revolution. One of the most fundamental attributes of the human intellect is the ability to communicate by speaking, and we have at last achieved the goal of developing machines that are themselves capable of spoken communication. To many ordinary people, speech technology seems an idea out of science fiction, yet the dedicated work of scientists around the world involved in this area of technology in recent decades has resulted in products with real commercial and industrial potential at relatively low cost.
Many surveys of the field have been written (see for example Lea, 1980a and b; Laver, 1994; Keller, 1994; Bernstein and Franco, 1996; Javkin, 1996), and innumerable conferences such as ICASSP, EUROSPEECH, ICSLP and meetings of the ASA have reported developments, some dramatic, others more mundane, in the scientific work which has enabled the growth of this technology. It is not my purpose here to write yet another general survey, nor am I (as a specialist in phonetics and speech science, rather than in electronic engineering or computer science) competent to write a detailed summary of the latest technological advances. My aim is simply to list the most obvious ways in which speech technology is likely to affect our lives, and to look at the challenges which face researchers in this field today.
For most speech scientists, speech technology comprises two principal areas: automatic speech recognition, and speech synthesis. We should not underestimate the importance of another area, that of speech compression and coding, but the involvement of conventional speech science in this area is rather less obvious. In principle, any area where technology is involved in the process of spoken communication should be regarded as an example of Speech Technology. Recognition and synthesis are, to a large extent, complementary fields, and might be thought of as two sides of the same coin. There are, however, major differences. One of the most influential figures in the development of speech technology in Britain, Dr. John Holmes, who has made major advances in both fields, once jokingly said that if speech synthesis is comparable with getting toothpaste out of a tube, speech recognition is like trying to get the toothpaste back in. Speech recognition has the biggest potential for economic success, but presents the biggest technical challenges.
Let us first look at speech recognition.
a. Applications of speech recognition
The most frequently quoted application for speech recognition is in office dictation systems. It is believed that there will be major economic benefits when a fully reliable system is on the market. Currently users must choose between systems which recognise a small vocabulary (one or two thousand words) reliably and with reasonably natural speaking style, or a large vocabulary (tens of thousands of words) in an unnatural speaking style in which words are separated by pauses. The DragonDictate system has an active vocabulary of 30,000 words, with the capability of using an additional 80,000 words from a well-known dictionary. It adapts to individual speakers, whereas most systems have to be trained to work with one particular person’s voice. It is clear that we can expect soon to see an office dictation system which will be capable of taking dictation from many speakers using a large (though not unlimited) vocabulary and more or less natural connected speech. Such a system will receive the spoken input, and produce a letter or report with proper formatting and spelling. It must be remembered that achieving correct spelling is not an easy achievement in English, and the difficulty of converting spelling to sound and sound to spelling is one of the problems that receives most effort in English-speaking countries - a problem that could be avoided if English spelling were reformed. In this context we should note that most people can speak more rapidly than they can type, so a speech-input system is likely to speed up work in some areas.
I have to say that I do not regard automatic dictation machines as an unmixed blessing: when the technology is fully established in the market, and computer companies are making good profits from selling their machines, tens of thousands of secretarial jobs will be lost, while in general letters will not be typed any better than they were before. There are other applications where the benefits are clearer. One very large field is that of telephone interaction with information systems. Although it is often possible to connect a computer to a remote server over telephone lines, this is often inconvenient, and to be able to use speech over the telephone is a real advantage. I have had experience of using a telephone airline flight enquiry system developed by the Marconi company and found it very effective. Enquiries about trains, weather, financial movements, entertainment, bank accounts, even elementary medical diagnosis could be made to work in this way; imagine a remote community with no nearby doctor being able to dictate the symptoms of an illness to a computer and receive advice on how serious the condition was likely to be, with the option of having the computer call a doctor in if the diagnosis was sufficiently urgent. The advantage of this becomes even more obvious if you imagine a community where most people are illiterate and would be unable to use a keyboard even if they had one. The computer which receives the telephone call and gives out the information is, of course, available 24 hours per day, every day. There are two major technical challenges here: one is that most telephone systems still effectively band-bass filter the signal between about 300 and 3,300 Hz, severely reducing the amount of acoustic information available to the recognizer. The second is that the system must be able to work with any voice presented to it: the full range of speaker types (female and male, young and old), accent types and speaking styles must be recognized, and while in an office environment it may be assumed that operators of voice-input technology will be well-trained and co-operative, information system users calling in by telephone may well be less easy to work with.
Another big area within the applications field is the "hands and eyes busy" situation - situations where someone needs to interact with a computer but is not in a position to use a keyboard. The example most often quoted is that of aircraft pilots, but I think there are many less exotic applications. One is certainly the car phone: it is well known that drivers dialling telephone numbers while at the wheel are unsafe, and the technology exists already to allow drivers to request a telephone number by voice. Manual dialling while driving should be made illegal, and this would dramatically boost the sale of "voice-dialling" systems. There are many other applications: I have been involved in several projects with the Institute for Transport Studies in Leeds University. Research in transport engineering requires a lot of observation of traffic in motion, and researchers often have to stand on motorway bridges or railway platforms manually recording what they see. Studies of urban car parking may require regular recording of information on all the cars in a car park, while surveys of street fittings such as warning notices and road markings also have to be surveyed regularly. We found these tasks could be made much easier if the researchers were equipped with hand-held portable computers with voice recognition capability - the spoken data was entered directly into a database. In our research, we found that recognizer performance could be unsatisfactory, and could deteriorate over time as the speaker became tired, unless they received immediate feedback from the computer confirming what had been said. We also looked at the work of geologists: in inspecting core samples, the geologists were often working in difficult and dirty conditions which would have resulted in damage to most computers and would have resulted in keyboards covered in mud; however, using a radio microphone connected to a speech recognizer allowed observations of the samples to be entered instantly into the computer being used for the survey work.
Another application area for speech technology, and one with a value that everyone can see, is in helping the disabled. There are many people who are physically unable to operate a keyboard but have the power of speech. To be able to control their environment by spoken commands (open the door, switch on the heating, operate an alarm) would be a big help to such people, and voice-operated devices can provide this.
b. Techniques in speech recognition
Many books and papers on speech technology devote considerable space to reviewing the heroic days of the 1970’s and 80’s when many of the computational techniques in use today were laboriously worked out. There is not space in this paper to go through such a historical review. I would simply like to make two basic points. Firstly, the most lasting developments of speech technology have been the result of partnership between specialists in computer science and electronic engineering on the one hand and specialists in speech science and linguistics on the other. Attempts to solve the many problems of speech recognition simply by advanced engineering have resulted in systems that work satisfactorily within the laboratory for an ideal speaker, but have been unable to survive exposure to the enormous variability of speech in the real world. The input of speech science has been of different types in different applications, but I believe phonetic expertise is always an essential component of a successful system.
The second point to make is that we have made a fundamental and, I believe, permanent change in the way we train computers. In earlier times, it seemed obvious that we humans, having learned what are the principal characteristics of the speech sounds that must be recognized, must then instruct the computer on how to perform the same task: such systems were known as knowledge-based systems. But more recently, we have been able to work with computer systems that are able to learn for themselves what characteristics distinguish the various units of speech; these systems are capable of learning by themselves. In the case of knowledge-based systems, the relevant input of the speech scientist was to provide the engineer with the best possible descriptions of the data. But in self-teaching systems, the input is completely different. Since this is a matter of great concern for present-day research, let us look at it in some detail.
There are two main computational techniques which could be called self-teaching. One makes use of Hidden Markov Models (see for example Lee et al, 1990), a technique which essentially exploits the statistical nature of transitional probabilities to construct automatically a mathematical model of each unit to be recognized. This technique has been of enormous value in improving the design and performance of recognition systems. The other technique is that of Artificial Neural Networks, a technique which has fascinated academic researchers with its promise of computational simulation of interaction between neurons in a nervous system. Although many people have questioned the appropriateness of equating this computational technique with the function of real nervous systems, there are many striking parallels (Rumelhart and McLelland, 1986). These techniques have many points in common, but the most important one is that in order to learn, they must be given very large amounts of appropriate training data (Kohonen and Torkkola, 1990). If we look at how a human child learns to understand speech, we can see that the process is one of repeated exposure to the data, day after day, with regular feedback on whether understanding has taken place correctly. There is no sudden transition in the child’s learning which is equivalent to the moment when a complex computer program begins to perform correctly. The process is one of providing the computer with very large bodies of carefully prepared training data, so that it will become familiar with each particular unit (such as a phoneme or syllable) that it must learn, in every context in which it may occur, spoken by a large and representative set of speakers. If the data is badly prepared, the learning will never be successful. There has, as a result of this, been an enormous growth in the development of speech databases to be used for training (and later on for testing) recognition systems. These databases comprise carefully-controlled recordings of many speakers (sometimes recorded under a variety of conditions), and expert transcriptions of the data made in such a way that the computer can link each symbol in the transcription with a particular part of the sound recording. Thus any present-day attempt to develop an effective speech recognition system must have a suitable speech database, and if such a database does not exist, it must be created. For the languages which have been extensively worked on (such as English, French, German and Japanese), general databases already exist and much effort is going into constructing more specialized databases of particular kinds of speech (for example, in my laboratory we are working on a database of emotional speech). But at the same time, there is a strong growth in the compilation of speech databases for languages which have not received so much attention in the past. This is the main reason for the existence of the BABEL project, a three-year project funded by the European Union (COPERNICUS Project #1304) which brings together speech technology and speech science specialists in many different European countries, and is putting together a database of some languages of Central and Eastern Europe (Roach et al, 1996). We hope that the number of languages in BABEL will grow, but the present list comprises Bulgarian, Estonian, Hungarian, Polish and Romanian; it is a great pleasure for me to be speaking in one of the BABEL partner nations, and to mention the valuable contribution of my colleague Marian Boldea, of the Technical University of Timisoara, who is leading the work on Romanian. This database should be of great value in the development of Romanian-language speech technology applications. It is based on the design of the speech database EUROM1 (Chan et al, 1995).
Before leaving the subject, I would like to mention the importance of prosody: factors in speech such as intonation, stress and rhythm. There is much evidence that humans make extensive use of prosody in speech understanding, yet artificial systems so far make little use of this information. Many other non-segmental factors need to be considered (Young, 1990).
We can now turn to Speech Synthesis.
c. Speech synthesis applications
In talking about speech recognition, we noted that we can speak more rapidly than we can type. One disadvantage to speech synthesis as a way of providing information is that, in general, we can read a screen more rapidly than we can listen to a voice. Receiving information from a synthesiser can be frustratingly slow, so we need to look carefully to find applications where the advantages of speech output compensate for this. Clearly we should look at cases where the user’s eyes are not available. In-car information is one example which is developing rapidly: as cars become stuck in congestion more and more often, there is a growing market for systems which advise on the least congested route. This is more useful than the rudimentary in-car synthesis system of a decade ago, when mechanical voices informed the driver about the car’s oil level or an unfastened seat-belt. Drivers did not like them (indeed, it has been reported that some drivers actually paid garages to disconnect the voice module). A recent example from the UK of the difficulty of getting synthetic voices into everyday use comes from a story in the Guardian newspaper, reporting on trials of a new "talking bus stop" in Leeds. Passengers waiting for a bus could press buttons on a panel to ask for spoken information about the various destinations of buses. Unfortunately, a young computer "hacker" managed to penetrate the computer system providing this service and substituted some messages that should not be spoken in polite society, and certainly never by a talking bus stop. When this breach of security had been fixed, another problem remained: the synthesizer could only speak in a refined Southern accent ("Received Pronunciation") which is not liked in the Yorkshire city of Leeds, and there were so many complaints about this accent that the service has been withdrawn until a Yorkshire accent can be substituted.
Speech synthesis can, however, help the disabled. One of the most attractive applications is that of reading machines for the blind. A printed page is scanned, the text is converted into phonetic symbolic form and speech is synthesised. This requires a synthesis-by-rule program, and the improvement to synthesis-by-rule is probably the most important activities in this field. Of course, speech synthesis can also help those disabled who are unable to speak. One of Britain’s greatest scientists, Professor Stephen Hawking, is only able to speak by means of a "pointer" keyboard and speech synthesizer.
Sadly, it must be admitted that the application of speech synthesis which is most likely to make money is that of talking toys.
d. Synthesis techniques
As with recognition, it is not possible here to review the whole range of synthesis techniques. I would like to mention a few important points, however. Firstly, the self-teaching processes described under speech recognition above work also for synthesis - Hidden Markov Models and Artificial Neural Networks can be used for synthesis, and it follows that our work on constructing speech databases has value in this field also. Secondly, there are many applications where it has been found that a completely artificial synthetic voice is not necessarily the best solution. As signal processing techniques develop, it can be more practical to manipulate "real" speech signals to generate new messages without the old problem of noticeable discontinuities in the signal where pieces of speech are joined together. Finally, it is important to remember that while high-quality synthesis of small sections of speech is important in the speech research context, it is synthesis-by-rule which represents the commercial future of this field, and at present there is still a long way to go before the goal of truly natural-sounding synthetic speech from synthesis-by-rule is achieved. Synthesis-by-rule takes as its input written text and produces as its output connected speech. One of the most important components of such a system is a dictionary which gives the pronunciations of words in computer-readable form. I have recently finished working on a 90,000-word pronouncing dictionary of English (Roach and Hartman, 1997) which exists in computer-readable form, and it is our intention to exploit this dictionary for such purposes. Another pronunciation dictionary (Wells, 1991) is also available in computer-readable form.
One of the most significant challenges is the production of realistic prosody. As with recognition, this is an area waiting for more research work to be done. We need large amounts of prosodically-transcribed data. One example of such data is the Spoken English Corpus, and its computer-readable version MARSEC (Roach et al, 1994; Roach and Arnfield, 1995). This makes it possible to train a prosody-generating program, and to simulate attitudes and emotions (Murray and Arnott, 1993).
Although I have spoken of specific issues in recognition and synthesis, I would like to speak of education as a separate area. There is growing interest in speech technology as a way of providing additional teaching for advanced-level language learners who need practice in using the spoken language. Computer systems are being developed which give learners tasks, evaluate their spoken performance and diagnose errors. It is not realistic to think of these as replacing teachers, but rather as providing additional work for students who require additional practice outside the classroom. A good example of a specific research project which made important progress in this area is the SPELL project, funded by the European Union’s ESPRIT programme (see papers by Bagshaw et al, 1993; Hiller et al, 1993 a,b,c; Rooney et al, 1993). For a different approach, with the suggestion that speech files could be moved over the Internet for pronunciation training, see Hiester and Abercrombie (1994).
For most of my research career, speech recognition and speech synthesis have been areas of technology that were only available in well-funded research laboratories. The first speech recognition system that I worked with in the 1980’s cost about the same price as a good new car. In the last few years the price of this technology has dropped dramatically, and some of the major manufacturers of speech recognition technology are selling small-scale systems for as little as $50, though they are capable of being used for serious dictation work. The breakthrough into widespread public use of speech technology is already beginning to happen. One of the most urgent tasks facing us is that so much work remains to be done on languages that are not major world languages. I hope that speech research on languages such as Romanian will be encouraged and accelerated by these developments.
Ainsworth, W.A. (1988) Speech Recognition by Machine, IEE, Peter Peregrinus.
Asadi, A., Lubensky, D. et al (1995) ‘Combining speech algorithms into a natural application of speech technology for telephone network services’, Proceedings of Eurospeech, Madrid, vol.1, 273-276.
Bagshaw, P., Hiller, S.M. and Jack, M.A. (1993) ‘Enhanced pitch tracking and the processing of F0 contours for computer aided intonation teaching’, Proceedings of Eurospeech 93, Berlin.
Bernstein, J. and Franco, H. (1996) ‘Speech recognition by computer’, in Lass, N.J. (ed.) Principles of Experimental Phonetics, Mosby, pp. 408-434.
Billi, R. et al (1990) ‘A PC-based very large vocabulary isolated word speech recognition system’, Speech Communication 9, pp. 521-530.
Bloothooft, G., Hazan, V., Huber, D. and Llisterri, J. (1995) European Studies in Phonetics and Speech Communication, Utrecht: OTS.
Chan, D., Fourcin, A.J. et al (1995) ‘EUROM - a Spoken Language Resource for the EU’, Proceedings of EUROSPEECH, Madrid, vol.1, 867-871.
Chen, J-K., Lee, L-S and Soong, F.K (1995) ‘Large vocabulary, word-based Mandarin dictation system’, Proceedings of EUROSPEECH, Madrid, vol.1, 285-291.
Cooke, M., Beet, S. and Crawford, M. (1993) Visual Representation of Speech Signals, Wiley.
Dalsgaard, P. and Baekgaard, A. (1990) ‘Recognition of continuous speech using neural nets and expert system processing’, Speech Comm. 9 .
Fallside, F. and Woods, W.A. (eds.) (1985) Computer Speech Processing, Prentice-Hall.
Ferretti, M. et al (1990) ‘Measuring information provided by language model and acoustic model in probabilistic speech recognition: theory and experimental results’, Speech Comm. 9
Hiester, C. and Abercrombie, J. (1994) ‘Penn’s virtual language lab on the Internet’, (Internet under address http://philae/sas.upenn.edu)
Hiller, S., Rooney, E., Lefevre, J-P and Jack, M. (1993a) ‘SPELL: A pronunciation training device based on speech technology’, Proceedings of ESCA/NATO Workshop on Applications of Speech Technology, Lautrach, Germany.
Hiller, S.M., Rooney, E., Lefevre, J-P. and Jack, M. (1993b) ‘SPELL: an automated system for computer-aided pronunciation teaching’, Proceedings of Eurospeech 93, Berlin.
Hiller, S.M., Rooney, E., Vaughan, R., Eckert, M., Laver, J. and Jack, M. (1993c) ‘An automated system for computer-aided pronunciation learning’, paper presented at CALL 93: "Reactive and Creative CALL", University of Exeter.
Jack, M.A. and Laver, J. (1988) Aspects of Speech Technology, Edinburgh University Press.
Javkin, H. (1996) ‘Speech analysis and synthesis’, in Lass, N.J. (ed.) Principles of Experimental Phonetics, Mosby, pp. 245-273.
Keller, E. (ed.) (1994) Fundamentals of Speech Synthesis and Speech Recognition, London: John Wiley.
Kohonen, M. and Torkkola, K. (1990) ‘Using self-organizing maps and multi-layered feed-forward nets to obtain phonemic transcriptions of spoken utterances’, Speech Comm. 9
Laver, J. (1994) ‘Speech technology overview’, in Asher, R. (ed) Encyclopaedia of Linguistics.
Lea, W.A. (1980a) ‘The value of speech recognition systems’, in Lea 1980, pp. 3-18.
Lea, W.A. (ed.) (1980b) Trends in Speech Recognition, Prentice-Hall;
Lee, K-F et al (1990) ‘Speech recognition using Hidden Markov Models: a CMU perspective’, Speech Comm. 9 .
Leech, G., Myers, G. and Thomas, J. (1995) Spoken English on Computer, Longman.
Linggard, R. (1985) Electronic Synthesis of Speech, Cambridge University Press.
Misheva, A., Dimitrova, S. et al (1995) ‘Bulgarian Speech Database - a Pilot Study’, Proceedings of EUROSPEECH, Madrid, vol.1, 859-863.
Murray, I. and Arnott, J. (1993) ‘Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion’, JASA 93(2), pp.1097-1108.
Niedermair, G., Streit, M. and Tropf, H. (1990) ‘Linguistic processing related to speech understanding in SPICOS II’, Speech Comm. 9.
Roach, P. (ed.) (1992) Computing in Linguistics & Phonetics, Academic Press.
Roach, P. and Arnfield, S. (1995) ‘Aligning prosodic transcription with the time dimension’, in G.N.Leech, G.Myers and J.Thomas (eds.) Spoken English on Computer, Longman.
Roach, P. and Hartman, J. (1997) The Daniel Jones English Pronouncing Dictionary, 15th edition, Cambridge University Press.
Roach, P.J., Knowles, G.O., Varadi, T. and Arnfield, S.C. (1994) ‘MARSEC: a MAchine-Readable Speech Database’, Journal of the International Phonetic Association vol.23:2, pp.47-54.
Roach, P.J., Knowles, G.O., Varadi, T., Ghali, N. and Arnfield, S.C. (1992): MARSEC Speech Database: CD-ROM disk.
Roach, P.J., Arnfield, S., Barry, W., Baltova, J., Boldea, M., Fourcin, A., Gonet, W., Gubrynowicz, R., Hallum, E., Lamel, L., Marasek, K., Marchal, A., Meister, E. and Vicsi, K. (1996) ‘BABEL: an Eastern European Multi-Language Database’, Proceedings of 4th International Congress of Spoken Language Processing, Philadelphia, SaP2P1.1.
Rooney, E., Vaughan, R., Hiller, S., Carraro, F. and Laver, J. (1993) ‘Training vowel pronunciation using a computer-aided teaching system’, Proceedings of Eurospeech 93, Berlin.
Rumelhart, D.E. and McLelland, J. (1986) Parallel Distributed Processing, M.I.T.
Silverman, K. (1984) ‘F0 perturbations as a function of voicing of prevocalic and postvocalic stops and fricatives, and of syllable stress’, Proceedings of the Institute of Acoustics.
Silverman, K. (1990) ‘The separation of prosodies: comments on Kohler’s paper’, in LabPhon 1, eds. J.Kingston and M.Beckman, pp. 139-151.
Wang, H.D., Degryse, D. and Carraro, F. (1993) ‘A prosody modification approach for auditory user feedback in the SPELL pronunciation teaching system’, Proceedings of Eurospeech 93, Berlin.
Wells, J.C. (1991) Longman Pronunciation Dictionary, Longman.
Young, S. (1990) ‘Use of dialogue, pragmatics and semantics to enhance speech recognition’, Speech Comm. 9