Speech technology
Technology is changing all our lives, and one aspect
of technology is soon to become a leading influence in this revolution. One of
the most fundamental attributes of the human intellect is the ability to
communicate by speaking, and we have at last achieved the goal of developing
machines that are themselves capable of spoken communication. To many ordinary
people, speech technology seems an idea out of science fiction, yet the work of
scientists around the world involved in this area of technology in recent
decades has resulted in products with real commercial and industrial potential at relatively low cost.
For most speech scientists, speech technology
comprises two principal areas: automatic speech recognition, and speech
synthesis. In principle, any area where technology is involved in the process
of spoken communication should be regarded as an example of Speech Technology.
Recognition and synthesis are, to a large extent, complementary fields, and
might be thought of as two sides of the same coin. There are, however, major
differences. We look first at speech recognition.
a. Applications
of speech recognition
The most frequently quoted application for speech
recognition is in office dictation systems. It is believed that there will be
major economic benefits when a fully reliable system is on the market.
Currently users must choose between systems which recognise a small vocabulary
(one or two thousand words) reliably and with reasonably natural speaking
style, or a large vocabulary (tens of thousands of words) in an unnatural
speaking style in which words are separated by pauses. It is clear that we can
expect soon to see an office dictation system which will be capable of taking
dictation from many speakers using a large (though not unlimited) vocabulary
and more or less natural connected speech. Such a system will receive the
spoken input, and produce a letter or report with proper formatting and
spelling. It must be remembered that achieving correct spelling is not an easy
achievement in English, and the difficulty of converting spelling to sound and
sound to spelling is one of the problems that receives most effort in
English-speaking countries - a problem that could be avoided if English
spelling were reformed. In this context we should note that most people can
speak more rapidly than they can type, so a speech-input system is likely to
speed up work in some areas.
An
important application area for speech technology, and one with a value that
everyone can see, is in helping the
disabled. There are many people who are physically unable to operate a keyboard
but have the power of speech. To be able to control their environment by spoken
commands (open the door, switch on the heating, operate an alarm) would be a
big help to such people, and voice-operated devices can provide this.
b. Techniques
in speech recognition
It is a striking fact that the most lasting
developments of speech technology have been the result of partnership between
specialists in computer science and
electronic engineering on the one hand and specialists in speech science and
linguistics on the other. Attempts to solve the many problems of speech
recognition simply by advanced engineering have resulted in systems that work
satisfactorily within the laboratory for an ideal speaker, but have been unable
to survive exposure to the enormous variability of speech in the real world.
The input of speech science has been of different types in different
applications, but I believe phonetic expertise is always an essential component
of a successful system.
In
earlier times, it seemed obvious that we humans, having learned what are the
principal characteristics of the speech sounds that must be recognised, must
then instruct the computer on how to perform the same task: such systems were
known as knowledge-based systems. But
more recently, we have been able to work with computer systems that are able to
learn for themselves what characteristics distinguish the various units of
speech; these systems are capable of learning by themselves. In the case of
knowledge-based systems, the relevant input of the speech scientist was to
provide the engineer with the best possible descriptions of the data. But in
self-teaching systems, the input is completely different.
If
we look at how a human child learns to understand speech, we can see that the
process is one of repeated exposure to
the data, day after day, with regular feedback on whether understanding has
taken place correctly. There is no sudden transition in the child’s learning
which is equivalent to the moment when a complex computer program begins to
perform correctly. The process is one of providing the computer with very large
bodies of carefully prepared training data, so that it will become familiar
with each particular unit (such as a phoneme or syllable) that it must learn,
in every context in which it may occur, spoken by a large and representative
set of speakers. If the data is badly prepared, the learning will never be
successful. There has, as a result of this, been an enormous growth in the
development of speech databases to be
used for training (and later on for testing) recognition systems. These
databases comprise carefully-controlled recordings of many speakers (sometimes
recorded under a variety of conditions), and expert transcriptions of the data
made in such a way that the computer can link each symbol in the transcription
with a particular part of the sound recording. Thus any present-day attempt to develop an effective speech
recognition system must have a suitable speech database, and if such a database
does not exist, it must be created.
c. Speech
synthesis applications
In talking about speech recognition, we noted that
we can speak more rapidly than we can type. One disadvantage to speech
synthesis as a way of providing information is that, in general, we can read a
screen more rapidly than we can listen to a voice. Receiving information from a
synthesiser can be frustratingly slow, so we need to look carefully to find
applications where the advantages of speech output compensate for this. Clearly
we should look at cases where the user’s eyes are not available. In-car
information is one example which is developing rapidly: as cars become stuck in
congestion more and more often, there is a growing market for systems which
advise on the least congested route. Speech
synthesis can help the disabled. One of the most attractive applications is
that of reading machines for the blind. A printed page is scanned, the text is
converted into phonetic symbolic form and speech is synthesised. This requires
a synthesis-by-rule program, and the improvement to synthesis-by-rule is
probably the most important activities in this field. Of course, speech
synthesis can also help those disabled who are unable to speak. One of
Britain’s greatest scientists, Professor Stephen Hawking, is only able to speak
by means of a “pointer” keyboard and speech synthesiser. Sadly, it must be admitted
that the application of speech synthesis which is most likely to make money is
that of talking toys.
d. Synthesis
techniques
The self-teaching processes described under speech
recognition above work also for synthesis - our work on constructing speech
databases has value in this field also. There are many applications where it
has been found that a completely artificial synthetic voice is not necessarily
the best solution. As signal processing techniques develop, it can be more
practical to manipulate “real” speech signals to generate new messages without
the old problem of noticeable discontinuities in the signal where pieces of
speech are joined together. Finally, it is important to remember that while
high-quality synthesis of small sections of speech is important in the speech
research context, it is synthesis-by-rule which represents the commercial
future of this field, and at present there is still a long way to go before the
goal of truly natural-sounding synthetic speech from synthesis-by-rule is
achieved. Synthesis-by-rule takes as its input written text and produces as its
output connected speech.
e. Education
There is growing interest in speech technology as a
way of providing additional teaching for advanced-level language learners who need
practice in using the spoken language. Computer systems are being developed
which give learners tasks, evaluate their spoken performance and diagnose
errors. It is not realistic to think of these as replacing teachers, but rather
as providing additional work for students who require additional practice
outside the classroom.
What to read
P.B.DENES AND E. PINSON The Speech Chain, W.H.Freeman & Co.,
2nd edn., 1993.
Chapters 10 and 11 contain a reasonably up-to-date review
of synthesis and automatic recognition.