Go to Laboratory Home Go to Laboratory Home PageGo to Laboratory PhoneGo to Laboratory Search
Executive Summary

The task in speech recognition is to be able to speak into a computer microphone and have the computer type out what was said. While speech recognition systems are commercially available for limited domains, state-of-the-art systems have only about a 60%-65% word recognition rate on casual speech, i.e., telephone conversations. Since speaking rates of 200 words per minute are not uncommon in casual speech, a 60% word recognition accuracy implies approximately 80 errors per minute -- an unacceptably high rate for many applications. Furthermore, recognition performance is not improving rapidly. Improvements in word recognition accuracy of a few percent are considered "big" improvements, and recognition rates of the best systems on the Switchboard data have been between 64.9% and 61.2% for three consecutive years, although they have improved from only 52% recognition four years ago.

For various reasons, including poor performance in real world situations, several agencies have been looking for alternatives to the hidden Markov models (HMMs) that are the best current tool for speech recognition -- particularly alternatives that incorporate more knowledge of speech production. Thus, Maximum Likelihood Continuity Mapping (MALCOM), which has been shown to be able to find a mapping between speech acoustics and speech articulator positions (e.g. tongue, jaw and lip positions), while sharing the probabilistic framework that makes HMMs so powerful, is an attractive approach.

An important achievement of the last year was to invent a workable acoustic model capable of being incorporated into the current state-of-the-art speech recognition packages in place of hidden Markov models. This is important in that the acoustic models we created are based on articulation, and so should have a good chance of eventually outperforming HMMs (although they are not yet as good as HMMs).

The word model we devised incorporates a new invention called Multiple-Observable MALCOM (MO-MALCOM). Research funded through this grant showed that MO-MALCOM has the ability to distinguish phonemes better than measured articulator positions. This is an important finding, especially since recent work by Roweis at Caltech demonstrated that articulator data could be used in conjunction with acoustic data to get nearly perfect recognition performance on a speaker-dependent data set (Roweis, 1998).

In fact, the ability of MO-MALCOM to distinguish phonemes is remarkable considering that MO-MALCOM was not using information from word-level models or information about the phonetic context. While we know of no way to directly compare MO-MALCOM's results to HMM results, it seems doubtful that HMM phoneme models that were not incorporated into word or mulit-phone models could perform with comparable accuracy.

The work quantifying MO-MALCOM's ability to discriminate phonemes argues that our current speech recognition system is not limited by the MO-MALCOM word model, but by the various components of the recognition algorithm that were not given much attention by this project, i.e. the word lattices, prior word probabilities, the dictionary, etc. Sophisticated versions of the non-MALCOM components have been implemented by other labs around the country (at considerable expense) and, perhaps, should not be re-invented at Los Alamos National Laboratory.

We tested our speaker-dependent, isolated-word recognition system on a data set extracted from Switchboard telephone conversations and on a set of German sentences. We tried three different methods for creating word models (models that estimate the probability of the speech acoustics given the words). The difficulties encountered with the each model lead to refinements incorporated in the subsequent model.

We currently have word recognition rates of around 40% on the training set, and we can expect recognition performance to be worse on the test set. It is difficult to compare the results we have to other results because the training and testing sets are so different from anything that has been used before, but these results are very probably worse than state-of-the-art systems. As mentioned above, state-of-the-art, speaker-independent, isolated-word recognizers working on the Switchboard data set (from which our data set was extracted) show recognition rates around 60%-65%. However, our task is much more limited than current speech recognition work, being speaker-dependent, isolated-word recognition instead of speaker-independent, continuous-speech recognition. It might be tempting to compare our results to other speaker-dependent, isolated-word recognition tests, but our data came from the Switchboard data set, which is one of the most difficult data sets yet used for recognition evaluation.

There are several factors that contribute to our low recognition performance. The biggest disadvantage our system has compared to state-of-the-art systems is the lack of 30 years of parameter tweaking. We have made a great number of arbitrary decisions without the time to evaluate the consequences of the decisions. Just performing further parametric tests (i.e. further adjusting the dimensions and cut-off frequency of the continuity map to optimize performance) would, in all likelihood, greatly improve recognition performance. Six other factors contributing to low recognition performance are listed next. First, our training set is much smaller than the speaker-independent continuous-speech recognition training sets commonly used today (we use about 3 minutes of speech as opposed to, say, 65 hours on the complete Switchboard training set). Second, since we are doing isolated-word recognition, we are unable to take advantage of a language model. Third, the model that estimates the probability of sequences of phonemes given a word is more simplistic than in state-of-the-art recognition systems. Fourth, our current dictionary contains only canonical pronunciations of words as opposed to pronunciations that commonly occur in casual speech. This problem is particularly severe since the word extraction process sometimes deletes phonemes or adds phonemes to the beginning or the end of the word. Fifth, we are not currently using any methods for combining the outputs of different recognizers. Sixth, we did not use cepstral mean subtraction or variance normalization.

In the course of doing this work, we also performed preliminary tests on compressing speech from sentences spoken by a German speaker -- our best estimates suggest that we may be able to achieve around a 25% reduction in the number of bits needed to transmit vector quantization codes (which is already highly compressed speech). This could result in significant cost saving in military and civilian communications.

In summary, we tested more word models than we proposed and came up with a better model than we expected. Furthermore, we found evidence that MALCOM may be useful for speech compression as well. While recognition performance in the first year was not good, we have every reason to believe that the system can be improved just by using components currently used in other speech recognition systems.

J. Hogden, D. Nix, and P. Valdez. An articulatorily constrained, maximum likelihood approach to speech recognition . Los Alamos Technical Report LA-UR-96-3518, Los Alamos National Laboratory, Los Alamos, NM, 1998.   [   Abstract   |   PDF (6.8 MB)   ]