"A high-level introduction to ASR"
In the last post, we discussed the acoustic basis of speech. In this post, we'll build on those concepts to discuss automatch speech recogntion in preparation for extracting the features we'll use as data in the next post.
Automatic speech recognition
Automatic speech recognition is the process of having computer recognize speech. That is, to recognize what words are contained in an acoustic signal. There are a number of strategies that one could use to approach this problem, but the successful neural network papers that I have come across take the strategy of using the acoustic information in the signal to determine what segments are contained in the signal. Researchers in this area borrow from phonology an abstraction over phones called the phoneme as the object to predict. This choice is simultaneously pragmatic and problematic.
The choice is pragmatic because it reduces the potential number of classes to predict between, since a phoneme is a grouping of phones that don't change the meaning/identity of a word, given a certain context, and there are, as far as I'm aware, always fewer phonemes than phones for a language. The phoneme /t/ can be realized as a number of different phones given the phones that surround it, but choosing another of the group would not change the meaning of an utterance. For example, kitten is usually pronounced more like [kɪʔn̩], but prouncing it as [kɪtn̩] would not change the meaning of the word. So, the phonemic representation of kitten could be said to be /kɪtn̩/, and this would be the target output of a speech recognition system presented with a recording of kitten.
However, the choice is problematic because phonemic representations of words are theoretical and may not closely match what phones are actually present in the acoustic signal. As such, there is the potential to introduce error into the training data by forcing the network to output predictions for a phoneme that isn't acoustically present in a word. In usage scenarios, a strict reliance on a mapping from phoneme sequences to words will be fragile, as well, because the phone sequence a user utters may not actually map well to the phoneme sequences that the system is using to map to words. Additionally, it is not agreed upon among phoneticians that phonemes actually exist cognitively or physically (see Port & Leary (2005)](https://www.cs.indiana.edu/~port/pap/AgainstFormalPhonology.June2.05.pdf)) for a discussion on why phonemes are problematic in linguistic theory).
With that out of the way, the idea of phonemes is still useful for our purposes, so we'll leave it at this: Phonemes are useful tools, regardless of whether or not they exist cognitively or physically.
So, okay, where are we? We've defined the output type for our neural network: phoneme labels for whatever data we feed into it. Now, we need to determine what kind of data we'll feed into it.
Representations of speech data
There are myriad ways of representing speech data. There is simply the vector of intensity measures (essentially, the waveform), which has been used in some recent speech recognition papers to decent effect (see, for example, Palaz, Collobert, & Doss, 2013). Often, however, researchers have some sort of frequency information in their representation. Because the human perceptual system responds in an approximately logarithmic way to different frequencies (Johnson, 2012), a variety of nonlinear frequency scales have been devised. Some examples include the Mel scale (Wikipedia contributors, 2018) and the Bark scale(Wikipedia contributors, 2017). However, for speech recognition, I have only encountered features derived from the Mel scale. Primarily, it seems Mel filterbanks and Mel-frequency cepstral coefficients (abbreviated MFCCs) have been used.
A filterbank is effectively a quantification of how much energy is contained in certain bins of frequency ranges. The bins for the frequency ranges are arbitrary. As an arbitrary example you could have a 2-filter filterbank that had two numbers: the amount of energy between 0 and 5000 Hz, and the amount of energy between 5000 and 10,000 Hz. A Mel filterbank is just a filterbank where the boundaries for the bins are spaced along the Mel scale instead of a linear frequency scale. Lyons (n.d.) has a good discussion on Practical Cryptography of the math behind calculating these filterbank values. Just stop before taking the discrete cosine transform, or you'll end up with MFCCs insted of a Mel filterbank.
Mel-frequency cepstral coefficients have been used for a long time in speech recognition research, and they are just one stop beyond the Mel filterbanks. Take the discrete cosine transform of the Mel filterbank, and you will have the MFCCs. I have heard them described as a spectrum of a spectrum, which still doesn't have a clear meaning to me, but perhaps it will for you. They come from a literal spectrum of a spectrum, but that that means is difficult for me to understand, and I believe a lot of speech researchers share this difficulty of knowing what MFCCs mean beyond the simplistic notion of them representing frquency information at some level.
There is sometimes talk of delta and delta-delta coefficients, and these are represntations of how the features are changing over time, and how the rate of change of the features is changing over time. Conceptually, this is similar to first- and second-order derivatives, but they are calculated slightly differently. The formula for them can be found in Lyons' (n.d.) discussion of calculating MFCCs as well.
I have seen some studies that use spectrograms (like those included above) and perform image processing on them, but, as a speech researcher, I am not personally interested in using machine vision per se to perform audition. As such, I don't have much experience with this approach and can't provide much of a summary.
Regardless, for the model we will implement, Zhang et al. (2017) chose to use Mel filterbanks, so that is what we will proceed with. If you are interested in other potential speech features, Meftah, Alotaibi, & Selouani (2016) compare a variety of options for recognizing Arabic phonemes. And, lucky for us, there is already a package in Julia that will calculate these features for us! We'll cover that in the next post.
References
Lyons, J. (n.d.) Mel frequency cepstral coefficient (MFCC) tutorial. Retrieved May 26, 2018 from https://web.archive.org/web/20180527045302/http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/.
Meftah, A., Alotaibi, Y. A., & Selouani, S. A. (2016). A comparative study of different speech features for Arabic phonemes classification. In Modelling Symposium (EMS), 2016, European (pp. 47-52). IEEE.
Palaz, D., Collobert, R., & Doss, M. M. (2013). Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. arXiv preprint arXiv:1304.1018.
Port, R. F., & Leary, A. P. (2005). Against formal phonology. Language, 81(4), 927-964.
Wikipedia contributors. (2018). Mel scale. In Wikipedia, the free encyclopedia. Retrieved May 26, 2018, from https://en.wikipedia.org/w/index.php?title=Mel_scale&oldid=832004295.
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.