"Extracting speech features in Julia"

Extracting the speech features in Julia

God, finally! The code! Up until now, I have been trying to situate automatic speech recognition in the context of what we know about human speech because I believe this is important to be able to reason about the kind of data we're working with, and also to demonstrate some of the complexity of this problem. But now, we can begin solving the problem.

A lot of research in speech recognition uses the TIMIT data set (Garofalo et al., 1993) as a benchmark for how well their system performs, and that is the data set we will be using ourselves.

The data set is distributed on through the Linguistic Data Consortium website. It comes separated into a training set and a test set. There are a number of recordings for each speaker in the corpus. Each recording is mono, and it is sampled at 16,000 Hz (meaning that measurements were taken from the microphone 16,000 times per second during recording), and they are stored in a sphere format with a .WAV file extension.

Each recording is accompanied by transcriptions of what is said in them. There is an orthographic transcription, and a phonological transcription. We're concerned with the phonological transcription. It comes formatted in three columns, where the first column contains the sample point at which the label starts, the second column contains the sample point at which the label ends, and the third column contains the phonetic label, and the file extension is ".PHN".

The first task that needs to be undertaken is to convert the files from sphere format WAV files to riff format WAV files because modern scientific software packages don't read the as of now anitquated sphere format. There are a number of ways that this can be done, and I will leave it to you to determine how exactly you wish to accomplish this. There are some solutions on StackOverflow using Sox (StackOverflow contributors, 2017), and the Linguistic Data Consortium provides its own tools (Linguistic Data Consortium, n.d.) for performing this conversion.

All converted now? Good. The code I'm going to be presenting assumes that the converted files replaced the old files. If this is not the case for you, you will need to modify the code accordingly. There are two main tasks now: calculating the features, and determining the labels for each time step in the features. The script that performs all of this is on my GitHub.

Calculating the speech features

The features Zhang et al. (2017) chose is a Mel filterbank with 40 filters and the energy coefficient, as well as delta and delta-delta coefficients for those coefficients. Filterbanks are always calculated over some period of time. While we could pass an entire audio file to the function that calculates the filterbank, it would not be helpful to to do so because we would lose any time related information about the audio, which is crucial because audio signals unfold over time.

To avoid this, speech recognition programs tend to calculate their features over small sections of time called windows or frames, and they move the window over the audio until all the different frames have been processed. The paper we're implementing does not, as far as I can tell, specify the parameters of this process, so I assume they're using the standard window length of 25 ms and move the window 10 ms forward each time the features are calculated. This is what we're going to do.

Luckily, there is already a library that will do this for us called MFCC.jl. However, at the time of writing, there is a small bug in the function that we need to calculate the features, so you will need to use the fork I've made of it until my pull request to fix the bug is resolved:


The function we want to use is called audspec, which will take in an Array of the samples from reading a WAV file and produce the filterbank. The energy coefficient is calculated by taking the log of the sum of the power spectrum, calculated by calling powspec. Then, we use the function deltas to calculate the delta and delta-delta coefficients. This process might look about like this:

samps, sr = wavread(wavFname)
samps = vec(samps)
frames = powspec(samps, sr; wintime=FRAME_LENGTH, steptime=FRAME_INTERVAL)
energies = log.(sum(frames', 2))
fbanks = audspec(frames, sr; nfilts=40, fbtype=:mel)'
fbanks = hcat(fbanks, energies)
fbank_deltas = deltas(fbanks)
fbank_deltadeltas = deltas(fbank_deltas)
features = hcat(fbanks, fbank_deltas, fbank_deltadeltas)

In this snippet, I assume that you've already imported the correct packages and have a string representing the path to the WAV file to be processed stored in wavFname. I also use constants to represent what the frame length and frame interval are supposed to be. In this case, they are set to 0.025 and 0.010, respectively, to represent the values in seconds.

There are a couple extra steps I take in the script I provide because standard approaches throw away any frames that are associated with the label 'q', denoting a glottal stop, and they also exclude any WAV files that are part of the speaker accent (SA) recordings. But this is the bulk of the work.

(As a note, I have not see anyone justify why they remove items labeled with 'q' or the speaker accent recordings, but I imagine the 'q' label is too infrequent to be learned easily, and removing the speaker accents makes it easier for the network since the data set becomes a bit more homogeneous. If I'm right, these are artificial inflations to the accuracy rates of the networks, which would make sense becuase this data set is often used for benchmarking new approaches to speech recognition).

These frames can be laid out to represent the filterbank signals over time, which is what we'll be doing. Now, we have to determine what the labels are for each of the frames.

Determining labels for the frames

Because the TIMIT data set comes phonetically-aligned, we can determine what the label is for each individual frame. When the entire frame is within boundaries defined by the start and end times of a label, the frame is associated with the label whose region it's in.

However, if the frame ends up split between two different labels, we have to make a decision. I have not encountered any explicit explanations of what researchers have done. However, what makes the most sense to me is to say that the associated label is the one whose region has the majority of the frame in it. And in the case where the frame is split 50-50 over a boundary, I choose the current label over the next one to favor longer segments over shorter ones.

The easiest way I've found to determine the sequence of labels is to simulate moving the window through the signal by iterating through the indices of the collection of frames and using those indices to convert to sample numbers to find the relevant label in the TIMIT PHN files.

First, however, we need to read in the PHN files to determine the labels and their boundaries:

local lines
open(phnFname, "r") do f
    lines = readlines(f)

boundaries = Vector()
labels = Vector()

# first field in the file is the beginning sample number, which isn't
# needed for calculating where the labels are
for line in lines
    _, boundary, label = split(line)
    boundary = parse(Int64, boundary)
    push!(boundaries, boundary)
    push!(labels, label)

Then, we can perform the iteration through the items in the collection of frames to build the sequence of labels. Note that FRAME_LENGTH and FRAME_INTERVAL are constants containing the width of the frame in seconds and what the interval is between frames in seconds, respectively. If you're using the default values of 25 ms frames at every 10 ms for the speech features, they will be 0.025 and 0.010, respectively, and sr will be 16,000 samples per second.

frameLengthSamples = FRAME_LENGTH * sr
frameIntervalSamples = FRAME_INTERVAL * sr
halfFrameLength = FRAME_LENGTH / 2

labelSequence = Vector() # Holds the sequence of labels

for i=1:size(fbanks, 1)
    win_end = frameLengthSamples + (i-1)*frameIntervalSamples

    # Move on to next label if current frame of samples is more than half
    # way into next labeled section and there are still more labels to
    # iterate through
    if labelInfoIdx < nSegments && win_end - boundary > halfFrameLength

        labelInfoIdx += 1
        boundary, label = labelInfo[labelInfoIdx]

    push!(labelSequence, label)

After this step, your features and labels should be ready to go for you to save or continue processing however you like.


After reading this post, I hope you come away understanding how speech data can be represented for deep learning networks, as well as how to extract features like Mel filterbanks using the Julia programming language and its ecosystem of packages.

If you're interested, I have a more in-depth implementation of feature extraction in my Google Summer of Code 2018 repository. It hews closer to implementing the paper I've selected in that it performs phoneme folding and removes those labeled as 'q'.


Garofalo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., & Dahlgren, N. L. (1993). The DARPA TIMIT acoustic-phonetic continuous speech corpus cdrom. Linguistic Data Consortium.

Linguistic Data Consortium. (n. d.). SPHERE conversion tools. Retrieved May 27, 2018 from https://web.archive.org/web/20180528014420/https://www.ldc.upenn.edu/language-resources/tools/sphere-conversion-tools.

StackOverflow constributors. (2017). Change huge amount of data from NIST to RIFF wav file. Retrieved May 27 from https://web.archive.org/web/20180528013655/https://stackoverflow.com/questions/47370167/change-huge-amount-of-data-from-nist-to-riff-wav-file.

Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.