"The network architecture"
UPDATE July 25, 2018: I have changed the network into one chain of functions instead of two that had to be managed with another function. This version should also have the kernels oriented the right way to replicate the paper.
Using convolutional layers for speech data
There are a number of different layer types that you could use for speech data. What I've seen widely used are convolutional and recurrent layers, or a mix of them. Fully-connected layers only ever seem to appear in conjunction with the convolutional and/or recurrent layers.
As with most uses of convolutional layers, the idea is that they will learn features useful for the task the network is being asked to do. For speech data, we might hope that they learn filterbanks if the yare used with raw audio data, or that they are tracking something important to speech perception like formants.
In the context of the speech data that we have—Mel filterbanks—we're going to perform the convolutions in a specific way. Principally, Zhang et al. (2017) designed a network architecture that they believe was able to learn temporal dependencies while also reducing dimensionality in the frequency domain. They accomplished this by having only one max-pooling layer, and it was right after the first convolutional layer. The pooling layer was such that it only pooled in the frequency domain. And, subsequent convolutions did not reduce dimensionality in the frequency domain (such as with stride length), nor did they reduce the number of time steps present in the signal.
They had 10 total convolutional layers, and it is precisely this depth that the authors argue allowed the network to learn temporal dependencies. They did not provide any visualizations that seemed to suggest this, though they were getting comparable results with their network as compared to those that used recurrent layers.
After the convolutions, they had 3 fully-connected layers. They did not speak to how exactly they connected these, but they got predictions at each time step. The only readily apparent way to do this is to flatten the results of the final convolution, and then feed each time step in that array into the fully-connected layer individually.
If that final part seems complex, it really isn't. At least, if you're using Flux. So, let's get down to it!
Implementing the network architecture in Julia using Flux
It's quite easy to set up the network architecture that the authors describe using Flux. Let's start with setting up the convolutional section, as described in Zhang et al.'s (2017) paper.
net = Chain(Conv((5, 3), 3=>128, relu; pad=(2, 1)),
x -> maxpool(x, (1,3)),
Dropout(0.3),
Conv((5, 3), 128=>128, relu; pad=(2, 1)),
Dropout(0.3),
Conv((5, 3), 128=>128, relu; pad=(2, 1)),
Dropout(0.3),
Conv((5, 3), 128=>128, relu; pad=(2, 1)),
Dropout(0.3),
Conv((5, 3), 128=>256, relu, pad=(2, 1)),
Dropout(0.3),
Conv((5, 3), 256=>256, relu, pad=(2, 1)),
Dropout(0.3),
Conv((5, 3), 256=>256, relu, pad=(2, 1)),
Dropout(0.3),
Conv((5, 3), 256=>256, relu, pad=(2, 1)),
Dropout(0.3),
Conv((5, 3), 256=>256, relu, pad=(2, 1)),
Dropout(0.3),
Conv((5, 3), 256=>256, relu, pad=(2, 1)),
Dropout(0.3),
x -> transpose(reshape(x, size(x, 1), prod(size(x)[2:end]))),
Dense(3328, 1024, relu),
Dropout(0.3),
Dense(1024, 1024, relu),
Dropout(0.3),
Dense(1024, 1024, relu),
Dropout(0.3),
Dense(1024, 62),
logsoftmax) |> gpu
Note, the actual architecture specification for the best performing network uses maxout
instead of relu
, but I have not yet implemented the maxout
activation function. Once I have implemented the maxout
function, I will update this post accordingly.
Using the Chain
function, we can declare these layers in a remarkably clean, almost delcarative fashion. The results of evaluating this particular Chain
function will give us a function called convSection
that will take in data and then feed it through each of those layers.
Also note the |> gpu
at the end. This is actually one of Flux's coolest features, in my opinion. If you've set up CUDA support in your Julia installation, and if you've put using CuArrays
at the top of your script, |> gpu
will automatically push this whole chain to the GPU as a CuArray
! Then operations that are called on it or use it will use CUDA on the GPU!! It's so easy! (Can you tell by my exclamation marks that I'm excited about this?)
Overall, what the network is structured to do is send whatever data is contained in x
into the convolutional layers, reshape it into time steps, and then send the reshaped array into the dense section.
Conclusion
That's it for building the architecture of our network. Pretty simple, right? That's the beauty of using Flux!
If you would like to see the script where I use this network for automatic speech recognition, feel free to check it out on my GitHub. Note that it is still under active development, so it is liable to change. However, the general structure and ideas should be present.
References
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.