"First status update"
We are upon the first evaluation period, so it's a good time to check in. My time so far in the Google Summer of Code program has been a joy. It's been wonderful to be able to interact with a coding community and spend a significant amount of time working on a coding project. Up until now, I am on track with the proposal that I set out to implement the neural net speech recognition system in Zhang et al. (2017) using Flux and Julia and contribute it to the model zoo. These first few weeks have been devoted to extracting data, setting up the network architecture, and beginning the implementation of the connectionist temporal classification loss function. I will explain a bit more about each of these below.
Extracting speech data
As I've written about before, the first task that I accomplished was writing the code that would extract speech data from the TIMIT speech corpus. This is something that I have done a few times before in both Python and Julia, so it was relatively easy task to accomplish.
I want to emphasize how nice it was to write this in Julia, however. I appreciate that I don't have to call modules like I do in Python, where, for example, walking through a directory would be a call to os.path.walk
after importing the os
module, compared to what I think is easier to remember in Julia, walkdir
, which does not require a module import. The dot-broadcasting functionality also made it easy to write what I consider to be more intelligible code in less space. Compare, for example, trying to fill a Vector
with the logged values of another. If you have the values you want to log in a variable called a
, all you need to do is this:
v = log.(a)
It's so simple! And, I believe this is cognitively easier for a human to understand while reading through code. Essentially, there is less information to work through to understand what's going on, whereas a for-loop, for example, would require understanding all the involved lines of code and then, hopefully, understanding that all of them together are functioning to create a variable v
with the logged values of a
.
In short, I've come to enjoy working in Julia even more because of how it removes some overhead that you might have to deal with in other languages, and it also provides practical syntactic sugar that makes it easier to write and read code, when used correctly.
The resulting script can be found in my GitHub repository. I've linked to the one in the master branch, which is supposed to be cleaner, but you can take a look at the daily branch if you would like to see any incremental updates I make as I make them.
Network architecture
The next task that I accomplished was implementing the network architecture, which I have discussed in a previous blog post. Flux makes this a pretty straightforward experience. It allows for an almost declarative style of designing network architecture when it's convenient, but it also makes it easy to drill down and modify parts of the learning process as necessary. This is what I've done for creating the model
function in the script I implemented, for which I was able to easily take the time steps contained in my convolutions and feed each of them individually to the fully-connected layers. Implementing this architecture was a very enjoyable process, and it really showed me some of the flexibiliy of Flux. I come from a Keras background, so I find Flux's approach refreshing.
I will note that I have currently not been using the Dropout
layers while testing the network because they have been found to run slowly on GPUs. However, there has been a patch recently that i don't believe I've installed yet, so I will be getting the Dropout
layers running in the very near future.
The current implementation of the network is up on my GitHub. The incremental upates can be tracked on the daily branch, if you're interested.
Implementing the connectionist temporal classification loss
This was my biggest bugbear coming into this coding project. My previous looks at the loss function saw that it would be difficult to mention, and I've heard from other posters around the internet that it can be hard to implement at all, let alone on a GPU for fast computation.
There is a piece of advice that I hewed closely to:
Give me six hours to chop down a tree and I will spend the first four sharpening the axe. (Anonymous)
It's often attributed to Abraham Lincoln, but we have no evidence that he every said it. But anyway, I spent quite a while sharpening my axe, so to speak, by poring over the specifications for the connectionist temporal classification loss in Graves (2012) and writing out a rough code sketch of what I wanted the functions to look like. It was vey easy to move from the sketches into the actual code. Or so it seemed...
I was plagued with two problems, both of which I brought onto myself and had nothing to do with Julia or Flux themselves. The first problem was that of interfacing with Flux's tracker to take advantage of automatic differentiation. I had originally used an iterative impelementation of the connectionist temporal classification algorithm by creating matrices instead of performing recursion, but I realized that I had inadvertently divorced the values in the matrices from the tracker in Flux, so the gradients could no longer be automatically found using automatic differentiation. This was my own doing from not understanding how the tracker worked at first and not paying enough attention to the error messages. To resolve this issue, I ended up rewriting the code to be recursive instead of iterative, and memoized the relevant functions using the Memoize.jl package, paying careful attention to not strip the values I was working with away from Flux's tracker. It's possible that this would have worked with the matrix implementation if I were more careful, but I haven't attempted the matrix implementation again.
The second issue I encountered was that of numerical stability. The values that are worked with in this loss function become incredibly small because it involves repeated multiplication of probability values. As such, Graves's (2012) implementation details suggest working as much as possible in the log space. This is all well and good, but some of the computations still involve linear space operations. Because of the use of a softmax
activation function at the end of the network, sometimes the values going into the log space operations were 0 or effectively 0, which caused problems with the log space operations. I believe that I've resolved this issue now by adding a small epsilon value of 1e-7 to the output of the softmax
function. This was a recent attempt at a solution, so the jury's still out on whether this will resolve the issue and allow the network to train, but I have been able to get further with training examples here than I have before.
You can follow my implementation of the connectionist temporal classification on my GitHub. It's still under very active development, so the daily branch should be more up to date with the current state of the function than the master is.
Where to go from here
At present, my main goal is to finish the implementation of the connectionist temporal classification function, which I projected should be done by the end of this week in my proposal. Along with that, I am working to make sure that the evaluation routines are working correctly so that I can accurately assess how the network is doing.
It may be worth considering calling out to an existing implementation of connectionist temporal classification at some point, such as Baidu's warp-ctc. While I believe there is a pedagogical reason to have a Julia implementation of the function for a model that will appear in Flux's model zoo, if I can't get its performance to match that of an off-the-shelf, well-tested implementation, I think there would be practical benefits from demonstrating how to connect to such a library as well.
References
Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Berlin: Springer. Preprint available at https://www.cs.toronto.edu/~graves/preprint.pdf.
Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Bengio, C. L. Y., & Courville, A. (2017). Towards end-to-end speech recognition with deep convolutional neural networks. arXiv preprint arXiv:1701.02720.