"Coding period two update: GPU coding and gradients"
I'd like to begin this post with a word of advice: never assume that the problem you're working on is too complex to find useful information on using Google. I'm sure you're at least acutely aware of this, but reinforce it for yourself for a moment. It will be relevant in a short while in this post.
This coding period has been devoted almost entirely to getting the neural net to train with the connectionist temporal classification loss. This process principally involved two tasks: getting a GPU implementation of loss function to work properly, and getting the gradients from the loss to properly backpropagate. Below, I will detail the work on both of these processes.
Implementing CTC on the GPU
The first task I confronted was the GPU implementation of the loss function, which I have blogged about previously. I used Baidu's warp-ctc implementation as a basis for beginning the Julia implementation of the algorithm, but I diverged from their implementation in several ways. The first of these was that they used some memory tricks on the GPU wherein they moved data between global and shared memory, which I have not coded into the Julia implementation. The reason for this is that I wanted to ensure the algorithm was working properly before trying to implement what seemed to be a speedup operation. The second way the Julia implementation diverges is in how it calculates the sum values used in the calculating the gradients. Warp-ctc uses a key-value sort to arrange the data and make it possible to map from strings values to integers representing classes. In the Julia implementation, I already had integers representing the classes, so I was able to use those to index the matrix holding the sum value directly, instead of having to try to map from the strings to the integers.
Overall, the coding process for this has been a consistent daily effort, since I felt that I was fighting a hydra; every time I fixed one issue (such as an indexing error in one of the kernels), a new one would crop up. At the time of this writing, however, I believe that it is operating correctly, as it is calculating values that match for a problem I have worked by hand.
Correct backpropagation
The second task was to ensure that the gradients were backpropagating correctly through the network. This does not mean, however, that I was checking Flux's backpropagation implementation. Rather, I was making sure that the data I was using were formatted correctly so that the backpropagation would work properly. One of the things that I found I had done at one point was separate a matrix of data from its record of functions that had been applied to it. This was a sneaky error because I hadn't expected it, but it caused the backpropagation to only reach a certain layer, because one of the layers had no record of what came before it. This problem vexed me for a while, but I eventually found the solution, which had to do with an extra call to cpu
or gpu
(I don't remember which, unfortunately). It's worth mentioning that part of th reason this is harder is because the gradients are calculated in the GPU kernel, so I can't just call Flux's wonderful back!
function on the loss value, since the loss value isn't tracked and can't be automatically differentiated.
But even after I got the gradients flowing correctly, I was still greatly concerned about the network because it very quickly came to consistently output only one class label (within only one or two training utterances). This just felt incredibly wrong to me, but it also made sense because that class, the blank label, shows up just over 50% of the time, and virtually every label could potentially be assigned that label.
This is where that lesson about Googling I started out with comes in. I was sure that there wouldn't be any meaningful results if I Googled this problem, so I refrained from doing so for a while. Today, though, I finally swallowed my pride and searched for information on this. And, what would you know? There was [an extensive post[(https://web.archive.org/web/20180709041144/http://www.tbluche.com/ctc_and_blank.html) about this very issue with blank labels. It turns out that it is documented behavior of networks using connectionist temporal classification loss to quickly begin to only output the blank label, and only after more training learn to occastionally output other class labels. I've already said it once, but don't make that assumption that a quick search on Google won't pull up useful results; it only takes a few seconds.
What's done at this point
At this point, it feels prudent to summarize what has been accomplished so far.
- The data extraction routine works, and the data should be extracted properly at this point.
- The tricky CTC loss has been implemented on a GPU now, which is necessary since running it on the CPU is painfully slow.
- The gradient backpropagation routine should be working. It's hard to say though at this moment that there aren't hidden bugs because I haven't gotten the network to run through all the data at this point.
What remains to be done
In this last stretch, there are two tasks I see that remain to be done for the network itself.
- Get the network to run through all the data. At this point, I'm running into memory issues on the GPU, which I'm hopeful should be resolvable by not loading as many files into the GPU memory at one time.
- Implement the maxout activation function. The final architecture of the network I'm implementing uses this activation function instead of ReLU.
In trying to be a good code custodian, there are a few tasks remaining in terms of documentation, as well.
- Write a post describing the connectionist temporal classification loss at a high level.
- Write a post describing the maxout activation function.
- Clean up and document the code I've written,
- Create a full demonstration of the speech recognition system that maps from acoustics to phoneme labels.
Onward
The middle of any journey in the monomyth fashion is the hardest. It can feel like a place of despair and hopelessness, and I will be the first to attest to experiencing at least a modicum of those feelings during this coding period. But I can see the way forward now, and I'm feeling much more confident about the project.
You can check out the daily versions of the training routine and CTC implementation on my GitHub. They are still undergoing debugging and such, so I don't yet want to push them to my master branch. As far as I can tell, however, the kernels are working properly for the CTC code. The relevant portion of the 02-speech-cnn.jl
file is the ctctrain!
function, and the kernels are the interesting bits in the warp-ctc.jl
file.