"Second status update"
These last two weeks have been spent working on the connectionist temporal classification loss function. Last post, I mentioned that was my bugbear for this project because it is difficult to get coded and working correctly. Coding it up has been an odyssey, to say the least. It's not been hard in the sense that it's undoable, but rather that there are many moving parts, and just because you have a working implementation doesn't mean it will be parallelizing correctly or running efficiently. And that's what I think I want to spend this post on.
I've spent a good amount of time during these past two weeks on getting a GPU version of the connectionist temporal classification loss working. The implementation is based on Baidu's warp-ctc, though I have not written code to do all of their tricks for moving objects between register and shared memory (and may not have the time for it before the end of the summer of code). It's been an ongoing process because this algorithm is not of the embarrassingly parallel variety of algorithms. As such, the process of getting it running on a GPU has been at turns fascinating and frustrating. I'm fascinated because I haven't actually written GPU kernels before now, and learning how they work is quite interesting. It's frustrating, however, because it's a new way of thinking about programming for me, and that means that there will be lots of head banging against a wall and brute force attempts to get something to work. But at the end of the day, it's impressive (to me at least) to think about splitting a nontrivial problem into many parallel processes for a GPU to work on.
Nonetheless, Julia thankfully makes GPU programming easier than it might otherwise be. The CUDAnative.jl package makes it as pleasant as possible to write GPU kernels using Julia code. (I mean, imagine having to write a kernel in C++! Yuck!*).
I'm nearing the end of testing the GPU kernels for correctness, so I should be able to start training the network in earnest very soon. I also anticipate writing a post about connectionist temporal classification so that newcomers can benefit from the knowledge I've gained while implementing this algorithm on both the CPU and the GPU.
If you would like to follow my development on this algorithm, you can look at the warp-ctc.jl file on my GitHub. I try to push daily to the daily branch, though it's often messy until I want to merge it with the master branch. You've been warned.
* - To all the C++ programmers out there, you are wonderful human beings and I am impressed with your ability to understand that process. It's not for me, though. (Please don't @ me.)