Here we are at the penultimate blog post. Things finally look like they're wrapping up! There was a key discrepancy I noticed between the code I was working from and the connectionist temporal classification (CTC) algorithm as described by Graves (2012) and how it was impelemented in Baidu's warp-ctc library. I want to preface my further remarks here by saying that I have not investigated whether or the discrepancy makes sense in context in the warp-ctc library; it's possible that this discrepancy when taken together wither other modifications they made will return correct results. What I can say for now is that it did not give me correct results in the basic port I made.
The beta coefficients
Part of the CTC algorithm involves calculating beta coefficients, which is essentially the probability of getting to a specific label at a specific time, but working from the end backward, instead from the beginning forward. Graves (2012) gvies the definition as follows:
However, the code in the warp-ctc library as I read it evaluates as
It should be apparent that these two formulations will not come out to be equivalent, since the value by which each term in the summation is multiplied by in the first formulation is different, whereas in the second, they are all multiplied by the same term. Once I changed my code to match the first formulation instead of the second, I saw a significant decrease in the loss values in the network, as well as seeing the network finally begin to output predictions. They are, at the moment, not very usable, but it is still substantially better than seeing all blanks being output.
I only noticed this difference between the two formulations when I was trying to get a CPU implementation of this loss function to match my GPU version but couldn't get them to match completely. When I investigated the differences, I saw that the beta coefficients didn't match, and that led me to check Graves' (2012) specification again.
Here is an example of what the network is outputting now.
The reasons for this dive seemed foolish now.
Target sequence of phones:
h# dh ix r iy z ax n z f axr dh ih s dcl d ay v s iy m dcl d f uw l ix sh epi n aw h#
Predicted sequence of phones:
h# pau w iy bcl r iy ux z bcl b iy bcl b uw z ay n pcl p z iy n dcl d v w iy er h#
The phone error rate for this sequence of predictions from the target sequence is approximately 84% (based on the edit distance divided by the length of the longer sequence). On inspection, it is making some errors that are expected, like confusing [ix] for [iy], which are acoustically similar (they are both high vowels).
Where to now?
Now that the CTC loss function seems to be working correctly, what's left to do is simply train the network.
Graves, A. (2012). Supervised sequence labelling. Springer, Berlin, Heidelberg.