"Update: Bug-fixing and interfacing"

Tue 24 July 2018 in updates

#gsoc #updates #ctc #gpu

We're coming into the final stretch of coding for Google Summer of Code 2018 here. The last post I made was not necessarily the most hopeful, but I hope this post will reinvigorate things a bit. I have accomplished quite a bit in these last two weeks, so let's talk about that a bit. In this post, I will discuss getting the loss function to interface with Flux's tracking interface for backpropagation, checking the output of my connectionist temporal classification (CTC) loss against another package's for correctness, and resolving a numerical instability for the function that computes $\text{ln}(a+b)$, which is used prevalently in the calculation of the connectionist temporal classification loss and gradients.

Connecting CTC loss to Flux's tracking interface

I think this was the biggest example of me getting in my own way during the coding process. I had convinced myself that it wasn't going to be feasible within the duration of the project to tie the loss function into Flux's tracker, so I instead just backpropagated the gradients to the network outputs directly, rather than writing the code that would start the backpropagation with the loss value. It was actually remarkably easy to get this to tie into Flux's tracking system, however. The documentation is quite clear on how this should work, actually, and I simply didn't read it right the first few times through.

Since the GPU kernels for the CTC loss already calculate the gradients with respect to the network outputs, we just have to return the appropriate interface. As it turns out, it is only a few lines of code:

ctc(ŷ::TrackedArray, y::AbstractArray) = Flux.Tracker.track(ctc, ŷ, y)

@grad function ctc(ŷ, y)
    ls, gs = ctc(Flux.Tracker.data(ŷ), y)
    return mean(ls), Δ -> (Δ .* gpu(gs), Δ)
end

The first function catches the general use-case where ŷ is the output from a Flux network, i.e., a TrackedArray. It strips the tracking from the array before passing it to the second function, which runs the ctc function and returns the loss value and gradients as appropriate. The first gradient returns is the change in the loss value with respect to the values in ŷ, which is the important one. The values of $y$ aren't used in gradient backpropagation, so the gradient it receives doesn't do anything.

But isn't that nice and simple? It was just a few lines of code to set up the CTC loss to interface with the tracker. After making this change, the code runs significantly faster, so I'm quite pleased with getting this to work.

Checking the output of the CTC loss

One of the things that was concerning was that the network wasn't really learning to predict anything other than the blank label, as I mentioned in the last post. One way of debugging whether this was the fault of the CTC loss function itself was to test it against another impelementation. It was helpfully suggested to me to try testing against the Chainer package's implementation, since it was written in Python and NumPy, making it easier to get up and running than something that called out to an external library.

I set up a script that would calculate the loss for an utterance using my implementation, as well as the Chainer implementation, and then ran this over a sample of 400 utterances. The mean absolute error from comparing these two implementations is 0.93 and the standard deviation of the absolute errors is 0.13 for the whole TIMIT training set, at the scale of values around 800 or 1000. This is insignificant in practice, and could potentially be attributed to floating point imprecision and different implementations of mathematical functions like log. So, the CTC loss I've implemented seems to be operating correctly. Yay!

Numerical stability of calculating ln(a+b)

The values used in the calculation of the CTC loss are extremely small, since they are repeated multiplications of probabilities. This means that they will tend toward zero. In theoretical math, this is fine, but when working numerically, this poses a problem because we only have so many bits we can use for precision, and if a number gets too small, we get an underflow error. To resolve this problem, it is suggested to work in the log space when calculating this function. This is fine for when the algorithm requires a multiplication, because the equation $\text{ln}(a * b) = \text{ln}(a) + \text{ln}(b)$ is straightforward. However, it is a bit more difficult to translate the idea of addition from linear space into log space.

Graves (2012) provides the equation:

$$\text{ln}(a+b) = \text{ln}\,a + \text{ln}(1+e^{\text{ln}\,b - \text{ln}\,a})\,\,.$$

Because we are working in log space from the start, $a$ and $b$ are already in log space, which means the equation becomes:

$$\text{ln}(e^a + e^b) = a + \text{ln}(1 + e^{b - a})\,\,.$$

This works fine when $a$ and $b$ are close to each other. However, when they are further apart, overflow and underflow errors can occur when $b$ is larger than $a$. As an example, imagine that $a = -275$ and $b = -180$, which are value similar to what I have observed. If we put these into the second equation, we get

\begin{align} \text{ln}(e^{-275} + e^{-180}) &= -275 + \text{ln}(1 + e^{-180 - -275}) \\ &= -275 + \text{ln}(1 + e^{95}) \\ &= -275 + \text{ln}(1 + 1.8112390828890233e41)\,\,. \end{align}

I'm going to stop the derivation here because the issue is apparent. The number $1.8112390828890233e41$ cannot be represented by a 32-bit floating point number, so Julia gives Inf32 as the value for this number. (Note that we work usually with 32-bit numbers on GPUs.) This problem can be solved rather easily, however, by swapping the values of $a$ and $b$ so that we get $a = -180$, and $b = -275$. Plugging these into the second equation again, we get

\begin{align} \text{ln}(e^{-180} + e^{-275}) &= -180 + \text{ln}(1 + e^{-275 - -180}) \\ &= -180 + \text{ln}(1 + e^{-95})\,\,. \end{align}

I'll stop the derivation here again because it's apparent that this will evaluate to a finite number. Because $\lim_{x\to-\infty} e^x = 0$, so when $e^{-95}$ underflows and subsumes the 1 into the infinite value, calling log on the resulting -Inf32 in Julia will return $0$, which can be safely added to $-180$. This highlights that there is a lack of precision in this calculation, but that is more to do wih using numerical methods at all than it is with the formula itself.

Making this variable swap in the code stopped the CTC loss calculation from returning Inf for an intermediate loss value even though the numbers that are being summed are themselves finite. The code with checks looks like this:

function logadd(a, b)

    isinf(a) && return b
    isinf(b) && return a

    if a < b
        a, b = b, a
    end

    return a + log(1+exp(b-a))
end

Summary

So what was the point of these tasks? Well, the network is still not making useful predictions while training yet, so each of these tasks eliminates potential causes for mistakes or bugs keeping the network from training. Tying the loss into Flux's tracker ensures I'm backpropagating the loss in a more expected and well-documented way. Checking the CTC loss values for correctness ensures that the GPU kernels are working correctly (at least, for calculating the loss value itself, as opposed to the gradients). And, adding more numerical stability to the log addition function keeps Inf from showing up in the code where it's not supposed to. What's more, resolving these issues for the CTC loss will make it easier to incorporate it into Flux or a package designed to have good interoperability with Flux in the future.

At present, the network has pushed beyond predicting ony blanks (but still not necessarily anything useful), so I've set it up to train for longer than I have in the past, which I have seen suggested for others using CTC loss who are experiencing this problem. It's looking promising.

References

Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Springer, Berlin, Heidelberg.