If you search for Loefflers algorithm you get a few academic papers, and for Futhark you get the Germanic runes. This post is a SEO masterpiece.

Discrete Cosine Transform is a variation on the Discrete Fourier Transform. It is used in basically every lossy compression format ever. The reason DCT is preferred is that discrete transforms are cyclic. So the DFT has a jump at the edges of the data, where it wraps around. (this is why windowing is frequently used in DFT) This jump at the edges leads to a fat tail in the frequency spectrum, which does not compress well.

The DCT constructs an “even” signal (mirrored around the 0 axis), so the signal is continuous at the edges. This leads to much lower high frequency coefficients. Lossy compression basically works by quantizing/masking/thresholding those coefficients, which produces many zeros at high frequencies. Long runs of zeros compress really well, so that’s what happens in most compression algorithms.

I was playing a bit with compression, but found that `scipy.fftpack.dct`

was not fast enough to my liking. Since I had recently discovered Futhark, which is an amazing ML-like functional programming language for GPU programming that compiles to OpenCL, I thought it’d be fun to implement the DCT in Futhark. Little did I know what I was getting myself into.

After some searching, I found that Loefflers algorithm is the way to go. It’s what *everyone* seems to be using, because for an 8-point DCT it obtains the theoretical lower bound of 11 multiplications. After chasing some references in more recent papers, I found the original: Practical fast 1-D DCT algorithms with 11 multiplications, and after days of struggling, I almost understood it.

I knew that a Fast Fourier Transform is based on taking the DFT equation, and splitting it up in odd and even parts. If you keep doing this recursively (called decimation in time/decimation in frequency), you end up with this “butterfly” structure, which are additions of two “branches” scaled by some factor. For the DCT there are also butterflies, but also rotation blocks.

It took a few mental leaps to understand that you can write the DCT or DFT in matrix form, express elementary row operations in matrix form, use those to factorize the DCT matrix, and derive an optimal implementation from this matrix factorization.

The Futhark side of things was really fun. If you know a bit of functional programming, it’s really not hard, and you don’t need to know anything about OpenCL or GPU’s. I hopped on Gitter, and Troels Henriksen was super helpful. I’d come up with a problem, and a few hours later I’d `git pull`

and the compiler got better.

There were a few surprises though. Many array operations are basically free, by returning a view of the same array. But there is an underlying assumption that arrays are big, heap allocated, and parallelized relentlessly. Tuples, on the other hand, are assumed to be small, and register allocated. By rewriting my inner DCT structure from (tiny) arrays to using tuples, performance more than doubled.

At first I tied to optimize my code to use in-place updates, but this was actually significantly *slower* than out-of-place. By doing in-place updates, I force the compiler to do the operations completely sequentially, while normally it could do a lot in parallel. It turns out that moving data around is by far the slowest thing, and arithmetic is basically free. So the best way to write fast code is to move less data, not to worry about every addition.

Actually implementing the DCT was really hard though. As I mentioned, searching for it brought up only academic papers, barely any working code. I managed to find two eventually: dct_simd and mozjpeg. I actually ported the first one to Python to compare intermediate results with my own implementation.

From this code and staring at the paper, I learned a few things. First of all **figure 1 is wrong**. The rotate block should be `sqrt(2)c6`

instead of `sqrt(2)c1`

. Another small detail is the dashed lines, meaning that **some butterflies are upside down**. Another one is the rotate block symbol. It says `kcn`

, which are the `k`

and `n`

in the block equation, *not* the one in the DCT equation, which confused me a lot. So for `sqrt(2)c6`

you just substitute `sqrt(2)`

and 6 in the rotation block. I noted down some more insights in response to a two year old question about the paper on the DSP StackExchange

Having implemented the forward DCT from the paper, I moved on to the inverse. All information the paper has about this is “just do everything backwards”. Thanks, paper. It turns out you use the **same** blocks, but in the reverse order, except… the rotate block `n`

becomes `-n`

. The inverse cosine transform has a negative angle, and this translates to `cos(-x)=cos(x)`

, `sin(-x)=-sin(x)`

.