Wishful Coding

Didn't you ever wish your computer understood you?

Partial Decoding of 360° HD Virtual Reality Video

I’m doing some mental cleaning, putting some ideas out there that I had saved up for a master thesis, startup, or other ambition. Starting with this VR-related idea.

I got this idea from a post by John Carmack about 5k video decoding on VR headsets, where he talks about the challenges of 360° HD video. Basically, it’s a lot of data, and the user is only looking at about 1/6 of it. The problem with partial decoding is that conventional video codecs use key frames and motion prediction. John’s solution is to slice up the video in tiles with extra many key frames and decode those, with an extra low-resolution backdrop for quick head motions.

I thought there must be better ways, so I made a new video codec to do efficient partial decoding. It’s based on the 3D discrete cosine transform, that I implemented on the GPU in Futhark. It’s the same thing used in JPEG, with the third dimension being time.

Think of it like this: If you’d put all the video frames behind each other, you basically get a cube of pixels. So similar to how you compress areas of the same color in JPEG, now you can compress volumes of the same color over time.

The way compression like this works is that you take blocks of 8x8(x8) pixels, and transform them to frequency domain. (the cosine transform is family of the Fourier transform) A property of the cosine transform is that most of the important information is at low frequencies, so you can basically set the high-frequency parts to zero. Then you do lossless compression, which is great at compressing long runs of zeros.

Well, that’s how JPEG and 3D-DCT video compression works, which has been written about a lot. That’s not a new thing. But what’s really great about 3D-DCT compared to motion prediction is that you can decode and arbitrary 8x8x8 cube without any extra data. This makes it great for VR video, I think.

What’s even more cool: The DC component of the DCT is the average of the whole cube, so without any decoding, you can take the DC component to get your low-resolution back-drop. This is also 1/8th the frame rate, so it may be desirable to partially decode the frame, which is totally possible. You just apply the 1D inverse DCT to the time dimension and take the DC components of the 2D frames from there.

After implementing a proof of concept in Futhark (for the DCT) and Python (for the IO and interface), I sent an email to John Carmack with the video above. His reply:

There are at least three companies working full time on schemes for partial video decode in VR. I have been in communication with Visbit and TiledMedia, and I know there are a couple others. An algorithm isn’t going to be worth much of anything, but a functioning service, like they are trying to do, may have some kind of acquisition exit strategy, but it isn’t looking great for them right now.

Long ago, I did some investigation of 3D DCT for video compression, and it wasn’t as competitive as I hoped – 2D motion prediction winds up being rather more flexible than the DCT basis functions, and video frames are actually aliased in time due to shutter exposures being a fraction of the time duration, so it isn’t as smooth as the spatial dimensions.

Though the main reason I shelved this idea is that there is not really a viable path to get this onto VR headsets. A mobile video codec pretty much has to be implemented in hardware, but for such a niche market, it’s hard to imagine a way to realize this hardware. If there were a CPU manufacturer interested in licensing my 3D-DCT IP block into their products, I’d be more than happy to finish the thing.

Pepijn de Vos

Futhark: Python gotta go faster

While discussing the disappointing performance of my Futhark DCT on my “retro GPU”(Nvidia NVS 4200M) with Troels Henriksen, it came up that the Python backend has quite some calling overhead.

Futhark can compile high-level functional code to very fast OpenCL, but Futhark is meant to be embedded in larger programs. So it provides a host library in C and Python that set up the GPU, transfer the memory, and run the code. It turns out the the Python backend based on PyOpenCL is quite a bit slower at this than the C backend.

I wondered why the Python backend did not use the C one via FFI, and Troels mentioned that someone had done this for a specific program and saw modest performance gains. However, this does require a working compiler and OpenCL installation, rather than just a pip install PyOpenCL, so he argued that PyOpenCL is the easiest solution for the average data scientist.

I figured I might be able to write a generic wrapper for the generated C code by feeding the generated header directly to CFFI. That worked on the first try, so that was nice. The hard part was writing a generic, yet efficient and Pythonic wrapper around the CFFI module.

The first proof of concept required quite a few fragile hacks (pattern matching on function names and relying on the type and number of arguments to infer other things) But it worked! My DCT ran over twice as fast. Then, Troels, helpful as always, modified the generated code to reduce the number of required hacks. He then proceeded to port some of the demos and benchmarks, request some features, and contribute Python 2 support.

futhark-ffi now supports all Futhark types on both Python 2 and 3, resulting in speedups of anywhere between 20% and 100% compared to the PyOpenCL backend. Programs that make many short calls benefit a lot, while programs that call large, long-running code benefit very little. The OpenCL code that runs is the same, only the calling overhead is reduced.

One interesting change suggested by Troels is to not automatically convert Futhark to Python types. For my use case I just wanted to take a Numpy array, pass it to Futhark, and get a Numpy array back. But for a lot of other programs, the Futhark types are passed between functions unchanged, so not copying them between the GPU and CPU saves a lot of time. There is even a compatibility shim that lets you use futhark-ffi with existing PyOpenCL code by merely changing the imports. An example of this can be seen here

After installing Futhark, you can simply get my library with pip. (working OpenCL required)

pip install futhark-ffi

Usage is as follows. First generate a C library, and build a Python binding for it

futhark-opencl --library test.fut
build_futhark_ffi test

From there you can import both the CFFI-generated module and the library to run your Futhark code even faster!

import numpy as np
import _test
from futhark_ffi import Futhark

test = Futhark(_test)
res = test.test3(np.arange(10))
test.from_futhark(res)
Pepijn de Vos

Loefflers Discrete Cosine Transform algorithm in Futhark

If you search for Loefflers algorithm you get a few academic papers, and for Futhark you get the Germanic runes. This post is a SEO masterpiece.

Discrete Cosine Transform is a variation on the Discrete Fourier Transform. It is used in basically every lossy compression format ever. The reason DCT is preferred is that discrete transforms are cyclic. So the DFT has a jump at the edges of the data, where it wraps around. (this is why windowing is frequently used in DFT) This jump at the edges leads to a fat tail in the frequency spectrum, which does not compress well.

The DCT constructs an “even” signal (mirrored around the 0 axis), so the signal is continuous at the edges. This leads to much lower high frequency coefficients. Lossy compression basically works by quantizing/masking/thresholding those coefficients, which produces many zeros at high frequencies. Long runs of zeros compress really well, so that’s what happens in most compression algorithms.

I was playing a bit with compression, but found that scipy.fftpack.dct was not fast enough to my liking. Since I had recently discovered Futhark, which is an amazing ML-like functional programming language for GPU programming that compiles to OpenCL, I thought it’d be fun to implement the DCT in Futhark. Little did I know what I was getting myself into.

After some searching, I found that Loefflers algorithm is the way to go. It’s what everyone seems to be using, because for an 8-point DCT it obtains the theoretical lower bound of 11 multiplications. After chasing some references in more recent papers, I found the original: Practical fast 1-D DCT algorithms with 11 multiplications, and after days of struggling, I almost understood it.

I knew that a Fast Fourier Transform is based on taking the DFT equation, and splitting it up in odd and even parts. If you keep doing this recursively (called decimation in time/decimation in frequency), you end up with this “butterfly” structure, which are additions of two “branches” scaled by some factor. For the DCT there are also butterflies, but also rotation blocks.

It took a few mental leaps to understand that you can write the DCT or DFT in matrix form, express elementary row operations in matrix form, use those to factorize the DCT matrix, and derive an optimal implementation from this matrix factorization.

The Futhark side of things was really fun. If you know a bit of functional programming, it’s really not hard, and you don’t need to know anything about OpenCL or GPU’s. I hopped on Gitter, and Troels Henriksen was super helpful. I’d come up with a problem, and a few hours later I’d git pull and the compiler got better.

There were a few surprises though. Many array operations are basically free, by returning a view of the same array. But there is an underlying assumption that arrays are big, heap allocated, and parallelized relentlessly. Tuples, on the other hand, are assumed to be small, and register allocated. By rewriting my inner DCT structure from (tiny) arrays to using tuples, performance more than doubled.

At first I tied to optimize my code to use in-place updates, but this was actually significantly slower than out-of-place. By doing in-place updates, I force the compiler to do the operations completely sequentially, while normally it could do a lot in parallel. It turns out that moving data around is by far the slowest thing, and arithmetic is basically free. So the best way to write fast code is to move less data, not to worry about every addition.

Actually implementing the DCT was really hard though. As I mentioned, searching for it brought up only academic papers, barely any working code. I managed to find two eventually: dct_simd and mozjpeg. I actually ported the first one to Python to compare intermediate results with my own implementation.

def pydct(x):
    r = [1.414214, 1.387040, 1.306563, 1.175876, 1.000000, 0.785695, 0.541196, 0.275899]
    y = np.zeros(8)

    invsqrt2= 0.707107;

    c1 = x[0]; c2 = x[7]; t0 = c1 + c2; t7 = c1 - c2;
    c1 = x[1]; c2 = x[6]; t1 = c1 + c2; t6 = c1 - c2;
    c1 = x[2]; c2 = x[5]; t2 = c1 + c2; t5 = c1 - c2;
    c1 = x[3]; c2 = x[4]; t3 = c1 + c2; t4 = c1 - c2;

    c0 = t0 + t3; c3 = t0 - t3;
    c1 = t1 + t2; c2 = t1 - t2;

    y[0] = c0 + c1;
    y[4] = c0 - c1;
    y[2] = c2 * r[6] + c3 * r[2];
    y[6] = c3 * r[6] - c2 * r[2];

    c3 = t4 * r[3] + t7 * r[5];
    c0 = t7 * r[3] - t4 * r[5];
    c2 = t5 * r[1] + t6 * r[7];
    c1 = t6 * r[1] - t5 * r[7];

    y[5] = c3 - c1; y[3] = c0 - c2;
    c0 = (c0 + c2) * invsqrt2;
    c3 = (c3 + c1) * invsqrt2;
    y[1] = c0 + c3; y[7] = c0 - c3;
    return y

def pyidct(y):
    r = [1.414214, 1.387040, 1.306563, 1.175876, 1.000000, 0.785695, 0.541196, 0.275899]
    x = np.zeros(8)
    
    z0 = y[1] + y[7]; z1 = y[3] + y[5]; z2 = y[3] + y[7]; z3 = y[1] + y[5];
    z4 = (z0 + z1) * r[3];

    z0 = z0 * (-r[3] + r[7]);
    z1 = z1 * (-r[3] - r[1]);
    z2 = z2 * (-r[3] - r[5]) + z4;
    z3 = z3 * (-r[3] + r[5]) + z4;

    b3 = y[7] * (-r[1] + r[3] + r[5] - r[7]) + z0 + z2;
    b2 = y[5] * ( r[1] + r[3] - r[5] + r[7]) + z1 + z3;
    b1 = y[3] * ( r[1] + r[3] + r[5] - r[7]) + z1 + z2;
    b0 = y[1] * ( r[1] + r[3] - r[5] - r[7]) + z0 + z3;
    #return np.array([z0, z1, z2, z3])

    z4 = (y[2] + y[6]) * r[6];
    z0 = y[0] + y[4]; z1 = y[0] - y[4];
    z2 = z4 - y[6] * (r[2] + r[6]);
    z3 = z4 + y[2] * (r[2] - r[6]);
    a0 = z0 + z3; a3 = z0 - z3;
    a1 = z1 + z2; a2 = z1 - z2;
    #return np.array([a0, a1, a2, a3, b0, b1, b2, b3])

    x[0] = a0 + b0; x[7] = a0 - b0;
    x[1] = a1 + b1; x[6] = a1 - b1;
    x[2] = a2 + b2; x[5] = a2 - b2;
    x[3] = a3 + b3; x[4] = a3 - b3;
    return x

From this code and staring at the paper, I learned a few things. First of all figure 1 is wrong. The rotate block should be sqrt(2)c6 instead of sqrt(2)c1. Another small detail is the dashed lines, meaning that some butterflies are upside down. Another one is the rotate block symbol. It says kcn, which are the k and n in the block equation, not the one in the DCT equation, which confused me a lot. So for sqrt(2)c6 you just substitute sqrt(2) and 6 in the rotation block. I noted down some more insights in response to a two year old question about the paper on the DSP StackExchange

Having implemented the forward DCT from the paper, I moved on to the inverse. All information the paper has about this is “just do everything backwards”. Thanks, paper. It turns out you use the same blocks, but in the reverse order, except… the rotate block n becomes -n. The inverse cosine transform has a negative angle, and this translates to cos(-x)=cos(x), sin(-x)=-sin(x).

type octet 't = [8]t
type f32octet = octet f32

let butterfly (a: f32) (b: f32) : (f32, f32) =
  (a+b, a-b)

let mk_coef (k:f32) (n:i32) : f32 =
  k*f32.cos ((r32 n)*f32.pi/16)

let coefr = map (mk_coef (f32.sqrt 2)) <| iota 8
let coef1 = map (mk_coef 1) <| iota 8

let rotate (sin_coef: f32) (cos_coef: f32) (a: f32) (b: f32): (f32, f32) =
  (a*cos_coef + b*sin_coef,
   b*cos_coef - a*sin_coef)

entry fdct8 (a: f32octet) : f32octet  =
  -- stage 1
  let (st1_0, st1_7) = butterfly a[0] a[7]
  let (st1_1, st1_6) = butterfly a[1] a[6]
  let (st1_2, st1_5) = butterfly a[2] a[5]
  let (st1_3, st1_4) = butterfly a[3] a[4]
  -- even part, stage 2
  let (st2_0, st2_3) = butterfly st1_0 st1_3
  let (st2_1, st2_2) = butterfly st1_1 st1_2
  -- stage 3
  let (y0, y4)   = butterfly st2_0 st2_1
  let (y2, y6)   = rotate coefr[2] coefr[6] st2_2 st2_3
  -- odd part, stage 2
  let (st2_4, st2_7)   = rotate coef1[5] coef1[3] st1_4 st1_7
  let (st2_5, st2_6)   = rotate coef1[7] coef1[1] st1_5 st1_6
  -- stage 3
  let (st3_4, st3_6)   = butterfly st2_4 st2_6
  let (st3_7, st3_5)   = butterfly st2_7 st2_5
  -- stage 4
  let (y1, y7)   = butterfly st3_7 st3_4
  let y3  = f32.sqrt(2)*st3_5
  let y5  = f32.sqrt(2)*st3_6
  in [y0, y4, y2, y6, y7, y3, y5, y1]


entry idct8 (a: f32octet) : f32octet  =
  -- odd part, stage 4
  let (st4_7, st4_4)   = butterfly a[7] a[4]
  let st4_5 = f32.sqrt(2)*a[5]
  let st4_6 = f32.sqrt(2)*a[6]
  -- stage 3
  let (st3_4, st3_6)   = butterfly st4_4 st4_6
  let (st3_7, st3_5)   = butterfly st4_7 st4_5
  -- stage 2
  let (st2_4, st2_7)   = rotate (-coef1[5]) coef1[3] st3_4 st3_7
  let (st2_5, st2_6)   = rotate (-coef1[7]) coef1[1] st3_5 st3_6
  -- even part, stage 3
  let (st3_0, st3_1)   = butterfly a[0] a[1]
  let (st3_2, st3_3)   = rotate (-coefr[2]) coefr[6] a[2] a[3]
  -- stage 2
  let (st2_0, st2_3) = butterfly st3_0 st3_3
  let (st2_1, st2_2) = butterfly st3_1 st3_2
  -- stage 1
  let (st1_0, st1_7) = butterfly st2_0 st2_7
  let (st1_1, st1_6) = butterfly st2_1 st2_6
  let (st1_2, st1_5) = butterfly st2_2 st2_5
  let (st1_3, st1_4) = butterfly st2_3 st2_4
  in [st1_0/8, st1_1/8, st1_2/8, st1_3/8, st1_4/8, st1_5/8, st1_6/8, st1_7/8]
Pepijn de Vos