Wishful Coding

Didn't you ever wish your
computer understood you?

Loefflers Discrete Cosine Transform algorithm in Futhark

If you search for Loefflers algorithm you get a few academic papers, and for Futhark you get the Germanic runes. This post is a SEO masterpiece.

Discrete Cosine Transform is a variation on the Discrete Fourier Transform. It is used in basically every lossy compression format ever. The reason DCT is preferred is that discrete transforms are cyclic. So the DFT has a jump at the edges of the data, where it wraps around. (this is why windowing is frequently used in DFT) This jump at the edges leads to a fat tail in the frequency spectrum, which does not compress well.

The DCT constructs an “even” signal (mirrored around the 0 axis), so the signal is continuous at the edges. This leads to much lower high frequency coefficients. Lossy compression basically works by quantizing/masking/thresholding those coefficients, which produces many zeros at high frequencies. Long runs of zeros compress really well, so that’s what happens in most compression algorithms.

I was playing a bit with compression, but found that scipy.fftpack.dct was not fast enough to my liking. Since I had recently discovered Futhark, which is an amazing ML-like functional programming language for GPU programming that compiles to OpenCL, I thought it’d be fun to implement the DCT in Futhark. Little did I know what I was getting myself into.

After some searching, I found that Loefflers algorithm is the way to go. It’s what everyone seems to be using, because for an 8-point DCT it obtains the theoretical lower bound of 11 multiplications. After chasing some references in more recent papers, I found the original: Practical fast 1-D DCT algorithms with 11 multiplications, and after days of struggling, I almost understood it.

I knew that a Fast Fourier Transform is based on taking the DFT equation, and splitting it up in odd and even parts. If you keep doing this recursively (called decimation in time/decimation in frequency), you end up with this “butterfly” structure, which are additions of two “branches” scaled by some factor. For the DCT there are also butterflies, but also rotation blocks.

It took a few mental leaps to understand that you can write the DCT or DFT in matrix form, express elementary row operations in matrix form, use those to factorize the DCT matrix, and derive an optimal implementation from this matrix factorization.

The Futhark side of things was really fun. If you know a bit of functional programming, it’s really not hard, and you don’t need to know anything about OpenCL or GPU’s. I hopped on Gitter, and Troels Henriksen was super helpful. I’d come up with a problem, and a few hours later I’d git pull and the compiler got better.

There were a few surprises though. Many array operations are basically free, by returning a view of the same array. But there is an underlying assumption that arrays are big, heap allocated, and parallelized relentlessly. Tuples, on the other hand, are assumed to be small, and register allocated. By rewriting my inner DCT structure from (tiny) arrays to using tuples, performance more than doubled.

At first I tied to optimize my code to use in-place updates, but this was actually significantly slower than out-of-place. By doing in-place updates, I force the compiler to do the operations completely sequentially, while normally it could do a lot in parallel. It turns out that moving data around is by far the slowest thing, and arithmetic is basically free. So the best way to write fast code is to move less data, not to worry about every addition.

Actually implementing the DCT was really hard though. As I mentioned, searching for it brought up only academic papers, barely any working code. I managed to find two eventually: dct_simd and mozjpeg. I actually ported the first one to Python to compare intermediate results with my own implementation.

def pydct(x):
    r = [1.414214, 1.387040, 1.306563, 1.175876, 1.000000, 0.785695, 0.541196, 0.275899]
    y = np.zeros(8)

    invsqrt2= 0.707107;

    c1 = x[0]; c2 = x[7]; t0 = c1 + c2; t7 = c1 - c2;
    c1 = x[1]; c2 = x[6]; t1 = c1 + c2; t6 = c1 - c2;
    c1 = x[2]; c2 = x[5]; t2 = c1 + c2; t5 = c1 - c2;
    c1 = x[3]; c2 = x[4]; t3 = c1 + c2; t4 = c1 - c2;

    c0 = t0 + t3; c3 = t0 - t3;
    c1 = t1 + t2; c2 = t1 - t2;

    y[0] = c0 + c1;
    y[4] = c0 - c1;
    y[2] = c2 * r[6] + c3 * r[2];
    y[6] = c3 * r[6] - c2 * r[2];

    c3 = t4 * r[3] + t7 * r[5];
    c0 = t7 * r[3] - t4 * r[5];
    c2 = t5 * r[1] + t6 * r[7];
    c1 = t6 * r[1] - t5 * r[7];

    y[5] = c3 - c1; y[3] = c0 - c2;
    c0 = (c0 + c2) * invsqrt2;
    c3 = (c3 + c1) * invsqrt2;
    y[1] = c0 + c3; y[7] = c0 - c3;
    return y

def pyidct(y):
    r = [1.414214, 1.387040, 1.306563, 1.175876, 1.000000, 0.785695, 0.541196, 0.275899]
    x = np.zeros(8)
    
    z0 = y[1] + y[7]; z1 = y[3] + y[5]; z2 = y[3] + y[7]; z3 = y[1] + y[5];
    z4 = (z0 + z1) * r[3];

    z0 = z0 * (-r[3] + r[7]);
    z1 = z1 * (-r[3] - r[1]);
    z2 = z2 * (-r[3] - r[5]) + z4;
    z3 = z3 * (-r[3] + r[5]) + z4;

    b3 = y[7] * (-r[1] + r[3] + r[5] - r[7]) + z0 + z2;
    b2 = y[5] * ( r[1] + r[3] - r[5] + r[7]) + z1 + z3;
    b1 = y[3] * ( r[1] + r[3] + r[5] - r[7]) + z1 + z2;
    b0 = y[1] * ( r[1] + r[3] - r[5] - r[7]) + z0 + z3;
    #return np.array([z0, z1, z2, z3])

    z4 = (y[2] + y[6]) * r[6];
    z0 = y[0] + y[4]; z1 = y[0] - y[4];
    z2 = z4 - y[6] * (r[2] + r[6]);
    z3 = z4 + y[2] * (r[2] - r[6]);
    a0 = z0 + z3; a3 = z0 - z3;
    a1 = z1 + z2; a2 = z1 - z2;
    #return np.array([a0, a1, a2, a3, b0, b1, b2, b3])

    x[0] = a0 + b0; x[7] = a0 - b0;
    x[1] = a1 + b1; x[6] = a1 - b1;
    x[2] = a2 + b2; x[5] = a2 - b2;
    x[3] = a3 + b3; x[4] = a3 - b3;
    return x

From this code and staring at the paper, I learned a few things. First of all figure 1 is wrong. The rotate block should be sqrt(2)c6 instead of sqrt(2)c1. Another small detail is the dashed lines, meaning that some butterflies are upside down. Another one is the rotate block symbol. It says kcn, which are the k and n in the block equation, not the one in the DCT equation, which confused me a lot. So for sqrt(2)c6 you just substitute sqrt(2) and 6 in the rotation block. I noted down some more insights in response to a two year old question about the paper on the DSP StackExchange

Having implemented the forward DCT from the paper, I moved on to the inverse. All information the paper has about this is “just do everything backwards”. Thanks, paper. It turns out you use the same blocks, but in the reverse order, except… the rotate block n becomes -n. The inverse cosine transform has a negative angle, and this translates to cos(-x)=cos(x), sin(-x)=-sin(x).

type octet 't = [8]t
type f32octet = octet f32

let butterfly (a: f32) (b: f32) : (f32, f32) =
  (a+b, a-b)

let mk_coef (k:f32) (n:i32) : f32 =
  k*f32.cos ((r32 n)*f32.pi/16)

let coefr = map (mk_coef (f32.sqrt 2)) <| iota 8
let coef1 = map (mk_coef 1) <| iota 8

let rotate (sin_coef: f32) (cos_coef: f32) (a: f32) (b: f32): (f32, f32) =
  (a*cos_coef + b*sin_coef,
   b*cos_coef - a*sin_coef)

entry fdct8 (a: f32octet) : f32octet  =
  -- stage 1
  let (st1_0, st1_7) = butterfly a[0] a[7]
  let (st1_1, st1_6) = butterfly a[1] a[6]
  let (st1_2, st1_5) = butterfly a[2] a[5]
  let (st1_3, st1_4) = butterfly a[3] a[4]
  -- even part, stage 2
  let (st2_0, st2_3) = butterfly st1_0 st1_3
  let (st2_1, st2_2) = butterfly st1_1 st1_2
  -- stage 3
  let (y0, y4)   = butterfly st2_0 st2_1
  let (y2, y6)   = rotate coefr[2] coefr[6] st2_2 st2_3
  -- odd part, stage 2
  let (st2_4, st2_7)   = rotate coef1[5] coef1[3] st1_4 st1_7
  let (st2_5, st2_6)   = rotate coef1[7] coef1[1] st1_5 st1_6
  -- stage 3
  let (st3_4, st3_6)   = butterfly st2_4 st2_6
  let (st3_7, st3_5)   = butterfly st2_7 st2_5
  -- stage 4
  let (y1, y7)   = butterfly st3_7 st3_4
  let y3  = f32.sqrt(2)*st3_5
  let y5  = f32.sqrt(2)*st3_6
  in [y0, y4, y2, y6, y7, y3, y5, y1]


entry idct8 (a: f32octet) : f32octet  =
  -- odd part, stage 4
  let (st4_7, st4_4)   = butterfly a[7] a[4]
  let st4_5 = f32.sqrt(2)*a[5]
  let st4_6 = f32.sqrt(2)*a[6]
  -- stage 3
  let (st3_4, st3_6)   = butterfly st4_4 st4_6
  let (st3_7, st3_5)   = butterfly st4_7 st4_5
  -- stage 2
  let (st2_4, st2_7)   = rotate (-coef1[5]) coef1[3] st3_4 st3_7
  let (st2_5, st2_6)   = rotate (-coef1[7]) coef1[1] st3_5 st3_6
  -- even part, stage 3
  let (st3_0, st3_1)   = butterfly a[0] a[1]
  let (st3_2, st3_3)   = rotate (-coefr[2]) coefr[6] a[2] a[3]
  -- stage 2
  let (st2_0, st2_3) = butterfly st3_0 st3_3
  let (st2_1, st2_2) = butterfly st3_1 st3_2
  -- stage 1
  let (st1_0, st1_7) = butterfly st2_0 st2_7
  let (st1_1, st1_6) = butterfly st2_1 st2_6
  let (st1_2, st1_5) = butterfly st2_2 st2_5
  let (st1_3, st1_4) = butterfly st2_3 st2_4
  in [st1_0/8, st1_1/8, st1_2/8, st1_3/8, st1_4/8, st1_5/8, st1_6/8, st1_7/8]
Published on

sysenv: virtualenv for system packages

I try to keep my Linux system as clean as possible, but for almost every project I inadvertently have to install some packages with apt-get, build some from source with make install, or even install some proprietary program that in turn requires more packages. So over time my system acquires more and more junk.

To solve this problem, I made “virualenv for system packages”, a little script that makes a chroot with an unholy combination of OverlayFS and bind mounts. From inside the chroot, it looks and behaves exactly like your /, with no overhead or isolation whatsoever. The only difference is that writes to all system directories go to an overlay directory.

$ sudo bash activate.sh
(env)$ sudo apt-get install everything
(env)$ sudo make install
(env)$ echo "Hello world" > hello.txt
(env)$ ^D
$ everything
Command not found
$ cat hello.txt
Hello world

In the Python world this is pretty much the standard. You make a virtualenv, pip install all the things you need, and delete then environment after your project is done. I want this kind of behavior for all of my software.

Of course you can run everything in a VM, a Docker image, or a chroot. But these typically provide isolation that I don’t want or need. They also have a lot of overhead in RAM, disk space, and most importantly, effort on my part. However, the git-like overlay filesystems used by Docker gave me an idea.

What if I made a chroot, but instead of putting an entire Debian installation inside it, make an overlay on my own system. The only trouble is that special directories like /proc, /sys, and /dev should work as usual, and preferably my /home folder should also persist outside the chroot.

The solution turns out to be relatively simple: My /home and all the special directories are mounted with mount --bind. All system directories like /lib and /bin are mounted with mount -t overlay.

It works great for from-source installs and proprietary software. It also works for apt-get, but I’m sure weird things will happen once you upgrade your system. Once it gets wonky, just nuke the env and start over.

Get the code here, pull requests welcome.

Published on

EV3 soccer robot with ROS

When I said that ROS would be a steep learning curve, I did not expect to become a Docker expert and a kernel developer in the progress. But in the end it worked, with only a few seconds of lag.

Meanwhile RoboTeam Twente won a game against Robodragons, but that’s sadly the only game they won. The finals looked a bit more fast-paced than this game, but not by a lot. It’s 10 minutes game time in an hour clock time. These top teams have impressive ball control though. There is some work to be done for Twente… But for now, back to ROS and EV3.

Installing ROS on the EV3

I headed over the ROS installation page, which states that on Debian they only provide x64 packages, not ARM. So I was kindly redirected to their page about compiling from source. However, the EV3 is far from powerful enough to compile ROS on the device, so a cross-compiler is needed.

Ev3dev uses Docker images with Qemu to do this. But they provide several kinds of images, useful for either cross-compilation or for generating boot images for the EV3. However, ROS does not have a nicely self-contained installation process, so what I needed is a cross-compilation image that could also generate a boot image.

ROS is like a fractal of package management, breaking at every stage. First you install Python packages to install stuff in /etc and then you install a bunch of stuff into a workspace, and then you tell it to install system packages for the dependencies of that stuff. Then you combine this into a standalone workspace, which can be used install more packages in more workspaces.

So at this point I was learning about Dockerfiles and base images and Qemu and ROS releases and contexts and environments and source lists. And after a day of trying and hours of waiting, I had a shiny boot image with my cross-compiled ROS installation. And then David Lechner casually mentioned Debian does actually provide ARM packages, available with a simple apt-get install ros-robot.

Learning ROS

graph

So after you’re three levels deep in workspaces that you have sourced, you can begin creating your own packages, by – you guessed it – more package management. So you edit your package XML file for the dependencies of your package and run rosmake which is like CMake for ROS. You also need to run rosmake on Python projects to resolve dependencies and generate more code.

In exchange for all this work, ROS provides a lot of powerful tools, like roscd, which is like cd but for ROS, rosed, which is like vim but for ROS, and rosrun which is like running your code, but for ROS. I think you can guess what rosls is for.

After you have created a package, you can start creating nodes. Nodes can be written in a number of languages, as long as they can talk to the central ROS server which routes all the messages on different topics to all the interested nodes. As per Greenspun’s tenth rule, ROS contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Erlang. An actor system is not provided, only callbacks.

So now you have all these packages and nodes, and of course it would be silly to rosrun all of them individually. So they provide roslaunch, which is like init but for ROS. So you write some more XML, and roslaunch will do the rest. It will even ssh into remote machines to launch nodes on it.

The goal was to run ROS on the EV3, but not all of ROS. To be more specific, one single node. The EV3 does not have enough RAM for two operating systems at the same time. So on the EV3 you source all the workspaces and roscd into your package where you can rosrun the node. But before doing that, you have to tell it where the server lives with export ROS_MASTER_URI=http://pepijn-Latitude-E6420.local:11311. I tried to get roslaunch to do it, but it’d run out of RAM and crash.

So next I ran the listener from the tutorial on the EV3 and the talker on my laptop. Everything would appear to be fine, but no messages arrived. Long story short, it turns out you also need to set ROS_HOSTNAME or ROS_IP on both ends so that they can actually reach each other.

Writing a ROS node

With ROS running on my laptop and the EV3, and the basic listener working, it was time to actually write some code. But first, some more package management!

ROS has a joy package, to read joysticks, and a teleop_twist_joy package that provides a node that converts joy to twist messages. These packages are supposed to be in your system package manager, but they are not. So I installed them from source, but… package management… fast forward… I apt-get purge the Ubuntu packages and use the official ROS Melodic packages.

<launch>

  <node pkg="joy" name="joy" type="joy_node" output="screen">
    <param name="dev" type="str" value="/dev/input/js0" />
    <param name="deadzone" type="double" value="0.1" />
  </node>

  <node pkg="teleop_twist_joy" name="twist" type="teleop_node" output="screen">
    <param name="enable_button" type="int" value="4" />
    <param name="axis_linear/x" type="int" value="0" />
    <param name="axis_linear/y" type="int" value="1" />
    <param name="scale_linear/x" type="double" value="1" />
    <param name="scale_linear/y" type="double" value="1" />
    <param name="axis_angular" type="int" value="3" />
    <param name="scale_angular" type="double" value="1" />
  </node>

</launch>

Now I can finally start writing code on the EV3. So I copy the Python code from last time, and replace the joystick code with rospy listener code. As I get ready to test my code, it dawns on my that rospy is Python 2, while ev3dev-lang-python is Python 3. I go back to my Docker image and change it to Python 3. An hour later it crashes with an Unicode error, and another hour later with a package version mismatch. The next day it finally works, but it seems to be missing half the packages.

I consider writing my node in C++, but worry about all the package management needed to get ev3dev-lang-c++ into the ROS universe. I consider my alternatives, and without further package management, I succeed in using a Vala/GTK/GObject binding called Ev3devKit from Python 2.

import gi
gi.require_version('Ev3devKit', '0.5')
from gi.repository import Ev3devKit

manager = Ev3devKit.DevicesDeviceManager()
motors = {m.get_address(): m for m in manager.get_tacho_motors()}

def run(port, speed):
    m = motors['ev3-ports:'+port]
    if speed==0:
        m.send_command('stop')
    else:
        m.set_speed_sp(speed)
        m.send_command('run-forever')

With that in place, all that is left is a few trivial changes to the code from last time.

#!/usr/bin/env python2

import threading
import numpy as np
from geometry_msgs.msg import Twist
from sensor_msgs.msg import Joy
import rospy
import kit

## Initializing ##

angles = np.deg2rad([-36.8, -90, 36.8])
coef = np.array([np.sin(angles), np.cos(angles), [-1,1,1]]).T

speed = np.zeros(3)
kick = 0
running = True

class MotorThread(threading.Thread):
    def run(self):
        print("Engine running!")
        while running:
            sp = coef.dot(speed)
            kit.run('outA', sp[0])
            kit.run('outB', sp[1])
            kit.run('outD', sp[2])
            kit.run('outC', kick)

        kit.run('outA', 0)
        kit.run('outB', 0)
        kit.run('outD', 0)
        kit.run('outC', 0)

motor_thread = MotorThread()
motor_thread.start()

def callback(data):
    #rospy.loginfo(data)
    speed[0] = data.linear.x*-300
    speed[1] = data.linear.y*300
    speed[2] = data.angular.z*100

def btn_callback(data):
    global kick, running
    btn = data.buttons
    if btn[1]:
        kick = 1000
    elif btn[3]:
        kick = -1000
    else:
        kick = 0
    
    if btn[2]:
        running = False
        rospy.signal_shutdown('button pressed')
    
def listener():
    rospy.init_node('ev3dev', anonymous=True)
    rospy.Subscriber("cmd_vel", Twist, callback)
    rospy.Subscriber("joy", Joy, btn_callback)
    print("subscribed")
    rospy.spin()

if __name__ == '__main__':
    listener()
Published on