Wishful Coding

Didn't you ever wish your computer understood you?

USBTinyISP debug oscilloscope

When I’m using a normal Arduino board I usually just do printf debugging. But when I design a PCB with an AVR chip on it, I don’t always want to use serial.

Now if you have an expensive official AVR debugger, you can use debugwire with Atmel Studio, but that is a proprietary protocol that is not supported by avrdude and the USBTinyISP I have.

For my guitar tuner project I really needed a way to debug the program I’m writing. So I looked around if anyone had figured out something to do at least basic printf debugging via SPI.

The best thing I found is people using another Arduino to forward SPI commands to the Arduino serial console. That seemed a bit clumsy when I already have a perfectly fine SPI chip in my ISP.

So I took a break from my guitar tuner to write an SPI debugging tool. I configured the AVR as an SPI slave, and then used PyUSB to send SPI commands with the USBTinyISP. In this particular case it was more insightful to plot numerical values than to print text, so I used matplotlib to make a very basic “SPI oscilloscope” to probe some variables on the AVR.

I put the code on Github. It’s pretty basic right now, but if people besides me actually need this kind of thing, it may eventually grow into a nice Arduino library and command line tool.

EV3 Puppy, retail edition

I wanted to make an EV3 tamagotchi with a friend, but after various experiments I found the Puppy that comes with the EV3 education set, which is way cuter than anything we designed so far. But the education version contains different parts than the retail version, so we could not build that one.

So I sat down to create a puppy with the parts from the retail edition of the EV3. The end results looks very similar, but the construction is quite different in places.

The only functional difference is that I did not add the touch sensor and added the infrared sensor instead. So you can’t pet this puppy, but it can detect when you’re near and track the beacon with its eyes.

Other than that it’s mostly the same as the educational version. It can tilt its head up and down, stand up and sit down, pee and display different emotions.

The code I wrote so far is only a dozen lines of Python, which I might upload later. The good news is that due to the Christmas holiday, I had time to make some nice building instructions, check them out!

Creating a Gigabyte of NOPs

At the university we’re learning about computer architecture. The professor said that registers are used because reading from main memory is too slow. But then he also said the program is stored in main memory. How can the CPU ever execute an instruction every cycle if memory is so slow?

The answer appears to be caching. The CPU can store parts of the program in the cache, and access it fast enough from there apparently. But what if your program is bigger then the cache, or even bigger than memory? I assume it will be limited by the throughput of the RAM/disk. Let’s find out.

As a baseline I created a loop that executes a few NOPs. Enough to neglect the loop overhead, but not so much they spill the cache. I was very surprised to find that on my 2.2GHz i7, it executed 11.5GHz. Wat? I thought I made an error in my math or my NOPs got optimized away, but this was not the case.

It turns out that my CPU has a turbo frequency of 3GHz and 4 execution units that can execute an (independent) instruction each. 3×4=12GHz of single-core performance. Not bad.

Now what if it does not fit in cache? Let’s create a GB of NOPs. This was not so easy and I used several “amplification” steps. First I compiled a C file to assembly with a CPP macro that generated 100 NOPs. Then I saved the NOPs in a separate file and used vim to “100dd10000p” create a million NOPs. Then I used cat to concatenate 10 of those files, and then 10 of those, and then included it 10 times in the original assembly file. Then I compiled with gcc -pipe -Wall -O0 -o bin10 wrap.S, which took a good number of minutes.

The resulting file still runs at a respectable 7GHz. I was expecting much slower, but in retrospect this was to be expected since the throughput of my DDR3 RAM is apparently 10GB/s. Much higher than I thought it would be.

To really slow down the program, I would need to make it so big that it doesn’t fit in RAM. Seeing how hard it was and how long it took to make a 1GB binary, a 20GB binary would require a new technique.

The other option is to generate a lot of jumps so that the speed becomes limited by the latency of RAM rather than the throughput. But again, generating a GB of jumps requires a completely new technique.

I’ll leave it at this for now, as I’ve already spent quite some time and learned a few interesting things. Now it doesn’t seem so odd anymore that apparently for large codebases -Os generally performs better than -O3.