# Wishful Coding

Didn't you ever wish your computer understood you?

## Introduction To Game Boy Hacking at SHA2017

At SHA2017 I gave a workshop about Game Boy assembly programming. Despite the projector not working, it was fun to do, and I got some nice feedback. We looked at some code from Pokemon Red, made some small changes, and more. Unfortunately there was not enough time to dive into Super Mario Land and write code from scratch.

For those that want to continue, or were not there, the PDF can be downloaded here.

As a bonus, someone asked how you could disassemble Super Mario Land and extract images from it. In theory this should be easy by using pokemontools, but I could not get it to work right away.

What I did do is search the code for the location of the sprites. The VRAM tile data is located at 0x8000, so searching for that adress in the debugger gives a few places that copy data from ROM to VRAM. Using pokemontools, I was able to at least decode a block of tiles at 0x4032, shown below. Other blocks should be possible to find as well, but that’s for another time.

## USBTinyISP debug oscilloscope

When I’m using a normal Arduino board I usually just do printf debugging. But when I design a PCB with an AVR chip on it, I don’t always want to use serial.

Now if you have an expensive official AVR debugger, you can use debugwire with Atmel Studio, but that is a proprietary protocol that is not supported by avrdude and the USBTinyISP I have.

For my guitar tuner project I really needed a way to debug the program I’m writing. So I looked around if anyone had figured out something to do at least basic printf debugging via SPI.

The best thing I found is people using another Arduino to forward SPI commands to the Arduino serial console. That seemed a bit clumsy when I already have a perfectly fine SPI chip in my ISP.

So I took a break from my guitar tuner to write an SPI debugging tool. I configured the AVR as an SPI slave, and then used PyUSB to send SPI commands with the USBTinyISP. In this particular case it was more insightful to plot numerical values than to print text, so I used matplotlib to make a very basic “SPI oscilloscope” to probe some variables on the AVR.

I put the code on Github. It’s pretty basic right now, but if people besides me actually need this kind of thing, it may eventually grow into a nice Arduino library and command line tool.

## Creating a Gigabyte of NOPs

At the university we’re learning about computer architecture. The professor said that registers are used because reading from main memory is too slow. But then he also said the program is stored in main memory. How can the CPU ever execute an instruction every cycle if memory is so slow?

The answer appears to be caching. The CPU can store parts of the program in the cache, and access it fast enough from there apparently. But what if your program is bigger then the cache, or even bigger than memory? I assume it will be limited by the throughput of the RAM/disk. Let’s find out.

As a baseline I created a loop that executes a few NOPs. Enough to neglect the loop overhead, but not so much they spill the cache. I was very surprised to find that on my 2.2GHz i7, it executed 11.5GHz. Wat? I thought I made an error in my math or my NOPs got optimized away, but this was not the case.

It turns out that my CPU has a turbo frequency of 3GHz and 4 execution units that can execute an (independent) instruction each. 3×4=12GHz of single-core performance. Not bad.

Now what if it does not fit in cache? Let’s create a GB of NOPs. This was not so easy and I used several “amplification” steps. First I compiled a C file to assembly with a CPP macro that generated 100 NOPs. Then I saved the NOPs in a separate file and used vim to “100dd10000p” create a million NOPs. Then I used cat to concatenate 10 of those files, and then 10 of those, and then included it 10 times in the original assembly file. Then I compiled with gcc -pipe -Wall -O0 -o bin10 wrap.S, which took a good number of minutes.

The resulting file still runs at a respectable 7GHz. I was expecting much slower, but in retrospect this was to be expected since the throughput of my DDR3 RAM is apparently 10GB/s. Much higher than I thought it would be.

To really slow down the program, I would need to make it so big that it doesn’t fit in RAM. Seeing how hard it was and how long it took to make a 1GB binary, a 20GB binary would require a new technique.

The other option is to generate a lot of jumps so that the speed becomes limited by the latency of RAM rather than the throughput. But again, generating a GB of jumps requires a completely new technique.

I’ll leave it at this for now, as I’ve already spent quite some time and learned a few interesting things. Now it doesn’t seem so odd anymore that apparently for large codebases -Os generally performs better than -O3.