Wishful Coding

Didn't you ever wish your
computer understood you?

Bugasnoo: rock your baby to sleep with Lego

I have recently become a dad, which is sometimes amazing, and sometimes suffering, such as when the baby is in a fussy mood and refusing to sleep. Some friends were very enthusiastic about a $1.5k smart crib that makes white noise and rocks your baby to sleep.

You know what also makes noise, rocks your baby to sleep, and doesn’t cost $1.5k? That’s right, this Lego model. This is already by far my most useful Lego creation, and we’re using it regularly to great effect.

At the core of this creation are two bogies, driven by Lego motors, strapped to the wheel of a stroller with rubber bands. Each bogie is driven by two motors for a total of 4 driven wheels and 4 motors for maximum traction and power.

You could honestly get away with two motors and maybe gear them down 12:20. This design has excess torque and velocity and is limited by traction. The traction is provided by the rubber bands, and kept straight with coaster wheels.

This particular design uses parts from the Lego Mindstorms Robot Inventor kit, which has been discontinued. It should however be possible to construct a similar model from Powered Up motors using the Technic Hub.

instructions instructions instructions instructions instructions instructions instructions instructions

ps: bugasnoo is obviously from buggy and snooze and not from any similar sounding trademarks ;)
pps: I have considered turning this into a product but I’d need a cofounder who’s more into Industrial Design Engineering and business.

Claude by the token in Open WebUI

Last month I subscribed to Claude Pro, but was dismayed to learn it doesn’t give you API access to use it in VS Code or Home Assistant or whatever. So I didn’t renew my subscription and instead bought API access, thinking I’d just use some chat app. Turns out it’s not that easy to find a good chat app where you can just plug in your API token.

The solution I settled on is to use LiteLLM with Open WebUI. Open WebUI is a great chat interface that is primarily used with Ollama, but it also supports OpenAI compatible APIs. LiteLLM is a proxy that translates a ton of LLMs to a unified OpenAPI compatible API. Badabing badaboom, give LiteLLM your Anthropic key, plug it into Open WebUI and bob’s your uncle.

It’s actually great if you are a very heavy or very casual user because you pay by the token. That means if you use it only a little, it’s cheaper than Claude Pro, and if you use it a lot, you aren’t limited to a certain amount of messages. Surprisingly it also does better RAG than Claude, letting you do web searches and include more and bigger documents than would fit in the context window.

Here is my Docker compose file to set it all up. It is modified from ollama-intel-gpu to include LiteLLM with an Anthropic config.yaml. But if you’re on team green or red, you can just change the first image to use ollama/ollama I suppose.

services:
  ollama-intel-gpu:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ollama-intel-gpu
    image: ollama-intel-gpu:latest
    restart: always
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - ollama-intel-gpu:/root/.ollama
    ports:
      - "11434:11434"
  ollama-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: ollama-webui
    volumes:
      - ollama-webui:/app/backend/data
    depends_on:
      - ollama-intel-gpu
      - litellm
    ports:
      - ${OLLAMA_WEBUI_PORT-3000}:8080
    environment:
      - OLLAMA_BASE_URL=http://ollama-intel-gpu:11434
      - OPENAI_API_BASE_URL=http://litellm:4000
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: always
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    ports:
      - 4000:4000
    environment:
      - ANTHROPIC_API_KEY=YOURKEYHERE
    restart: always
    command: --config /app/config.yaml
volumes:
  ollama-webui: {}
  ollama-intel-gpu: {}

JPEG compress your LLM weights

So quantization is kinda bad lossy compression right? JPEG is good lossy compression. This may sound stupid, and maybe it is, but hear me out.

I’ve read that LLM performance is usually constrained by memory bandwidth, and for us plebs also by memory size, and there is a precedent in for example ZFS compression which has shown to increase disk performance because you’re IO constrained rather than compute constrained. So it might be beneficial to decompress LLM parameters on the fly, and if you’re doing that you might want to use a good lossy compression algorithm instead of blunt quantization. It is said that compression is equivalent to general intelligence, so in that sense lossy compression would be expected to reduce intelligence, so you’d want to get a good compression ratio with minimal loss.

The way JPEG works is basically

  • break down the pixels in chunks - after decompression chunk boundaries are visible as JPEG artifacts.
  • Discrete Cosine Transform them - lossless transformation in the family of Fourier transforms
  • quantize them - data loss happens here, creating longer runs
  • Run Length Encode them - compression happens here

RLE is a lossless compression technique, which gets turbocharged by discarding some data to create longer runs. In the case of image data, the DCT concentrates most information in the low frequencies so you can quantize high frequencies with minor loss in image quality. Now, I don’t expect LLM parameters to be “smooth” like image data, so naive JPEG compression of LLM weights is not likely to be effective.

BUT!

You can reorder the columns and rows of a matrix without affecting the result. It’s like \(a+b+c=d \rightarrow c+b+a=d\). So you could reorder your rows and columns to maximize clustering of similar values. Not sure how you’d do this, maybe just sort by vector sum, or some genetic algorithm, or other cleverness.

So my proposed LLM compression would work like this

  • reorder the matrices to improve value clustering
  • break down the values in chunks
  • DCT them
  • quantize them
  • RLE them

And then inference would

  • RLE expand a chunk
  • inverse DCT it
  • perform the multiplications

So the compressed data would exist in VRAM and be decompressed on the fly chunk by chunk to perform a matrix vector product. It’d take more compute, 11 multiplications to be precise, but if you’re memory constrained it could be worth it.

I guess the real question is if you can obtain any useful clustering in LLM data. In a sense the parameters are already compressed(=intelligence), but there is no information in their order, so reordering and transforming parameters could improve RLE compression without incurring extra quantization loss.