Wishful Coding

Didn't you ever wish your
computer understood you?

Claude by the token in Open WebUI

Last month I subscribed to Claude Pro, but was dismayed to learn it doesn’t give you API access to use it in VS Code or Home Assistant or whatever. So I didn’t renew my subscription and instead bought API access, thinking I’d just use some chat app. Turns out it’s not that easy to find a good chat app where you can just plug in your API token.

The solution I settled on is to use LiteLLM with Open WebUI. Open WebUI is a great chat interface that is primarily used with Ollama, but it also supports OpenAI compatible APIs. LiteLLM is a proxy that translates a ton of LLMs to a unified OpenAPI compatible API. Badabing badaboom, give LiteLLM your Anthropic key, plug it into Open WebUI and bob’s your uncle.

It’s actually great if you are a very heavy or very casual user because you pay by the token. That means if you use it only a little, it’s cheaper than Claude Pro, and if you use it a lot, you aren’t limited to a certain amount of messages. Surprisingly it also does better RAG than Claude, letting you do web searches and include more and bigger documents than would fit in the context window.

Here is my Docker compose file to set it all up. It is modified from ollama-intel-gpu to include LiteLLM with an Anthropic config.yaml. But if you’re on team green or red, you can just change the first image to use ollama/ollama I suppose.

services:
  ollama-intel-gpu:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ollama-intel-gpu
    image: ollama-intel-gpu:latest
    restart: always
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - ollama-intel-gpu:/root/.ollama
    ports:
      - "11434:11434"
  ollama-webui:
    image: ghcr.io/open-webui/open-webui:latest
    container_name: ollama-webui
    volumes:
      - ollama-webui:/app/backend/data
    depends_on:
      - ollama-intel-gpu
      - litellm
    ports:
      - ${OLLAMA_WEBUI_PORT-3000}:8080
    environment:
      - OLLAMA_BASE_URL=http://ollama-intel-gpu:11434
      - OPENAI_API_BASE_URL=http://litellm:4000
    extra_hosts:
      - host.docker.internal:host-gateway
    restart: always
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    ports:
      - 4000:4000
    environment:
      - ANTHROPIC_API_KEY=YOURKEYHERE
    restart: always
    command: --config /app/config.yaml
volumes:
  ollama-webui: {}
  ollama-intel-gpu: {}

JPEG compress your LLM weights

So quantization is kinda bad lossy compression right? JPEG is good lossy compression. This may sound stupid, and maybe it is, but hear me out.

I’ve read that LLM performance is usually constrained by memory bandwidth, and for us plebs also by memory size, and there is a precedent in for example ZFS compression which has shown to increase disk performance because you’re IO constrained rather than compute constrained. So it might be beneficial to decompress LLM parameters on the fly, and if you’re doing that you might want to use a good lossy compression algorithm instead of blunt quantization. It is said that compression is equivalent to general intelligence, so in that sense lossy compression would be expected to reduce intelligence, so you’d want to get a good compression ratio with minimal loss.

The way JPEG works is basically

  • break down the pixels in chunks - after decompression chunk boundaries are visible as JPEG artifacts.
  • Discrete Cosine Transform them - lossless transformation in the family of Fourier transforms
  • quantize them - data loss happens here, creating longer runs
  • Run Length Encode them - compression happens here

RLE is a lossless compression technique, which gets turbocharged by discarding some data to create longer runs. In the case of image data, the DCT concentrates most information in the low frequencies so you can quantize high frequencies with minor loss in image quality. Now, I don’t expect LLM parameters to be “smooth” like image data, so naive JPEG compression of LLM weights is not likely to be effective.

BUT!

You can reorder the columns and rows of a matrix without affecting the result. It’s like \(a+b+c=d \rightarrow c+b+a=d\). So you could reorder your rows and columns to maximize clustering of similar values. Not sure how you’d do this, maybe just sort by vector sum, or some genetic algorithm, or other cleverness.

So my proposed LLM compression would work like this

  • reorder the matrices to improve value clustering
  • break down the values in chunks
  • DCT them
  • quantize them
  • RLE them

And then inference would

  • RLE expand a chunk
  • inverse DCT it
  • perform the multiplications

So the compressed data would exist in VRAM and be decompressed on the fly chunk by chunk to perform a matrix vector product. It’d take more compute, 11 multiplications to be precise, but if you’re memory constrained it could be worth it.

I guess the real question is if you can obtain any useful clustering in LLM data. In a sense the parameters are already compressed(=intelligence), but there is no information in their order, so reordering and transforming parameters could improve RLE compression without incurring extra quantization loss.

Backwards Game of Life

I got a litlte bit nerd sniped by the following video and decided to implement game of life in clojure.core.logic, because any logic program can be evaluated forwards and backwards.

Without further ado here is my implementation:

(ns pepijndevos.lifeclj
  (:refer-clojure :exclude [==])
  (:use clojure.core.logic)
  (:gen-class))

;; A helper to get the neighbouring cells.
;; Clips to zero.
(defn get-neighbours [rows x y]
  (for [dx (range -1 2)
        dy (range -1 2)
        :when (not (= dx dy 0))]
    (get-in rows [(+ x dx) (+ y dy)] 0)))

;; Produces binary vectors of a certain number of bits.
;; This is used to generate all neighbour combinations.
(defn bitrange [n]
  (sort-by #(apply + %)
           (for [i (range (bit-shift-left 1 n))]
             (vec (map #(bit-and 1 (bit-shift-right i %)) (range n))))))

;; Encode the game of life rules as a 256 element conde.
;; Depending on the number of ones in a vector,
;; the corresponding rule is generated
;; that equates the pattern to the neigbours
;; and the appropriate next state.
;;
;; This can be asked simply what the next state is for
;; given neighbours and current state.
;; OR you could drive it backwards any way you like.
(defn lifegoals [neigh self next]
  (or*
   (for [adj (bitrange 8)
         :let [n (apply + adj)]]
     (cond
       (or (< n 2) (> n 3)) (all (== next 0) (== neigh adj))
       (= n 3)              (all (== next 1) (== neigh adj))
       :else             (all (== next self) (== neigh adj))))))

;; Relate two grids to each other according to the above rules.
;; Applies lifegoals to every cell and its neighbours.
;; in the forwards direction executes one life step,
;; in the backwards direction generates grids
;; that would produce the next step.
(defn stepo [size vars next]
  (let [rows (->> vars (partition size) (map vec) (into []))
        neig (for [x (range size)
                   y (range size)]
               (get-neighbours rows x y))]
    (everyg #(apply lifegoals %) (map vector neig vars next))))

;; Make a grid of unbound variables.
(defn grid [size] (repeatedly (* size size) lvar))

;; Simply execute a life step on the state.
(defn fwdlife [size state]
  (let [vars (grid size)
        next (grid size)]
    (run 1 [q]
         (== q next)
         (== vars state)
         (stepo size vars next))))

;; Produce three backward steps on state.
(defn revlife [size state]
  (let [start (grid size)
        s1 (grid size)
        s2 (grid size)
        end (grid size)]
    (run 1 [q]
          (== q [start s1 s2 end])
          (== end state)
          (stepo size s2 end)
          (stepo size s1 s2)
          (stepo size start s1)
         )))

;; Nicely print the board.
(defn printlife [size grids]
  (doseq [g grids]
    (doseq [row (->> g (partition size) (map vec) (into []))]
      (doseq [t row]
        (print t ""))
      (print "\n"))
    (print "\n")))

;; Test with a glider.
(defn -main [& args]
  (->> [0 0 0 0 0 0
        0 0 0 0 0 0
        0 0 0 1 1 0
        0 0 1 1 0 0
        0 0 0 0 1 0
        0 0 0 0 0 0]
       (revlife 6)
       first
       (printlife 6)))

output:

$ clj -Mrun
1 0 1 0 1 1 
1 0 0 0 0 1 
0 0 1 0 0 0 
0 0 0 0 0 1 
1 0 1 1 0 0 
1 0 1 1 1 1 

0 1 0 0 1 1 
0 0 0 1 1 1 
0 0 0 0 0 0 
0 1 1 1 0 0 
0 0 1 0 0 1 
0 0 1 0 1 0 

0 0 0 1 0 1 
0 0 0 1 0 1 
0 0 0 0 0 0 
0 1 1 1 0 0 
0 0 0 0 1 0 
0 0 0 1 0 0 

0 0 0 0 0 0 
0 0 0 0 0 0 
0 0 0 1 1 0 
0 0 1 1 0 0 
0 0 0 0 1 0 
0 0 0 0 0 0

Sadly, this is nowhere near fast enough to solve the play button problem.