marin_.log

Building an LLM from Scratch in Go Part 1: BPE Tokenizer

Fri, 27 Feb 2026 07:55:45 GMT

I recently picked up Build a Large Language Model (From Scratch) by Sebastian Raschka. The book walks you through building a GPT-style language model step by step — from raw text all the way to a trained model. It's Python-first, but I'm a Go developer, so I'm implementing everything in Go as I follow along.

The Book

The book is structured around understanding, not shortcuts. Instead of importing a library and calling it a day, you build each component from scratch and understand why it works. Chapter 1 covers the big picture of LLMs, what they are, how they differ from earlier models, and the overall architecture. Chapter 2 gets into the first real implementation: text tokenization.

By the end of chapter 2, you have a working BPE tokenizer(same kind used in GPT-2 and GPT-3). That's what this post covers.

What is a Tokenizer?

Before a language model can process text, the text needs to be converted into numbers. A tokenizer does this job. It takes a string like "Hello, world!" and converts it into a list of integers:

"Hello, world!" → [15496, 11, 995, 0]

Each integer is a token ID — an index into the model's vocabulary. The model never sees the raw text, only these numbers. Decoding is the reverse: given a list of token IDs, reconstruct the original string.

The Vocabulary: r50k_base.tiktoken

Instead of training a vocabulary from scratch, I'm using OpenAI's pre-built r50k_base vocabulary. It contains ~50,000 tokens and comes in the tiktoken format, where each line is:

For example:

SGVsbG8= 15496

SGVsbG8= is the base64 encoding of the bytes for Hello, and 15496 is its token ID.

The first 256 entries cover every possible single byte (0x00–0xFF), which guarantees that any input can always be encoded. The remaining ~49,700 entries are multi-byte merge tokens built up during BPE training.

Loading the vocabulary in Go

I use Go's embed package to bundle the file directly into the binary, then parse it line by line:

//go:embed resources/r50k_base.tiktoken
var resource embed.FS

var (
    Vocabulary    = make(map[string]int)
    VacabularyInv = make(map[int]string)
)

func LoadVocabulary() error {
    fd, err := resource.Open("resources/r50k_base.tiktoken")
    if err != nil {
        return fmt.Errorf("fail to open file %w", err)
    }
    defer fd.Close()

    scanner := bufio.NewReader(fd)
    for {
        line, err := scanner.ReadString('\n')
        if err == io.EOF {
            break
        }

        split := strings.Split(line, " ")
        tokenBytes, _ := base64.StdEncoding.DecodeString(strings.TrimSpace(split[0]))
        tokenID, _    := strconv.Atoi(strings.TrimSpace(split[1]))

        key := string(tokenBytes)
        Vocabulary[key]       = tokenID
        VacabularyInv[tokenID] = key
    }
    return nil
}

Two maps: one for encoding (token bytes → ID) and one for decoding (ID → token bytes).

Step 1: Pre-tokenization with Regex

BPE doesn't run on the entire input at once. First the text is split into chunks using a regex. This prevents merges from crossing word boundaries (you don't want "dog" and "s" in "dogs" merging with a token from the next word).

GPT-2 uses a specific regex for this. The standard regexp package in Go doesn't support Unicode categories like \p{L} (letters) and \p{N} (numbers), so I reached for github.com/dlclark/regexp2:

var expr = regexp2.MustCompile(
    `'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+`,
    0,
)

For input "Hello, world!" this produces: ["Hello", ",", " world", "!"]

Each chunk is then encoded independently by the BPE algorithm.

Step 2: Byte Pair Encoding (BPE)

BPE is the core algorithm. The key insight: you don't train it here, the training already happened when OpenAI built the r50k_base vocabulary. Your job is just the encoding side: given a chunk of text, find the optimal way to split it into tokens from the vocabulary.

How it works

Take the chunk " world" (note the leading space — GPT-2 treats " world" and "world" as different tokens).

1. Initialize as individual bytes

parts = [ [0x20], [0x77], [0x6F], [0x72], [0x6C], [0x64] ]
          " "      "w"     "o"     "r"     "l"     "d"

Every single byte is guaranteed to be in the vocabulary, so this starting state is always valid.

2. Find the best merge candidate

Look at every adjacent pair and check if their concatenation exists in the vocabulary. Among all valid pairs, pick the one with the lowest token ID — lower ID means it was merged earlier during training, giving it higher priority.

3. Merge that pair

Replace the two adjacent tokens with their concatenation. The slice shrinks by one element.

4. Repeat until no valid merges remain

5. Return token IDs

Look up each remaining token in the vocabulary and return the IDs.

The Go implementation

func (b *BPE) Encode(chunk []byte) []int {
    // Step 1: one element per byte
    parts := make([][]byte, len(chunk))
    for i, b := range chunk {
        parts[i] = []byte{b}
    }

    for {
        bestRank := math.MaxInt
        bestIdx  := -1

        // Step 2: scan all adjacent pairs
        for i := 0; i < len(parts)-1; i++ {
            merged := append(append([]byte{}, parts[i]...), parts[i+1]...)
            rank, ok := Vocabulary[string(merged)]
            if ok && rank < bestRank {
                bestRank = rank
                bestIdx  = i
            }
        }

        // No valid merge found — done
        if bestIdx == -1 {
            break
        }

        // Step 3: merge at bestIdx
        merged := append(append([]byte{}, parts[bestIdx]...), parts[bestIdx+1]...)
        parts = append(parts[:bestIdx+1], parts[bestIdx+2:]...)
        parts[bestIdx] = merged
    }

    // Step 5: convert to token IDs
    result := make([]int, len(parts))
    for i, part := range parts {
        result[i] = Vocabulary[string(part)]
    }
    return result
}

One subtle Go gotcha

You'll notice the merging always uses:

append(append([]byte{}, parts[i]...), parts[i+1]...)

instead of the simpler:

append(parts[i], parts[i+1]...)

The simpler version is a bug. If parts[i] has spare capacity in its underlying array (which happens after a merge), the append will write parts[i+1] directly into that memory silently corrupting another token's data. The double-append forces a fresh allocation every time, so nothing is shared.

Step 3: Wiring it Together

The Tokenizer owns a BPE instance and orchestrates the full pipeline:

type Tokenizer struct {
    bpe *BPE
}

func NewTokenizer() *Tokenizer {
    if err := LoadVocabulary(); err != nil {
        panic(err)
    }
    return &Tokenizer{bpe: new(BPE)}
}

func (t *Tokenizer) Encode(input string) []int {
    match, err := expr.FindStringMatch(input)
    if err != nil {
        panic(err)
    }
    var result []int
    for match != nil {
        ids := t.bpe.Encode([]byte(match.String()))
        result = append(result, ids...)
        match, _ = expr.FindNextMatch(match)
    }
    return result
}

func (t *Tokenizer) Decode(input []int) string {
    var result []byte
    for _, id := range input {
        result = append(result, []byte(VacabularyInv[id])...)
    }
    return string(result)
}

The flow is:

Regex splits input into chunks
Each chunk goes through BPE → token IDs
All IDs are concatenated into one slice

Decoding is simpler, just look up each ID in the inverse vocabulary and concatenate the bytes.

Does it Work?

t := NewTokenizer()

ids := t.Encode("Hello, world!")
fmt.Println(ids)          // [15496 11 995 0]

fmt.Println(t.Decode(ids)) // Hello, world!

The round-trip works. Decode(Encode(s)) == s for any input.

What's Next

Part 2 will continue into the next chapter: building the data loader that feeds tokenized text into the model in fixed-size chunks with a sliding window. That's where the training pipeline starts to take shape.

The full source is on GitHub: github.com/MarinX/llm-from-scratch

Controlling LG Heat Pump via Modbus on Home Assistant

Thu, 29 Jan 2026 12:05:50 GMT

I have an LG Therma V heat pump (inside unit: HN0916T.NB1, outside unit: HU091MR.U44) with a 200L DHW (Domestic Hot Water) tank, and I wanted to integrate it with Home Assistant. After some research, I found that the heat pump supports Modbus communication, which opened up a lot of possibilities.

In this post, I'll share how I connected everything and what configuration worked for me.

The Architecture

Here's an overview of my setup:

The data flows like this:

Modbus ↔ Home Assistant: Two-way communication with the heat pump (reading sensors, sending commands)
Home Assistant → InfluxDB: Storing historical data for analysis
Grafana ← InfluxDB: Visualizing the data in dashboards

What is Modbus?

Modbus is an industrial communication protocol that's been around since 1979. It's simple, reliable, and widely used in industrial equipment – including heat pumps.

There are two main types:

Modbus RTU: Uses serial communication (RS485 wires)
Modbus TCP: Uses ethernet/IP network

The LG heat pump uses Modbus RTU, but Home Assistant works better with Modbus TCP. To bridge the gap, I needed a Modbus TCP/IP gateway module.

Hardware I Used

Here's what I gathered for this project:

Ethernet cable – To connect the gateway to my network
RS485 cable – The communication line between the heat pump and the gateway
Waveshare RS485 to PoE ETH module – This converts Modbus RTU to Modbus TCP

Making the Connection

Here's how I connected the RS485 cable inside the heat pump's control board:

Home Assistant Configuration

With the hardware in place, I moved on to configuring Home Assistant.

Basic Modbus Connection

I added this to my /homeassistant/configuration.yaml:

modbus:
  - name: "LG Therma V"
    delay: 1
    timeout: 14
    message_wait_milliseconds: 200
    host: "device-ip-address-on-local-lan"
    port: 4196
    type: tcp

Here's what each setting does:

name: A friendly name for the connection
delay: Wait time in seconds before the first request
timeout: How long to wait for a response (14 seconds works well for heat pumps)
message_wait_milliseconds: Pause between messages to avoid overwhelming the device
host: The Modbus gateway's IP address (I set a static IP for this)
port: The TCP port (4196 is the default for Waveshare modules)
type: TCP connection type

Writing Modbus Queries

To find the correct register addresses, I consulted the LG manual for my model. It lists all the Modbus registers and what data they contain.

Binary Sensors

For monitoring on/off states like whether the pump is running, I configured a binary sensor:

modbus:
  binary_sensors:
    - name: "LG Therma V Pump Running"
      unique_id: "lg_therma_v_pump_running"
      address: 1
      slave: 1
      scan_interval: 20
      device_class: running
      input_type: discrete_input

Key settings:

address: The Modbus register address from the LG manual
slave: Set to 1 for single-device setups
scan_interval: Polling frequency in seconds (20 seconds is reasonable)
input_type: discrete_input for read-only binary values

Sensors

For reading temperature values, I set up sensors like this one for DHW temperature:

modbus:
  sensors:
    - name: "LG Therma V DHW Temp"
      unique_id: "lg_therma_v_dhw_temperature"
      scale: 0.1
      precision: 1
      scan_interval: 20
      address: 5 # reg 6
      slave: 1
      unit_of_measurement: °C
      device_class: temperature
      input_type: input

Important settings:

scale: LG reports temperature as integers (e.g., 445 = 44.5°C), so I multiply by 0.1
precision: Number of decimal places to display
device_class: Tells Home Assistant this is a temperature sensor
input_type: input for read-only registers

Switches

To control the heat pump, I configured switches like:

modbus:
  switches:
    - name: "LG Therma V Underflow"
      unique_id: "lg_therma_v_underflow_on_off"
      slave: 1
      address: 0
      write_type: coil
      command_on: 1
      command_off: 0
      verify:
        input_type: coil
        address: 0
        state_on: 1
        state_off: 0

Key settings:

write_type: coil for boolean writes
verify: Reads back the state to confirm the command was executed

Climate Entities

For a complete thermostat experience with current temperature and target adjustment, I used a climate entity:

modbus:
  climates:
    - name: "LG Therma V Underflow"
      unique_id: "lg_therma_v_underflow"
      address: 7
      slave: 1
      input_type: input
      max_temp: 33
      min_temp: 16
      offset: 0
      precision: 0
      scale: 0.1
      target_temp_register: 2
      temp_step: 1
      temperature_unit: C
      hvac_mode_register:
        address: 0
        values:
          state_heat: 4

This creates a proper thermostat card in Home Assistant where I can see the current temperature and adjust the target.

Bonus: Recording Data to InfluxDB

I wanted to keep historical data for analysis, so I added InfluxDB to my setup.

What is InfluxDB?

InfluxDB is a time-series database designed specifically for data that changes over time (like temperatures and power consumption). It handles large amounts of time-stamped data efficiently.

Home Assistant + InfluxDB Integration

Home Assistant has built-in support for InfluxDB. I added this to my configuration.yaml:

influxdb:
  api_version: 2
  ssl: false
  host: your-ip
  port: 8086
  token: influxdb-token
  organization: your-org
  bucket: homeassistant
  tags:
    source: HA
  tags_attributes:
    - friendly_name
  default_measurement: units

Configuration notes:

api_version: Version 2 for modern InfluxDB installations
ssl: Set to true if using HTTPS
token: Generated in InfluxDB's web interface
organization & bucket: These need to be created in InfluxDB first
tags: Useful for filtering data later

With this configuration, every sensor update in Home Assistant gets automatically logged to InfluxDB.

Grafana: Visualizing the Data

To create dashboards from the stored data, I use Grafana. It connects to InfluxDB and provides flexible visualization options.

Here's my current dashboard:

I can monitor:

Pump and compressor status
DHW temperature over time
Power consumption
Flow rates
And more

Setting up Grafana involves:

Installing Grafana (I used Docker)
Adding InfluxDB as a data source
Creating dashboards and panels

Summary

This setup gives me full visibility and control over my LG Therma V heat pump through Home Assistant. I can:

Monitor temperatures and status in real-time
Control the heat pump remotely
Store historical data for analysis
Visualize everything in Grafana dashboards

The total cost was much lower than LG's official smart home solutions, and I have complete control over my data.

감사합니다 for reading! If you have questions or spot any issues, feel free to leave a comment.

Getting Started with Local LLM Code Completion in Neovim using Ollama

Mon, 19 Jan 2026 13:40:35 GMT

So you want AI-powered code completion but you're not keen on sending your code to the cloud? Maybe you work on proprietary code, maybe you value privacy, or maybe you just want to avoid that awkward moment when your internet dies mid-completion and you realize you've forgotten how to write a for loop. Whatever your reason, running LLMs locally is the answer.

In this guide, we'll set up Ollama with the Qwen2.5-Coder model and hook it into Neovim using the minuet-ai.nvim plugin. By the end, you'll have GitHub Copilot-like completions running entirely on your machine.

Why Local LLMs?

Privacy: Your code never leaves your machine. Your embarrassing variable names stay between you and your GPU.
No subscription fees: Once set up, it's free forever. Your wallet will thank you.
Offline capability: Works on airplanes, in basements, and during apocalyptic internet outages.
Latency: With a decent GPU, local inference can actually be faster than cloud APIs.

Installing Ollama

Ollama is a lightweight runtime for running LLMs locally. Installation is refreshingly simple:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation

ollama -v

If you see a version number, congratulations—you've successfully installed software. The bar was low, but you cleared it.

Configuring Ollama for Optimal Performance

The default Ollama configuration works, but we can do better. On Linux, we'll override the systemd service with custom environment variables:

systemctl edit ollama

This opens an override file. Add the following configuration:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_KEEP_ALIVE=10m"

Let me explain what each of these does:

Variable	What it does
`OLLAMA_HOST=0.0.0.0`	Makes Ollama accessible from any network interface, not just localhost. Useful if you want to access it from other devices on your network.
`OLLAMA_FLASH_ATTENTION=1`	Enables Flash Attention, which significantly reduces memory usage as context size grows. This is required for KV cache quantization.
`OLLAMA_KV_CACHE_TYPE=q4_0`	Quantizes the key-value cache to 4-bit, using approximately 1/4 the memory compared to the default fp16. There's a slight quality trade-off, but for code completion it's barely noticeable. Use `q8_0` if you have more VRAM and want higher quality.
`OLLAMA_NUM_PARALLEL=2`	Allows processing multiple requests simultaneously. Useful if you're trigger-happy with your completions.
`OLLAMA_KEEP_ALIVE=10m`	Keeps the model loaded in memory for 10 minutes after the last request. Default is 5 minutes. Longer means faster first response, but uses more memory.

Don't forget to restart the service:

systemctl restart ollama

Choosing the Right Model: Qwen2.5-Coder

We're using Qwen2.5-Coder, and here's why it's an excellent choice:

Fill-in-the-Middle (FIM) support: Essential for code completion—it can complete code in the middle of a function, not just at the end.
40+ programming languages: From Python to Haskell to Racket (yes, really).
5.5 trillion tokens of training data: It has seen more code than any human ever will.

Pull the model:

ollama pull qwen2.5-coder:latest

This downloads the 7B parameter version (~4.7GB). If you have more VRAM, consider qwen2.5-coder:14b or qwen2.5-coder:32b for even better results.

Note: The 7B model needs ~6GB VRAM, 14B needs ~10GB, and 32B needs ~20GB. Plan accordingly, or your GPU will plan for you (by crashing).

Setting Up Neovim with minuet-ai.nvim

minuet-ai.nvim is a fantastic plugin that brings LLM-powered completions to Neovim. Unlike some alternatives, it doesn't require any proprietary background processes—just curl and your LLM provider.

Key Features

Multiple frontends: Virtual-text, nvim-cmp, blink-cmp, built-in completion (Neovim 0.11+)
Streaming support: See completions as they generate
Fill-in-the-Middle: Proper code completion that understands context before AND after the cursor
Incremental acceptance: Accept completions word-by-word or line-by-line

Installation with lazy.nvim

Create or edit your plugin spec in your Neovim lua folder:

---@type LazySpec
return {
  "milanglacier/minuet-ai.nvim",
  dependencies = {
    "nvim-lua/plenary.nvim",
  },
  opts = {
    provider = "openai_fim_compatible",
    n_completions = 1, -- Use 1 for local models to save resources
    context_window = 4096, -- Adjust based on your GPU's capability
    throttle = 500, -- Minimum time between requests in ms
    debounce = 300, -- Wait time after typing stops before requesting
    provider_options = {
      openai_fim_compatible = {
        api_key = "TERM", -- Ollama doesn't need a real API key
        name = "Ollama",
        end_point = "http://localhost:11434/v1/completions",
        model = "qwen2.5-coder:latest",
        optional = {
          max_tokens = 256, -- Maximum tokens to generate
          stop = { "\n\n" }, -- Stop at double newlines
          top_p = 0.9, -- Nucleus sampling parameter
        },
      },
    },
    -- Virtual text display settings
    virtualtext = {
      auto_trigger_ft = { "*" }, -- Enable for all filetypes
      keymap = {
        accept = "",
        accept_line = "",
        next = "",
        prev = "",
        dismiss = "",
      },
    },
  },
}

Configuration Notes

api_key = "TERM": Ollama doesn't require authentication, but minuet-ai needs some environment variable name here. TERM exists on all systems.
context_window = 4096: Start with a smaller window and increase if your GPU can handle it. The plugin author recommends starting at 512 to benchmark your hardware.
throttle and debounce: These prevent spamming your GPU with requests. Tune based on how aggressive you want completions to be.

Results

Once everything is set up, you'll see ghost text completions appear as you type. Press to accept the full completion, or to accept just the current line.

The first completion after loading the model takes a bit longer (cold start), but subsequent completions are fast. If you find it slow, try:

Reducing context_window
Using a smaller model (qwen2.5-coder:7b instead of larger variants)
Increasing debounce to reduce request frequency

Conclusion

You now have a fully local, privacy-respecting, subscription-free AI code completion setup. Your code stays on your machine, your completions work offline, and you're not paying monthly fees for the privilege.

Happy coding!