<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>marin_.log</title>
        <link>https://velog.io/</link>
        <description>크로아티아 출신 Go 개발자입니다. 한국 문화와 홈 오토메이션, 셀프 호스팅에 관심이 많습니다. 한국어는 아직 배우는 중이라 서툴지만, 커피챗이나 언어 교환은 언제든 환영합니다!</description>
        <lastBuildDate>Fri, 27 Feb 2026 07:55:45 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>marin_.log</title>
            <url>https://velog.velcdn.com/images/marin_/profile/ab47226c-3d54-44ea-8975-42ca23e52c32/image.jpg</url>
            <link>https://velog.io/</link>
        </image>
        <copyright>Copyright (C) 2019. marin_.log. All rights reserved.</copyright>
        <atom:link href="https://v2.velog.io/rss/marin_" rel="self" type="application/rss+xml"/>
        <item>
            <title><![CDATA[Building an LLM from Scratch in Go Part 1: BPE Tokenizer]]></title>
            <link>https://velog.io/@marin_/Building-an-LLM-from-Scratch-in-Go-Part-1-BPE-Tokenizer</link>
            <guid>https://velog.io/@marin_/Building-an-LLM-from-Scratch-in-Go-Part-1-BPE-Tokenizer</guid>
            <pubDate>Fri, 27 Feb 2026 07:55:45 GMT</pubDate>
            <description><![CDATA[<p>I recently picked up <a href="https://www.manning.com/books/build-a-large-language-model-from-scratch">Build a Large Language Model (From Scratch)</a> by Sebastian Raschka. The book walks you through building a GPT-style language model step by step — from raw text all the way to a trained model. It&#39;s Python-first, but I&#39;m a Go developer, so I&#39;m implementing everything in Go as I follow along.</p>
<h2 id="the-book">The Book</h2>
<p>The book is structured around understanding, not shortcuts. Instead of importing a library and calling it a day, you build each component from scratch and understand why it works. Chapter 1 covers the big picture of LLMs, what they are, how they differ from earlier models, and the overall architecture. Chapter 2 gets into the first real implementation: text tokenization.</p>
<p>By the end of chapter 2, you have a working BPE tokenizer(same kind used in GPT-2 and GPT-3). That&#39;s what this post covers.</p>
<hr>
<h2 id="what-is-a-tokenizer">What is a Tokenizer?</h2>
<p>Before a language model can process text, the text needs to be converted into numbers. A tokenizer does this job. It takes a string like <code>&quot;Hello, world!&quot;</code> and converts it into a list of integers:</p>
<pre><code>&quot;Hello, world!&quot; → [15496, 11, 995, 0]</code></pre><p>Each integer is a <strong>token ID</strong> — an index into the model&#39;s vocabulary. The model never sees the raw text, only these numbers. Decoding is the reverse: given a list of token IDs, reconstruct the original string.</p>
<hr>
<h2 id="the-vocabulary-r50k_basetiktoken">The Vocabulary: r50k_base.tiktoken</h2>
<p>Instead of training a vocabulary from scratch, I&#39;m using OpenAI&#39;s pre-built <code>r50k_base</code> vocabulary. It contains ~50,000 tokens and comes in the tiktoken format, where each line is:</p>
<pre><code>&lt;base64_encoded_token_bytes&gt; &lt;token_id&gt;</code></pre><p>For example:</p>
<pre><code>SGVsbG8= 15496</code></pre><p><code>SGVsbG8=</code> is the base64 encoding of the bytes for <code>Hello</code>, and <code>15496</code> is its token ID.</p>
<p>The first 256 entries cover every possible single byte (0x00–0xFF), which guarantees that any input can always be encoded. The remaining ~49,700 entries are multi-byte merge tokens built up during BPE training.</p>
<h3 id="loading-the-vocabulary-in-go">Loading the vocabulary in Go</h3>
<p>I use Go&#39;s <code>embed</code> package to bundle the file directly into the binary, then parse it line by line:</p>
<pre><code class="language-go">//go:embed resources/r50k_base.tiktoken
var resource embed.FS

var (
    Vocabulary    = make(map[string]int)
    VacabularyInv = make(map[int]string)
)

func LoadVocabulary() error {
    fd, err := resource.Open(&quot;resources/r50k_base.tiktoken&quot;)
    if err != nil {
        return fmt.Errorf(&quot;fail to open file %w&quot;, err)
    }
    defer fd.Close()

    scanner := bufio.NewReader(fd)
    for {
        line, err := scanner.ReadString(&#39;\n&#39;)
        if err == io.EOF {
            break
        }

        split := strings.Split(line, &quot; &quot;)
        tokenBytes, _ := base64.StdEncoding.DecodeString(strings.TrimSpace(split[0]))
        tokenID, _    := strconv.Atoi(strings.TrimSpace(split[1]))

        key := string(tokenBytes)
        Vocabulary[key]       = tokenID
        VacabularyInv[tokenID] = key
    }
    return nil
}</code></pre>
<p>Two maps: one for encoding (<code>token bytes → ID</code>) and one for decoding (<code>ID → token bytes</code>).</p>
<hr>
<h2 id="step-1-pre-tokenization-with-regex">Step 1: Pre-tokenization with Regex</h2>
<p>BPE doesn&#39;t run on the entire input at once. First the text is split into chunks using a regex. This prevents merges from crossing word boundaries (you don&#39;t want <code>&quot;dog&quot;</code> and <code>&quot;s&quot;</code> in <code>&quot;dogs&quot;</code> merging with a token from the next word).</p>
<p>GPT-2 uses a specific regex for this. The standard <code>regexp</code> package in Go doesn&#39;t support Unicode categories like <code>\p{L}</code> (letters) and <code>\p{N}</code> (numbers), so I reached for <code>github.com/dlclark/regexp2</code>:</p>
<pre><code class="language-go">var expr = regexp2.MustCompile(
    `&#39;s|&#39;t|&#39;re|&#39;ve|&#39;m|&#39;ll|&#39;d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+`,
    0,
)</code></pre>
<p>For input <code>&quot;Hello, world!&quot;</code> this produces: <code>[&quot;Hello&quot;, &quot;,&quot;, &quot; world&quot;, &quot;!&quot;]</code></p>
<p>Each chunk is then encoded <strong>independently</strong> by the BPE algorithm.</p>
<hr>
<h2 id="step-2-byte-pair-encoding-bpe">Step 2: Byte Pair Encoding (BPE)</h2>
<p>BPE is the core algorithm. The key insight: you don&#39;t train it here, the training already happened when OpenAI built the r50k_base vocabulary. Your job is just the <strong>encoding</strong> side: given a chunk of text, find the optimal way to split it into tokens from the vocabulary.</p>
<h3 id="how-it-works">How it works</h3>
<p>Take the chunk <code>&quot; world&quot;</code> (note the leading space — GPT-2 treats <code>&quot; world&quot;</code> and <code>&quot;world&quot;</code> as different tokens).</p>
<p><strong>1. Initialize as individual bytes</strong></p>
<pre><code>parts = [ [0x20], [0x77], [0x6F], [0x72], [0x6C], [0x64] ]
          &quot; &quot;      &quot;w&quot;     &quot;o&quot;     &quot;r&quot;     &quot;l&quot;     &quot;d&quot;</code></pre><p>Every single byte is guaranteed to be in the vocabulary, so this starting state is always valid.</p>
<p><strong>2. Find the best merge candidate</strong></p>
<p>Look at every adjacent pair and check if their concatenation exists in the vocabulary. Among all valid pairs, pick the one with the <strong>lowest token ID</strong> — lower ID means it was merged earlier during training, giving it higher priority.</p>
<p><strong>3. Merge that pair</strong></p>
<p>Replace the two adjacent tokens with their concatenation. The slice shrinks by one element.</p>
<p><strong>4. Repeat until no valid merges remain</strong></p>
<p><strong>5. Return token IDs</strong></p>
<p>Look up each remaining token in the vocabulary and return the IDs.</p>
<h3 id="the-go-implementation">The Go implementation</h3>
<pre><code class="language-go">func (b *BPE) Encode(chunk []byte) []int {
    // Step 1: one element per byte
    parts := make([][]byte, len(chunk))
    for i, b := range chunk {
        parts[i] = []byte{b}
    }

    for {
        bestRank := math.MaxInt
        bestIdx  := -1

        // Step 2: scan all adjacent pairs
        for i := 0; i &lt; len(parts)-1; i++ {
            merged := append(append([]byte{}, parts[i]...), parts[i+1]...)
            rank, ok := Vocabulary[string(merged)]
            if ok &amp;&amp; rank &lt; bestRank {
                bestRank = rank
                bestIdx  = i
            }
        }

        // No valid merge found — done
        if bestIdx == -1 {
            break
        }

        // Step 3: merge at bestIdx
        merged := append(append([]byte{}, parts[bestIdx]...), parts[bestIdx+1]...)
        parts = append(parts[:bestIdx+1], parts[bestIdx+2:]...)
        parts[bestIdx] = merged
    }

    // Step 5: convert to token IDs
    result := make([]int, len(parts))
    for i, part := range parts {
        result[i] = Vocabulary[string(part)]
    }
    return result
}</code></pre>
<h3 id="one-subtle-go-gotcha">One subtle Go gotcha</h3>
<p>You&#39;ll notice the merging always uses:</p>
<pre><code class="language-go">append(append([]byte{}, parts[i]...), parts[i+1]...)</code></pre>
<p>instead of the simpler:</p>
<pre><code class="language-go">append(parts[i], parts[i+1]...)</code></pre>
<p>The simpler version is a bug. If <code>parts[i]</code> has spare capacity in its underlying array (which happens after a merge), the <code>append</code> will write <code>parts[i+1]</code> directly into that memory silently corrupting another token&#39;s data. The double-append forces a fresh allocation every time, so nothing is shared.</p>
<hr>
<h2 id="step-3-wiring-it-together">Step 3: Wiring it Together</h2>
<p>The <code>Tokenizer</code> owns a <code>BPE</code> instance and orchestrates the full pipeline:</p>
<pre><code class="language-go">type Tokenizer struct {
    bpe *BPE
}

func NewTokenizer() *Tokenizer {
    if err := LoadVocabulary(); err != nil {
        panic(err)
    }
    return &amp;Tokenizer{bpe: new(BPE)}
}

func (t *Tokenizer) Encode(input string) []int {
    match, err := expr.FindStringMatch(input)
    if err != nil {
        panic(err)
    }
    var result []int
    for match != nil {
        ids := t.bpe.Encode([]byte(match.String()))
        result = append(result, ids...)
        match, _ = expr.FindNextMatch(match)
    }
    return result
}

func (t *Tokenizer) Decode(input []int) string {
    var result []byte
    for _, id := range input {
        result = append(result, []byte(VacabularyInv[id])...)
    }
    return string(result)
}</code></pre>
<p>The flow is:</p>
<ol>
<li>Regex splits input into chunks</li>
<li>Each chunk goes through BPE → token IDs</li>
<li>All IDs are concatenated into one slice</li>
</ol>
<p>Decoding is simpler, just look up each ID in the inverse vocabulary and concatenate the bytes.</p>
<hr>
<h2 id="does-it-work">Does it Work?</h2>
<pre><code class="language-go">t := NewTokenizer()

ids := t.Encode(&quot;Hello, world!&quot;)
fmt.Println(ids)          // [15496 11 995 0]

fmt.Println(t.Decode(ids)) // Hello, world!</code></pre>
<p>The round-trip works. <code>Decode(Encode(s)) == s</code> for any input.</p>
<hr>
<h2 id="whats-next">What&#39;s Next</h2>
<p>Part 2 will continue into the next chapter: building the data loader that feeds tokenized text into the model in fixed-size chunks with a sliding window. That&#39;s where the training pipeline starts to take shape.</p>
<p>The full source is on GitHub: <a href="https://github.com/MarinX/llm-from-scratch">github.com/MarinX/llm-from-scratch</a></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Controlling LG Heat Pump via Modbus on Home Assistant]]></title>
            <link>https://velog.io/@marin_/Controlling-LG-Heat-Pump-via-Modbus-on-Home-Assistant</link>
            <guid>https://velog.io/@marin_/Controlling-LG-Heat-Pump-via-Modbus-on-Home-Assistant</guid>
            <pubDate>Thu, 29 Jan 2026 12:05:50 GMT</pubDate>
            <description><![CDATA[<p>I have an LG Therma V heat pump (inside unit: HN0916T.NB1, outside unit: HU091MR.U44) with a 200L DHW (Domestic Hot Water) tank, and I wanted to integrate it with Home Assistant. After some research, I found that the heat pump supports Modbus communication, which opened up a lot of possibilities.</p>
<p>In this post, I&#39;ll share how I connected everything and what configuration worked for me.</p>
<h2 id="the-architecture">The Architecture</h2>
<p>Here&#39;s an overview of my setup:</p>
<p><img src="https://velog.velcdn.com/images/marin_/post/9a5e08d3-65a9-459e-a379-f05d400e3527/image.png" alt=""></p>
<p>The data flows like this:</p>
<ul>
<li><strong>Modbus</strong> ↔ <strong>Home Assistant</strong>: Two-way communication with the heat pump (reading sensors, sending commands)</li>
<li><strong>Home Assistant</strong> → <strong>InfluxDB</strong>: Storing historical data for analysis</li>
<li><strong>Grafana</strong> ← <strong>InfluxDB</strong>: Visualizing the data in dashboards</li>
</ul>
<h2 id="what-is-modbus">What is Modbus?</h2>
<p>Modbus is an industrial communication protocol that&#39;s been around since 1979. It&#39;s simple, reliable, and widely used in industrial equipment – including heat pumps.</p>
<p>There are two main types:</p>
<ul>
<li><strong>Modbus RTU</strong>: Uses serial communication (RS485 wires)</li>
<li><strong>Modbus TCP</strong>: Uses ethernet/IP network</li>
</ul>
<p>The LG heat pump uses Modbus RTU, but Home Assistant works better with Modbus TCP. To bridge the gap, I needed a Modbus TCP/IP gateway module.</p>
<h2 id="hardware-i-used">Hardware I Used</h2>
<p>Here&#39;s what I gathered for this project:</p>
<img src="https://velog.velcdn.com/images/marin_/post/1afd41f4-e25b-4c83-8dbb-4e7b4b0627c9/image.jpg" width=500 />


<ul>
<li><strong>Ethernet cable</strong> – To connect the gateway to my network</li>
<li><strong>RS485 cable</strong> – The communication line between the heat pump and the gateway</li>
<li><strong>Waveshare RS485 to PoE ETH module</strong> – This converts Modbus RTU to Modbus TCP</li>
</ul>
<h2 id="making-the-connection">Making the Connection</h2>
<p>Here&#39;s how I connected the RS485 cable inside the heat pump&#39;s control board:</p>
<img src="https://velog.velcdn.com/images/marin_/post/2d49cfbb-d279-40e0-b7dc-7e7f7fe254ba/image.jpg" width=300 />

<h2 id="home-assistant-configuration">Home Assistant Configuration</h2>
<p>With the hardware in place, I moved on to configuring Home Assistant.</p>
<h3 id="basic-modbus-connection">Basic Modbus Connection</h3>
<p>I added this to my <code>/homeassistant/configuration.yaml</code>:</p>
<pre><code class="language-yaml">modbus:
  - name: &quot;LG Therma V&quot;
    delay: 1
    timeout: 14
    message_wait_milliseconds: 200
    host: &quot;device-ip-address-on-local-lan&quot;
    port: 4196
    type: tcp</code></pre>
<p>Here&#39;s what each setting does:</p>
<ul>
<li><strong>name</strong>: A friendly name for the connection</li>
<li><strong>delay</strong>: Wait time in seconds before the first request</li>
<li><strong>timeout</strong>: How long to wait for a response (14 seconds works well for heat pumps)</li>
<li><strong>message_wait_milliseconds</strong>: Pause between messages to avoid overwhelming the device</li>
<li><strong>host</strong>: The Modbus gateway&#39;s IP address (I set a static IP for this)</li>
<li><strong>port</strong>: The TCP port (4196 is the default for Waveshare modules)</li>
<li><strong>type</strong>: TCP connection type</li>
</ul>
<h2 id="writing-modbus-queries">Writing Modbus Queries</h2>
<p>To find the correct register addresses, I consulted the LG manual for my model. It lists all the Modbus registers and what data they contain.</p>
<h3 id="binary-sensors">Binary Sensors</h3>
<p>For monitoring on/off states like whether the pump is running, I configured a binary sensor:</p>
<pre><code class="language-yaml">modbus:
  binary_sensors:
    - name: &quot;LG Therma V Pump Running&quot;
      unique_id: &quot;lg_therma_v_pump_running&quot;
      address: 1
      slave: 1
      scan_interval: 20
      device_class: running
      input_type: discrete_input</code></pre>
<p>Key settings:</p>
<ul>
<li><strong>address</strong>: The Modbus register address from the LG manual</li>
<li><strong>slave</strong>: Set to 1 for single-device setups</li>
<li><strong>scan_interval</strong>: Polling frequency in seconds (20 seconds is reasonable)</li>
<li><strong>input_type</strong>: <code>discrete_input</code> for read-only binary values</li>
</ul>
<h3 id="sensors">Sensors</h3>
<p>For reading temperature values, I set up sensors like this one for DHW temperature:</p>
<pre><code class="language-yaml">modbus:
  sensors:
    - name: &quot;LG Therma V DHW Temp&quot;
      unique_id: &quot;lg_therma_v_dhw_temperature&quot;
      scale: 0.1
      precision: 1
      scan_interval: 20
      address: 5 # reg 6
      slave: 1
      unit_of_measurement: °C
      device_class: temperature
      input_type: input</code></pre>
<p>Important settings:</p>
<ul>
<li><strong>scale</strong>: LG reports temperature as integers (e.g., 445 = 44.5°C), so I multiply by 0.1</li>
<li><strong>precision</strong>: Number of decimal places to display</li>
<li><strong>device_class</strong>: Tells Home Assistant this is a temperature sensor</li>
<li><strong>input_type</strong>: <code>input</code> for read-only registers</li>
</ul>
<h3 id="switches">Switches</h3>
<p>To control the heat pump, I configured switches like:</p>
<pre><code class="language-yaml">modbus:
  switches:
    - name: &quot;LG Therma V Underflow&quot;
      unique_id: &quot;lg_therma_v_underflow_on_off&quot;
      slave: 1
      address: 0
      write_type: coil
      command_on: 1
      command_off: 0
      verify:
        input_type: coil
        address: 0
        state_on: 1
        state_off: 0</code></pre>
<p>Key settings:</p>
<ul>
<li><strong>write_type</strong>: <code>coil</code> for boolean writes</li>
<li><strong>verify</strong>: Reads back the state to confirm the command was executed</li>
</ul>
<h3 id="climate-entities">Climate Entities</h3>
<p>For a complete thermostat experience with current temperature and target adjustment, I used a climate entity:</p>
<pre><code class="language-yaml">modbus:
  climates:
    - name: &quot;LG Therma V Underflow&quot;
      unique_id: &quot;lg_therma_v_underflow&quot;
      address: 7
      slave: 1
      input_type: input
      max_temp: 33
      min_temp: 16
      offset: 0
      precision: 0
      scale: 0.1
      target_temp_register: 2
      temp_step: 1
      temperature_unit: C
      hvac_mode_register:
        address: 0
        values:
          state_heat: 4</code></pre>
<p>This creates a proper thermostat card in Home Assistant where I can see the current temperature and adjust the target.</p>
<hr>
<h2 id="bonus-recording-data-to-influxdb">Bonus: Recording Data to InfluxDB</h2>
<p>I wanted to keep historical data for analysis, so I added InfluxDB to my setup.</p>
<h3 id="what-is-influxdb">What is InfluxDB?</h3>
<p>InfluxDB is a time-series database designed specifically for data that changes over time (like temperatures and power consumption). It handles large amounts of time-stamped data efficiently.</p>
<h3 id="home-assistant--influxdb-integration">Home Assistant + InfluxDB Integration</h3>
<p>Home Assistant has built-in support for InfluxDB. I added this to my <code>configuration.yaml</code>:</p>
<pre><code class="language-yaml">influxdb:
  api_version: 2
  ssl: false
  host: your-ip
  port: 8086
  token: influxdb-token
  organization: your-org
  bucket: homeassistant
  tags:
    source: HA
  tags_attributes:
    - friendly_name
  default_measurement: units</code></pre>
<p>Configuration notes:</p>
<ul>
<li><strong>api_version</strong>: Version 2 for modern InfluxDB installations</li>
<li><strong>ssl</strong>: Set to <code>true</code> if using HTTPS</li>
<li><strong>token</strong>: Generated in InfluxDB&#39;s web interface</li>
<li><strong>organization</strong> &amp; <strong>bucket</strong>: These need to be created in InfluxDB first</li>
<li><strong>tags</strong>: Useful for filtering data later</li>
</ul>
<p>With this configuration, every sensor update in Home Assistant gets automatically logged to InfluxDB.</p>
<hr>
<h2 id="grafana-visualizing-the-data">Grafana: Visualizing the Data</h2>
<p>To create dashboards from the stored data, I use Grafana. It connects to InfluxDB and provides flexible visualization options.</p>
<p>Here&#39;s my current dashboard:</p>
<p><img src="https://velog.velcdn.com/images/marin_/post/12cfe85a-3b35-495a-af51-f00b49dd5be5/image.png" alt=""></p>
<p>I can monitor:</p>
<ul>
<li>Pump and compressor status</li>
<li>DHW temperature over time</li>
<li>Power consumption</li>
<li>Flow rates</li>
<li>And more</li>
</ul>
<p>Setting up Grafana involves:</p>
<ol>
<li>Installing Grafana (I used Docker)</li>
<li>Adding InfluxDB as a data source</li>
<li>Creating dashboards and panels</li>
</ol>
<hr>
<h2 id="summary">Summary</h2>
<p>This setup gives me full visibility and control over my LG Therma V heat pump through Home Assistant. I can:</p>
<ul>
<li>Monitor temperatures and status in real-time</li>
<li>Control the heat pump remotely</li>
<li>Store historical data for analysis</li>
<li>Visualize everything in Grafana dashboards</li>
</ul>
<p>The total cost was much lower than LG&#39;s official smart home solutions, and I have complete control over my data.</p>
<hr>
<p>감사합니다 for reading! If you have questions or spot any issues, feel free to leave a comment.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Getting Started with Local LLM Code Completion in Neovim using Ollama]]></title>
            <link>https://velog.io/@marin_/Getting-Started-with-Local-LLM-Code-Completion-in-Neovim-using-Ollama</link>
            <guid>https://velog.io/@marin_/Getting-Started-with-Local-LLM-Code-Completion-in-Neovim-using-Ollama</guid>
            <pubDate>Mon, 19 Jan 2026 13:40:35 GMT</pubDate>
            <description><![CDATA[<p>So you want AI-powered code completion but you&#39;re not keen on sending your code to the cloud? Maybe you work on proprietary code, maybe you value privacy, or maybe you just want to avoid that awkward moment when your internet dies mid-completion and you realize you&#39;ve forgotten how to write a for loop. Whatever your reason, running LLMs locally is the answer.</p>
<p>In this guide, we&#39;ll set up <a href="https://ollama.com/">Ollama</a> with the Qwen2.5-Coder model and hook it into Neovim using the <a href="https://github.com/milanglacier/minuet-ai.nvim">minuet-ai.nvim</a> plugin. By the end, you&#39;ll have GitHub Copilot-like completions running entirely on your machine.</p>
<h2 id="why-local-llms">Why Local LLMs?</h2>
<ul>
<li><strong>Privacy</strong>: Your code never leaves your machine. Your embarrassing variable names stay between you and your GPU.</li>
<li><strong>No subscription fees</strong>: Once set up, it&#39;s free forever. Your wallet will thank you.</li>
<li><strong>Offline capability</strong>: Works on airplanes, in basements, and during apocalyptic internet outages.</li>
<li><strong>Latency</strong>: With a decent GPU, local inference can actually be faster than cloud APIs.</li>
</ul>
<h2 id="installing-ollama">Installing Ollama</h2>
<p>Ollama is a lightweight runtime for running LLMs locally. Installation is refreshingly simple:</p>
<pre><code class="language-sh">curl -fsSL https://ollama.com/install.sh | sh</code></pre>
<h3 id="verify-the-installation">Verify the installation</h3>
<pre><code class="language-sh">ollama -v</code></pre>
<p>If you see a version number, congratulations—you&#39;ve successfully installed software. The bar was low, but you cleared it.</p>
<h2 id="configuring-ollama-for-optimal-performance">Configuring Ollama for Optimal Performance</h2>
<p>The default Ollama configuration works, but we can do better. On Linux, we&#39;ll override the systemd service with custom environment variables:</p>
<pre><code class="language-sh">systemctl edit ollama</code></pre>
<p>This opens an override file. Add the following configuration:</p>
<pre><code class="language-conf">[Service]
Environment=&quot;OLLAMA_HOST=0.0.0.0&quot;
Environment=&quot;OLLAMA_FLASH_ATTENTION=1&quot;
Environment=&quot;OLLAMA_KV_CACHE_TYPE=q4_0&quot;
Environment=&quot;OLLAMA_NUM_PARALLEL=2&quot;
Environment=&quot;OLLAMA_KEEP_ALIVE=10m&quot;</code></pre>
<p>Let me explain what each of these does:</p>
<table>
<thead>
<tr>
<th>Variable</th>
<th>What it does</th>
</tr>
</thead>
<tbody><tr>
<td><code>OLLAMA_HOST=0.0.0.0</code></td>
<td>Makes Ollama accessible from any network interface, not just localhost. Useful if you want to access it from other devices on your network.</td>
</tr>
<tr>
<td><code>OLLAMA_FLASH_ATTENTION=1</code></td>
<td>Enables <a href="https://github.com/ollama/ollama/pull/6279">Flash Attention</a>, which significantly reduces memory usage as context size grows. This is required for KV cache quantization.</td>
</tr>
<tr>
<td><code>OLLAMA_KV_CACHE_TYPE=q4_0</code></td>
<td>Quantizes the key-value cache to 4-bit, using approximately <strong>1/4 the memory</strong> compared to the default fp16. There&#39;s a slight quality trade-off, but for code completion it&#39;s barely noticeable. Use <code>q8_0</code> if you have more VRAM and want higher quality.</td>
</tr>
<tr>
<td><code>OLLAMA_NUM_PARALLEL=2</code></td>
<td>Allows processing multiple requests simultaneously. Useful if you&#39;re trigger-happy with your completions.</td>
</tr>
<tr>
<td><code>OLLAMA_KEEP_ALIVE=10m</code></td>
<td>Keeps the model loaded in memory for 10 minutes after the last request. Default is 5 minutes. Longer means faster first response, but uses more memory.</td>
</tr>
</tbody></table>
<p>Don&#39;t forget to restart the service:</p>
<pre><code class="language-sh">systemctl restart ollama</code></pre>
<h2 id="choosing-the-right-model-qwen25-coder">Choosing the Right Model: Qwen2.5-Coder</h2>
<p>We&#39;re using <a href="https://qwenlm.github.io/blog/qwen2.5-coder-family/">Qwen2.5-Coder</a>, and here&#39;s why it&#39;s an excellent choice:</p>
<ul>
<li><strong>Fill-in-the-Middle (FIM) support</strong>: Essential for code completion—it can complete code in the middle of a function, not just at the end.</li>
<li><strong>40+ programming languages</strong>: From Python to Haskell to Racket (yes, really).</li>
<li><strong>5.5 trillion tokens of training data</strong>: It has seen more code than any human ever will.</li>
</ul>
<p>Pull the model:</p>
<pre><code class="language-sh">ollama pull qwen2.5-coder:latest</code></pre>
<p>This downloads the 7B parameter version (~4.7GB). If you have more VRAM, consider <code>qwen2.5-coder:14b</code> or <code>qwen2.5-coder:32b</code> for even better results.</p>
<blockquote>
<p><strong>Note</strong>: The 7B model needs ~6GB VRAM, 14B needs ~10GB, and 32B needs ~20GB. Plan accordingly, or your GPU will plan for you (by crashing).</p>
</blockquote>
<h2 id="setting-up-neovim-with-minuet-ainvim">Setting Up Neovim with minuet-ai.nvim</h2>
<p><a href="https://github.com/milanglacier/minuet-ai.nvim">minuet-ai.nvim</a> is a fantastic plugin that brings LLM-powered completions to Neovim. Unlike some alternatives, it doesn&#39;t require any proprietary background processes—just curl and your LLM provider.</p>
<h3 id="key-features">Key Features</h3>
<ul>
<li><strong>Multiple frontends</strong>: Virtual-text, nvim-cmp, blink-cmp, built-in completion (Neovim 0.11+)</li>
<li><strong>Streaming support</strong>: See completions as they generate</li>
<li><strong>Fill-in-the-Middle</strong>: Proper code completion that understands context before AND after the cursor</li>
<li><strong>Incremental acceptance</strong>: Accept completions word-by-word or line-by-line</li>
</ul>
<h3 id="installation-with-lazynvim">Installation with lazy.nvim</h3>
<p>Create or edit your plugin spec in your Neovim lua folder:</p>
<pre><code class="language-lua">---@type LazySpec
return {
  &quot;milanglacier/minuet-ai.nvim&quot;,
  dependencies = {
    &quot;nvim-lua/plenary.nvim&quot;,
  },
  opts = {
    provider = &quot;openai_fim_compatible&quot;,
    n_completions = 1, -- Use 1 for local models to save resources
    context_window = 4096, -- Adjust based on your GPU&#39;s capability
    throttle = 500, -- Minimum time between requests in ms
    debounce = 300, -- Wait time after typing stops before requesting
    provider_options = {
      openai_fim_compatible = {
        api_key = &quot;TERM&quot;, -- Ollama doesn&#39;t need a real API key
        name = &quot;Ollama&quot;,
        end_point = &quot;http://localhost:11434/v1/completions&quot;,
        model = &quot;qwen2.5-coder:latest&quot;,
        optional = {
          max_tokens = 256, -- Maximum tokens to generate
          stop = { &quot;\n\n&quot; }, -- Stop at double newlines
          top_p = 0.9, -- Nucleus sampling parameter
        },
      },
    },
    -- Virtual text display settings
    virtualtext = {
      auto_trigger_ft = { &quot;*&quot; }, -- Enable for all filetypes
      keymap = {
        accept = &quot;&lt;Tab&gt;&quot;,
        accept_line = &quot;&lt;C-y&gt;&quot;,
        next = &quot;&lt;C-n&gt;&quot;,
        prev = &quot;&lt;C-p&gt;&quot;,
        dismiss = &quot;&lt;C-e&gt;&quot;,
      },
    },
  },
}</code></pre>
<h3 id="configuration-notes">Configuration Notes</h3>
<ul>
<li><strong><code>api_key = &quot;TERM&quot;</code></strong>: Ollama doesn&#39;t require authentication, but minuet-ai needs some environment variable name here. <code>TERM</code> exists on all systems.</li>
<li><strong><code>context_window = 4096</code></strong>: Start with a smaller window and increase if your GPU can handle it. The plugin author recommends starting at 512 to benchmark your hardware.</li>
<li><strong><code>throttle</code> and <code>debounce</code></strong>: These prevent spamming your GPU with requests. Tune based on how aggressive you want completions to be.</li>
</ul>
<h2 id="results">Results</h2>
<p>Once everything is set up, you&#39;ll see ghost text completions appear as you type. Press <code>&lt;Tab&gt;</code> to accept the full completion, or <code>&lt;C-y&gt;</code> to accept just the current line.</p>
<p><img src="https://velog.velcdn.com/images/marin_/post/374ff306-4ef9-437c-a8fe-a5d1f32f9259/image.gif" alt=""></p>
<p>The first completion after loading the model takes a bit longer (cold start), but subsequent completions are fast. If you find it slow, try:</p>
<ol>
<li>Reducing <code>context_window</code></li>
<li>Using a smaller model (<code>qwen2.5-coder:7b</code> instead of larger variants)</li>
<li>Increasing <code>debounce</code> to reduce request frequency</li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p>You now have a fully local, privacy-respecting, subscription-free AI code completion setup. Your code stays on your machine, your completions work offline, and you&#39;re not paying monthly fees for the privilege.</p>
<p>Happy coding!</p>
]]></description>
        </item>
    </channel>
</rss>