llmgolangsse

From prompt to token — how an LLM agent works under the hood

Published: March 28, 2026

While experimenting with coding agents and local models like Qwen Coder, I started wondering what actually happens between an agent and a model at the protocol and code level. I also wanted to brush up on Go — against the trend of "giving up" on code in favor of writing prompts — and that's how a working agent prototype came to be.

What's on the wire — a look at the protocol

All communication between the agent and the model is plain HTTP. As an example, let's take Ollama with the Qwen Coder model.

A standard HTTP POST to /v1/chat/completions

{
  "model": "qwen3-coder",
  "stream": true,
  "messages": [
    {"role": "system", "content": "You are a coding assistant..."},
    {"role": "user",   "content": "What does the main.go file do?"}
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "read_file",
        "description": "...",
        "parameters": {...}
      }
    }
  ]
}

The "magic" of tokens appearing as the model thinks is hidden in the stream: true field. The server responds with Content-Type: text/event-stream and doesn't close the body — tokens arrive in chunks as event: + data: pairs:

event: content_block_delta
data: {"delta": {"type": "text_delta", "text": "Hello"}}

event: content_block_delta
data: {"delta": {"type": "text_delta", "text": " world!"}}

event: message_stop
data: {}

Most events are text_delta — fragments of the response. But what happens when the model wants to do something instead of generating text?

Tool call — the model asks, the agent executes

There's no special protocol here — the model simply generates JSON instead of text:


1. Agent → model   POST with tools[]
2. Model → agent   tool_call block: { name: "read_file", arguments: {"path": "main.go"} }
3. Agent           executes the tool (reads the file)
4. Agent → model   new POST with the same history + a role: "tool" message with the result
5. Model → agent   text response based on the file contents

{
  "role": "tool",
  "tool_call_id": "call_01abc",
  "content": "package main\n\nfunc main() {..."
}

Every request includes a tools[] field with the full list of available tools and their schemas (JSON Schema). The model has no access to any "registry" — it only sees what we send it. If the agent supports MCP, tools from MCP servers are appended to the same list — the model doesn't know that some of them come from an external source.

Tool arguments may arrive in fragments via SSE (input_json_delta) — we assemble them as a string and only parse after content_block_stop
The model can request multiple tools at once (parallel tool calls) — the agent executes all of them, sends back all results, and only then does the model continue
Why does the model "know" when to call a tool? It depends on the system prompt and training — not on any protocol on the agent's side

Implementation in code

At the code level, the idea is simple — a while loop with an HTTP client that sends conversation history, receives a stream of tokens, uses tools, and starts over.

user input → messages[] → Chat() → <-chan Event → terminal output
                              ↑                        |
                              └── tool results ────────┘

As the implementation progresses, things get more complex — the code should be readable, logically split into files and modules. For instance, LLM providers expose their models differently: Anthropic has /v1/messages with its own SSE format, OpenAI/Ollama uses /v1/chat/completions — the details need to be hidden behind a common interface. A Factory Pattern helps here, returning a ready instance based on the selected provider.

Core agent logic

In the code, responsibility for the conversation flow is spread across three files, each with a clearly defined role:

cmd/root.go → runREPL() — composition root. This is where everything comes together: flag parsing, config loading, provider creation, building the tool registry, connecting MCP servers. No business logic here — just wiring up dependencies injected into the agent.
agent.go → Run() — the REPL loop. Reads user input, handles commands (/compact, /model, /provider, etc.), calls turn() for each query, manages sessions (save, resume, title generation), auto-compact when context window > 80%.
turn.go → turn() — a single agent cycle. Builds a ChatRequest, calls Chat(), passes the stream to consumeStream(), handles the agentic loop (tool_use → execute → repeat). Handles Ctrl+C (SIGINT → cancel context → abort cycle).
stream.go → consumeStream() — consumes the event channel from the provider. Reads <-chan Event, prints tokens to the terminal, accumulates tool calls, filters <think> blocks, highlights code, watches for idle timeout.

Go and channels — non-blocking streaming

The HTTP response from the model can take tens of seconds — how do you avoid blocking the rest of the program? Chat() returns <-chan Event immediately, while SSE parsing happens in a background goroutine. The channel is buffered (64 elements), so the provider doesn't block on the consumer.

// Provider interface — one contract for all providers
type Provider interface {
    Chat(ctx context.Context, req ChatRequest) (<-chan Event, error)
}

// Inside Anthropic.Chat():
ch := make(chan Event, 64)       // buffered — don't block on the agent
go a.streamResponse(ctx, resp.Body, ch)
return ch, nil                   // return immediately, goroutine runs in the background

The channel is unidirectional — the provider writes, the agent reads. If the user presses Ctrl+C, context.WithCancel cancels the context, closes the HTTP body, and the goroutine terminates naturally. In case the model simply stops responding, consumeStream watches for idle timeout via time.NewTimer. Goroutines are cheap, channels are a built-in synchronization mechanism — Go is simply a natural fit here.

// The agent consumes the stream
for event := range ch {
    switch event.Type {
    case provider.EventTextDelta:
        fmt.Print(event.Text)      // print the token in real time
    case provider.EventToolUseStart:
        // the model wants a tool — remember it
    case provider.EventDone:
        // done, we have usage stats
    }
}

Agentic loop — the full conversation cycle

The mechanism is simple: the agent sends Chat() with the history and tool list, consumes the response stream, and checks what it got. If the model returned text — the cycle ends, the result goes to the terminal. But if the response contains a tool_use block, it means the model needs something from the outside — it wants to read a file, check the directory structure, query an MCP server. The agent executes the requested tool, appends the result to messages[], and calls Chat() again. The model now sees a longer history — its own previous response plus the tool result — and decides what to do next. It can respond, or it can ask for another tool.

In code, it's literally a for {} in turn.go:

for {
    req := provider.ChatRequest{
        System:   a.system,
        Messages: a.messages,
        Tools:    a.toolDefs(),
    }
    ch, err := a.provider.Chat(turnCtx, req)
    // ...
    result := a.consumeStream(ch, spin)

    // build the assistant message with text and/or tool calls
    a.messages = append(a.messages, assistantMsg)

    // no tool calls → end of cycle
    if len(result.toolCalls) == 0 {
        return nil
    }

    // tool calls present → execute and loop back
    for _, tc := range result.toolCalls {
        a.executeTool(turnCtx, tc)
    }
}

Each iteration is a separate HTTP request with the full conversation history. The history grows — messages[] is accumulated memory where the model sees the entire course of the cycle. What happens after a tool executes? executeTool() calls it through the registry and appends the result to the history:

result, err := a.registry.Execute(ctx, tc.name, params)
// ...
a.appendToolResult(tc.id, result, false)

Where appendToolResult is simply:

func (a *Agent) appendToolResult(toolUseID, content string, isError bool) {
    msg := provider.NewToolResultMessage(toolUseID, content, isError)
    a.messages = append(a.messages, msg)
}

On the next loop iteration, the model sees this message and decides — respond, or ask for another tool.

The loop runs until the model stops requesting tools. During longer sessions, context overflow is prevented by compression (/compact manually, or automatically when we exceed 80% of the window). The model can also request multiple tools at once in a single pass (parallel tool calls) — consumeStream() accumulates them all in a calls array, the agent executes them in sequence, and the results are sent to the model in a single request.

What it looks like in practice

The process above is illustrated by logs saved in ~/.go-agent/agent.log. For example, for the query: "describe the @main.go file":

# startup: config + MCP
time=2026-03-28T17:45:16.248+01:00 level=DEBUG msg="config loaded" provider=ollama
time=2026-03-28T17:45:16.248+01:00 level=INFO  msg="mcp connecting" server=context7
time=2026-03-28T17:45:17.380+01:00 level=INFO  msg="mcp ready" servers=2 tools=3

# the cycle begins — 1 message in history (just the query), 9 tools in tools[]
time=2026-03-28T17:45:33.588+01:00 level=DEBUG msg="turn start" messages=1 query="opisz plik @main.go"
time=2026-03-28T17:45:33.610+01:00 level=INFO  msg="[flow] → request" provider=ollama messages=1 tools=9

# model responds after ~8s: wants to call read_file with argument {"path":"main.go"}
time=2026-03-28T17:45:41.352+01:00 level=INFO  msg="[flow] ← tool_call" name=read_file id=ollama_1
time=2026-03-28T17:45:41.353+01:00 level=INFO  msg="[flow]   input" json="{\"path\":\"main.go\"}"

# agent executes the tool — file has 11 lines
time=2026-03-28T17:45:41.371+01:00 level=DEBUG msg="tool call" name=read_file path=main.go
time=2026-03-28T17:45:41.371+01:00 level=INFO  msg="[flow]   executing" name=read_file
time=2026-03-28T17:45:41.463+01:00 level=DEBUG msg="tool done" name=read_file lines=11 preview="package main"
time=2026-03-28T17:45:41.463+01:00 level=INFO  msg="[flow]   result" name=read_file length=122

# result goes into messages[] and the agent sends a second request — now 3 messages: query + model response + tool result
time=2026-03-28T17:45:41.463+01:00 level=INFO  msg="[flow] → result" tool_id=ollama_1 is_error=false
time=2026-03-28T17:45:41.502+01:00 level=INFO  msg="[flow] → request" provider=ollama messages=3 tools=9

# model responds with text — cycle complete: 1 tool call, 6444 input tokens
time=2026-03-28T17:45:46.876+01:00 level=DEBUG msg="turn done" tool_calls=1 in=6444 out=280
time=2026-03-28T17:45:46.876+01:00 level=INFO  msg="[flow] ← response" chars=720

Everything else

The sections above describe the core — the protocol, streaming, the agent loop. But even a basic prototype needs much more:

Providers and runtime switching. The agent supports Anthropic, OpenAI, and Ollama behind a common interface. The provider and model can be changed mid-session (/provider, /model) — no restart, no loss of history.

Sessions. Every conversation is saved as a JSONL file with metadata (working directory, provider, title generated by the LLM). A session can be resumed (/resume) — the agent replays the history and displays a compact summary of previous steps.

Context compression. During longer sessions, messages[] grows until it starts filling the context window. When it exceeds 80%, the agent automatically asks the model to summarize older messages — taking care not to split tool_use/tool_result pairs. Manual compression is also available via the /compact command.

Execution confirmation. Potentially destructive tools (bash, edit, write_file) require user confirmation before the agent executes them. A single keypress: y / n / a (approve all).

Filtering <think>. Some models (e.g., Qwen Coder) generate <think>...</think> blocks with internal reasoning. The agent saves them in the history but doesn't display them in the terminal.

TUI. The presentation layer is built on Bubble Tea — multiline input, syntax highlighting (Chroma), markdown rendering (Glamour), an interactive picker for selecting models and sessions, a spinner with a status bar.

MCP, skills, AGENT.md. The agent can connect external tool servers via MCP, load pre-built instruction sets (skills) from SKILL.md files, and inject project context from AGENT.md into the system prompt.

Project source code available on GitHub

github.com/baniol/go-agent