-7 min read

What Is a Harness?

A visual guide to the infrastructure that turns language models into agents.

blogharness-designfundamentals

An LLM, on its own, is a function. You hand it text. It hands you back text.

That's not what people mean when they say "AI agent." An agent books flights. It reads files, runs tests, opens pull requests, drives a browser. It does things.

So how do you get from the first to the second? That's what this post is about.


Part 1: Building the harness

The model, alone

The obvious place to start is to hand the task straight to the model and see how far that gets.

model only

The model knows which airlines fly the route and can ballpark prices. What it can't do is find specific flights or book anything. There's no airline API on the other side of the text box. There's just more text coming back. The request arrives, a reply goes out, and nothing in the real world moves.

For that to change, the model needs a way to trigger code outside the text box.

Giving it tools

The fix is to let the model ask for help. We describe a few things it's allowed to do (searching flights, booking them) and pass those descriptions in with the user's message. Each description has a name, a short explanation, and the shape of the arguments it takes. From the model's point of view, they're just options it can pick.

There's no function-calling magic here. The model still only does one thing: generate text. What changes is what we do with that text. We tell the model: if you want one of these tools, write it out in this exact format.

<search_flights>
{
  from: "CPH",
  to: "AMS",
  date: "2026-05-24",
}
</search_flights>

As the reply streams back, we watch for those tags. The moment we see a <search_flights> opening, we stop treating the output as conversation and start treating it as a request. We parse the JSON, call the real search_flights function, and carry on. The model wrote text in a particular shape. Our code noticed the shape and did something with it. That's all tool use really is.Note: In a real system you wouldn't use a format this simple. A tag like <search_flights> can collide with what the model is actually trying to write. A coding agent might emit <div> or <table> as part of its real answer, and a naive parser would grab it. Production systems use formats that can't plausibly appear in normal output. OpenAI's Harmony is a good example. It defines special tokens like <|start|>, <|message|>, and <|channel|> that are trained into the model and never show up in regular text, so the harness can scan for them without false positives.

model + tools
Tools availablesearch_flightsbook_flight

The model picked the right tool, filled in the arguments, and we got flight data back. Open the Code tab to see how little code this takes.

Then it stops. The search ran, but nobody passed the result back to the model, so it never got to react to what it found. The return flight never got looked up. Nothing got booked.

The agent loop

What's missing is a way to keep going. Once a tool runs, the model has to see the result and decide what to do next. And it has to be able to do that as many times as the task takes.Note: In real systems there is usually a cap on how many turns the agent can run, to prevent runaway loops.

The fix is small. Wrap the whole thing in a loop. Call the model. If it asked for a tool, run the tool, append the result to the conversation, and call the model again. Exit when it stops asking for tools and gives a normal reply.

model + tools + loop
Tools availablesearch_flightsbook_flight

Same model, same tools. The only thing we added was a while and an append. That loop is the agent.

You could stop here and ship something useful. The rest of this post is about what goes on inside that growing conversation. Once the loop works, most of the hard problems in agent design are about the list of messages you keep handing back to the model.


Part 2: Looking inside the harness

What the model sees on every call

The model is stateless. Each call starts from nothing. If you want it to know what happened on the previous turn, you include that turn in the next call. Same for the turn before, and the one before that. What looks like memory is just us re-sending the whole history every time.

What we've been loosely calling "the conversation" is just a messages array we keep in a variable and replay. Below is the full transcript of the flight-booking task. Send a follow-up to see how it grows when the loop keeps going.

the context window7 messages
System prompt
You are a travel assistant. Today is 2026-04-24. You have access to the following tools: <tool name="search_flights"> <description> Search for flights between two airports on a specific date. Returns a list of flights with id, airline, times, duration, and price. </description> <parameters> from (string, required) Origin IATA code, e.g. "CPH" to (string, required) Destination IATA code, e.g. "AMS" date (string, required) ISO 8601 date, e.g. "2026-05-24" </parameters> </tool> <tool name="book_flight"> <description> Book a flight and return a confirmation (PNR). </description> <parameters> flight_id (string, required) An id from a search_flights result date (string, required) ISO 8601 date </parameters> </tool> Always search first, present the options with prices, and ask the user to confirm before booking.
User
Can you book me flights from Copenhagen to Amsterdam for May 24th to 30th?
Assistant
<thinking>
The user wants flights in both directions. Let me start with the outbound from Copenhagen to Amsterdam on May 24th.
</thinking>
<search_flights>
{
from: "CPH",
to: "AMS",
date: "2026-05-24",
}
</search_flights>
Tool result
3 flights found KL1124 KLM 08:05 → 09:30 €132 direct SK1551 SAS 11:20 → 12:45 €148 direct U21802 easyJet 14:40 → 16:05 €119 direct
Assistant
<thinking>
Got the outbound options. Now the return from Amsterdam on May 30th.
</thinking>
<search_flights>
{
from: "AMS",
to: "CPH",
date: "2026-05-30",
}
</search_flights>
Tool result
2 flights found KL1133 KLM 17:15 → 18:40 €141 direct SK1558 SAS 19:55 → 21:20 €156 direct
Assistant
Found two solid options: - Outbound — KL1124, 24/05 at 08:05 (€132) - Return — KL1133, 30/05 at 17:15 (€141) Total €273. Want me to book them?

The system prompt is the first message. It's where you tell the model what it's for, what tools it has, and how to behave. Models are trained to weight what's in the system prompt above what users say, so this is also your main lever for steering behavior. When a user tries to jailbreak their way past the rules, the system prompt usually wins. Not always, but usually. It stays at the top of every call, unchanged. Everything below it (user messages, assistant replies, tool calls, tool results) just piles up. Nothing gets removed automatically.

That pile is the messages array from the loop we wrote in Part 1. It's the only memory the agent has. Whatever is in there is what the model can reason over on the next call.

The cost of remembering

Because the full transcript goes out on every call, the input the model has to read grows with every turn.

context size per LLM call (tokens)
Call 1
180
Call 2
520
Call 3
880
Call 4
1,640
Call 5
2,900
Call 6
5,100

Every tool result sticks around. Every assistant reply sticks around. By the sixth call the model is re-reading everything it has already done — which is why managing this context is the real engineering problem.

Every noisy tool call keeps costing you. A test runner that dumps 200 lines of output, or a file read that returns a 4,000-line source file. They sit in the transcript for the rest of the task and get re-read on every call after they first fired. One careless tool call at step 2 can double the input the model has to wade through at step 10.

That's where most of the design work happens. Once the loop works, what's left is figuring out what goes into the transcript and what stays out. Compaction, sub-agents, skills, memory hierarchies. All different answers to the same question: what belongs in the context, and what doesn't?


Where this leads

That's the whole thing. An agent is:

  • A model
  • A set of tools the model can call
  • A loop that keeps calling the model with the growing transcript

Everything else (system prompts, sub-agents, context compaction, skills, permission gates, verification loops, memory hierarchies, etc.) is a refinement of one of those three pieces.

Reshaping the tools starts with which tools you hand the agent, how many, and how well they're designed. Too few and the agent can't do the job. Too many and it gets lost picking between them. A tool that dumps too much output, takes too many arguments, or just does the wrong thing will wreck a run on its own. The rest of the harness can't save you from it.

Reshaping the loop usually means injecting something between model calls: a verifier, a compaction pass, a handoff to a sub-agent.

When you read about a new agent framework or a clever prompt technique, you can usually locate it on this map: is it poking at the tools, the loop, or the transcript?

That's the map. Most of what follows in this series is just working out the details.


This is the first in a series on harness design. Next post: experiments on progressive disclosure. How agents best navigate large context, and what shape that context should take. Reach out at noah@schenktechnology.com if you're building agents and want to compare notes.

All posts