What Is Harness Engineering: From Prompts to a Full Agent Wrapper
In 2023 everyone was memorizing "how to write a good prompt." In 2025 the conversation shifted to "context engineering." In 2026 OpenAI dropped yet another term — Harness Engineering.
It's tempting to roll your eyes at every new buzzword, but if you look closely each one expands the engineering boundary outward: from talking to a model in a single sentence, to organizing the full input context, to embedding the model in a complete execution loop.
This article maps that evolution clearly — what each term actually solves, how they relate, and how Coding Agents like Claude Code put them all together.
The Starting Point: What an LLM Actually Is
Strip away the wrappers around ChatGPT, Claude, DeepSeek, and what's left is an LLM — fundamentally just a giant parameter file on disk.
| What you see | What it actually is |
|---|---|
| ChatGPT website | LLM + chat UI |
| OpenAI API | LLM + HTTP interface |
| Claude Code | LLM + terminal + filesystem + MCP |
| Cursor | LLM + code editor + project index |
The model itself does exactly one thing: predict the next token given the current input. It's a high-dimensional "guess-the-next-word" machine — feed it a prefix, get the most likely continuation.
This single property is the root of everything: the model doesn't know what you actually want, it can only infer from input. The vaguer the instruction, the more divergent the output.
Drop a code snippet on it and say "add sorting" — you might get back an isolated sort function with the surrounding code stripped. Say "without changing other logic, add created_at DESC ordering and return the entire modified function" — now the result is on target.
This is the starting point of every layer below: how do you get the guess-machine to consistently guess what you mean?
Layer 1: Prompt Engineering
Prompt engineering was the first wave. The core move: state the constraints explicitly in the input.
Anatomy of a Minimal Prompt
Every line narrows the model's guess space.
Common Prompt Components
| Component | Purpose | Example |
|---|---|---|
| Role | Restrict tone and depth | "You are an SRE skilled in K8s" |
| Task | Primary goal | "Help me debug this OOM" |
| Background | Current state | "Service runs on EKS, 4 GB limit" |
| Input data | What to operate on | Logs, code, tables |
| Output format | Structural limits | JSON / Markdown / code block |
| Boundary | Forbidden actions | "No speculation, only facts" |
| Few-shot examples | Style anchoring | One or two demo I/O pairs |
Where Prompt Engineering Hits Its Limits
It solves "the model speaks without guidance" — but that's not enough.
With prompts alone, every conversation is isolated: the model doesn't know what just happened, doesn't know your project structure, doesn't know what error showed up after the last code change. Prompts only carry static information. Once tasks get complex, hand-written prompts can't keep up.
Layer 2: Context Engineering
No matter how good your prompt is, the model only sees one input. The moment your task involves multi-turn dialogue, multi-file codebases, or dynamic environments, prompts alone fall short — you need to feed the model all the relevant information.
Context = A Bigger Concept Than Prompt
The prompt is just one part of context. What actually goes to the model is the full context.
Context Window and Context Rot
The total information the model can "see" at once is the context window:
| Model | Context Window |
|---|---|
| GPT-3.5 (early) | 4K tokens |
| GPT-4 Turbo | 128K |
| Claude Sonnet 4 | 200K |
| Claude Opus 4.7 | 1M |
| Gemini 2.5 Pro | 2M |
Bigger every year — but in real engineering it's easy to fill up. Reading a few files in a large codebase quickly hits tens of thousands of tokens; a few rounds of tool results and the window's done.
Once full, you have two paths: compress (let the model summarize history) or drop (truncate old messages). Either way, you get context rot:
- Early information gets compressed into a one-liner; details vanish
- Critical constraints get truncated; the model forgets "don't change the schema"
- Decision causality breaks; the model repeats earlier mistakes
The result: the model gets less accurate over time, contradicts itself, forgets its initial goals.
The Three Moves of Context Engineering
To fight context rot, the work boils down to three steps:
1. Retrieve
Pull the most relevant information from history, codebase, error logs, doc stores. RAG (Retrieval-Augmented Generation) is the canonical implementation.
2. Compress
Not naive truncation — structured compression. For example:
- Distill 50 turns of dialogue into 10 lines of decisions
- Compress an entire file read into "this file defines class X with method Y"
- Reduce 5000 lines of logs to "OOM at line 42, root cause: cache not freed"
3. Compose
Adjust ordering. The most important content goes near the end — models pay highest attention to recent input (this is a transformer architectural property called recency bias).
A typical context layout:
Prompt engineering is a subset of context engineering: prompt engineering handles only the system prompt layer; context engineering manages the entire dynamic information flow.
Different Tools, Different Context Strategies
The same GPT-4 / Claude wrapped in different shells produces wildly different results:
| Tool | Context Strategy |
|---|---|
| Cursor | Project-wide embedding index; retrieves on demand |
| Claude Code | Terminal-native; proactively reads files; tool results flow into context |
| Trae | Cursor-like, optimized for multi-file editing |
| WorkBuddy | Workflow orchestration with memory and skill injection |
It's not which model is stronger — it's which shell does context engineering better.
Layer 3: Harness Engineering
By this point the model is still just chatting — it understands context, gives suggestions, but can't actually do things.
To make it a worker, it needs hands and feet.
Give the Model Execution Capabilities
Add these around the model:
| Capability | What It Does |
|---|---|
| Bash sandbox | Run shell commands |
| Filesystem | Read, write, create files |
| HTTP client | Call APIs, fetch web pages |
| MCP (Model Context Protocol) | Standardized external tool protocol |
| Code execution env | Run tests, run lint |
| Browser automation | Real browser interaction |
Together these form the Execution Layer. The model decides "what to do"; the wrapper executes "how to do it".
The ReAct Loop: Reason + Act
Wire execution to the model in a loop and you get ReAct:
This is the skeletal pattern of every AI Agent — a never-ending think-act loop.
But Long Loops Break Things
ReAct's downside: the longer the loop, the more context bloats. Even with great context engineering, 50 rounds of tool calls accumulate. Symptoms:
- The agent reaches step 30 and forgets the original requirement
- After reading too many files, it loses track of "which is the entry point"
- Earlier prohibitions get compressed away; the agent starts doing forbidden things
Compression alone isn't enough — you need to pin core information into every call.
The Four Layers of Harness
Here's the complete Harness Engineering architecture:
Execution Layer
Lets the model "act". Filesystem, Bash, HTTP, browsers, MCP — covered above.
Feedback Layer
Execution doesn't always go well, and errors are the agent's eyes. The feedback layer routes errors back into context properly:
A good feedback layer doesn't just paste raw errors — it also:
- Filters noise (90% of stack frames are node_modules, drop them)
- Highlights signals ("most relevant failure point from last run")
- Provides environment context (OS, Node version, file existence)
Memory Layer
Put project core information into a dedicated file that gets injected on every call. This is Claude Code's CLAUDE.md, Cursor's .cursorrules, WorkBuddy's MEMORY.md:
Memory layer keys: lightweight, extensible, lazy-loaded. Split if it grows too long:
The entry says "for deployment, see deploy.md" — the model loads details on demand instead of dumping 50 KB into context up front.
Orchestration Layer
The top layer manages "what comes first, what comes next." Without it, agents handed a big requirement either wander aimlessly or loop forever.
The orchestration layer does three things:
| Action | Example |
|---|---|
| Decompose | "Implement login" → ① schema ② backend API ③ frontend form ④ unit tests |
| Acceptance criteria | Each subtask has a "done means..." |
| Sequencing | Schema must come before API implementation |
This is the heart of Spec-Driven Development (SDD) — write the spec first, then have the agent execute against it step by step.
Formulas and Causality
OpenAI's stated formula:
And the harness itself:
Drilling into the relationship between the three terms:
| Engineering | Object | Problem It Solves |
|---|---|---|
| Prompt | A single input's wording | The model doesn't understand instructions |
| Context | The full input's organization | Information doesn't fit or isn't relevant |
| Harness | The complete loop around the model | The model can't execute continuously |
Each layer doesn't replace the previous one — it wraps around it.
In Practice: Implementing All Four Layers with Claude Code
Claude Code is currently the most "out-of-the-box harness" Coding Agent — all four layers are natively supported.
Minimum Viable: Write a Good CLAUDE.md
Just create CLAUDE.md at your project root:
This file gets auto-injected as a system prompt on every Claude Code conversation — that's your memory layer.
Going Further: Orchestration Plugins like Spec-Kit
Manual planning is tedious — use a plugin like Spec-Kit that forces development into three phases:
| Phase | Output |
|---|---|
| Specify | A clear constraint document (spec.md) |
| Plan | A step-by-step plan (plan.md) |
| Execute | Code changes with tests |
Each phase updates CLAUDE.md, ensuring the next phase receives only the distilled essentials. This is the engineering form of orchestration + memory layers, fundamentally Spec-Driven Development.
Feedback Layer Comes Free
When Claude Code runs npm run lint and it fails, stderr automatically flows into the next round of context. Zero work on your part — that's standard feedback layer.
Execution Layer's Extension Points
Need Claude Code to operate Notion / GitHub / Slack? Install an MCP server:
Execution layer expanded.
Common Misconceptions
"Prompt engineering is dead"
Wrong. Prompts are the innermost core of harness — every outer layer eventually delegates to a well-crafted prompt. They didn't disappear; they became one part of a bigger system.
"Context engineering = RAG"
RAG is just the retrieve step. Context engineering also covers compression, composition, and pacing of model interaction.
"Harness equals Agent frameworks"
Frameworks like LangChain / LlamaIndex do harness work, but having a framework isn't the same as having a real harness. A hand-written CLI with CLAUDE.md + tool calling + task planning is also complete harness engineering.
"Stronger models will make harnesses obsolete"
The opposite. Stronger models make harness more valuable — a model that can write code + a strong harness = an agent that genuinely delivers; a model that can write code + no harness = a chatbot that gives suggestions.
A Practical Checklist
If you're building an agent, work through this in order:
- Write a solid prompt template — role, task, output format, boundaries
- Add a project rules file (CLAUDE.md / .cursorrules) — pin core constraints
- Wire in tool calling — file I/O, Bash, HTTP at minimum
- Design a feedback path — errors must flow back into the next round
- Add task decomposition — big tasks must split into steps with acceptance criteria
- Watch context window utilization — over 60% means start compressing or phasing
- Run evaluations — measure pass rate on real tasks each iteration, not "looks right"
Summary
The three terms aren't replacements for each other — the engineering boundary is just expanding outward:
| Era | Key Capability | Key Artifact |
|---|---|---|
| 2023 | Prompt engineering | A good instruction |
| 2025 | Context engineering | A well-organized input |
| 2026 | Harness engineering | A wrapper that delivers continuously |
Formulas:
Boiled to one paragraph:
- Prompt engineering — make the model understand the requirement and output spec
- Context engineering — feed the model precise, relevant context
- Harness engineering — let the model execute against spec until delivery
The next term will likely come from the multi-agent collaboration space — but the underlying logic stays the same: the model thinks; engineering does everything else.