What Is Harness Engineering: From Prompts to a Full Agent Wrapper

In 2023 everyone was memorizing "how to write a good prompt." In 2025 the conversation shifted to "context engineering." In 2026 OpenAI dropped yet another term — Harness Engineering.

It's tempting to roll your eyes at every new buzzword, but if you look closely each one expands the engineering boundary outward: from talking to a model in a single sentence, to organizing the full input context, to embedding the model in a complete execution loop.

This article maps that evolution clearly — what each term actually solves, how they relate, and how Coding Agents like Claude Code put them all together.

The Starting Point: What an LLM Actually Is

Strip away the wrappers around ChatGPT, Claude, DeepSeek, and what's left is an LLM — fundamentally just a giant parameter file on disk.

What you see	What it actually is
ChatGPT website	LLM + chat UI
OpenAI API	LLM + HTTP interface
Claude Code	LLM + terminal + filesystem + MCP
Cursor	LLM + code editor + project index

The model itself does exactly one thing: predict the next token given the current input. It's a high-dimensional "guess-the-next-word" machine — feed it a prefix, get the most likely continuation.

This single property is the root of everything: the model doesn't know what you actually want, it can only infer from input. The vaguer the instruction, the more divergent the output.

Drop a code snippet on it and say "add sorting" — you might get back an isolated sort function with the surrounding code stripped. Say "without changing other logic, add created_at DESC ordering and return the entire modified function" — now the result is on target.

This is the starting point of every layer below: how do you get the guess-machine to consistently guess what you mean?

Layer 1: Prompt Engineering

Prompt engineering was the first wave. The core move: state the constraints explicitly in the input.

Anatomy of a Minimal Prompt

Every line narrows the model's guess space.

Common Prompt Components

Component	Purpose	Example
Role	Restrict tone and depth	"You are an SRE skilled in K8s"
Task	Primary goal	"Help me debug this OOM"
Background	Current state	"Service runs on EKS, 4 GB limit"
Input data	What to operate on	Logs, code, tables
Output format	Structural limits	JSON / Markdown / code block
Boundary	Forbidden actions	"No speculation, only facts"
Few-shot examples	Style anchoring	One or two demo I/O pairs

Where Prompt Engineering Hits Its Limits

It solves "the model speaks without guidance" — but that's not enough.

With prompts alone, every conversation is isolated: the model doesn't know what just happened, doesn't know your project structure, doesn't know what error showed up after the last code change. Prompts only carry static information. Once tasks get complex, hand-written prompts can't keep up.

Layer 2: Context Engineering

No matter how good your prompt is, the model only sees one input. The moment your task involves multi-turn dialogue, multi-file codebases, or dynamic environments, prompts alone fall short — you need to feed the model all the relevant information.

Context = A Bigger Concept Than Prompt

The prompt is just one part of context. What actually goes to the model is the full context.

Context Window and Context Rot

The total information the model can "see" at once is the context window:

Model	Context Window
GPT-3.5 (early)	4K tokens
GPT-4 Turbo	128K
Claude Sonnet 4	200K
Claude Opus 4.7	1M
Gemini 2.5 Pro	2M

Bigger every year — but in real engineering it's easy to fill up. Reading a few files in a large codebase quickly hits tens of thousands of tokens; a few rounds of tool results and the window's done.

Once full, you have two paths: compress (let the model summarize history) or drop (truncate old messages). Either way, you get context rot:

Early information gets compressed into a one-liner; details vanish
Critical constraints get truncated; the model forgets "don't change the schema"
Decision causality breaks; the model repeats earlier mistakes

The result: the model gets less accurate over time, contradicts itself, forgets its initial goals.

The Three Moves of Context Engineering

To fight context rot, the work boils down to three steps:

1. Retrieve

Pull the most relevant information from history, codebase, error logs, doc stores. RAG (Retrieval-Augmented Generation) is the canonical implementation.

2. Compress

Not naive truncation — structured compression. For example:

Distill 50 turns of dialogue into 10 lines of decisions
Compress an entire file read into "this file defines class X with method Y"
Reduce 5000 lines of logs to "OOM at line 42, root cause: cache not freed"

3. Compose

Adjust ordering. The most important content goes near the end — models pay highest attention to recent input (this is a transformer architectural property called recency bias).

A typical context layout:

Prompt engineering is a subset of context engineering: prompt engineering handles only the system prompt layer; context engineering manages the entire dynamic information flow.

Different Tools, Different Context Strategies

The same GPT-4 / Claude wrapped in different shells produces wildly different results:

Tool	Context Strategy
Cursor	Project-wide embedding index; retrieves on demand
Claude Code	Terminal-native; proactively reads files; tool results flow into context
Trae	Cursor-like, optimized for multi-file editing
WorkBuddy	Workflow orchestration with memory and skill injection

It's not which model is stronger — it's which shell does context engineering better.

Layer 3: Harness Engineering

By this point the model is still just chatting — it understands context, gives suggestions, but can't actually do things.

To make it a worker, it needs hands and feet.

Give the Model Execution Capabilities

Add these around the model:

Capability	What It Does
Bash sandbox	Run shell commands
Filesystem	Read, write, create files
HTTP client	Call APIs, fetch web pages
MCP (Model Context Protocol)	Standardized external tool protocol
Code execution env	Run tests, run lint
Browser automation	Real browser interaction

Together these form the Execution Layer. The model decides "what to do"; the wrapper executes "how to do it".

The ReAct Loop: Reason + Act

Wire execution to the model in a loop and you get ReAct:

This is the skeletal pattern of every AI Agent — a never-ending think-act loop.

But Long Loops Break Things

ReAct's downside: the longer the loop, the more context bloats. Even with great context engineering, 50 rounds of tool calls accumulate. Symptoms:

The agent reaches step 30 and forgets the original requirement
After reading too many files, it loses track of "which is the entry point"
Earlier prohibitions get compressed away; the agent starts doing forbidden things

Compression alone isn't enough — you need to pin core information into every call.

The Four Layers of Harness

Here's the complete Harness Engineering architecture:

Execution Layer

Lets the model "act". Filesystem, Bash, HTTP, browsers, MCP — covered above.

Feedback Layer

Execution doesn't always go well, and errors are the agent's eyes. The feedback layer routes errors back into context properly:

A good feedback layer doesn't just paste raw errors — it also:

Filters noise (90% of stack frames are node_modules, drop them)
Highlights signals ("most relevant failure point from last run")
Provides environment context (OS, Node version, file existence)

Memory Layer

Put project core information into a dedicated file that gets injected on every call. This is Claude Code's CLAUDE.md, Cursor's .cursorrules, WorkBuddy's MEMORY.md:

Memory layer keys: lightweight, extensible, lazy-loaded. Split if it grows too long:

The entry says "for deployment, see deploy.md" — the model loads details on demand instead of dumping 50 KB into context up front.

Orchestration Layer

The top layer manages "what comes first, what comes next." Without it, agents handed a big requirement either wander aimlessly or loop forever.

The orchestration layer does three things:

Action	Example
Decompose	"Implement login" → ① schema ② backend API ③ frontend form ④ unit tests
Acceptance criteria	Each subtask has a "done means..."
Sequencing	Schema must come before API implementation

This is the heart of Spec-Driven Development (SDD) — write the spec first, then have the agent execute against it step by step.

Formulas and Causality

OpenAI's stated formula:

And the harness itself:

Drilling into the relationship between the three terms:

Engineering	Object	Problem It Solves
Prompt	A single input's wording	The model doesn't understand instructions
Context	The full input's organization	Information doesn't fit or isn't relevant
Harness	The complete loop around the model	The model can't execute continuously

Each layer doesn't replace the previous one — it wraps around it.

In Practice: Implementing All Four Layers with Claude Code

Claude Code is currently the most "out-of-the-box harness" Coding Agent — all four layers are natively supported.

Minimum Viable: Write a Good CLAUDE.md

Just create CLAUDE.md at your project root:

This file gets auto-injected as a system prompt on every Claude Code conversation — that's your memory layer.

Going Further: Orchestration Plugins like Spec-Kit

Manual planning is tedious — use a plugin like Spec-Kit that forces development into three phases:

Phase	Output
Specify	A clear constraint document (spec.md)
Plan	A step-by-step plan (plan.md)
Execute	Code changes with tests

Each phase updates CLAUDE.md, ensuring the next phase receives only the distilled essentials. This is the engineering form of orchestration + memory layers, fundamentally Spec-Driven Development.

Feedback Layer Comes Free

When Claude Code runs npm run lint and it fails, stderr automatically flows into the next round of context. Zero work on your part — that's standard feedback layer.

Execution Layer's Extension Points

Need Claude Code to operate Notion / GitHub / Slack? Install an MCP server:

Execution layer expanded.

Common Misconceptions

"Prompt engineering is dead"

Wrong. Prompts are the innermost core of harness — every outer layer eventually delegates to a well-crafted prompt. They didn't disappear; they became one part of a bigger system.

"Context engineering = RAG"

RAG is just the retrieve step. Context engineering also covers compression, composition, and pacing of model interaction.

"Harness equals Agent frameworks"

Frameworks like LangChain / LlamaIndex do harness work, but having a framework isn't the same as having a real harness. A hand-written CLI with CLAUDE.md + tool calling + task planning is also complete harness engineering.

"Stronger models will make harnesses obsolete"

The opposite. Stronger models make harness more valuable — a model that can write code + a strong harness = an agent that genuinely delivers; a model that can write code + no harness = a chatbot that gives suggestions.

A Practical Checklist

If you're building an agent, work through this in order:

Write a solid prompt template — role, task, output format, boundaries
Add a project rules file (CLAUDE.md / .cursorrules) — pin core constraints
Wire in tool calling — file I/O, Bash, HTTP at minimum
Design a feedback path — errors must flow back into the next round
Add task decomposition — big tasks must split into steps with acceptance criteria
Watch context window utilization — over 60% means start compressing or phasing
Run evaluations — measure pass rate on real tasks each iteration, not "looks right"

Summary

The three terms aren't replacements for each other — the engineering boundary is just expanding outward:

Era	Key Capability	Key Artifact
2023	Prompt engineering	A good instruction
2025	Context engineering	A well-organized input
2026	Harness engineering	A wrapper that delivers continuously

Formulas:

Boiled to one paragraph:

Prompt engineering — make the model understand the requirement and output spec
Context engineering — feed the model precise, relevant context
Harness engineering — let the model execute against spec until delivery

The next term will likely come from the multi-agent collaboration space — but the underlying logic stays the same: the model thinks; engineering does everything else.

In 2023 everyone was memorizing "how to write a good prompt." In 2025 the conversation shifted to "context engineering." In 2026 OpenAI dropped yet another term — Harness Engineering.

This article maps that evolution clearly — what each term actually solves, how they relate, and how Coding Agents like Claude Code put them all together.

The Starting Point: What an LLM Actually Is

Strip away the wrappers around ChatGPT, Claude, DeepSeek, and what's left is an LLM — fundamentally just a giant parameter file on disk.

What you see	What it actually is
ChatGPT website	LLM + chat UI
OpenAI API	LLM + HTTP interface
Claude Code	LLM + terminal + filesystem + MCP
Cursor	LLM + code editor + project index

This single property is the root of everything: the model doesn't know what you actually want, it can only infer from input. The vaguer the instruction, the more divergent the output.

This is the starting point of every layer below: how do you get the guess-machine to consistently guess what you mean?

Layer 1: Prompt Engineering

Prompt engineering was the first wave. The core move: state the constraints explicitly in the input.

Anatomy of a Minimal Prompt

Every line narrows the model's guess space.

Common Prompt Components

Component	Purpose	Example
Role	Restrict tone and depth	"You are an SRE skilled in K8s"
Task	Primary goal	"Help me debug this OOM"
Background	Current state	"Service runs on EKS, 4 GB limit"
Input data	What to operate on	Logs, code, tables
Output format	Structural limits	JSON / Markdown / code block
Boundary	Forbidden actions	"No speculation, only facts"
Few-shot examples	Style anchoring	One or two demo I/O pairs

Where Prompt Engineering Hits Its Limits

It solves "the model speaks without guidance" — but that's not enough.

Layer 2: Context Engineering

Context = A Bigger Concept Than Prompt

The prompt is just one part of context. What actually goes to the model is the full context.

Context Window and Context Rot

The total information the model can "see" at once is the context window:

Model	Context Window
GPT-3.5 (early)	4K tokens
GPT-4 Turbo	128K
Claude Sonnet 4	200K
Claude Opus 4.7	1M
Gemini 2.5 Pro	2M

Once full, you have two paths: compress (let the model summarize history) or drop (truncate old messages). Either way, you get context rot:

Early information gets compressed into a one-liner; details vanish
Critical constraints get truncated; the model forgets "don't change the schema"
Decision causality breaks; the model repeats earlier mistakes

The result: the model gets less accurate over time, contradicts itself, forgets its initial goals.

The Three Moves of Context Engineering

To fight context rot, the work boils down to three steps:

1. Retrieve

Pull the most relevant information from history, codebase, error logs, doc stores. RAG (Retrieval-Augmented Generation) is the canonical implementation.

2. Compress

Not naive truncation — structured compression. For example:

Distill 50 turns of dialogue into 10 lines of decisions
Compress an entire file read into "this file defines class X with method Y"
Reduce 5000 lines of logs to "OOM at line 42, root cause: cache not freed"

3. Compose

Adjust ordering. The most important content goes near the end — models pay highest attention to recent input (this is a transformer architectural property called recency bias).

A typical context layout:

Prompt engineering is a subset of context engineering: prompt engineering handles only the system prompt layer; context engineering manages the entire dynamic information flow.

Different Tools, Different Context Strategies

The same GPT-4 / Claude wrapped in different shells produces wildly different results:

Tool	Context Strategy
Cursor	Project-wide embedding index; retrieves on demand
Claude Code	Terminal-native; proactively reads files; tool results flow into context
Trae	Cursor-like, optimized for multi-file editing
WorkBuddy	Workflow orchestration with memory and skill injection

It's not which model is stronger — it's which shell does context engineering better.

Layer 3: Harness Engineering

By this point the model is still just chatting — it understands context, gives suggestions, but can't actually do things.

To make it a worker, it needs hands and feet.

Give the Model Execution Capabilities

Add these around the model:

Capability	What It Does
Bash sandbox	Run shell commands
Filesystem	Read, write, create files
HTTP client	Call APIs, fetch web pages
MCP (Model Context Protocol)	Standardized external tool protocol
Code execution env	Run tests, run lint
Browser automation	Real browser interaction

Together these form the Execution Layer. The model decides "what to do"; the wrapper executes "how to do it".

The ReAct Loop: Reason + Act

Wire execution to the model in a loop and you get ReAct:

This is the skeletal pattern of every AI Agent — a never-ending think-act loop.

But Long Loops Break Things

ReAct's downside: the longer the loop, the more context bloats. Even with great context engineering, 50 rounds of tool calls accumulate. Symptoms:

The agent reaches step 30 and forgets the original requirement
After reading too many files, it loses track of "which is the entry point"
Earlier prohibitions get compressed away; the agent starts doing forbidden things

Compression alone isn't enough — you need to pin core information into every call.

The Four Layers of Harness

Here's the complete Harness Engineering architecture:

Execution Layer

Lets the model "act". Filesystem, Bash, HTTP, browsers, MCP — covered above.

Feedback Layer

Execution doesn't always go well, and errors are the agent's eyes. The feedback layer routes errors back into context properly:

A good feedback layer doesn't just paste raw errors — it also:

Filters noise (90% of stack frames are node_modules, drop them)
Highlights signals ("most relevant failure point from last run")
Provides environment context (OS, Node version, file existence)

Memory Layer

Put project core information into a dedicated file that gets injected on every call. This is Claude Code's CLAUDE.md, Cursor's .cursorrules, WorkBuddy's MEMORY.md:

Memory layer keys: lightweight, extensible, lazy-loaded. Split if it grows too long:

The entry says "for deployment, see deploy.md" — the model loads details on demand instead of dumping 50 KB into context up front.

Orchestration Layer

The top layer manages "what comes first, what comes next." Without it, agents handed a big requirement either wander aimlessly or loop forever.

The orchestration layer does three things:

Action	Example
Decompose	"Implement login" → ① schema ② backend API ③ frontend form ④ unit tests
Acceptance criteria	Each subtask has a "done means..."
Sequencing	Schema must come before API implementation

This is the heart of Spec-Driven Development (SDD) — write the spec first, then have the agent execute against it step by step.

Formulas and Causality

OpenAI's stated formula:

And the harness itself:

Drilling into the relationship between the three terms:

Engineering	Object	Problem It Solves
Prompt	A single input's wording	The model doesn't understand instructions
Context	The full input's organization	Information doesn't fit or isn't relevant
Harness	The complete loop around the model	The model can't execute continuously

Each layer doesn't replace the previous one — it wraps around it.

In Practice: Implementing All Four Layers with Claude Code

Claude Code is currently the most "out-of-the-box harness" Coding Agent — all four layers are natively supported.

Minimum Viable: Write a Good CLAUDE.md

Just create CLAUDE.md at your project root:

This file gets auto-injected as a system prompt on every Claude Code conversation — that's your memory layer.

Going Further: Orchestration Plugins like Spec-Kit

Manual planning is tedious — use a plugin like Spec-Kit that forces development into three phases:

Phase	Output
Specify	A clear constraint document (spec.md)
Plan	A step-by-step plan (plan.md)
Execute	Code changes with tests

Feedback Layer Comes Free

When Claude Code runs npm run lint and it fails, stderr automatically flows into the next round of context. Zero work on your part — that's standard feedback layer.

Execution Layer's Extension Points

Need Claude Code to operate Notion / GitHub / Slack? Install an MCP server:

Execution layer expanded.

Common Misconceptions

"Prompt engineering is dead"

Wrong. Prompts are the innermost core of harness — every outer layer eventually delegates to a well-crafted prompt. They didn't disappear; they became one part of a bigger system.

"Context engineering = RAG"

RAG is just the retrieve step. Context engineering also covers compression, composition, and pacing of model interaction.

"Harness equals Agent frameworks"

"Stronger models will make harnesses obsolete"

A Practical Checklist

If you're building an agent, work through this in order:

Write a solid prompt template — role, task, output format, boundaries
Add a project rules file (CLAUDE.md / .cursorrules) — pin core constraints
Wire in tool calling — file I/O, Bash, HTTP at minimum
Design a feedback path — errors must flow back into the next round
Add task decomposition — big tasks must split into steps with acceptance criteria
Watch context window utilization — over 60% means start compressing or phasing
Run evaluations — measure pass rate on real tasks each iteration, not "looks right"

Summary

The three terms aren't replacements for each other — the engineering boundary is just expanding outward:

Era	Key Capability	Key Artifact
2023	Prompt engineering	A good instruction
2025	Context engineering	A well-organized input
2026	Harness engineering	A wrapper that delivers continuously

Formulas:

Boiled to one paragraph:

Prompt engineering — make the model understand the requirement and output spec
Context engineering — feed the model precise, relevant context
Harness engineering — let the model execute against spec until delivery

The next term will likely come from the multi-agent collaboration space — but the underlying logic stays the same: the model thinks; engineering does everything else.