OpenAIRead originalJanuary–February 2026

How OpenAI built Codex: inside the agent loop and harness

14 min read

OpenAI's Codex powers a cross-platform coding agent (CLI, web, VS Code, macOS app) from a single shared harness. Two engineering posts reveal exactly how that harness works: the agent loop that orchestrates model inference and tool calls, the prompt structure and caching strategy that keeps it efficient, and the App Server JSON-RPC protocol that lets every client surface share the same core.

Ready to ace your system design interview?

This article is just one piece. SWE Quiz gives you structured, interview-focused practice across every topic that comes up in senior engineering rounds.

1,000+ quiz questions across system design and ML/AI
Spaced repetition to lock in what you learn
Full case study walkthroughs of real interview topics
Track streaks, XP, and progress over time

Start for free See pricing

TL;DR

Every Codex surface (CLI, web, VS Code, macOS) runs the same "Codex harness" — the shared agent loop, thread lifecycle, tool execution, and auth logic — so features ship once and work everywhere.
The agent loop sends HTTP requests to the Responses API, building an ever-growing JSON prompt from system instructions, tool definitions, sandbox permissions, and conversation history.
Prompt caching is critical for efficiency: static content (instructions, tools) lives at the front of the prompt so each new turn gets a cache hit on all prior context. Tool ordering bugs cause expensive cache misses.
When the context window fills, Codex calls a /responses/compact endpoint that returns an opaque encrypted_content item encoding the model's latent understanding — smaller than a text summary and privacy-preserving.
The App Server is a JSON-RPC protocol over stdio exposing Codex core to any client. Three primitives power the protocol: Item (atomic I/O with a lifecycle), Turn (one unit of agent work), and Thread (persistent conversation container).

Why this matters for interviews

Agent architecture is becoming a standard system design topic. This post gives you concrete production patterns: how tool calls are orchestrated, how context is managed at scale, and how to design a stable protocol that decouples a complex agent runtime from multiple client surfaces.

Breakdown

1.One harness, many surfaces

Codex ships as a CLI, a web app, a VS Code extension, and a macOS desktop app. Despite looking different, they all run the same "Codex harness" — a shared Rust library called Codex core that contains the agent loop, thread lifecycle (create/resume/fork/archive), config and auth management, and sandboxed tool execution.

The harness is not just a module you call once; it's a runtime you spin up, and it manages the full lifecycle of one conversation thread, including persisting the event history so clients can reconnect and render a consistent timeline.

The App Server is the stable protocol that exposes this runtime to any client, regardless of language or platform.

Interview angle: This is a textbook platform extraction story. Interviewers often ask how you'd serve multiple clients (mobile, web, desktop) from one backend.

The key insight here is that you don't just expose an API — you define a protocol with a well-typed event stream. That lets you decouple client release cycles from server improvements, which is exactly how Xcode pins to a stable client while updating the App Server binary independently.

2.The agent loop: inference, tool calls, repeat

At the core of every AI agent is the agent loop. In Codex, it works like this: (1) build a prompt from the current conversation state; (2) send it to the Responses API and receive a stream of Server-Sent Events; (3) if the model requests a tool call (e.g., "run ls"), execute it and append the result to the prompt; (4) re-query the model with the updated prompt; (5) repeat until the model emits an assistant message instead of a tool call.

That final assistant message — something like "I added architecture.md" — signals the end of one turn. Each loop iteration can involve many inference/tool-call cycles.

The prompt grows with each round trip because the full conversation history must be included in every request for the model to have context. This makes the loop technically O(n²) in bytes sent over the course of a conversation, which is why caching and compaction are non-negotiable.

Diagram of the Codex agent loop showing user input flowing into model inference, tool calls, and back to the model until an assistant message is produced — The core agent loop: user input triggers inference, the model may issue tool calls whose outputs are appended to the prompt, and the cycle repeats until the model produces a final assistant message.

Interview angle: The agent loop is the foundation of any LLM agent interview answer. Know that it's not just "call the model once" — the key complexity is managing the growing context across many tool-call iterations.

The interview follow-up is almost always "what happens when you hit the context limit?" — the answer is compaction, covered later.

3.Building the initial prompt

When Codex starts a conversation, it constructs a structured JSON payload for the Responses API with three fields: instructions (a system/developer message with model-specific guidance, e.g., gpt-5.2-codex_prompt.md), tools (a list of tool definitions including the shell tool, the update_plan tool, web search, and any MCP tools the user configured), and input (an ordered list of messages).

The input is built in a specific order: first a developer message describing sandbox permissions and which folders Codex can write to; optionally a developer message with the user's personal config from ~/.codex/config.toml; optionally a user message aggregating AGENTS.md files found from the git root up to the current working directory; a user message with the current environment context (cwd and shell); and finally the actual user message.

The role hierarchy — system > developer > user > assistant — determines how the model weights each piece of content. Critically, static content (instructions, tools) comes first so prompt caching can work.

Snapshot diagram showing a single step in the Codex agent loop: user request enters the model, which produces a tool call, and the output is fed back — Snapshot 1 of the Codex agent loop: the initial inference step where the model receives the full prompt and responds with a tool call.

Interview angle: Prompt structure is a systems problem, not just a prompt engineering problem. Interviewers testing agent design will ask: how do you inject context without blowing up the prompt, and how do you keep it cacheable?

The answer: layer messages by stability — the most static things go first (system instructions), the most dynamic things go last (user message). This maximizes cache hit rate on every turn.

4.Prompt caching: making the loop efficient

LLM inference is expensive because the model must process every input token on every call. Prompt caching lets the server reuse computation from a previous call when the new prompt shares an exact prefix with a cached one.

Because Codex appends to the prompt rather than rebuilding it, each new turn automatically shares a prefix with the previous call — so caching delivers near-linear cost instead of quadratic. But cache hits only happen for exact prefix matches, so anything that changes earlier in the prompt poisons the cache for everything after it.

Codex guards against this carefully: if the sandbox configuration changes mid-conversation, a new developer message is appended at the end (not inserted earlier). If the cwd changes, a new environment_context message is appended. The team also found that MCP tools had a subtle bug where tool definitions were emitted in non-deterministic order, causing cache misses on every turn until fixed.

Multi-turn agent loop diagram showing how each new turn appends to the growing conversation history, with the prompt growing across turns — As a conversation grows, each turn re-sends the full history. Static content at the start of the prompt enables cache hits — only the new tail needs fresh computation.

Interview angle: Caching as a correctness constraint, not just a perf optimization, is a strong interview signal. The concrete question is: given that you must send the full conversation history each time, how do you avoid O(n²) cost?

The answer — maintain a stable prefix, append-only history, and use server-side prefix caching — directly maps to how Redis, CDNs, and SQL query caches all work. You can draw the same analogy in interviews.

5.Context window management: compaction

Every model has a finite context window — the maximum number of tokens for one inference call, counting both input and output. In a long agentic session with many tool calls, the conversation history can easily exhaust this limit. Codex solves this with compaction.

Originally, users had to manually run /compact, which summarized the conversation into a short text. Now Codex calls a dedicated /responses/compact endpoint when the token count exceeds auto_compact_limit.

This endpoint returns a new, smaller list of input items that represents the conversation. Crucially, this list includes a special type=compaction item containing an opaque encrypted_content blob that encodes the model's latent understanding of everything that happened — richer than any text summary. For Zero Data Retention customers, OpenAI keeps the decryption key (not the data), so the model's reasoning is preserved without storing conversation content in plain text.

Interview angle: Context window management will come up any time you discuss production LLM systems.

Know the three levers: (1) summarization (cheap, lossy), (2) selective truncation (drop old messages, risky), (3) model-native compaction like Codex uses (preserves latent state, best quality).

The ZDR angle is also worth knowing — "how do you do compaction without retaining user data?" is a real enterprise requirement.

6.The App Server: a stable client protocol

The App Server is a long-lived process that hosts Codex core threads and exposes them over a bidirectional JSON-RPC protocol (JSONL over stdio). It has four internal components: a stdio reader (transport layer), a Codex message processor (translates JSON-RPC requests into Codex core operations and transforms low-level internal events into stable UI-ready notifications), a thread manager (one core session per thread), and the core threads themselves.

The protocol was born out of necessity: when the VS Code extension was built, the team tried using MCP to expose the agent but found MCP semantics couldn't represent rich session state like diffs and streaming progress. They designed a custom JSON-RPC protocol instead.

As internal teams and partners (JetBrains, Xcode) wanted to embed Codex, it became the official surface. The design goal was backward compatibility — older clients can talk to newer App Server binaries safely, so partners can update server-side without waiting for a client release.

App Server process flow diagram showing a client sending JSON-RPC messages through a stdio reader to the Codex message processor, thread manager, and core threads — Inside the App Server: a single long-lived process hosts all Codex core threads. The message processor translates JSON-RPC requests from any client into core operations and converts low-level events into stable UI-ready notifications.

Interview angle: Designing a stable protocol for a complex stateful system is a recurring system design problem.

The pattern here is: (1) define a small set of well-typed primitives with explicit lifecycles; (2) make the protocol backward compatible so clients and servers can be deployed independently; (3) use a transport (stdio/JSONL) that works across languages.

The tradeoff versus REST is that you get bidirectional streaming but require clients to speak JSON-RPC.

7.Conversation primitives: Item, Turn, Thread

The App Server protocol is built on three primitives.

An Item is the atomic unit of input/output — typed (user message, agent message, tool execution, approval request, diff) with an explicit lifecycle: item/started fires when the item begins, optional item/*/delta events stream in content incrementally, and item/completed fires with the terminal payload. This lets clients render UI immediately on started without waiting for the full content.

A Turn is one unit of agent work: it begins when the client submits input and ends when the agent finishes all outputs for that input. A Turn contains a sequence of Items representing intermediate steps.

A Thread is the durable container for an ongoing session: it holds multiple turns, persists event history to disk, and can be resumed after a disconnect. The server can also pause a turn mid-execution by sending an approval request to the client — the agent waits until the client responds allow or deny before continuing.

Thread and turn lifecycle diagram showing client creating a thread, submitting turns, and receiving item notifications as the agent works — Thread and Turn lifecycle: a thread holds multiple turns; each turn is triggered by a client message and ends with a turn/completed notification after all item events have been emitted.

Interview angle: Defining primitives with explicit lifecycles is a pattern that shows up in streaming API design generally (think WebSockets, SSE, gRPC streams).

The key interview insight: rather than returning a single response, you emit a stream of typed events with lifecycle markers. This makes partial rendering, error recovery, and audit logging all much easier to implement.

The approval/pause mechanism is also worth discussing — it's how you make an autonomous agent safely interruptible.

8.Client integrations: local, web, and beyond

Different surfaces integrate with the App Server in different ways. Local clients (VS Code, macOS desktop) bundle a platform-specific App Server binary, launch it as a child process, and keep a bidirectional stdio channel open. The shipped binary is pinned to a tested version, but partners like Xcode can decouple their release cycle by pointing to a newer binary when needed.

Codex Web runs the App Server inside a provisioned container — a worker checks out the workspace, launches the App Server binary, and maintains a JSON-RPC channel; the browser talks to the backend over HTTP+SSE. This means the agent keeps running even if the browser tab closes, and a reconnecting session can catch up from the persisted thread history.

The CLI's TUI historically ran in the same process as the agent loop (no protocol layer), but the team plans to refactor it to use the App Server too, which would enable connecting the TUI to a remote agent session.

Diagram showing all Codex clients (CLI, VS Code, web, macOS desktop, JetBrains, Xcode) integrated with the Codex harness via the App Server — Every Codex surface — first-party and partner — integrates via the same App Server protocol. Backward compatibility means new clients can use older servers and vice versa without breaking changes.

Interview angle: The web vs. local integration is a classic "where does state live?" problem. If you run the agent in the browser tab, the tab closing kills the session. If you run it server-side, you get persistence but need a reconnect mechanism.

The Codex answer — run server-side, stream events over SSE, persist thread history — is exactly how you'd design any long-running job system (think CI pipelines, batch processing, or async workflows).

Key Concepts

Agent loop

The core orchestration cycle in an AI agent: build a prompt, query the model, execute any tool calls the model requests, append results to the prompt, and repeat until the model returns a final assistant message.

Analogy: Think of it like a while loop in code: the exit condition is the model saying "I'm done," and each iteration adds more context to a shared scratchpad.

Responses API

OpenAI's server API that Codex uses for model inference. It accepts a JSON payload with instructions, tool definitions, and an input list, then returns a Server-Sent Events stream of typed response events.

Prompt caching

A server-side optimization where the model reuses computation from a previous inference call when the new prompt shares an exact prefix with a cached one. Only works if the prompt is append-only and static content comes first.

Analogy: Like a CDN cache: if the URL (prefix) matches exactly, you get the cached response instantly. Any change to an earlier part of the URL busts the cache for everything after it.

Compaction

The process of replacing a long conversation history with a shorter, semantically equivalent representation when the context window limit approaches. The /responses/compact endpoint returns an opaque encrypted_content item that preserves the model's latent understanding.

Codex core

The Rust library and runtime that contains all agent logic: the agent loop, thread lifecycle management, configuration and auth, and sandboxed tool execution. It is the implementation shared across all Codex surfaces.

App Server

A long-lived process that hosts Codex core threads and exposes them to clients via a bidirectional JSON-RPC protocol over stdio. It acts as both the transport layer and the translation layer between client requests and low-level agent events.

Item (App Server primitive)

The atomic unit of input/output in the App Server protocol. Each item has a type (message, tool call, diff, approval request) and an explicit lifecycle: started, optional delta stream, then completed.

Turn (App Server primitive)

One unit of agent work initiated by a single user input. A turn begins when the client submits a message and ends when the agent finishes producing all outputs, potentially after many model inference/tool-call cycles.

Thread (App Server primitive)

The durable container for an ongoing Codex session. A thread holds multiple turns, persists event history so clients can reconnect, and supports operations like create, resume, fork, and archive.

Zero Data Retention (ZDR)

A configuration where the provider does not store conversation data. Codex supports ZDR by keeping requests stateless (no previous_response_id) and by encrypting reasoning content — the provider stores the decryption key, not the data.

Knowledge Check

10 questions — your answers are saved locally so you can come back anytime.