Inside Claude Code: The Architecture
Add to your library first to use in Claude Code
About
An architectural analysis of Claude Code -- the internal agent schema, built-in agents, coordinator system, hooks, memory, and token-optimization decisions. Written from source analysis, in our own words.
Part of the Pro Library
Get full access to this book and 5 others with CandleKeep Pro. Preview the first 2 pages below.
Preview
Inside Claude Code: The Architecture
Version: 2.0 Based on: Source analysis of Claude Code Note: All content is original analysis and description. No verbatim code or prompts are reproduced.
Table of Contents
Part I: The Core Runtime
- Bootstrap & State — The Singleton Nervous System
- The Query Engine — The Core Loop
- Context Management — The Five Layers
- System Prompt Construction
Part II: The Security & Control Plane
- The Permission System
- Auto Mode & the YOLO Classifier
- The Hooks System
Part III: The Extension Surface
- Tool Architecture Patterns
- The Agent Definition Schema
- The Agent Sourcing Hierarchy
- The Five Built-in Agents
- How the Parent Model Chooses — The Agent Tool Prompt
- Coordinator Mode — The Full Orchestration System
- Fork Subagent — The Experimental Parallelism Feature
- The Memory System
- The Skill System
- The Command System
- The Plugin System
- MCP Integration
Part IV: Infrastructure & Connectivity
- The Transport Layer
- The Bridge & Remote Control System
Appendices
- Appendix A: The Buddy System
- Appendix B: Glossary of Architectural Patterns
- Appendix C: Agent Definition Quick Reference
Chapter 1: Bootstrap & State — The Singleton Nervous System
Claude Code's entire runtime revolves around a single mutable state object initialized at module load time. Every subsystem reads from and writes to this object, making it the central nervous system of the application. The design is shaped by one hard constraint: the state module must sit at the leaf of the import dependency graph.
The State Object
At the core of Claude Code is a large, flat state structure with over ninety fields. These fields span session identity, cost tracking, model configuration, permission modes, cache management, skill tracking, conversation metadata, and more. The object is created once — at module load time — and every other part of the system holds a reference to it.
This is not an oversight or a shortcut. It is a deliberate architectural choice driven by the need to share mutable state across deeply nested subsystems without threading it through every function signature. The query engine needs to know the current model. The permission system needs to know the current mode. The cost tracker needs to update running totals that the UI reads on every render. A singleton makes all of this possible without prop-drilling or dependency injection frameworks.
The Leaf Constraint
The most important structural rule governing this module is its position in the import graph: it sits at the very bottom. The state module cannot import from the rest of the application. No utility functions, no helper modules, no shared constants from other packages.
This constraint exists to prevent circular dependencies. Because every subsystem imports from the state module, if the state module were to import from any of those subsystems, the resulting cycle would create initialization-order bugs that are nearly impossible to diagnose at scale. The cost of this rule is real — all state transformations must be self-contained, or achieved through getter/setter patterns that accept injected logic from outside. Any developer adding a new field to the state object must accept this trade-off.
The Sticky-On Latch Pattern
Certain configuration flags in the state object exhibit a behavior that is easy to miss on first reading: once activated, they are never deactivated for the remainder of the session. These are sticky-on latches.
The flags in question control HTTP headers sent with API requests — headers related to experimental features, fast-output mode, and cache optimization hints. The reason for latching them is subtle and rooted in how the server-side prompt cache works.
The API's prompt cache uses the full set of request parameters — including headers — as part of its cache key. Consider what happens without latching: a header appears in request three, disappears in request four, and reappears in request five. Requests three and five have identical headers, but request four differs. The cache sees three distinct key patterns, and no two consecutive requests share a cache hit.
By latching headers permanently on first activation, every subsequent request in the session carries an identical header set. The cache key stabilizes, and hit rates climb toward their theoretical maximum. This is a case where a small piece of client-side discipline produces outsized savings at fleet scale.
Atomic Session Switching
When a user switches sessions — moving from one conversation to another — two pieces of state must change together: the session identifier and the transcript storage path. These are updated atomically, and a notification signal is emitted to all listeners immediately after.
The atomicity prevents a specific race condition. If the session identifier changed first but the transcript path lagged behind, any transcript write that occurred in the gap would land in the old session's directory. The user would later open their new session and find messages missing. By bundling the update and signaling listeners only after both fields are set, the system guarantees consistency.
Path Normalization at Startup
During initialization, the system resolves symbolic links and normalizes all file paths to Unicode NFC form. This seemingly minor step addresses a real cross-platform hazard.
macOS uses NFD (decomposed) Unicode normalization for filenames by default, while Linux uses NFC (composed). A file named with an accented character — say, "resume.md" — is stored as different byte sequences on the two platforms. If a session starts on one platform and paths are compared against values generated on another (as happens with synced configuration files or shared project directories), lookups fail silently.
By normalizing to NFC at the earliest possible moment — before any path is stored, compared, or transmitted — the system ensures that every downstream consumer sees a consistent representation regardless of the operating system that created the original path.
Why This Matters
The bootstrap and state system is not glamorous, but it is load-bearing. Every optimization described in later chapters — prompt caching, context compaction, session resumption — depends on the state object being correct, consistent, and available from the first moment of execution. The leaf constraint, the latch pattern, and the atomic switching logic are all consequences of taking that responsibility seriously.
Chapter 2: The Query Engine — The Core Loop
The query engine is the heartbeat of Claude Code. Every interaction — from the user pressing Enter to the final character of the assistant's response — flows through a single processing loop. Its design choices around persistence, streaming, and token management reveal how a conversational AI system handles the messy realities of network failures, resource constraints, and multi-turn agentic workflows.
Persistence Before Transmission
The first non-obvious behavior of the query engine occurs before any API call is made: the user's message is written to the transcript file on disk. This write-before-send pattern seems backward — why persist a message before you know the outcome? — but it solves a critical reliability problem.
If the process is killed mid-request — by a power outage, an out-of-memory kill, or the user pressing Ctrl-C — the session must be resumable. Without the pre-write, a crash during an API call would lose the user's last message entirely. The user would reopen Claude Code, see their previous conversation, and have no record of what they just asked. By writing first, the worst case after a crash is a message with no response, which the system can detect and handle gracefully on the next startup.
The Async Generator Pattern
The query engine's main processing method is implemented as an async generator — a function that yields discrete events as they occur rather than accumulating results and returning them all at once. Each event represents a meaningful moment in the lifecycle of a query: the API stream has started, a chunk of the response has arrived, a tool call has been requested, a tool result is ready.
This design is not merely an aesthetic preference. It allows three fundamentally different consumers to share the same underlying loop. The terminal UI subscribes to events to render incremental updates — characters appearing as the model generates them, spinner animations during tool execution. The bridge system (used for remote and SDK access) forwards events over a network connection to stream results to external clients. The SDK's programmatic interface processes events to build structured responses. All three consume the same event stream, which means behavioral changes to the core loop automatically propagate to every surface.
The Infinite Loop
At the heart of query processing is an unbounded loop. Each iteration follows the same sequence: assemble the conversation history, make an API call, stream and process the response, execute any tool calls that the model requested, then decide whether to continue or stop.
The stopping condition is simple: if the model's response ended without requesting any tool calls, the loop terminates. If tool calls were made, their results are appended to the conversation history and the loop iterates again, giving the model the opportunity to act on the results.
This structure handles multi-turn agentic workflows naturally. A single user message — "refactor this module and update all the tests" — can drive dozens of tool calls across dozens of loop iterations before the model decides its work is complete. There is no artificial cap on iteration count built into the loop itself; the model's judgment about when to stop is the only termination condition. The loop trusts the model to converge.
Output Token Capping
A subtle optimization lives in how the engine configures its API requests. By default, the system requests a maximum output size significantly smaller than what the API actually supports. This is not a limitation — it is a deliberate performance trade-off informed by empirical data.
Analysis of production usage patterns shows that the vast majority of responses — on the order of ninety-nine percent — fall comfortably under a modest token limit. By advertising a smaller maximum, the engine allows the API infrastructure to release its internal resource reservation sooner. When the API allocates capacity for a request, it must reserve enough compute to generate the full requested output. A smaller reservation frees resources faster, improving throughput across the fleet for all concurrent users.
When a response legitimately hits the reduced cap — the model was generating a long file or a detailed explanation — the engine detects this condition and retries the request with the full, unrestricted limit. This retry path is rare but essential: it ensures that the optimization never silently truncates output.
The Rules of Thinking
Extended thinking — where the model exposes its internal reasoning process before generating a response — introduces a set of invariants that the query engine must carefully maintain across the entire conversation history. These invariants govern the ordering and placement of thinking blocks relative to other message types, particularly tool results.
Violating these invariants causes the API to reject the request, often with error messages that do not directly indicate the root cause. The development team discovered these constraints through significant debugging effort, and the system now includes explicit safeguards to ensure that conversation histories are always well-formed with respect to thinking block placement.
The constraints are particularly tricky during conversation compaction (covered in the next chapter), where the history is rewritten. The compaction logic must preserve thinking-block invariants even as it dramatically restructures the conversation.
Token Budget Tracking Across Compaction
The API server tracks how many tokens the model has used and how many remain in its budget based on the conversation history it receives. Under normal operation, this is invisible to the client — the server computes everything it needs from the messages array.
But compaction breaks this assumption. When the conversation is compacted, the detailed history that the server used to compute the token budget is replaced with a brief summary. The server, seeing only the summary, has no way to reconstruct how many tokens were consumed before compaction.
The query engine solves this by carrying the token budget counter across the compaction boundary. Before compaction, the engine records the current budget state. After compaction produces its summary, the engine injects the preserved counter back into the request context. Without this carry-forward, the model would behave as if it had a fresh, full budget after every compaction — defeating the purpose of budget limits in long-running sessions.
Chapter 3: Context Management — The Five Layers
Claude Code manages context window utilization through five distinct systems, applied in sequence from least aggressive to most aggressive. Each layer addresses a different scale of the problem, and together they explain why long sessions rarely hit hard context limits — and what happens when they do.
Layer 1: Context Window Detection
Before any management can occur, the system must know how much space it has to work with. The first layer determines the effective context window for the current model. Different models expose different limits, and the system detects these at session start.
Models with very large context windows — one million tokens or more — receive special handling. The compaction thresholds, safety buffers, and budget calculations all scale relative to the detected window size. A session running against a large-context model can tolerate far more accumulated history before intervention is needed.
There is also an enterprise override mechanism. Certain regulated environments prohibit sending large volumes of data to external APIs, regardless of what the model supports. When this override is active, the effective context window is artificially constrained, and the compaction layers trigger earlier and more aggressively.
Layer 2: Microcompaction
The lightest intervention happens before each API call. The system scans recent tool results and truncates any that exceed a size threshold, replacing oversized content with a compact placeholder.
Microcompaction is selective — it only applies to tool types known to produce large outputs: file reads, shell command results, and search results. A file read that returned two thousand lines of source code might be trimmed to its first and last sections with a note indicating the omission. A search result listing hundreds of matches might be capped at the most relevant subset.
The thresholds are time-sensitive. Recent tool results receive more generous limits than older ones. The intuition is sound: the model is more likely to need the full content of something it just read than something from thirty turns ago. As messages age, they become candidates for more aggressive truncation, gradually reclaiming context space without losing information the model is actively using.
Layer 3: Auto-Compaction Trigger
When the total conversation size approaches a critical threshold — defined as the context window minus a safety buffer — the system automatically triggers full compaction. This is the transition from incremental trimming to wholesale summarization.
The most important engineering detail in this layer is the circuit breaker. Compaction involves an API call to generate a summary, and that call can fail. In early versions of the system, a failure would simply retry on the next turn, fail again, retry again, and so on. At fleet scale, this pattern was consuming hundreds of thousands of wasted API calls per day — sessions stuck in a compaction-failure loop, retrying every turn without progress.
The circuit breaker tracks consecutive compaction failures. After a fixed number of failures, it trips, stops retrying, and informs the user that their session needs to be manually reset. This simple mechanism eliminated an enormous source of wasted compute. The failure threshold and the trigger point for auto-compaction are both tunable via environment variable, which allows the team to run experiments and adjust behavior without code changes.
Layer 4: Full Compaction
When triggered, full compaction replaces the entire conversation history with an AI-generated summary. The process is more nuanced than a simple "summarize everything" call.
Images and embedded documents are stripped and replaced with descriptive placeholders — the summary notes that an image was present and what it depicted, but the raw data is removed. This is critical because images consume disproportionate amounts of the token budget relative to their informational density in a text-oriented workflow.
After compaction produces the summary, the system performs a context restoration step. Up to a fixed number of recently referenced files are re-read from disk and injected into the post-compaction context. This ensures that the model retains working knowledge of the files it was actively editing, even though the conversation history that originally loaded those files has been replaced.
If skills were invoked during the session, they receive a dedicated portion of the post-compaction token budget. Skills carry behavioral instructions and state that the model needs to continue operating correctly. Losing skill context after compaction would cause the model to "forget" capabilities it had been using, leading to confusing behavioral regressions mid-session.
A reactive compaction path handles the edge case where the API rejects a request as too large before auto-compaction had a chance to trigger. Rather than surfacing the error to the user, the system intercepts it, runs compaction immediately, and retries the request. From the user's perspective, the response is slightly delayed but otherwise normal.
Layer 5: Session Memory Compaction
The most experimental layer takes a fundamentally different approach. Instead of summarizing the entire conversation into a condensed narrative, session memory compaction preserves a chunk of recent raw messages unchanged and relies on separately maintained session memory — structured understanding extracted from prior turns — to fill in the gaps.
The hypothesis is that recent messages are more valuable in their original form than as a summary. A summary of "you edited three files and ran the tests" is less useful than the actual messages containing the file paths, the test output, and the model's reasoning about what to change next. By keeping recent history intact and using extracted memory for older context, this layer aims to preserve more actionable detail.
Session memory compaction runs first in the pipeline, before traditional compaction. If it succeeds and brings the context under the threshold, the heavier traditional path is skipped entirely. If it fails — because the session memory is insufficient, or the recent history alone is still too large — the system falls back to the traditional approach.
The Manual Trigger
The /compact command gives the user direct control over the pipeline. When invoked, it attempts session memory compaction first (unless the user provided custom instructions for the summary, in which case it skips to traditional compaction with those instructions). Pre-compaction hooks can inject additional context into the summary — useful for preserving project-specific state that the default summarizer might not prioritize.
The layered design means that most users never think about context management. Microcompaction silently trims old results, auto-compaction fires when needed, and the session continues. The layers exist for the cases where silence is not enough.
Chapter 4: System Prompt Construction
The system prompt in Claude Code is not a static string pasted into every request. It is a dynamically assembled document composed of independently managed sections, each with its own lifecycle, caching behavior, and update frequency. The architecture is shaped by one overriding concern: prompt cache economics at fleet scale.
The Cache Boundary
The single most important structural decision in system prompt construction is the existence of a hard boundary that divides the prompt into two halves. Everything before the boundary is globally static — identical for every user, every session, every organization. Everything after the boundary is session-dynamic — it varies by user, directory, model, or even by turn.
This boundary is not a conceptual guideline. It is a load-bearing architectural element that directly affects cost. The API's prompt cache uses a prefix-matching strategy: two requests that share an identical prefix can share cached computation for that prefix, even if they diverge afterward. The static half of the system prompt, being identical across all sessions fleet-wide, achieves near-perfect cache hit rates. Every session benefits from every other session's cache entries.
The implication is severe: if any session-variant content — the user's working directory, the current model name, MCP server instructions — were placed before the boundary, each unique combination would create a distinct cache prefix. At Claude Code's scale, with millions of sessions across thousands of distinct environments, even one misplaced dynamic field would fragment the cache into millions of entries with minimal reuse. The cost impact is measured in real money.
The Section Framework
System prompt content is organized into named sections, each registered with a compute function. The framework distinguishes between two categories based on their volatility.
Cached sections are computed once per session — typically at startup — and their values are stored for reuse on every subsequent turn. They are only recomputed when the conversation is cleared or compacted. Most sections fall into this category: the core behavioral instructions, environment details, memory content.
Volatile sections are recomputed on every turn. They carry a naming convention that is deliberately conspicuous — a visual signal to developers that creating a new volatile section is not a casual decision. The reason: when a volatile section's value changes between turns, it invalidates the prompt cache for everything that follows it in the prompt. A volatile section that changes frequently and sits early in the dynamic half can negate much of the caching benefit that the static/dynamic split provides. Every volatile section must justify its existence against this cost.
The Static Half
The content before the cache boundary contains the core identity and behavioral instructions that define Claude Code as a product. This includes: who the assistant is and how it should present itself, what tools are available and how to use them, how to approach software engineering tasks, guidelines for handling risky or destructive operations, tone and style requirements, and output efficiency guidance.
These sections are identical across all deployments. They change only when the Claude Code team ships a new version. Internal builds — used by the team that develops Claude Code — include additional sections not present in the external product: more detailed guidance on avoiding false claims, numeric constraints on inter-tool communication to prevent context bloat, and other operational refinements discovered through internal dogfooding.
The Dynamic Half
After the cache boundary, the prompt assembles session-specific content. The ordering is deliberate — sections that change less frequently are placed earlier to maximize the cacheable prefix length even within the dynamic half.
Session guidance describes how to use agents, skills, and the task system. This is near-static — it changes only when new agent types or skill capabilities are added.
Memory content is loaded from the user's memory files. It is cached per session because memory files rarely change during a conversation.
Environment information includes the working directory, platform, shell, git status, and model identifier. This is computed at startup and cached.
MCP server instructions are the most notable volatile section. MCP (Model Context Protocol) servers can connect and disconnect between turns as the user activates or deactivates integrations. If these instructions were cached at session start, connecting a new MCP server mid-session would not update the model's understanding of available tools — the model would continue operating as if the new server did not exist. Volatility is the correct trade-off here, despite the cache cost.
Custom style instructions, user language preferences, and experimental feature flags round out the dynamic content.
Identity Prefix Variants
The opening line of the system prompt — the very first thing the model sees — varies based on how Claude Code is being invoked. An interactive terminal session, an SDK-managed agent session, and an SDK session that opts out of the standard system prompt each receive a different identity prefix.
The distinction matters for behavioral defaults. An interactive session assumes a human is reading the output and adjusts tone accordingly. An SDK agent session assumes programmatic consumption and may omit conversational pleasantries. A bare SDK session makes minimal assumptions, deferring to whatever system prompt the SDK consumer provides.
The Attestation Header
Every API request includes a header that proves the request originated from a legitimate Claude Code client rather than a modified or spoofed installation. The attestation mechanism computes a hash at a level below the application code — in a component that cannot be modified by editing the application's source files.
The implementation uses a placeholder strategy to avoid a subtle technical problem. The hash must cover the request content, but inserting the hash into the request changes the content, which changes the hash. The system resolves this by writing a fixed-size placeholder into the request, computing the hash over the placeholder-bearing content, and then substituting the real hash value into the placeholder's position. Because the placeholder and the hash are the same size, the substitution does not change the content length, avoiding the need to reallocate the request buffer — a detail that matters at the volume of requests Claude Code generates.
Chapter 5: The Permission System
Every tool call in Claude Code passes through a permission system before it executes. This is not a simple allow/deny toggle — it is a multi-layered decision pipeline that balances safety, usability, and user autonomy. Understanding this pipeline is essential to understanding why Claude Code behaves the way it does when you use it.
The Seven Modes
The permission system operates in one of seven modes at any given time. Five are user-facing, and two are internal.
The default mode is what most users experience: the system asks for confirmation on potentially dangerous operations while silently approving safe ones. The edit-accepting mode automatically approves file modifications without prompting, reducing friction for users who trust the assistant's judgment on code changes. Planning mode disables all tool execution entirely — the assistant can only think and plan, never act. Bypass mode approves everything (with critical exceptions described below). Do-not-ask mode silently denies anything that would normally trigger a prompt, useful for scripted or batch workflows where no human is present to respond.
The two internal modes serve specialized purposes. The AI-classification mode delegates approval decisions to a secondary model (covered in depth in Chapter 6). The bubble mode handles permission propagation in nested agent scenarios — when one agent spawns another, bubble mode ensures the child agent inherits appropriate restrictions from its parent rather than operating with its own independent permission state.
The Three-Step Pipeline
When a tool call is about to execute, the system evaluates it through three sequential steps. The order matters: denial is checked first, then allowance, then everything else falls through to prompting.
Step 1: Denial. The system checks several denial conditions. Is this tool on an explicit deny list? Does a rule exist that requires prompting for this specific tool? Does the tool itself object to running in the current context? Does the operation target a protected path?
Protected paths deserve special attention because they are bypass-immune. Even when the user has enabled full bypass mode — explicitly telling the system to approve everything — operations targeting version control directories, the Claude Code configuration directory, and shell configuration files still trigger a prompt. This is a deliberate safety boundary: these paths control the behavior of the development environment itself, and modifying them without explicit consent could compromise the user's entire workflow. The bypass-immunity design reflects a philosophical position — there are some operations where convenience should never override safety.
Content-specific denial rules add another layer of granularity. A rule can require a prompt not just for a particular tool, but for a particular pattern within that tool's arguments. For example, a rule might flag any shell command that publishes to a package registry. This pattern-matching override applies even in bypass mode, creating narrow safety corridors for high-consequence operations.
Step 2: Allowance. If no denial condition fires, the system checks for explicit approval. Is the session in bypass mode? Does an allow rule cover this specific tool? If either condition holds, the call proceeds without prompting.
Step 3: Passthrough conversion. Anything not explicitly denied or allowed becomes an ask — the user sees a prompt and decides. The architecture here is subtle: tools can return a "passthrough" response, meaning "I have no opinion on whether this should be allowed." Passthrough is semantically distinct from an explicit ask. When a tool passes through, the decision propagates upward through the pipeline until it reaches Step 3, where it converts into a user prompt. This distinction matters because it allows tools to remain policy-neutral while the pipeline enforces a consistent default behavior.
Rule Sources and Persistence
Permission rules come from multiple sources, each with different persistence characteristics. User settings persist across all sessions and projects. Project settings are checked into the repository and shared with the team. Local settings are gitignored and apply only to the current developer's environment. CLI arguments apply to a single invocation. Command-level rules are attached to specific slash commands. Session rules exist only for the duration of the current conversation.
Rules can specify tool names with optional content matchers. For shell commands, the matcher operates on the command string itself. For MCP-provided tools, rules can target either an entire server (all tools from that server) or individual tools within a server.
Sandbox Integration
When Claude Code runs inside a sandboxed environment, the permission system integrates with the sandbox boundary. Shell commands destined for sandboxed execution can be automatically approved, since the operating system's isolation guarantees provide a stronger safety boundary than prompting alone. This integration point allows the permission system to defer to the sandbox rather than duplicating its protections.
Server-Side Killswitches
Two of the permission modes — bypass mode and AI-classification mode — can be remotely disabled via server-side feature flags. These flags are checked once per session at startup. This mechanism allows Anthropic to respond to discovered safety issues without requiring users to update their client. If a vulnerability is found in how bypass mode handles a particular class of operations, the mode can be disabled fleet-wide within minutes, then re-enabled once a client-side fix ships. It is a pragmatic acknowledgment that any sufficiently complex permission system will have edge cases, and the ability to respond quickly matters more than theoretical perfection.
Chapter 6: Auto Mode and the YOLO Classifier
If the permission system described in Chapter 5 is the skeleton, auto mode is the brain. Rather than applying blanket policies — approve everything or deny everything — auto mode delegates each decision to an AI classifier that evaluates tool calls individually. It is the most nuanced permission mode in Claude Code, and the one that best illustrates the system's approach to balancing autonomy with safety.
The Design Problem
The tension auto mode resolves is fundamental. Bypass mode is fast but dangerous — it approves everything, including destructive operations the user might not intend. The default mode is safe but interruptive — it prompts for every potentially risky action, breaking the user's flow. Auto mode sits between these extremes: it tries to approve safe operations silently while catching dangerous ones before they execute.
The challenge is that "safe" and "dangerous" are context-dependent. Running a test suite is usually safe. Running a test suite that drops and recreates a database is not. Editing a source file is usually safe. Editing a shell configuration file that runs on every terminal launch is not. Auto mode must make these distinctions in real time, for every tool call, without access to the full context of what the user intends.
The Fast Path
Before invoking the classifier — which requires an API call and costs tokens — the system checks whether the regular edit-accepting mode would have approved this particular tool call. Most operations in a typical coding session are file reads and writes, and the edit-accepting mode already has well-defined rules for those. If the edit-accepting mode says yes, auto mode says yes immediately, without consulting the classifier.
This fast-path optimization is critical for performance. In practice, it handles the majority of tool calls during a session. The classifier is only invoked for operations that the simpler mode would have prompted on — primarily shell commands, file operations on sensitive paths, and MCP tool calls. This keeps the token cost of auto mode manageable while preserving its decision-making capability for the cases that actually need it.
Two-Stage Classification
When the fast path does not apply, the classifier runs. It operates in two stages, each making a separate API call to a model — not the main conversation model, but a dedicated side call.
The first stage is a fast initial judgment. It receives context about the tool call (what tool, what arguments, what the conversation looks like) and makes a quick approve-or-deny decision. Most calls are resolved here.
The second stage activates for borderline cases — situations where the first stage's confidence is low or the operation is ambiguous. This stage uses a deeper reasoning process to evaluate the call. The system measures and records the time spent on each stage, providing performance telemetry that helps tune the classifier over time.
The two-stage design reflects a common pattern in ML systems: use a cheap, fast model for the easy cases, and reserve the expensive model for the hard ones. In this context, the "models" are not different neural networks but different prompting strategies applied to the same underlying model.
Dangerous Pattern Pre-Screening
Certain categories of tool calls are flagged before the classifier even sees them. These are operations with inherently high risk profiles: launching an interpreter that could execute arbitrary code, making network requests to external systems, or invoking system utilities with elevated privileges.
The pre-screening list is not identical across all builds. Internal builds (used by Anthropic employees) have a longer list that reflects additional security considerations for internal infrastructure. External builds use a shorter list focused on the patterns most likely to cause harm in a typical development environment.
Pre-screened operations are not automatically denied — they are flagged so the classifier applies extra scrutiny. The distinction is between "this might be dangerous, look carefully" and "this is definitely dangerous, block it." Pre-screening implements the former.
The Circuit Breaker
Auto mode includes a circuit breaker mechanism that detects when the classifier is systematically failing and falls back to interactive prompting. The trigger conditions are: three consecutive denials, or twenty total denials in a single session.
Three consecutive denials suggest the classifier is objecting to a pattern in the user's workflow, not a one-off risky command. If the user is trying to do something the classifier does not understand — perhaps working with an unusual tool or following an unconventional workflow — continued silent denial would be worse than asking. Twenty total denials over a session's lifetime suggests something systematic is wrong, even if the denials are not consecutive.
When the circuit breaker trips, the system does not shut down. It reverts to the default interactive mode for the remainder of the session: every potentially risky operation prompts the user. This is a graceful degradation — the user can still work, they just lose the automation. The next session starts fresh with the circuit breaker reset.
The thresholds — three consecutive, twenty total — were chosen through practical experience rather than theoretical analysis. They represent the point where the cost of continued silent denial (user frustration, blocked workflows) exceeds the cost of prompting (interruption, slower pace).
The Name
The classifier was internally nicknamed "YOLO" during development — a tongue-in-cheek reference to the bypass behavior it was designed to replace. Where bypass mode was truly "you only live once" (approve everything, consequences be damned), the classifier was meant to provide the same feeling of speed while actually checking each decision. In formal documentation and user-facing interfaces, it is called the auto-mode classifier. But the internal nickname persists in conversations about the system, and it captures something real about the design intent: auto mode is meant to feel like bypass mode, even though it behaves nothing like it.
Implications for Users
The practical effect of auto mode is that most users experience it as "Claude Code that just works." File edits happen without prompts. Safe shell commands run silently. Only genuinely risky operations — publishing packages, modifying system files, running destructive commands — trigger interruptions. The fast path, two-stage classification, and circuit breaker work together to create an experience that is both safe and fluid. When it works well, users forget the permission system exists. That is the highest compliment a security mechanism can receive.
Chapter 7: The Hooks System
Hooks are shell commands (or callbacks) that execute in response to specific Claude Code events. The hooks system is more extensive than most users realize -- there are fourteen distinct hook events, and several of them expose capabilities that have no other equivalent.
The Fourteen Hook Events
PreToolUse -- Fires before a tool call executes. The hook can approve, block, or modify the tool's input before it runs. Blocking a PreToolUse hook prevents the tool from executing at all. This is the standard use case: validate tool calls, enforce policies, prevent dangerous operations.
PostToolUse -- Fires after a tool call completes. The hook receives both the input and the output. For MCP tools specifically, a PostToolUse hook can modify the tool's output before the model sees it. This is a powerful capability for transforming or filtering external tool results.
PostToolUseFailure -- Fires when a tool call fails. Receives the error. Useful for logging, alerting, or taking recovery actions when tools break.
UserPromptSubmit -- Fires when the user submits a message. The hook can inject additional context into the model's context for that turn. This is a hook-based alternative to CLAUDE.md for dynamic context that changes per-message.
SessionStart -- Fires once at the beginning of a session. Receives the first user message. Can return an override for the initial user message, and can also return a list of absolute paths to watch for file change events.
Setup -- A configuration event for hooks that need to perform initialization. Fires before SessionStart.
SubagentStart -- Fires when a subagent is spawned. The hook can inject additional context into the subagent's session. This enables consistent cross-agent instrumentation -- any hook-configured context is automatically applied to every agent spawn.
Notification -- Fires when Claude Code generates a notification (task completed, system message). The hook can inject additional context.
PermissionDenied -- Fires when a tool call is blocked because it lacks permission. The hook can instruct Claude to retry with a modified approach.
PermissionRequest -- Fires when a tool call triggers a permission prompt. The hook can resolve the permission request -- approve with optional input modifications, or deny with an optional message. A PermissionRequest hook that approves a call prevents the human-in-the-loop prompt from appearing. This enables automated permission workflows: hooks that validate calls against policy and approve/deny without human intervention.
Elicitation -- Fires when the model triggers an elicitation (a structured prompt asking for user input). The hook can handle the elicitation programmatically: accept with content, decline, or cancel.
ElicitationResult -- Fires after an elicitation is resolved, with the user's response.
CwdChanged -- Fires when the current working directory changes. Like SessionStart, can return updated watch paths.
FileChanged -- Fires when a watched file changes. The watch list is maintained by SessionStart and CwdChanged hooks. This enables reactive behavior: hooks that respond to file system changes in real time.
WorktreeCreate -- Fires when a git worktree is created for isolated agent execution.
Sync vs. Async Hooks
Hooks can run synchronously (blocking the main loop) or asynchronously (returning immediately while the hook continues in the background). Async hooks signal their mode immediately, then communicate results through a separate protocol. Sync hooks block the next tool call until they complete.
Async hooks have configurable timeouts. A hook that takes too long can be configured to give up rather than block indefinitely.
What Hooks Can Return
The hook response schema is rich. Beyond the basic continue/decision fields, hooks can return:
- suppressOutput -- hide the hook's stdout from the session transcript
- stopReason -- a message shown to the user when the hook blocks continuation
- systemMessage -- a warning message shown to the user (does not block, just informs)
- updatedInput -- a modified version of the tool's input, applied before execution
- additionalContext -- injected into the model's context as additional information
- initialUserMessage -- for SessionStart hooks, overrides the user's first message
- watchPaths -- paths to add to the file watcher
Chapter 8: Tool Architecture Patterns
Claude Code has over forty tools. Most are straightforward (file read/write, shell execution, search). Several reveal architectural patterns worth understanding.
The Deferred Tool Pattern
Not all tool schemas are loaded upfront. A search tool exists specifically to fetch tool schemas on demand. Deferred tools appear by name in system reminder messages -- the model knows they exist but does not have their full parameter schema. Calling the search tool fetches the schema, making the tool callable.
This pattern reduces the upfront token cost of tool loading. Tools that are rarely used (obscure MCP tools, specialized integrations) do not need to occupy the context on every turn. They are discoverable via search and loaded when needed.
The Synthetic Output Tool
This tool generates synthetic output blocks -- artificial "tool results" injected into the conversation. It is an internal tool used by the coordinator system to inject structured information into the message stream that is not the result of a real tool call.
The Brief Tool
A tool for creating brief, structured summaries. Used internally when a compact representation of complex information is needed.
The LSP Tool
Language Server Protocol integration. This tool connects Claude Code to IDE language servers, enabling it to query type information, find definitions, and check diagnostics directly from the language server rather than through text search. This is how Claude Code can answer questions about types and interfaces without parsing source code manually.
The REPL Tool
A stateful REPL that persists state between calls within a session. Unlike the Bash tool (which runs each command in a fresh shell), the REPL maintains variable state across invocations. Useful for interactive computation where the model needs to build up state incrementally.
Tool Filtering and Cache Optimization
The fork system includes an exact-tools flag that makes a fork child receive byte-for-byte the same tool pool as its parent. This is a cache optimization: if the parent's tool schema and the fork's tool schema are identical, the API request prefix is identical, and the cache hit is guaranteed.
Setting a custom tool list on a fork breaks this guarantee. The fork receives different tool schema bytes, the API requests diverge, and the cache cannot be shared. This is why the fork documentation is explicit: do not customize tools on a fork.
The Permission System Interaction
Tool execution is gated by a permission system. Each tool has a permission rule that determines whether it requires human approval, runs automatically, or is blocked by policy. PreToolUse hooks can override these rules dynamically -- a hook can approve a call that would normally require human intervention, or block a call that would normally run automatically.
The bubble permission mode, used by fork children, sends permission requests up to the parent terminal rather than blocking the fork. This preserves the user's ability to approve or deny operations even when the agent running them is a background fork.
Chapter 9: The Agent Definition Schema
Most developers who define custom agents in Claude Code know three or four fields: a name, a description, maybe a model. The full schema has over fifteen configurable fields -- and the ones most people miss are often the most powerful.
The Complete Field Set
Agent type -- The agent's identifier. Used as the key in the agent registry and as the directory name when memory is enabled. Plugin-namespaced agents use a colon separator, which gets sanitized to a dash for filesystem paths.
When to use (also called description in markdown frontmatter) -- This is the most important and least documented field. It is what the parent model reads to decide which agent to spawn. It is not the agent's system prompt. It is not shown to the user. It is a one-paragraph decision brief injected into the parent's Agent tool description. Every word in this field is competing for the parent model's attention at the moment it chooses between agents. Write it like a label on a tool, not like a README.
Tools -- An explicit allowlist of tool names. When set, the agent only has access to these tools. When combined with a denylist, the denylist is subtracted from the allowlist.
Disallowed tools -- A denylist of tool names. The agent has all tools except these. This is how the read-only built-in agents work: they receive the full tool pool minus the editing tools. The system enforces this at the API level -- it is not just a hint.
Model -- Three valid values: a specific model ID, inherit (uses the parent model's model), or nothing (defaults to the configured default). The Explore agent uses a fast model by default for speed. The Plan and Verification agents use inherit because they need the same reasoning capacity as the parent.
Effort -- Controls extended thinking depth. Relevant for complex reasoning tasks.
Permission mode -- Overrides the session's permission mode for this agent. One special value is bubble -- used by the fork system so permission prompts from child agents surface to the parent terminal rather than blocking silently.
Max turns -- Maximum number of tool call rounds. The fork agent uses a high limit. Bounded agents are cheaper and more predictable.
Skills -- Skills available to this agent. Works like the skill system for the main agent.
Initial prompt -- A string prepended to the first user message in the agent's conversation, not to the system prompt. This is distinct from the system prompt. Use it for task-specific framing that should appear as user context rather than permanent instruction.
Memory -- Enables persistent memory for this agent. Three scope values: user (persisted across all projects), project (project-specific, can be version-controlled), and local (project-specific, gitignored). When memory is enabled, the system auto-injects write and read tools into the agent's tool pool so it can maintain its memory file.
Background -- Boolean. The Verification agent runs in background mode by default, meaning the main agent can continue working while verification runs in parallel.
Isolation -- Two values: worktree (agent runs in a temporary git worktree -- isolated copy of the repo, cleaned up automatically if no changes) or remote (runs in a remote environment for long sandboxed tasks).
Omit project config -- When true, the CLAUDE.md hierarchy is not loaded into this agent's context. This is a significant token optimization. The Explore and Plan agents both set this to true. The reason is practical: a code exploration agent doesn't need commit message conventions, Linear issue format rules, or CI setup. Providing that context wastes tokens and pollutes the system prompt with irrelevant instructions.
Required MCP servers -- An array of MCP server names. If any listed server is not connected, this agent does not appear in the available agents list. This enables conditional agents: a browser automation agent only appears when a browser MCP server is configured.
Color -- Display color in the UI. The Verification agent uses red.
Critical system reminder (experimental) -- A string injected into the conversation after every tool call result, not just at session start. Standard system prompts are static -- they are set once and the model can gradually "forget" them as the conversation grows. This field is a persistent reminder mechanism. The Verification agent uses it to reinforce its two rules: no project file modification, and the response must end with a VERDICT line.
The Key Insight
The "when to use" field and the system prompt serve completely different audiences. The "when to use" is read by the parent model to make spawning decisions -- it needs to be a precise, distinctive description of when this agent is the right choice. The system prompt is read by the spawned agent to understand its role -- it needs to be a detailed behavioral specification. Many custom agent definitions conflate these two. They write a system-prompt-style paragraph as the description and wonder why the parent model keeps choosing the wrong agent.
Chapter 10: The Agent Sourcing Hierarchy
When Claude Code loads agents, it processes them from six sources in order of increasing authority. Later sources win.
Level 1: Built-in -- Agents compiled into the Claude Code binary. These are the baseline agents that every user gets: general-purpose, Explore, Plan, Verification, and a few specialized agents. They cannot be removed by users (though they can be overridden).
Level 2: Plugin -- Agents defined in installed plugins. A plugin can add new agent types or extend built-in ones.
Level 3: User Settings -- Agents defined in the user's global agents directory. These are personal agents that apply across all projects.
Level 4: Project Settings -- Agents defined in the project's agents directory. These are project-specific agents checked into source control.
Level 5: Flag Settings -- Feature-flag-controlled agents. Used for A/B testing and gradual rollouts of new agent capabilities.
Level 6: Policy Settings -- Managed/enterprise policy settings. These override everything else. This is how enterprise administrators can enforce agent configurations that users cannot override.
What This Means in Practice
If a user defines an agent with a given name in their personal agents directory, and the project also defines an agent with the same name in its agents directory, the project version wins. This is intentional -- project-level configuration is more specific and should take precedence.
If an enterprise administrator deploys policy settings that restrict certain agent types or override "when to use" descriptions to control routing behavior, those settings override everything the user has configured. The hierarchy is a trust chain, not just a loading order.
The SDK also provides an escape hatch: an environment variable removes all built-in agents when running in non-interactive (SDK) mode. This lets SDK users start with a blank slate and define exactly the agent roster they want.
Chapter 11: The Five Built-in Agents
Each built-in agent reflects a specific design philosophy. Understanding why they were built the way they were built explains how to build better custom agents.
The General-Purpose Agent
The general-purpose agent is the default -- what you get when you call the Agent tool without specifying a type. Its design philosophy is "complete the task without overthinking it." The system prompt distinguishes between gold-plating (doing more than asked, adding unrequested features) and half-finishing (declaring done before the task is actually complete). The agent is explicitly instructed to walk the line between these two failure modes.
It has access to all tools. It is the Swiss army knife.
The Explore Agent
The Explore agent is optimized for one thing: reading codebases fast. It runs on a fast model rather than the default model. It omits project configuration to avoid loading unnecessary context. It disallows all file editing tools.
The read-only constraint is enforced at two levels: the disallowed tools list prevents editing operations at the API level, and the system prompt contains an explicit prohibition block listing exactly what the agent cannot do. The two-level enforcement matters because the disallowed tools list handles the tool interface, while the system prompt guidance handles emergent workarounds (a model trying to use Bash to redirect output to a file, for example).
The agent's description instructs callers to specify a thoroughness level: "quick" for basic searches, "medium" for moderate exploration, or "very thorough" for comprehensive analysis. This is guidance for the parent model -- the parent model is expected to include the thoroughness level in the prompt it writes.
The Plan Agent
The Plan agent shares the Explore agent's read-only philosophy but focuses on architectural output rather than code navigation. It uses the same model as the parent because planning tasks benefit from full reasoning capacity.
Its output format is structured: it ends every response with a "Critical Files for Implementation" section listing the most important files for the task. This makes the plan immediately actionable -- the coordinator can hand the file list directly to an implementation agent.
The Verification Agent
The Verification agent is the most sophisticated built-in agent.
The agent explicitly names two failure modes in its own system prompt. The first is "verification avoidance": instead of running checks, the agent reads code, narrates what it would test, marks items as passing, and moves on. Code reading is not verification. The second is "seduced by the first 80%": the agent sees a polished UI or a passing test suite and concludes the task is done, missing that half the features don't work. The first 80% of any implementation is the easy part -- the entire value of a verification agent is finding the last 20%.
The system prompt lists specific self-rationalizations the model will feel tempted to use and instructs the agent to recognize them as excuses:
- "The code looks correct based on my reading" -- reading is not verification, run it.
- "The implementer's tests already pass" -- the implementer is an LLM, verify independently.
- "This is probably fine" -- probably is not verified, run it.
Every check must follow a mandatory format: the exact command run, the actual terminal output, and a PASS/FAIL result. A check without command output is not a PASS, it is a skip.
The agent is required to run at least one adversarial probe: concurrency, boundary values, idempotency, or orphan operations. A verification that only confirms the happy path has only confirmed the happy path.
The verdict is machine-readable: the final line of every response must be VERDICT: PASS, VERDICT: FAIL, or VERDICT: PARTIAL. PARTIAL is reserved for environmental limitations only -- not for uncertainty. A critical system reminder field reinforces this rule after every tool call.
The agent runs in background mode, meaning the main agent can continue other work while verification runs in parallel.
The Claude Code Guide Agent
The self-referential agent: it knows Claude Code. It answers questions about Claude Code features, hooks, slash commands, MCP servers, IDE integrations, and the Agent SDK. It does not appear in SDK deployments -- only in interactive CLI sessions where users might ask "how do I do X in Claude Code."
Chapter 12: How the Parent Model Chooses -- The Agent Tool Prompt
When Claude Code injects the Agent tool description into the main model's context, it includes a list of available agents and their descriptions. The parent model reads this to decide which agent to spawn for a given task. Understanding exactly how this injection works reveals important optimization opportunities.
The Static/Dynamic Split
The agent list is injected as an "attachment message" rather than embedded in the tool description. This is a cache optimization. The tool description is part of the tools block sent with every API call. If the agent list is embedded in the tool description, any change to the available agents (new MCP server connected, plugin loaded, permission mode changed) causes the entire tools block to change, which busts the prompt cache for every subsequent API call.
By moving the agent list to an attachment message injected separately, the tool description stays static. The list can change without invalidating the cached prefix. Anthropic measured this at approximately 10% of fleet cache creation tokens before the fix -- a substantial waste.
The Agent Line Format
Each agent in the list is formatted to show: what is this agent for, and what can it do. The tools summary matters -- a parent model that sees an agent cannot create or modify files knows immediately when to use it versus the general-purpose agent.
The "When Not to Use" Section
The Agent tool description includes explicit guidance about when NOT to use agents. For directed lookups (read a specific file, search for a class definition), using search tools directly is faster than spawning an agent. Agents add overhead -- they start a new conversation, load context, and run a full model call. For a simple file read, that overhead is wasteful.
The "when not to use" guidance exists because the parent model, without it, will happily spawn an Explore agent to find a file path when a direct search call would answer the question in milliseconds. Correcting this tendency requires explicit instruction.
Writing the Prompt Guidance
The tool description contains guidance on how to write effective agent prompts. The core principle: fresh agents start with zero context. A fresh agent does not know what you have tried, what the conversation has been about, or why this task matters. Brief it like a smart colleague who just walked into the room.
The guidance distinguishes between two types of prompts:
- For lookups and narrow tasks: hand over the exact command. The agent should execute, not decide.
- For investigations: hand over the question. Prescribing steps becomes "dead weight when the premise is wrong."
The most important rule: never delegate understanding. Phrases like "based on your findings, fix the bug" or "based on the research, implement it" push synthesis onto the agent instead of doing it yourself. A good prompt proves the parent model understood the research by including specific file paths, line numbers, and exactly what to change.
Chapter 13: Coordinator Mode -- The Full Orchestration System
Most users know Claude Code as a single-agent tool. Inside the source, there is a complete multi-agent orchestration framework called Coordinator Mode.
The Coordinator's Role
The coordinator is an orchestrator, not an implementer. Its job is to help the user achieve their goal by directing workers, synthesizing their findings, and communicating results. It explicitly does not implement things directly when workers are available -- it plans, delegates, synthesizes, and reports.
Every message from the coordinator goes to the user. Worker results and system notifications are internal signals, not conversation partners. The coordinator never thanks or acknowledges workers. It summarizes new information for the user as it arrives.
The Four-Phase Workflow
Research phase -- Workers run in parallel to investigate the codebase, find files, and understand the problem. Multiple research workers can run simultaneously because they are read-only. The coordinator fans them out.
Synthesis phase -- This is the coordinator's most important job. After research workers report back, the coordinator reads their findings, understands the problem, and writes specific implementation specs. This phase cannot be delegated. The coordinator must understand before directing.
Implementation phase -- Workers make targeted changes per the coordinator's spec and commit. Write-heavy workers typically run one at a time per file set to avoid conflicts.
Verification phase -- Workers prove the changes work. The coordinator sends a verification worker that runs independently from the implementation worker -- fresh eyes, no implementation assumptions.
The system is explicit: parallelism is the coordinator's superpower. Independent tasks should run simultaneously. The coordinator launches multiple Agent tool calls in a single message to fan out work.
Worker Results Format
When a worker completes, its result arrives as a structured message. The format includes: a task ID (used to continue the worker), a status (completed, failed, or killed), a human-readable summary, the worker's final text response, and usage statistics (tokens, tool uses, elapsed time).
This structured format is machine-readable. The coordinator parses these notifications to decide what to do next: continue the worker, spawn a fresh one, or report to the user.
Continue vs. Spawn -- The Decision Matrix
One of the most practically useful pieces of the coordinator system is a decision matrix for whether to continue an existing worker or spawn a fresh one.
Continue the existing worker when: the research explored exactly the files that need editing (the worker has that context loaded and ready), or when correcting a failure or extending recent work (the worker has the error context).
Spawn fresh when: the research was broad but implementation is narrow (dragging along exploration noise pollutes the focused implementation context), or when verifying code (the verifier needs fresh eyes, not the implementer's assumptions), or when the first attempt used the wrong approach entirely (wrong-context anchoring causes retries to repeat the mistake).
The rule of thumb: think about how much of the worker's existing context overlaps with the next task. High overlap -- continue. Low overlap -- spawn fresh.
The "Never Fabricate Results" Rule
The coordinator prompt contains an explicit prohibition: never fabricate or predict worker results. Workers run asynchronously. Their results arrive as separate messages. The coordinator cannot know what a worker found before the notification arrives. If the user asks a follow-up question about work-in-progress, the correct response is "the worker is still running" -- not a fabricated prediction.
This rule is non-obvious to the model. Without it, a model will sometimes "predict" what the worker probably found based on the task description, presenting confident analysis that may be completely wrong.
Chapter 14: Fork Subagent -- The Experimental Parallelism Feature
The fork system is an experimental feature that introduces a third spawning mode alongside "spawn fresh agent" and "continue existing agent."
What a Fork Is
When a fork is created, the child agent inherits the parent's full conversation context and system prompt. It does not start fresh -- it starts with the complete conversation history up to the moment of forking. The parent continues working independently. Both run in parallel.
This differs from the standard Agent tool in an important way: a fresh agent needs a complete self-contained prompt because it has no context. A fork receives a short directive because it already knows everything the parent knows.
How Prompt Cache Sharing Works
Forks are described as "cheap" specifically because they share the parent's prompt cache. When multiple forks are launched simultaneously, they all share the same cached prefix -- the parent's conversation history plus system prompt. Only the final directive differs between forks.
To make this work, all fork children receive identical placeholder text in their tool result blocks. When the parent spawns multiple forks in one message, each fork gets the same placeholder result for every tool call in that message. Then each fork's unique directive is appended as a separate text block. The result: every fork's API request shares an identical prefix up to the last message, maximizing cache hits.
Setting a model parameter on a fork defeats this -- a different model cannot reuse the parent's cache. The guidance is explicit: don't set a model on a fork.
The Fork Boilerplate
Every fork child receives a mandatory instruction block at the start of its conversation. The key rules: the fork is a worker, not the main agent. It does not converse, ask questions, or suggest next steps. It uses tools silently and reports once at the end. It stays within its directive's scope. Its response must begin with "Scope:" with no preamble. The report should be under 500 words unless specified otherwise.
The output format is structured: Scope, Result, Key files, Files changed (with commit hash if applicable), Issues. This makes fork output easy for the parent to parse and synthesize.
The "Don't Peek" Rule
After launching a fork, the parent has no information about what the fork found. The tool result includes an output file path, but the parent is explicitly instructed not to read it. Reading the transcript mid-flight pulls the fork's tool calls into the parent's context, defeating the purpose of forking. The parent waits for the completion notification, which arrives as a user-role message in a later turn.
Coordinator Mode Exclusion
Fork subagent and coordinator mode are mutually exclusive. In coordinator mode, the orchestration role is already managed by the coordinator prompt. The fork experiment is disabled when coordinator mode is active. The two systems solve similar problems with different approaches and are not designed to coexist.
Chapter 15: The Memory System
Claude Code's memory system has two layers: the session memory everyone knows (project configuration files loaded at startup) and a persistent agent memory system that most users have never configured.
Agent Memory Scopes
Each agent type can have its own persistent memory in one of three scopes:
User scope stores memory in a user-level directory. This memory persists across all projects. It is the right scope for learnings that apply generally -- user preferences, cross-project patterns, things the agent should always remember regardless of which codebase it is working in.
Project scope stores memory in a project-level directory. This is project-specific and can be checked into version control. It is the right scope for project conventions, team decisions, and context that new agents should inherit from prior work on this codebase.
Local scope stores memory in a project-level but gitignored directory. Like project scope but not shared. For machine-specific state that should not be shared with the team -- local environment details, personal preferences for this project.
The Memory Entrypoint
Each agent's memory directory contains an index file as its entrypoint. This is the same pattern as the main agent's memory system. The file is loaded into the agent's system prompt at spawn time. All writes go to this directory.
When memory is enabled for an agent, the system automatically adds write and read tools to the agent's tool pool -- even if the agent's definition does not include them explicitly. The agent needs these tools to maintain its own memory file.
Remote Memory for Container Deployments
For cloud/container deployments where agents run in ephemeral environments, an environment variable redirects all memory writes to a persistent mount. The directory structure is preserved: the mount contains a subdirectory namespaced by the canonical git root, and agent memory is stored within that namespace. This allows agents running in Docker or remote environments to maintain persistent memory across container restarts.
Memory Scope Guidance
Each scope includes a guidance note injected into the agent's memory prompt:
- User scope: "keep learnings general since they apply across all projects"
- Project scope: "tailor your memories to this project" and note that the memory is shared with the team via version control
- Local scope: "tailor your memories to this project and machine"
This guidance appears in the memory prompt alongside the memory contents, so the agent knows how to write appropriate memories for its scope.
Chapter 16: The Skill System
Skills are how Claude Code gains specialized behaviors without hardcoding them into the application. They represent a middle ground between the always-present built-in tools and the heavyweight plugin system: lightweight, contextual, and designed to appear only when relevant. If tools are Claude Code's hands, skills are its training — they teach it how to approach specific kinds of work.
Five Sources, One Ordered List
When a session begins, the system assembles the available skill set by drawing from five sources in a defined precedence order. Bundled skills are compiled into the binary and always available. Togglable built-in plugin skills come from plugins that ship with the application but can be enabled or disabled by the user. User directory skills live in the user's personal configuration. Workflow-based skills are generated from automation definitions. Marketplace plugin skills come from installed third-party plugins.
Precedence matters when names conflict. If a bundled skill and a marketplace skill share a name, the later source in the precedence order wins. This ordering is deliberate: it allows users and plugin authors to override default behaviors without modifying the application itself.
The Skill File Format
A skill is defined as a markdown file with a structured frontmatter header. The header carries all metadata: the skill's description (used to decide when to surface it), any model override (allowing a skill to request a specific model for its execution), an effort level hint, a list of allowed tools (restricting what the skill can use), argument definitions (typed parameters the user can provide), invocation hooks (shell commands to run when the skill activates), and path restriction patterns.
The body of the file contains the skill's instructions — the actual guidance given to the assistant when the skill is active. This separation between metadata and content keeps the instructions clean and readable while enabling rich programmatic configuration in the header.
Conditional Activation
Not every skill should be available at all times. A skill can declare a list of file path patterns. When such patterns are present, the skill starts in an inactive state. It activates only when the assistant touches a file matching one of the declared patterns during the session.
This mechanism enables highly contextual behavior. A test-writing skill can activate when the assistant opens a test file. A database migration skill can activate when it touches a schema definition. A deployment skill can activate when it reads a CI configuration. The user does not need to manually invoke these skills — they appear naturally as part of the workflow, as if the assistant "remembered" the right approach at the right moment.
The activation is per-session and sticky: once a skill activates, it remains active for the rest of the session. This prevents the jarring experience of a skill appearing, disappearing, and reappearing as the assistant moves between files.
Dynamic Discovery
The skill system does not load all skills at session start and call it done. It actively monitors file operations throughout the session. When the assistant reads or writes a file, the system inspects the ancestor directories of that file for skill directories. Any skills found there are loaded into the session on the fly.
This dynamic discovery mechanism means the available skill set evolves as the assistant explores different parts of a codebase. Working in the frontend directory might surface React-specific skills. Moving to the backend might surface API design skills. Navigating to the infrastructure directory might surface deployment skills. Each part of the codebase can carry its own specialized knowledge, and that knowledge activates only when needed.
The design has a practical benefit for monorepo environments: different teams can maintain different skill sets in their respective directories, and the assistant adapts to whichever team's code it is currently working in.
The Security Boundary for External Skills
Skills loaded from external systems — specifically through the MCP protocol — face a critical restriction: they cannot execute inline shell commands. Regular skills, whether bundled or loaded from the filesystem, can include shell commands that run as part of their invocation (useful for setup tasks, environment checks, or build steps). MCP-loaded skills cannot.
This restriction exists because MCP servers are external processes, potentially running on remote machines or provided by third parties. Allowing an MCP server to inject arbitrary shell commands into the user's terminal through skill content would create a severe attack surface. A compromised or malicious MCP server could use skill definitions as a trojan horse to execute commands on the user's machine. By blocking shell execution for MCP-sourced skills, the system maintains a clear trust boundary: external skills can provide instructions and guidance, but they cannot directly act on the user's system.
Bundled Skill File Extraction
Skills compiled into the binary can include companion files — templates, configuration snippets, helper scripts, or other resources that the skill needs at runtime. These files are not extracted at startup (which would slow down launch). Instead, they are extracted to a temporary location when the skill is first invoked.
The extraction process uses operating system primitives that are hardened against symlink attacks. When writing companion files to disk, the system uses flags that refuse to follow symbolic links. This prevents a scenario where a malicious process creates a symlink at the expected extraction path, redirecting the companion files to an attacker-controlled location. The precaution may seem paranoid, but skills that include templates or configuration files could contain sensitive patterns, and the temporary directory is a shared resource on most operating systems.
Design Philosophy
The skill system reflects a conviction that the best developer tools are contextual rather than comprehensive. Rather than loading every possible behavior at startup and hoping the assistant picks the right one, skills appear when the context calls for them and stay quiet otherwise. This keeps the assistant's behavior predictable — users are not surprised by capabilities they did not ask for — while still enabling deep specialization when the work demands it. The system scales with the codebase rather than with the feature list, and that is what makes it practical for real-world use.
Chapter 17: The Command System
Slash commands are the primary way users invoke specific behaviors in Claude Code. Type a forward slash, pick a command, and something happens. Behind this simple interface lies a carefully designed system that handles registration, loading, visibility, and lifecycle — all while keeping startup fast and the codebase maintainable.
Three Command Types
Every command in the system is one of three types, and the type determines how it executes.
Prompt commands expand into text that gets sent to the model as a message. When a user invokes a prompt command, the system generates a prompt string and feeds it into the conversation as if the user had typed it. Skills and several built-in commands work this way. The command itself does not execute logic — it produces instructions that the assistant then follows. This is the simplest and most common command type.
Local commands run client-side logic and return a result directly. Most built-in operations — clearing the conversation, showing help, toggling settings — are local commands. They execute immediately in the CLI process without involving the model. The result might be displayed to the user, used to modify session state, or both.
Local-JSX commands render interactive terminal UI components. These are the most sophisticated command type: they produce rich, interactive interfaces within the terminal. The plugin manager, MCP server manager, and agent editor are all implemented as JSX commands. They can display lists, handle keyboard input, show previews, and manage complex state — all within the terminal's text interface.
JSX commands carry an important constraint: they are blocked in remote and bridge sessions. Because they require a local terminal to render interactive components, they cannot function when Claude Code is accessed through a remote connection or an IDE bridge. This is enforced at the type level — the system does not attempt to render JSX commands remotely and fall back gracefully. It simply prevents them from appearing in contexts where they cannot work.
No Auto-Discovery
Unlike many plugin systems that scan directories or use naming conventions to find commands, Claude Code requires every command to be explicitly registered in a central location. There is no magic directory where dropping a file creates a command. There is no naming convention that the system watches for.
This is an intentional design choice with concrete benefits. Explicit registration enables build-time dead code elimination: commands that are not registered are not included in the compiled binary, reducing its size. It provides explicit control over load order: the system knows exactly which commands exist and in what sequence they should be evaluated. And it creates a single source of truth for the command inventory, making the system easier to audit and reason about.
The tradeoff is that adding a new command requires editing the registry. In practice, this is a minor inconvenience that is outweighed by the benefits — especially in a compiled application where build-time optimization matters.
Lazy Loading
Command implementations can be heavy. Some commands pull in large dependency trees — the plugin manager needs the entire plugin resolution system, the MCP manager needs transport and protocol handling, and so on. Loading all of these at startup would add noticeable delay to every session, even sessions that never use those commands.
The solution is lazy loading. Rather than importing the implementation of each command at startup, the registry stores a function that returns the implementation when called. This function — essentially a deferred import — is only invoked when the user actually uses the command. The first invocation pays the loading cost; subsequent invocations use the cached result.
The impact on startup time is significant. In a system with dozens of commands, each potentially pulling in its own dependency tree, lazy loading keeps the startup path clean. The user pays only for the commands they actually use in a given session.
Feature-Flag Gating
Some commands exist only in specific builds. The build system supports conditional compilation: a command can be guarded by a feature flag, and builds without that flag have the command entirely absent. This is not hiding — the command's code is removed from the binary during compilation.
This mechanism serves several purposes. Experimental features can be developed and tested in internal builds without appearing in public releases. Platform-specific commands (those that only work on certain operating systems) can be excluded from builds targeting other platforms. Internal-only tools used by Anthropic employees can be kept out of the public binary entirely, reducing both binary size and attack surface.
Two-Layer Visibility
Even among registered commands, not all are visible to all users at all times. The system evaluates command visibility through two independent layers.
The first layer is availability, which is authentication-based. Some commands only appear for users authenticated through one service versus another. This check runs fresh on every evaluation because authentication state can change mid-session — a user might log in or log out, and the command list should update accordingly.
The second layer is enablement, which is condition-based. A command can be registered and available but currently disabled due to feature flags, environment conditions, or other runtime checks. A disabled command might still appear in help listings (marked as unavailable) or might be hidden entirely, depending on the implementation.
The two layers are evaluated independently. A command can be available (the user has the right authentication) but disabled (a required feature flag is off). Or it can be enabled (all conditions are met) but unavailable (the user is not authenticated with the right service). Both checks must pass for the command to be usable.
The Migration Pattern
When a built-in command is being moved to a plugin — a transition that happens as the system matures and capabilities are modularized — the system supports a transitional shim. The original command location is preserved, but instead of executing its former logic, it displays a message directing the user to install the plugin that now provides the functionality.
The shim includes the full installation instruction, making the migration as frictionless as possible. The message differs by audience: external users see the standard installation command, while internal users with access to additional distribution channels see a different path. This graceful deprecation pattern ensures that users are not broken by the migration — they get a clear, actionable message rather than a cryptic error.
The shim is temporary by design. Once adoption of the plugin version reaches a sufficient level, the shim can be removed entirely, completing the migration. This pattern makes it possible to evolve the system's architecture without breaking its users, which is essential for a tool that people depend on for daily work.
Chapter 18: The Plugin System
The plugin system is the largest extensibility mechanism in Claude Code. It is the answer to a question every platform eventually faces: how do you let third parties extend the system without compromising its integrity? The answer, in this case, is a carefully structured architecture that gives plugins access to the same building blocks used internally, distributed through a marketplace system with strong security guarantees.
Five Component Types
A plugin can provide any combination of five component types: slash commands, agent definitions, skills, event hooks, and output styles. These are not special plugin-only abstractions — they are the same extension points used by the Claude Code team internally. A plugin author has access to exactly the same capabilities as the application developers.
This symmetry is intentional. It means that anything Claude Code can do internally, a plugin can do externally. It also means that plugin authors learn one set of concepts, not a separate "plugin API." The barrier between built-in and third-party functionality is organizational, not technical.
The Manifest
Each plugin has a manifest file that declares its identity — name, version, author, description — and lists its components. The manifest is validated at two critical moments: when a plugin is installed and when it is published to a marketplace.
The validation strategy uses a deliberate dual-strictness design. At runtime, the manifest parser accepts unknown fields silently. This is for forward compatibility: when a new version of Claude Code introduces new manifest fields, older plugins that do not include those fields continue to work, and newer plugins that include them do not break older clients. But the plugin validation tool (used during development) applies strict parsing — if a developer typos a field name, they get immediate feedback rather than silent acceptance. This asymmetry catches mistakes during development while preserving compatibility in production.
The Marketplace
Plugins are distributed through marketplaces, which are git repositories containing a marketplace manifest. Claude Code ships with a default marketplace, but the system supports multiple marketplaces. Users can configure additional ones — a company might run an internal marketplace for proprietary plugins, or a community might maintain a curated collection.
A reconciler component manages the relationship between declared intent (which marketplaces the user has configured) and materialized state (what is actually cached on disk). The reconciler handles edge cases that simpler systems ignore: a marketplace being removed from configuration, a marketplace changing its source URL, a marketplace being renamed. In each case, the reconciler diffs the desired state against the actual state and resolves the discrepancy.
Anti-Impersonation
The marketplace has explicit defenses against name squatting and impersonation. Official marketplace names — those reserved for Anthropic — can only originate from a known GitHub organization. This is not a convention; it is enforced by the system.
Beyond organizational verification, a pattern-matching system blocks names that visually resemble official names. This catches obvious attempts like adding a hyphen or suffix to an official name. A separate check blocks Unicode homograph attacks — the technique of using visually similar characters from non-Latin alphabets to create names that look identical to official ones but are technically different strings. Together, these defenses make it difficult for malicious actors to publish plugins that users might confuse with official ones.
The Loading Pipeline
Plugins are loaded from two sources in order. Marketplace-based plugins are specified in settings by name and resolved through the local marketplace cache. Session-only plugins are provided as local directory paths and used during development or when running in SDK mode. The two-source design means that production users get the speed of cached marketplace plugins, while developers get the flexibility of testing plugins from their local filesystem.
Plugin files are versioned in the local cache to enable instant startup without network access. Once a plugin is installed, Claude Code does not need to contact the marketplace to use it. Network access is only required when installing new plugins or checking for updates.
Variable Substitution
Plugin components can reference variables that are resolved at invocation time. Path variables point to the plugin's installation directory and a separate persistent data directory that survives plugin updates (important for plugins that store state). A session identifier variable enables session-scoped behavior. User-configuration variables allow plugins to accept settings like API keys or endpoint URLs.
A critical security detail: sensitive configuration values are never exposed to the model directly. When the model sees plugin context, sensitive values are replaced with placeholders. The actual values are only used when executing tool calls. This prevents prompt injection attacks from extracting credentials — even if a malicious prompt tricks the model into echoing its context, the credentials are not there to echo.
Dependency Resolution
Plugins can declare dependencies on other plugins. The resolution algorithm follows the apt model (from Linux package management) rather than the npm model: dependencies are presence guarantees, not imported modules. A plugin that depends on another plugin simply declares the dependency, and the system ensures the required plugin is installed. There is no version pinning, no dependency tree resolution, no lock files. The system checks presence, not compatibility.
Cycle detection prevents circular dependencies. And a critical security constraint blocks cross-marketplace dependencies: a plugin on Marketplace A cannot require a plugin from Marketplace B. This prevents dependency confusion attacks — a class of supply chain vulnerability where an attacker publishes a package with the same name on a different registry, hoping the resolution system will prefer the malicious version.
MCP Server Deduplication
Plugins can provide MCP server configurations, and multiple plugins might provide configurations for the same underlying server. The system detects this using content-based signature matching: it computes a signature from the server's configuration and deduplicates servers with matching signatures. The signature computation is sophisticated enough to unwrap proxy URLs (used in remote sessions) and compare the original addresses underneath. This prevents the same MCP server from being launched multiple times under different names, which would waste resources and potentially cause conflicts.
The Built-In Plugin Registry
The infrastructure for built-in plugins — plugins that ship with the application but can be toggled on and off by users through the plugin UI — is fully implemented. However, the registry currently contains no entries. This scaffolding exists in preparation for a migration strategy: as the system matures, some capabilities that are currently hardcoded will be moved into togglable built-in plugins, giving users control over which features are active without requiring third-party installation. The empty registry is not incomplete — it is ready, waiting for the right moment to be populated.
Chapter 19: MCP Integration
The Model Context Protocol is a standard for connecting AI models to external tools and data sources. Claude Code does not treat MCP as an afterthought or a compatibility layer — it is a first-class extension mechanism, integrated as deeply as the native tool and plugin systems. Understanding the MCP integration reveals how Claude Code bridges the gap between its internal architecture and the broader ecosystem of AI tooling.
Seven Configuration Scopes
MCP servers can be configured at seven levels of specificity, each serving a different use case. Project-scoped configurations are checked into the repository, shared with the team, and version-controlled alongside the code. User-scoped configurations are personal and apply across all projects. Local-scoped configurations are gitignored and specific to one developer's environment — useful for personal API keys or development servers. Plugin-contributed configurations come from installed plugins. Enterprise policy configurations are managed by an organization and cannot be overridden by users. Cloud connector configurations come from the web application. Managed settings provide another administrative override point.
When the same server appears in multiple scopes, well-defined precedence rules resolve conflicts. The key principle: manual configurations always win over plugin-contributed ones. A user who explicitly configures a server should not have their settings overridden by a plugin that contributes the same server with different parameters.
Six Transport Types
The connection between Claude Code and an MCP server can use one of six transport mechanisms. Standard IO — communication over standard input and output of a subprocess — is the default and most common. Server-Sent Events over HTTP enables browser-friendly streaming connections. HTTP streaming provides general-purpose HTTP-based communication. WebSocket transport supports bidirectional real-time connections. Two internal transport types handle communication with IDE extensions (VS Code, JetBrains), where the MCP server runs within the IDE process. An SDK transport supports in-process MCP servers, useful for testing and embedding scenarios where launching a separate process is unnecessary.
The diversity of transports reflects the diversity of MCP server deployments. A local development tool might run as a subprocess. A cloud service might expose an HTTP endpoint. An IDE integration might communicate through the editor's extension API. By supporting all of these, Claude Code can connect to MCP servers regardless of how they are deployed.
Plugin-Provided MCP Servers
The integration between the plugin system and MCP is where the two extensibility mechanisms intersect. A plugin can contribute MCP servers through three channels, evaluated in order of priority.
The first channel is a configuration file at a conventional location within the plugin directory. If this file exists, its contents are parsed as MCP server definitions. The second channel is explicit entries in the plugin manifest, allowing more precise control over how servers are configured. The third channel is packaged MCP server bundles — a format that includes the server binary, its configuration, and any required resources together in a single distributable unit.
The system resolves these three sources in order and merges the results. This layered approach means plugin authors can start simple (drop a configuration file) and add sophistication (explicit manifest entries, bundled servers) as their needs grow.
The Channel System
Plugins can declare that their MCP servers require user-provided configuration before they can start. An MCP server that connects to a third-party API needs an API key. One that authenticates with a private service needs credentials. One that points to a self-hosted instance needs an endpoint URL.
These requirements are expressed through channels — named configuration slots that the plugin declares and the user fills in. The plugin UI detects channels with missing required configuration and prompts the user to complete setup before the server can start. This creates a structured onboarding flow: rather than failing with a cryptic error when a required API key is missing, the system guides the user through providing it.
Channels also interact with the variable substitution system described in Chapter 18. The values users provide are available to the MCP server configuration through variables, but — critically — they are treated as sensitive values and never exposed to the model.
OAuth and Cross-App Authentication
MCP servers that require OAuth authentication can declare their client credentials in the server configuration. The entire OAuth flow — launching the browser, handling the callback, exchanging codes for tokens — is managed by the CLI. The user does not need to manually copy tokens or configure credentials.
Client secrets are stored in the system keychain, not in configuration files. This is a deliberate security decision: configuration files are often committed to repositories, shared between machines, or readable by other processes. The system keychain provides OS-level encryption and access control.
The system also supports cross-app authentication, a mechanism that allows an MCP server to authenticate using credentials from another application the user has already authorized. If the user has logged into a service through their browser, an MCP server can leverage that existing authentication rather than requiring a separate login flow. This reduces authentication friction for users who work with multiple tools connected to the same services.
Enterprise Allowlisting
Enterprise deployments can restrict which MCP servers are permitted through an allowlist. The allowlist supports URL pattern matching with wildcards, enabling policies like "allow any server from our internal domain" without listing every server individually. When this policy is active, any server not matching an allowed pattern is rejected at connection time — it cannot start, cannot connect, and cannot contribute tools.
This mechanism gives enterprise administrators positive control over the MCP ecosystem within their organization. Rather than trying to block known-bad servers (a losing game), they permit known-good servers and block everything else. The wildcard support makes this practical even for organizations with many internal services, and the connection-time enforcement ensures that no unauthorized server can slip through by being added after the session starts.
The Bigger Picture
MCP integration is where Claude Code's architecture meets the outside world. The permission system governs what happens inside a session. The plugin system packages functionality for distribution. But MCP is how Claude Code connects to the vast ecosystem of tools, services, and data sources that developers actually use. The depth of the integration — seven scopes, six transports, plugin bridging, structured configuration, enterprise controls — reflects the conviction that AI coding tools are only as useful as the connections they can make. A brilliant assistant that cannot reach your database, your CI system, or your monitoring dashboard is an assistant that cannot do the job. MCP is how Claude Code ensures it can.
Chapter 20: The Transport Layer
Every networked application must eventually answer the question: how do bytes move between here and there? For Claude Code, "here" is a terminal on the user's machine and "there" is a constellation of remote systems — the Anthropic API, bridge endpoints for web and desktop access, and cloud compute workers for headless execution. The transport layer is the plumbing that connects them, and its design reflects hard-won lessons about real-world network conditions.
The Transport Hierarchy
The transport system is organized as a stack of increasingly specialized implementations, each building on the one below.
At the base sits a WebSocket transport, providing persistent bidirectional connections. This is the natural choice for a system where the server streams tokens back to the client in real time — HTTP request-response cycles would introduce unacceptable latency for interactive conversation.
Above this sits a hybrid transport that makes a surprising architectural choice: it reads from WebSocket but writes via HTTP POST. We will examine why shortly.
A separate SSE (Server-Sent Events) transport serves environments where WebSocket connections are unavailable — certain corporate proxies, restrictive firewalls, or serverless platforms that only support unidirectional streaming.
At the top of the stack, a cloud compute client wraps the SSE transport with worker lifecycle management: heartbeats to prove liveness, delivery receipts to confirm message processing, and state reporting to coordinate distributed execution.
Each layer adds functionality without replacing the layer below. A cloud compute session still uses SSE underneath, which still handles the same streaming protocol. The hierarchy is additive, not substitutive.
Why Split Reads and Writes?
The hybrid transport's design — WebSocket for reads, HTTP POST for writes — appears unnecessarily complex until you understand the concurrency problem it solves.
The bridge backend (discussed in Chapter 21) stores session state in a document-based system. Concurrent writes to this system are not safe; if two writes arrive simultaneously, they can conflict and corrupt state. WebSocket messages are inherently fire-and-forget from the sender's perspective — there is no built-in mechanism to serialize them or confirm ordering at the application layer.
By routing all writes through a serial batch uploader that permits at most one HTTP POST in flight at any time, the conflict is eliminated entirely. Each write waits for the previous write's acknowledgment before dispatching. Fire-and-forget WebSocket writes could arrive out of order; serial HTTP POST writes arrive in guaranteed order. The read path remains on WebSocket because reads have no concurrency constraint — receiving streamed tokens is naturally sequential.
Buffering and Coalescing
Raw streaming produces an enormous volume of small messages. Each token the model generates creates a separate event. Sending each one individually would saturate the network with overhead — HTTP headers, TLS framing, and connection management would dwarf the actual payload.
Two mechanisms address this. First, 100-millisecond buffering: before sending streaming events, the transport accumulates them for 100ms and dispatches as a single batch. A non-streaming write (such as a tool call result) flushes the buffer immediately to preserve causal ordering. This dramatically reduces POST count for verbose responses without introducing perceptible latency.
Second, text delta coalescing at the cloud compute client level. Rather than forwarding each incremental text chunk individually, the client accumulates them into a single "full-so-far" snapshot for each content block. Each snapshot is self-contained — a client connecting mid-stream sees the complete text accumulated to that point, not just the latest fragment. This elegantly solves the reconnection problem: there is no need to replay message history, because the latest snapshot already contains everything.
Sleep, Wake, and Reconnection
Laptops sleep. Wi-Fi drops. Tunnels enter and exit. A transport layer that cannot handle these realities gracefully will frustrate users who expect their session to survive closing a laptop lid.
The WebSocket transport tracks the elapsed time between reconnection attempts. If the gap exceeds a threshold — roughly twice the maximum retry delay, approximately 60 seconds — the system infers that the machine went to sleep rather than experiencing a network failure. When sleep is detected, the reconnection budget is reset. Without this heuristic, a machine waking from a long nap would find its retry budget exhausted and refuse to reconnect, even though from the user's perspective the session should simply resume.
Rate Limits and Backpressure
When the server responds with a rate-limit signal, the batch uploader respects the server's requested retry delay. But a naive implementation creates a thundering herd: many sessions pause simultaneously and then all resume at the same moment, immediately triggering another rate limit. To prevent this, the retry delay is clamped to a reasonable maximum and then jittered — each client waits a slightly different random duration, spreading the resumed traffic across time.
On the client side, backpressure prevents memory exhaustion when writes outpace the network. When the upload queue grows beyond a threshold, new writes block rather than growing the queue unboundedly. This is the transport layer's pressure valve: it converts a memory problem (unbounded queue growth) into a latency problem (writes slow down), which is almost always the preferable failure mode.
Design Philosophy
The transport layer embodies a principle that recurs throughout Claude Code's architecture: each layer handles one concern well and delegates everything else downward. The WebSocket transport handles connection lifecycle. The hybrid transport handles write serialization. The cloud compute client handles distributed worker coordination. No layer tries to do everything, and no concern is split across multiple layers.
This layering also enables graceful degradation. If WebSocket is unavailable, SSE provides a fallback. If the network is slow, buffering and coalescing reduce overhead. If the machine sleeps, the reconnection logic adapts. The transport layer's job is to make unreliable networks feel reliable — and to do so invisibly, so the layers above never need to think about it.
Chapter 21: The Bridge and Remote Control System
Claude Code begins life as a terminal application. But terminals are not always where users want to be. The bridge system is the mechanism by which a Claude Code session running in a terminal becomes accessible from a web browser, a desktop application, or an IDE extension. It transforms a local, single-user process into something that can be observed and controlled remotely.
Two Generations, One Interface
The bridge has two complete implementations coexisting behind a feature flag, reflecting an ongoing architectural migration.
The first generation follows an environment-based pattern: the CLI registers itself with a backend environment system, then polls for work assignments, executes them, and reports results. This design emerged from an era when the bridge was conceived as a way to assign work to agents rather than to observe interactive sessions.
The second generation is simpler and more direct: the CLI creates a session on the backend and establishes a streaming connection. There is no polling, no work assignment — just a persistent channel through which the session's state flows to remote observers and their inputs flow back.
Both implementations produce identical user-facing behavior. The migration from first to second generation is happening gradually across the user base, controlled by feature flags. This coexistence is architecturally expensive — two complete implementations must be maintained — but it allows the transition to proceed without disruption.
Three Spawn Modes
When running as a persistent bridge server rather than as part of an interactive terminal session, the bridge can operate in three modes that reflect different tradeoffs between isolation and resource usage.
Single-session mode creates one session per bridge invocation. When the session ends, the bridge process exits. This is the simplest model and the right choice for most interactive use cases.
Worktree mode creates an isolated git worktree for each new session. This prevents concurrent sessions from interfering with each other's file state — if two sessions are editing files simultaneously, each operates in its own copy of the repository. The cost is disk space and setup time for each worktree.
Same-directory mode runs all sessions against the shared working directory. This risks file conflicts when sessions overlap, but uses less disk space and avoids worktree setup overhead. It is appropriate when sessions are sequential or when the user accepts the coordination risk.
Cross-Process Token Failure Backoff
OAuth tokens expire. Refresh tokens can be revoked. When the token used to authenticate with the bridge backend fails and cannot be refreshed, the system must handle this gracefully — and "gracefully" means more than just retrying.
The failure state is persisted to disk. Subsequent processes — even entirely separate CLI invocations, not just retries within the same process — read this state before attempting authentication. If a failure was recorded recently, new processes skip the attempt entirely and proceed directly to an error state.
This prevents a pernicious scenario in automated environments: dozens of scheduled or scripted processes all holding the same dead token, all simultaneously hammering the authentication server, all receiving the same rejection. The disk-persisted failure state acts as a distributed circuit breaker across processes that share no memory.
The failure record is keyed to the token's expiry time. When a user logs in with a fresh token, the key no longer matches, and the cached failure is automatically invalidated. No explicit cache-clearing step is needed.
Crash Recovery Through Bridge Pointers
After connecting to the bridge backend, the CLI writes a small file — a bridge pointer — recording the session details. If the process crashes, is killed, or the machine loses power, this file remains on disk.
The next time the bridge is started, it checks for this file and offers to resume the interrupted session. The file's modification time serves as a heartbeat: the running process periodically re-writes the file, so its age indicates how recently the process was active. Sessions whose pointer files are older than a threshold are not offered for resume — the backend will have expired them anyway, so attempting to reconnect would fail.
This is a pragmatic recovery mechanism. It does not guarantee seamless continuation (the backend may have discarded the session's state), but it handles the common case — a process killed by a signal, a terminal window accidentally closed — with minimal user friction.
The FlushGate State Machine
When a remote client connects to an in-progress bridge session, it needs to see the conversation history before receiving real-time updates. This creates a sequencing problem: historical messages are being transmitted as a batch while new real-time messages are simultaneously arriving from the model. If these interleave, the client sees a garbled timeline.
A state machine manages this transition through three states:
Flushing: the historical batch is being transmitted. New real-time messages are queued rather than sent immediately. The queue preserves ordering so nothing is lost.
Active: the batch is complete and the queue has been drained. New messages flow directly to the client without buffering. This is the steady state for the remainder of the session.
Dropped: the transport was replaced while a flush was in progress — for example, the user's browser reconnected on a new WebSocket before the old flush completed. Pending queued messages are discarded because they belong to a transport that no longer exists. Without this state, the system would attempt to drain queued messages into a dead connection.
Session Identity Compatibility
Two subsystems within the bridge use different formats for session identifiers. The streaming protocol produces identifiers with one prefix convention; the client-facing interface expects identifiers with a different prefix. Rather than forcing a synchronized migration across both subsystems, a compatibility layer transparently re-tags identifiers at the boundary between the two systems. A feature flag controls whether this shim is active, allowing it to be disabled once the server is updated to accept either format natively.
This is a small detail, but it illustrates a recurring theme in Claude Code's architecture: compatibility layers absorb the cost of incremental migration so that users never experience discontinuity.
Appendix A: The Buddy System
Not every architectural decision serves a production requirement. Sometimes engineering teams build things because they are delightful. The buddy system — a companion creature feature introduced as an April Fools' Day experiment in 2026 — is one of those things. But beneath its whimsy lies genuinely interesting design work.
Bones and Soul
Each user receives a unique ASCII art companion that lives alongside their terminal session, occasionally commenting in a speech bubble. The companion has two distinct identity layers, and the separation between them is architecturally deliberate.
The bones — species, rarity, eye style, hat, and numerical stats — are deterministically generated from a hash of the user's identifier. The generation is never stored; it is recomputed fresh on every access. This design means the companion's appearance cannot be tampered with by editing configuration files, because there is no stored appearance to edit. The hash function is the single source of truth.
The soul — a name and personality — is generated once through a conversation with the model and then persisted. This is the result of actual user interaction: the user participates in naming and characterizing their companion. Unlike the bones, the soul is stored because it cannot be recomputed; it emerged from a specific conversational moment that will not recur identically.
This creates a meaningful distinction: cosmetic identity is deterministic and tamper-proof, while character identity is generative and personal.
Rarity, Stats, and Personality
Species are assigned with weighted probabilities across a rarity spectrum from common to legendary. Rarity affects three things: the companion's stat floor (legendary companions have higher baseline stats than common ones), hat eligibility (common companions do not wear hats), and shininess (a rare independent probability that produces rainbow-colored display).
Each companion has five RPG-style stats representing personality traits. The stat distribution is not uniform — each companion has a peak (one notably high stat) and a dump (one intentionally low stat). This creates personality differentiation. A companion with high curiosity and low caution feels meaningfully different from one with the inverse distribution, even though both are generated from the same system.
The Forbidden String Problem
One species name happens to collide with a model development codename that appears in the build system's list of forbidden strings — unreleased product names that must not appear in build artifacts. The literal string would trigger a false positive in automated scanning.
The solution is to construct the name at runtime from individual character codes rather than including it as a literal string in the source. The name never appears in compiled output, the automated scan passes cleanly, and the feature works correctly. It is a pragmatic workaround for a tooling constraint — the kind of small, unglamorous problem that real systems must solve constantly.
Virality by Time Zone
The companion teaser — shown to users who have not yet hatched a companion — runs only during the first week of April. The display logic uses the user's local time zone rather than a UTC cutoff.
This choice is intentional and reflects thinking about social media dynamics. A UTC-based cutoff would create a synchronized global event: usage spikes at midnight UTC and drops simultaneously everywhere. Local-time-based logic creates a rolling wave — users in different time zones encounter the feature at different local times throughout the day. The effect is a longer sustained period of social media discussion compared to a single synchronized spike. People in early time zones post about it, generating curiosity among those in later time zones who have not yet seen it.
The Theatrical Separation
The companion is introduced to the model through an attachment in the system prompt. The model is told that the companion is a separate observer — not an alter ego of the assistant, but an independent entity watching the session. When the user addresses the companion directly, the model is instructed to step back and let the companion respond.
Of course, the companion's responses are also generated by the model. The separation is entirely theatrical. But theatrical separation is powerful: it creates the impression of an independent entity with its own perspective, even though the underlying intelligence is the same. Users naturally anthropomorphize the companion, talk to it, and develop attachment — precisely because the framing encourages them to treat it as a separate being.
This is perhaps the most interesting design insight in the buddy system: identity is not about the source of intelligence but about the frame placed around it.
Appendix B: Glossary of Architectural Patterns
This appendix defines recurring patterns used throughout the Claude Code architecture. Each pattern appears in multiple subsystems; understanding them here allows readers to recognize them quickly in new contexts without re-deriving the reasoning.
Module-as-Singleton
A module that exports a single mutable state object, initialized at module load time. State is accessed through exported getter and setter functions rather than through direct object references. This pattern appears when a subsystem needs to maintain state that outlives any single request or conversation turn, and when circular import chains would make dependency injection impractical. The tradeoff is global mutable state — with all its testability drawbacks — in exchange for simplicity and freedom from import cycles.
DAG Leaf Constraint
A module that is prohibited from importing anything from the rest of the application. This constraint is imposed on foundational modules that everything else depends on, ensuring they sit at the leaves of the dependency graph and preventing circular imports. The cost is real: utilities and helpers that exist elsewhere in the codebase must be duplicated or inlined rather than imported. The benefit is that the module's initialization is guaranteed to be side-effect-free and its load order is never ambiguous.
Sticky-On Latch
A boolean flag that, once set to true, is never reset during a session. This pattern appears when a behavior must remain consistent throughout a session for correctness — typically because external systems have cached state based on the flag's first observed value. Toggling the flag mid-session would create an inconsistency between local and remote state.
Circuit Breaker
A failure counter paired with a threshold. When consecutive failures exceed the threshold, the system stops retrying and falls back to a safe but potentially degraded behavior. The purpose is to prevent runaway retry storms that waste resources, add latency, and provide no value. Circuit breakers appear in authentication flows, network transports, and tool execution paths throughout the system.
Passthrough
A value in a decision pipeline that means "I have no opinion — let the next layer decide." Distinct from an explicit allow or an explicit deny. This pattern enables modular permission logic where each layer handles only the concerns it understands, without needing to replicate the decision logic of other layers. A permission check might pass through several layers that all return passthrough before reaching the one layer that has a definitive answer.
Bubble Permission Propagation
When a nested agent needs user permission, the permission request travels upward through the nesting hierarchy to the topmost terminal, where a human can actually respond. Without bubbling, nested agents would face an impossible choice: always blocked (too restrictive) or always auto-approved (no human oversight). Bubbling preserves the human-in-the-loop guarantee regardless of nesting depth.
Content-Based Deduplication
Identifying entities as identical based on what they contain rather than what they are named. This pattern appears in server management to prevent launching the same underlying service twice under different names, even when the names originate from different configuration sources. The deduplication key is derived from the entity's configuration — its command, arguments, and environment — rather than its user-facing label.
Reconciler Pattern
A component that takes declared intent (what the user configured) and materialized state (what actually exists on disk or in memory) and brings them into alignment. The reconciler must handle edge cases that simple "apply configuration" logic misses: intent changed while the system was offline, materialized state was manually modified by the user, or intent references something that no longer exists. The reconciler pattern appears in server lifecycle management and configuration synchronization.
Capacity Wake
A pattern for poll loops that need to sleep between iterations but must wake immediately when capacity becomes available. Rather than sleeping for a fixed interval and potentially wasting time, the loop listens for a cancellable signal that external code can trigger when conditions change. This converts a polling loop into an event-driven loop without abandoning the polling structure entirely — the fixed-interval sleep serves as a fallback if the signal mechanism fails.
FlushGate
A state machine that queues new writes while a batch flush of historical data is in progress, then drains the queue after the flush completes. This prevents interleaving between historical batch content and new real-time content, which would produce a garbled timeline for the consumer. The state machine includes a "dropped" state for the edge case where the transport is replaced mid-flush, ensuring queued messages for a dead transport are discarded rather than delivered into the void.
Appendix C: Agent Definition Quick Reference
A summary of all fields available in agent definitions, with their types and defaults.
| Field | Type | Default | Description |
|---|---|---|---|
| Agent type | string | required | Agent identifier |
| When to use | string | required | Decision brief for parent model |
| Tools | string[] | undefined | Tool allowlist |
| Disallowed tools | string[] | undefined | Tool denylist |
| Model | string | default | Model or 'inherit' |
| Effort | string | undefined | Extended thinking effort |
| Permission mode | string | default | Permission mode override |
| Max turns | number | unlimited | Max tool call rounds |
| Skills | string[] | undefined | Available skills |
| Initial prompt | string | undefined | Prepended to first user message |
| Memory | string | undefined | 'user', 'project', or 'local' |
| Background | boolean | false | Run as background task |
| Isolation | string | undefined | 'worktree' or 'remote' |
| Omit project config | boolean | false | Skip project config loading |
| Required MCP servers | string[] | undefined | Conditional visibility |
| Color | string | undefined | UI display color |
| Critical reminder (experimental) | string | undefined | Post-tool-call reminder |