OpenAI Prompting Best Practices
by OpenAI
Add to your library first to use in Claude Code
About
A comprehensive guide to prompting and building with OpenAI's models and APIs, covering model selection, prompt engineering techniques, GPT-4.1 best practices, reasoning models (o3/o4-mini), Responses API, function calling, structured outputs, multimodal capabilities, advanced features, and the Agents SDK. Sourced from official OpenAI documentation.
Preview
OpenAI Prompting Best Practices
A comprehensive guide to prompting, model selection, and API usage for the May 2026 OpenAI model lineup.
Chapter 1: Introduction & Model Overview
The Current Landscape (May 2026)
OpenAI's model lineup has consolidated into a clear hierarchy. The GPT-5.x family now dominates both the consumer product (ChatGPT) and the developer API, with earlier generations on a deprecation glide path toward shutdown. The flagship models share a common architecture that supports function calling, web search, file search, computer use, and reasoning at adjustable effort levels (none / low / medium / high / xhigh). Meanwhile, a set of specialized models handles image generation, real-time audio, autonomous coding, and video.
For developers, the most consequential shift in the past year has been the retirement of the GPT-4.x era. The GPT-4o, GPT-4.1, and o-series models that dominated 2024-2025 are now deprecated, with API shutdown dates set for late 2026. New projects should target the GPT-5.x family exclusively; migration for existing projects should be underway.
Two API surfaces coexist: the legacy Chat Completions API and the newer Responses API. OpenAI's documentation explicitly recommends the Responses API, particularly for reasoning models, which "perform better and demonstrate higher intelligence" when used with it. This book assumes the Responses API throughout unless otherwise noted.
Flagship Models at a Glance
| Model | Context Window | Max Output | Knowledge Cutoff | Key Strengths |
|---|---|---|---|---|
| GPT-5.5 | 1M tokens | 128K tokens | Dec 1, 2025 | Top-tier intelligence for coding and professional work |
| GPT-5.5 Pro | 1M tokens | 128K tokens | Dec 1, 2025 | Highest-quality single-shot reasoning; extended thinking |
| GPT-5.4 | 1M tokens | 128K tokens | Aug 31, 2025 | Strong general-purpose model at lower cost |
| GPT-5.4 Pro | 1M tokens | 128K tokens | Aug 31, 2025 | Extended reasoning at the 5.4 tier |
| GPT-5.4 mini | 400K tokens | 128K tokens | Aug 31, 2025 | Best mini model for coding, computer use, and subagents |
| GPT-5.4 nano | -- | -- | -- | Lowest-cost model for high-volume, lightweight tasks |
Note: GPT-5.4 nano's context window and knowledge cutoff are not confirmed in current documentation. GPT-5.5 Pro's context window is inferred from the GPT-5.5 family but not independently verified. Check the models page for the latest specifications.
All flagship models support reasoning effort adjustment (none, low, medium, high, xhigh), allowing you to trade latency and cost for depth of thought on a per-request basis. This is one of the most important levers for prompt optimization and is covered in depth in Chapter 4.
Pricing
Pricing is per 1 million tokens unless otherwise noted. All flagship models offer multiple tiers:
- Standard -- default API access with the lowest latency guarantees.
- Cached -- automatic discount when the prompt prefix matches a recent request (see Prompt Caching below).
- Batch -- asynchronous processing at 50% of standard rates; results returned within 24 hours.
- Priority -- guaranteed low-latency access at a premium.
- Flex -- matches Batch pricing but with synchronous access during off-peak capacity.
Flagship Models -- Short Context
| Model | Standard (In / Out) | Cached Input | Batch (In / Out) | Priority (In / Out) |
|---|---|---|---|---|
| GPT-5.5 | $5.00 / $30.00 | $0.50 | $2.50 / $15.00 | $12.50 / $75.00 |
| GPT-5.5 Pro | $30.00 / $180.00 | -- | $15.00 / $90.00 | -- |
| GPT-5.4 | $2.50 / $15.00 | $0.25 | $1.25 / $7.50 | $5.00 / $30.00 |
| GPT-5.4 Pro | $30.00 / $180.00 | -- | $15.00 / $90.00 | -- |
| GPT-5.4 mini | $0.75 / $4.50 | $0.075 | $0.375 / $2.25 | $1.50 / $9.00 |
| GPT-5.4 nano | $0.20 / $1.25 | $0.02 | $0.10 / $0.625 | -- |
Cached input prices reflect a ~90% reduction from standard input rates. The Pro models and GPT-5.4 nano do not list cached or priority pricing.
Flagship Models -- Long Context
When requests exceed the short-context threshold, pricing increases. The exact token boundary is not published, but the price differential is significant:
| Model | Standard (In / Out) | Cached Input | Batch (In / Out) |
|---|---|---|---|
| GPT-5.5 | $10.00 / $45.00 | $1.00 | $5.00 / $22.50 |
| GPT-5.5 Pro | $60.00 / $270.00 | -- | -- |
| GPT-5.4 | $5.00 / $22.50 | $0.50 | $2.50 / $11.25 |
| GPT-5.4 Pro | $60.00 / $270.00 | -- | $30.00 / $135.00 |
Long-context pricing roughly doubles input costs and increases output costs by ~50% compared to short-context rates. This makes prompt caching and context management critical for cost control at scale -- a theme we return to in Chapters 3 and 5.
Data Residency Surcharge
Models released after March 5, 2026 incur a 10% surcharge when using regional data-residency endpoints. Factor this into cost projections if your deployment requires data to stay in a specific geography.
Specialized Models
GPT-5.3-Codex (Autonomous Coding Agent)
Released February 24, 2026, this model powers OpenAI's Codex product -- an autonomous coding agent that operates in sandboxed cloud environments. It supports the full range of reasoning effort settings (none through xhigh).
| Tier | Input | Cached Input | Output |
|---|---|---|---|
| Standard | $1.75 | $0.175 | $14.00 |
| Priority | $3.50 | $0.35 | $28.00 |
Batch pricing for GPT-5.3-Codex is not separately listed in current documentation.
The Codex model is optimized for long-running, multi-step coding tasks where the agent reads files, writes code, runs tests, and iterates. It is not a general-purpose chat model -- it excels when given a well-defined task specification and access to a codebase. See Chapter 8 for Codex-specific prompting patterns.
GPT Image 2
State-of-the-art image generation with a reasoning/thinking mode. Supports up to 4K resolution with approximately 99% text rendering accuracy -- a dramatic improvement over earlier DALL-E models.
| Tier | Image Input | Text Input | Cached (Image / Text) | Image Output |
|---|---|---|---|---|
| Standard | $8.00 | $5.00 | $2.00 / $1.25 | $30.00 |
| Batch | $4.00 | $2.50 | $1.00 / $0.625 | $15.00 |
GPT Image 2 uses token-based pricing rather than per-image pricing. Actual cost depends on image resolution and complexity. See Chapter 10 for image generation prompting strategies.
Real-Time Audio
| Model | Audio (In / Out) | Text (In / Out) | Cached Audio Input |
|---|---|---|---|
| gpt-realtime-1.5 | $32.00 / $64.00 | $4.00 / $16.00 | $0.40 |
gpt-realtime-miniis deprecated. Migrate togpt-realtime-1.5.
Video Generation (Sora 2)
| Model | 720p | 1024p | 1080p |
|---|---|---|---|
| Sora 2 (Standard) | $0.10/sec | -- | -- |
| Sora 2 (Batch) | $0.05/sec | -- | -- |
| Sora 2 Pro (Standard) | $0.30/sec | $0.50/sec | $0.70/sec |
| Sora 2 Pro (Batch) | $0.15/sec | $0.25/sec | $0.35/sec |
Transcription
gpt-4o-transcribeandgpt-4o-mini-transcriberemain available for speech-to-text at $0.003-$0.006 per minute.
Embeddings
text-embedding-3-smallandtext-embedding-3-largeremain available. See the pricing page for current rates.
Web Search Tool
Web search is billed per call, not per token:
- Reasoning models (GPT-5.x): $10.00 per 1,000 calls. Search-retrieved content tokens are billed at the model's standard rates.
- Non-reasoning models: $25.00 per 1,000 calls. Search-retrieved content tokens are included (not billed separately).
Model Selection Decision Tree
Choosing the right model is the single highest-leverage prompt engineering decision. The wrong model wastes money on easy tasks or produces weak results on hard ones. Use this framework:
START
|
v
Is this an autonomous coding task (multi-file edits, test-driven iteration)?
YES --> GPT-5.3-Codex
NO --> continue
|
v
Is this a high-stakes, single-shot task where quality matters more than cost?
(Legal analysis, complex reasoning, critical decisions)
YES --> GPT-5.5 Pro (or GPT-5.4 Pro for budget-conscious)
NO --> continue
|
v
Does the task require frontier intelligence?
(Novel problem-solving, complex multi-step reasoning, hard coding problems)
YES --> GPT-5.5
NO --> continue
|
v
Is this a general production workload?
(Summarization, Q&A, content generation, moderate coding)
YES --> GPT-5.4 (best balance of quality and cost)
NO --> continue
|
v
Is this a high-volume task requiring good intelligence at low cost?
(Chat routing, customer support, content moderation, subagent orchestration)
YES --> GPT-5.4 mini
NO --> continue
|
v
Is this a simple, high-throughput task?
(Classification, extraction, tagging, formatting, simple Q&A)
YES --> GPT-5.4 nano
NO --> Reassess requirements
Cost-Effectiveness Ratios
To put the decision tree in concrete terms, here is the relative cost per million output tokens across models (using GPT-5.4 nano as the baseline):
| Model | Relative Output Cost | Best For |
|---|---|---|
| GPT-5.4 nano | 1x ($1.25) | Classification, extraction, tagging |
| GPT-5.4 mini | 3.6x ($4.50) | Routing, moderation, subagents |
| GPT-5.4 | 12x ($15.00) | General production workloads |
| GPT-5.5 | 24x ($30.00) | Frontier reasoning and coding |
| GPT-5.5 Pro | 144x ($180.00) | Highest-stakes single-shot tasks |
The 144x cost gap between nano and Pro is enormous. A classification pipeline running on GPT-5.5 Pro instead of GPT-5.4 nano is burning 144x the budget for negligible quality improvement on that specific task. Conversely, using nano for complex legal reasoning is false economy -- the quality gap will cost more in rework than the savings.
When to Use Batch vs. Priority
- Batch (50% discount, 24h turnaround): Ideal for offline processing -- bulk summarization, dataset labeling, content generation pipelines, nightly report generation. If the result doesn't need to be synchronous, use batch.
- Flex (Batch pricing, synchronous): Same cost as batch but results are returned immediately when capacity is available. Good for development and testing.
- Priority (2.5x standard): Use when you need guaranteed low latency under load -- real-time user-facing applications with strict SLAs. Available only for GPT-5.5, GPT-5.4, and GPT-5.4 mini.
Prompt Caching Mechanics
Prompt caching is one of the most impactful cost-optimization tools in the API. It is automatic -- no code changes required -- and can reduce both latency (up to 80%) and input token costs (up to 90%).
How It Works
- Prefix matching: The system hashes the initial prefix of your prompt. When a subsequent request shares the same prefix, it is routed to a server that has already processed it.
- Cache hit: The cached key-value tensors are reused, skipping redundant computation. You pay the reduced "cached input" rate.
- Cache miss: The full prompt is processed normally and cached for future requests.
Activation Threshold
Caching activates automatically for prompts of 1,024 tokens or longer. Below this threshold, caching metadata is still tracked (you will see cached_tokens in the usage response) but no cost reduction applies.
Cache Duration by Model Family
This is where model selection intersects with caching strategy:
| Cache Type | Models | Duration | Eviction |
|---|---|---|---|
| In-memory | GPT-5.4, GPT-5.4 mini, GPT-5.4 nano, and older | 5-10 minutes of inactivity, max 1 hour | Automatic on inactivity |
| Extended (24-hour) | GPT-5.5, GPT-5.5 Pro | Up to 24 hours | Key-value tensors offloaded to GPU-local storage |
The extended cache on GPT-5.5 is a significant advantage for applications with large system prompts that serve intermittent traffic. A system prompt cached on GPT-5.4 evicts after 10 minutes of silence; the same prompt on GPT-5.5 persists for up to 24 hours.
Maximizing Cache Hit Rates
-
Put static content first. System instructions, few-shot examples, and reference material should be at the beginning of the prompt. Variable, user-specific content goes at the end. The cache matches on prefix -- any change in the early tokens invalidates the cache for everything after it.
-
Use
prompt_cache_keyfor shared prefixes across different users or sessions. This parameter improves routing to the correct cache shard. Keep request rates below ~15 per minute per key to avoid cache thrashing. -
Maintain steady request volume. The in-memory cache evicts on inactivity. If your traffic is bursty with long gaps, you will see lower hit rates on GPT-5.4 family models. Consider GPT-5.5 for intermittent workloads where the 24-hour cache justifies the higher per-token cost.
-
Monitor cache performance. Every API response includes
cached_tokensin the usage object. Track your cache hit ratio over time and correlate it with prompt structure changes.
Retirement & Migration Notes
The GPT-4.x era is winding down. Here is the current deprecation timeline relevant to API users:
Deprecated (API shutdown: October 23, 2026)
| Deprecated Model | Recommended Replacement |
|---|---|
gpt-4-0613, gpt-4 | gpt-4.1 |
gpt-4-turbo, gpt-4-turbo-2024-04-09 | gpt-4.1 |
gpt-4-1106-preview | gpt-4.1 |
gpt-3.5-turbo-0125, gpt-3.5-turbo | gpt-4.1-mini |
o1, o1-2024-12-17 | o3 |
o1-pro, o1-pro-2025-03-19 | gpt-5.4-pro |
o3-mini, o3-mini-2025-01-31 | o3 |
o4-mini, o4-mini-2025-04-16 | gpt-5-mini (currently resolves to gpt-5.4-mini) |
Already Shut Down
| Model | Shutdown Date | Notes |
|---|---|---|
gpt-4.5-preview | July 14, 2025 | Replaced by gpt-4.1 |
Shutting Down Imminently
| Model | Shutdown Date | Notes |
|---|---|---|
| Realtime API Beta | May 7, 2026 | Migrate to GA Realtime API |
dall-e-2, dall-e-3 | May 12, 2026 | Replaced by gpt-image-1 / gpt-image-2 |
Other Deprecations in Progress
- Assistants API: Shutdown August 26, 2026. Migrate to the Responses API and Conversations API.
- gpt-4o-audio-preview: Shutdown July 23, 2026. Migrate to
gpt-audio. - gpt-4o-mini-realtime-preview: Shutdown July 23, 2026. Migrate to
gpt-realtime-mini(itself deprecated; ultimate target isgpt-realtime-1.5). - gpt-realtime-mini: Deprecated. Migrate to
gpt-realtime-1.5. - GPT-4o mini TTS: Deprecated. Check documentation for current TTS replacement.
ChatGPT vs. API timelines differ. Some models (GPT-4o, GPT-4.1, GPT-4.1 mini, o4-mini) were removed from the ChatGPT consumer product in February 2026, but their API endpoints remain functional until the October 2026 shutdown. Do not confuse ChatGPT product changes with API availability.
Note: GPT-4o, GPT-4.1, and GPT-4.1 mini are not separately listed on the current API deprecations page. Consult OpenAI's deprecations page for their latest API status.
Migration Best Practices
- Pin to specific snapshots in production (e.g.,
gpt-5.4-2025-08-01rather thangpt-5.4). Floating aliases can change behavior without notice. - Build evaluations first. Before migrating from one model to another, establish a benchmark suite that captures your quality requirements. Run both models against it and compare.
- Test prompts across model versions. OpenAI's documentation notes that "different models might need to be prompted differently to produce the best results." A prompt optimized for GPT-4o may underperform on GPT-5.4 -- and vice versa.
- Budget for prompt iteration. Model migration is not just a version-number change. Plan for 1-2 weeks of prompt refinement when moving between generations.
Fine-Tuning Availability
Fine-tuning remains available for select models:
- o4-mini (2025-04-16): Training at $100.00/hour; inference at $4.00 input / $16.00 output per MTok (standard).
Fine-tuning options for GPT-5.x models are not yet publicly documented as of May 2026. Check the fine-tuning guide for updates.
Chapter Summary
The OpenAI model landscape in May 2026 is organized around three axes:
- Intelligence tier: nano < mini < 5.4 < 5.5 < Pro. Match the tier to task complexity.
- Cost tier: Standard > Priority > Batch = Flex. Match the tier to latency requirements.
- Context tier: Short context (cheaper) vs. long context (~2x input cost). Minimize context where possible; cache aggressively where you cannot.
The rest of this book focuses on how to write prompts that extract maximum value from whichever model you choose -- but no amount of prompt engineering compensates for choosing the wrong model. Start here. Get the model selection right, and the prompting techniques in subsequent chapters will compound on a solid foundation.
Chapter 2: Universal Prompting Techniques
The Outcome-First Paradigm
The most important shift in modern prompting is this: describe the destination, not the route. Earlier models needed step-by-step hand-holding. Current models perform better when you state what success looks like and let them choose the path.
Instead of:
First search for X. Then read the top 3 results. Then compare them.
Then write a summary. Use bullet points. Keep it under 500 words.
Write:
Resolve the customer's issue end to end. Success means the user
has a working solution and knows why the problem occurred.
Process-heavy prompts add noise. Outcome-first prompts let the model allocate its reasoning where it matters.
The Six Core Strategies
1. Personality and Behavior
Separate two concerns:
- Persistent personality -- tone, warmth, directness, collaboration style. This stays constant across the session.
- Per-response controls -- format, length, register, channel. These change per task.
Example personality block:
Assume the user is competent and acting in good faith.
Respond with patience, respect, and practical helpfulness.
Be candid but constructive, concise but not curt.
2. Preambles for Responsiveness
For longer or tool-heavy tasks, prompt the model to start with a short preamble -- a brief visible update that acknowledges the request and states the first step. This prevents the "silent thinking" problem where users see no output for extended periods.
3. Outcome-First Prompts
Define success criteria and constraints rather than prescribing steps. Let the model decide how to get there.
4. Grounding and Citations
Define what claims need evidence, what counts as sufficient evidence, and how to behave when evidence is missing. Lock the citation format explicitly:
Only cite sources retrieved in the current workflow.
Never fabricate citations, URLs, or reference IDs.
Attach citations to specific claims, not only at the end.
5. Validation and Verification
Give the model access to tools that let it check its own outputs. Include verification commands and rendered inspection steps in your prompt.
6. Creative Drafting Guardrails
When the model is generating creative content that includes factual claims, distinguish between:
- Source-backed facts -- product names, metrics, customer details. Must come from retrieved sources.
- Creative wording -- framing, metaphors, narrative structure. Model has latitude here.
The Structured Prompt Template
This template is the single most reusable artifact from OpenAI's guidance:
Role: [1-2 sentences defining function, context, job]