CandleKeep

OpenAI Prompting Best Practices

by OpenAI

open-aibest-practicesagentsllms
Pages62
Formatmarkdown
ListedFebruary 22, 2026
UpdatedMay 3, 2026
Subscribers36

About

A comprehensive guide to prompting and building with OpenAI's models and APIs, covering model selection, prompt engineering techniques, GPT-4.1 best practices, reasoning models (o3/o4-mini), Responses API, function calling, structured outputs, multimodal capabilities, advanced features, and the Agents SDK. Sourced from official OpenAI documentation.

62Chapters
229Topics
62Pages

Preview

OpenAI Prompting Best Practices

A comprehensive guide to prompting, model selection, and API usage for the May 2026 OpenAI model lineup.

Chapter 1: Introduction & Model Overview

The Current Landscape (May 2026)

OpenAI's model lineup has consolidated into a clear hierarchy. The GPT-5.x family now dominates both the consumer product (ChatGPT) and the developer API, with earlier generations on a deprecation glide path toward shutdown. The flagship models share a common architecture that supports function calling, web search, file search, computer use, and reasoning at adjustable effort levels (none / low / medium / high / xhigh). Meanwhile, a set of specialized models handles image generation, real-time audio, autonomous coding, and video.

For developers, the most consequential shift in the past year has been the retirement of the GPT-4.x era. The GPT-4o, GPT-4.1, and o-series models that dominated 2024-2025 are now deprecated, with API shutdown dates set for late 2026. New projects should target the GPT-5.x family exclusively; migration for existing projects should be underway.

Two API surfaces coexist: the legacy Chat Completions API and the newer Responses API. OpenAI's documentation explicitly recommends the Responses API, particularly for reasoning models, which "perform better and demonstrate higher intelligence" when used with it. This book assumes the Responses API throughout unless otherwise noted.


Flagship Models at a Glance

ModelContext WindowMax OutputKnowledge CutoffKey Strengths
GPT-5.51M tokens128K tokensDec 1, 2025Top-tier intelligence for coding and professional work
GPT-5.5 Pro1M tokens128K tokensDec 1, 2025Highest-quality single-shot reasoning; extended thinking
GPT-5.41M tokens128K tokensAug 31, 2025Strong general-purpose model at lower cost
GPT-5.4 Pro1M tokens128K tokensAug 31, 2025Extended reasoning at the 5.4 tier
GPT-5.4 mini400K tokens128K tokensAug 31, 2025Best mini model for coding, computer use, and subagents
GPT-5.4 nano------Lowest-cost model for high-volume, lightweight tasks

Note: GPT-5.4 nano's context window and knowledge cutoff are not confirmed in current documentation. GPT-5.5 Pro's context window is inferred from the GPT-5.5 family but not independently verified. Check the models page for the latest specifications.

All flagship models support reasoning effort adjustment (none, low, medium, high, xhigh), allowing you to trade latency and cost for depth of thought on a per-request basis. This is one of the most important levers for prompt optimization and is covered in depth in Chapter 4.


Pricing

Pricing is per 1 million tokens unless otherwise noted. All flagship models offer multiple tiers:

  • Standard -- default API access with the lowest latency guarantees.
  • Cached -- automatic discount when the prompt prefix matches a recent request (see Prompt Caching below).
  • Batch -- asynchronous processing at 50% of standard rates; results returned within 24 hours.
  • Priority -- guaranteed low-latency access at a premium.
  • Flex -- matches Batch pricing but with synchronous access during off-peak capacity.
Flagship Models -- Short Context
ModelStandard (In / Out)Cached InputBatch (In / Out)Priority (In / Out)
GPT-5.5$5.00 / $30.00$0.50$2.50 / $15.00$12.50 / $75.00
GPT-5.5 Pro$30.00 / $180.00--$15.00 / $90.00--
GPT-5.4$2.50 / $15.00$0.25$1.25 / $7.50$5.00 / $30.00
GPT-5.4 Pro$30.00 / $180.00--$15.00 / $90.00--
GPT-5.4 mini$0.75 / $4.50$0.075$0.375 / $2.25$1.50 / $9.00
GPT-5.4 nano$0.20 / $1.25$0.02$0.10 / $0.625--

Cached input prices reflect a ~90% reduction from standard input rates. The Pro models and GPT-5.4 nano do not list cached or priority pricing.

Flagship Models -- Long Context

When requests exceed the short-context threshold, pricing increases. The exact token boundary is not published, but the price differential is significant:

ModelStandard (In / Out)Cached InputBatch (In / Out)
GPT-5.5$10.00 / $45.00$1.00$5.00 / $22.50
GPT-5.5 Pro$60.00 / $270.00----
GPT-5.4$5.00 / $22.50$0.50$2.50 / $11.25
GPT-5.4 Pro$60.00 / $270.00--$30.00 / $135.00

Long-context pricing roughly doubles input costs and increases output costs by ~50% compared to short-context rates. This makes prompt caching and context management critical for cost control at scale -- a theme we return to in Chapters 3 and 5.

Data Residency Surcharge

Models released after March 5, 2026 incur a 10% surcharge when using regional data-residency endpoints. Factor this into cost projections if your deployment requires data to stay in a specific geography.


Specialized Models

GPT-5.3-Codex (Autonomous Coding Agent)

Released February 24, 2026, this model powers OpenAI's Codex product -- an autonomous coding agent that operates in sandboxed cloud environments. It supports the full range of reasoning effort settings (none through xhigh).

TierInputCached InputOutput
Standard$1.75$0.175$14.00
Priority$3.50$0.35$28.00

Batch pricing for GPT-5.3-Codex is not separately listed in current documentation.

The Codex model is optimized for long-running, multi-step coding tasks where the agent reads files, writes code, runs tests, and iterates. It is not a general-purpose chat model -- it excels when given a well-defined task specification and access to a codebase. See Chapter 8 for Codex-specific prompting patterns.

GPT Image 2

State-of-the-art image generation with a reasoning/thinking mode. Supports up to 4K resolution with approximately 99% text rendering accuracy -- a dramatic improvement over earlier DALL-E models.

TierImage InputText InputCached (Image / Text)Image Output
Standard$8.00$5.00$2.00 / $1.25$30.00
Batch$4.00$2.50$1.00 / $0.625$15.00

GPT Image 2 uses token-based pricing rather than per-image pricing. Actual cost depends on image resolution and complexity. See Chapter 10 for image generation prompting strategies.

Real-Time Audio
ModelAudio (In / Out)Text (In / Out)Cached Audio Input
gpt-realtime-1.5$32.00 / $64.00$4.00 / $16.00$0.40

gpt-realtime-mini is deprecated. Migrate to gpt-realtime-1.5.

Video Generation (Sora 2)
Model720p1024p1080p
Sora 2 (Standard)$0.10/sec----
Sora 2 (Batch)$0.05/sec----
Sora 2 Pro (Standard)$0.30/sec$0.50/sec$0.70/sec
Sora 2 Pro (Batch)$0.15/sec$0.25/sec$0.35/sec
Transcription
  • gpt-4o-transcribe and gpt-4o-mini-transcribe remain available for speech-to-text at $0.003-$0.006 per minute.
Embeddings
  • text-embedding-3-small and text-embedding-3-large remain available. See the pricing page for current rates.
Web Search Tool

Web search is billed per call, not per token:

  • Reasoning models (GPT-5.x): $10.00 per 1,000 calls. Search-retrieved content tokens are billed at the model's standard rates.
  • Non-reasoning models: $25.00 per 1,000 calls. Search-retrieved content tokens are included (not billed separately).

Model Selection Decision Tree

Choosing the right model is the single highest-leverage prompt engineering decision. The wrong model wastes money on easy tasks or produces weak results on hard ones. Use this framework:

START
  |
  v
Is this an autonomous coding task (multi-file edits, test-driven iteration)?
  YES --> GPT-5.3-Codex
  NO  --> continue
  |
  v
Is this a high-stakes, single-shot task where quality matters more than cost?
(Legal analysis, complex reasoning, critical decisions)
  YES --> GPT-5.5 Pro (or GPT-5.4 Pro for budget-conscious)
  NO  --> continue
  |
  v
Does the task require frontier intelligence?
(Novel problem-solving, complex multi-step reasoning, hard coding problems)
  YES --> GPT-5.5
  NO  --> continue
  |
  v
Is this a general production workload?
(Summarization, Q&A, content generation, moderate coding)
  YES --> GPT-5.4 (best balance of quality and cost)
  NO  --> continue
  |
  v
Is this a high-volume task requiring good intelligence at low cost?
(Chat routing, customer support, content moderation, subagent orchestration)
  YES --> GPT-5.4 mini
  NO  --> continue
  |
  v
Is this a simple, high-throughput task?
(Classification, extraction, tagging, formatting, simple Q&A)
  YES --> GPT-5.4 nano
  NO  --> Reassess requirements
Cost-Effectiveness Ratios

To put the decision tree in concrete terms, here is the relative cost per million output tokens across models (using GPT-5.4 nano as the baseline):

ModelRelative Output CostBest For
GPT-5.4 nano1x ($1.25)Classification, extraction, tagging
GPT-5.4 mini3.6x ($4.50)Routing, moderation, subagents
GPT-5.412x ($15.00)General production workloads
GPT-5.524x ($30.00)Frontier reasoning and coding
GPT-5.5 Pro144x ($180.00)Highest-stakes single-shot tasks

The 144x cost gap between nano and Pro is enormous. A classification pipeline running on GPT-5.5 Pro instead of GPT-5.4 nano is burning 144x the budget for negligible quality improvement on that specific task. Conversely, using nano for complex legal reasoning is false economy -- the quality gap will cost more in rework than the savings.

When to Use Batch vs. Priority
  • Batch (50% discount, 24h turnaround): Ideal for offline processing -- bulk summarization, dataset labeling, content generation pipelines, nightly report generation. If the result doesn't need to be synchronous, use batch.
  • Flex (Batch pricing, synchronous): Same cost as batch but results are returned immediately when capacity is available. Good for development and testing.
  • Priority (2.5x standard): Use when you need guaranteed low latency under load -- real-time user-facing applications with strict SLAs. Available only for GPT-5.5, GPT-5.4, and GPT-5.4 mini.

Prompt Caching Mechanics

Prompt caching is one of the most impactful cost-optimization tools in the API. It is automatic -- no code changes required -- and can reduce both latency (up to 80%) and input token costs (up to 90%).

How It Works
  1. Prefix matching: The system hashes the initial prefix of your prompt. When a subsequent request shares the same prefix, it is routed to a server that has already processed it.
  2. Cache hit: The cached key-value tensors are reused, skipping redundant computation. You pay the reduced "cached input" rate.
  3. Cache miss: The full prompt is processed normally and cached for future requests.
Activation Threshold

Caching activates automatically for prompts of 1,024 tokens or longer. Below this threshold, caching metadata is still tracked (you will see cached_tokens in the usage response) but no cost reduction applies.

Cache Duration by Model Family

This is where model selection intersects with caching strategy:

Cache TypeModelsDurationEviction
In-memoryGPT-5.4, GPT-5.4 mini, GPT-5.4 nano, and older5-10 minutes of inactivity, max 1 hourAutomatic on inactivity
Extended (24-hour)GPT-5.5, GPT-5.5 ProUp to 24 hoursKey-value tensors offloaded to GPU-local storage

The extended cache on GPT-5.5 is a significant advantage for applications with large system prompts that serve intermittent traffic. A system prompt cached on GPT-5.4 evicts after 10 minutes of silence; the same prompt on GPT-5.5 persists for up to 24 hours.

Maximizing Cache Hit Rates
  1. Put static content first. System instructions, few-shot examples, and reference material should be at the beginning of the prompt. Variable, user-specific content goes at the end. The cache matches on prefix -- any change in the early tokens invalidates the cache for everything after it.

  2. Use prompt_cache_key for shared prefixes across different users or sessions. This parameter improves routing to the correct cache shard. Keep request rates below ~15 per minute per key to avoid cache thrashing.

  3. Maintain steady request volume. The in-memory cache evicts on inactivity. If your traffic is bursty with long gaps, you will see lower hit rates on GPT-5.4 family models. Consider GPT-5.5 for intermittent workloads where the 24-hour cache justifies the higher per-token cost.

  4. Monitor cache performance. Every API response includes cached_tokens in the usage object. Track your cache hit ratio over time and correlate it with prompt structure changes.


Retirement & Migration Notes

The GPT-4.x era is winding down. Here is the current deprecation timeline relevant to API users:

Deprecated (API shutdown: October 23, 2026)
Deprecated ModelRecommended Replacement
gpt-4-0613, gpt-4gpt-4.1
gpt-4-turbo, gpt-4-turbo-2024-04-09gpt-4.1
gpt-4-1106-previewgpt-4.1
gpt-3.5-turbo-0125, gpt-3.5-turbogpt-4.1-mini
o1, o1-2024-12-17o3
o1-pro, o1-pro-2025-03-19gpt-5.4-pro
o3-mini, o3-mini-2025-01-31o3
o4-mini, o4-mini-2025-04-16gpt-5-mini (currently resolves to gpt-5.4-mini)
Already Shut Down
ModelShutdown DateNotes
gpt-4.5-previewJuly 14, 2025Replaced by gpt-4.1
Shutting Down Imminently
ModelShutdown DateNotes
Realtime API BetaMay 7, 2026Migrate to GA Realtime API
dall-e-2, dall-e-3May 12, 2026Replaced by gpt-image-1 / gpt-image-2
Other Deprecations in Progress
  • Assistants API: Shutdown August 26, 2026. Migrate to the Responses API and Conversations API.
  • gpt-4o-audio-preview: Shutdown July 23, 2026. Migrate to gpt-audio.
  • gpt-4o-mini-realtime-preview: Shutdown July 23, 2026. Migrate to gpt-realtime-mini (itself deprecated; ultimate target is gpt-realtime-1.5).
  • gpt-realtime-mini: Deprecated. Migrate to gpt-realtime-1.5.
  • GPT-4o mini TTS: Deprecated. Check documentation for current TTS replacement.

ChatGPT vs. API timelines differ. Some models (GPT-4o, GPT-4.1, GPT-4.1 mini, o4-mini) were removed from the ChatGPT consumer product in February 2026, but their API endpoints remain functional until the October 2026 shutdown. Do not confuse ChatGPT product changes with API availability.

Note: GPT-4o, GPT-4.1, and GPT-4.1 mini are not separately listed on the current API deprecations page. Consult OpenAI's deprecations page for their latest API status.

Migration Best Practices
  1. Pin to specific snapshots in production (e.g., gpt-5.4-2025-08-01 rather than gpt-5.4). Floating aliases can change behavior without notice.
  2. Build evaluations first. Before migrating from one model to another, establish a benchmark suite that captures your quality requirements. Run both models against it and compare.
  3. Test prompts across model versions. OpenAI's documentation notes that "different models might need to be prompted differently to produce the best results." A prompt optimized for GPT-4o may underperform on GPT-5.4 -- and vice versa.
  4. Budget for prompt iteration. Model migration is not just a version-number change. Plan for 1-2 weeks of prompt refinement when moving between generations.

Fine-Tuning Availability

Fine-tuning remains available for select models:

  • o4-mini (2025-04-16): Training at $100.00/hour; inference at $4.00 input / $16.00 output per MTok (standard).

Fine-tuning options for GPT-5.x models are not yet publicly documented as of May 2026. Check the fine-tuning guide for updates.


Chapter Summary

The OpenAI model landscape in May 2026 is organized around three axes:

  1. Intelligence tier: nano < mini < 5.4 < 5.5 < Pro. Match the tier to task complexity.
  2. Cost tier: Standard > Priority > Batch = Flex. Match the tier to latency requirements.
  3. Context tier: Short context (cheaper) vs. long context (~2x input cost). Minimize context where possible; cache aggressively where you cannot.

The rest of this book focuses on how to write prompts that extract maximum value from whichever model you choose -- but no amount of prompt engineering compensates for choosing the wrong model. Start here. Get the model selection right, and the prompting techniques in subsequent chapters will compound on a solid foundation.

Chapter 2: Universal Prompting Techniques

The Outcome-First Paradigm

The most important shift in modern prompting is this: describe the destination, not the route. Earlier models needed step-by-step hand-holding. Current models perform better when you state what success looks like and let them choose the path.

Instead of:

First search for X. Then read the top 3 results. Then compare them.
Then write a summary. Use bullet points. Keep it under 500 words.

Write:

Resolve the customer's issue end to end. Success means the user
has a working solution and knows why the problem occurred.

Process-heavy prompts add noise. Outcome-first prompts let the model allocate its reasoning where it matters.

The Six Core Strategies

1. Personality and Behavior

Separate two concerns:

  • Persistent personality -- tone, warmth, directness, collaboration style. This stays constant across the session.
  • Per-response controls -- format, length, register, channel. These change per task.

Example personality block:

Assume the user is competent and acting in good faith.
Respond with patience, respect, and practical helpfulness.
Be candid but constructive, concise but not curt.

2. Preambles for Responsiveness

For longer or tool-heavy tasks, prompt the model to start with a short preamble -- a brief visible update that acknowledges the request and states the first step. This prevents the "silent thinking" problem where users see no output for extended periods.

3. Outcome-First Prompts

Define success criteria and constraints rather than prescribing steps. Let the model decide how to get there.

4. Grounding and Citations

Define what claims need evidence, what counts as sufficient evidence, and how to behave when evidence is missing. Lock the citation format explicitly:

Only cite sources retrieved in the current workflow.
Never fabricate citations, URLs, or reference IDs.
Attach citations to specific claims, not only at the end.

5. Validation and Verification

Give the model access to tools that let it check its own outputs. Include verification commands and rendered inspection steps in your prompt.

6. Creative Drafting Guardrails

When the model is generating creative content that includes factual claims, distinguish between:

  • Source-backed facts -- product names, metrics, customer details. Must come from retrieved sources.
  • Creative wording -- framing, metaphors, narrative structure. Model has latitude here.

The Structured Prompt Template

This template is the single most reusable artifact from OpenAI's guidance:

Role: [1-2 sentences defining function, context, job]
Add to library to read more

Table of Contents

Personality
Goal
Success criteria
Constraints
Output

Good: batch reads

Autonomy
Exploration
Quality

Turn 1

Create a named conversation
Turn 1

Turn 1

Submit a batch of requests

The model might call both get_weather("Paris") and get_weather("London")

The model is available in the gpt-image-2 family.

Generate initial image

Async
Synchronous

Create a persistent session
Turn 1
Turn 2 -- agent remembers the name

Default: Filesystem + Shell + Compaction

Save state after a run

Environment variable
OPENAI_AGENTS_DISABLE_TRACING=1
Code

Global default (currently gpt-4.1, overridable)

--- Tools ---
--- Guardrails ---
--- Specialist Agents ---
--- Triage Agent ---

Add to Library

Free ยท Live updates included

36 readers subscribed