Home / How-to / Claude Adaptive Thinking API: Developer Guide 2026

How-to

Claude Adaptive Thinking API: Developer Guide 2026

By TouchEVA

No Comments

Published: 14/04/2026 • Updated: 27/04/2026 15:21

Claude Adaptive Thinking API: Developer Guide 2026 — editorial featured image showing the topic context, key signals, and reader intent

⏱ 8 min read1,669 words

Table of Contents

What Is Adaptive Thinking and How Does It Differ from Extended Thinking?
How Do You Enable Adaptive Thinking in Claude’s API?
Which Effort Level Should You Choose?
How Does Adaptive Thinking Handle Streaming and Tool Use?
How Can You Control Costs with Adaptive Thinking?
Common Questions — Claude Adaptive Thinking API
Conclusion

Key Takeaways

Adaptive thinking replaces the deprecated budget_tokens approach — set thinking: {type: "adaptive"} in your request, no beta header needed.
Four effort levels (low, medium, high, max) let you balance reasoning depth vs. cost; at max effort, token spend can exceed 10× the low setting.
Sonnet 4.6 at medium effort approaches Opus 4.6 quality on complex agent tasks at a fraction of the cost — the right default for most production workloads.
Adaptive mode automatically enables interleaved thinking between tool calls — unavailable in manual mode on Opus 4.6.
The old thinking.type: "enabled" with budget_tokens is deprecated on Opus 4.6 and Sonnet 4.6 and will be removed in a future release.

Setting a fixed token budget for Claude’s extended thinking is guesswork. You over-provision for simple queries, under-provision for complex ones, and iterate blindly. The Claude adaptive thinking API changes this: tell the model the effort level you want, and it decides how much to reason on its own.

Claude Adaptive Thinking API: Developer Guide 2026 — Photo by Tara Winstead on Pexels

Adaptive thinking shipped with Claude Opus 4.6 in early 2026 and is now the recommended mode for both Opus 4.6 and Sonnet 4.6. The old approach — thinking.type: "enabled" with a fixed budget_tokens — is now deprecated on these models.

This guide covers enabling adaptive thinking, choosing the right effort level for your workload, streaming thinking blocks in real time, and keeping per-call costs predictable. Browse more developer tutorials in our How-to section.

What Is Adaptive Thinking and How Does It Differ from Extended Thinking?

The original extended thinking API required a hard token budget. You’d set budget_tokens: 10000 and Claude would think up to that ceiling on every call — even for a question as simple as asking the capital of France. This produced unpredictable costs and no way to optimize without tuning each query type manually.

Adaptive thinking removes the fixed budget entirely. Set thinking: {type: "adaptive"} and Claude evaluates the complexity of each request on its own. A trivial question gets an immediate answer. A multi-step architecture problem gets full deliberation. The model makes this call per request, not per session.

There is a second structural advantage: adaptive mode automatically enables interleaved thinking between tool calls. On Opus 4.6 in manual thinking mode, interleaved thinking is completely unavailable. For complex agentic pipelines — where the model needs to reason between each tool invocation — adaptive mode is the only viable option on Opus 4.6. According to Anthropic’s official documentation, adaptive thinking is now the recommended approach for all Opus 4.6 and Sonnet 4.6 workloads.

How Do You Enable Adaptive Thinking in Claude’s API?

The minimum change to your existing code is a single field. Replace any thinking configuration with {type: "adaptive"}. No beta header is required — this is a stable, production-ready feature on Opus 4.6 and Sonnet 4.6.

Here is a complete Python example using the official Anthropic SDK:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    messages=[
        {
            "role": "user",
            "content": "Analyze the tradeoffs between B-trees and LSM-trees for a write-heavy database.",
        }
    ],
)

for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking]: {block.thinking}")
    elif block.type == "text":
        print(f"[Response]: {block.text}")

The response includes both thinking blocks and text blocks. On Claude 4 models, the thinking content you receive is a summarized version of the full internal reasoning process. You are billed for the full thinking tokens generated internally — not the summary tokens visible in the response.

For a curl-based request, the structure is identical:

curl https://api.anthropic.com/v1/messages 
  --header "x-api-key: $ANTHROPIC_API_KEY" 
  --header "anthropic-version: 2023-06-01" 
  --header "content-type: application/json" 
  --data '{
    "model": "claude-opus-4-6",
    "max_tokens": 16000,
    "thinking": {"type": "adaptive"},
    "messages": [
      {"role": "user", "content": "Design a rate-limiting strategy for a public API serving 10 million daily requests."}
    ]
  }'

Which Effort Level Should You Choose?

Adaptive thinking pairs with the effort parameter inside output_config. This provides soft guidance on thinking depth — not a hard token cap. The default is high, where Claude almost always engages extended reasoning. Lower levels reduce thinking frequency for workloads where speed outweighs deliberation depth.

Effort Level	Thinking Behavior	Best For
`low`	Skips thinking for most tasks	High-volume chat, latency-sensitive pipelines
`medium`	Moderate thinking; skips for simple queries	Agentic coding, code review, tool workflows
`high` (default)	Always thinks; deep reasoning on complex tasks	Architecture decisions, multi-step analysis
`max`	No constraints on thinking depth	Safety-critical decisions, research, maximum accuracy

low: Use for FAQ bots, simple summarization, and any workload where latency matters more than reasoning depth.
medium: The recommended production default for code review, agentic pipelines, and general assistant tasks.
high: For architecture decisions, security audits, and multi-step data analysis that benefit from deep reasoning.
max: Reserve for research tasks, safety-critical evaluations, and scenarios where missing a nuance is unacceptable. At max effort, a single complex prompt can consume more than 10× the tokens of the same prompt at low effort.

Production testing by Resolve.ai found that Sonnet 4.6 at medium effort approached Opus 4.6 quality on complex agent investigations at substantially lower cost. For most developer teams, start with Sonnet 4.6 at medium effort and benchmark against Opus 4.6 before upgrading. On benchmarks requiring long-context reasoning, Opus 4.6 scores 76% on the 8-needle 1M-token MRCR v2 evaluation — compared to 18.5% for Sonnet 4.5 — so the gap widens for genuinely hard retrieval tasks.

Add the effort parameter to any request like this:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "medium"},
    messages=[{"role": "user", "content": "Review this pull request diff for security vulnerabilities."}],
)

How Does Adaptive Thinking Handle Streaming and Tool Use?

Streaming with adaptive thinking works exactly as expected. Thinking blocks arrive as thinking_delta events and text responses arrive as text_delta events. Both stream in real time, letting you display reasoning to users as it arrives:

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    messages=[{"role": "user", "content": "Explain the CAP theorem with real-world database examples."}],
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                print(event.delta.thinking, end="", flush=True)
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

For tool use, the key rule is simple: pass thinking blocks back unchanged when constructing multi-turn messages with tool results. Each thinking block includes an encrypted signature field that the API uses to verify block authenticity. Treat this field as opaque — never modify or parse it. Signatures are compatible across Anthropic’s API, Amazon Bedrock, and Google Vertex AI, so multi-turn conversations transfer cleanly between platforms.

If you need to reduce time-to-first-token when streaming, set display: "omitted" in the thinking configuration. Claude still generates the full internal reasoning, but only the signature streams — the text response begins immediately after. You are still billed for full thinking tokens; omitting the display reduces latency, not cost.

How Can You Control Costs with Adaptive Thinking?

Two parameters drive cost control: max_tokens is the hard ceiling on total output (thinking tokens plus response text combined); effort is the soft guidance lever. Neither alone is sufficient — both should be set deliberately for any production workload.

If you frequently see stop_reason: "max_tokens" in your responses, raise max_tokens rather than lowering the effort level. Hitting the ceiling mid-answer degrades response quality. Opus 4.6’s 1M-token context window became available to Max, Team, and Enterprise users on March 13, 2026, so context size is rarely the binding constraint — per-call cost is.

You can also prompt Claude to think less often on low-complexity tasks. Add this to your system prompt:

Extended thinking adds latency. Use it only when multi-step reasoning will
meaningfully improve the answer quality. For straightforward questions, respond directly.

Test any prompt-based tuning carefully before deploying to production. Reducing thinking frequency cuts cost but can degrade quality on tasks where reasoning matters. The Anthropic documentation recommends testing with lower effort levels before resorting to prompt-based tuning. For deeper coverage of production AI infrastructure decisions, see our Dev/IT Ops section.

Common Questions — Claude Adaptive Thinking API

Q: Is adaptive thinking available on Claude Sonnet 4.5 or older models?

A: No. Adaptive thinking only works on Claude Opus 4.6, Claude Sonnet 4.6, and the Claude Mythos Preview model. Older models still require thinking.type: "enabled" with a budget_tokens value. Anthropic has not announced plans to backport adaptive thinking to prior-generation models.

Q: How does the cost of adaptive thinking compare to the old extended thinking?

A: You are billed for the full thinking tokens Claude generates internally — not the summarized tokens visible in the response. At medium effort, most workloads see lower thinking-token spend versus a high fixed budget_tokens setting, because Claude skips thinking on simpler queries. Exact savings depend on your query distribution and complexity mix.

Q: Can I use adaptive thinking on Amazon Bedrock or Google Vertex AI?

A: Yes. Adaptive thinking is supported on both Amazon Bedrock and Google Vertex AI for Claude Opus 4.6 and Sonnet 4.6. Thinking block signatures are compatible across all platforms, so multi-turn conversations transfer without issues between deployment environments.

Q: What happens if I still use budget_tokens with Opus 4.6?

A: It still works as of April 2026, but it is officially deprecated. Anthropic will remove budget_tokens support on Opus 4.6 and Sonnet 4.6 in a future model release. Migrate to thinking: {type: "adaptive"} with the effort parameter now to avoid future breaking changes.

Conclusion

Adaptive thinking simplifies one of the trickiest decisions in building with Claude: how much reasoning each task actually needs. Start with thinking: {type: "adaptive"} and effort: "medium" on Sonnet 4.6 for most production workloads. Raise to Opus 4.6 at high effort only after benchmarking confirms the quality difference justifies the cost. And migrate away from budget_tokens now — deprecation means a breaking change is coming, and the new API is cleaner.

Explore more developer guides in our How-to section.

About the author: TouchEVA is a tech journalist covering AI, software, and cybersecurity for Hubkub.com — independent tech media since 2025. Every article is researched from primary sources and verified data.