Perplexity Search as Code: letting models run the search
Why let a search engine guess what context your model needs? Perplexity's new Search as Code hands the reins to the agent.
TL/DR: Perplexity's Search as Code turns search from a monolithic API into programmable primitives that the model orchestrates itself. The result: 85% fewer tokens and 100% accuracy on complex research tasks that trip up traditional search.
I've been using Perplexity's Search API since it launched. It is fast, the results are high quality, and the latency is excellent. For single-shot lookups... it just works 😎 But I've also hit the wall where my agent needs to do real research: fan out across query variants (for example multiple rows in a data set, each representing a query), filter results, verify sources. In those cases, the standard search API behaves like a black box: it returns whatever its fixed pipeline thinks is relevant. That's fine for a human scanning links. It is less fine for a developer whose agent needs surgical precision.
The monolithic search trap
Search API designers built the current generation of endpoints for humans: query in, ranked results out. The developer gets no say in how retrieval, ranking, or filtering happens, beyond a few tuning parameters. The Perplexity research team describes three failure modes that show up again and again in agentic workflows:
- Coarse context. A fixed pipeline returns the same kind of result set regardless of what the agent actually needs. If the task requires a single surgical fact, the developer receives a bloated context window full of irrelevant noise.
- Wasted domain knowledge. The agent often knows exactly how to search (for example, prefer official vendor pages over aggregators, or blend lexical and semantic signals in a specific ratio). A rigid API gives the developer no way to act on that knowledge.
- Inefficient control flow. Complex research needs parallel fan-outs, deduplication, and adaptive refinement. Forcing that through serial function calls pollutes the context window with intermediate states the model doesn’t need.
In short: the search engine owns the pipeline, and the developer must adapt to it. That worked when AI use cases were simple. It falls apart when agents complete tasks end-to-end over hundreds of retrieval operations.

What Search as Code changes
The Perplexity team designed SaC to flip that relationship. Instead of calling a monolithic endpoint, the model generates Python code that orchestrates search primitives directly. The architecture has three layers that give the model both control and safety:
- Models (LLMs) serve as the control plane. They reason about the task, decompose it, and generate code to implement retrieval pipelines.
- Compute sandboxes provide secure, isolated environments where that code executes deterministically.
- The Agentic Search SDK exposes Perplexity's search stack as composable primitives, from low‑level retrieval operations (like site‑scoped queries) to higher‑level semantic parsing.
In their words: "Models should not merely call a search engine. They should be able to orchestrate the individual pieces of the search stack as the specific task demands."
Developers writing simple tasks might need only a handful of SDK calls in the generated code. For complex tasks, the code can fan out over thousands of operations with custom filtering, ranking, and verification logic, all inside a single inference turn.
The numbers
The Perplexity team tested SaC on a CVE advisory task: identify 200+ high-severity CVEs from 2023-2025, citing vendor advisories with product and fix version. This task requires not just finding CVE IDs, but tying them to vendor advisories and fix versions. SaC scored 100% accuracy. Every non-Perplexity system tested scored below 25%.
Even more striking: token usage dropped 85.1%, from 288.7K to 42.9K. The generated code implements a three-stage pipeline:
- Fan out over official vendor advisory formats using site-scoped exact-phrase queries.
- Refine and summarize coverage, ask an LLM for targeted queries where coverage is sparse, and validate each proposed query before executing it.
- Verify extracted structured records and validate that each CVE is explicitly bound to a specific fix version in vendor-authored text.
Across five benchmarks (DSQA, BrowseComp, HLE, WideSearch, and the new WANDR benchmark), the team measured SaC outperforming all other systems on four and tying on the fifth. On WANDR, a long‑horizon web research benchmark, SaC beat the next-best system by 2.5x.
| Benchmark | SaC | OpenAI | Anthropic | Exa |
|---|---|---|---|---|
| DSQA | 0.871 | 0.733 | 0.815 | 0.530 |
| BrowseComp | 0.805 | 0.720 | 0.598 | 0.380 |
| HLE | 0.612 | 0.614 | 0.566 | 0.387 |
| WideSearch | 0.651 | 0.522 | 0.590 | 0.471 |
| WANDR | 0.386 | 0.130 | 0.152 | 0.057 |
Hands-on: the Agent API Sandbox
The Agent API Sandbox is the runtime that makes SaC possible. It executes Python in isolated containers, gives code access to Web Search, Fetch URL, and People Search tools, and supports background execution with polling. Intermediate files persist across turns using a simple filesystem interface.
Getting started with the sandbox is straightforward:
from perplexity import Perplexity
client = Perplexity()
response = client.responses.create(
model="openai/gpt-5.5",
input="Create a CSV with the first 10 Fibonacci numbers and their squares.",
tools=[{"type": "sandbox"}],
instructions="Use the sandbox to compute the values before answering.",
)Pricing is refreshingly simple for what it does:
| Component | Price |
|---|---|
| Tokens | Standard Agent API rates (per model) |
| Sandbox session | $0.03 per session (up to 20 min active use) |
| Tools used | Per invocation (web_search, fetch_url, people_search) |
Three cents for up to twenty minutes of sandbox time is very cost‑effective for complex multi-step research.
background: true and polling for completion.How I'm thinking about using it
For my own workflows, the pattern is clear: keep fast local tools for simple queries, delegate complex research to the Agent API. A monolithic search call is still the right tool for 'what's the weather' or 'when was Python 3.14 released'. But for tasks that need parallel retrieval, custom heuristics, or structured verification, the Sandbox is the better bet.
The pattern I have in mind looks like a deep_research tool that switches intelligently:
def deep_research(query: str, complexity: str = "auto") -> dict:
if complexity == "low":
return perplexity_search(query) # fast path
response = perplexity_agent_api.responses.create(
model="openai/gpt-5.5",
input=query,
tools=[{"type": "sandbox"}, {"type": "web_search"}],
instructions="Use the sandbox to orchestrate multi-step research.",
background=True,
)
# poll for completion...
return response.output_textThe upside: 85% token reduction, 100% accuracy on tasks where non-SaC systems fail, and no need to build, host, or secure your own sandbox environment. The model writes the retrieval script, the sandbox runs it, and I get back structured results.
And that's it: for complex research, I'd rather let the model orchestrate the search pipeline than accept whatever a fixed endpoint returns. The numbers make that case pretty convincing 😎
