The Cancel Scope Bug: Debugging Intermittent 500s in a pydantic-ai MCP Agent

One of our agents started throwing intermittent 500s in production. Not every request, just some of them, and only on the agent that leans hardest on tool calls. The error message was the kind that makes you tilt your head:

RuntimeError: Attempted to exit cancel scope in a different task than it was entered in

This is the story of how I chased it from that one line all the way to a closed pull request in someone else's library, reproduced it in 40 lines of standalone Python, and pinned the exact version that fixes it. If you run pydantic-ai agents with MCP tools, you probably want to know about this one.

The symptom

The failing agent does a lot of tool calls. A single run fans out to four or more MCP tools to gather context, then asks the model to make a judgment. Each run takes 20 to 25 seconds and the tool calls clearly overlap in the traces.

Under normal traffic everything looked fine. Under load, a slice of requests came back as 500s. Our tracing showed the literal string cancel scope in the error, always on this one agent, always when several tool calls were in flight at once.

Two variants of the message showed up, which turned out to be the same root cause hitting two slightly different code paths inside anyio:

Attempted to exit cancel scope in a different task than it was entered in
Attempted to exit a cancel scope that isn't the current task's current cancel scope

Reading the error

The phrase to anchor on is cancel scope. That is an anyio concept. anyio wraps asyncio (and trio) and gives you structured concurrency primitives. A cancel scope is one of them: a region of async code whose cancellation you can control as a unit.

The important property, and the whole reason this bug exists, is that a cancel scope is task-bound. The task that calls __enter__ on a scope is the only task allowed to call __exit__ on it. Enter in task A, exit in task B, and anyio refuses with exactly the error we were seeing. It is not a soft warning. It raises.

So somewhere, something was opening a cancel scope in one asyncio task and closing it in another. The question was what, and why only sometimes.

The hypothesis: a shared MCP connection

pydantic-ai talks to MCP servers through a connection object (for us, MCPServerStreamableHTTP). That connection is an async context manager. When the agent runs, the connection gets entered, used for tool calls, and exited.

Under the hood, entering that connection opens an anyio cancel scope to bound the lifetime of the underlying HTTP stream. And here is the shape of the old code that mattered:

A _running_count tracked how many things were currently using the connection.
The first user to enter (count goes 0 to 1) opened the cancel scope, in whatever task happened to get there first.
The last user to leave (count goes back to 0) tore it down, by calling aclose() on a stashed exit stack, in whatever task happened to finish last.

Now layer in how the model drives tool calls. When the model decides to call four tools in one turn, pydantic-ai dispatches them concurrently with asyncio.gather. Each one runs in its own child task. They all share the same connection.

So: task A (say, the first tool call) opens the scope. Tasks A through D all do their work. Whichever one finishes last closes the scope. If that is not task A, anyio raises. That is the bug.

Reproducing it on demand

A theory is nice. A reproduction is better. I wrote a small script that fires N parallel requests at the agent and flags any response whose body contains cancel scope.

At one request at a time, it never failed. At ten concurrent requests, it failed 2 out of 10. Roughly a 20 percent hit rate, consistent across runs. That matched what we saw in prod under load, and it gave me a dial I could turn.

It also gave me something to hand the team later: a deterministic way to say "this is broken" before the fix and "this is fixed" after.

What actually goes wrong

I wanted to be sure I understood the mechanism rather than just correlating symptoms. So I stripped it down to the bone. No LLM, no MCP server, no HTTP, no framework. Just asyncio and anyio, with a fake connection that copies the old pydantic-ai shape.

The whole point: the workload is identical in both cases. Four tasks, one shared connection, asyncio.gather. Only the connection's internals change.

import asyncio
from typing import Protocol, runtime_checkable

import anyio


@runtime_checkable
class Connection(Protocol):
    async def __aenter__(self) -> "Connection": ...
    async def __aexit__(self, exc_type, exc, tb) -> None: ...


async def tool_call(conn: Connection, idx: int) -> str:
    """The same workload runs against both connection implementations."""
    async with conn:
        # First-entering task (idx=0) exits FIRST. Last-entering task
        # (idx=3) exits LAST. For the buggy impl, last-exiter != opener
        # so the bug triggers on cleanup.
        await asyncio.sleep(0.01 * (idx + 1))
        return f"tool-{idx}"

Here is the buggy connection. It is a faithful miniature of what pydantic-ai 1.20 was doing: a refcount, a scope opened by the first entrant, a scope closed by the last to leave.

class SharedConnectionBuggy:
    def __init__(self) -> None:
        self._scope: anyio.CancelScope | None = None
        self._count = 0

    async def __aenter__(self) -> "SharedConnectionBuggy":
        if self._count == 0:
            self._scope = anyio.CancelScope()
            self._scope.__enter__()  # opened inside whichever task got here first
        self._count += 1
        return self

    async def __aexit__(self, exc_type, exc, tb) -> None:
        self._count -= 1
        if self._count == 0:
            assert self._scope is not None
            self._scope.__exit__(exc_type, exc, tb)  # may run in a DIFFERENT task

Run four tool_calls through it with asyncio.gather and, with the sleep timings rigged so the opener is not the closer, it blows up every time:

RuntimeError: Attempted to exit cancel scope in a different task than it was entered in

That is the bug, reproduced in something you can read in one sitting.

Before and after, as a picture

The shared-scope version looks like this. Four worker tasks all reach through the connection and pierce the same cancel scope. T0 opened it. T3 happens to finish last, so T3 is the one that calls __exit__. Different task, so anyio refuses.

Before: four worker tasks share one cancel scope. T0 opens it, T3 closes it, anyio raises.

The fix restructures who owns the scope. Instead of letting whichever worker task happens to be first or last drive the scope's lifecycle, a dedicated long-running task owns it. That owner opens the scope, parks, and closes the scope itself on shutdown. The worker tasks borrow the connection but never touch the scope, so their entry and exit order stops mattering.

After: a dedicated owner task opens and closes the scope. Workers borrow the connection but never touch the scope.

Here is the fixed connection in the same miniature form. Notice that __aenter__ and __aexit__ for workers do nothing to the scope. The scope is opened and closed entirely inside _run, in one task.

class SharedConnectionFixed:
    def __init__(self) -> None:
        self._shutdown = asyncio.Event()
        self._ready = asyncio.Event()
        self._owner: asyncio.Task[None] | None = None

    async def start(self) -> None:
        self._owner = asyncio.create_task(self._run())
        await self._ready.wait()

    async def stop(self) -> None:
        assert self._owner is not None
        self._shutdown.set()
        await self._owner

    async def _run(self) -> None:
        # Scope opened AND closed inside this single owner task.
        with anyio.CancelScope():
            self._ready.set()
            await self._shutdown.wait()

    async def __aenter__(self) -> "SharedConnectionFixed":
        # Workers don't touch the scope, they just borrow the
        # already-open connection. No-op for the scope.
        return self

    async def __aexit__(self, exc_type, exc, tb) -> None:
        return None

Same four parallel tool_calls, zero crashes, every time.

It is not about concurrency, it is about task identity

This is the part I got slightly wrong on my first pass, and it is worth getting right because it changes how you reason about the failure rate.

Concurrency is necessary but not sufficient. With a single tool call there is one task, so the opener is trivially also the closer and the bug stays invisible forever. Parallel tool calls are what create the opportunity. But what actually triggers the crash is task identity at cleanup time: is the task that closes the scope the same one that opened it?

That reframing explains the numbers. If you imagine four tasks finishing in a random order, "the first to enter is also the last to exit" happens only some of the time, and the rest of the time you crash. Real tool latencies are not uniform, some code paths in the old library swallowed the error, and not every turn batches four calls. All of that pushes the observed rate down to the roughly 20 percent we measured rather than the much higher number a naive model would predict.

The clean one-liner: the bug is a function of which task closes the scope, not how many calls ran in parallel. Parallelism just turns the dial on probability.

Finding the fix upstream

Once I was confident this was a library problem and not ours, I went looking upstream. It was already known and already fixed. Two issues describe it almost exactly:

pydantic-ai #2355, parallel MCP tool calls cause a runtime error.
pydantic-ai #2818, the same error with concurrent evals.

And the fix: pydantic-ai #4514, titled "fix attempted exit cancel scope in different task by running MCP session in a dedicated task." Reading the diff confirmed everything the miniature implied. The old code had the _running_count plus a stashed exit stack closed in __aexit__. The new code introduces a _session_runner task created with asyncio.create_task, which opens the scope, parks on an event with await stop_event.wait(), and closes the scope itself. The PR's own docstring spells out the intent:

Because the session runs in a dedicated background task, entering and exiting from different tasks (e.g. asyncio.gather children, fasta2a workers, or graph node tasks) is safe: the underlying transport's cancel scopes never cross task boundaries.

That is the same idea as SharedConnectionFixed above, just in the real codebase.

Which version has the fix?

A merged PR does not tell you which release you can actually install. The cleanest way to answer that is to ask git which tags contain the merge commit.

git clone --filter=blob:none --no-checkout https://github.com/pydantic/pydantic-ai.git
cd pydantic-ai

# the merge commit of PR #4514
git tag --contains <merge_sha> --sort=v:refname | head -1
# v1.92.0

git describe --contains <merge_sha>
# v1.92.0~5

git tag --contains lists every tag whose history includes that commit, and the first one in version order is the first release with the fix: pydantic-ai 1.92.0 (released the day after the merge). git describe --contains returning v1.92.0~5 says the commit landed five commits before that tag was cut. This trick works for any commit in any repo, and it beats guessing from a changelog.

The fix and verifying it

We were on pydantic-ai 1.20.0. That is roughly six months and seventy-some minor versions behind, and the fix had been sitting in a release for about five weeks before we hit it in prod. The bug was probably latent the whole time and only surfaced once this agent's load got high enough to batch parallel tool calls regularly.

The change in our backend was small: raise the floor.

# pyproject.toml
dependencies = [
    "pydantic-ai>=1.92.0",
    # ...
]

I floored at 1.92.0 rather than pinning to a specific version, because 1.92.0 is the minimum that contains the fix and I want resolution to keep picking up newer compatible releases. Resolution landed on 1.107.0, the latest stable in the 1.x line. I deliberately did not move to the 2.0 betas, since a major bump does not belong in a bug-fix change.

Then I verified both directions:

Does it still work? Ran our end-to-end suite. Same passes as before the bump, no new functional breakage.
Is the bug gone? Ran the parallel reproduction script again. It had been 2 out of 10 before. After the bump it was 0 out of 10, and each call was a couple of seconds faster too.

A jump of seventy minor versions is not a guaranteed clean swap, so this is not a "bump and forget" move. Import the app, run the linter and type checker, run the tests, and actually exercise the thing that was broken. But the diff itself was two lines.

Takeaways

A few things I am keeping from this one:

Anchor on the unusual word in the error. "Cancel scope" was the whole thread to pull. Once I knew it was an anyio task-binding rule, the rest followed.
Reproduce before you fix. The 40-line standalone version proved the mechanism without the framework in the way, and the parallel script gave me a measurable before and after. Both made the fix verifiable instead of hopeful.
Concurrency bugs do not show up in sequential tests. Our end-to-end suite ran tools one after another, so it never exercised the failing path. A concurrency reproduction belongs in the suite, not just in my terminal.
Stale dependencies cost more than the upgrade. The fix existed for weeks before it bit us. git tag --contains makes "is this fixed yet, and in which release" a five-second question.

If you run pydantic-ai with MCP tools and you are on anything before 1.92.0, this is waiting for you the moment your traffic batches parallel tool calls. Floor your dependency and move on.