Draft for Architecture Review

Agent Loop — Technical Spec

Engineer-facing companion to the PM spec. SDK ground-truth, code shapes, state machine, integration patterns, and the reliability primitives that make this shippable at 200 PRs/week.

01Overview

This document is the engineering depth for TVP-6862. It pairs with pm-spec.html — that doc carries the executive narrative (problem, success metrics, the open-decisions agenda); this one carries the API surface, code patterns, state machine, and failure modes. Read this when you need to know how, not whether.

Audience: Fynn (engineering owner), Sid (PM + part-time IC). Tony can scroll if curious but doesn't need to. The two artifacts together drive tomorrow's 8am review.

Every code sample here is TypeScript / Node 24 LTS targeting the existing CAP EKS environment. Every behavioral claim is anchored to either the official Claude Agent SDK docs or a verifiable source — cited inline.

02SDK API ground-truth

Common assumptions about @anthropic-ai/claude-agent-sdk we needed to correct, plus the structural primitives we adopt. Each claim is sourced from the published docs at platform.claude.com/docs/en/agent-sdk/overview and the typed exports at github.com/anthropics/claude-agent-sdk-typescript. If we get these wrong the worker won't compile, or worse, will compile and behave incorrectly.

Concern	Ground truth
`permissionMode`	Accepted values: `'default'` \| `'acceptEdits'` \| `'bypassPermissions'` \| `'plan'` \| `'dontAsk'` \| `'auto'` (6 values as of recent SDK releases — `'dontAsk'` and `'auto'` are recent additions). We use `'bypassPermissions'` + tight `allowedTools`; the latter does the actual enforcement, the former just skips the prompt loop.
Budget cap	The SDK exposes `maxBudgetUsd` as a native typed option (added in a recent SDK release). We use both layers for defense in depth: native `maxBudgetUsd: 2.0` in `options` AND caller-side `result.total_cost_usd` polling + `AbortController` abort on breach. Belt-and-suspenders because budget enforcement is too load-bearing to depend on a single seam. See §04.
Hook events	Real event names: `PreToolUse`, `PostToolUse`, `Notification`, `UserPromptSubmit`, `SessionStart`, `SessionEnd`, `Stop`, `SubagentStop`, `PreCompact`. Not `beforeToolUse` / `afterToolUse` / `onMessage`.
PreToolUse capabilities	Can mutate input via `updatedInput` (inside `hookSpecificOutput`), can block via `permissionDecision: 'deny'`, can defer to user via `permissionDecision: 'ask'`, and (in current SDK) `'defer'` ends the query for later resumption. We don't register hooks on this worker — see §03 for the defensive-design rationale.
AWS auth	Two distinct paths: Claude Platform on AWS (`CLAUDE_CODE_USE_ANTHROPIC_AWS=1` + `ANTHROPIC_AWS_WORKSPACE_ID`) is Anthropic-operated with bare first-party model IDs and same-day API parity. Amazon Bedrock (`CLAUDE_CODE_USE_BEDROCK=1`) is AWS-operated, lags 2-4 weeks, drops some server-side tools. Different products. See §13.
Skills & Subagents	Two first-class structural primitives the SDK gives us — we use one in v1, defer the other to v2. Skills (`.claude/skills/<name>/SKILL.md` per repo) are loaded automatically when we pass `settingSources: ["project"]` and `skills: "all"`. Each team owns their review behavior in a structured format — no custom prompt-composition function. Subagents (defined via the `agents:` option + `"Agent"` in `allowedTools`) would let us decompose review into specialized roles (security / code-quality / test-coverage / architecture). Deferred to v2 after Stage 2 prod data informs the right decomposition. See §08 for v1 wiring.

03Hooks routing — defensive design

We route reviewId + observability MCP-side, not through SDK hooks. Two reasons:

Design rationale

(1) The MCP server we control already injects headers on every Bitbucket REST call — that's a single observable seam with first-class contract tests (see §07). Adding a parallel hook-based seam would split observability across two surfaces.

(2) Hooks-and-systemPrompt interactions have surprised us in early SDK exploration. The SDK's public docs don't currently call out a known issue, but we'd rather verify the persona / observability interaction with a week-0 spike (see §16) than ship a load-bearing dependency on undocumented behavior.

Operational guidance for this worker:

Use options.canUseTool callback (not PreToolUse hook) for any per-call permission overrides.
For post-tool logging, parse result.messages[] after the query completes — typed, verifiable, and survives any future hook-handling changes.
Skip PreCompact + SessionEnd for observability — Langfuse spans cover the lifecycle (see §12).

Load-bearing engineering decision for TVP-6862: inject reviewId headers inside the MCP server, not via a PreToolUse hook. Observability flows through the typed message stream + caller-side wrapping. The week-0 verification spike (§16) confirms canUseTool + MCP-side injection behave cleanly together before Stage 0 begins.

04Worker shape — canonical TypeScript pattern

This is the inner loop. One query() call per PR review, wrapped in a Langfuse observation, with caller-side cost enforcement. No hooks.

TypeScriptimport { query } from "@anthropic-ai/claude-agent-sdk";
import { startActiveObservation } from "@langfuse/otel";

const controller = new AbortController();
let cumulativeCost = 0;
const PER_REVIEW_USD_CAP = 2.0;

// Wrap the whole call in a Langfuse observation for trace correlation.
// Manual span instrumentation for the TS worker — see §12.
const obs = startActiveObservation({
  name: "pr-review",
  metadata: { reviewId, repo, prNumber, headSha }
});

try {
  for await (const msg of query({
    prompt: userPrompt,
    options: {
      model: "claude-sonnet-4-6",           // bare first-party ID (Claude Platform on AWS)
      systemPrompt: BASE_REVIEW_PROMPT,     // CLAUDE.md from the repo composes on top as Memory (see §08)
      cwd: sandboxDir,                       // repo checkout at PR head SHA
      settingSources: ["project"],          // SDK auto-loads .claude/skills/ + CLAUDE.md from cwd
      skills: "all",                          // every SKILL.md the repo defines — per-team review behavior
      abortController: controller,            // caller-side cost cap signal
      mcpServers: {
        "bitbucket-api": {
          type: "stdio",
          command: "node",
          args: ["/app/mcp-servers/bitbucket-api/server.js"]
          // reviewId injected MCP-side, NOT via PreToolUse hook (see §03)
        }
      },
      allowedTools: [
        "mcp__bitbucket-api__bb_current_repo",
        "mcp__bitbucket-api__bb_list_pull_requests",
        "mcp__bitbucket-api__bb_get_pull_request",
        "mcp__bitbucket-api__bb_get_pull_request_diff",
        "mcp__bitbucket-api__bb_comment_pull_request"
      ],
      disallowedTools: ["Bash", "Write", "Edit"],
      permissionMode: "bypassPermissions",  // allowedTools enforces; this skips prompt loop
      maxTurns: 25,
      maxBudgetUsd: PER_REVIEW_USD_CAP        // native SDK option (recent release); AbortController polling below is the 2nd layer
      // NO `hooks` block — see §03 for the defensive-design rationale.
    }
  })) {
    if (msg.type === "result") {
      cumulativeCost += msg.total_cost_usd;
      obs.update({ output: {
        cost_usd: msg.total_cost_usd,
        num_turns: msg.num_turns,
        subtype: msg.subtype,                 // "success" | "error_max_turns" | "error_during_execution"
        stop_reason: msg.stop_reason,
        permission_denials: msg.permission_denials?.length ?? 0
      }});
      if (cumulativeCost > PER_REVIEW_USD_CAP) {
        logger.warn({ event: "budget.exceeded", reviewId, cumulativeCost });
        controller.abort();
        throw new BudgetExceededError(reviewId, cumulativeCost);
      }
    }
  }
} finally {
  obs.end();
}

Annotation

permissionMode: "bypassPermissions" — allowedTools is the actual enforcement gate; this just skips the SDK's per-tool prompt loop so the worker doesn't wait on user input. Accepted values today: 'default'|'acceptEdits'|'bypassPermissions'|'plan'|'dontAsk'|'auto'.
Repo customization via cwd + settingSources + skills — the SDK auto-discovers CLAUDE.md (Memory) and .claude/skills/<name>/SKILL.md (Skills) from the working directory. No custom prompt-composition function — the SDK does it natively. See §08 for the three-layer composition order (org-default → org-baseline skills → repo overrides).
No hooks block — observability flows through the typed message stream + MCP-side reviewId injection. See §03 for the defensive-design rationale and §16 for the week-0 verification spike.
Dual cost cap — native maxBudgetUsd: $2.00 in options is the first line; caller-side result.total_cost_usd polling + AbortController is the second. Belt-and-suspenders — budget enforcement is too load-bearing to depend on a single seam.
Bare model ID — "claude-sonnet-4-6" works on Claude Platform on AWS. If we fall back to Bedrock (§13) it becomes "anthropic.claude-sonnet-4-6-<YYYYMMDD>-v1:0" — exact date suffix to verify before Stage 0.
disallowedTools — explicit deny on Bash / Write / Edit as defense-in-depth. The allowlist already excludes them; this is the belt-and-suspenders second layer.

05Per-PR slot state machine

Single Valkey key per in-flight PR: review:slot:{workspace}:{repo}:{prId}. All transitions atomic via Lua scripts. Caps commit storms at ~2 reviews per PR regardless of push frequency.

webhook │ ▼ ┌──────────┐ │ idle │ └────┬─────┘ │ SETNX slot=debouncing(15s) ▼ ┌─────────────┐ another webhook ┌─────────────┐ │ debouncing │ ───────────────────▶ │ debouncing │ (extend timer, latest headSha wins) └──────┬──────┘ └─────────────┘ │ debounce expires ▼ ┌──────────┐ webhook during run ┌──────────────┐ │ running │ ────────────────────▶ │ pending-rerun│ └────┬─────┘ └──────┬───────┘ │ result / error │ ▼ │ ┌──────────┐ │ │ idle │◀─────────────────────────────┘ └──────────┘ (re-enqueues at latest headSha)

Implementation notes

Lua atomicity — all transitions wrapped in EVAL Lua so check-and-set races can't split a slot.
Heartbeat — worker emits lastHeartbeatAt to the slot every 10s during a review.
Stuck-slot recovery — debounce-drainer detects lastHeartbeatAt < now - 60s on a running slot and force-transitions it back to debouncing for retry (worker pod was likely OOM-killed).
Per-PR isolation — slots are keyed by {workspace}:{repo}:{prId}, so unrelated PRs can run in parallel up to the BullMQ concurrency limit.

06MCP integration patterns

The Claude Agent SDK supports four MCP transports. Listed here so the choice is explicit; for TVP-6862 we use stdio.

Option 1 — stdio (what we use)

TypeScriptmcpServers: {
  "bitbucket-api": {
    type: "stdio",         // optional — stdio is the default
    command: "node",
    args: ["/app/mcp-servers/bitbucket-api/server.js"],
    env: { BITBUCKET_APP_PASSWORD, BITBUCKET_USERNAME }
  }
}

Bundled in the worker container image, spawned as a long-lived stdio child per worker process, not per review. Lowest latency, no extra service hop. The MCP server has direct memory access to its own state but cannot see the Agent SDK process state.

Option 2 — SSE (remote MCP)

TypeScriptmcpServers: {
  "remote-mcp": {
    type: "sse",
    url: "https://mcp.example.com/sse",
    headers: { "Authorization": "Bearer ${TOKEN}" }
  }
}

Option 3 — HTTP (remote alternative)

TypeScriptmcpServers: {
  "http-mcp": {
    type: "http",
    url: "https://api.example.com/mcp",
    headers: { "X-API-Key": "${KEY}" }
  }
}

Option 4 — SDK in-process (fastest, same memory)

TypeScriptimport { createSdkMcpServer, tool } from "@anthropic-ai/claude-agent-sdk";

const bitbucket = createSdkMcpServer({
  name: "bitbucket-api",
  version: "1.0.0",
  tools: [/* tool definitions */]
});

mcpServers: { "bitbucket-api": bitbucket }

SDK transport shares process memory with the agent — fastest, but no isolation. We don't use it because stdio's separation is the more cautious default for a Bitbucket-credentialed component.

07Bitbucket MCP server design

Read-only allowlist plus one idempotent comment-upsert. The agent is a reviewer, not an approver. Write tools (bb_create_pull_request, bb_approve_pull_request, bb_merge_pull_request) are explicitly excluded from the exposed surface.

Tool	Purpose
bb_current_repo	Returns workspace + repo_slug. Called first in any flow.
bb_list_pull_requests	Open PRs in a repo. Mostly unused in the bot flow (the webhook tells us which PR); useful in dev.
bb_get_pull_request	Full PR details: title, description, author, branches, reviewers.
bb_get_pull_request_diff	Unified diff text. The primary input to the review.
bb_comment_pull_request	Post inline or summary comment. Used by the agent inside the `query()` loop.
bb_upsert_inline_comment	New. Idempotency-marker upsert. Computes `hash = sha256(filePath + lineNumber + commentBodyNormalized)`, prepends `<!-- claude-reviewer:inline:{hash} -->`, lists existing comments, regex-matches the marker, updates in place if present. Wraps the idempotency logic so the agent doesn't have to reason about it.

Auth: the MCP server reads BITBUCKET_APP_PASSWORD + BITBUCKET_USERNAME from env (injected at pod start via External Secrets Operator → AWS Secrets Manager). A dedicated bot Bitbucket account with workspace-level read + PR-comment-write — no human credentials. reviewId is read from a header propagated from the worker and added to every Bitbucket REST call as X-Review-Id for trace correlation.

08Repo-local config — SDK-native loading

Each repo customizes review behavior through SDK-native primitives. We let the SDK compose at runtime — no custom prompt-composition function, no string-concat of markdown files. Two seams the repo controls:

CLAUDE.md (root or .claude/CLAUDE.md) → loaded as Memory. Repo context — stack, conventions, "this is a payments microservice on Go 1.22, the canonical PR shape is X" — composes on top of our base reviewer system prompt automatically.
.claude/skills/<name>/SKILL.md → loaded as Skills. Each team's review-behavior bundles: review-security, review-test-coverage, review-migration, etc. Owned by the team that owns the repo.

The whole thing wires together via the worker options (§04):

TypeScript// After checking out the repo at PR head SHA to `sandboxDir`:
options: {
  cwd: sandboxDir,
  settingSources: ["project"],     // SDK auto-discovers .claude/* + CLAUDE.md from cwd
  skills: "all",                  // load every SKILL.md found
  systemPrompt: BASE_REVIEW_PROMPT   // org-default reviewer role string
  // ... mcpServers, allowedTools, maxBudgetUsd, maxTurns as in §04
}

Three layers compose, in this order:

Org-default reviewer role — BASE_REVIEW_PROMPT string passed to systemPrompt. Baked into the worker container.
Org-baseline skills — shipped in the worker image at /etc/reviewer/skills/, mounted to ~/.claude/skills/ for the SDK process. Cover the universal review dimensions (basic security checks, missing tests, dead code, etc.) that apply org-wide.
Repo overrides — CLAUDE.md + .claude/skills/ at the PR head SHA. The team's repo-specific context and tuned skills. Composes on top of org baselines without replacing them.

Cache invariant

Cache the sandbox checkout by sha256(sorted content of CLAUDE.md + .claude/**), NOT by {repoSlug}:{headSha}. If two consecutive PRs in the same repo don't change any config or skill file, they reuse the same checkout — saves a Bitbucket round-trip. SDK reloads from cwd each query() call, so the working directory IS the cache surface.

Subagents — the v2 path on the same primitive surface

The SDK also exposes Subagents via the agents: { ... } option (plus "Agent" added to allowedTools). The natural v2 decomposition is four specialized reviewers spawned in parallel from the main agent, each with its own focused prompt + tool surface + budget slice:

security-reviewer — auth, SQL injection, hardcoded secrets, missing rate limits
code-quality-reviewer — style, dead code, readability, naming
test-coverage-reviewer — does new code have tests; do existing tests still pass
architecture-reviewer — fits repo conventions per CLAUDE.md; respects module boundaries

Why deferred to v2: (a) v1 single-agent + skills is simpler to ship, debug, and budget at Q3 timeline; (b) the right decomposition is best informed by Stage 2 prod data on which review dimensions actually hit vs. miss; (c) per-subagent budget slicing (4 × $0.50 = $2 ceiling instead of single $2 cap) needs operational practice before we commit to it. .claude/agents/<name>.md at the repo level becomes the per-team customization surface for subagent roles in v2 — same SDK-native pattern as skills.

09Multi-commit handling — slot + debounce + incremental diff

Combines the slot machine (§05) with a 15-second debounce window and an incremental-diff context primer. Net effect: at most 2 reviews per PR regardless of push frequency.

Algorithm

Webhook arrives → Fastify → Valkey Lua script atomically transitions slot (idle→debouncing | debouncing→debouncing with timer reset | running→pending-rerun).
A debounce-drainer worker polls expired debounce slots every 2s, atomically transitions debouncing → running, enqueues a BullMQ job at the latest headSha.
Review worker pulls the job, runs query(), posts comments via the MCP, releases the slot.
On completion: Lua transitions running → idle OR pending-rerun → debouncing(15s) if a re-push landed mid-run. The latter case re-enqueues at the new headSha.

Incremental diff context

When transitioning pending-rerun → debouncing, persist the prior review's summary comment SHA. The next review's prompt gets prepended:

Prompt fragmentPrevious review was at SHA <x>. The following commits have landed since
that review:

<incremental diff between previous reviewed SHA and current head>

Update the existing summary comment in place. Only add NEW inline comments
for new findings — do not re-post comments on lines you already flagged.

This is what makes the re-push experience feel surgical instead of redundant. The agent gets full context (base...head) plus a clear signal about what's actually new (last_reviewed_sha...head), and the bb_upsert_inline_comment tool handles the dedup mechanics on the comment-posting side.

10Cost controls — defense in depth

Layer	Limit	Mechanism
Per-review hard cap	$2.00 / review	Caller-side `AbortController` + cumulative `result.total_cost_usd` polling. Aborts the `query()` loop mid-flight on breach.
Per-repo daily budget	$5 / repo / day (default)	Valkey counter checked before `query()` call. Rejects if budget exhausted, emits `BudgetExhausted` event, posts a "review skipped — daily budget hit" comment.
Per-team monthly cap	configurable per team	AWS Budgets on the Claude Platform on AWS workspace. Email + Slack alerts at 50%, 80%, 100% of monthly target.
Kill switch	0 (emergency)	Single Valkey key flips. See §11.

Three layers because no single layer catches every failure shape. Per-call catches runaway loops on a single pathological PR. Per-repo-daily catches sustained prompt-config regressions on one repo before they bleed budget. Per-team-monthly is the org-level backstop. The kill switch is the emergency cut.

11Kill switch contract

Single Valkey key, checked on every webhook receipt AND every BullMQ job pickup. First action in every runbook.

Shell# Flip on — any on-call engineer can do this
redis-cli -h <valkey> SET bot:killswitch:enabled true

# Watch workers drain (≈30s)
kubectl logs -n tvp-loop -l app=loop-agent --tail=50 -f | grep KillSwitchEngaged

# After all workers idle, PRs no longer reviewed and no errors posted.
# Flip off when ready.
redis-cli -h <valkey> SET bot:killswitch:enabled false

Worker behavior on engaged kill switch:

Webhook ingress returns 503 killswitch_engaged — Bitbucket retries per its own policy.
In-flight query() calls complete (do not abort mid-comment-post — that would create orphan inline comments).
BullMQ worker drops in-flight job back to the queue, emits KillSwitchEngaged structured-log event with reviewId.
Pods drain on SIGTERM within a 60-second graceful period.

12Observability — Langfuse via OpenTelemetry

Two instrumentation paths into the same self-hosted Langfuse instance (per ATF-76). Verify the minimum platform version with the ATF-76 owners before Stage 0 — current self-hosted Langfuse with OTel ingestion enabled is the target.

Python path (eval harness and sub-services)

Pythonpip install langfuse claude-agent-sdk \
  "langsmith[claude-agent-sdk]" "langsmith[otel]"

# env:
#   LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY
#   LANGFUSE_BASE_URL=https://<samba-langfuse-host>
#   LANGSMITH_OTEL_ENABLED=true

from langfuse import get_client
from langsmith.integrations.claude_agent_sdk import configure_claude_agent_sdk

langfuse = get_client()
assert langfuse.auth_check()
configure_claude_agent_sdk()  # spans auto-flow to Langfuse via OTel

Note: the instrumentation library is langsmith.integrations.claude_agent_sdk — yes, that's LangSmith's open-source library. The spans flow to Langfuse via OpenTelemetry because both vendors share the OTel surface. We use Langfuse for storage and UI; LangSmith is just the instrumentation seam.

TypeScript path (the worker — Fynn's stack)

No verified TS sibling of the Python instrumentation library exists yet. Day-1 approach: manual span instrumentation with @langfuse/otel. ~50 lines around the query() call (see §04 for the wrapper). If openinference-instrumentation-claude-agent-sdk matures a TS distribution, adopt that and remove the manual layer.

What you get

Every model call — input messages, output, model ID, token counts (input / output / cache).
Every tool use — name, input args, result, latency.
Session/trace correlation by reviewId — filter the dashboard by repo, PR number, or reviewId.
Cost tracking — tokens × pricing table, per-trace.
Replay UI — step through any review turn-by-turn after the fact.

Why self-hosted

ATF-76 standardization + data residency for Security (traces never leave AWS). Bonus: the same Langfuse instance hosts both agent traces AND eval harness runs — one observability backend, two use cases.

13Claude Platform on AWS vs Bedrock

Different products. Easy to confuse because both are "Claude on AWS". The first is the recommended path; the second is the documented fallback.

Concern	Claude Platform on AWS	Amazon Bedrock
Operator	Anthropic-operated	AWS-operated
Env vars	`CLAUDE_CODE_USE_ANTHROPIC_AWS=1` + `ANTHROPIC_AWS_WORKSPACE_ID`	`CLAUDE_CODE_USE_BEDROCK=1`
Model ID format	`claude-sonnet-4-6` (bare first-party)	`anthropic.claude-sonnet-4-6-<YYYYMMDD>-v1:0` — exact date suffix to verify via `aws bedrock list-foundation-models` before Stage 0
Auth	SigV4 + AWS IAM (IRSA)	AWS IAM (IRSA) with `bedrock:InvokeModel`
API parity	Same-day with first-party	Lags 2-4 weeks
Managed Agents	Available	Not available
Server-side tools	Available	Some unavailable
Billing	AWS Marketplace	AWS native
Endpoint	`aws-external-anthropic.{region}.api.aws/v1/...`	`bedrock-runtime.{region}.amazonaws.com`

Recommendation: Claude Platform on AWS for new builds (this is what TVP-6862 should target). Bedrock is the documented fallback if the Platform-on-AWS workspace isn't provisioned in Samba's AWS account.

Verification — week 0

The exact IAM action set for Claude Platform on AWS is NOT bedrock:InvokeModel — it's a separate action namespace documented at platform.claude.com/docs/en/api/claude-platform-on-aws-iam-actions.md. We MUST verify those action names against the live doc before applying the IRSA policy; the IAM block in build-and-deploy.md currently has a TODO placeholder.

14Failure mode catalog

Nine production scenarios. Detection signal is what triggers the alert; mitigation is what we ship on day one to keep blast radius small.

#	Failure	Trigger	Detection	Mitigation
F1	Cost runaway, single review	Pathological diff sends review into recursive tool loop	`result.subtype` + cost spike on the trace	Caller-side `AbortController` at $2 (see §04)
F2	Cost runaway, sustained	Prompt change increases avg tokens 5×	Prometheus histogram on `total_cost_usd` p95	Per-repo daily Valkey budget + alert on rolling p95
F3	Runaway turn loop	Agent stuck retrying a failing MCP call	`result.subtype = "error_max_turns"`	`maxTurns: 25` cap; post truncated summary
F4	Duplicate inline comments on re-push	Idempotency marker logic broken	Engineer feedback / Slack reactions	`bb_upsert_inline_comment` with hash marker + contract tests in CI
F5	MCP server crash mid-review	Bitbucket API rate-limit → child exits	`result.subtype = "error_during_execution"` + MCP stderr	DLQ + auto-retry with exponential backoff; post "review temporarily unavailable" summary
F6	Prompt regression ships	Engineer edits `CLAUDE.md` or a `SKILL.md`; comment quality drops 40%	Eval gate failure rate	Required Bitbucket Pipelines status check on prompt-config repo PRs
F7	Prompt injection in diff	Diff includes injection payload trying to escape allowlist	`permission_denials` array on `result`	Hard-deny at SDK layer (`disallowedTools`) + structural eval fixture with injection attempt
F8	Silent webhook backlog	BullMQ consumer crashes	`oldest_unacked_message_age > 5min` alert	DLQ + queue-depth alert + auto-restart on liveness probe
F9	Anthropic API spend cap hit	Per-team monthly cap reached	HTTP 429/403 in `result.subtype`	AWS Budgets pre-alerts at 80%; fallback to "minimal review" mode

15Escape hatch — `RUNTIME_MODE=cli`

Insurance against SDK breakage. The worker has two runtime drivers behind a single env-flagged factory:

TypeScript// services/pr-review-worker/src/runtime/index.ts
import { runWithSdk } from "./sdk";
import { runWithCli } from "./cli";

export const runReview = process.env.RUNTIME_MODE === "cli"
  ? runWithCli
  : runWithSdk;

Default is sdk. If Anthropic ships a breaking SDK change mid-pilot, flip RUNTIME_MODE=cli via ConfigMap, restart pods, and the worker spawns @anthropic-ai/claude-code as a subprocess until we patch.

SDK upgrade discipline

Pin to exact version. Never ^x.y.z. Patch bumps land in the lockfile only after CI verification.
bot-sdk-bump Bitbucket Pipelines job — runs the eval gate against any new SDK version before promotion.
2-week soak in staging before merging SDK upgrades to main.
CLI adapter exercised weekly in CI so it doesn't bitrot. If we ever need it, we know it works.

16Verification items — week 0

Concrete checks to run before Stage 0 (shadow mode) begins. Each is a 15-minute spike, not a research project.

Confirm Samba has a Claude Platform on AWS workspace provisioned in the AWS account where CAP runs. Set + read ANTHROPIC_AWS_WORKSPACE_ID end-to-end.
Confirm Samba's self-hosted Langfuse instance supports OTel ingestion from CAP pods. Cross-check the minimum platform version with the ATF-76 owners.
Look up the exact IAM action set for Claude Platform on AWS (NOT bedrock:InvokeModel — see §13). Update the IRSA policy from the TODO placeholder in build-and-deploy.md.
End-to-end smoke test the canUseTool callback + MCP-side reviewId injection pattern against a sandbox MCP server. Verify no persona / observability surprises per §03.
Spike-test the maxTurns boundary behavior — confirm the SDK emits the partial result message with subtype: "error_max_turns" when the cap is hit (we rely on this for F3 detection).
Spike-test the new native maxBudgetUsd option against a deliberately-expensive synthetic review — confirm the SDK terminates cleanly at the cap, and that our caller-side AbortController polling fires correctly as the second-layer guard.
Confirm the exact Bedrock model ID for Claude Sonnet 4.6 (anthropic.claude-sonnet-4-6-<YYYYMMDD>-v1:0) via aws bedrock list-foundation-models --region <cap-region>. Update build-and-deploy.md + this doc's §13 if the date suffix differs from the current placeholder.