Agent Loop — Technical Spec
Engineer-facing companion to the PM spec. SDK ground-truth, code shapes, state machine, integration patterns, and the reliability primitives that make this shippable at 200 PRs/week.
01Overview
This document is the engineering depth for TVP-6862. It pairs with pm-spec.html — that doc carries the executive narrative (problem, success metrics, the open-decisions agenda); this one carries the API surface, code patterns, state machine, and failure modes. Read this when you need to know how, not whether.
Audience: Fynn (engineering owner), Sid (PM + part-time IC). Tony can scroll if curious but doesn't need to. The two artifacts together drive tomorrow's 8am review.
Every code sample here is TypeScript / Node 24 LTS targeting the existing CAP EKS environment. Every behavioral claim is anchored to either the official Claude Agent SDK docs or a verifiable source — cited inline.
02SDK API ground-truth
Common assumptions about @anthropic-ai/claude-agent-sdk we needed to correct, plus the structural primitives we adopt. Each claim is sourced from the published docs at platform.claude.com/docs/en/agent-sdk/overview and the typed exports at github.com/anthropics/claude-agent-sdk-typescript. If we get these wrong the worker won't compile, or worse, will compile and behave incorrectly.
| Concern | Ground truth |
|---|---|
permissionMode |
Accepted values: 'default' | 'acceptEdits' | 'bypassPermissions' | 'plan' | 'dontAsk' | 'auto' (6 values as of recent SDK releases — 'dontAsk' and 'auto' are recent additions). We use 'bypassPermissions' + tight allowedTools; the latter does the actual enforcement, the former just skips the prompt loop. |
| Budget cap | The SDK exposes maxBudgetUsd as a native typed option (added in a recent SDK release). We use both layers for defense in depth: native maxBudgetUsd: 2.0 in options AND caller-side result.total_cost_usd polling + AbortController abort on breach. Belt-and-suspenders because budget enforcement is too load-bearing to depend on a single seam. See §04. |
| Hook events | Real event names: PreToolUse, PostToolUse, Notification, UserPromptSubmit, SessionStart, SessionEnd, Stop, SubagentStop, PreCompact. Not beforeToolUse / afterToolUse / onMessage. |
| PreToolUse capabilities | Can mutate input via updatedInput (inside hookSpecificOutput), can block via permissionDecision: 'deny', can defer to user via permissionDecision: 'ask', and (in current SDK) 'defer' ends the query for later resumption. We don't register hooks on this worker — see §03 for the defensive-design rationale. |
| AWS auth | Two distinct paths: Claude Platform on AWS (CLAUDE_CODE_USE_ANTHROPIC_AWS=1 + ANTHROPIC_AWS_WORKSPACE_ID) is Anthropic-operated with bare first-party model IDs and same-day API parity. Amazon Bedrock (CLAUDE_CODE_USE_BEDROCK=1) is AWS-operated, lags 2-4 weeks, drops some server-side tools. Different products. See §13. |
| Skills & Subagents | Two first-class structural primitives the SDK gives us — we use one in v1, defer the other to v2. Skills (.claude/skills/<name>/SKILL.md per repo) are loaded automatically when we pass settingSources: ["project"] and skills: "all". Each team owns their review behavior in a structured format — no custom prompt-composition function. Subagents (defined via the agents: option + "Agent" in allowedTools) would let us decompose review into specialized roles (security / code-quality / test-coverage / architecture). Deferred to v2 after Stage 2 prod data informs the right decomposition. See §08 for v1 wiring. |
03Hooks routing — defensive design
We route reviewId + observability MCP-side, not through SDK hooks. Two reasons:
(1) The MCP server we control already injects headers on every Bitbucket REST call — that's a single observable seam with first-class contract tests (see §07). Adding a parallel hook-based seam would split observability across two surfaces.
(2) Hooks-and-systemPrompt interactions have surprised us in early SDK exploration. The SDK's public docs don't currently call out a known issue, but we'd rather verify the persona / observability interaction with a week-0 spike (see §16) than ship a load-bearing dependency on undocumented behavior.
Operational guidance for this worker:
- Use
options.canUseToolcallback (notPreToolUsehook) for any per-call permission overrides. - For post-tool logging, parse
result.messages[]after the query completes — typed, verifiable, and survives any future hook-handling changes. - Skip
PreCompact+SessionEndfor observability — Langfuse spans cover the lifecycle (see §12).
Load-bearing engineering decision for TVP-6862: inject reviewId headers inside the MCP server, not via a PreToolUse hook. Observability flows through the typed message stream + caller-side wrapping. The week-0 verification spike (§16) confirms canUseTool + MCP-side injection behave cleanly together before Stage 0 begins.
04Worker shape — canonical TypeScript pattern
This is the inner loop. One query() call per PR review, wrapped in a Langfuse observation, with caller-side cost enforcement. No hooks.
TypeScriptimport { query } from "@anthropic-ai/claude-agent-sdk";
import { startActiveObservation } from "@langfuse/otel";
const controller = new AbortController();
let cumulativeCost = 0;
const PER_REVIEW_USD_CAP = 2.0;
// Wrap the whole call in a Langfuse observation for trace correlation.
// Manual span instrumentation for the TS worker — see §12.
const obs = startActiveObservation({
name: "pr-review",
metadata: { reviewId, repo, prNumber, headSha }
});
try {
for await (const msg of query({
prompt: userPrompt,
options: {
model: "claude-sonnet-4-6", // bare first-party ID (Claude Platform on AWS)
systemPrompt: BASE_REVIEW_PROMPT, // CLAUDE.md from the repo composes on top as Memory (see §08)
cwd: sandboxDir, // repo checkout at PR head SHA
settingSources: ["project"], // SDK auto-loads .claude/skills/ + CLAUDE.md from cwd
skills: "all", // every SKILL.md the repo defines — per-team review behavior
abortController: controller, // caller-side cost cap signal
mcpServers: {
"bitbucket-api": {
type: "stdio",
command: "node",
args: ["/app/mcp-servers/bitbucket-api/server.js"]
// reviewId injected MCP-side, NOT via PreToolUse hook (see §03)
}
},
allowedTools: [
"mcp__bitbucket-api__bb_current_repo",
"mcp__bitbucket-api__bb_list_pull_requests",
"mcp__bitbucket-api__bb_get_pull_request",
"mcp__bitbucket-api__bb_get_pull_request_diff",
"mcp__bitbucket-api__bb_comment_pull_request"
],
disallowedTools: ["Bash", "Write", "Edit"],
permissionMode: "bypassPermissions", // allowedTools enforces; this skips prompt loop
maxTurns: 25,
maxBudgetUsd: PER_REVIEW_USD_CAP // native SDK option (recent release); AbortController polling below is the 2nd layer
// NO `hooks` block — see §03 for the defensive-design rationale.
}
})) {
if (msg.type === "result") {
cumulativeCost += msg.total_cost_usd;
obs.update({ output: {
cost_usd: msg.total_cost_usd,
num_turns: msg.num_turns,
subtype: msg.subtype, // "success" | "error_max_turns" | "error_during_execution"
stop_reason: msg.stop_reason,
permission_denials: msg.permission_denials?.length ?? 0
}});
if (cumulativeCost > PER_REVIEW_USD_CAP) {
logger.warn({ event: "budget.exceeded", reviewId, cumulativeCost });
controller.abort();
throw new BudgetExceededError(reviewId, cumulativeCost);
}
}
}
} finally {
obs.end();
}
Annotation
permissionMode: "bypassPermissions"—allowedToolsis the actual enforcement gate; this just skips the SDK's per-tool prompt loop so the worker doesn't wait on user input. Accepted values today:'default'|'acceptEdits'|'bypassPermissions'|'plan'|'dontAsk'|'auto'.- Repo customization via
cwd+settingSources+skills— the SDK auto-discoversCLAUDE.md(Memory) and.claude/skills/<name>/SKILL.md(Skills) from the working directory. No custom prompt-composition function — the SDK does it natively. See §08 for the three-layer composition order (org-default → org-baseline skills → repo overrides). - No
hooksblock — observability flows through the typed message stream + MCP-sidereviewIdinjection. See §03 for the defensive-design rationale and §16 for the week-0 verification spike. - Dual cost cap — native
maxBudgetUsd: $2.00inoptionsis the first line; caller-sideresult.total_cost_usdpolling +AbortControlleris the second. Belt-and-suspenders — budget enforcement is too load-bearing to depend on a single seam. - Bare model ID —
"claude-sonnet-4-6"works on Claude Platform on AWS. If we fall back to Bedrock (§13) it becomes"anthropic.claude-sonnet-4-6-<YYYYMMDD>-v1:0"— exact date suffix to verify before Stage 0. disallowedTools— explicit deny onBash/Write/Editas defense-in-depth. The allowlist already excludes them; this is the belt-and-suspenders second layer.
05Per-PR slot state machine
Single Valkey key per in-flight PR: review:slot:{workspace}:{repo}:{prId}. All transitions atomic via Lua scripts. Caps commit storms at ~2 reviews per PR regardless of push frequency.
Implementation notes
- Lua atomicity — all transitions wrapped in
EVALLua so check-and-set races can't split a slot. - Heartbeat — worker emits
lastHeartbeatAtto the slot every 10s during a review. - Stuck-slot recovery — debounce-drainer detects
lastHeartbeatAt < now - 60son arunningslot and force-transitions it back todebouncingfor retry (worker pod was likely OOM-killed). - Per-PR isolation — slots are keyed by
{workspace}:{repo}:{prId}, so unrelated PRs can run in parallel up to the BullMQ concurrency limit.
06MCP integration patterns
The Claude Agent SDK supports four MCP transports. Listed here so the choice is explicit; for TVP-6862 we use stdio.
Option 1 — stdio (what we use)
TypeScriptmcpServers: {
"bitbucket-api": {
type: "stdio", // optional — stdio is the default
command: "node",
args: ["/app/mcp-servers/bitbucket-api/server.js"],
env: { BITBUCKET_APP_PASSWORD, BITBUCKET_USERNAME }
}
}
Bundled in the worker container image, spawned as a long-lived stdio child per worker process, not per review. Lowest latency, no extra service hop. The MCP server has direct memory access to its own state but cannot see the Agent SDK process state.
Option 2 — SSE (remote MCP)
TypeScriptmcpServers: {
"remote-mcp": {
type: "sse",
url: "https://mcp.example.com/sse",
headers: { "Authorization": "Bearer ${TOKEN}" }
}
}
Option 3 — HTTP (remote alternative)
TypeScriptmcpServers: {
"http-mcp": {
type: "http",
url: "https://api.example.com/mcp",
headers: { "X-API-Key": "${KEY}" }
}
}
Option 4 — SDK in-process (fastest, same memory)
TypeScriptimport { createSdkMcpServer, tool } from "@anthropic-ai/claude-agent-sdk";
const bitbucket = createSdkMcpServer({
name: "bitbucket-api",
version: "1.0.0",
tools: [/* tool definitions */]
});
mcpServers: { "bitbucket-api": bitbucket }
SDK transport shares process memory with the agent — fastest, but no isolation. We don't use it because stdio's separation is the more cautious default for a Bitbucket-credentialed component.
07Bitbucket MCP server design
Read-only allowlist plus one idempotent comment-upsert. The agent is a reviewer, not an approver. Write tools (bb_create_pull_request, bb_approve_pull_request, bb_merge_pull_request) are explicitly excluded from the exposed surface.
| Tool | Purpose |
|---|---|
| bb_current_repo | Returns workspace + repo_slug. Called first in any flow. |
| bb_list_pull_requests | Open PRs in a repo. Mostly unused in the bot flow (the webhook tells us which PR); useful in dev. |
| bb_get_pull_request | Full PR details: title, description, author, branches, reviewers. |
| bb_get_pull_request_diff | Unified diff text. The primary input to the review. |
| bb_comment_pull_request | Post inline or summary comment. Used by the agent inside the query() loop. |
| bb_upsert_inline_comment | New. Idempotency-marker upsert. Computes hash = sha256(filePath + lineNumber + commentBodyNormalized), prepends <!-- claude-reviewer:inline:{hash} -->, lists existing comments, regex-matches the marker, updates in place if present. Wraps the idempotency logic so the agent doesn't have to reason about it. |
Auth: the MCP server reads BITBUCKET_APP_PASSWORD + BITBUCKET_USERNAME from env (injected at pod start via External Secrets Operator → AWS Secrets Manager). A dedicated bot Bitbucket account with workspace-level read + PR-comment-write — no human credentials. reviewId is read from a header propagated from the worker and added to every Bitbucket REST call as X-Review-Id for trace correlation.
08Repo-local config — SDK-native loading
Each repo customizes review behavior through SDK-native primitives. We let the SDK compose at runtime — no custom prompt-composition function, no string-concat of markdown files. Two seams the repo controls:
CLAUDE.md(root or.claude/CLAUDE.md) → loaded as Memory. Repo context — stack, conventions, "this is a payments microservice on Go 1.22, the canonical PR shape is X" — composes on top of our base reviewer system prompt automatically..claude/skills/<name>/SKILL.md→ loaded as Skills. Each team's review-behavior bundles:review-security,review-test-coverage,review-migration, etc. Owned by the team that owns the repo.
The whole thing wires together via the worker options (§04):
TypeScript// After checking out the repo at PR head SHA to `sandboxDir`:
options: {
cwd: sandboxDir,
settingSources: ["project"], // SDK auto-discovers .claude/* + CLAUDE.md from cwd
skills: "all", // load every SKILL.md found
systemPrompt: BASE_REVIEW_PROMPT // org-default reviewer role string
// ... mcpServers, allowedTools, maxBudgetUsd, maxTurns as in §04
}
Three layers compose, in this order:
- Org-default reviewer role —
BASE_REVIEW_PROMPTstring passed tosystemPrompt. Baked into the worker container. - Org-baseline skills — shipped in the worker image at
/etc/reviewer/skills/, mounted to~/.claude/skills/for the SDK process. Cover the universal review dimensions (basic security checks, missing tests, dead code, etc.) that apply org-wide. - Repo overrides —
CLAUDE.md+.claude/skills/at the PR head SHA. The team's repo-specific context and tuned skills. Composes on top of org baselines without replacing them.
Cache the sandbox checkout by sha256(sorted content of CLAUDE.md + .claude/**), NOT by {repoSlug}:{headSha}. If two consecutive PRs in the same repo don't change any config or skill file, they reuse the same checkout — saves a Bitbucket round-trip. SDK reloads from cwd each query() call, so the working directory IS the cache surface.
Subagents — the v2 path on the same primitive surface
The SDK also exposes Subagents via the agents: { ... } option (plus "Agent" added to allowedTools). The natural v2 decomposition is four specialized reviewers spawned in parallel from the main agent, each with its own focused prompt + tool surface + budget slice:
security-reviewer— auth, SQL injection, hardcoded secrets, missing rate limitscode-quality-reviewer— style, dead code, readability, namingtest-coverage-reviewer— does new code have tests; do existing tests still passarchitecture-reviewer— fits repo conventions perCLAUDE.md; respects module boundaries
Why deferred to v2: (a) v1 single-agent + skills is simpler to ship, debug, and budget at Q3 timeline; (b) the right decomposition is best informed by Stage 2 prod data on which review dimensions actually hit vs. miss; (c) per-subagent budget slicing (4 × $0.50 = $2 ceiling instead of single $2 cap) needs operational practice before we commit to it. .claude/agents/<name>.md at the repo level becomes the per-team customization surface for subagent roles in v2 — same SDK-native pattern as skills.
09Multi-commit handling — slot + debounce + incremental diff
Combines the slot machine (§05) with a 15-second debounce window and an incremental-diff context primer. Net effect: at most 2 reviews per PR regardless of push frequency.
Algorithm
- Webhook arrives → Fastify → Valkey Lua script atomically transitions slot (
idle→debouncing|debouncing→debouncingwith timer reset |running→pending-rerun). - A
debounce-drainerworker polls expired debounce slots every 2s, atomically transitionsdebouncing → running, enqueues a BullMQ job at the latestheadSha. - Review worker pulls the job, runs
query(), posts comments via the MCP, releases the slot. - On completion: Lua transitions
running → idleORpending-rerun → debouncing(15s)if a re-push landed mid-run. The latter case re-enqueues at the newheadSha.
Incremental diff context
When transitioning pending-rerun → debouncing, persist the prior review's summary comment SHA. The next review's prompt gets prepended:
Prompt fragmentPrevious review was at SHA <x>. The following commits have landed since
that review:
<incremental diff between previous reviewed SHA and current head>
Update the existing summary comment in place. Only add NEW inline comments
for new findings — do not re-post comments on lines you already flagged.
This is what makes the re-push experience feel surgical instead of redundant. The agent gets full context (base...head) plus a clear signal about what's actually new (last_reviewed_sha...head), and the bb_upsert_inline_comment tool handles the dedup mechanics on the comment-posting side.
10Cost controls — defense in depth
| Layer | Limit | Mechanism |
|---|---|---|
| Per-review hard cap | $2.00 / review | Caller-side AbortController + cumulative result.total_cost_usd polling. Aborts the query() loop mid-flight on breach. |
| Per-repo daily budget | $5 / repo / day (default) | Valkey counter checked before query() call. Rejects if budget exhausted, emits BudgetExhausted event, posts a "review skipped — daily budget hit" comment. |
| Per-team monthly cap | configurable per team | AWS Budgets on the Claude Platform on AWS workspace. Email + Slack alerts at 50%, 80%, 100% of monthly target. |
| Kill switch | 0 (emergency) | Single Valkey key flips. See §11. |
Three layers because no single layer catches every failure shape. Per-call catches runaway loops on a single pathological PR. Per-repo-daily catches sustained prompt-config regressions on one repo before they bleed budget. Per-team-monthly is the org-level backstop. The kill switch is the emergency cut.
11Kill switch contract
Single Valkey key, checked on every webhook receipt AND every BullMQ job pickup. First action in every runbook.
Shell# Flip on — any on-call engineer can do this
redis-cli -h <valkey> SET bot:killswitch:enabled true
# Watch workers drain (≈30s)
kubectl logs -n tvp-loop -l app=loop-agent --tail=50 -f | grep KillSwitchEngaged
# After all workers idle, PRs no longer reviewed and no errors posted.
# Flip off when ready.
redis-cli -h <valkey> SET bot:killswitch:enabled false
Worker behavior on engaged kill switch:
- Webhook ingress returns
503 killswitch_engaged— Bitbucket retries per its own policy. - In-flight
query()calls complete (do not abort mid-comment-post — that would create orphan inline comments). - BullMQ worker drops in-flight job back to the queue, emits
KillSwitchEngagedstructured-log event withreviewId. - Pods drain on
SIGTERMwithin a 60-second graceful period.
12Observability — Langfuse via OpenTelemetry
Two instrumentation paths into the same self-hosted Langfuse instance (per ATF-76). Verify the minimum platform version with the ATF-76 owners before Stage 0 — current self-hosted Langfuse with OTel ingestion enabled is the target.
Python path (eval harness and sub-services)
Pythonpip install langfuse claude-agent-sdk \
"langsmith[claude-agent-sdk]" "langsmith[otel]"
# env:
# LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY
# LANGFUSE_BASE_URL=https://<samba-langfuse-host>
# LANGSMITH_OTEL_ENABLED=true
from langfuse import get_client
from langsmith.integrations.claude_agent_sdk import configure_claude_agent_sdk
langfuse = get_client()
assert langfuse.auth_check()
configure_claude_agent_sdk() # spans auto-flow to Langfuse via OTel
Note: the instrumentation library is langsmith.integrations.claude_agent_sdk — yes, that's LangSmith's open-source library. The spans flow to Langfuse via OpenTelemetry because both vendors share the OTel surface. We use Langfuse for storage and UI; LangSmith is just the instrumentation seam.
TypeScript path (the worker — Fynn's stack)
No verified TS sibling of the Python instrumentation library exists yet. Day-1 approach: manual span instrumentation with @langfuse/otel. ~50 lines around the query() call (see §04 for the wrapper). If openinference-instrumentation-claude-agent-sdk matures a TS distribution, adopt that and remove the manual layer.
What you get
- Every model call — input messages, output, model ID, token counts (input / output / cache).
- Every tool use — name, input args, result, latency.
- Session/trace correlation by
reviewId— filter the dashboard by repo, PR number, or reviewId. - Cost tracking — tokens × pricing table, per-trace.
- Replay UI — step through any review turn-by-turn after the fact.
Why self-hosted
ATF-76 standardization + data residency for Security (traces never leave AWS). Bonus: the same Langfuse instance hosts both agent traces AND eval harness runs — one observability backend, two use cases.
13Claude Platform on AWS vs Bedrock
Different products. Easy to confuse because both are "Claude on AWS". The first is the recommended path; the second is the documented fallback.
| Concern | Claude Platform on AWS | Amazon Bedrock |
|---|---|---|
| Operator | Anthropic-operated | AWS-operated |
| Env vars | CLAUDE_CODE_USE_ANTHROPIC_AWS=1 + ANTHROPIC_AWS_WORKSPACE_ID | CLAUDE_CODE_USE_BEDROCK=1 |
| Model ID format | claude-sonnet-4-6 (bare first-party) | anthropic.claude-sonnet-4-6-<YYYYMMDD>-v1:0 — exact date suffix to verify via aws bedrock list-foundation-models before Stage 0 |
| Auth | SigV4 + AWS IAM (IRSA) | AWS IAM (IRSA) with bedrock:InvokeModel |
| API parity | Same-day with first-party | Lags 2-4 weeks |
| Managed Agents | Available | Not available |
| Server-side tools | Available | Some unavailable |
| Billing | AWS Marketplace | AWS native |
| Endpoint | aws-external-anthropic.{region}.api.aws/v1/... | bedrock-runtime.{region}.amazonaws.com |
Recommendation: Claude Platform on AWS for new builds (this is what TVP-6862 should target). Bedrock is the documented fallback if the Platform-on-AWS workspace isn't provisioned in Samba's AWS account.
The exact IAM action set for Claude Platform on AWS is NOT bedrock:InvokeModel — it's a separate action namespace documented at platform.claude.com/docs/en/api/claude-platform-on-aws-iam-actions.md. We MUST verify those action names against the live doc before applying the IRSA policy; the IAM block in build-and-deploy.md currently has a TODO placeholder.
14Failure mode catalog
Nine production scenarios. Detection signal is what triggers the alert; mitigation is what we ship on day one to keep blast radius small.
| # | Failure | Trigger | Detection | Mitigation |
|---|---|---|---|---|
| F1 | Cost runaway, single review | Pathological diff sends review into recursive tool loop | result.subtype + cost spike on the trace |
Caller-side AbortController at $2 (see §04) |
| F2 | Cost runaway, sustained | Prompt change increases avg tokens 5× | Prometheus histogram on total_cost_usd p95 |
Per-repo daily Valkey budget + alert on rolling p95 |
| F3 | Runaway turn loop | Agent stuck retrying a failing MCP call | result.subtype = "error_max_turns" |
maxTurns: 25 cap; post truncated summary |
| F4 | Duplicate inline comments on re-push | Idempotency marker logic broken | Engineer feedback / Slack reactions | bb_upsert_inline_comment with hash marker + contract tests in CI |
| F5 | MCP server crash mid-review | Bitbucket API rate-limit → child exits | result.subtype = "error_during_execution" + MCP stderr |
DLQ + auto-retry with exponential backoff; post "review temporarily unavailable" summary |
| F6 | Prompt regression ships | Engineer edits CLAUDE.md or a SKILL.md; comment quality drops 40% |
Eval gate failure rate | Required Bitbucket Pipelines status check on prompt-config repo PRs |
| F7 | Prompt injection in diff | Diff includes injection payload trying to escape allowlist | permission_denials array on result |
Hard-deny at SDK layer (disallowedTools) + structural eval fixture with injection attempt |
| F8 | Silent webhook backlog | BullMQ consumer crashes | oldest_unacked_message_age > 5min alert |
DLQ + queue-depth alert + auto-restart on liveness probe |
| F9 | Anthropic API spend cap hit | Per-team monthly cap reached | HTTP 429/403 in result.subtype |
AWS Budgets pre-alerts at 80%; fallback to "minimal review" mode |
15Escape hatch — RUNTIME_MODE=cli
Insurance against SDK breakage. The worker has two runtime drivers behind a single env-flagged factory:
TypeScript// services/pr-review-worker/src/runtime/index.ts
import { runWithSdk } from "./sdk";
import { runWithCli } from "./cli";
export const runReview = process.env.RUNTIME_MODE === "cli"
? runWithCli
: runWithSdk;
Default is sdk. If Anthropic ships a breaking SDK change mid-pilot, flip RUNTIME_MODE=cli via ConfigMap, restart pods, and the worker spawns @anthropic-ai/claude-code as a subprocess until we patch.
SDK upgrade discipline
- Pin to exact version. Never
^x.y.z. Patch bumps land in the lockfile only after CI verification. bot-sdk-bumpBitbucket Pipelines job — runs the eval gate against any new SDK version before promotion.- 2-week soak in staging before merging SDK upgrades to
main. - CLI adapter exercised weekly in CI so it doesn't bitrot. If we ever need it, we know it works.
16Verification items — week 0
Concrete checks to run before Stage 0 (shadow mode) begins. Each is a 15-minute spike, not a research project.
- Confirm Samba has a Claude Platform on AWS workspace provisioned in the AWS account where CAP runs. Set + read
ANTHROPIC_AWS_WORKSPACE_IDend-to-end. - Confirm Samba's self-hosted Langfuse instance supports OTel ingestion from CAP pods. Cross-check the minimum platform version with the ATF-76 owners.
- Look up the exact IAM action set for Claude Platform on AWS (NOT
bedrock:InvokeModel— see §13). Update the IRSA policy from the TODO placeholder inbuild-and-deploy.md. - End-to-end smoke test the
canUseToolcallback + MCP-sidereviewIdinjection pattern against a sandbox MCP server. Verify no persona / observability surprises per §03. - Spike-test the
maxTurnsboundary behavior — confirm the SDK emits the partialresultmessage withsubtype: "error_max_turns"when the cap is hit (we rely on this for F3 detection). - Spike-test the new native
maxBudgetUsdoption against a deliberately-expensive synthetic review — confirm the SDK terminates cleanly at the cap, and that our caller-sideAbortControllerpolling fires correctly as the second-layer guard. - Confirm the exact Bedrock model ID for Claude Sonnet 4.6 (
anthropic.claude-sonnet-4-6-<YYYYMMDD>-v1:0) viaaws bedrock list-foundation-models --region <cap-region>. Updatebuild-and-deploy.md+ this doc's §13 if the date suffix differs from the current placeholder.