← Hub · TVP-6862 · Product Spec

Agent Loop — Bitbucket Claude PR Reviewer

Draft for Review
Drafted 2026-05-18
Review 2026-05-19 · 8:00 PT
PM Sid Dani
Engineering Fynn Chen
Sponsor Tony Tseng
Ship an automated Claude review on every Bitbucket PR before peer review starts. First production deployment of the Agent Loop pattern at Samba — one runtime, extending to PRD review, ticket triage, and release notes through Q3–Q4.

01Problem

Samba ships ~200 PRs per week across ~80 active repositories. Peer review is the bottleneck — engineers wait hours-to-days for a first pass, and security or correctness issues sometimes slip past review entirely. The cost compounds at our scale: ~150 engineers each losing an hour per week to review latency is ~600 reviewer-hours per quarter we can never get back.

Every PR should get a competent first pass within seconds of opening — flagging the obvious before a human ever looks. That changes peer review from "first look" to "second opinion," which is what it should have been all along.

02Scope

In Scope (v1)

  • Automated Claude review on Bitbucket Cloud pullrequest:created/updated webhooks
  • Inline + summary comments with idempotency markers (re-pushes update in place)
  • Per-repo behavior via CLAUDE.md (context) + .claude/skills/ (behavior) — loaded natively by the SDK
  • Per-repo daily budget caps + global kill switch
  • Quality gate: offline fixture evals block prompt-config regressions

Out of Scope (v1)

  • Post-merge review (only pre-merge PR review)
  • IDE integration / Cursor-style inline suggestions
  • Auto-apply of fix suggestions — human still merges
  • Comment threading and back-and-forth on follow-ups
  • Bitbucket Server (self-hosted), GitHub, or GitLab support

03Success Metrics (Q3 Commitment)

Metric Target Measurement
Pilot adoption 5 repos onboarded by end of Q3 Per-repo CLAUDE.md or .claude/skills/ commits in Bitbucket
Review latency 80% of PRs reviewed within 30s of webhook Langfuse trace duration, webhook receipt to comment post
Comment quality <2% false-positive rate on high-severity flags Engineer feedback survey + Slack-bot reactions
Cost discipline Zero budget overruns; p95 cost < $0.50 / review AWS Budgets, Langfuse cost histograms
Hours saved ~600 reviewer-hours / Q3 at 4–5× ROI vs. API spend Modeled from PR count × catch-rate × avg review minutes

04Architecture

Single-service shape on Samba's existing CAP EKS cluster. Bitbucket webhooks land on a Fastify endpoint with HMAC verification, get queued through BullMQ on ElastiCache Valkey, and a worker pool processes each PR by calling the Claude Agent SDK against an MCP server that exposes a tight allowlist of Bitbucket REST operations.

1

Webhook ingress

Bitbucket Cloud sends pullrequest:created / updated webhooks. A Fastify endpoint verifies the HMAC signature, then hands off.

15s debounceper-PR slot
2

Job queue

BullMQ on ElastiCache Valkey. Per-repo concurrency limits keep one busy repo from starving the others.

3

Agent loop

A worker pool calls the Claude Agent SDK query() against an in-container MCP server that exposes a tight Bitbucket REST allowlist.

  • maxTurns: 25
  • permissionMode: 'bypassPermissions' + allowedTools allowlist
  • maxBudgetUsd: $2 + caller-side AbortController (two-layer cap)
4

Bitbucket comments

The MCP posts inline + summary comments via Bitbucket REST. Idempotency markers let re-pushes update existing comments in place.

Auth: Claude Platform on AWS — pure AWS IAM via IRSA, no Anthropic API key to rotate. Observability: Langfuse (the same self-hosted instance the AI Task Force evals platform runs on, per ATF-76).

05Key Architecture Decisions

Concern Decision Why this, not alternatives
Runtime Claude Agent SDK query() loop Typed budget / turn / permission controls. CLI subprocess spawn forces reinventing these via stdout parsing and is brittle to CLI version drift.
Auth Claude Platform on AWS One AWS bill, no Anthropic API key to rotate, same-day API feature parity with first-party. Bedrock is the documented fallback but lags 2–4 weeks on features.
Multi-commit policy Per-PR slot + 15s debounce + incremental diff context Caps commit storms at ~2 reviews per PR regardless of push frequency. Without this, every push enqueues a redundant review and cost compounds linearly.
Quality gate Fixture evals + LLM-as-judge, blocking at weighted_score < 0.70 Without an offline eval gate, prompt changes ship blind. ~$0.30 per run, ~5 fixtures to start, expands via prod-miss feedback loop.

06User Stories

Engineer pushing a PR

As an engineer opening a PR, I want Claude's inline review within 30 seconds, so I can iterate on obvious issues before tagging a peer reviewer.

Team lead with repo-specific conventions

As a team lead, I want to drop a CLAUDE.md or a .claude/skills/ entry in my repo to teach Claude my codebase's conventions and review behavior, so I can tune review behavior without filing a central PM-team ticket.

On-call SRE responding to incident

As an on-call engineer, I want to flip a single Valkey kill switch, so I can drain in-flight reviews and stop new ones within 30 seconds.

07Rollout — 4-Stage Canary

Each stage has hard reliability criteria before advancing. The Valkey kill switch is valid at every stage from week −1 onward.

Week −1 · Stage 0

Shadow

30 curated fixture PRs at weighted_score ≥ 0.75. 100 shadow reviews against synthetic webhook traffic. Validate all 9 failure modes.

Week 1 · Stage 1

One pilot repo

Single low-traffic repo with a friendly owner. 25+ real PRs over 5 business days. Exit: zero unhandled errors, p95 cost < $0.50.

Weeks 2–3 · Stage 2

Five repos

Mix of high- and low-traffic, including one team that's a friendly skeptic. Daily alert review. 3+ prompt-config PRs pass eval gate.

Weeks 4–6 · Stage 3

10–20 opt-in

Per-repo CLAUDE.md or .claude/skills/ required (no default-on yet). Weekly cost p95 stable across 14 days, error rate < 2%, zero on-call pages.

Week 7+ · Stage 4

Org-wide opt-out

Default-on with documented opt-out. Conservative per-repo daily budget defaults ($5/day). Per-team AWS Budgets configured.

08Cost Ceiling — Defense in Depth

Baseline (Q3)
~$5,200/yr
200 PRs/week × $0.50 avg/review.
Hard ceiling
~$20,800/yr
Worst case: every review hits the $2 cap. Caller-side AbortController enforces.
Per-repo default
$5/day
Valkey counter checked before each query. Conservative; repo owners can raise.
Three layers: per-call ($2 via AbortController) · per-repo per-day ($5 in Valkey) · per-team per-month (AWS Budgets).

09Risks & Mitigations

ID Risk Mitigation
R1 Anthropic ships a breaking change to the Agent SDK mid-pilot. Pin to exact version (never ^x.y.z); SDK-bump CI job runs eval gate before promote; 2-week soak; RUNTIME_MODE=cli env-flagged escape hatch maintained from day one.
R2 Cost-runaway incident before alerts are wired up. Stage 0 shadow mode + per-review AbortController + per-repo daily budget enforced from day one. Cost can't run away because three layers gate it.
R3 Eval gate ships but the prod-miss feedback loop loses cadence and the fixture set goes stale. Weekly missed_issues review on PM calendar from week 1. Slack bot for engineers to tag missed catches. Auto-generated fixtures from the database.
R4 Capacity: Q3 timeline requires the eval harness + observability work to run in parallel with the runtime + EKS build. Open for discussion in this meeting. Either a second engineering contributor for ~6 weeks, or a 3-week timeline slip with one engineer doing both serially.
Decision Block

Open Decisions for This Meeting

Five calls we need to make today. Recommendations on each line; the room owns the ratify.

D1

Runtime: Agent SDK, Claude Code CLI subprocess, or Flue Framework?

Recommendation: Claude Agent SDK. Typed control surface for budget, turns, and permissions; same vendor incentive as the model. Flue is 3 months old and explicitly experimental — not appropriate for a 150-engineer hot path.

D2

Multi-commit handling — add as new Acceptance Criterion to TVP-6862?

Recommendation: Yes. Per-PR slot + 15s debounce + incremental diff context. Without this, the cost model breaks the first time a developer pushes five commits in a sprint demo.

D3

Eval gate as a blocking Bitbucket Pipelines status check — new AC?

Recommendation: Yes. Blocks prompt-config repo merges if weighted_score < 0.70. At 200 PRs/week scale, prompt regressions need a gate, not a smoke test.

D4

Pilot repo selection — which team goes first?

Tony's call based on repo-owner relationships. Looking for a low-traffic repo with a team lead who'll file good feedback and own the rollback if needed.

D5

Capacity — second engineering contributor for parallel workstreams?

Recommendation: yes, ~6-week scope covering eval harness, multi-commit policy, and observability while Fynn lands the runtime + EKS work. Alternative: 3-week timeline slip with one engineer doing both serially.

11Timeline

Decision day

2026-05-19

Architecture review — this meeting.

First pilot PR reviewed

Early July

~6.5 weeks from decision. One repo, < 25 real PRs over 5 days.

Org-wide opt-out

Mid-August

Default-on with documented opt-out path.