As an engineer opening a PR, I want Claude's inline review within 30 seconds, so I can iterate on obvious issues before tagging a peer reviewer.
Samba ships ~200 PRs per week across ~80 active repositories. Peer review is the bottleneck — engineers wait hours-to-days for a first pass, and security or correctness issues sometimes slip past review entirely. The cost compounds at our scale: ~150 engineers each losing an hour per week to review latency is ~600 reviewer-hours per quarter we can never get back.
Every PR should get a competent first pass within seconds of opening — flagging the obvious before a human ever looks. That changes peer review from "first look" to "second opinion," which is what it should have been all along.
| Metric | Target | Measurement |
|---|---|---|
| Pilot adoption | 5 repos onboarded by end of Q3 | Per-repo CLAUDE.md or .claude/skills/ commits in Bitbucket |
| Review latency | 80% of PRs reviewed within 30s of webhook | Langfuse trace duration, webhook receipt to comment post |
| Comment quality | <2% false-positive rate on high-severity flags | Engineer feedback survey + Slack-bot reactions |
| Cost discipline | Zero budget overruns; p95 cost < $0.50 / review | AWS Budgets, Langfuse cost histograms |
| Hours saved | ~600 reviewer-hours / Q3 at 4–5× ROI vs. API spend | Modeled from PR count × catch-rate × avg review minutes |
Single-service shape on Samba's existing CAP EKS cluster. Bitbucket webhooks land on a Fastify endpoint with HMAC verification, get queued through BullMQ on ElastiCache Valkey, and a worker pool processes each PR by calling the Claude Agent SDK against an MCP server that exposes a tight allowlist of Bitbucket REST operations.
Bitbucket Cloud sends pullrequest:created / updated webhooks. A Fastify endpoint verifies the HMAC signature, then hands off.
BullMQ on ElastiCache Valkey. Per-repo concurrency limits keep one busy repo from starving the others.
A worker pool calls the Claude Agent SDK query() against an in-container MCP server that exposes a tight Bitbucket REST allowlist.
maxTurns: 25permissionMode: 'bypassPermissions' + allowedTools allowlistmaxBudgetUsd: $2 + caller-side AbortController (two-layer cap)The MCP posts inline + summary comments via Bitbucket REST. Idempotency markers let re-pushes update existing comments in place.
Auth: Claude Platform on AWS — pure AWS IAM via IRSA, no Anthropic API key to rotate. Observability: Langfuse (the same self-hosted instance the AI Task Force evals platform runs on, per ATF-76).
| Concern | Decision | Why this, not alternatives |
|---|---|---|
| Runtime | Claude Agent SDK query() loop | Typed budget / turn / permission controls. CLI subprocess spawn forces reinventing these via stdout parsing and is brittle to CLI version drift. |
| Auth | Claude Platform on AWS | One AWS bill, no Anthropic API key to rotate, same-day API feature parity with first-party. Bedrock is the documented fallback but lags 2–4 weeks on features. |
| Multi-commit policy | Per-PR slot + 15s debounce + incremental diff context | Caps commit storms at ~2 reviews per PR regardless of push frequency. Without this, every push enqueues a redundant review and cost compounds linearly. |
| Quality gate | Fixture evals + LLM-as-judge, blocking at weighted_score < 0.70 | Without an offline eval gate, prompt changes ship blind. ~$0.30 per run, ~5 fixtures to start, expands via prod-miss feedback loop. |
As an engineer opening a PR, I want Claude's inline review within 30 seconds, so I can iterate on obvious issues before tagging a peer reviewer.
As a team lead, I want to drop a CLAUDE.md or a .claude/skills/ entry in my repo to teach Claude my codebase's conventions and review behavior, so I can tune review behavior without filing a central PM-team ticket.
As an on-call engineer, I want to flip a single Valkey kill switch, so I can drain in-flight reviews and stop new ones within 30 seconds.
Each stage has hard reliability criteria before advancing. The Valkey kill switch is valid at every stage from week −1 onward.
30 curated fixture PRs at weighted_score ≥ 0.75. 100 shadow reviews against synthetic webhook traffic. Validate all 9 failure modes.
Single low-traffic repo with a friendly owner. 25+ real PRs over 5 business days. Exit: zero unhandled errors, p95 cost < $0.50.
Mix of high- and low-traffic, including one team that's a friendly skeptic. Daily alert review. 3+ prompt-config PRs pass eval gate.
Per-repo CLAUDE.md or .claude/skills/ required (no default-on yet). Weekly cost p95 stable across 14 days, error rate < 2%, zero on-call pages.
Default-on with documented opt-out. Conservative per-repo daily budget defaults ($5/day). Per-team AWS Budgets configured.
| ID | Risk | Mitigation |
|---|---|---|
| R1 | Anthropic ships a breaking change to the Agent SDK mid-pilot. | Pin to exact version (never ^x.y.z); SDK-bump CI job runs eval gate before promote; 2-week soak; RUNTIME_MODE=cli env-flagged escape hatch maintained from day one. |
| R2 | Cost-runaway incident before alerts are wired up. | Stage 0 shadow mode + per-review AbortController + per-repo daily budget enforced from day one. Cost can't run away because three layers gate it. |
| R3 | Eval gate ships but the prod-miss feedback loop loses cadence and the fixture set goes stale. | Weekly missed_issues review on PM calendar from week 1. Slack bot for engineers to tag missed catches. Auto-generated fixtures from the database. |
| R4 | Capacity: Q3 timeline requires the eval harness + observability work to run in parallel with the runtime + EKS build. | Open for discussion in this meeting. Either a second engineering contributor for ~6 weeks, or a 3-week timeline slip with one engineer doing both serially. |
Five calls we need to make today. Recommendations on each line; the room owns the ratify.
Recommendation: Claude Agent SDK. Typed control surface for budget, turns, and permissions; same vendor incentive as the model. Flue is 3 months old and explicitly experimental — not appropriate for a 150-engineer hot path.
Recommendation: Yes. Per-PR slot + 15s debounce + incremental diff context. Without this, the cost model breaks the first time a developer pushes five commits in a sprint demo.
Recommendation: Yes. Blocks prompt-config repo merges if weighted_score < 0.70. At 200 PRs/week scale, prompt regressions need a gate, not a smoke test.
Tony's call based on repo-owner relationships. Looking for a low-traffic repo with a team lead who'll file good feedback and own the rollback if needed.
Recommendation: yes, ~6-week scope covering eval harness, multi-commit policy, and observability while Fynn lands the runtime + EKS work. Alternative: 3-week timeline slip with one engineer doing both serially.
Architecture review — this meeting.
~6.5 weeks from decision. One repo, < 25 real PRs over 5 days.
Default-on with documented opt-out path.