DevRel - Playbooks & Cookbooks

Playbook · Strategic Motion

The Demo-to-Production
Field Playbook

A six-phase motion for DevRel teams shipping AI features from controlled demo into enterprise-grade production. Designed for teams where the gap is the problem, not the model.

Audience: DevRel / PM / Eng Stage: Post-launch Scope: 6–12 weeks

Phase Sequence

1 Signal Audit Map where demo assumptions collide with real data distributions. Instrument before you optimize.
2 Grounding Verification Validate retrieval fidelity against production corpora. Demo RAG ≠ production RAG.
3 Cost Baseline Token cost at demo volume vs. projected production volume. Surface the math early.
4 Failure Mode Taxonomy Classify prompt instability, hallucination vectors, and edge-case exposure before GA.
5 Developer Enablement Layer Cookbooks, SDKs, and reference architectures that meet developers at their real environment.
6 Trust Signal Distribution Publish findings, benchmarks, and practitioner narratives. Trust is earned in the open.

Success Signals

↓ 40%

support tickets from "it worked in demo"

3×

faster production onboarding vs. undocumented launch

vrj-demo-prod-playbook-v1 ↗

Cookbook · Executable Recipe

Audit Your AI Demo Before
You Ship It

A step-by-step recipe for engineers who inherited a demo that "works great" and need to prove — or disprove — that claim before production go-live.

Audience: Engineers / ML Ops Time: ~90 min Tools: CLI + local

Recipe Steps

Step 1 — Baseline the prompt. Run your demo prompt 20× against production data. Log variance in output structure, not just content.
Step 2 — Estimate real token cost. Use token-estimator with your actual production payload. Compare to your demo estimate. Delta >2× = red flag.
Step 3 — Grounding check. Pull 5 retrieved chunks from your prod corpus. Do they match what your demo assumed? Confirm with contexteval --verify.
Step 4 — Inject edge cases. Feed 3 adversarial inputs your demo was never shown. Run trace-bench --profile instability. Document what breaks.
Step 5 — Write your honest ship/no-ship memo. One paragraph. Surface it to your team before GA, not after the first production incident.

What you walk away with

1

honest audit doc your team can actually act on

0

surprises post-launch that were actually visible pre-launch

vrj-demo-audit-cookbook-v1 ↗

Playbook · Strategic Motion

The Evaluation Practice
Playbook

A motion for teams that have run evals once and called it done. Building a repeatable evaluation practice is an infrastructure problem — not a model problem.

Audience: ML Ops / DevRel / Eng Stage: Pre-scale Scope: Ongoing

Phase Sequence

1 Define What You're Actually Measuring Accuracy of what? Against which ground truth? Most teams skip this and wonder why evals disagree with production behavior.
2 Build a Stable Test Corpus Curated, versioned, adversarially representative. If your eval set drifts, your benchmark is lying to you.
3 Instrument for Reproducibility Same prompt, same model version, same temperature — every run. Variance you can't explain is debt you'll pay later.
4 Establish a Regression Gate Before any model update ships, evals run. Not after. Not "when we have time." Gate the pipeline.
5 Publish Internal Benchmarks Share eval results with the team in a format they can act on. A number without context is just noise.

Success Signals

1×

eval run per model change, no exceptions

↓

production regressions caught post-deploy

vrj-eval-practice-playbook-v1 ↗

Cookbook · Executable Recipe

Run Your First Repeatable
LLM Benchmark

A recipe for engineers who've been told "we need evals" and handed no infrastructure to run them. First benchmark in under two hours, repeatable forever.

Audience: Engineers / ML Ops Time: ~2 hrs Tools: BenchKit + CLI

Recipe Steps

Step 1 — Pick one behavior to measure. Not "quality." One behavior: hallucination rate, format compliance, refusal rate. Narrow wins.
Step 2 — Build a 20-prompt test set. 15 representative, 5 adversarial. Save to eval-corpus-v1.json. Version it from day one.
Step 3 — Run baseline with BenchKit. benchkit run --corpus eval-corpus-v1.json --model [your-model]. Log the output. This is your zero-point.
Step 4 — Change one variable. Model version, temperature, or prompt wording. Run again. Compare delta. Now you have signal, not opinion.
Step 5 — Write the one-sentence finding. "At temp 0.7, hallucination rate increased 18% vs. temp 0.3." Ship that to your team. That's your eval culture starting.

What you walk away with

1

versioned corpus you can run forever

∞

repeatable — same corpus, any model, any time

vrj-benchkit-cookbook-v1 ↗

Playbook · Strategic Motion

The Silent Drift
Playbook

Your model updated. You didn't get a changelog. Your outputs degraded quietly for days before anyone noticed. This is the motion that catches it before your users do.

Audience: ML Ops / Eng / DevRel Stage: Always-on Scope: Continuous

Phase Sequence

1 Version Lock Your Baselines Every model you call in production gets a pinned baseline run. Behavior at version X is your ground truth. Not "what it should do" — what it actually did.
2 Instrument for Drift Detection Output structure, tone signature, refusal rate, format compliance. The metrics that catch silent changes before accuracy does.
3 Schedule Automated Regression Runs Weekly minimum. Daily if you're in a high-stakes vertical. Same prompts, same corpus, compared against pinned baseline. Variance above threshold = alert.
4 Build a Provider Change Log Track every model version update, announced or not. Cross-reference against your drift alerts. Build institutional memory your team actually owns.
5 Define Your Regression Response Protocol What is the threshold for rollback? Who decides? What's the escalation path? If you don't have answers before drift happens, you'll make them up under pressure.

Success Signals

0

regressions discovered by users before your team

↓

mean time to detect silent model drift

vrj-silent-drift-playbook-v1 ↗

Cookbook · Executable Recipe

Catch the Model Update
Nobody Told You About

A recipe for engineers who suspect their model changed and need to prove it. No vendor confirmation required. Your corpus is the witness.

Audience: Engineers / ML Ops Time: ~1 hr setup Tools: Trace-bench + CLI

Recipe Steps

Step 1 — Pull your oldest stored outputs. Find the earliest logged responses from your production model. If you haven't been logging, start today. This is your day-one debt.
Step 2 — Run the same prompts now. Exact same inputs against your current model. Save to outputs-current.json. Do not change anything else.
Step 3 — Run structural diff with Trace-bench. trace-bench --compare outputs-baseline.json outputs-current.json --profile drift. Look at structure and tone variance, not just semantic similarity.
Step 4 — Classify what changed. Format drift? Verbosity shift? Refusal rate change? New hedging language? Each pattern points to a different kind of update.
Step 5 — Write the drift report. Date range, delta magnitude, behavior classification, and your ship/hold recommendation. One page. Now you have evidence, not suspicion.

What you walk away with

1

drift report your team can act on immediately

∞

reusable detection pipeline for every future update

vrj-drift-detection-cookbook-v1 ↗

The Demo-to-ProductionField Playbook

Audit Your AI Demo BeforeYou Ship It

The Evaluation PracticePlaybook

Run Your First RepeatableLLM Benchmark

The Silent DriftPlaybook

Catch the Model UpdateNobody Told You About

The Demo-to-Production
Field Playbook

Audit Your AI Demo Before
You Ship It

The Evaluation Practice
Playbook

Run Your First Repeatable
LLM Benchmark

The Silent Drift
Playbook

Catch the Model Update
Nobody Told You About