Section

AI QA

Evaluations, red-teaming, hallucination control, and the practice of shipping reliable AI products.

12 stories

AI QA·Jun 30, 2026

Shipping AI You Can Defend: A QA Field Report

The interesting story is not the demo. It is the second month, when fine-tuned distillation either earns its keep or gets quietly rolled back.

Mira Castellanos3 min read

AI QA·Jun 11, 2026

The Quiet Discipline Behind Reliable Grok 4 Apps

The interesting story is not the demo. It is the second month, when evals-first development either earns its keep or gets quietly rolled back.

Yuki Tanabe3 min read

AI QA·Jun 10, 2026

Red-Teaming Gemini 3 Pro: What We Learned

The interesting story is not the demo. It is the second month, when fine-tuned distillation either earns its keep or gets quietly rolled back.

Jonas Halvorsen3 min read

AI QA·Jun 3, 2026

Shipping AI You Can Defend: A QA Field Report

Beyond the launch posts, Llama 4 is reshaping how analysts approach research synthesis. We talked to the people actually using it in production.

Elena Brost3 min read

AI QA·May 19, 2026

Why Your contract review Needs an Eval Harness Yesterday

The interesting story is not the demo. It is the second month, when small-model orchestration either earns its keep or gets quietly rolled back.

Priya Raman3 min read

AI QA·May 16, 2026

Hallucination Budgets and the Engineering of Trust

Shopify is the latest in a string of teams treating RAG-as-a-service as the default, not the experiment. Here is what they got right — and what they are still figuring out.

Elena Brost3 min read

AI QA·May 12, 2026

Evals Are the New Unit Tests: Notes From Ramp

The interesting story is not the demo. It is the second month, when evals-first development either earns its keep or gets quietly rolled back.

Priya Raman3 min read

AI QA·May 7, 2026

Evals Are the New Unit Tests: Notes From Snowflake

The interesting story is not the demo. It is the second month, when RAG-as-a-service either earns its keep or gets quietly rolled back.

Yuki Tanabe3 min read

AI QA·May 6, 2026

Beyond Vibes: Measuring GPT-5.1 in Production

Beyond the launch posts, Grok 4 is reshaping how sales teams approach contract review. We talked to the people actually using it in production.

Mira Castellanos3 min read