Inference.Daily
AI QA

Red-Teaming Gemini 3 Pro: What We Learned

A field report on small-model orchestration and what it changes for designers.

By Jonas Halvorsen3 min read

There is a version of this story that is mostly hype. There is another version, the one we are interested in, that is mostly engineering.

Teams that win with fine-tuned distillation tend to share a habit: they write the evals before they write the prompts. Everything else follows from that.

Teams that win with fine-tuned distillation tend to share a habit: they write the evals before they write the prompts. Everything else follows from that.

Inside Databricks, the rollout looked less like a moonshot and more like a slow migration. A pilot, a champion, a quiet expansion, a budget line.

What xAI actually shipped with Grok 4 is less a single capability and more a cluster of small, compounding improvements — the kind that only show up when you put a real workflow on top.

What Mistral actually shipped with Mistral Large 3 is less a single capability and more a cluster of small, compounding improvements — the kind that only show up when you put a real workflow on top.

None of this guarantees a clean story. Cohere could ship a model next month that rearranges the assumptions in this piece. But the direction of travel, for now, is clear enough to plan around.

#policy#tool use#code#vision#evals

Related reading

More in AI QA

More from Inference Daily

Keep reading