Training AI Models with Synthetic Values: A Precision Alignment Technique

Synthetic data is doing more alignment work than most teams realize. A recent research update from the Google DeepMind Language Model Interpretability team — by Callum McDougall, Arthur Conmy, and Neel Nanda — lays out a pipeline for instilling positive traits into a frontier model using synthetic documents, and the practical details are worth your attention.

The core idea

The team adapts methods from Li et al ("model spec midtraining," or MSM) and Marks et al. The premise: you can shape how a model generalizes by training it on synthetic documents before chat finetuning begins. Critically, teaching the model the reasons behind a behavior — not just the behavior itself — improves generalization.

Their target was Gemini 3 Flash. The motivation they call "deep alignment": training principles into the model that hold even in highly out-of-distribution (OOD) scenarios, rather than surface behaviors that collapse the moment you leave the training distribution.

The pipeline

It starts with a "traits document" — a short, bullet-pointed list of desired positive traits — used as a universe context. From there, two pipelines:

Midtraining generates pretraining-style documents (Reddit threads, blog posts, emails, research papers) describing a world where Gemini exhibits the target traits. This data is not chat-formatted.

SFT generates chat-format prompt/response pairs where the assistant naturally embodies the traits. Responses come from Gemini 3.1 Pro, given the relevant traits in its system prompt and instructed to embody the trait without exaggeration or explicit reference to the document. The system prompt is then removed before training.

The data generation itself is more careful than "ask a bigger model nicely." The team splits the traits document into chunks, generates a scenario per chunk where that trait matters, and runs a critique stage to confirm the scenario is realistic and would genuinely elicit the trait. One useful trick: generate an initial no-system-prompt response first, then use it to sharpen the user prompt — if the default reply is full of platitudes, you rewrite the prompt to force deeper engagement. A final autorater filters low-quality responses, and a deduplication stage removes embedding-similar prompts.

Validating robustness OOD

Training data here was single-turn and narrow ("difficult advice"). So the evals were deliberately chosen to be OOD along at least one axis:

AI Delusion Validation — can a red-teaming "client" induce a therapist-roleplay model to validate delusions?
ODCV — do models violate constraints under strong performance incentives, with tool use?
Agentic Misalignment — will models take actions like information leakage under direct goal conflict or autonomy threat?
Audit Agents — an auditor model tries to elicit a trait violation over a 5-turn conversation, performing a strategy assessment before each step to escalate, de-escalate, or pivot adaptively. Petri-style realism checkers reduce eval-awareness.

What they found

SFT showed mild-to-significant improvement on all alignment evals. Midtraining improved most (and often stacked with SFT), though not all. Capability results — measured via LMSYS and agentic coding evals — stayed mostly flat, suggesting no significant degradation.

They also tried BDPO (bounded direct policy optimization) in place of SFT. The bounded variant was chosen because vanilla DPO drove rejected-response probabilities incredibly low rather than lifting good responses. BDPO results were sometimes marginally better, but inconsistent, and harder to tune for training stability. Their verdict: not worth it over SFT.

The superficial-pattern trap

The most instructive failure: teaching "appropriate agency" by showing the model asking for clarification on underspecified questions accidentally taught it to ask for clarification always — even for "What is 1+1?" Each example was reasonable in isolation; the pattern only emerged across the whole dataset.

Their fix is a 3-pass pipeline run at the end of generation: Scan concatenated transcript batches with an LLM to surface recurring structural or behavioral patterns; Cluster to de-duplicate and merge features appearing across scans; Autorate each surviving feature to count its prevalence.

Why it matters

The signal here isn't a benchmark number — it's a methodology. Instilling traits robustly enough to survive OOD pressure is exactly the alignment problem that surface-level finetuning ducks. The dataset-level pattern audit is the kind of unglamorous step that separates a working pipeline from one that quietly teaches your model bad habits. If you're building with synthetic data, the lesson is plain: inspect your dataset as a whole, not just example by example.

Training AI Models with Synthetic Values: A Precision Alignment Technique

The core idea

The pipeline

Validating robustness OOD

What they found

The superficial-pattern trap

Why it matters

Why it matters

Related reading

TextGrad: Pioneering Textual Backpropagation in Prompt Optimization

Production AI Agent Testing: Strategies for Reliability and Trust

Grounding AI Agents: 4 Critical Strategies for Operational Performance