Q-PILOTS: Solving Policy Optimization Challenges in Flow-Based RL

If you've tried to train a flow-matching or diffusion policy with temporal-difference RL, you already know where the wheels come off: backpropagating the critic's action gradient through a multi-step denoising process is numerically unstable. A new method called QPILOTS sidesteps that problem entirely by steering the denoising process at inference time, leaving the original policy untouched. On a standard offline-to-online benchmark, it hits a 90% average success rate across 50 tasks.

Here's why that matters, and how it works.

The problem with gradients through denoising

Flow-matching and diffusion policies are expressive action generators. That expressiveness is exactly what makes them attractive for control tasks where the action distribution is multimodal or hard to capture with a simple Gaussian head.

The trouble starts when you want to optimize them with temporal-difference RL. Effective policy extraction depends on exploiting the critic's action gradient — you want to push the policy toward actions the critic rates highly. But the policy doesn't produce actions in one shot. It produces them through a sequence of denoising steps.

Directly backpropagating the critic signal through that whole multi-step chain is numerically unstable. The gradient has to flow through every denoising step, and that path is fragile.

How existing methods work around it

The field has converged on a few imperfect workarounds, each with a real cost:

Discard gradient information entirely, giving up the most useful learning signal the critic provides.
Distill the policy into a one-step actor, trading away the expressiveness of the multi-step generator for something simpler to optimize.
Repeatedly fine-tune the denoising policy as the critic improves, which means continuously retraining the generator and tightly coupling it to a moving target.

None of these is free. You either lose information, lose expressiveness, or take on a heavier training loop.

The QPILOTS idea: steer, don't retrain

QPILOTS takes a different angle. Instead of modifying the policy at all, it steers the denoising process at inference time.

The key insight addresses where you evaluate the critic. The naive approach computes the critic gradient on the noisy intermediate action at each denoising step — but those intermediate states are exactly where the critic's predictions are unreliable. You're asking the critic to score something that isn't a real action yet.

QPILOTS instead projects each intermediate state to an estimate of the final clean action, then computes the critic gradient there. You evaluate the critic where its predictions actually mean something, and use that to nudge the denoising trajectory.

That move keeps the original policy frozen and avoids pushing gradients through the full denoising chain — which is the source of the instability in the first place.

Two variants for two budgets

The paper introduces two flavors:

QPILOTS-U uses a fast single-point approximation. This is the lightweight option when you want speed.
QPILOTS-M draws differentiable posterior samples through a learned auxiliary network, trading a bit more machinery for a richer estimate of the clean action.

The split is a useful tradeoff to keep in mind: a cheap projection versus a learned sampler, depending on how much you're willing to spend at inference.

What the results show

On a standard offline-to-online RL benchmark, QPILOTS reports the best aggregate performance, reaching a 90% average success rate across 50 tasks.

The more interesting result for practitioners working with large models: QPILOTS was used to steer a large, frozen, pretrained Vision-Language-Action (VLA) foundation model. Across six manipulation tasks in simulation, it outperformed or matched prior inference-time approaches.

That's the part worth dwelling on. Because QPILOTS operates at inference time and leaves the policy unmodified, you can apply it to a frozen foundation model you can't — or don't want to — retrain. You bring a critic and steer the generation, rather than touching the base weights.

The takeaway

QPILOTS reframes a training problem as an inference-time one. The instability of TD-RL on flow and diffusion policies comes from backpropagating critic gradients through denoising; QPILOTS avoids that by projecting to a clean action estimate and steering generation instead of retraining.

If you're working with expressive generative policies — especially large frozen VLA models you'd rather not fine-tune — this is a pattern to watch. The open questions to keep an eye on: how the U and M variants trade off in cost versus quality on real hardware, and how well test-time steering holds up beyond simulation.

Q-PILOTS: Solving Policy Optimization Challenges in Flow-Based RL

The problem with gradients through denoising

How existing methods work around it

The QPILOTS idea: steer, don't retrain

Two variants for two budgets

What the results show

The takeaway

Why it matters

Related reading

TextGrad: Pioneering Textual Backpropagation in Prompt Optimization

Grounding AI Agents: 4 Critical Strategies for Operational Performance

Training AI Models with Synthetic Values: A Precision Alignment Technique