Personalized Programming Learning with Multi-Agent Planning

Anyone who has tried to teach programming knows the core difficulty: the same explanation lands differently depending on who's receiving it. A learner with a strong math background and one coming from a non-technical field need different scaffolding, different pacing, and different examples — even for the same question. Most automated tutoring systems flatten that variation. They produce a generic answer and hope it fits.

A recent paper, PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning, takes aim at exactly this gap. The authors' framing is worth stating plainly: LLM-based multi-agent systems (MAS) are good at complex planning, but the planners used to orchestrate them tend to lack two things — profile-grounding (awareness of who the learner is) and pedagogical scaffolding (the structure that makes instruction teachable, not just correct). Without those, personalization breaks down.

The dataset: MAP-PPL

The first contribution is data. The authors introduce MAP-PPL (Multi-Agent Plans for Personalized Programming Learning), a profile-conditioned planning dataset of 3,043 query–profile–plan instances. These are drawn from 1,730 Stack Overflow question groups and paired with 2,738 learner profiles.

The structure of each plan is the interesting part. Rather than a single output blob, every plan specifies four things: the agents involved, the subtasks they handle, the executable steps within those subtasks, and the prerequisite dependencies between steps. That last element matters — it encodes ordering and dependency, which is what separates a teaching plan from a flat list of instructions. If you've built agentic systems, you'll recognize this as a dependency graph over work, conditioned on the learner.

The planner: a two-stage approach

PersonalPlan itself is a two-stage MAS planner.

The first stage is hierarchical supervised fine-tuning (SFT) using separate LoRA adapters for two distinct jobs: profile-aware task decomposition, and step dependency planning. Splitting these into separate adapters is a sensible design choice — decomposing a task with the learner's profile in mind is a different skill from working out the dependency ordering among the resulting steps, and isolating them with dedicated adapters keeps each focused.

The second stage applies Reward-Adaptive GRPO, a reinforcement-learning step intended to push the model toward plans that are executable, personalized, and pedagogically scaffolded simultaneously. The three properties pull in different directions — a maximally personalized plan isn't automatically executable, and an executable plan isn't automatically good teaching — so optimizing for all three at once is the harder version of the problem.

What the experiments claim

The authors evaluate PersonalPlan on MAP-PPL against three classes of baselines: frontier LLMs, generic MAS frameworks, and agentic planners. They report that PersonalPlan, in 8B and 32B variants, achieves state-of-the-art results on plan executability, personalization, and pedagogical quality, and effectively orchestrates the multi-agent system for agent–student interactions.

Two things stand out for practitioners. First, those are relatively modest model sizes for a claimed frontier-beating result on a specialized task — consistent with the broader pattern that targeted fine-tuning on a well-structured dataset can outperform much larger general models within a narrow domain. Second, the evaluation criteria are domain-appropriate: executability, personalization, and pedagogical quality are the three axes that actually matter for tutoring, rather than a single accuracy number.

Why it matters for adaptive ed-tech

The reusable idea here isn't limited to programming education. It's the recipe: condition planning on an explicit profile, represent plans as structured graphs of agents/subtasks/steps/dependencies rather than free text, and optimize against multiple competing objectives rather than one. That pattern generalizes to any agentic system where who the output is for should change the plan, not just the wording.

A few caveats are worth keeping in mind. The dataset is built from Stack Overflow question groups and synthesized learner profiles, so how well it transfers to live learners remains an open question that the paper's offline evaluation doesn't fully settle. And "pedagogical quality" as a metric depends heavily on how it's operationalized — a detail to scrutinize in the full paper before drawing conclusions about classroom readiness.

Still, the framing is clean and the structure is the right one: personalization as a planning problem, grounded in profiles, scaffolded for teaching.

Personalized Programming Learning with Multi-Agent Planning

The dataset: MAP-PPL

The planner: a two-stage approach

What the experiments claim

Why it matters for adaptive ed-tech

Why it matters

Related reading

TextGrad: Pioneering Textual Backpropagation in Prompt Optimization

Training AI Models with Synthetic Values: A Precision Alignment Technique

Q-PILOTS: Solving Policy Optimization Challenges in Flow-Based RL