LLMs as Documentation Assistants for Agent-Based Modeling

Documentation is where rigorous agent-based modeling goes to die. Standards exist—ODD for models, TRACE and EABSS for processes, RAT-RS for data use—but adoption stays low because producing the documentation is treated as supplementary effort nobody has time for. A new feasibility study from Peer-Olaf Siebers and Christopher Frantz asks a practical question: can LLMs do the grunt work and lower that barrier?

The adoption problem

Agent-Based Modelling (ABM) leans on clear documentation for credibility and transparency. The community has built standards to enforce that, but having a standard and using it are different things.

The friction is effort. Filling out a structured reporting standard after the modeling work is done feels like homework, so it often doesn't happen. The result is a body of published models that are hard to reproduce, audit, or trust.

The paper targets a specific underused standard: the Rigour and Transparency Reporting Standard (RAT-RS), which covers how data is used in a model. If you can automate part of populating RAT-RS from an existing paper, you remove much of the reason people skip it.

LLMs as supervised extraction assistants

The framing matters. The authors don't pitch LLMs as autonomous documentation generators. They position them as supervised extraction assistants—tools that pull structured reporting fields out of a paper, with a human staying in the loop.

That's the right scope. Extraction is a task LLMs are plausibly good at: read the source, find the relevant claims, map them onto the standard's questions. The open question is how reliable that mapping is across different kinds of questions.

How they tested it

The methodology is a focused feasibility study, not a sweeping benchmark. The authors took a single published ABM paper and used four LLMs to extract RAT-RS reports from it.

They then assessed two things: consistency (do the models agree, and do they produce coherent output?) and performance across question types. That second axis is the interesting one, because RAT-RS questions aren't uniform—some ask you to describe what a model did, others ask you to explain why or evaluate how well.

What they found

The headline result is a clean split by question type. LLMs perform more reliably on descriptive tasks than on explanatory or evaluative ones.

That tracks with what you'd expect. Pulling out a stated fact—what data source was used, what the agents represent—is closer to retrieval. Explaining rationale or judging adequacy requires inference the source text may not fully support, and that's where models get shakier.

The models did generate coherent outputs overall, and the authors conclude LLMs can improve reporting quality and consistency. But they're explicit about notable limitations alongside that, which is why the supervised framing isn't optional.

The practical takeaway: heuristics, not autopilot

The most useful contribution for practitioners is that the paper identifies heuristics for when LLM-assisted documentation is reliable and when human oversight is needed.

Read through the lens of the descriptive/explanatory split, the rule of thumb is straightforward: lean on the model for descriptive extraction, and keep a human firmly in charge of explanatory and evaluative fields. Treat the LLM output as a first draft to verify, not a finished report to file.

This is a sensible default for any structured-extraction task, not just ABM. If your reporting standard mixes "what happened" questions with "why" and "how well" questions, expect the second category to need review.

Why it matters and what to watch

The bet here is that lowering documentation overhead could raise adoption of rigorous reporting standards—and not just in ABM. Any scientific domain with under-used documentation standards faces the same effort-versus-payoff problem.

The authors stop short of claiming a solution. They call for systematic community-level exploration to build on the feasibility result, which is the honest position given a single-paper study with four models.

For ML engineers, this is a clean example of scoping LLMs to what they're actually good at: supervised extraction, descriptive over evaluative, human-in-the-loop. What to watch is whether the community turns these heuristics into reusable tooling, and whether the descriptive/explanatory reliability gap holds up across more papers and standards. If it does, you've got a defensible recipe for partially automating structured documentation—and a clear line for where the human still has to sign off.

LLMs as Documentation Assistants for Agent-Based Modeling

The adoption problem

LLMs as supervised extraction assistants

How they tested it

What they found

The practical takeaway: heuristics, not autopilot

Why it matters and what to watch

Why it matters

Related reading

Production AI Agent Testing: Strategies for Reliability and Trust

TextGrad: Pioneering Textual Backpropagation in Prompt Optimization

Grounding AI Agents: 4 Critical Strategies for Operational Performance