AI-Driven Test Specification Generation for Complex Software Requirements

If you've ever shipped automotive software, you know the bottleneck: writing test specifications that satisfy Automotive SPICE SWE.6 when a project spans thousands of requirements. The manual process can eat weeks of engineering time. A new paper proposes a pipeline that automates it without the failure modes that sink naive LLM approaches—and the core idea generalizes well beyond cars.

Why the obvious LLM approach breaks

There are two tempting ways to throw an LLM at this problem, and both fail at scale.

Process each requirement individually, and you discard inter-requirement dependencies. The model never sees that requirement 412 interacts with requirement 1,309, so integration coverage falls apart and you get redundant test cases.

Feed the entire corpus at once, and you blow past the context window. The model can't hold thousands of requirements in working memory, so coverage stays incomplete in a different way.

Both paths leave you with the same outcome: missed integration behavior and duplicated effort. The structure of the requirements—how they cluster into features—gets thrown away.

Cluster-then-summarize

The authors' answer is a three-stage "Cluster-then-Summarize" pipeline that recovers that structure before generating anything.

Stage one: clustering. Requirements are embedded with sentence transformers, reduced with UMAP, then grouped with HDBSCAN density-based clustering. Rather than hand-tuning the minimum cluster size, the pipeline selects it automatically using a quality criterion that combines normalized Silhouette and Calinski-Harabasz scores. The result is a topology of requirement groups that roughly maps to features.

Stage two: summarization. A multi-level map-reduce summarization algorithm distills each cluster into a concise, domain-conformant description. Crucially, it preserves the details that matter for testing—quantitative thresholds and safety integrity levels—rather than smoothing them away the way generic summarizers tend to.

Stage three: dual-level generation. This is where the cluster topology pays off. The pipeline produces specifications at two levels:

Individual requirement verification — tests for each requirement on its own.
Cluster-level integration tests — tests that verify cross-requirement feature behavior, the exact thing the per-requirement approach loses.

Two more mechanisms keep the outputs grounded. A nearby-cluster context mechanism gives each LLM call bounded awareness of adjacent features—enough cross-feature context to catch interactions, without dumping the whole corpus back into the prompt. And Retrieval-Augmented Generation grounds outputs in ISO 26262 and ASPICE standards, so the generated specs are anchored to the actual normative requirements rather than the model's recollection of them.

That nearby-cluster trick is the quiet star here. It's a middle path between "one requirement at a time" and "everything at once," and it's what lets the approach scale while keeping integration awareness.

What the evaluation shows

The authors evaluated on automotive requirement datasets of varying scale. The reported result: the cluster-aware approach improves integration test coverage and maintains summarization fidelity against baseline methods, while scaling efficiently to thousands of requirements.

Read that carefully. The headline isn't raw accuracy—it's coverage of the integration behavior that per-requirement methods structurally miss, held together at a scale where the full-corpus approach can't even run. The paper does not publish specific coverage numbers in the abstract, so treat the gains as directional rather than quantified here.

Why this matters beyond automotive

Strip out the ISO 26262 and ASPICE specifics, and what's left is a reusable pattern for any large, interdependent requirement set: cluster to recover structure, summarize to fit the context window, then generate at multiple levels using the cluster graph as scaffolding.

The design choices are the transferable part. Density-based clustering with automatic parameter selection avoids brittle hand-tuning. Map-reduce summarization that protects thresholds is a reminder that summarization fidelity is a hard requirement in regulated domains, not a nice-to-have. And bounded cross-cluster context is a pragmatic answer to the context-window problem that doesn't depend on ever-larger models.

What to watch

The open questions are the usual ones for this class of system. How well does cluster quality hold up on messier, less-structured requirement corpora? How much does integration coverage depend on the clustering getting feature boundaries right? And how do you validate the generated specs themselves—still a human review problem.

If you're building test-generation tooling for any standards-driven domain, the takeaway is concrete: don't process requirements in isolation and don't try to brute-force the whole corpus. Cluster first, preserve the numbers that matter, and use the topology to drive both unit- and integration-level generation.

AI-Driven Test Specification Generation for Complex Software Requirements

Why the obvious LLM approach breaks

Cluster-then-summarize

What the evaluation shows

Why this matters beyond automotive

What to watch

Why it matters

Related reading

Production AI Agent Testing: Strategies for Reliability and Trust

LLMs as Documentation Assistants for Agent-Based Modeling

Allen AI Launches OLMO-Eval: A Comprehensive Model Evaluation Framework