Skip to main content

Command Palette

Search for a command to run...

LLM Does Not Write Code

Updated
46 min read
            LLM  Does  Not   Write  Code

LLM Writes Text Indistinguishable from Code. With Caveats.

"You can only transform something if you can imagine the entire process at the molecular level. In principle it is possible, but devilishly difficult. You would end up with a half-alive toxic sausage covered in fur."

— Jarosław Grzędowicz, The Lord of the Ice Garden

Introduction

This article is an attempt to understand why software development with AI simultaneously looks like a revolution and so frequently disappoints in practice.

We examine the fundamental constraints facing any system that transforms human intent into executable code, and propose an engineering model of those constraints by analogy with the CAP theorem.

Building on this foundation, we explore an architectural approach that, if it cannot eliminate these constraints, can substantially mitigate them — with concrete benefits for each stakeholder in the process: the requirements owner, the developer, the QA engineer, and the DevOps engineer.

The approach is not new — domain-specific languages (DSLs) have existed for a long time. What is new is something else: the combination of LLM and DSL allows the LLM to solve a fundamentally narrower problem — translating human intent into a formal model — while a deterministic engine handles everything else. It is precisely this combination that the article examines.

The arguments presented here are grounded in personal experience developing a large industrial system based on a DSL — built before the era of LLMs, which makes it possible to assess the potential of the approach independently of the current state of language models.

Problems

I think everyone is familiar with the strange feeling you get standing in a boarding queue for an airplane. Intellectually, you know how it flies and could even explain it if asked. But your instincts stubbornly insist that this enormous hollow metal tube full of people simply cannot fly — that the whole thing must be underpinned by some kind of arcane magic.

Much the same feeling, personally, comes over me when I observe software development with AI. What we are effectively dealing with is a transformation from "whatever you like" into "whatever you like." The input is a vague human intent expressed in natural language; the output is a formal executable artifact with an arbitrary implementation. Both ends of this transformation are fundamentally unstructured, and it is precisely in this gap that most of the problems are concentrated.

Raw natural language is, in general, a rather poor instrument for writing precise specifications (though it excels at producing foggy discourse designed to conceal the speaker’s incompetence). Even genuine domain experts — the vast majority of people — do not think in a sufficiently structured and systematic way that their thoughts can be directly translated into code.

On the input side, we therefore face the following fundamental problems:

  • Words and phrases quite often carry several permissible meanings, and the choice between them is not entirely clear from the context of the argument.

    • Some of these meanings are hidden assumptions — they are merely implied.

    • Moreover, it is human nature to gradually shift context as the argument unfolds, which introduces additional ambiguity.

  • Natural language requires a large number of words to express concepts that could be stated far more concisely using formal approaches.

    • This not only wastes time but also produces inefficiency: context overload, an elevated risk of hallucinations, and increased cost.

    • Writing a fragment of pseudocode is often significantly easier than attempting to describe its logic in prose.

  • The imprecision of natural language permits many different ways of expressing identical concepts, which makes it harder for language models to determine meaning and ultimately leads to unpredictable behavior.

On the output side, the situation is no better:

  • Any programming language is itself highly diverse, and the same behavior can be implemented in a large number of different ways.

  • This diversity is further multiplied by the ability to use, even within a single stack, a large number of libraries.

  • During design we are choosing from dozens of possible architectural decisions that admit different trade-offs (latency vs. consistency, simplicity vs. extensibility).

  • The difference between one implementation or configuration choice and another may appear negligible at first glance, yet prove highly significant in actual production use.

  • AI does not write code in the sense that a human developer does. AI optimizes for text plausibility, and plausible code is not the same thing as correct code.

    • Yes, such code may well compile and work fine in default happy-path scenarios.

    • But a single flag set incorrectly for your specific context — buried deep in some initialization routine — can lead to catastrophic consequences under load or frequent failures. And that is before we even get to more insidious problems such as race conditions.

The obvious remedy is thorough code review. Unfortunately, problems arise here too:

  • AI writes code faster than we can review it — we have simply moved the human bottleneck to a different stage of the production pipeline. One can, of course, also review with AI — this reduces the probability of problems, but does not fundamentally change the picture.

  • Whoever conducts the review must be sufficiently expert to understand subtleties at both the architectural level and the level of individual libraries.

    • As code authorship shifts toward AI agents and junior engineers (who, it seems, will soon stop being hired altogether) no longer progress from simple tasks to complex ones, the next generation of such experts simply fails to form.

    • The process by which expertise reproduces itself is breaking down — and this is perhaps the least obvious but most long-term consequence of the current trend.

Another CAP Theorem

Let us attempt to formulate an analogue of the Gilbert–Lynch CAP theorem. This is, of course, not a rigorous theorem in the mathematical sense, but an engineering model useful for reasoning about the limits of LLM applicability in software development.

By "system" in this context we mean not the language model itself, but the entire transformation pipeline:

  • the informal human description of the task;

  • the context we managed to collect and supply;

  • the model’s interpretation of that description;

  • the resulting artifact;

  • the means of verifying, reviewing, executing, and further evolving it.

It is precisely within this pipeline that the central tension arises. The input is an ambiguous statement of intent; the output must be a formal, maintainable artifact. Yet the space of permissible interpretations and the solution space are both far too wide.

For a system that transforms an informal, ambiguous natural-language description into a formal program, it is impossible to simultaneously guarantee all three of the following properties in full:

C — Correctness (semantic correctness)

  • the generated result genuinely corresponds to the user’s intent;

  • the right architecture has been chosen, invariants are correct, trade-offs are appropriate;

  • there are no hidden assumptions that are critical specifically for this deployment context.

A — Adaptability (adaptability / breadth of coverage)

  • the system handles adequately a wide range of formulations, thinking styles, levels of detail, subject domains, and task types;

  • it is portable across contexts and does not require excessively rigid standardization of the input.

P — Predictability (reproducibility)

  • identical or equivalent input consistently produces identical or equivalent output;

  • behavior remains stable under minor changes to the prompt, the composition of the context, the model version, and random generation variance;

  • the result is suitable for diffing, repeatable review, and controlled evolution.

Why these properties begin to conflict with one another:

  • The higher the Adaptability, the wider the set of permissible interpretations of the input.

  • The wider the set of permissible interpretations, the harder it is to guarantee Predictability.

  • The wider both the interpretation space and the solution space simultaneously, the harder it is to guarantee Correctness — especially in the absence of an external verifier capable of confirming that the result matches the original intent.

  • Attempting to rigidly fix Predictability is usually achieved by strongly constraining the input and output — that is, at the cost of reduced Adaptability.

  • Attempting to achieve Correctness on arbitrary tasks typically requires external verification, expert review, and iterative refinement, which takes the system outside the fully automated pipeline.

This is precisely why the naïve expectation — "if the model is smart enough, it will simply write correct code" — almost always proves overoptimistic in practice.

It might seem that Predictability is not all that important — what does it matter how the generated code looks, as long as it behaves correctly?

Unfortunately, this impression lasts exactly until the first serious fix to a system already running in production. At that moment it becomes critically important to guarantee that the patch has not broken anything:

  • the change must be sufficiently local, and its business meaning sufficiently clear, that it can be verified and explained;

  • the result must be suitable for normal diff-based review;

  • regression testing must verify precisely the change in behavior, rather than drowning in noise from unpredictably rewritten neighboring fragments.

An unpredictable code generator makes all of this significantly less reliable: even if the new version is formally functional, too large a number of incidental differences sharply complicates maintenance. In regulated environments — finance, healthcare, the public sector — this additionally creates problems during security reviews and compliance audits, where reproducibility of changes is a formal requirement.

The options we are left with are as follows:

Correctness + Predictability (−Adaptability)

  • What this looks like: strict prompt templates, rigid constraints on the output format, low model temperature, a narrow class of tasks.

  • The result is reproducible and reasonably correct, but its practical utility is limited precisely because the system does not transfer well to new task classes or non-standard formulations.

  • Examples: generating DTOs, OpenAPI specs, boilerplate code, simple migrations, standard configuration fragments.

This is the mode most developers start with when first encountering LLMs. And many, incidentally, stop here — limiting themselves to using the model as an advanced autocomplete.

Correctness + Adaptability (−Predictability)

  • What this looks like: complex tasks, rich context, output variance is acceptable, the result is then evaluated and refined by a human.

  • The system frequently produces a substantively useful answer, but a somewhat different one each time.

  • Correctness here is achieved not through reproducibility, but through a human closing the external verification and refinement loop.

  • Typical use: "LLM as architect" or "LLM as advisor."

This is a highly valuable mode, especially when developers know how to give the model high-quality context. But it is poorly suited to direct automated application of results in a production pipeline.

Adaptability + Predictability (−Correctness)

  • What this looks like: the system confidently and consistently answers a wide range of queries, but semantic correctness is not guaranteed.

  • Hallucinated APIs appear, implementation details are wrong, assumptions about the environment are false, architectural decisions are incorrect.

  • This is precisely the mode that manifests most often when LLMs are used naïvely and directly.

The result is first unwarranted optimism, followed by equally unwarranted disillusionment with LLMs in industrial software development.

Most current attempts to compensate for these shortcomings are built around this mode:

  • Constraining the output space (structural approaches)

    • Few-shot examples with explicit counterexamples — showing the model not only "what is correct" but also "what counts as a hallucination." Reduces variance without fully moving to templates.

    • Chain-of-thought with an explicit uncertainty section — the instruction: "before answering, list what you do not know." A model that has first acknowledged uncertainty hallucinates less in the subsequent step.

    • Retrieval-Augmented Generation (RAG) — the model answers only on the basis of supplied documents. Works well for domain-specific code (internal APIs, corporate libraries) where the model has no default knowledge.

    • Grammars and constrained decoding — restricting the set of admissible tokens at inference time by a schema (JSON Schema, context-free grammar). Completely eliminates structural hallucinations, but requires direct access to inference. When working through closed-model APIs, structured outputs serve as a partial analogue, though they cover only structural, not semantic, constraints.

  • Output verification (detecting approaches)

    • Compiler / linter / type checker as the first judge — automatically rejecting results that fail static analysis and triggering regeneration. Cheap and effective against syntactic and type errors.

    • Self-review via a second model call — a separate prompt: "find errors in this code, do not fix them." Preferably use a model from a different class or family: the same model tends to reproduce the same systematic errors when reviewing itself. Works better than asking the model to correct itself within a single conversation.

    • Automatically generate test cases from the specification and run the generated code against them. The invariant is verified formally rather than visually.

  • Process approaches

    • Human-in-the-loop checkpoint — rather than applying the result automatically, route it to a developer for review specifically at points of uncertainty (for example, when unfamiliar external dependencies are used).

    • Iterative narrowing — start with a high-level sketch, verify it, then generate the next level of detail. Hallucinations in the details do not accumulate on top of hallucinations in the architecture.

None of these measures eliminates the problem entirely. All of them either reduce Adaptability or introduce an external verification loop that compensates for the deficit of Correctness via a human or a tool.

The practical way out, therefore, is usually not to force the LLM to directly produce final low-level code, but to change the form of the problem itself. Below we examine exactly such an alternative, in the form of an intermediate DSL.

DSL as a Means of Achieving Correctness + Predictability

Intuitively, from a developer’s perspective, the Correctness + Predictability mode is the most promising and comfortable. The only problem is that when applied directly, its practical utility is quite limited.

The root cause is that when generating code, the LLM is simultaneously solving three fundamentally different problems:

  • interpreting human intent

  • designing a solution

  • implementing that solution — in one of a potentially limitless number of formally admissible ways

The third problem is the most obvious source of instability. It is not that "LLMs write bad code" — it is that the solution space is practically unbounded.

However, the first problem also carries serious risk: a misinterpreted intent produces a stably reproducible but fundamentally wrong result — one that is often discovered later and proves more expensive to fix.

The solution is that the LLM should not directly produce executable low-level code.

The process becomes far more robust when the LLM produces a formal statement of intent in the form of a DSL, and a deterministic engine then executes that intent according to strict rules. Determinism here means the following: an identical DSL artifact, given a fixed engine version, always produces identical system behavior — regardless of which model generated that artifact and with what parameters.

Under this approach, the LLM is responsible for a fundamentally narrower task:

  • extracting intent from the description

  • populating a formal model

  • selecting from a constrained set of patterns

Why this eases the situation:

  • When an LLM works across different programming languages, frameworks, and configuration formats (e.g. Java / Spring / SQL / YAML or JSON configs), the space of variants is enormous.

  • When an LLM writes DSL, the picture is fundamentally different: a finite set of base components, a finite set of their combinations, a strict schema, explicit required fields, a finite set of semantics.

  • Instead of "devise the entire solution from scratch," the model solves a far narrower problem: it classifies the request, selects an appropriate archetype from a finite set, fills in the parameters, and specifies dependencies and constraints.

  • In other words, instead of open-ended generation, the result is constrained generation — generation within defined boundaries.

Size Matters

Beyond everything else, a DSL is significantly more compact as text than artifacts in conventional languages. The context window is consumed more slowly, there is less noise in it, terms are less ambiguous — all of which further increases the predictability of the result.

Less code means fewer tokens. On the surface this seems trivial, but when a system is large, the volume of code becomes very substantial. Consider, for example, a corporate remote banking system for organizations:

  • On the order of a thousand tables in the database (which is, incidentally, very few — SAP R/3 contains on the order of 90,000–100,000 tables, and in real-world installations with customization this figure can be substantially higher)

  • On the order of several hundred entities (documents, reference books), each with its own lifecycle comprising several dozen states

  • On the order of several thousand UI primitives (entity tables, editing / viewing / operation-parameter forms for various use cases and user categories, global / contextual menus, and so on)

For the simplest task of adding a new field to an entity, at minimum the following components must be touched (assuming a classic REST-based CRUD design):

  • The entity editing form in the UI (and often more than one, for different user categories and accounting operations)

  • The entity list table, if the field is used as a column or filter

  • The controller data model and its validation rules

  • The business service data model and its validation rules

  • The fragment of the business service that uses the new field (otherwise why add it at all?)

  • The ORM data model and its validation rules

  • Converters between these three models

  • The database schema migration

  • Tests for all of the above changes

And that is in the fortunate case where the new field is not mandatory. If it is mandatory — things are even worse.

Exactly the same problems arise during refactoring. From personal observation, refactoring is precisely what models find hardest of all, and the reason is trivial: beyond the sheer volume of code to analyze, refactoring requires simultaneously holding two states of the system in context, plus tracking all dependencies between the modified fragments — which increases pressure on the context window in a non-linear fashion.

When a feature touches several entities, the total volume of code that must be kept in context becomes genuinely very large. And as the volume grows, so too — non-linearly — does the number of inter-fragment dependencies the model must track simultaneously.

A DSL is more compact than a raw implementation by orders of magnitude — and this substantially reduces pressure on the context window: the model needs to hold less information, there are fewer connections between its parts, and both accuracy and predictability of the result improve accordingly.

The cost of development deserves separate attention.

  • Individual tokens are cheap and continue to get cheaper — but here it is worth recalling Jevons' paradox: technological progress that increases the efficiency of resource use has historically led not to a reduction but to an increase in total consumption of that resource.

  • Cheaper tokens incentivize offloading an ever-larger volume of work onto the LLM — and total expenditure grows along with the ambition of the tasks. The joke that "carbon-based coding agents in the form of junior engineers" are not such a bad alternative has lately been sounding less and less like a joke.

DSL directly reduces this pressure: a compact formal description requires orders of magnitude fewer tokens than the equivalent low-level code — and this applies equally to generation, review, and iterative refinement. At sufficient system scale, the difference in cost ceases to be negligible.

What Else This Approach Gives Us

There is nothing new in this approach. As David Wheeler observed, all problems in computer science are solved — and generated — by another level of indirection ("All problems in computer science can be solved by another level of indirection"). Each level of abstraction trades breadth of available actions for ease of performing typical operations.

What is most valuable here, however, is that this abstraction is expressed in business terms and formally describes user intent — rather than being yet another technical layer intelligible only to developers. This yields concrete advantages for each party involved:

  • the party stating the intent operates in domain concepts and does not think about implementation details;

  • the party executing it receives an unambiguous formal contract instead of a vague natural-language description.

In addition, both sides of this contract gain the ability to evolve independently of each other — provided the DSL schema itself remains stable. This is the key effect:

  • complexity does not disappear, but it stops multiplying.

  • Without a DSL, the technical complexity of implementation and the business complexity of the description are intertwined in a single codebase: a change in requirements forces a change in implementation; a change in the technology stack forces a change in business logic.

  • The DSL severs this connection, isolating both components on opposite sides of a formal contract.

The total complexity of the system thereby changes from a product of two factors to their sum — which, at sufficient scale, yields a fundamentally different level of controllability and predictability.

Let us now look at the concrete benefits each key stakeholder receives. I will try to focus primarily on features related to AI — but I cannot pass over in silence the other remarkable possibilities this approach opens up.

For the Requirements Owner

It becomes easy to analyze generated code: the concentration of business information in DSL descriptions is significantly higher, and the result is far more comprehensible to a non-specialist.

In large systems, different requirements owners often formulate requirements independently and contradictorily. An LLM working with an existing DSL model can identify such contradictions — for example: "the calculations module assumes that a document in status X is available for modification, while the approval module locks it at that same status."

The application is assembled from standard components with measurable complexity — which yields predictable development cost. Cost is still measured in money and time, but tokens are becoming an increasingly significant component of it, and DSL directly affects their consumption. It should also be noted that the need for manual refinement of DSL descriptions and review has not gone away.

Code generated from a DSL can be reverse-deployed into a requirements specification.

  • Through the cycle RS1 → DSL1 → formal validation → DSL2 → RS2 → stakeholder analysis → RS3 → DSL3 and so on, the system surfaces specification errors and ambiguities and makes them explicit.

  • This is not merely Human-in-the-Loop (HITL) development — it is simultaneously a learning process: the requirements owner gradually comes to understand what constitutes a high-quality problem statement from the system’s perspective.

The chronic problem of requirements specification becoming stale is resolved automatically:

  • Since the DSL description is the single source of truth about system behavior, it is always current by definition.

  • This is especially valuable when company processes are not well-established and the details of various feature implementations are scattered across tickets, Confluence pages, Slack discussions, and email threads.

  • In effect, the application source code becomes the specification of its own behavior.

Chat with data becomes possible — asking questions about the current state of the application in natural language:

  • in which forms are particular reference data elements used? which operations and workflows modify property X of a given document type?

  • with a dependency graph between DSL artifacts, it becomes possible to get early feedback that a seemingly simple feature actually touches a dozen components and will be quite expensive to implement.

Investigation of actual product behavior is facilitated through analysis of calls at the DSL boundary. It is, frankly, remarkable how much effort we expend imagining the future user’s behavior — and how little interest is typically shown in finding out what they actually do.

  • It becomes possible to analyze how customers actually use the system: which scenarios carry the heaviest load, where friction arises, which behavioral patterns have emerged organically — and all of this in business rather than technical terms.

  • This data can be used to classify users and the operations they perform — for example, for fraud analysis.

This kind of analysis can be implemented for any system, but with a DSL it becomes available automatically.

Since DSL descriptions contain a wealth of business detail, an LLM can use one set of descriptions to generate realistic prototypes of others.

  • For example, from a data model one can automatically obtain draft editing forms, entity tables, lists of validation checks, and so on.

  • This approach is not new — it is called scaffolding and is widely used in systems such as Grails, Yii, Ruby on Rails, and Django for rapid initial prototyping.

  • The ability to account for data semantics substantially extends its applicability: a field of type "currency amount" automatically receives an appropriate widget and validation rather than a plain text field — and a real-world form may contain dozens of such semantically grounded decisions.

For the Product Owner

DSL descriptions can not only independently define the behavior of individual system components from scratch, but can also use existing descriptions as a base or modify their behavior. This opens up the possibility of developing "extension" modules of various types:

  • for a specific client, without touching the components of the core product, one can develop a module that adds new fields to a form and to the database schema;

  • an extension module can carry additional functionality for all clients who have purchased a particular service package;

  • an auditor’s access form for a given entity can use a standard editing form as its base, overriding field visibility and availability and adding a dedicated history tab;

  • a lifecycle fragment common to a group of documents can be extracted into an extension module for reuse and applied to documents conditionally — based on user or organization configuration.

This is essentially Aspect-Oriented Programming (AOP), but at the business level rather than the technical level — with the caveat that extension modules not only add behavior to existing artifacts, as classical aspects do, but can also extend the structure of existing descriptions.

  • The closest technical analogy here is not so much pure AOP as a combination of aspects with trait composition.

  • We control the business aspects of application behavior both statically — through the set of modules included at build time — and dynamically — through conditional application of extension modules based on client / organization data and similar criteria.

This opens up a fundamentally new way of managing large products.

  • Previously, one typically had to either keep everything in a single project and manage feature distribution through feature flags, or fork the repository — with the well-known synchronization problems that entails.

  • The popular plugin architecture partially addresses this, but requires explicit upfront design and is generally limited to the technical level.

  • DSL modules operate at the business level and do not require extension points to be defined in advance: any aspect of behavior can be refined by a module, provided the DSL can express it.

The product can now be split into many separate modules, each developed independently, without the need to modify anyone else’s code. For example, in a corporate remote banking system for large organizations one might identify the following parts:

  • Core product — only the key entities and services: account management, organizations, individuals, and so on;

  • core functional modules — RUB payment order, FX payment order, and so on;

  • optional functional modules — loan servicing, budgeting;

  • cross-functional modules — for example, a multi-level document processing authorization scheme (first / second / endorsing signature);

  • implementation modules — specialized customizations developed separately for various large clients. Based on real-world implementation experience in the Russian corporate sector, such customization could affect up to 10–15% of the total system functionality.

The core product and individual modules are developed independently and in parallel:

  • purely technical complexity is isolated within the DSL engine;

  • industry-specific business complexity resides in the core product and the main functional modules;

  • functionality available according to license resides in the optional and cross-functional modules;

  • client-specific features reside in the implementation modules.

Since developing extension modules does not require modifying the source code of existing modules — and in most cases does not even require access to it — technically advanced clients can develop their own modules given an appropriate SDK.

  • This opens up a qualitatively different model of client engagement: part of the customization naturally shifts to the client side, relieving the development team and shortening the implementation cycle.

  • From the author’s personal experience, approximately 10% of clients made use of this capability.

For the Developer

Many problems can be identified early through formal validation of DSL artifacts and their cross-references. Such problems are the cheapest to fix.

Even when the LLM’s output requires manual refinement, the work is significantly easier: less expertise is required, the patterns of application are uniform and transfer readily across tasks. The learning curve is considerably shallower.

Developers have less room for error, and the errors themselves are far more uniform:

  • Error descriptions together with the corresponding fix can accumulate in a knowledge base and be used for automatic diagnosis of new cases.

  • In this way, expertise that previously existed only in the heads of experienced engineers is naturally codified: the system begins to "know" — much like a seasoned senior engineer — exactly what went wrong from a single glance at the log.

  • Errors become classifiable and thereby provide excellent material for improving the DSL itself.

It becomes easier to extract from each PR / release not "how" something was done, but "what" — in business terms.

  • Accordingly, any audit, review, and compliance process becomes significantly easier.

  • Communication between separate developer groups is also facilitated.

The inevitable gap between what the LLM knows and the current state of the API is eliminated.

  • A well-known problem: the LLM does not know the new API. In real-world development of large systems, however, the problem is often the reverse — the LLM knows precisely the new API, while the team has not yet migrated to it.

  • By introducing a DSL engine, migrating the application between different API versions and even different frameworks becomes fundamentally simpler: only the engine migrates, leaving the products untouched. This naturally holds only for the portion of functionality covered by the DSL — low-level extensions will require manual work when switching frameworks, but their volume is bounded by definition.

DSL descriptions exist in a rich context, which opens up additional possibilities for automated processing:

  • Refactoring from one DSL version to another becomes easier.

  • Analysis of usage patterns becomes possible, which can inform the evolution of the DSL as a product:

    • if certain parts are almost never used, they can be removed from the DSL;

    • if low-level customization is frequently applied to certain DSL fragments, this is a signal to augment the DSL with the corresponding standard cases.

  • DSL descriptions provide a natural basis for automatically generating regression tests: when the DSL version changes, it becomes possible to automatically verify that the behavior of all existing entities has not changed.

A DSL consists essentially of text descriptions and is easy to version-control.

  • For entities undergoing long processing workflows, it is straightforward to save a snapshot of the system at the moment the workflow began. For example, a contract created a year ago will be rendered correctly in the interface using the form that was current at the time — because the system uses exactly the DSL revision in which the document was created. The same applies, for example, to the lifecycle schema and the set of data validation rules.

  • Different versions can be activated conditionally: per user, per organization, per entity creation time, and so on.

A DSL fundamentally simplifies onboarding of new developers.

  • Instead of spending months working through thousands of lines of code, a new developer reads compact DSL descriptions in business terms and quickly acquires a coherent understanding of what the system does.

  • The distance between "I understand the system" and "I can make changes" shrinks: the first tasks can be performed without diving into the details of the engine implementation.

  • An LLM supplied via RAG with a corpus of DSL descriptions for a specific product can act as an interactive guide — answering questions about system behavior, explaining dependencies between components, and pointing to exactly where changes need to be made to address a specific task.

The approach naturally fosters specialization within the development team. Developers differentiate by task type in accordance with their inclinations and strengths:

  • technology-oriented engineers — work on the DSL engine, infrastructure, performance;

  • business-task-oriented engineers — design DSL descriptions, develop domain expertise, work with requirements owners;

  • customer-interaction-oriented engineers (yes, these exist too) — handle the problems and specifics of individual implementations.

Each type works with code that matches their natural inclinations. Specialization and division of labor have long been recognized as the foundation of productive manufacturing, and a DSL makes them applicable to software development. Previously, all three roles were inevitably mixed together in a single project and a single codebase.

For DevOps Engineers

Thanks to the single engine, dependency management and security patching of infrastructure components are substantially simplified. A vulnerability in a serialization library or HTTP client is fixed once at the engine level and automatically closed across all products simultaneously — rather than patching each product separately and coordinating the release of multiple deployments.

The DSL boundary naturally defines the contract for chaos engineering and fault injection. Since all calls to external systems, queues, and databases pass through the engine, the DevOps engineer has a single point for injecting latency, errors, and dependency unavailability — without having to negotiate intervention points with each development team separately for each product.

Capacity planning is substantially simplified: since each type of DSL operation has a measurable and stable resource consumption profile, load forecasting becomes a calculation based on statistics rather than an expert estimate. Growth in the number of documents of type X in status Y translates directly into a CPU, memory, and IOPS forecast — through the known characteristics of the corresponding DSL operations.

Configuring the DSL engine simplifies deployment of the system in environments of varying capacity.

  • For example, to test the functionality of a complex distributed system, one can assemble a cheap-to-deploy monolith with an embedded database and in-memory queues (provided the engine is deterministic at initialization and does not pull in external dependencies at startup).

    • Since such a configuration provides a fully reproducible initial state without the need to clean up external dependencies, writing automated tests becomes significantly easier: before each test it is sufficient simply to restart the entire system.

    • Deployment for QA on the principle of "one tested feature — one instance" is also greatly simplified.

Such configuration variability would be prohibitively expensive for each individual product. In the case of a DSL engine, the cost of implementing it is spread across all products at once and ceases to be any individual product’s problem.

The DSL boundary is a natural point for collecting metrics, traces, and logs in a unified format.

  • DevOps engineers gain observability at the DSL boundary without having to negotiate tools and logging formats with each development team. Integration with specific tools — Prometheus, Grafana, OpenTelemetry — is configured once at the engine level and automatically propagates to all products.

  • Particularly valuable is the fact that observability in this case operates on two levels simultaneously.

    • Technical metrics (latency, throughput, error rate) are complemented by metrics in business terms: how many documents of type X transitioned to status Y in a given period, at which workflow step errors occur most frequently, which operations are performed outside typical working hours.

    • This data is valuable not only for DevOps engineers but also for the business — and it appears automatically, without additional instrumentation effort. Management sees not RPS and p99 latency, but "number of applications processed on time" and "share of customers who encountered an error during checkout."

Versioning DSL descriptions enables reliable rollback — not just of code, but of system behavior. It is clearly defined what exactly is being rolled back and which components are affected.

For QA Engineers

Many tests can be executed automatically without involving QA engineers or developers.

  • For example, for an entity collection filtering form, one can verify all possible combinations of filter values — and for each combination verify not only that the resulting query executes correctly against the database, but also that all combinations hit the indexes and none of them triggers a full table scan.

  • When dealing with collections of several million records, this is more than an academic concern.

Automatic verification of business logic coverage becomes possible:

  • Since the DSL contains an explicit description of all statuses, transitions, and validation rules, test coverage can be computed automatically — not at the level of lines of code, but at the level of business scenarios.

  • The QA engineer sees not "87% of lines covered" but "transitions from status X to Y under condition Z are not covered" — which resolves the chronic problem of high line coverage coexisting with low coverage of real business scenarios.

The DSL definition of the interface automatically provides a DSL for testing based on high-level user actions.

  • It becomes possible to write tests in terms of user actions: "entered a value in a field, clicked a button, observed the document status change." In terms of abstraction level, this is even higher than Martin Fowler’s classic Page Object pattern widely used in UI automation — UI details are completely hidden behind user actions.

  • Such test scenarios can be generated by an LLM from DSL descriptions, and also verified against the requirements specification after manual editing — that is, we again activate the Human-in-the-Loop (HITL) cycle. It should be noted that automatically generated scenarios primarily cover the happy path: edge cases and negative scenarios require explicit involvement of a QA engineer.

In essence, such test scenarios are written in the language of user documentation and can serve as its direct basis: a test looks like "execute procedure N from the user guide with the following data and verify the result." We are testing not just functionality, but the user’s perspective on functionality.

Based on a collection of DSL descriptions, an MCP server can be straightforwardly implemented with human-oriented tools for accessing system functionality.

  • A developer gains the ability to run user scenarios through an LLM agent directly, without involving QA for each iteration.

  • For a QA engineer, a test scenario in text form becomes an executable artifact: the LLM agent interprets it and performs the actions through the tool interface — without writing any code.

By programmatically recording actual user behavior at the DSL boundary, we gain the ability to automatically:

  • generate test scenarios illustrating normal system behavior;

  • record the sequence of actions that leads to an error — given reproducible data state, this provides a reliable way to reproduce production incidents.

With knowledge of data semantics from DSL descriptions, procedures for generating test data and populating the database with initial data can be implemented automatically — a task that in practice is poorly handled and rarely done.

  • The mere existence of initial database population is critical: load tests on a nearly empty database are simply meaningless.

  • It is important to have not just a populated database, but a realistic data distribution — only then will queries execute against adequately formed indexes, and load testing results will correspond to real system behavior under load.

  • Semantically correct data is also important for uncovering edge cases in business logic: many of them simply do not manifest with synthetic data that does not reflect real usage patterns.

Limits of Applicability

The proposed approach is not a universal solution and does not claim to be one.

It is most applicable when:

  • the system’s behavior can be expressed in stable business terms — the domain is sufficiently stable for the investment in a DSL to pay off.

    • Financial documents, HR processes, logistics — typical examples of such stability;

    • A rapidly changing product catalog or an experimental startup — clearly not;

  • the value of the system is concentrated in the composition of standard business patterns rather than in unique algorithms. Where the algorithmic core is unique, the benefit of a DSL is substantially lower;

  • there are many repeating patterns — the DSL pays off through replication;

  • the cost of misalignment between intent and implementation is high — when an error in interpreting requirements proves expensive in production;

  • the product has a long lifecycle and many deployments — this is precisely where the main effect of standardization accumulates. The personal experience described below provides a concrete reference point: a horizon of approximately five years and around thirty deployments is the scale at which the approach reliably pays off.

In addition, when deciding whether to adopt this architecture, the following considerations must be taken into account:

  • Initial investment.

    • Developing the DSL and the engine represents a significant cost that pays off only with sufficient system complexity and a substantial operational lifespan. A practical benchmark: approximately one year of work by a team of five developers on the engine alone — before the first product based on it begins development.

    • A small project is not suited to this approach.

  • Domain complexity.

    • The first version of the DSL requires deep domain expertise. In practice, systems of this kind have been built by people who already had several successful systems behind them, written in the conventional way — without that foundation it is difficult to separate stable business patterns from accidental artifacts of a specific implementation.

    • If business terminology changes rapidly, keeping the DSL current becomes a source of technical debt that can gradually devalue the investment in the approach.

  • Engine tyranny.

    • Everything not provided for in the DSL must be implemented either by extending the engine or through low-level workarounds within the descriptions.

    • The DSL must cover the standard patterns of the domain and provide an extension mechanism for non-standard cases. The practical ratio depends on the maturity of both the DSL and the domain, but if non-standard extensions outnumber standard ones — that is a signal that the DSL is failing to keep pace with reality.

    • To prevent such workarounds from accumulating, code distillation is required — regular work to promote non-standard extensions into standard DSL constructs. Otherwise the DSL gradually loses its primary property: being the single source of truth about system behavior.

  • Expressiveness vs. strictness.

    • A DSL strict enough for deterministic execution is inevitably limited in expressiveness.

    • The richer the DSL’s semantics, the more complex the engine — and the greater the risk that the engine itself becomes a source of unpredictable behavior.

Architecture of AI-Assisted DSL-Based Development

On this basis, we can formulate the following high-level architecture for AI-assisted enterprise development — one in which the LLM does not "write the system" but acts as a managed translator from human intent into the DSL model, while everything else is handled by a deterministic pipeline.

The key idea is that each layer solves its own problem, and the boundary between layers is a formal contract in the form of a DSL schema whose violation is detected by static validation. Complexity is localized within the layer and does not leak across the entire pipeline — not by virtue of developer discipline, but by virtue of the architectural design of the system.

The key infrastructural element of this architecture is the DSL Knowledge Registry — a registry of DSL schemas, policies, and patterns that provides the LLM with structured context at each processing stage.

  • In a certain sense this is metadata about metadata, designed to supply the LLM at each stage with only the necessary slice of information about the DSL.

    • This is not merely documentation, but a retrieval + schema + policy knowledge base.

    • The diametric opposite is the naïve approach based on a giant prompt containing the entire DSL documentation as a single continuous block of text.

  • The stored information may include, in particular:

    • base DSL archetypes (UI descriptions, data models, state machines, and so on);

    • schemas for the main components within those archetypes;

    • permissible references between different components;

    • constraints on component properties;

    • standard functionality patterns;

    • examples of correct and incorrect descriptions.

The most reasonable approach is to implement this registry through a separate meta-DSL for describing the DSL, which then serves as the basis for populating the registry. Such a meta-DSL can be simple and stable enough to be described using standard schema languages — JSON Schema, XSD, or EBNF — and does not generate recursion.

In terms of its operating mechanism, the DSL Knowledge Registry also resembles in some ways the skills catalog in LLM agents:

  • each DSL component is annotated with a description of its purpose, and the LLM selects the needed component by semantically matching the user’s intent against these descriptions — exactly as an agent selects an appropriate tool from the available set.

  • There is one fundamental difference:

    • the result is not a tool call but a declarative description that emerges as a development artifact, which the deterministic engine then executes according to strict rules.

    • It is precisely this distinction that ensures Predictability — a property that remains the hardest to achieve in a purely agentic approach.

Processing then proceeds through the following main layers.

  • Layer 1. Intent Extraction

    • At this stage the LLM does not yet write code; it only translates the human description of intent into a more rigorous form suitable for further validation and formalization.

    • Main tasks: request classification, entity and term extraction, preliminary mapping of the task to relevant DSL archetypes, checking the completeness of the input against the schema of the selected archetypes, and identifying missing information that must be requested from the requirements owner.

    • At this layer we identify incomplete or ambiguous intents and, when necessary, initiate a Human-in-the-Loop cycle to refine the requirements.

  • Layer 2. Formalization

    • The LLM translates the task description into a DSL artifact with an explicit schema, constraints, referential integrity, and so on. This is where interpretation of intent and its formalization still occur together, but the result is no longer free-form text — it is a formal artifact.

    • Since the large task has already been decomposed into a set of required archetypes and their key fragments, generation can be performed incrementally, pulling from the DSL Knowledge Registry only the schema fragments and patterns relevant to the current piece.

    • In other words, open-ended generation immediately gives way to constrained generation: the space of permissible outputs is bounded by the DSL schema, the set of archetypes, and the validation rules.

  • Layer 3. Static Validation

    • The DSL description is checked for completeness, internal consistency, conflicts, mandatory dependencies, and module compatibility.

    • This is where violations of domain invariants, composition errors, and incorrect pattern combinations are detected — it is the central mechanism for ensuring Correctness in this architecture.

    • A significant share of problems that would otherwise surface only in later stages or in production are caught here — at the least expensive stage.

  • Layer 4. Deterministic Execution

    • The engine transforms the DSL description into system behavior strictly according to fixed rules, with no variability at the execution level.

    • This is the only layer where high implementation complexity is intentional and where deep technical specialization is appropriate.

    • This complexity, however, is isolated within the engine and must not spread into DSL descriptions or problem statements.

  • Layer 5. Runtime Observability and Feedback

    • Execution at the DSL boundary automatically yields metrics, logs, and traces simultaneously in technical and business terms.

    • Accumulated data on real system usage closes a three-way feedback loop: it is returned as material for

      • refining intent during requirements formulation. For example: if observability shows that a certain document lifecycle transition is never used, this is a signal to the requirements owner to reconsider the intent.

      • enriching the DSL Knowledge Registry with new patterns and anti-patterns. For example, recurring scenarios such as "draft document approval loop" become standard DSL fragments for reuse.

      • evolving the engine itself.

The interaction of the above layers with the DSL Knowledge Registry can be illustrated as follows.

  • The registry stores a catalog of DSL components — archetypes and their constituent parts, each providing certain functionality with a known interface. Functionality can be described

    • both informally — similarly to a skill description,

    • and through formalized requirements specifying the presence of certain property definitions or child components.

  • When the user’s intent contains the statement "the document table must be filterable by the created_at field,"

    • Layer 1 classifies this as a requirement for the presence of a filterForm component in the future document table DSL.

    • Layer 2 retrieves the description of this component from the registry and generates its DSL with the appropriate content.

    • Layer 3 verifies that the resulting table description is internally consistent — in particular, that the created_at field is actually present in the data model and that an index is defined for it.

It is critically important that this architecture does not eliminate complexity but redistributes and localizes it. Changes to the engine should, wherever possible, leave DSL descriptions untouched; DSL evolution should minimally affect existing descriptions; and expanding the set of modules should not require revisiting already-correct artifacts.

In other words, the complexity of the system does not disappear — but it concentrates at explicit, controlled points.

Personal Experience

The author has worked on DSL-based projects on several occasions, and once even developed his own DSL with an engine along with the banking product for mass deployment described earlier.

The system was designed to serve several thousand concurrent users who process up to several million complex financial documents per day. Documents pass through several dozen statuses and accounting operations across various banking systems.

All of this was implemented between 2009 and 2015 (without any LLM, of course) on the basis of the following main types of DSL descriptions:

  • document tables

  • forms — for editing documents, requesting arguments for a processing operation, defining filter conditions for document tables, and so on

  • hierarchical validation rule sets, configurable by system administrators and applied conditionally depending on document status and processing operations

  • content digest schema — used to generate the text over which the digital signature is applied

  • lifecycles — document state machines with synchronous and asynchronous transitions

  • signature schemes — first, second, and endorsing signature, with conditional application per organization and account

  • reports and printable forms

  • diagrams

  • predicates, permissions, and standard security roles

The DSL also covered not only the business layer but also infrastructural aspects of the system, ensuring, in particular, their runtime manageability by an administrator:

  • message queues and handlers

  • periodically executed jobs — local and cluster-level

  • gateways for interacting with external systems

  • declarative application-level event handlers — login, logout, menu selection, form open / close

The final results were as follows:

Technology stack: Java, Groovy (used for writing DSL components that combined foldable declarative descriptions with runnable / debuggable extension code), Spring, JPA (with simultaneous support for multiple databases — Oracle, PostgreSQL, MS SQL, DB2, and H2 for the test environment), ActiveMQ, ZKoss.

DSL engine:

  • developed over approximately one year by a team of 5 developers (a realistic minimum investment benchmark for this type of approach)

  • 1,600 Java classes / 400 Groovy classes / 200 UI artifacts / approximately 50 database tables

  • developer documentation — 800 pages in PDF

Banking application (after approximately 5 years of development):

  • 2,000 Java classes / 7,000 Groovy classes / 2,000 UI artifacts / approximately 800 database tables

  • approximately 20 developers working on core functionality;

    • the overwhelming majority were retrained from the proprietary scripting language of the previous-generation system and reached working productivity in approximately 2 months (an instructive figure for assessing the shallowness of the learning curve when working with a DSL)

    • it was far easier for us to retain existing developers with deep banking domain knowledge than to hire Java developers and train them in the domain to a sufficient degree

Product deployments:

  • approximately 30 installations, each with customization of up to 10–20% of the functionality (though there was one outlier implementation where ultimately more than half the system was rewritten)

  • refinement and deployment were carried out simultaneously by a team of approximately 15 developers

Future Prospects

Conclusion

A DSL is not a silver bullet and is not a way to avoid complexity. Developing the engine will require significant initial investment, and the DSL itself needs maintenance and evolution alongside the domain.

This approach is justified only at sufficient scale and over a sufficient development horizon — the personal experience described above provides a concrete benchmark: approximately one year for the engine, five years of product life, thirty deployments. It is at this scale that the investment pays off and the advantages of the approach are fully realized.

If that scale is achievable, the combination of LLM + DSL changes the character of complexity: it stops being spread across the entire codebase and concentrates at clearly defined locations — in the engine, in the DSL schema, in the validation policies.

The LLM thereby ceases to be a source of unpredictable code and becomes what it is most naturally suited to be: a translator between human intent and a formal model. The deterministic engine handles everything else.

The gap between "the system understands what I want" and "the system does exactly that" narrows. It does not disappear — but it narrows. And that is the most realistic outcome we can expect from AI in industrial software development today.

P.S. I sincerely apologize for the illustrations—they look very realistic, but upon closer inspection, they contain a lot of bullshit. I think this in itself is a good meta-illustration of the issue at hand. So I didn't even try to fight it and left it as is.

More from this blog

Java-паттерны и анти-паттерны для агентской разработки, часть 3: логирование

Статья — продолжение первой части «Концептуальные основы» и второй части «Паттерны раннего обнаружения ошибок». Она адресована техническим лидам и архитекторам, которые внедряют LLM-агентов (в первую

May 28, 202650 min read19
Java-паттерны и анти-паттерны для агентской разработки, часть 3: логирование
K

krocodl

10 posts