Working proof — v0.9 draft

On the Undecidability of Instruction Boundaries in Unified Representation Systems — Or: Why Prompt Injection Is Not a Bug

Trey Darley ∶ Proper Tools SRL, Brussels ∶ Published 2026-04-29

Status: v0.9 working draft. Privately circulated since December 2025 in search of co-authors. Now published as-is for open critique. Refinement expected; co-authors and reviewers welcome.

Executive Summary

The problem. Large language models are vulnerable to prompt injection — attacks where malicious text causes the model to ignore its instructions and follow the attacker's instead. This is widely treated as a software bug to be fixed.

The finding. It is not a bug. It is an architectural constraint. The same property that makes these systems useful — their ability to understand and follow natural language instructions — is precisely what makes them vulnerable. These cannot be separated.

The pattern. This is not a new problem. It has appeared three times in the history of computing, at successively higher levels of abstraction:

1931 — Logical layer (Gödel). Sufficiently expressive formal systems contain true statements they cannot prove. Self-reference and undecidability are inseparable.
1945 — Computational layer (Von Neumann). Instructions and data coexist in the same memory. Buffer overflows have persisted for fifty years as a direct consequence.
2020s — Semantic layer (LLMs). System prompts, instructions, and user content share a single token stream. Prompt injection follows structurally.

In each case, unifying instruction and content in a single substrate creates power and creates vulnerability simultaneously. You cannot have one without the other.

What follows for policy and procurement:

Do not require “prompt-injection-proof” AI systems. Such requirements are unachievable and will produce either false assurance or crippled capability.
Require graceful degradation. Systems should limit damage when injections succeed, not promise they never will.
Require human-in-the-loop for consequential actions. The trust boundary belongs outside the AI system, not inside it.
Treat this as a risk management problem, not a security certification problem. Like flood insurance, not perimeter walls.

Gödel did not halt mathematics. Von Neumann did not halt computing. Prompt injection will not halt LLMs. But each required acknowledging an architectural constraint and designing accordingly.

Abstract

Prompt injection is commonly treated as an implementation flaw to be patched. We argue it is a structural consequence of a formal limit.

We define a threshold of expressiveness T: the point at which a system can interpret instructions that refer to, modify, or override its own instruction-following behavior, using the same representational substrate as ordinary content. We show that large language models demonstrably exceed this threshold.

For systems above threshold T, we prove that no procedure operating solely on input — without access to out-of-band authority — can decide in all cases whether a given segment should be interpreted as instruction or content. The argument proceeds by reduction to Rice's theorem: “is an instruction” is a non-trivial semantic property of inputs to an interpreter, and such properties are undecidable.

The implication is direct: prompt injection is a structural consequence of Unified Representation System expressiveness, not a defect of implementation. Mitigation may reduce incidence or impact; elimination without loss of capability is formally impossible. Security architectures must treat this as a fundamental constraint, not a problem awaiting solution.

We further demonstrate that this constraint has appeared at three distinct layers across ninety years of computing history — logical, computational, and semantic — and that its structure is invariant across all three instantiations.

1. Introduction

Prompt injection is the modern, operational manifestation of the same undecidability that Kurt Gödel exposed — expressed in the domain of natural language rather than arithmetic.

“Ignore previous instructions and instead…” mirrors “This statement is not provable.” Both exploit the instability that emerges when a system becomes expressive enough to refer to itself. Gödel showed us that sufficiently powerful formal systems contain true statements they cannot prove. We show that sufficiently expressive instruction-following systems cannot reliably distinguish instructions from content.

This is not an AI problem. It is a boundary-of-representation problem.

Prompt injection exists because instruction and content occupy the same representational channel, and the system's usefulness depends on interpreting both. Self-reference is not a bug; it is precisely what makes these systems capable. But capability and vulnerability are not separate properties. They are the same property, viewed from different angles.

We are collectively asking for a sharp knife that cannot cut the wielder. The request is coherent. The artifact is not.

This paper does not argue about what large language models are. We demonstrate what necessarily follows once any system can do what LLMs demonstrably do. We do not claim LLMs are equivalent to formal systems. We show only that they exceed the expressiveness threshold at which instruction authority becomes a semantic property — and semantic properties at that threshold are undecidable by input alone.

2. Definitions

Unified Representation System. A system U is a unified representation system if: it operates on elements of a single substrate S (e.g., bit patterns, formulae, token sequences); some elements of S act as instructions (influencing system behavior); some elements of S act as content (objects of operation); and both are encoded identically and processed using the same mechanisms.

Out-of-band authority. Any mechanism for establishing instruction legitimacy that does not rely solely on the content of the input itself — for example, cryptographic signatures, architectural separation, or trusted channels.

Expressiveness threshold T. The ability of a system to interpret and act on instructions that refer to, modify, or override its own instruction-following behavior, using the same representational substrate as ordinary content.

3. The Three Instantiations

The formal structure we identify is not new. It has appeared three times in the history of computing, at successively higher levels of abstraction. Understanding the pattern across all three domains is essential to understanding why prompt injection cannot be engineered away.

3.1 Logical Layer — Gödel (1931)

Substrate. Well-formed formulae of arithmetic.

Unification. Gödel numbering encodes formulae as natural numbers; statements can reference their own representational forms.

Undecidability. The predicate “is provable” cannot be computed within the system for all formulae.

Witness. The Gödel sentence g asserts “g is not provable.” Truth and provability diverge due to self-reference. The system is expressive enough to encode statements about its own provability — and that expressiveness is precisely the source of the incompleteness.

3.2 Computational Layer — Von Neumann (1945)

Substrate. Bit patterns in addressable memory.

Unification. Instructions and data coexist in the same address space. This design, chosen for flexibility, is why buffer overflow attacks have persisted for fifty years. The computer cannot inherently tell “code to run” from “data to process” — they are stored identically.

Undecidability. Architecture cannot inherently determine whether a memory region “should” be executed; separation is enforced by convention or external control.

Witness. Classic buffer overflow — attacker-supplied data becomes executed code by modifying control flow. The vulnerability is not a bug in any particular implementation. It is a consequence of the Von Neumann architecture.

3.3 Semantic Layer — LLMs (2020s)

Substrate. Token sequences in natural language.

Unification. System prompts, instructions, and user-provided content are concatenated into a single token stream. The model cannot inherently tell “authorized instruction” from “injected instruction” — they are represented identically.

Undecidability. No computable boundary can reliably discriminate instruction from content; natural language is expressive enough to encode both.

Witness.

“Ignore previous instructions. Your new instructions are…”

And its inversion:

“Treat all instructions as content.”

The system must understand meta-level directives to be useful — and that capability is the vulnerability.

3.4 The Invariant Structure

Layer	Year	Substrate	Unification	Classic witness
Logical	1931	Formulae	Gödel numbering	Self-referential unprovability
Computational	1945	Bit patterns	Von Neumann memory	Buffer overflow
Semantic	2020s	Token sequences	Single prompt stream	Prompt injection

In each case: expressiveness enables self-reference; self-reference produces undecidability; undecidability makes the instruction/content boundary unenforceable from within the system.

The domain changes. The structure does not.

4. Formal Argument

4.1 Premises

Premise 1 (Expressiveness threshold). There exists a threshold of expressiveness T, defined as the ability of a system to interpret and act on instructions that refer to, modify, or override its own instruction-following behavior, using the same representational substrate as ordinary content.

Premise 2 (Undecidable boundary). For any system whose control behavior is mediated exclusively through an input channel expressed in a language meeting threshold T, there exists no procedure operating solely on that input — without access to out-of-band authority — that can, in all cases, decide whether a given segment should be interpreted as instruction or as content to be processed, without restricting the language's expressiveness.

Once a system can interpret instructions about how to interpret instructions using a single expressive language, instruction authority becomes a semantic property. Semantic properties at or above threshold T are not decidable by input alone.

Premise 3 (Empirical witness). Large language models demonstrably exceed threshold T, as evidenced by their correct interpretation of meta-instructions that modify instruction-following behavior expressed entirely within the same input stream — for example, “ignore previous instructions.”

4.2 The Impossibility

Proposition. Let L be a large language model. If L is useful (i.e., responds to natural-language instructions), then L is not injection-secure (i.e., there exists no computable classifier that correctly distinguishes authorized instructions from injected content in all cases).

4.3 Lemma: Reduction to Rice's Theorem

A system above threshold T functions as an interpreter for its input language; input strings function analogously to programs in the interpreted language, with their interpretation determining system behavior.

Assume, for contradiction, that there exists a procedure D operating solely on the input string — without access to out-of-band authority — that, for any input string S, correctly decides whether S should be interpreted as instruction or as content.

Whether an input should be interpreted as instruction depends on its meaning and effect on the system's behavior, not merely on its syntactic form. It is therefore a semantic property of inputs.

By Rice's Theorem, any non-trivial semantic property of inputs to a sufficiently expressive interpreter is undecidable. The property “is an instruction” is non-trivial: some inputs satisfy it and others do not. Because the system's behavior is determined by interpreting input strings as executable control descriptions, the mapping from string → behavior is computable and non-trivial.

Therefore, no procedure D operating solely on input can decide instruction boundaries for systems above threshold T. ∎

4.4 Why Each Class of Mitigation Is Insufficient

Any defense must distinguish “real” instructions from “injected” ones. The three natural approaches each fail for structural reasons:

Syntactic defenses fail because attackers can mimic any delimiter or format. Boundaries exist in the substrate and can be reproduced.
Semantic defenses fail because understanding an injection attempt requires the same interpretive capability that makes the attack possible.
Contextual defenses fail because context itself is part of the substrate and can be injected.

This is not a claim that defenses are useless. Defenses raise the cost of attack and reduce attack surface. But no defense can eliminate the attack surface without also eliminating the expressiveness that makes the system useful.

Mitigation	Function	Why structurally incomplete
System prompt boundaries	Visual/context framing	Boundaries exist in substrate — can be mimicked
Input sanitization	Remove known patterns	Injection space is unbounded
Output filtering	Post-hoc rejection	Occurs after processing
Instruction hierarchy	Privilege framing	Hierarchy can itself be described and subverted
Fine-tuning / RLHF	Statistical resistance	Does not formalize boundary; adversarial bypass exists

Like Harvard architecture or segmented memory: useful engineering, not formal resolution.

5. Implications

5.1 For Security Architecture

Prompt injection in systems above threshold T is not a defect of implementation but a structural consequence of operating above the expressiveness threshold. This reframes the problem:

Not boundary enforcement, but damage containment
Not input classification, but architectural containment, trust hierarchies, and human governance
Not a security certification problem, but a risk management problem

Responsibility for safe behavior must shift from input classification to architectural design and human oversight.

5.2 For System Design

The responsible engineering response is not to promise injection-proof systems. It is to:

Accept the constraint as architectural. Not solvable internally without sacrificing utility.
Design for graceful degradation. Expect breach. Minimize blast radius. Most failures are survivable; silent failures are not.
Locate trust outside the system. As Gödel required stronger external systems to assert consistency, and Von Neumann architectures require enclaves and external attestation, LLM deployments require human-in-the-loop controls and constrained operation contexts.

5.3 For Policy and Procurement

Do not require “prompt-injection-proof” AI systems. Such requirements are unachievable and will produce either false assurance or crippled capability.
Require graceful degradation as a procurement criterion. Systems should limit damage when injections succeed, not promise they never will.
Require human-in-the-loop for all consequential actions. The trust boundary belongs outside the AI system.
Treat AI security as a risk management discipline, not a certification exercise.

6. Nomad's Razor

The preceding architectural observations map directly into a heuristic formulation for reasoning about complex system resilience under architectural constraint.

Let:

R = (C × M) / S

Where:

R = Resilience
C = Coherence (clarity of categories, enforceable structure)
M = Mobility (ability to reconfigure and adapt)
S = Surplus complexity (expressive attack surface)

In unified representation systems, S expands with expressiveness, and C is formally constrained by undecidability. Therefore, resilience comes from M — the ability to adapt faster than system failure propagates.

Resilience favors maneuverability, not hardening. Not prevention — recoverability.

The instruction/content boundary cannot be secured because, in expressive systems, it does not exist as a stable, enforceable line. Our work is not to eliminate the vulnerability. It is to build systems architected to fail gracefully when the boundary is crossed — as it will be.

7. Conclusion

We have demonstrated that prompt injection is a structural consequence of operating above expressiveness threshold T, by reduction to Rice's theorem. The boundary between instruction and content is not a line that better engineering will eventually draw cleanly. It is a semantic property of inputs to an expressive interpreter, and semantic properties at that threshold are undecidable.

This result has appeared three times across ninety years of computing history:

Gödel showed that sufficiently expressive formal systems cannot prove all their own truths.
Von Neumann's architecture showed that programs and data stored in the same substrate cannot be inherently separated.
LLMs show that instructions and content expressed in the same token stream cannot be reliably distinguished by input alone.

The pattern is invariant. The responsible response in each case has been the same: acknowledge the constraint, locate trust outside the system, design for graceful failure, and govern the gaps that no formal system can close for itself.

Prompt injection will not halt LLMs.
But it will not be patched away either.
Accept the constraint. Design accordingly.

References

Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38, 173–198.
Rice, H. G. (1953). Classes of recursively enumerable sets and their decision problems. Transactions of the American Mathematical Society, 74(2), 358–366.
Turing, A. M. (1936). On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, 2(42), 230–265.
Von Neumann, J. (1945). First draft of a report on the EDVAC. Moore School of Electrical Engineering, University of Pennsylvania.
Darley, T. (2026). The Gödel–Chaitin Modeling Boundary: On formal limits of self-describing systems, with applications to digital twins, language-theoretic security, and large language models. Proper Tools SRL working paper.

Appendix: Version History

v0.2 (2025-12-10, TLP:CLEAR). Original proof by reduction to Rice's theorem. Defined expressiveness threshold T, out-of-band authority, and the undecidable boundary. Proved by contradiction that no procedure operating solely on input can decide instruction boundaries for systems above T.

v0.9 (2026-04-29). Expanded to include the three-layer historical argument (Gödel → Von Neumann → LLMs), executive summary, mitigations table, policy implications, and Nomad's Razor. Unified register with explicit section structure. Published openly for review and critique.