AI Agent Security: What SQL Injection Already Taught Us
When LLMs take real actions, architectural controls matter more than model training.
The New Attack Surface
This isn't about Grok generating nonconsensual images. That's an output problem, and people are right to be angry about it. But it's a different threat model than what I'm addressing here. When LLMs can take actions (browse the web, query databases, call APIs, access your files) the risks change completely. Brave's security team recently demonstrated prompt injections hidden in screenshots and webpage content that could hijack AI browsers to exfiltrate banking credentials. That's the attack surface I'm talking about.
People are genuinely afraid of these systems, and they're not wrong to be. At the same time, some developers are being reckless: connecting models directly to databases, letting them execute arbitrary code, treating user prompts as trusted input. LLMs will be used as parts of sensitive, enterprise systems. The question is what to do about it.
Right now, the industry's primary answer is in-model safety: RLHF, constitutional AI, safety fine-tuning, alignment research. Enormous resources go into teaching models to refuse certain requests. This work isn't worthless. It stops casual misuse, reduces liability exposure, and provides a real first line of defense.
But it comes at a cost that rarely gets discussed: model performance. Every refusal pattern trained into a model is capability traded away. Too many "I can't help with that" responses are false positives that frustrate legitimate users. We're paying a tax in degraded tools, and the security guarantees we're getting in return are weaker than they appear.
The many jailbreaks discovered in 2025 exploit the same architectural reality: LLMs process all tokens through the same attention mechanism with the same weights. System prompts and user inputs compete on equal footing. Safety instructions are just more text, and text can be drowned out, reframed, or contradicted by other text.
Research from Penn State's Center for Socially Responsible AI found that non-technical users are just as effective at jailbreaking AI chatbots as expert red-teamers. Participants with zero technical training, typing plain English prompts, induced policy violations at the same rate as researchers using sophisticated techniques. If your security depends on the model saying no, you're in a cat-and-mouse game where new attacks are discovered faster than patches can be deployed.
There's a more grounded approach—one we've already figured out for a different class of powerful, dangerous tools.
What SQL Injection Taught Us
SQL injection is a solved problem. Not because we made databases "safer" (databases will still execute any valid SQL you send them) but because we stopped exposing them directly to untrusted input. Parameterized queries, ORMs, and input validation layers sit between the user and the database. The database itself remains a sharp tool. We just control who can use it and how.
The same principle applies to LLMs. If your model has access to backend systems, sensitive data, or agentic capabilities, don't let users prompt it directly. This isn't a philosophical position. It's the same engineering discipline we apply to any powerful system that accepts external input.
We didn't solve SQL injection by making databases refuse to run dangerous queries. We solved it by treating user input as untrusted by default and validating it before it ever reached the database. The same pattern works for LLMs.
A Layered Approach
Defense in depth means multiple independent layers, each with a single responsibility. Here's one way to structure LLM-powered applications:
Layer 1: The Chatbot. Handles natural language interaction with users. This is your interface layer. It can be heavily safety-tuned because its only job is conversation. It never touches your backend directly.
Layer 2: The Content Moderator. Deterministic filters and validators that inspect requests before they reach any agentic component. This layer can use ML classifiers, keyword detection, format validation, or rate limiting, depending on your threat model. Crucially, it operates on structured data, not raw conversation, making it auditable and testable.
Layer 3: The Agent. The capable model that actually does work. This layer can be less safety-constrained because it only receives validated, filtered requests. It never sees raw user input.
Layer 4: The Abstraction Layer. Even a validated agent shouldn't have direct access to your database, codebase, or document store. Expose a constrained API that limits permissions, restricts available actions, and scopes access to exactly what's needed. The agent can call create_invoice(customer_id, amount) but can't execute arbitrary SQL. The API becomes your contract: a narrow, auditable surface that defines the boundary between what the model wants to do and what it's allowed to do.
If the agent somehow gets jailbroken despite all prior layers, it can only misuse the tools you've explicitly provided. And those tools enforce their own validation, rate limits, and access controls. The blast radius is contained by design.
Why This Approach Works
It's auditable. Deterministic code can be reviewed, tested, and formally verified. You can prove properties about it. You can't prove properties about neural network behavior under adversarial conditions.
It doesn't degrade capability. Your model stays capable. The safety constraints live in the system, not the weights. Legitimate use cases don't pay a tax for abuse prevention.
It fails closed. When an external filter encounters something unexpected, it can reject by default. When an in-model safety measure encounters a novel attack, it often fails open. The model just complies.
It scales independently. You can update your content moderation layer without retraining your model. You can swap in a more capable base model without rebuilding your safety infrastructure. The concerns are separated.
It's cheaper. Building deterministic validation logic costs a fraction of what safety fine-tuning costs. You don't need millions of examples and thousands of GPU-hours. You need clear requirements and good software engineering.
Where In-Model Safety Still Matters
None of this means in-model safety is worthless. It serves real purposes:
For consumer-facing chat products without agentic capabilities, in-model safety is often the only practical layer. You can't put an approval workflow in front of a chatbot meant for casual conversation. The model's own judgment is the product.
For open-ended creative and research tasks, you can't fully specify acceptable outputs in advance. The model needs some internal sense of boundaries for genuinely ambiguous cases.
As a first line of defense, in-model safety stops the majority of casual misuse attempts. This has real value even if it doesn't stop sophisticated adversaries. Security is about raising costs, not achieving perfection.
The argument isn't either/or. It's about emphasis. Both matter. But if you're building systems where LLMs touch real data and take real actions, the architectural patterns deserve at least as much attention as the model's training.
Finding the Balance
We've been here before. Every powerful tool (databases, web servers, APIs) went through a phase where we learned, sometimes painfully, how to use it safely. The answer was never to make the tool less capable. It was to build systems that treated external input as potentially hostile and validated it before letting it reach anything important.
LLMs are powerful, and that power is exactly why we need to be thoughtful about how we deploy them. The people who are afraid aren't wrong; there are real risks. The people who are excited aren't wrong either; there's real potential. The path forward is the same boring engineering discipline we've applied to every other dangerous-but-useful technology: deterministic filters, constrained APIs, layered permissions, human-in-the-loop approvals where stakes are high.
The model is a component. Wrap it in a system that ensures it can only do what you've explicitly allowed, and you get the best of both worlds: tools that work and systems that are safe.