The Boring AI Revolution: How LLMs Will Actually Change Work

The real progress isn't flashy—it's forms, workflows, and the friction that quietly disappears.

There is a heated debate about whether Large Language Models will lead to AGI—but no one can agree on what AGI means, or even whether token prediction qualifies as intelligence. What I can tell you is that LLMs as they stand today, assuming no additional improvements in the foundational models themselves, will augment work for most Americans. We deal with paperwork, forms, order and inventory systems, Excel files, CRMs, and databases across just about every industry. Workflows within and between these systems require human labor that LLMs have the potential to augment or even eliminate. However, it will not happen quickly. It will take years, and it will likely start with industries where labor has a well-defined goal and time has a high value.

What Agents Actually Are

LLMs—the foundational models—are prone to failure in real-world tasks. This is why "agents" are not simply API calls to a model provider. Agents are systems that use LLMs as reasoning engines, combined with databases, software, and tools the LLM can access. Each step in a workflow may depend more on one element than another.

Databases: RAG in All Its Forms

Almost immediately after the release of GPT-3.5's API, people started connecting LLMs to databases—myself included. The required tech stack was already mature and well-optimized thanks to years of work on search infrastructure. With Retrieval-Augmented Generation (RAG), you could dynamically augment your prompt by inserting information relevant to the user query from a database. Once in the context window, an LLM could structure a coherent response based on factual information.

Over the past few years, we've found limitations, workarounds, and new designs like GraphRAG to address specific shortcomings. Between improvements in database design and improvements in LLMs, RAG can be incredibly accurate, fast, and cheap—if you're willing to put in the work to get it right.

Tools: Making LLMs Reliable

You are probably familiar with ChatGPT being unable to count the Rs in "strawberry" or tell whether 9.11 is greater than 9.9. You may be surprised to learn that it can do those tasks if you tell it to use a tool like Python to answer your question.

Over the past two years, companies like OpenAI, Anthropic, and Cursor started integrating "tool calls" into their software and have even begun training their models specifically to make better use of tools. "Tools" is the umbrella term for everything an LLM can invoke: code execution in sandboxed environments, web browsers, external APIs, and integrations via protocols like MCP (Model Context Protocol). Some tools run locally—a Python interpreter, a browser automation script. Others connect to external services—AWS, Salesforce, your ERP. From the agent's perspective, it's all tool invocation. What matters is whether the tool is reliable, whether its outputs are structured, and whether failures can be handled gracefully.

Structured Outputs: Getting Valid Data Every Time

OpenAI was first to focus on and develop structured outputs, but just about every major model provider has followed. You can think of this as handing an LLM a form to fill in, where you can force the responses to be valid data types. For example, you could have an LLM extract the order date, shipping date, and delivery date from an image and force those fields to a specific date format. Another example: label something as relevant to a specific topic and return a boolean—true or false—then route the information accordingly.

Structured outputs can be simple or complex, containing deeply nested JSON. This response format enables the reliability and traceability that enterprise deployments require.

Software Integration: LLMs as the Grout Between Tiles

Software is where agents stop being "chatbots" and start being operational. In a typical business, the work is not thinking up an answer; it is moving an order, a record, or an approval through a chain of systems that do not naturally communicate with one another.

Consider a mid-size distributor that receives purchase orders through EDI, emailed PDFs, and customer portals. An agent can watch those channels, translate each order into a clean, structured representation, and then validate it against what the company already knows: real SKUs, customer terms, credit status, inventory, shipping cutoffs, and contract pricing.

When something is ambiguous, the LLM is useful for interpreting intent ("ship what you can now," "substitute if needed"), but software is what actually enforces policy and prevents mistakes.

Once the order is valid, execution is not a paragraph of text. It is a set of controlled writes through APIs and transactions: create or update the sales order in the ERP, reserve inventory, trigger warehouse picking, request a shipment, and log the outcome in the CRM. When the workflow hits exceptions—an unknown SKU, a pricing mismatch, a new ship-to address—the agent does not pretend certainty. It packages the issue with evidence and a proposed resolution and routes it to a human, then resumes once the exception is cleared.

That is what agents using software means in practice: the LLM helps interpret messy inputs, but the system is built to be reliable, auditable, and accountable because the software layer owns validation, permissions, and state changes.

Putting It All Together

Imagine a ten-step process that needs to be performed correctly every time. Across every step, a simple LLM call would fail at a rate too high for enterprise use. Instead, an agent is created—an LLM equipped with tools like Python execution, database access, and software integrations.

For some steps, the role of the LLM is minor: selecting which pre-written configuration file to send to a cloud service, where the cloud provider does the real work. For others, the LLM handles the task almost entirely on its own—like writing an executive summary based on the results of a SQL query.

The Threshold Effect: Why Small Improvements Matter More Than They Appear

There is something that gets lost in discussions of AI progress: below a certain reliability threshold, automation is worthless. You cannot half-automate a ten-step process. If step four fails 20% of the time, you need a human monitoring the entire workflow anyway, which defeats the purpose. The economics do not work. The process stays manual.

But once a capability crosses that threshold—once an agent can reliably make a SQL call, parse a PDF, or validate an order—something important happens. It is not just that one task that gets unlocked. It is every workflow that was bottlenecked waiting for that same capability to become reliable.

A single improvement quietly unblocks dozens of use cases across different industries, different companies, different job functions. Then the snowball starts. Each newly reliable capability becomes a building block for the next. An agent that can reliably query a database, plus reliably output structured JSON, plus reliably call an external API, can now do things that none of those pieces could do in isolation. The capabilities combine. The combinatorics explode.

This is why the "AI will take your job" versus "AI is overhyped" framing misses what is actually happening. It is not about replacement. It is not about one dramatic moment. It is about the quiet accumulation of crossed thresholds—capability after capability graduating from "almost works" to "works reliably"—until one day you realize that half of your job looks different than it did two years ago, and you cannot point to a single moment when it changed.

We are in the middle of that accumulation right now. The breakthroughs are not dramatic. They are things like "structured outputs now work consistently" and "tool calls fail less often." Individually, none of them sound revolutionary. But they build on one another, and each one makes the next more powerful.

What This Means for Work

In 2026, a lot of the software you think should be able to talk to one another will start to. The loose or missing connections between pieces of information, the small steps between currently automated tasks—these will be the starting points. Instead of a manager asking an intern to run a SQL query from a sales table, they'll @ Claude in Slack. When a coworker asks to access a Google Doc via Outlook, you won't have to find it, select share, plug in their email, and hit send—Copilot will recommend the three most likely matches, you tap the right one, and it handles the rest.

AI is unlikely to take your job in 2026, just as only 55,000 out of 1.17 million layoffs were attributed to AI in 2025 (CNBC). But you will start to see small friction points in your job begin to vanish. For some changes, you may not even notice that an LLM is involved.

These changes, small at first, will compound. More deadlines will be met. Fewer balls will be dropped. And we will all have more time to think and evaluate how we complete our tasks rather than rushing through every one of them.