You Didn’t Sign Up to Be an Infrastructure Engineer

There’s a version of building an AI agent that looks like this: you write a prompt, connect a few tools, test it in a notebook, and ship it. Within an afternoon, you have something working. That version is real, but it’s not the whole story.

The full story starts the morning after.

That’s when you realize the agent needs somewhere to store context between turns. It needs compute that can scale when ten users become a hundred. It needs a network layer that can reach your data sources without exposing credentials. It needs to compress and manage what it knows so it doesn’t blow past context limits on every other invocation. It needs authentication, logging, rate limiting, and a deployment target that won’t require a ticket to the infrastructure team every time you want to iterate.

You didn’t set out to build any of that. You set out to build an agent.

The hardest part of building agents in production isn’t the agent. It’s everything the agent needs to survive outside a notebook.

Not every agent hits this wall but the most valuable ones do

It’s worth being precise about which agents run into these problems, because not all of them do. A simple agent that summarizes a document on demand and returns the result has minimal infrastructure requirements. The problems compound as the agent takes on more consequential work.

The agents most likely to create real business value and most likely to strain a self-built stack, tend to fall into a few categories:

Customer-facing agents. Support bots, sales assistants, onboarding guides. These face unpredictable concurrency, must maintain coherent conversation history across sessions, and have zero tolerance for hallucinations about product or policy details. They’re also the ones where a bad experience has an immediate, visible cost.
Long-horizon task agents. Agents that execute multi-step workflows, financial reconciliation, compliance review, procurement processing, over minutes or hours. They need to persist state across steps, recover gracefully from failures, and know when to pause and escalate.
Data-intensive analytical agents. Agents that query large datasets, synthesize results from multiple sources, and return grounded answers. These stress the data layer hard: the quality of the answer depends entirely on what the agent can reach and how fresh it is.
Internal knowledge agents. Agents that help employees navigate policies, surface institutional knowledge, or answer questions across unstructured document stores. The infrastructure challenge here is mostly retrieval quality and access control, making sure the agent finds the right document and only the documents the user is entitled to see.

What these categories share is that they all require the agent to do more than produce a single response. They operate over time, across data, and under load. That’s where the infrastructure surface becomes unavoidable.

What “we built it in a week” actually means

Imagine a team that builds a customer support agent in about a week. The demo is impressive: it handles common product questions fluently, pulls from the knowledge base, and escalates gracefully when it doesn’t know the answer. Leadership loves it. They move it to production.

Then the volume arrived.

Under concurrent load, response times degraded. The agent had no rate limiting, so peak traffic from a single promotional campaign caused LLM API costs to spike faster than anyone had modeled. The conversation history implementation, fine for a single session in testing, wasn’t designed for thousands of simultaneous users, and state started leaking between sessions in ways that were subtle but damaging. Customers were occasionally receiving context from someone else’s prior conversation.

Security review surfaced more issues. The agent had broad read access to the product database, necessary for answering questions, but more permissive than any compliance team would have signed off on if they’d been asked. Audit logging was minimal. There was no way to reconstruct what the agent had said to a specific customer on a specific day.

In this story, at base case scenario, the agent goes back to limited access while the team spent the next two months retrofitting the infrastructure that should have been there from the start. Rate limiting, session isolation, scoped data access, audit trails, a cost model that didn’t assume low traffic. None of it was the agent. All of it was the prerequisite for running the agent safely.

The agent took a week to build. The infrastructure to run it responsibly took two months. That ratio is more common than teams expect.

The infrastructure you didn’t plan for

This pattern repeats because the infrastructure an agent needs in production is genuinely broad. It’s not one hard problem, it’s six or seven medium-hard problems that interact with each other in ways that only become visible under real conditions.

Consider what a production agent actually requires:

Memory, and there are several kinds. Short-term context is what fits in the active conversation window: the current exchange, recent tool results, the task at hand. Long-term memory is retrieved from a store facts about the user, prior interactions, domain knowledge, and surfaced selectively when relevant. Workflow state is different again: the record of what the agent has already done in a multi-step task, what succeeded, what failed, and where to resume. Most teams start by thinking about the first kind and discover the other two in production.
Compute. A single agent handling a handful of requests looks fine. The same agent under concurrent load, with multi-step reasoning chains and external tool calls, looks very different. Scaling agent workloads isn’t like scaling a stateless API. Each request may involve multiple sequential LLM calls, and the latency compounds.
Network and connectivity. Agents are only as useful as the tools they can reach. Connecting them to internal data sources, APIs, and enterprise systems, securely, reliably, with proper credential management, is its own engineering problem.
Context compression. Long-running agents accumulate context fast. Without intelligent summarization and compression, you hit limits early and often, and the agent’s ability to reason degrades with it.
Security and governance. In an enterprise, an agent that can query data can also, in principle, query the wrong data. Access controls, audit logs, and policy enforcement aren’t optional. They’re what separates a proof of concept from something a compliance team will sign off on.
Observability. When an agent fails, and it will, you need to know exactly what it did, in what order, and where it went wrong. A system that produces results without an audit trail isn’t production-grade.

None of this is exotic. These are well-understood infrastructure concerns. The problem is that each one represents an engineering investment that has nothing to do with the actual capability you’re trying to deliver.

What it costs to build it yourself

Teams that try to assemble this stack themselves tend to discover the same thing: by the time the infrastructure is stable, the agent use case has moved. The business problem that seemed urgent in January is competing with three newer priorities by March, and the team that was supposed to be building intelligent systems has spent most of its time wiring together memory stores, configuring network policies, and debugging deployment failures.

It’s not that any individual piece is impossibly hard. It’s that each piece requires real expertise, and the combination creates compounding complexity. A change to the memory architecture affects what context the agent can access. A change to the compute configuration affects latency, which affects how users experience multi-step tasks. These dependencies aren’t visible until they break.

When the infrastructure becomes the project, the agent becomes the afterthought.

The question worth asking before you build

The same question I’ve started asking about agents versus workflows applies here: Does this problem require you to build infrastructure, or does it require you to build an agent?

If you’re spending more than a small fraction of your engineering effort on memory management, compute orchestration, and deployment plumbing, the answer is probably that you’re solving the wrong problem first.

This is the premise behind Oracle AI Data Platform. Not that building infrastructure is wrong, someone has to, but that most teams shouldn’t have to build it themselves every time they want to deploy an agent. The platform provides memory, compute, networking, context management, security, and observability as foundation. The team focuses on the agent: the tools it uses, the decisions it makes, the workflows it supports.

The data problem is worse than it looks

There’s a second structural problem that infrastructure-from-scratch approaches tend to surface late: agents without good data access aren’t very useful. And building good data access for an agent is harder than it looks, because enterprise data is almost never in one place.

A realistic picture of what a data-intensive agent needs to reach looks something like this: a data warehouse (Autonomous AI Database, Snowflake, BigQuery, or Redshift) for structured analytics; a vector database (Autonomous AI Database, Pinecone, Weaviate, or pgvector) for semantic retrieval over documents; a set of internal APIs for transactional or real-time data; and a permission layer that enforces who can see what across all of the above. Each of these comes from a different vendor, has its own authentication model, its own rate limits, its own schema conventions, and its own failure modes.

Building the connectors is one project. Keeping them current as schemas evolve is another. Enforcing consistent access control across four different systems, so that the agent never surfaces data a user isn’t entitled to see, is a third. And when something goes wrong, attributing the failure to the right layer requires understanding all of them simultaneously.

The reason this matters for AI Data Platform specifically is that the data isn’t somewhere else. The platform unifies the agent layer with the data layer, structured data, documents, vectors, under the same roof, with a single permission model. An agent that needs to query a warehouse, retrieve a relevant document, and return a grounded answer doesn’t need to reach across three vendors and three sets of credentials to do it. The data is already there, and access is already governed.

That sounds like a small thing. In practice, it changes what agents can do on day one versus month six, and it changes what your team has to build to get there.

Focus on the part that matters

The teams I’ve seen move fastest with agents aren’t the ones with the most sophisticated infrastructure. They’re the ones that stopped building infrastructure and started building agents.

That distinction is harder to achieve than it sounds. It requires a platform that handles the foundational concerns well enough that you can trust it without understanding it completely. Memory that works, short-term, long-term, and workflow state. Compute that scales. Security that’s on by default. Data that’s already connected and governed.

When those pieces are in place, the conversation changes. Instead of asking “how do we deploy this securely?” the team asks “what should this agent actually do?” That’s a much more interesting problem. It’s also the one they set out to solve.

For more information:

You Didn’t Sign Up to Be an Infrastructure Engineer

Not every agent hits this wall but the most valuable ones do

What “we built it in a week” actually means

The infrastructure you didn’t plan for

What it costs to build it yourself

The question worth asking before you build

The data problem is worse than it looks

Focus on the part that matters

Guy Michaeli

Senior Director of Product Management

Source Control and Deployment Discipline in AI Data Platform Workbench with Git and Bundle Folders

Governed Data Products with Bronze, Silver, and Gold Layers

You Didn’t Sign Up to Be an Infrastructure Engineer

Not every agent hits this wall but the most valuable ones do

What “we built it in a week” actually means

The infrastructure you didn’t plan for

What it costs to build it yourself

The question worth asking before you build

The data problem is worse than it looks

Focus on the part that matters

Authors

Guy Michaeli

Senior Director of Product Management

Source Control and Deployment Discipline in AI Data Platform Workbench with Git and Bundle Folders

Governed Data Products with Bronze, Silver, and Gold Layers