AI Agents Are Winning Internally—But Organizations Can't Scale Them, Experts Warn

Despite widespread adoption of AI agents in pilot projects, a new study from Google Research, DeepMind and MIT reveals that nearly no organization has successfully scaled these systems enterprise-wide. The research highlights a critical gap between experimental success and operational deployment, leaving many companies to 'guess' their way through agent organization.

At NVIDIA GTC 2025, one clear takeaway emerged from dozens of conversations: companies are shipping agent systems almost by guessing. "The core challenge is determining the optimal structure—how many agents per team, which model providers to use, and whether to use a 'boss' agent or peer-to-peer coordination," said a developer who attended.

Background

The problem stems from the rapid rise of generative AI and agent architectures. Large Language Models (LLMs) are powerful but limited without proper orchestration. The new paper, Towards a Science of Scaling Agent Systems: When and Why Agent Systems Work, provides a decision framework for structuring agent teams.

AI Agents Are Winning Internally—But Organizations Can't Scale Them, Experts Warn — Source: www.freecodecamp.org

LLMs, as described in the paper, are like "very well-read interns who have never left the library." They can generate text, code, and translations but lack memory, autonomy, and the ability to perform actions. This is where AI agents come in—extending LLMs with a desk, laptop, and a to-do list.

What This Means

Developers need to rethink deployment strategies. Instead of simply deploying agents, organizations must focus on structured testing and evaluation (evals). "The future of AI is evals," the authors conclude—moving from guesswork to a scientific approach for agent organization.

The framework addresses questions like: What's the right number of agents? Should they have a supervising agent or coordinate peer-to-peer? The answers depend on specific business cases and require hands-on experimentation.

Prerequisites

You don't need to be an expert developer to create AI agents. No-code tools can help, but for full comprehension, a general understanding of Python and what an LLM is essential. The code examples use Ollama to run models locally for free and Jupyter Notebook (Google Colab recommended for cloud GPUs).

What Is an LLM?

An LLM is a Large Language Model—a neural network trained on vast text data. It can summarize, translate, and imitate styles but suffers from hallucinations when unsure. It has no memory between conversations and cannot act independently.

What Are AI Agents?

An AI agent is an LLM equipped with tools, memory, and autonomy. It can send emails, query databases, or browse the web. The key is how you organize multiple agents: should they have a "boss" agent or collaborate as peers?

Decision Algorithm for Optimal Agents

The paper proposes a science-based algorithm for creating optimal agent teams. It involves deciding the number of agents, their roles, and how they communicate. The authors provide a step-by-step approach to avoid common pitfalls.

Code Examples

The code uses a Jupyter notebook on Google Colab. Steps include:

Installing utilities and Python libraries – Setting up the environment.
Starting Ollama server – Getting the model and tools locally.
Testing the model – Verifying LLM behavior.
Running AI agents – Deploying a simple multi-agent system.

Each step is accompanied by executable cells. The full notebook is available in the study's supplementary materials.

Conclusion: The Future of AI Is Evals

The takeaway is clear: successful scaling of AI agents requires rigorous evaluation. Organizations must treat agent systems as scientific experiments—testing structures, measuring outcomes, and iterating. Only then will agents move from isolated wins to enterprise-wide success.

For developers, this means investing in evaluation frameworks and collaboration between AI and operations teams. The era of guessing is over; the science of scaling has begun.

Tags: