Automating Agent Analysis: How eval-agents Transformed Our Research Workflow

In the world of AI research, analyzing coding agents has often meant drowning in endless JSON files. One engineer on the Copilot Applied Science team decided to turn that pain into a powerful tool. By leveraging GitHub Copilot and open-source principles, they created eval-agents—a library that automates the intellectual toil of studying agent trajectories. Here’s how it works, why it matters, and how you can apply similar thinking to your own projects.

What sparked the creation of eval-agents?

The project began when an AI researcher faced a daunting task: analyzing thousands of agent trajectories—detailed logs of how coding agents think and act while solving tasks. These logs, often in JSON format, contain hundreds of lines each. With dozens of tasks per benchmark and multiple runs daily, the researcher was looking at hundreds of thousands of lines of code. The repetitive nature of this analysis—using GitHub Copilot to surface patterns, then manually investigating—became a bottleneck. The engineer realized they could automate not just manual labor, but intellectual toil, by building agents that do the heavy lifting of pattern detection and anomaly spotting. This insight led directly to the creation of eval-agents.

Automating Agent Analysis: How eval-agents Transformed Our Research Workflow — Source: github.blog

What exactly are coding agent trajectories, and why are they so hard to analyze?

Coding agent trajectories are essentially step-by-step records of what an AI agent does when tackling a challenge. Each trajectory shows the agent’s thought process, actions, and outcomes—captured in a JSON file that can be hundreds of lines long. For benchmarks like TerminalBench2 or SWEBench-Pro, each task produces its own trajectory. Multiply that by dozens of tasks and multiple benchmark runs per day, and you’re facing an overwhelming volume of data. Reading through all that manually is impossible. The real challenge isn’t just the quantity—it’s the need to identify subtle patterns, errors, or successes across many runs. This is where automation becomes essential, but traditional tools often fall short.

How did GitHub Copilot help before the automation project existed?

Before building eval-agents, the researcher used GitHub Copilot as a clever shortcut to tame the data. Instead of reading hundreds of thousands of lines, they’d ask Copilot to surface patterns across trajectory files—for example, highlighting recurring failures or successful strategies. This reduced the workload from thousands of lines to just a few hundred. However, the process was still manual: each new benchmark run meant repeating the same loop of querying Copilot, interpreting results, and investigating. The engineer, driven by a desire to remove repetitive tasks, saw an opportunity to turn this ad hoc approach into a reusable, sharable system—planting the seed for eval-agents.

What were the main design goals for the eval-agents project?

The engineer set three core principles to guide development. First, make agents easy to share and use—any team member should be able to run and benefit from them without deep configuration. Second, make it easy to author new agents—lowering the barrier for creating custom analyzers. Third, make coding agents the primary vehicle for contributions, so that adding new capabilities feels natural to developers. These goals align with GitHub’s collaborative DNA, reflecting the researcher’s experience as an open-source maintainer on GitHub CLI. The result is a library where agents are designed for teamwork—each agent is a self-contained, testable piece of code that can be version-controlled, reviewed, and iterated on by the whole team.

How does eval-agents actually enable team collaboration?

Eval-agents turns trajectory analysis into a shared, extensible toolkit. Instead of one person manually inspecting data, the library lets multiple researchers contribute agents that automatically detect specific patterns—like agent loops, token waste, or successful strategies. These agents are stored in a central repository, so anyone can run them on new benchmark results with a single command. The design mimics open-source contribution workflows: team members fork, improve, and submit pull requests for new agents. This not only speeds up analysis but also democratizes the process. A colleague who spots a recurring anomaly can quickly code a small agent to track it, then share it with the team. Over time, the library grows smarter, becoming a collective intelligence tool for the entire Copilot Applied Science group.

What lessons did the researcher learn about using GitHub Copilot effectively?

Through this journey, the researcher discovered that Copilot excels at accelerating repetitive patterns, not just writing code from scratch. The key was to use Copilot interactively—asking it to summarize, compare, and highlight trends in trajectory data—rather than expecting it to produce final answers. This approach unlocked an incredibly fast development loop: a few prompts reduced hours of reading to minutes of investigation. Another lesson was to combine Copilot with human intuition; the tool handles data crunching, while the researcher focuses on interpreting anomalies and designing new experiments. Finally, the project showed that the real power of AI assistants emerges when you build on top of them—creating custom agents that automate the patterns Copilot helped identify, closing the loop between discovery and automation.

Tags: