Terminal Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

The developers of Terminal Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released Version 2.0 side by side Porta new framework for testing, improving, and optimizing AI agents in containerized environments.

The dual release aims to address long-standing pain points for testing and optimizing AI agents, particularly those designed to operate autonomously in realistic developer environments.

With a more difficult and rigorously validated task set, Terminal Bench 2.0 replaces version 1.0 as the benchmark for evaluating Frontier model capabilities.

Harbor, along with its runtime framework, enables developers and researchers to scale diagnostics across thousands of cloud containers and integrates with both open source and proprietary agents and training pipelines.

“The Harbor is the package we wished we had when we were building the Terminal Bench," Co-creator wrote Alex Shaw at x. "It is intended for agent, model, and benchmark developers and researchers who want to evaluate and improve agents and models."

High bar, cleaner data

Terminal Bench 1.0 saw rapid adoption thereafter Released in May 2025becoming the default benchmark for evaluating agent performance in the field of AI-powered agents operating in developer-style terminal environments. These agents interact with the system through the command line, simulating how developers work behind the scenes of a graphical user interface.

However, its wide scope has come with contradictions. A number of functions were identified by the community as impaired or unstable due to external service changes.

Version 2.0 addresses these issues directly. The latest suite includes 89 tasks, each subjected to several hours of manual and LLM-assisted validation. Emphasis is placed on making tasks competent, realistic, and clearly defined, raising the ceiling of difficulty while improving reliability and reproducibility.

A notable example is download-youtube task, which was removed or refactored in 2.0 due to its dependency on unstable third-party APIs.

“Despite our claim, Sota’s performance is comparable to TB1.0,” said Shao. Noted On X. “We believe this is because the quality of work in the new benchmark is substantially higher.”

Harbor: Unified rollout at scale

Along with the benchmark update, the team launched Porta new framework for running and evaluating agents in cloud-deployed containers.

Harbor supports large-scale rollout infrastructure, such as with compatibility for major providers Daytona And Modal.

Designed to standardize agent architectures, Harbor supports:

Evaluation of any container installation agent
Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines
Custom benchmark creation and deployment
Full integration with Terminal Bench 2.

Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available through Harbor Framework.comwith documents for checking and submitting agents to the public leaderboard.

Preliminary results: The GPT5 task proceeds successfully

Preliminary results from the Terminal Bench 2.0 leaderboard show Openai’s Codex CLI (Command Line Interface), a GPT5-powered variant, with a 49.6 percent success rate.

There are other GPT5 variants and agents based on CloudSont 4.5.

Top 5 Agent Results (Terminal Bench 2.0):

Codex CLI (GPT-5) – 49.6%
Codex CLI (GPT-5-CODEX) – 44.3%
Open Hands (GPT-5) – 43.8%
Terminus 2 (GPT-5-CODEX) – 43.4%
Terminus 2 (Claude Sonet 4.5) – 42.8%

Close clustering among top models indicates active competition across platforms, with no single agent solving more than half of the tasks.

Submission and Use

To test or submit an agent, users install Harbor and run benchmarks using simple CLI commands. The leaderboard requires five benchmark runs in submissions, and the results can be emailed to job directories as well as developers for validation.

Harbor Run -D Terminal -bench@2.0 -m "<ماڈل>" -a "<ایجنٹ>" -n-attempts 5-jobs-dir <راستہ/سے/آؤٹ پٹ>

Terminal Bench 2.0 is already being integrated into research workflows that focus on agentic reasoning, code generation, and tool usage. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed presentation is underway to cover the validation process and design methodology behind the benchmark.

The purpose of standardization

The joint release of Terminal Bench 2.0 and Harbor is a step toward a more consistent and scalable agent evaluation infrastructure. As LLM agents proliferate in developer and operational environments, the need for controlled, reproducible testing has increased.

These tools offer a potential basis for a unified assessment stack.

Editor's pick

Get latest news

Terminal Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

High bar, cleaner data

Harbor: Unified rollout at scale

Preliminary results: The GPT5 task proceeds successfully

Submission and Use

The purpose of standardization

VPNs vs Proxy: What are the Differences?

NYU’s new AI architecture makes high-quality image generation faster and cheaper

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news