RE-Bench

RE-Bench, or Research Engineering Benchmark, is a METR benchmark for evaluating how well AI agents can perform realistic AI R&D and machine-learning research engineering tasks.

benchmarkneeds_reviewuseful

#ai-agents#research-engineering#long-horizon-tasks#software-engineering#agent-capability-assessment#metr

Links

Website: metr.org

Overview

RE-Bench is a benchmark introduced by METR to measure AI systems on research engineering work: the kind of practical, open-ended technical work involved in running experiments, improving ML systems, debugging code, analyzing results, and optimizing performance. Rather than testing only short programming puzzles or static question answering, RE-Bench focuses on longer-horizon tasks that resemble real AI R&D workflows.

💡 What is this?

RE-Bench is like a test suite for AI agents that asks: can this AI do useful machine-learning research work, not just answer questions or write small code snippets? For example, an AI might be placed in a coding environment and asked to improve a model, debug an experiment, optimize performance, or produce a better result under a time limit. The benchmark then scores how well the AI did compared with a target or with human performance.

⚙️ How it works

RE-Bench evaluates agentic AI systems in controlled research-engineering environments. Tasks are designed to be closer to real ML engineering workflows than typical code benchmarks: agents may need to inspect an existing codebase, run experiments, interpret logs, modify training or evaluation scripts, tune hyperparameters, optimize algorithms, and produce measurable improvements. The benchmark emphasizes end-to-end task performance rather than isolated unit-test correctness. A key design goal is to measure capabilities relevant to automating AI R&D. This includes not only coding skill, but also experiment planning, empirical judgment, debugging, scientific iteration, compute management, and the ability to make progress under time constraints. RE-Bench can be used to compare AI agents against human baselines or against other agent scaffolds and model backends. Its scoring is typically task-specific and outcome-oriented, such as achieving better model performance, finding a valid solution, or improving an experimental result.

🎯 Why it matters

RE-Bench matters because many existing AI benchmarks measure narrow skills, while the economic and safety significance of frontier AI increasingly depends on whether models can automate complex technical work. AI systems that can perform research engineering could accelerate ML development, software infrastructure work, and AI capability research itself. Measuring this capability helps labs, safety researchers, and policymakers understand how close AI systems are to materially contributing to AI R&D automation.

🛠️ Practical use cases

•Evaluating whether an AI agent can perform realistic machine-learning research engineering tasks
•Comparing different frontier models or agent scaffolds on long-horizon technical work
•Studying AI R&D automation risks and forecasting the impact of AI systems on ML research productivity

✅ When to use

Use RE-Bench when you want to evaluate agentic AI systems on realistic, outcome-based AI research engineering tasks rather than on short coding problems, multiple-choice exams, or static QA. It is especially relevant for organizations interested in frontier model evaluation, AI safety, ML productivity, and automation of technical research workflows.

❌ When not to use

Do not use RE-Bench if you only need a quick measure of basic coding ability, general language understanding, chat quality, or mathematical reasoning. It may also be excessive for lightweight model comparisons because research-engineering tasks can be expensive, time-consuming, and sensitive to agent scaffold design, compute budget, tool access, and evaluation environment.

👍 Advantages

+Focuses on realistic AI R&D and ML engineering workflows rather than toy problems
+Measures long-horizon agent behavior, including planning, debugging, experimentation, and iteration
+Outcome-oriented scoring makes it more relevant to real productivity than purely textual evaluations
+Useful for comparing AI agents with human research-engineering baselines
+Highly relevant to AI safety analysis because it targets capabilities that could accelerate AI development

👎 Disadvantages

−Likely more expensive and time-consuming to run than standard coding or QA benchmarks
−Results can depend heavily on the agent scaffold, tools, compute limits, and environment configuration
−Task coverage is narrower than the full range of scientific or engineering research work
−May be harder to reproduce or interpret than benchmarks with simple pass/fail unit tests
−Strong performance on RE-Bench does not necessarily imply broad autonomous research capability

⚠️ Limitations

•Benchmarks only a sample of research-engineering tasks and may not represent all AI R&D workflows
•Scores may be sensitive to time limits, hardware access, library versions, and available tools
•Agents can fail for practical reasons such as environment issues, poor exploration strategy, or compute mismanagement
•Human comparison data may depend on the expertise, incentives, and time budget of the human participants
•Task-specific scoring can miss qualitative aspects of good research, such as insight, robustness, documentation, or long-term maintainability

🔄 Alternatives to consider

MLE-benchMLAgentBenchSWE-benchSWE-bench VerifiedAgentBenchGAIADS-1000HumanEvalAider polyglot benchmarkMETR task suites

📚 Related concepts to learn

AI R&D automationResearch engineeringAgentic AI evaluationLong-horizon tasksMachine-learning engineeringAutonomous coding agentsAI safety evaluationsCapability forecastingTool-using AI agentsBenchmark contaminationHuman baseline comparisonScaffolded language-model agents

🧪 Suggested experiments

→Run the same model on RE-Bench with multiple agent scaffolds to measure how much performance comes from the base model versus the surrounding agent system
→Compare model performance under different time budgets, such as short, medium, and long runs, to estimate whether the system benefits from extended autonomous work
→Evaluate the same tasks with and without internet access, retrieval tools, or richer debugging tools to measure the importance of tool availability
→Compare AI-agent performance against human ML engineers with matched time limits and compute budgets
→Analyze failure modes by categorizing unsuccessful attempts into planning failures, coding bugs, experiment misinterpretation, environment problems, and poor compute management
→Test whether agents can transfer lessons from one RE-Bench-style task to another without overfitting to a specific benchmark environment

🗺️ Ecosystem Map: Evals Benchmarks

Evaluation frameworks and benchmarks are essential for understanding AI coding tool capabilities. They provide objective measures of performance across real-world tasks and competitive programming challenges.

Key Concepts

Real-world issue resolutionCompetitive programming evalsAgent capability assessmentData contamination prevention

Major Tools

SWE-bench

Emerging Tools

LiveCodeBench

Metadata

Slug: re-bench

Primary section: evals-benchmarks

Status: active

Review: ai_generated

Setup: moderate

Activity: unknown

Version: 1

Version generated: 2026-05-30 13:33:32 UTC

Version reason: AI discovery

Discovered: 2026-05-30 13:33:32 UTC

Last checked: 2026-05-30 13:57:26 UTC

Stale at: 2026-06-29 13:57:26 UTC

Created: 2026-05-30 13:33:32 UTC

Updated: 2026-05-30 13:57:26 UTC

This data is loaded from the database. Ecosystem context may use the section-level generated map.