RE-Bench
RE-Bench, or Research Engineering Benchmark, is a METR benchmark for evaluating how well AI agents can perform realistic AI R&D and machine-learning research engineering tasks.
Links
Website: metr.orgOverview
RE-Bench is a benchmark introduced by METR to measure AI systems on research engineering work: the kind of practical, open-ended technical work involved in running experiments, improving ML systems, debugging code, analyzing results, and optimizing performance. Rather than testing only short programming puzzles or static question answering, RE-Bench focuses on longer-horizon tasks that resemble real AI R&D workflows.
π‘ What is this?
RE-Bench is like a test suite for AI agents that asks: can this AI do useful machine-learning research work, not just answer questions or write small code snippets? For example, an AI might be placed in a coding environment and asked to improve a model, debug an experiment, optimize performance, or produce a better result under a time limit. The benchmark then scores how well the AI did compared with a target or with human performance.
βοΈ How it works
RE-Bench evaluates agentic AI systems in controlled research-engineering environments. Tasks are designed to be closer to real ML engineering workflows than typical code benchmarks: agents may need to inspect an existing codebase, run experiments, interpret logs, modify training or evaluation scripts, tune hyperparameters, optimize algorithms, and produce measurable improvements. The benchmark emphasizes end-to-end task performance rather than isolated unit-test correctness. A key design goal is to measure capabilities relevant to automating AI R&D. This includes not only coding skill, but also experiment planning, empirical judgment, debugging, scientific iteration, compute management, and the ability to make progress under time constraints. RE-Bench can be used to compare AI agents against human baselines or against other agent scaffolds and model backends. Its scoring is typically task-specific and outcome-oriented, such as achieving better model performance, finding a valid solution, or improving an experimental result.
π― Why it matters
RE-Bench matters because many existing AI benchmarks measure narrow skills, while the economic and safety significance of frontier AI increasingly depends on whether models can automate complex technical work. AI systems that can perform research engineering could accelerate ML development, software infrastructure work, and AI capability research itself. Measuring this capability helps labs, safety researchers, and policymakers understand how close AI systems are to materially contributing to AI R&D automation.
π οΈ Practical use cases
- β’Evaluating whether an AI agent can perform realistic machine-learning research engineering tasks
- β’Comparing different frontier models or agent scaffolds on long-horizon technical work
- β’Studying AI R&D automation risks and forecasting the impact of AI systems on ML research productivity
β When to use
Use RE-Bench when you want to evaluate agentic AI systems on realistic, outcome-based AI research engineering tasks rather than on short coding problems, multiple-choice exams, or static QA. It is especially relevant for organizations interested in frontier model evaluation, AI safety, ML productivity, and automation of technical research workflows.
β When not to use
Do not use RE-Bench if you only need a quick measure of basic coding ability, general language understanding, chat quality, or mathematical reasoning. It may also be excessive for lightweight model comparisons because research-engineering tasks can be expensive, time-consuming, and sensitive to agent scaffold design, compute budget, tool access, and evaluation environment.
π Advantages
- +Focuses on realistic AI R&D and ML engineering workflows rather than toy problems
- +Measures long-horizon agent behavior, including planning, debugging, experimentation, and iteration
- +Outcome-oriented scoring makes it more relevant to real productivity than purely textual evaluations
- +Useful for comparing AI agents with human research-engineering baselines
- +Highly relevant to AI safety analysis because it targets capabilities that could accelerate AI development
π Disadvantages
- βLikely more expensive and time-consuming to run than standard coding or QA benchmarks
- βResults can depend heavily on the agent scaffold, tools, compute limits, and environment configuration
- βTask coverage is narrower than the full range of scientific or engineering research work
- βMay be harder to reproduce or interpret than benchmarks with simple pass/fail unit tests
- βStrong performance on RE-Bench does not necessarily imply broad autonomous research capability
β οΈ Limitations
- β’Benchmarks only a sample of research-engineering tasks and may not represent all AI R&D workflows
- β’Scores may be sensitive to time limits, hardware access, library versions, and available tools
- β’Agents can fail for practical reasons such as environment issues, poor exploration strategy, or compute mismanagement
- β’Human comparison data may depend on the expertise, incentives, and time budget of the human participants
- β’Task-specific scoring can miss qualitative aspects of good research, such as insight, robustness, documentation, or long-term maintainability
π Alternatives to consider
π Related concepts to learn
π§ͺ Suggested experiments
- βRun the same model on RE-Bench with multiple agent scaffolds to measure how much performance comes from the base model versus the surrounding agent system
- βCompare model performance under different time budgets, such as short, medium, and long runs, to estimate whether the system benefits from extended autonomous work
- βEvaluate the same tasks with and without internet access, retrieval tools, or richer debugging tools to measure the importance of tool availability
- βCompare AI-agent performance against human ML engineers with matched time limits and compute budgets
- βAnalyze failure modes by categorizing unsuccessful attempts into planning failures, coding bugs, experiment misinterpretation, environment problems, and poor compute management
- βTest whether agents can transfer lessons from one RE-Bench-style task to another without overfitting to a specific benchmark environment
πΊοΈ Ecosystem Map: Evals Benchmarks
Evaluation frameworks and benchmarks are essential for understanding AI coding tool capabilities. They provide objective measures of performance across real-world tasks and competitive programming challenges.
Key Concepts
Major Tools
Emerging Tools
Metadata
re-benchThis data is loaded from the database. Ecosystem context may use the section-level generated map.