OSWorld
OSWorld is a benchmark for evaluating multimodal AI agents on real operating-system tasks in a virtual desktop environment.
OSWorld is a benchmark for evaluating multimodal AI agents on real operating-system tasks in a virtual desktop environment.
CodeElo is a coding benchmark that evaluates large language models on competitive-programming-style problems and reports ability using an Elo-like rating scale.
Multi-SWE-bench is a benchmark for evaluating AI coding agents on real-world software engineering tasks across multiple programming languages and repositories.
SWE-gym is a benchmark and training environment for evaluating and improving AI agents on real-world repository-level software engineering tasks.
SWE-smith is a framework from the SWE-bench ecosystem for generating SWE-bench-style software engineering tasks and benchmarks from real code repositories.
RE-Bench, or Research Engineering Benchmark, is a METR benchmark for evaluating how well AI agents can perform realistic AI R&D and machine-learning research engineering tasks.
SWE-Lancer is an OpenAI benchmark that evaluates AI agents on real-world freelance software engineering tasks with outcomes measured against practical deliverables and monetary value.
MLE-bench is an OpenAI benchmark for evaluating AI agents on end-to-end machine learning engineering tasks using real Kaggle competitions.
Terminal-Bench is a benchmark for evaluating AI agents on realistic tasks that require using a Unix-like terminal, editing files, running commands, debugging, and verifying results.
Aider Polyglot Benchmark is a code-editing benchmark used by Aider to compare how well AI models modify existing code and pass tests across multiple programming languages.
EvalPlus is a code-generation evaluation framework that improves benchmarks like HumanEval and MBPP with many additional test cases to produce more reliable LLM coding scores.
BigCodeBench is a code-generation benchmark designed to evaluate large language models on realistic Python programming tasks involving complex instructions and diverse library/API usage.
SWE-bench Multimodal is a benchmark for evaluating AI software-engineering agents on real GitHub issue-fixing tasks that include visual information such as screenshots, UI bugs, plots, or other images.
SWE-bench Verified is a human-validated subset of SWE-bench designed to more reliably evaluate AI agents on real-world software engineering bug-fixing tasks.
A dynamic evaluation platform that assesses code generation models through competitive programming challenges updated in near real-time, preventing data contamination from training sets.
A benchmark suite for evaluating AI agents on real-world GitHub issues. It measures an agent's ability to understand, plan, and fix actual software engineering problems in open-source repositories.