← Back to Dashboard

Evals and Benchmarks

16 itemsRefresh every 30dLive Data

OSWorld

benchmarkneeds_review

OSWorld is a benchmark for evaluating multimodal AI agents on real operating-system tasks in a virtual desktop environment.

CodeElo

benchmarkneeds_review

CodeElo is a coding benchmark that evaluates large language models on competitive-programming-style problems and reports ability using an Elo-like rating scale.

Multi-SWE-bench

benchmarkneeds_review

Multi-SWE-bench is a benchmark for evaluating AI coding agents on real-world software engineering tasks across multiple programming languages and repositories.

SWE-gym

benchmarkneeds_review

SWE-gym is a benchmark and training environment for evaluating and improving AI agents on real-world repository-level software engineering tasks.

SWE-smith

frameworkneeds_review

SWE-smith is a framework from the SWE-bench ecosystem for generating SWE-bench-style software engineering tasks and benchmarks from real code repositories.

RE-Bench

benchmarkneeds_review

RE-Bench, or Research Engineering Benchmark, is a METR benchmark for evaluating how well AI agents can perform realistic AI R&D and machine-learning research engineering tasks.

SWE-Lancer

benchmarkneeds_review

SWE-Lancer is an OpenAI benchmark that evaluates AI agents on real-world freelance software engineering tasks with outcomes measured against practical deliverables and monetary value.

MLE-bench

benchmarkneeds_review

MLE-bench is an OpenAI benchmark for evaluating AI agents on end-to-end machine learning engineering tasks using real Kaggle competitions.

Terminal-Bench

benchmarkneeds_review

Terminal-Bench is a benchmark for evaluating AI agents on realistic tasks that require using a Unix-like terminal, editing files, running commands, debugging, and verifying results.

Aider Polyglot Benchmark

benchmarkneeds_review

Aider Polyglot Benchmark is a code-editing benchmark used by Aider to compare how well AI models modify existing code and pass tests across multiple programming languages.

EvalPlus

frameworkneeds_review

EvalPlus is a code-generation evaluation framework that improves benchmarks like HumanEval and MBPP with many additional test cases to produce more reliable LLM coding scores.

BigCodeBench

benchmarkneeds_review

BigCodeBench is a code-generation benchmark designed to evaluate large language models on realistic Python programming tasks involving complex instructions and diverse library/API usage.

SWE-bench Multimodal

benchmarkneeds_review

SWE-bench Multimodal is a benchmark for evaluating AI software-engineering agents on real GitHub issue-fixing tasks that include visual information such as screenshots, UI bugs, plots, or other images.

SWE-bench Verified

benchmarkneeds_review

SWE-bench Verified is a human-validated subset of SWE-bench designed to more reliably evaluate AI agents on real-world software engineering bug-fixing tasks.

LiveCodeBench

trending
benchmarkconfirmedbeta

A dynamic evaluation platform that assesses code generation models through competitive programming challenges updated in near real-time, preventing data contamination from training sets.

SWE-bench

popular
benchmarkconfirmedproduction

A benchmark suite for evaluating AI agents on real-world GitHub issues. It measures an agent's ability to understand, plan, and fix actual software engineering problems in open-source repositories.