SWE-bench Verified
SWE-bench Verified is a human-validated subset of SWE-bench designed to more reliably evaluate AI agents on real-world software engineering bug-fixing tasks.
Links
Website: openai.comOverview
SWE-bench Verified is a benchmark for evaluating AI coding agents on realistic software engineering tasks drawn from real GitHub issues and pull requests. It builds on the original SWE-bench benchmark, which asks models or agents to modify an existing codebase so that hidden tests pass, simulating the process of resolving real software bugs or feature requests.
π‘ What is this?
SWE-bench Verified is like a practical exam for AI coding assistants. Instead of asking the AI to solve small toy programming problems, it gives the AI a real software project, a real issue from GitHub, and asks it to edit the code so the problem is fixed. The AI has to understand the repository, locate the relevant files, make a correct change, and pass tests.
βοΈ How it works
SWE-bench Verified is a curated subset of SWE-bench consisting of human-reviewed software engineering tasks. Each task is typically derived from a GitHub issue and its corresponding pull request. The evaluation setup provides the model or agent with a repository snapshot and an issue description, then measures whether the submitted patch passes the relevant test suite. The 'Verified' subset was created to address quality issues in the original SWE-bench dataset, such as ambiguous issue descriptions, incorrect or overly narrow tests, and tasks where the expected patch may not correspond well to the stated problem. Human reviewers, generally experienced software developers, validate that tasks are understandable, testable, and appropriate for evaluating real coding ability. This makes SWE-bench Verified especially useful for benchmarking agentic systems that combine LLM reasoning, repository search, tool use, code editing, test execution, and iterative debugging.
π― Why it matters
SWE-bench Verified matters because it evaluates AI systems on realistic software maintenance work rather than isolated coding puzzles. As AI developer tools move from autocomplete toward autonomous agents that can fix bugs and submit patches, the ecosystem needs benchmarks that measure whether these agents can perform meaningful work in real repositories. The verified subset improves confidence in benchmark results by reducing noise from flawed or ambiguous tasks.
π οΈ Practical use cases
- β’Benchmarking AI coding agents on realistic GitHub issue resolution tasks
- β’Comparing different LLMs, prompting strategies, tool-use systems, and agent architectures for software engineering
- β’Testing whether an AI development workflow can navigate repositories, edit code, run tests, and produce valid patches
β When to use
Use SWE-bench Verified when you want to evaluate an AI system's ability to solve real-world software engineering tasks end-to-end, especially bug fixing or issue resolution in existing codebases. It is most appropriate for testing agentic coding systems that can inspect files, reason over repository structure, edit code, and run validation tests.
β When not to use
Do not use SWE-bench Verified if you only need to measure basic programming knowledge, syntax completion, algorithmic problem solving, or code generation in isolation. It is also not ideal for evaluating non-Python ecosystems exclusively, UI-heavy development tasks, product design ability, security auditing, or tasks where interactive human collaboration is the primary target.
π Advantages
- +Human validation improves reliability compared with noisier benchmark subsets
- +Uses real-world GitHub issues and repositories rather than artificial programming puzzles
- +Measures practical agent capabilities such as repository understanding, patch generation, and test-driven debugging
- +Better suited for evaluating autonomous software engineering agents than simple code-generation benchmarks
- +Provides a more credible signal for progress in AI-assisted software maintenance
π Disadvantages
- βStill represents only a subset of real software engineering work
- βBenchmark runs can be computationally expensive and operationally complex
- βPerformance can depend heavily on agent scaffolding, tools, retrieval, and test execution strategy, not just the base model
- βMay encourage over-optimization to a fixed benchmark
- βPassing tests does not always guarantee a patch is ideal, maintainable, or production-ready
β οΈ Limitations
- β’Contains a limited number of verified tasks, commonly described as a 500-task subset
- β’Primarily measures issue-level code modification rather than broader engineering activities such as architecture, planning, code review, or long-term maintenance
- β’Hidden or benchmark-specific tests may not capture every real-world correctness criterion
- β’Repository setup, dependency installation, and test execution can introduce evaluation friction
- β’Results may not generalize perfectly to private enterprise codebases, unfamiliar stacks, or tasks requiring domain knowledge
π Alternatives to consider
π Related concepts to learn
π§ͺ Suggested experiments
- βRun the same coding agent on SWE-bench Lite and SWE-bench Verified to compare how task curation changes measured performance
- βEvaluate multiple base models with the same agent scaffold to isolate the effect of model capability
- βCompare a single-shot patch-generation approach against an iterative agent that can run tests and revise its patch
- βMeasure how repository retrieval strategy affects solve rate on SWE-bench Verified tasks
- βAnalyze failed tasks by category, such as misunderstood issue, incorrect file localization, failing tests, incomplete patch, or dependency/setup failure
πΊοΈ Ecosystem Map: Evals Benchmarks
Evaluation frameworks and benchmarks are essential for understanding AI coding tool capabilities. They provide objective measures of performance across real-world tasks and competitive programming challenges.
Key Concepts
Major Tools
Emerging Tools
Metadata
swe-bench-verifiedThis data is loaded from the database. Ecosystem context may use the section-level generated map.