OpenAI Launches SWE-bench Verified: Enhancing AI Software Engineering Capability Assessment
OpenAI announced the launch of SWE-bench Verified, a code generation evaluation benchmark, on August 13th. This new benchmark aims to more accurately assess the performance of AI models in software engineering tasks, addressing several limitations of the previous SWE-bench.
SWE-bench is an evaluation dataset based on real software issues from GitHub, containing 2,294 Issue-Pull Request pairs from 12 popular Python repositories. However, the original SWE-bench had three main issues: overly strict unit tests that could reject correct solutions, unclear problem descriptions, and unreliable development environment setup.
OpenAI announced the launch of SWE-bench Verified, a code generation evaluation benchmark, on August 13th. This new benchmark aims to more accurately assess the performance of AI models in software engineering tasks, addressing several limitations of the previous SWE-bench.
SWE-bench is an evaluation dataset based on real software issues from GitHub, containing 2,294 Issue-Pull Request pairs from 12 popular Python repositories. However, the original SWE-bench had three main issues: overly strict unit tests that could reject correct solutions, unclear problem descriptions, and unreliable development environment setup.