Comprehensive analysis of 19 AI coding agents across 113 real-world software engineering tasks โ 13,424 trials total (benchmark scores)
Disclaimer: This is an independent community-driven report created for interactive exploration of DeepSWE trial data. It is not officially affiliated with, sponsored by, or endorsed by Datacurve or the DeepSWE benchmark team. Mimo V2.5 was benchmarked independently, and Mimo V2.5 Pro pricing has been adjusted from the official benchmark values to reflect its recent permanent price drop.
All 18 models ranked by pass rate, with cost, token usage, and timing metrics.
| # | Model | Provider | Family | Trials | Passed | Pass Rate | Avg Cost | Cost/Pass | Avg Input Tokens | Avg Output Tokens | Avg Steps | Avg Duration |
|---|
Models ranked across 10 different dimensions โ click tabs to explore what matters most to your team.
How model families compare in capability, cost-efficiency, and specialization.
How models perform across Python, TypeScript, Go, Rust, and JavaScript.
| Model | Python | TypeScript | Go | Rust | JavaScript |
|---|
Pass rates for every model on every synthetic task (excluding SWE-bench instances). Scroll horizontally to see all models.
| Task | Language | Repository | Trials | Passed | Pass Rate |
|---|
| Task | Language | Repository | Trials | Passed | Pass Rate |
|---|
| Task | Language | Repository | Trials | Best Model | Best Rate | Worst Model | Worst Rate | Spread |
|---|
Click any model to see its strengths, weaknesses, best/worst tasks, and family context.