ℹ️ Unofficial Community Analysis

DeepSWE Benchmark Report

Comprehensive analysis of 19 AI coding agents across 113 real-world software engineering tasks — 13,424 trials total (benchmark scores)

Disclaimer: This is an independent community-driven report created for interactive exploration of DeepSWE trial data. It is not officially affiliated with, sponsored by, or endorsed by Datacurve or the DeepSWE benchmark team. Mimo V2.5 was benchmarked independently, and Mimo V2.5 Pro pricing has been adjusted from the official benchmark values to reflect its recent permanent price drop.

📊 Overview

Base Models

Providers

113

Tasks

13,424

Total Trials

32.9%

Overall Pass Rate

$64,697

Total Cost

Model Pass Rates — Full Leaderboard

Provider Comparison (Pass Rate vs. Avg Cost)

🏆 Model Leaderboard

All 18 models ranked by pass rate, with cost, token usage, and timing metrics.

#	Model	Provider	Family	Trials	Passed	Pass Rate	Avg Cost	Cost/Pass	Avg Input Tokens	Avg Output Tokens	Avg Steps	Avg Duration

🏆 Multi-Dimensional Rankings

Models ranked across 10 different dimensions — click tabs to explore what matters most to your team.

👨‍👩‍👧‍👦 Model Family Analysis

How model families compare in capability, cost-efficiency, and specialization.

Family Pass Rates (Best Model)

Family Cost Efficiency (Best Model)

💻 Language Performance

How models perform across Python, TypeScript, Go, Rust, and JavaScript.

Pass Rate by Language — Top Models

Language Distribution (113 tasks)

Pass Rate by Language — All Models

Model	Python	TypeScript	Go	Rust	JavaScript

💰 Cost & Efficiency Analysis

⚠️ Pricing Update: MiMo V2.5 Pro pricing has been updated to reflect a major price cut. New rates: $0.435/M input (was $1.00), $0.0036/M cached input (was $0.20), $0.870/M output (was $3.00). This reduces cost per pass from $10.20 to $0.69 (93% reduction) and cost per trial from $1.99 to $0.13.

Cost per Pass vs. Pass Rate (Value Quadrant)

Duration: Pass vs. Fail

Token Efficiency: Input Tokens vs. Pass Rate

Success Rate by Agent Steps

🌡️ Model × Task Heatmap

Pass rates for every model on every synthetic task (excluding SWE-bench instances). Scroll horizontally to see all models.

📋 Task Analysis

Task	Language	Repository	Trials	Passed	Pass Rate

Task	Language	Repository	Trials	Passed	Pass Rate

Task	Language	Repository	Trials	Best Model	Best Rate	Worst Model	Worst Rate	Spread

Impossible Tasks (0% pass rate across all models)

🔍 Detailed Model Profiles

Click any model to see its strengths, weaknesses, best/worst tasks, and family context.