SWE-AGI
Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents
Overview
11
Models Evaluated
22
Tasks
103-104
Typical Core LOC per Task
22526
Total Tests
3
Difficulty Tiers
Model Comparison
Four-model comparison across six dimensions. Task Passed is shown out of 22 tasks. Scores use a zero baseline for each axis (value / axis max * 100).
Behavior Composition by Model
Each model bar is normalized to 100%. Color encodes behavior category; hover segments to inspect percentage and raw action counts.
Model Summary
Overall performance across all tasks
| Model | Organization | Tasks Passed | Pass Rate | Total Cost | Total Time |
|---|---|---|---|---|---|
| GPT-5.3 Codex | OpenAI | 19/22 |
95.6%
|
$213.07 | 24.8h |
| GPT-5.2 Codex | OpenAI | 17/22 |
96.4%
|
$435.72 | 108.6h |
| Claude Opus 4.6 | Anthropic | 15/22 |
90.8%
|
$2055.81 | 76.4h |
| Claude Opus 4.5 | Anthropic | 10/22 |
81.7%
|
$507.94 | 26.8h |
| Gemini 3 Flash | 2/6 |
49.8%
|
$31.61 | 1.5h | |
| GLM-4.7 | Zhipu AI | 2/6 |
64.2%
|
$4.86 | 4.2h |
| Kimi K2.5 | Moonshot AI | 2/6 |
92.0%
|
N/A | 5.9h |
| DeepSeek V3.2 | DeepSeek | 1/6 |
16.7%
|
$4.12 | 20.2h |
| Claude Sonnet 4.5 | Anthropic | 0/6 |
76.1%
|
$40.67 | 1.9h |
| Gemini 3 Pro | 0/6 |
16.5%
|
N/A | 1.8h | |
| Qwen3 Max | Alibaba | 0/6 |
13.9%
|
$368.37 | 15.5h |
Results by Difficulty
Performance breakdown by task difficulty tier
Easy Tier
| Model | Tasks Passed | Pass Rate | Avg Time | Avg LOC | Cost |
|---|---|---|---|---|---|
| Claude Opus 4.5 | 6/6 |
100.0%
|
0.39h | 1092 | $56.69 |
| Claude Opus 4.6 | 6/6 |
100.0%
|
0.45h | 1781 | $48.61 |
| Claude Sonnet 4.5 | 0/6 |
76.1%
|
0.32h | 930 | $40.67 |
| DeepSeek V3.2 | 1/6 |
16.7%
|
3.4h | 1070 | $4.12 |
| Gemini 3 Flash | 2/6 |
49.8%
|
0.25h | 558 | $31.61 |
| Gemini 3 Pro | 0/6 |
16.5%
|
0.30h | 710 | N/A |
| GLM-4.7 | 2/6 |
64.2%
|
0.70h | 904 | $4.86 |
| GPT-5.2 Codex | 6/6 |
100.0%
|
0.81h | 1081 | $33.51 |
| GPT-5.3 Codex | 6/6 |
100.0%
|
0.28h | 1305 | $15.00 |
| Kimi K2.5 | 2/6 |
92.0%
|
0.99h | 1163 | N/A |
| Qwen3 Max | 0/6 |
13.9%
|
2.6h | 850 | $368.37 |
Medium Tier
| Model | Tasks Passed | Pass Rate | Avg Time | Avg LOC | Cost |
|---|---|---|---|---|---|
| Claude Opus 4.5 | 3/8 |
82.6%
|
1.3h | 3304 | $208.43 |
| Claude Opus 4.6 | 5/8 |
93.6%
|
3.5h | 4867 | $1183.94 |
| GPT-5.2 Codex | 7/8 |
98.9%
|
5.1h | 4702 | $287.17 |
| GPT-5.3 Codex | 8/8 |
100.0%
|
1.2h | 2575 | $114.14 |
Hard Tier
| Model | Tasks Passed | Pass Rate | Avg Time | Avg LOC | Cost |
|---|---|---|---|---|---|
| Claude Opus 4.5 | 1/8 |
67.0%
|
1.7h | 6603 | $242.82 |
| Claude Opus 4.6 | 4/8 |
81.2%
|
5.7h | 10103 | $823.26 |
| GPT-5.2 Codex | 4/8 |
91.2%
|
7.8h | 9034 | $115.04 |
| GPT-5.3 Codex | 5/8 |
87.9%
|
1.7h | 6255 | $83.94 |