Grok 4.20 0309 v2 (Non-reasoning) vs Qwen3 4B 2507 (Reasoning)

xAI vs Alibaba — side-by-side benchmark comparison

	Grok 4.20 0309 v2 (Non-reasoning)	Qwen3 4B 2507 (Reasoning)
Intelligence Index	29.0	18.2
Coding Index	22.0	9.5
Math Index	—	82.7
Output speed (tok/s)	175.2	0.0
Blended price ($/1M)	$3.00	$0.00
Time to first token (s)	0.47s	0.00s
aime	—	—
aime 25	—	82.7%
artificial analysis coding index	22.00	9.50
artificial analysis intelligence index	29.00	18.20
artificial analysis math index	—	82.70
gpqa	77.6%	66.7%
hle	24.2%	5.9%
ifbench	49.3%	49.8%
lcr	17.3%	37.7%
livecodebench	—	64.1%
math 500	—	—
mmlu pro	—	74.3%
scicode	32.8%	25.6%
tau2	59.9%	25.4%
terminalbench hard	16.7%	1.5%

Benchmark data from Artificial Analysis.