auto-bench: Benchmarking Quantized LLMs for Local Coding Agents Part 2
Edward J. SchwartzComputer Security Researcher3 min. read

In Part 1, I introduced auto-bench, a tool for benchmarking quantized LLMs for local coding agents, and shared some results from a preliminary study on a single instance from SWE-bench Verified. The results showed that (1) KL Divergence doesn't predict performance, and (2) quantizations can both outperform and underperform the original model.

In this post, I'll share some new results. Like the other experiment, this one also focuses on Qwen3.5-2B. Unlike the other experiment, which tested a single instance of SWE-bench Verified with eight attempts, this experiment tests all instances of SWE-bench Verified with one attempt.

Results

Without further ado, here are the results.

QuantResolved%PPLKLRuntimeExceptions
BF1628/5005.6%13.38โ€”547m 59sTimeout(47), ExitCode(22), Verifier(2)
IQ4_NL26/5005.2%13.670.0309423m 21sTimeout(29), ExitCode(16), Verifier(1)
IQ4_XS24/5004.8%13.680.0318493m 7sTimeout(39), ExitCode(15), Reward(1), Verifier(2)
Q3_K_M30/5006.0%14.330.0774785m 43sTimeout(67), ExitCode(21)
Q3_K_S20/5004.0%15.080.1334742m 25sTimeout(73), ExitCode(25), Reward(1), Verifier(1)
Q4_024/5004.8%13.910.0454407m 34sTimeout(25), ExitCode(21), Verifier(1)
Q4_136/5007.2%13.680.0273766m 19sTimeout(59), ExitCode(16), Reward(1), Verifier(1)
Q4_K_M27/5005.4%13.790.0230357m 0sTimeout(20), ExitCode(23), Verifier(2)
Q4_K_S19/5003.8%13.780.0274519m 39sTimeout(38), ExitCode(23), Reward(1), Verifier(1)
Q5_K_M62/50012.4%13.460.0082784m 58sTimeout(61), ExitCode(23), Reward(1), Verifier(1)
Q5_K_S46/5009.2%13.490.0100563m 27sTimeout(30), ExitCode(25), Reward(1), Verifier(3)
Q6_K58/50011.6%13.480.0035820m 37sTimeout(62), ExitCode(20), Verifier(1)
Q8_037/5007.4%13.390.0012598m 13sTimeout(46), ExitCode(17), Verifier(1)
UD-IQ2_M1/5000.2%17.610.26771866m 13sTimeout(300), ExitCode(24), Verifier(1)
UD-IQ2_XXS1/5000.2%27.110.70182196m 49sTimeout(371), ExitCode(19)
UD-IQ3_XXS5/5001.0%15.310.15491481m 24sTimeout(230), ExitCode(24), Verifier(1)
UD-Q2_K_XL2/5000.4%17.150.2388442m 51sTimeout(29), ExitCode(27)
UD-Q3_K_XL48/5009.6%13.940.0520738m 44sTimeout(57), ExitCode(19), Reward(1), Verifier(1)
UD-Q4_K_XL57/50011.4%13.600.0164759m 13sTimeout(48), ExitCode(18), Reward(2), Verifier(2)
UD-Q5_K_XL62/50012.4%13.510.0077932m 37sTimeout(71), ExitCode(18), Reward(2), Verifier(1)
UD-Q6_K_XL29/5005.8%13.480.0020539m 38sTimeout(35), ExitCode(20), Verifier(1)
UD-Q8_K_XL36/5007.2%13.370.0011502m 1sSetup(1), Timeout(35), ExitCode(23), Verifier(2)
vllm26/5005.2%โ€”โ€”521m 14sTimeout(49), ExitCode(11), Verifier(1)

And the plot of % Resolved vs KL Divergence:

Percentage instances resolved vs. KL Divergence
Percentage instances resolved vs. KL Divergence

Observations

A few observations:

  1. The overall resolve rates are low across the board. This is not a very powerful model. I intentionally selected an easy problem instance for the Part 1 experiment.
  2. As in Part 1, many of the "mid-range" quantizations outperform the original model, yet small and large quantizations underperform. This is consistent with the idea that some quantizations are actually beneficial, while others are harmful.
  3. Also as in Part 1, KL Divergence does not fully explain performance.

What next?

This experiment largely confirmed the findings from Part 1 about the Qwen3.5-2B model. An open question is whether these results apply to other models as well. I plan to run similar experiments on larger variants of the Qwen3.5 family next, but I won't be evaluating every quantization. Too much time is wasted on bad quantizations because they get stuck in endless loops. Instead, I'll probably try a select few quantizations, such as BF16, Q8_0, and Q5_K_M. Although I am interested in understanding these peculiar behaviors, my primary goal is actually to find which models and quantizations are usable.

Powered with by Gatsby 5.0