In Part 1 and Part 2, I reported some peculiar behavior for quantized Qwen3.5-2B models:
This post shares early results for a larger variant: Qwen3.5-35B-A3B.
vllm currently leads this sweep at 46.0% resolved.vllm show many more timeouts than the quantized GGUF variants.| Run | Resolved | % | PPL | KL | Runtime | Exceptions |
|---|---|---|---|---|---|---|
| BF16 | 181/500 | 36.2% | 6.62 | — | 2103m 41s | Timeout(275), ExitCode(17), Reward(3), Verifier(1) |
| Q5_K_M | 194/500 | 38.8% | 6.62 | 0.0083 | 531m 46s | Timeout(8), ExitCode(23), Reward(3), Verifier(1) |
| Q8_0 | 202/500 | 40.4% | 6.61 | 0.0068 | 575m 45s | Timeout(11), ExitCode(24), Reward(1), Verifier(1) |
| vllm | 230/500 | 46.0% | — | — | 1634m 13s | Timeout(145), ExitCode(22), Reward(6) |
BF16 and vllm both have substantial timeout counts. At this point, it is unclear whether these are true long-horizon failures or loop-like behaviors similar to what I saw with weaker Qwen3.5-2B quantizations. Either way, timeout behavior appears to be a key driver of aggregate score differences.
I still do not have enough data to draw firm conclusions about KL divergence trends for this model size. However, I do not observe the same pattern as Qwen3.5-2B. Here, the smaller GGUF quantizations (Q5_K_M, Q8_0) outperform GGUF BF16, which is surprising and may be largely explained by fewer timeouts.
There is a notable gap between GGUF BF16 results and the original model results. That is surprising because the original checkpoint is mostly BF16 with a small number of F32 tensors. I checked the GGUF contents and confirmed F32 tensors are present:
vscode ➜ /workspaces/auto-bench (main) $ gguf-dump /home/vscode/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/BF16/Qwen3.5-35B-A3B-BF16-00002-of-00002.gguf | grep "F32" | head -20
INFO:gguf-dump:* Loading: /home/vscode/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/BF16/Qwen3.5-35B-A3B-BF16-00002-of-00002.gguf
142: 2048 | 2048, 1, 1, 1 | F32 | blk.0.attn_norm.weight
143: 32 | 32, 1, 1, 1 | F32 | blk.0.ssm_a
144: 32768 | 4, 8192, 1, 1 | F32 | blk.0.ssm_conv1d.weight
145: 32 | 32, 1, 1, 1 | F32 | blk.0.ssm_dt.bias
148: 128 | 128, 1, 1, 1 | F32 | blk.0.ssm_norm.weight
149: 524288 | 2048, 256, 1, 1 | F32 | blk.0.ffn_gate_inp.weight
153: 2048 | 2048, 1, 1, 1 | F32 | blk.0.ffn_gate_inp_shexp.weight
154: 2048 | 2048, 1, 1, 1 | F32 | blk.0.post_attention_norm.weight
155: 2048 | 2048, 1, 1, 1 | F32 | blk.1.attn_norm.weight
156: 32 | 32, 1, 1, 1 | F32 | blk.1.ssm_a
157: 32768 | 4, 8192, 1, 1 | F32 | blk.1.ssm_conv1d.weight
158: 32 | 32, 1, 1, 1 | F32 | blk.1.ssm_dt.bias
161: 128 | 128, 1, 1, 1 | F32 | blk.1.ssm_norm.weight
162: 524288 | 2048, 256, 1, 1 | F32 | blk.1.ffn_gate_inp.weight
166: 2048 | 2048, 1, 1, 1 | F32 | blk.1.ffn_gate_inp_shexp.weight
167: 2048 | 2048, 1, 1, 1 | F32 | blk.1.post_attention_norm.weight
168: 2048 | 2048, 1, 1, 1 | F32 | blk.10.attn_norm.weight
169: 32 | 32, 1, 1, 1 | F32 | blk.10.ssm_a
170: 32768 | 4, 8192, 1, 1 | F32 | blk.10.ssm_conv1d.weight
171: 32 | 32, 1, 1, 1 | F32 | blk.10.ssm_dt.biasIn fact, some tensors appear upcast to F32 in the GGUF file, so raw dtype alone does not explain the performance gap.
According to Qwen's benchmark page, Qwen3.5-35B-A3B achieves 69.2% on SWE-bench Verified. That is substantially higher than the 46.0% best result in this sweep.
Interestingly, on the Qwen3-Coder-Flash model page, Qwen reports 51.6% on SWE-bench Verified using OpenHands, the same agent framework I am using.
There is also active community discussion about reproducibility for these numbers here. I have not yet found detailed methodology documentation for Qwen3.5's SWE-bench Verified evaluation setup, which may explain part of the discrepancy.
I think the critical question right now is whether the timeouts are legitimate. After reviewing the logs, I think that they may not be. So I'm going to re-run with fewer agents in parallel.
In Part 1, I introduced auto-bench, a tool for benchmarking quantized LLMs for local coding agents, and shared some results from a preliminary study on a single instance from SWE-bench Verified. The results showed that (1) KL Divergence doesn't predict performance, and (2) quantizations can both outperform and underperform the original model.
In this post, I'll share some new results. Like the other experiment, this one
also focuses on Qwen3.5-2B. Unlike the other experiment, which tested a
single instance of SWE-bench Verified with eight
attempts,
this experiment tests all instances of SWE-bench Verified with one
attempt.
Without further ado, here are the results.
| Quant | Resolved | % | PPL | KL | Runtime | Exceptions |
|---|---|---|---|---|---|---|
| BF16 | 28/500 | 5.6% | 13.38 | — | 547m 59s | Timeout(47), ExitCode(22), Verifier(2) |
| IQ4_NL | 26/500 | 5.2% | 13.67 | 0.0309 | 423m 21s | Timeout(29), ExitCode(16), Verifier(1) |
| IQ4_XS | 24/500 | 4.8% | 13.68 | 0.0318 | 493m 7s | Timeout(39), ExitCode(15), Reward(1), Verifier(2) |
| Q3_K_M | 30/500 | 6.0% | 14.33 | 0.0774 | 785m 43s | Timeout(67), ExitCode(21) |
| Q3_K_S | 20/500 | 4.0% | 15.08 | 0.1334 | 742m 25s | Timeout(73), ExitCode(25), Reward(1), Verifier(1) |
| Q4_0 | 24/500 | 4.8% | 13.91 | 0.0454 | 407m 34s | Timeout(25), ExitCode(21), Verifier(1) |
| Q4_1 | 36/500 | 7.2% | 13.68 | 0.0273 | 766m 19s | Timeout(59), ExitCode(16), Reward(1), Verifier(1) |
| Q4_K_M | 27/500 | 5.4% | 13.79 | 0.0230 | 357m 0s | Timeout(20), ExitCode(23), Verifier(2) |
| Q4_K_S | 19/500 | 3.8% | 13.78 | 0.0274 | 519m 39s | Timeout(38), ExitCode(23), Reward(1), Verifier(1) |
| Q5_K_M | 62/500 | 12.4% | 13.46 | 0.0082 | 784m 58s | Timeout(61), ExitCode(23), Reward(1), Verifier(1) |
| Q5_K_S | 46/500 | 9.2% | 13.49 | 0.0100 | 563m 27s | Timeout(30), ExitCode(25), Reward(1), Verifier(3) |
| Q6_K | 58/500 | 11.6% | 13.48 | 0.0035 | 820m 37s | Timeout(62), ExitCode(20), Verifier(1) |
| Q8_0 | 37/500 | 7.4% | 13.39 | 0.0012 | 598m 13s | Timeout(46), ExitCode(17), Verifier(1) |
| UD-IQ2_M | 1/500 | 0.2% | 17.61 | 0.2677 | 1866m 13s | Timeout(300), ExitCode(24), Verifier(1) |
| UD-IQ2_XXS | 1/500 | 0.2% | 27.11 | 0.7018 | 2196m 49s | Timeout(371), ExitCode(19) |
| UD-IQ3_XXS | 5/500 | 1.0% | 15.31 | 0.1549 | 1481m 24s | Timeout(230), ExitCode(24), Verifier(1) |
| UD-Q2_K_XL | 2/500 | 0.4% | 17.15 | 0.2388 | 442m 51s | Timeout(29), ExitCode(27) |
| UD-Q3_K_XL | 48/500 | 9.6% | 13.94 | 0.0520 | 738m 44s | Timeout(57), ExitCode(19), Reward(1), Verifier(1) |
| UD-Q4_K_XL | 57/500 | 11.4% | 13.60 | 0.0164 | 759m 13s | Timeout(48), ExitCode(18), Reward(2), Verifier(2) |
| UD-Q5_K_XL | 62/500 | 12.4% | 13.51 | 0.0077 | 932m 37s | Timeout(71), ExitCode(18), Reward(2), Verifier(1) |
| UD-Q6_K_XL | 29/500 | 5.8% | 13.48 | 0.0020 | 539m 38s | Timeout(35), ExitCode(20), Verifier(1) |
| UD-Q8_K_XL | 36/500 | 7.2% | 13.37 | 0.0011 | 502m 1s | Setup(1), Timeout(35), ExitCode(23), Verifier(2) |
| vllm | 26/500 | 5.2% | — | — | 521m 14s | Timeout(49), ExitCode(11), Verifier(1) |
And the plot of % Resolved vs KL Divergence:
A few observations:
This experiment largely confirmed the findings from Part 1 about the Qwen3.5-2B model. An open question is whether these results apply to other models as well. I plan to run similar experiments on larger variants of the Qwen3.5 family next, but I won't be evaluating every quantization. Too much time is wasted on bad quantizations because they get stuck in endless loops. Instead, I'll probably try a select few quantizations, such as BF16, Q8_0, and Q5_K_M. Although I am interested in understanding these peculiar behaviors, my primary goal is actually to find which models and quantizations are usable.
New open LLMs are released constantly and keep improving. Gemma 4 was recently claimed to be groundbreaking, but when I tried the quantized version for coding agents like opencode, it was completely unusable—it gets stuck in output loops or unable to call tools with correct syntax. This is the norm, not an exception. Most quantized open LLMs I try for agentic AI simply don't work. Finding a setup that does requires trial and error across model, quantization, and dozens of settings.
I just want a command I can run to get a working LLM for my GPU. No hours of experimentation. No guessing at combinations. Just a proven setup. I couldn't find one, so I built auto-bench.
Auto-bench is a tool that allows you to define experiments, automatically run LLM inference servers with the proper settings, and execute a set of benchmarks against them. Rather than reinventing the wheel, I'm currently using Harbor Framework to run the tests. Auto-bench has first-class support for quantized models. This is important, because most existing benchmarks and leaderboards don't consider quantization, even though that is how many people run models.
My project is in its earliest stages, but I have at least one experiment to share: testing various quantizations of the Qwen3.5-2B model on a single problem instance from SWE-bench Verified (swe-bench/sympy__sympy-22914). I deliberately selected an easy instance to see if quantized models can perform basic tool calls to solve an easy problem.
Here is how this experiment is configured in auto-bench:
# Benchmark 22 quants of Qwen3.5-2B on a single SWE-bench instance
# Usage: auto-bench run configs/qwen-2b-quant-sweep.yaml
name: qwen-2b-quant-sweep
backend_type: llamacpp
dataset: SWE-bench/SWE-bench_Verified
instance_ids:
- swe-bench/sympy__sympy-22914
model:
name: Qwen3.5-2B
source: huggingface
repo_id: unsloth/Qwen3.5-2B-GGUF
sweep:
- label: BF16
filename: Qwen3.5-2B-BF16.gguf
- label: Q3_K_S
filename: Qwen3.5-2B-Q3_K_S.gguf
- label: Q5_K_M
filename: Qwen3.5-2B-Q5_K_M.gguf
- label: Q5_K_S
filename: Qwen3.5-2B-Q5_K_S.gguf
- label: Q6_K
filename: Qwen3.5-2B-Q6_K.gguf
# ... 17 more quantizations
sampling:
temperature: 0.7
top_p: 0.8
top_k: 20
min_p: 0.0
presence_penalty: 1.5
repetition_penalty: 1.0
agent:
agent: openhands
env: docker
attempts: 4
limit: 1
setup_multiplier: 10.0
evaluation:
run_evaluation: truePart of my goal is to include all information needed to actually run the models properly. For example, the sampling section includes the sampling parameters that are recommended by Qwen for best performance, and these types of details can make a huge effect! My vision is to eventually have a leaderboard that will provide you with a llama.cpp command-line to run the model with the proper settings, and then you can just copy and paste that command to get a working LLM for your coding agent.
Before diving into the data, here's what the columns mean:
| Quant | Resolved | % | PPL | KL | Runtime | Exceptions |
|---|---|---|---|---|---|---|
| BF16 | 0/8 | 0% | 13.38 | — | 3m 37s | — |
| IQ4_NL | 1/8 | 12.5% | 13.67 | 0.0309 | 3m 53s | — |
| IQ4_XS | 0/8 | 0% | 13.68 | 0.0318 | 51m 53s | Timeout, ExitCode |
| Q3_K_M | 2/8 | 25% | 14.33 | 0.0774 | 51m 48s | Timeout |
| Q3_K_S | 5/8 | 62.5% | 15.08 | 0.1334 | 51m 39s | Timeout |
| Q4_0 | 4/8 | 50% | 13.91 | 0.0454 | 19m 5s | — |
| Q4_1 | 3/8 | 37.5% | 13.68 | 0.0273 | 11m 14s | — |
| Q4_K_M | 1/8 | 12.5% | 13.79 | 0.0230 | 4m 8s | ExitCode |
| Q4_K_S | 2/8 | 25% | 13.78 | 0.0274 | 4m 43s | — |
| Q5_K_M | 8/8 | 100% | 13.46 | 0.0082 | 51m 48s | Timeout |
| Q5_K_S | 6/8 | 75% | 13.49 | 0.0100 | 8m 53s | — |
| Q6_K | 7/8 | 87.5% | 13.48 | 0.0035 | 51m 54s | Timeout |
| Q8_0 | 4/8 | 50% | 13.39 | 0.0012 | 51m 49s | Timeout |
| UD-IQ2_M | 0/8 | 0% | 17.61 | 0.2677 | 51m 56s | Timeout(5) |
| UD-IQ2_XXS | 0/8 | 0% | 27.11 | 0.7018 | 51m 58s | Timeout(6), ExitCode |
| UD-IQ3_XXS | 0/8 | 0% | 15.31 | 0.1549 | 51m 49s | Timeout |
| UD-Q2_K_XL | 0/8 | 0% | 17.15 | 0.2388 | 51m 49s | Timeout(2) |
| UD-Q3_K_XL | 6/8 | 75% | 13.94 | 0.0520 | 51m 56s | Timeout(2) |
| UD-Q4_K_XL | 6/8 | 75% | 13.60 | 0.0164 | 7m 40s | — |
| UD-Q5_K_XL | 8/8 | 100% | 13.51 | 0.0077 | 18m 2s | — |
| UD-Q6_K_XL | 3/8 | 37.5% | 13.48 | 0.0020 | 6m 51s | — |
| UD-Q8_K_XL | 3/8 | 37.5% | 13.37 | 0.0011 | 5m 49s | — |
| vllm | 1/8 | 12.5% | — | — | 6m 32s | — |
The most striking finding is that the unquantized base model (shown as vllm in the results) achieves only 12.5% resolution—worse than most quantized versions. BF16, which is nearly the original model without quantization, also consistently fails at 0%. This suggests the base model is fundamentally broken for this coding task, but quantization somehow fixes it.
Many medium-sized quantizations (Q5_K_M, UD-Q5_K_XL, Q6_K) achieve 100%, 100%, and 87.5% resolution respectively. Yet larger quantizations like Q8_0 fail again at 50%. This isn't about bigger being better—it's about finding the quantization that repairs the base model's broken reasoning.
The KL (Kullback-Leibler) divergence column measures how much a quantized model's output distribution diverges from the original. If the base model is broken for this task, then staying close to the original (low divergence) just means inheriting the same brokenness. That could explain why there's no strong correlation between divergence and success.
Q5_K_M achieves 100% resolution with low divergence (0.0082)—but so does UD-Q5_K_XL with similar low divergence. Meanwhile, BF16 (essentially 0 divergence, nearly the original) fails completely at 0%. Some high-divergence models like UD-IQ2_XXS fail too, but others like Q3_K_S achieve 62.5% with divergence of 0.1334.
The plot tells the story: failing models (0 resolved) scatter across the entire divergence range—some very close to the original, some far away. If you stayed loyal to a broken base model, you'd fail. If you accidentally diverged in the right way, you'd succeed. KL divergence alone can't tell you which happened.
There are two distinct failure modes visible in the results.
Many quantizations exhibit infinite looping behavior, where the agent gets stuck generating the same outputs repeatedly and eventually hits the timeout limit. Models like UD-IQ2_M, UD-IQ2_XXS, IQ4_XS, and IQ3_XXS show multiple AgentTimeoutError instances. Interestingly, this failure mode appears to correlate strongly with extremely aggressive quantization (e.g., IQ2 variants with very high KL divergence > 0.26).
The second failure mode is when the agent runs to completion without timing out, but simply fails to correctly solve the problem. Models like BF16, UD-IQ2_XXS, and UD-IQ3_XXS never produce output loops, but they still achieve 0% resolution. This suggests that the quantization has degraded the model's reasoning ability below a critical threshold where it can't effectively reason about code, even if it's still syntactically generating valid tool calls.
The core finding is that the unquantized base model (12.5% resolution) and near-original BF16 (0% resolution) both fail for this coding task. Yet specific quantizations like Q5_K_M and UD-Q5_K_XL achieve 100%. Quantization isn't degrading a working model—it's repairing a broken one.
Notice in the visualization: models that fail (0 resolved) scatter across the entire KL divergence range, from very close to the original all the way to extremely divergent. Models that succeed tend to cluster at low divergence. But the scatter on the left side proves you can't predict failure from divergence—some quantizations stay very close to the original yet still fail.
The lesson is that model quality for coding agents is challenging to predict. This is why auto-bench exists—to empirically measure what actually works for your specific use case.
This experiment demonstrates an interesting phenomenon, but it's based on a single problem instance from a single model family. The findings should be interpreted with appropriate caution:
I'm actively running experiments to answer these questions! The auto-bench framework is designed to scale to hundreds of model/quantization combinations and thousands of problem instances. Stay tuned for results on larger model families, more problem instances, and more task types.
Powered with by Gatsby 5.0