In Part 1 and Part 2, I reported some peculiar behavior for quantized Qwen3.5-2B models:

  1. KL divergence did not reliably predict downstream coding-agent performance.
  2. Some quantizations outperformed the original model, while others underperformed.

This post shares early results for a larger variant: Qwen3.5-35B-A3B.

Key Takeaways

  1. vllm currently leads this sweep at 46.0% resolved.
  2. Q8_0 and Q5_K_M both outperform the GGUF BF16 variant in resolved rate.
  3. BF16 and vllm show many more timeouts than the quantized GGUF variants.
  4. Public benchmark numbers for this model are meaningfully higher than what I observe locally.

Sweep Summary

RunResolved%PPLKLRuntimeExceptions
BF16181/50036.2%6.622103m 41sTimeout(275), ExitCode(17), Reward(3), Verifier(1)
Q5_K_M194/50038.8%6.620.0083531m 46sTimeout(8), ExitCode(23), Reward(3), Verifier(1)
Q8_0202/50040.4%6.610.0068575m 45sTimeout(11), ExitCode(24), Reward(1), Verifier(1)
vllm230/50046.0%1634m 13sTimeout(145), ExitCode(22), Reward(6)

Observations

Timeout behavior is a major differentiator

BF16 and vllm both have substantial timeout counts. At this point, it is unclear whether these are true long-horizon failures or loop-like behaviors similar to what I saw with weaker Qwen3.5-2B quantizations. Either way, timeout behavior appears to be a key driver of aggregate score differences.

Behavior differs from Qwen3.5-2B

I still do not have enough data to draw firm conclusions about KL divergence trends for this model size. However, I do not observe the same pattern as Qwen3.5-2B. Here, the smaller GGUF quantizations (Q5_K_M, Q8_0) outperform GGUF BF16, which is surprising and may be largely explained by fewer timeouts.

GGUF BF16 vs original checkpoint is still surprising

There is a notable gap between GGUF BF16 results and the original model results. That is surprising because the original checkpoint is mostly BF16 with a small number of F32 tensors. I checked the GGUF contents and confirmed F32 tensors are present:

vscode ➜ /workspaces/auto-bench (main) $ gguf-dump /home/vscode/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/BF16/Qwen3.5-35B-A3B-BF16-00002-of-00002.gguf | grep "F32" | head -20
INFO:gguf-dump:* Loading: /home/vscode/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/BF16/Qwen3.5-35B-A3B-BF16-00002-of-00002.gguf
    142:       2048 |  2048,     1,     1,     1 | F32     | blk.0.attn_norm.weight
    143:         32 |    32,     1,     1,     1 | F32     | blk.0.ssm_a
    144:      32768 |     4,  8192,     1,     1 | F32     | blk.0.ssm_conv1d.weight
    145:         32 |    32,     1,     1,     1 | F32     | blk.0.ssm_dt.bias
    148:        128 |   128,     1,     1,     1 | F32     | blk.0.ssm_norm.weight
    149:     524288 |  2048,   256,     1,     1 | F32     | blk.0.ffn_gate_inp.weight
    153:       2048 |  2048,     1,     1,     1 | F32     | blk.0.ffn_gate_inp_shexp.weight
    154:       2048 |  2048,     1,     1,     1 | F32     | blk.0.post_attention_norm.weight
    155:       2048 |  2048,     1,     1,     1 | F32     | blk.1.attn_norm.weight
    156:         32 |    32,     1,     1,     1 | F32     | blk.1.ssm_a
    157:      32768 |     4,  8192,     1,     1 | F32     | blk.1.ssm_conv1d.weight
    158:         32 |    32,     1,     1,     1 | F32     | blk.1.ssm_dt.bias
    161:        128 |   128,     1,     1,     1 | F32     | blk.1.ssm_norm.weight
    162:     524288 |  2048,   256,     1,     1 | F32     | blk.1.ffn_gate_inp.weight
    166:       2048 |  2048,     1,     1,     1 | F32     | blk.1.ffn_gate_inp_shexp.weight
    167:       2048 |  2048,     1,     1,     1 | F32     | blk.1.post_attention_norm.weight
    168:       2048 |  2048,     1,     1,     1 | F32     | blk.10.attn_norm.weight
    169:         32 |    32,     1,     1,     1 | F32     | blk.10.ssm_a
    170:      32768 |     4,  8192,     1,     1 | F32     | blk.10.ssm_conv1d.weight
    171:         32 |    32,     1,     1,     1 | F32     | blk.10.ssm_dt.bias

In fact, some tensors appear upcast to F32 in the GGUF file, so raw dtype alone does not explain the performance gap.

Vendor-reported benchmark numbers are much higher

According to Qwen's benchmark page, Qwen3.5-35B-A3B achieves 69.2% on SWE-bench Verified. That is substantially higher than the 46.0% best result in this sweep.

Interestingly, on the Qwen3-Coder-Flash model page, Qwen reports 51.6% on SWE-bench Verified using OpenHands, the same agent framework I am using.

Qwen3-Coder-Flash benchmarks
Qwen3-Coder-Flash benchmarks

There is also active community discussion about reproducibility for these numbers here. I have not yet found detailed methodology documentation for Qwen3.5's SWE-bench Verified evaluation setup, which may explain part of the discrepancy.

Next Steps

I think the critical question right now is whether the timeouts are legitimate. After reviewing the logs, I think that they may not be. So I'm going to re-run with fewer agents in parallel.

In Part 1, I introduced auto-bench, a tool for benchmarking quantized LLMs for local coding agents, and shared some results from a preliminary study on a single instance from SWE-bench Verified. The results showed that (1) KL Divergence doesn't predict performance, and (2) quantizations can both outperform and underperform the original model.

In this post, I'll share some new results. Like the other experiment, this one also focuses on Qwen3.5-2B. Unlike the other experiment, which tested a single instance of SWE-bench Verified with eight attempts, this experiment tests all instances of SWE-bench Verified with one attempt.

Results

Without further ado, here are the results.

QuantResolved%PPLKLRuntimeExceptions
BF1628/5005.6%13.38547m 59sTimeout(47), ExitCode(22), Verifier(2)
IQ4_NL26/5005.2%13.670.0309423m 21sTimeout(29), ExitCode(16), Verifier(1)
IQ4_XS24/5004.8%13.680.0318493m 7sTimeout(39), ExitCode(15), Reward(1), Verifier(2)
Q3_K_M30/5006.0%14.330.0774785m 43sTimeout(67), ExitCode(21)
Q3_K_S20/5004.0%15.080.1334742m 25sTimeout(73), ExitCode(25), Reward(1), Verifier(1)
Q4_024/5004.8%13.910.0454407m 34sTimeout(25), ExitCode(21), Verifier(1)
Q4_136/5007.2%13.680.0273766m 19sTimeout(59), ExitCode(16), Reward(1), Verifier(1)
Q4_K_M27/5005.4%13.790.0230357m 0sTimeout(20), ExitCode(23), Verifier(2)
Q4_K_S19/5003.8%13.780.0274519m 39sTimeout(38), ExitCode(23), Reward(1), Verifier(1)
Q5_K_M62/50012.4%13.460.0082784m 58sTimeout(61), ExitCode(23), Reward(1), Verifier(1)
Q5_K_S46/5009.2%13.490.0100563m 27sTimeout(30), ExitCode(25), Reward(1), Verifier(3)
Q6_K58/50011.6%13.480.0035820m 37sTimeout(62), ExitCode(20), Verifier(1)
Q8_037/5007.4%13.390.0012598m 13sTimeout(46), ExitCode(17), Verifier(1)
UD-IQ2_M1/5000.2%17.610.26771866m 13sTimeout(300), ExitCode(24), Verifier(1)
UD-IQ2_XXS1/5000.2%27.110.70182196m 49sTimeout(371), ExitCode(19)
UD-IQ3_XXS5/5001.0%15.310.15491481m 24sTimeout(230), ExitCode(24), Verifier(1)
UD-Q2_K_XL2/5000.4%17.150.2388442m 51sTimeout(29), ExitCode(27)
UD-Q3_K_XL48/5009.6%13.940.0520738m 44sTimeout(57), ExitCode(19), Reward(1), Verifier(1)
UD-Q4_K_XL57/50011.4%13.600.0164759m 13sTimeout(48), ExitCode(18), Reward(2), Verifier(2)
UD-Q5_K_XL62/50012.4%13.510.0077932m 37sTimeout(71), ExitCode(18), Reward(2), Verifier(1)
UD-Q6_K_XL29/5005.8%13.480.0020539m 38sTimeout(35), ExitCode(20), Verifier(1)
UD-Q8_K_XL36/5007.2%13.370.0011502m 1sSetup(1), Timeout(35), ExitCode(23), Verifier(2)
vllm26/5005.2%521m 14sTimeout(49), ExitCode(11), Verifier(1)

And the plot of % Resolved vs KL Divergence:

Percentage instances resolved vs. KL Divergence
Percentage instances resolved vs. KL Divergence

Observations

A few observations:

  1. The overall resolve rates are low across the board. This is not a very powerful model. I intentionally selected an easy problem instance for the Part 1 experiment.
  2. As in Part 1, many of the "mid-range" quantizations outperform the original model, yet small and large quantizations underperform. This is consistent with the idea that some quantizations are actually beneficial, while others are harmful.
  3. Also as in Part 1, KL Divergence does not fully explain performance.

What next?

This experiment largely confirmed the findings from Part 1 about the Qwen3.5-2B model. An open question is whether these results apply to other models as well. I plan to run similar experiments on larger variants of the Qwen3.5 family next, but I won't be evaluating every quantization. Too much time is wasted on bad quantizations because they get stuck in endless loops. Instead, I'll probably try a select few quantizations, such as BF16, Q8_0, and Q5_K_M. Although I am interested in understanding these peculiar behaviors, my primary goal is actually to find which models and quantizations are usable.

Edward J. SchwartzComputer Security Researcher7 min. read

New open LLMs are released constantly and keep improving. Gemma 4 was recently claimed to be groundbreaking, but when I tried the quantized version for coding agents like opencode, it was completely unusable—it gets stuck in output loops or unable to call tools with correct syntax. This is the norm, not an exception. Most quantized open LLMs I try for agentic AI simply don't work. Finding a setup that does requires trial and error across model, quantization, and dozens of settings.

I just want a command I can run to get a working LLM for my GPU. No hours of experimentation. No guessing at combinations. Just a proven setup. I couldn't find one, so I built auto-bench.

auto-bench

Auto-bench is a tool that allows you to define experiments, automatically run LLM inference servers with the proper settings, and execute a set of benchmarks against them. Rather than reinventing the wheel, I'm currently using Harbor Framework to run the tests. Auto-bench has first-class support for quantized models. This is important, because most existing benchmarks and leaderboards don't consider quantization, even though that is how many people run models.

My project is in its earliest stages, but I have at least one experiment to share: testing various quantizations of the Qwen3.5-2B model on a single problem instance from SWE-bench Verified (swe-bench/sympy__sympy-22914). I deliberately selected an easy instance to see if quantized models can perform basic tool calls to solve an easy problem.

Here is how this experiment is configured in auto-bench:

# Benchmark 22 quants of Qwen3.5-2B on a single SWE-bench instance
# Usage: auto-bench run configs/qwen-2b-quant-sweep.yaml

name: qwen-2b-quant-sweep
backend_type: llamacpp
dataset: SWE-bench/SWE-bench_Verified

instance_ids:
  - swe-bench/sympy__sympy-22914

model:
  name: Qwen3.5-2B
  source: huggingface
  repo_id: unsloth/Qwen3.5-2B-GGUF
  sweep:
    - label: BF16
      filename: Qwen3.5-2B-BF16.gguf
    - label: Q3_K_S
      filename: Qwen3.5-2B-Q3_K_S.gguf
    - label: Q5_K_M
      filename: Qwen3.5-2B-Q5_K_M.gguf
    - label: Q5_K_S
      filename: Qwen3.5-2B-Q5_K_S.gguf
    - label: Q6_K
      filename: Qwen3.5-2B-Q6_K.gguf
    # ... 17 more quantizations

sampling:
  temperature: 0.7
  top_p: 0.8
  top_k: 20
  min_p: 0.0
  presence_penalty: 1.5
  repetition_penalty: 1.0
agent:
  agent: openhands
  env: docker
  attempts: 4
  limit: 1
  setup_multiplier: 10.0

evaluation:
  run_evaluation: true

Part of my goal is to include all information needed to actually run the models properly. For example, the sampling section includes the sampling parameters that are recommended by Qwen for best performance, and these types of details can make a huge effect! My vision is to eventually have a leaderboard that will provide you with a llama.cpp command-line to run the model with the proper settings, and then you can just copy and paste that command to get a working LLM for your coding agent.

Results

Before diving into the data, here's what the columns mean:

  • Resolved: Number of problem instances successfully resolved by the agent
  • Total: Total number of attempts (8 in this case)
  • % Resolved: Resolution rate as a percentage
  • PPL (Perplexity): Measures the model's confidence in its predictions. Lower is generally better, though surprisingly this doesn't always correlate with task success
  • KL: Kullback-Leibler divergence—how much the quantized model's output distribution diverges from the original model's. Lower is better, but as we'll see, it's not a strong predictor of task performance
  • Runtime: Total time to run all attempts
  • Exceptions: Types of errors encountered (e.g., timeouts, exit code errors)
QuantResolved%PPLKLRuntimeExceptions
BF160/80%13.383m 37s
IQ4_NL1/812.5%13.670.03093m 53s
IQ4_XS0/80%13.680.031851m 53sTimeout, ExitCode
Q3_K_M2/825%14.330.077451m 48sTimeout
Q3_K_S5/862.5%15.080.133451m 39sTimeout
Q4_04/850%13.910.045419m 5s
Q4_13/837.5%13.680.027311m 14s
Q4_K_M1/812.5%13.790.02304m 8sExitCode
Q4_K_S2/825%13.780.02744m 43s
Q5_K_M8/8100%13.460.008251m 48sTimeout
Q5_K_S6/875%13.490.01008m 53s
Q6_K7/887.5%13.480.003551m 54sTimeout
Q8_04/850%13.390.001251m 49sTimeout
UD-IQ2_M0/80%17.610.267751m 56sTimeout(5)
UD-IQ2_XXS0/80%27.110.701851m 58sTimeout(6), ExitCode
UD-IQ3_XXS0/80%15.310.154951m 49sTimeout
UD-Q2_K_XL0/80%17.150.238851m 49sTimeout(2)
UD-Q3_K_XL6/875%13.940.052051m 56sTimeout(2)
UD-Q4_K_XL6/875%13.600.01647m 40s
UD-Q5_K_XL8/8100%13.510.007718m 2s
UD-Q6_K_XL3/837.5%13.480.00206m 51s
UD-Q8_K_XL3/837.5%13.370.00115m 49s
vllm1/812.5%6m 32s

The Base Model Is Broken

The most striking finding is that the unquantized base model (shown as vllm in the results) achieves only 12.5% resolution—worse than most quantized versions. BF16, which is nearly the original model without quantization, also consistently fails at 0%. This suggests the base model is fundamentally broken for this coding task, but quantization somehow fixes it.

Many medium-sized quantizations (Q5_K_M, UD-Q5_K_XL, Q6_K) achieve 100%, 100%, and 87.5% resolution respectively. Yet larger quantizations like Q8_0 fail again at 50%. This isn't about bigger being better—it's about finding the quantization that repairs the base model's broken reasoning.

KL Divergence Doesn't Predict Success

The KL (Kullback-Leibler) divergence column measures how much a quantized model's output distribution diverges from the original. If the base model is broken for this task, then staying close to the original (low divergence) just means inheriting the same brokenness. That could explain why there's no strong correlation between divergence and success.

Q5_K_M achieves 100% resolution with low divergence (0.0082)—but so does UD-Q5_K_XL with similar low divergence. Meanwhile, BF16 (essentially 0 divergence, nearly the original) fails completely at 0%. Some high-divergence models like UD-IQ2_XXS fail too, but others like Q3_K_S achieve 62.5% with divergence of 0.1334.

KL Divergence vs number of resolved attempts
KL Divergence vs number of resolved attempts

The plot tells the story: failing models (0 resolved) scatter across the entire divergence range—some very close to the original, some far away. If you stayed loyal to a broken base model, you'd fail. If you accidentally diverged in the right way, you'd succeed. KL divergence alone can't tell you which happened.

Multiple Failure Modes

There are two distinct failure modes visible in the results.

Failure Mode 1: Infinite Loops (AgentTimeoutError)

Many quantizations exhibit infinite looping behavior, where the agent gets stuck generating the same outputs repeatedly and eventually hits the timeout limit. Models like UD-IQ2_M, UD-IQ2_XXS, IQ4_XS, and IQ3_XXS show multiple AgentTimeoutError instances. Interestingly, this failure mode appears to correlate strongly with extremely aggressive quantization (e.g., IQ2 variants with very high KL divergence > 0.26).

Failure Mode 2: Silent Failure

The second failure mode is when the agent runs to completion without timing out, but simply fails to correctly solve the problem. Models like BF16, UD-IQ2_XXS, and UD-IQ3_XXS never produce output loops, but they still achieve 0% resolution. This suggests that the quantization has degraded the model's reasoning ability below a critical threshold where it can't effectively reason about code, even if it's still syntactically generating valid tool calls.

Conclusion: The Base Model Is Broken, Quantization Fixes It

The core finding is that the unquantized base model (12.5% resolution) and near-original BF16 (0% resolution) both fail for this coding task. Yet specific quantizations like Q5_K_M and UD-Q5_K_XL achieve 100%. Quantization isn't degrading a working model—it's repairing a broken one.

Notice in the visualization: models that fail (0 resolved) scatter across the entire KL divergence range, from very close to the original all the way to extremely divergent. Models that succeed tend to cluster at low divergence. But the scatter on the left side proves you can't predict failure from divergence—some quantizations stay very close to the original yet still fail.

The lesson is that model quality for coding agents is challenging to predict. This is why auto-bench exists—to empirically measure what actually works for your specific use case.

Open Questions and Limitations

This experiment demonstrates an interesting phenomenon, but it's based on a single problem instance from a single model family. The findings should be interpreted with appropriate caution:

  • Generalization to other models: Do these patterns hold for Llama, Mistral, or other model families? The behaviors might be Qwen-specific.
  • Generalization to other instances: I deliberately chose an easy instance to see if quantized models could work at all. Would the patterns hold on harder instances? SWE-bench Verified spans easy to extremely difficult problems.
  • Generalization to other tasks: Would we see similar results on SWE-bench instances beyond Verified, or on other benchmarks like HumanEval or MBPP?
  • Sampling parameter sensitivity: How much of the improvement from quantized models comes from properly-tuned sampling parameters? A controlled ablation would be valuable.

I'm actively running experiments to answer these questions! The auto-bench framework is designed to scale to hundreds of model/quantization combinations and thousands of problem instances. Stay tuned for results on larger model families, more problem instances, and more task types.

Powered with by Gatsby 5.0