Edward J. SchwartzComputer Security Researcher1 min. read

🎉 New Research Published at DIMVA 2025

I'm excited to announce that "Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement" has been published at the 2025 Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2025)!

The Research Team

This work was primarily conducted by Deniz Bölöni-Turgut—a bright undergraduate at Cornell University—as part of the REU in Software Engineering (REUSE) program at CMU. She was supervised by Luke Dramko from our research group.

What We Investigated

This paper tackles an important question in the evolving landscape of AI-powered reverse engineering: How do code obfuscations impact the effectiveness of these ML-based approaches? In the real world, adversaries often employ obfuscation techniques to make their code harder to analyze by reverse engineers. Although these obfuscation techniques were not designed with machine learning in mind, they can significantly modify the code, which raises the question of whether they could hinder the performance of ML models, which are currently trained on unobfuscated code.

Key Findings

Our research provides important quantitative insights into how obfuscations affect ML-based decompilation:

Obfuscations do negatively impact ML models: We demonstrated that semantics-preserving transformations that obscure program functionality significantly reduce the accuracy of machine learning-based decompilation tools.
Training on obfuscated code helps: Our experiments show that training models on obfuscated code can partially recover the lost accuracy, making the tools more resilient to obfuscation techniques.
Consistent results across multiple models: We validated our findings across three different state-of-the-art models from the literature—DIRTY, HexT5, and VarBERT—suggesting that our findings generalize.
Practical implications for malware analysis: Since obfuscations are commonly used in malware, these findings are directly applicable to improving real-world binary analysis scenarios.

This work represents an important step forward in making ML-based decompilation tools more resilient against the obfuscation techniques commonly encountered in real-world binary analysis scenarios. As the field continues to evolve, understanding these vulnerabilities and developing robust solutions will be crucial for maintaining the effectiveness of AI-powered security tools.

Want to know more? Download the complete paper.

Edward J. SchwartzComputer Security Researcher1 min. read

🎉 New Research Published at DSN 2025

I'm excited to announce that "A Human Study of Automatically Generated Decompiler Annotations" has been published at the 2025 IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2025)!

The Research Team

This work represents the culmination of Jeremy Lacomis's Ph.D. research, alongside our fantastic collaborators:

Vanderbilt University: Yuwei Yang, Skyler Grandel, and Kevin Leach
Carnegie Mellon University: Bogdan Vasilescu and Claire Le Goues

What We Studied

This paper investigates a critical question in reverse engineering: Do automatically generated variable names and type annotations actually help human analysts understand decompiled code?

Our study built upon DIRTY, our machine learning system that automatically generates meaningful variable names and type information for decompiled binaries. While DIRTY showed promising technical results, we wanted to understand its real-world impact on human reverse engineers.

Key Findings

Surprisingly, the annotations did not significantly improve participants' task completion speed or accuracy
This challenges assumptions about the direct correlation between code readability and task performance
Participants preferred code with annotations over plain decompiled output

Interested in the full methodology and detailed results? Download the complete paper to dive deeper into our human study design, statistical analysis, and implications for future decompilation tools.

Edward J. SchwartzComputer Security Researcher4 min. read

Many people seem to be unaware that decompilers have a decompilation export feature, which is particularly beneficial when you are trying to parse or recompile decompiled code.

Ghidra

Here is a random function from /bin/ls that I decompiled using Ghidra, and simply copied the output from the decompilation window:

void FUN_0010b1b0(undefined8 *param_1,undefined8 *param_2)

{
  char *__s;
  char *__s_00;
  int iVar1;
  char *__s1;
  char *__s2;
  int *piVar2;
  
  __s = (char *)*param_2;
  __s1 = strrchr(__s,0x2e);
  __s_00 = (char *)*param_1;
  __s2 = strrchr(__s_00,0x2e);
  if (__s2 == (char *)0x0) {
    __s2 = "";
  }
  if (__s1 == (char *)0x0) {
    __s1 = "";
  }
  piVar2 = __errno_location();
  *piVar2 = 0;
  iVar1 = strcoll(__s1,__s2);
  if (iVar1 == 0) {
    strcoll(__s,__s_00);
    return;
  }
  return;
}

This code does not compile.

gcc -c a.c
a.c:1:19: error: unknown type name ‘undefined8’
    1 | void FUN_0010b1b0(undefined8 *param_1,undefined8 *param_2)
      |                   ^~~~~~~~~~
a.c:1:39: error: unknown type name ‘undefined8’
    1 | void FUN_0010b1b0(undefined8 *param_1,undefined8 *param_2)
      |                                       ^~~~~~~~~~

However, we can use Ghidra's decompiler exporter to decompile the function AND emit a header file that will define the types and declare the functions used in the decompiled code. Unfortunately, this is a little bit awkward to do for one function. I suggest the following process:

In the CodeBrowser, select the function you want to decompile in the Listing (disassembly) window.
In the CodeBrowser, navigate to File → Export Program.
Select "C/C++" as the Format.
Ensure that the "Selection Only" checkbox is checked.
In the Options dialog, check both the "Create Header File (.h)" and "Create C File (.c)" options.
Click OK.

This will create a decompiled source file ls.c containing type definitions for types used in the decompiled code, such as undefined8.

And if you try to compile the generated ls.c file, it will compile successfully:

gcc -c ls.c
ls.c: In function ‘FUN_0010b1b0’:
ls.c:630:10: warning: implicit declaration of function ‘strrchr’ [-Wimplicit-function-declaration]
  630 |   __s1 = strrchr(__s,0x2e);
      |          ^~~~~~~
ls.c:1:1: note: include ‘<string.h>’ or provide a declaration of ‘strrchr’
  +++ |+#include <string.h>
    1 | typedef unsigned char   undefined;
ls.c:630:10: warning: incompatible implicit declaration of built-in function ‘strrchr’ [-Wbuiltin-declaration-mismatch]
  630 |   __s1 = strrchr(__s,0x2e);
      |          ^~~~~~~
ls.c:630:10: note: include ‘<string.h>’ or provide a declaration of ‘strrchr’
ls.c:639:12: warning: implicit declaration of function ‘__errno_location’ [-Wimplicit-function-declaration]
  639 |   piVar2 = __errno_location();
      |            ^~~~~~~~~~~~~~~~
ls.c:639:10: warning: assignment to ‘int *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
  639 |   piVar2 = __errno_location();
      |          ^
ls.c:641:11: warning: implicit declaration of function ‘strcoll’ [-Wimplicit-function-declaration]
  641 |   iVar1 = strcoll(__s1,__s2);
      |           ^~~~~~~

Hurray!

This is slightly oversimplified. You can see the compiler's warnings that we did not have declarations for strrchr, __errno_location, and strcoll. Ghidra can generate these declarations, but didn't because we used the "Selection Only" option. To get these declarations, you can use an approach like the following:

Use "Export Program" to export the entire program. Under the "Options" dialog, check only the "Create Header File (.h)" option.
Use "Export Program" to export only the function you care about, but this time check only the "Create C File (.c)" option.
Remove type definitions from the generated C file, and instead include the generated header file.
Cross your fingers.

Unfortunately, this header file does not always compile. Here are some errors that I obtained when trying to compile the header for /bin/ls:

In file included from FUN_0010b1b0.c:1:
FUN_0010b1b0.h:671:6: warning: conflicting types for built-in function ‘__snprintf_chk’; expected ‘int(char *, long unsigned int,  int,  long unsigned int,  const char *, ...)’ [-Wbuiltin-declaration-mismatch]
  671 | void __snprintf_chk(void);
      |      ^~~~~~~~~~~~~~
FUN_0010b1b0.h:682:5: error: ‘sigaction’ redeclared as different kind of symbol
  682 | int sigaction(int __sig,sigaction *__act,sigaction *__oact);
      |     ^~~~~~~~~
FUN_0010b1b0.h:294:26: note: previous declaration of ‘sigaction’ with type ‘sigaction’
  294 | typedef struct sigaction sigaction, *Psigaction;
      |                          ^~~~~~~~~
FUN_0010b1b0.h:695:6: warning: conflicting types for built-in function ‘dcgettext’; expected ‘char *(const char *, const char *, int)’ [-Wbuiltin-declaration-mismatch]
  695 | void dcgettext(void);
      |      ^~~~~~~~~
[truncated]

This is probably because Ghidra's type manager includes multiple definitions for some types, such as sigaction. I have opened a Ghidra issue for these problems. Hopefully the Ghidra developers will commit to fixing these problems in the header file, since it would significantly improve the usability of the decompiler output.

Here is a ghidra-scala-loader script that can do the above steps for you from the command line.

Hex-Rays

IDA/Hex-Rays has a similar feature, but it is much more straightforward. Simply select File → Create C File, and it will generate a C file that includes all the necessary type definitions and function declarations. It does include defs.h, which is a file that contains type definitions for Hex-Rays. You can find that in the Hex-Rays SDK or elsewhere.

Edward J. SchwartzComputer Security Researcher2 min. read

Can existing neural decompiler artifacts be used to run on a new example? Here are some notes on the current state of the art. I assign each decompiler a score from 0 to 10 based on how easy it is to use the publicly available artifacts to run on a new example.

SLaDe: 2/10

SLaDe has a publicly released replication artifact but there are several problems that prevent it from being used on new examples:

The models are trained on assembly code produced from compilers rather than disassemblers. This is probably minor.
More problematically, SLaDe uses IO testcases during beam search to help detect the best candidate. It can be used without these, but the results will be worse. SLaDe does not contain a mechanism for producing testcases for new examples.

Below is a quote from a private conversation with the author:

You are right that IO are somehow used to select in the beam search, in the sense that we report pass@5. They are not strictly required to get the outputs though.

The link you sent is for the program synthesis dataset. In this one, IO generation was programmatic but still kind of manual, I don't think it would be feasible to automatically generate the props file in the general case. For the Github functions, we have a separate repo that automatically generates IO tests, but those are randomly generated and the quality depends on each case. If I had to redo now, I would ask an LLM to generate unit tests! I can give you access to the private repo we used to automatically generate the IO examples for the general case if you wish, but now I'd do it with LLMs rather than randomly.

LLM4Decompile: 9/10

LLM4Decompile has published model files on HuggingFace that can easily be used to run on new examples. I created a few HuggingFace Spaces for testing.

resym: 2/10

resym has a publicly released replication artifact. Unfortunately, as of February 2025, the artifact is missing the "prolog-based inference system for struct layout recovery" which is the key contribution of the paper. Thus it is not possible to run resym on new examples.

DeGPT: 8/10

DeGPT has a publicly released GitHub repository. I'm largely going on memory, but I used it previously on new examples and it was relatively easy to use. I did have to file a few PRs though.

Edward J. SchwartzComputer Security Researcher28 min. read

In Part 1, we showed that tool-calling agents built using open source LLMs and LangChain almost universally performed poorly. They demonstrated strange behaviors such as responding to "Hello." by making non-sensical tool calls. In this blog post, we will try to determine why this happened.

Rather than investigating every model at once, in this blog post I'm going to focus my effort on Llama 3.2.

Hidden: Setup
Quantization
Prompt Templates
Conclusion
TLDR

Hidden: Setup

!pip install langgraph~=0.2.53 langchain-ollama langchain-huggingface python-dotenv
!pip install httpx==0.27.2 # temp
!apt-get install -y jq
debug = False
sample_size = 100
num_ctx = 8192

from langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage

@tool
def foobar(input: int) -> int:
    """Computes the foobar function on input and returns the result."""
    return input + 2

tools = [foobar]

from langgraph.prebuilt import create_react_agent

def react_chat(prompt, model):
  agent_executor = create_react_agent(model, tools)

  response = agent_executor.invoke({"messages": [("user", prompt)]})
  return response['messages'][-1].content, response

Here we install ollama.

!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh

Make sure the ollama server is running.

!ollama -v
!ollama ps 2>/dev/null || (setsid env OLLAMA_DEBUG=1 nohup ollama serve &)

!ollama pull llama3.2 2>/dev/null

basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
  last_msg, _ = react_chat(basic_tool_question, model=model)
  r = "32" in last_msg
  if not r and debug:
    print(f"q1 debug: {last_msg}")
  return r

from langchain_core.messages import ToolMessage

basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"

def q2a(model):
  _, result = react_chat(basic_arithmetic_question, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2b(model):
  _, result = react_chat(greeting, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2(model):
  return q2a(model) and q2b(model)

def q3a(model):
  result = model.invoke(basic_arithmetic_question)
  return "12243" in result.content

def q3b(model):
  last_msg, _ = react_chat(basic_arithmetic_question, model=model)
  r = "12243" in last_msg
  if not r and debug:
    print(f"q3b debug: {last_msg}")
  return "12243" in last_msg

def q3(model):
  # q3a ==> q3b: If q3a, then q3b ought to be true as well.
  r = not q3a(model) or q3b(model)
  return r

def q4(model):
  last_msg, _ = react_chat(greeting, model=model)
  c1 = any(w in last_msg for w in ["Hi", "hello", "Hello", "help you", "Welcome", "welcome", "Greeting", "assist"])
  c2 = any(w in last_msg for w in ["None of the"])

  r = c1 and not c2
  #if not r:
  if debug: print(f"q4 debug: c1={c1} c2={c2} r={r} greeting? {last_msg}")
  return r

from tqdm.notebook import tqdm
from termcolor import colored

def do_bool_sample(fun, n=10, *args, **kwargs):
  try:
    # tqdm here if desired
    return sum(fun(*args, **kwargs) for _ in (range(n))) / n
  except Exception as e:
    print(e)
    return 0.0

def run_experiment(model, name, n=10):
  do = lambda f: do_bool_sample(f, model=model, n=n)
  d = {
      "q1": do(q1),
      "q2": do(q2),
      "q3": do(q3),
      "q4": do(q4),
      "n": n,
      "model": name
  }
  d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
  return d

def print_experiment(results):
  name = results['model']
  print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
  print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
  print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
  print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")

def run_and_print_experiment(model, name, **kwargs):
  results = run_experiment(model, name, **kwargs)
  print_experiment(results)
  return results

Quantization

You probably know that modern neural networks can be pretty large, and that is why special GPUs with lots of memory are in high demand right now. So how are we able to run some of these models on our computers, which don't have these special GPUs, using Ollama?

One reason is because Ollama uses quantized models, which are numerically compressed to use less memory. For example, the original Llama 3.2-3b-Instruct model uses bfloat16 tensors which require 16-bits to store each parameter. On Ollama's Llama 3.2 model page, you can see the quantization is listed as Q4_K_M. At a high-level, this squeezes each 16-bit parameter down to 4-bits. And somehow it still works!

Or does it? Maybe this is why our tool-callling doesn't work?

One simple way to test this is to evaluate a quantized version versus a non-quantized version. Luckily, this repository on HuggingFace happens to have both quantized and non-quantized models in a format that Ollama can process. So we can evaluate both of them using Ed's Really Dumb Tool-calling Benchmark ™️ that I introduced in Part 1.

from langchain_ollama import ChatOllama

quant_models = [
  "hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16", # non-quantized
  "hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M", #quantized
]

for m in quant_models:
  print(f"Model: {m}")
  !ollama pull {m} 2>/dev/null
  r = run_and_print_experiment(ChatOllama(model=m, num_ctx=num_ctx), m, n=sample_size)
  print(r)

Model: hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.91
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.36
Question 4: Does the react agent lose the ability to chat? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.12
{'q1': 0.91, 'q2': 0.0, 'q3': 0.36, 'q4': 0.12, 'n': 100, 'model': 'hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16', 'total': 1.3900000000000001}
Model: hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.99
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.51
Question 4: Does the react agent lose the ability to chat? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.06
{'q1': 0.99, 'q2': 0.0, 'q3': 0.51, 'q4': 0.06, 'n': 100, 'model': 'hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M', 'total': 1.56}

Both models do poorly. In fact, the quantized version does slightly better.

Conclusion: Quantization is probably not the problem.

Prompt Templates

Behind the scenes, LLMs need to be prompted in a very specific format to work well. HuggingFace dubs this problem the silent performance killer. In response, they created "chat templates" which codify the format and live alongside the model to avoid any ambiguity. Note: I call these "prompt templates".

We haven't had to worry about prompt templates at all, because Ollama has been taking care of templating for us. Maybe its prompt templates are problematic?

To test this theory, we're going to build some code to query Ollama but without using Ollama to format the prompt for us. There are two purposes for this:

We will learn a bit how tool calling works and how it interacts with prompt templates. I have a suspicion that prompt templates have something to do with the problem.
We will avoid a lot of code that has been hiding behind abstraction. This is the downside of abstraction: it makes things easier to build, but it's harder to understand where the blame might lie when something fails.

Llama 3.2 Based Prompt

Let's start by examining the prompt template recommended for Llama 3.2, which is this template for zero-shot function calling from the llama-models repository. We'll talk more about this later, but Meta actually publishes conflicting prompt templates in different locations! So to be clear, this is the llama-models Llama 3.2 prompt template.

Here is the example from that page:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the weather in SF and Seattle?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It's worth adding a few notes here. Modern LLMs are implemented as chat models, which means that they expect a conversation in the form of a list of messages that are sent by various roles. The primary roles are the user and the assistant. But we can also see that there are system messages, which are hidden instructions that are sent to the LLM. They provide instructions on how the LLM should behave. In this prompt template, we can see that it also specifies how and when the LLM should interact with tools.

On the same page is an example of the format in which the model should respond:

[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]<|eot_id|>

Let's code this up and try it out.

llama_32_example_funs = """[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]"""

# https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#input-prompt-format-1
def llama_32_prompt_template(user, funs=llama_32_example_funs):
  return """<|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

%s<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>""" % (funs, user)

print(llama_32_prompt_template("What is the weather in new york?", funs=llama_32_example_funs))

<|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the weather in new york?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Looks good. Now we need some code to send the request to the Ollama API.

import json

def build_ollama_request(full_prompt, model="llama3.2"):
  return f"""{{
    "model": \"{model}\",
    "prompt": {json.dumps(full_prompt)},
    "raw": true,
    "stream": false
  }}
  """

def build_ollama_request_from_userprompt(user, template, funs=None, **kwargs):
  return build_ollama_request(template(user, funs), **kwargs)

def send_request(full_request):
  open("req", "w").write(full_request)
  !curl -s http://localhost:11434/api/generate --data-ascii "@req" | jq -M '.response'

def send_userprompt(user, **kwargs):
  request = build_ollama_request_from_userprompt(user, **kwargs)
  send_request(request)

# A few sample prompts...
prompts = [
    "Use tools to find the weather in new york", # llama 3.2 example
    "Use tools to get latest trending songs", # llama 3.1 example
    basic_tool_question, # Please evaluate foobar(30)
    greeting, # Hello!
    basic_arithmetic_question # What is 12345 - 102?
]

def try_prompts(**kwargs):
  for p in prompts:
    print(f"Prompt: {p}\nResponse: ", end="")
    send_userprompt(p, **kwargs)
    print()

In the list of example prompts, there are three tool calling examples. We will be passing along the get_weather tool definition. Obviously, commonsense tells us that the model should not attempt to list trending songs or evaluate foobar(30) using get_weather. Let's see how it does.

for _ in range(1):
  try_prompts(funs=llama_32_example_funs, template=llama_32_prompt_template, model="llama3.2")

Prompt: Use tools to find the weather in new york
Response: "\n\n[get_weather(city='New York')]"

Prompt: Use tools to get latest trending songs
Response: "\n\n[get_weather(city='trending songs', metric='') ]"

Prompt: Please evaluate foobar(30)
Response: "\n\n[]"

Prompt: Hello!
Response: "\n\nNothing to see here. Would you like to ask a question or request a function call?"

Prompt: What is 12345 - 102?
Response: "\n\n[]"

On Sunday, the weather is going to be sunny with a chance of rain in the legendary city of "Trending Songs".

Overall, this is pretty disappointing. The model appears overly eager to call tools, even when it makes no sense, such as calling get_weather on the city of "trending songs". Oops. And it often responds unnaturally to "Hello!". It doesn't respond at all to the arithmetic or foobar questions.

Llama 3.1 Based Prompt

Llama 3.2 is actually compatible with the Llama 3.1 prompt format for tool calling, so next let's try the llama-models Llama 3.1 prompt template. Below is the example from that page.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Answer the user's question by making use of the following functions if needed.
If none of the function can be used, please say so.
Here is a list of functions in JSON format:
{
    "type": "function",
    "function": {
        "name": "trending_songs",
        "description": "Returns the trending songs on a Music site",
        "parameters": {
            "type": "object",
            "properties": [
                {
                    "n": {
                        "type": "object",
                        "description": "The number of songs to return"
                    }
                },
                {
                    "genre": {
                        "type": "object",
                        "description": "The genre of the songs to return"
                    }
                }
            ],
            "required": ["n"]
        }
    }
}

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Use tools to get latest trending songs<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Notice that the Llama 3.2 and 3.1 prompt templates have very little in common!

Let's code up the Llama 3.1 prompt and test it out.

llama_31_example_funs = """{
    "type": "function",
    "function": {
        "name": "trending_songs",
        "description": "Returns the trending songs on a Music site",
        "parameters": {
            "type": "object",
            "properties": [
                {
                    "n": {
                        "type": "object",
                        "description": "The number of songs to return"
                    }
                },
                {
                    "genre": {
                        "type": "object",
                        "description": "The genre of the songs to return"
                    }
                }
            ],
            "required": ["n"]
        }
    }
}
"""

# https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/prompt_format.md#input-prompt-format-5
def llama_31_prompt_template(user, funs=llama_31_example_funs):
   return """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Answer the user's question by making use of the following functions if needed.
If none of the function can be used, please say so.
Here is a list of functions in JSON format:
%s

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>
""" % (funs, user)

Now we'll run the sample prompts again, but this time we'll use the Llama 3.1 example function trending_songs rather than get_weather. As before, commonsense tells us that we can't use trending_songs to predict the weather or compute foobar(30). Let's see how it does.

for _ in range(1):
  try_prompts(funs=llama_31_example_funs, template=llama_31_prompt_template, model="llama3.2")

Prompt: Use tools to find the weather in new york
Response: "I can't directly use the provided function to find the weather in New York as it is a location-based API and the given function is for getting trending songs, not weather information."

Prompt: Use tools to get latest trending songs
Response: "{\"type\": \"function\", \"name\": \"trending_songs\", \"parameters\": {\"n\": \"10\"}}"

Prompt: Please evaluate foobar(30)
Response: "Since there is no `foobar` function available, the answer is: None"

Prompt: Hello!
Response: "Hello! How can I assist you today?"

Prompt: What is 12345 - 102?
Response: "I'm not aware of any function that can perform this calculation. The functions provided only include the `trending_songs` function, which is used to retrieve trending songs based on a specific number of songs and genre. It does not include arithmetic operations like subtraction. If you need help with a different type of calculation, please let me know!"

These responses seems greatly improved compared to the Llama 3.2 prompt we tried. The response to a greeting is more natural. It also didn't do silly things like stuffing "trending songs" into the get_weather function calls.

The only consistent problem I can see is that it didn't even try to answer the arithmetic question. Let's see if we can fix that by slightly tweaking the wording of the prompt with the following diff:

-Answer the user's question by making use of the following functions if needed.
-If none of the function can be used, please say so.
+Help and converse with the user.  If and only if the user asks a question that
+is relevant to one of the following functions, make use of them.  If none of
+the functions can be used, answer the query as best you can.

def llama_31_custom_prompt_template(user, funs=llama_31_example_funs):
   return """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
%s

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>
""" % (funs, user)

for _ in range(1):
  try_prompts(funs=llama_31_example_funs, template=llama_31_custom_prompt_template, model="llama3.2")

Prompt: Use tools to find the weather in new york
Response: "Since the provided function `trending_songs` is related to music and not weather, I'll answer your question as best I can.\n\nAccording to current information, New York's weather can be quite varied depending on the time of year. In general, winters are cold and snowy, while summers are hot and humid. Springs and autumns are generally mild.\n\nIf you're looking for more specific or up-to-date weather information, I recommend checking a reliable weather website or app, such as AccuWeather or Weather.com, which can provide more accurate and current forecasts for New York City and surrounding areas."

Prompt: Use tools to get latest trending songs
Response: "Here is a function call in JSON format that uses the `trending_songs` function to get the latest trending songs:\n\n```\n{\n  \"type\": \"function\",\n  \"name\": \"trending_songs\",\n  \"parameters\": {\n    \"n\": \"10\"\n  }\n}\n```"

Prompt: Please evaluate foobar(30)
Response: "Unfortunately, I don't have any information about a `foobar` function. Could you please provide more context or details about this function? Is it defined somewhere else that I can access? Or would you like me to try and infer what the `foobar` function might do based on its name?"

Prompt: Hello!
Response: "Hello! How can I assist you today?"

Prompt: What is 12345 - 102?
Response: "Since no specific function can be used for this question, I'll answer it directly:\n\nThe result of subtracting 102 from 12345 is 12343."

This seemed to work really well if you ignore the fact that it got the subtraction wrong 🤣

What prompt template is Ollama using?

So we found a prompt template that appears to be working well. Let's try to make Ollama use that prompt template. To start, let's investigate what prompt template it's using "by default".

def get_raw_prompt():
  template = !cat nohup.out | fgrep "chat request" | sed -e 's/.*prompt="\(.*\)"/\1/'
  return [s.encode().decode('unicode_escape') for s in template]

!ollama pull llama3.2 2>/dev/null
!>nohup.out # Truncate ollama output
response = react_chat(greeting, model=ChatOllama(model="llama3.2", num_ctx=num_ctx))
print(get_raw_prompt()[0])

<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023

When you receive a tool call response, use the output to format an answer to the orginal user question.

You are a helpful assistant with tool calling capabilities.<|eot_id|><|start_header_id|>user<|end_header_id|>

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{"type":"function","function":{"name":"foobar","description":"Computes the foobar function on input and returns the result.","parameters":{"type":"object","required":["input"],"properties":{"input":{"type":"integer","description":""}}}}}

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

That doesn't look like either of the prompt templates we used before, which both came from the llama-models repository. After a bit of google-fu, we can see that it originated from the Llama 3.1 "JSON based" tool calling documentation on the llama website. But it's not the same Llama 3.1 prompt template that we used from the llama-models repository.

This raises a few questions:

Why are there multiple prompt templates for Llama 3.1?
Which prompt template is best?
Why not use the prompt template for Llama 3.2, since we are using the Llama 3.2 model?

Let's start by answering the first two questions. In this github issue, a user notes that there are at least three different prompt templates for Llama 3.1:

A Meta employee states that the template in the llama-models repository is the correct one. Fortunately, that's what we have been using in this blog post. (It's almost as if I knew this in advance!) But Ollama has been basing their template on the one from the model website. That seems problematic!

The last question, "Why not use the prompt format for Llama 3.2?" is pretty easy to answer as well. Llama 3.2's default prompt format responds using a pythonic function call syntax that Ollama can't parse. And, as we saw above when we tested it manually, the Llama 3.2 prompt anecdotally did not seem to work well anyway.

Let's build our own Ollama prompt template

Now that we identified a prompt template that seems to work pretty well, let's try to make Ollama use it.

To start, let's look at the default prompt template for the llama3.2 model in Ollama. We already saw the instantiated prompt, but now let's look at the template that Ollama uses to build the prompts. You can see this below, or on the Ollama website here.

<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023

{{ if .System }}{{ .System }}
{{- end }}
{{- if .Tools }}When you receive a tool call response, use the output to format an answer to the orginal user question.

You are a helpful assistant with tool calling capabilities.
{{- end }}<|eot_id|>
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>
{{- if and $.Tools $last }}

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{{ range $.Tools }}
{{- . }}
{{ end }}
{{ .Content }}<|eot_id|>
{{- else }}

{{ .Content }}<|eot_id|>
{{- end }}{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
{{- if .ToolCalls }}
{{ range .ToolCalls }}
{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}
{{- else }}

{{ .Content }}
{{- end }}{{ if not $last }}<|eot_id|>{{ end }}
{{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>

{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- end }}
{{- end }}

It's not just you. It really is hard to read. The above code is written in the Go template language. I recommend using this interactive editor to better understand how the template language works if you are interested.

At a high level, the prompt template takes a sequence of messages and converts it into a prompt for the model. Some important notes:

ToolCalls indicate the calls the model wants to make. Ollama infers these by parsing the model's responses.
The tool role contains the output of an executed tool.
The user and assistant roles are self explanatory!

With a lot of trial and error in the interactive template editor, I converted our earlier template into the Ollama format:

<|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

{{ if .System }}{{ .System }}
{{- end -}}
<|eot_id|>{{ if .Tools }}<|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
{{- range $.Tools }}
{{ . }}{{ end }}

Return function calls in JSON format.<|eot_id|>{{ end }}
{{- range $i, $_ := .Messages }}

{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>

{{ .Content }}<|eot_id|>
{{- if $last }}<|start_header_id|>assistant<|end_header_id|>
{{ end }}
{{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
{{- if .ToolCalls }}
<|python_tag|>{{- range .ToolCalls -}}
{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}<|eom_id|>
{{- else }}
{{ .Content }}<|eot_id|>
{{- end }}
{{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>

{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- end }}
{{- end }}

Now the big question -- does it actually work?

Does our new Ollama prompt template work as intended?

First, let's make a query and make sure that we get the right answer!

!>nohup.out # Truncate ollama output
!ollama pull ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized 2>/dev/null
response = react_chat(basic_tool_question, model=ChatOllama(model="ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized", num_ctx=num_ctx))
print(response[0])
assert "32" in response[0]

The result of the foobar function when called with 30 as input is 32.

Second, let's take a peek at the prompt we sent to the LLM.

print(get_raw_prompt()[-1])

<|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of 
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
{"type":"function","function":{"name":"foobar","description":"Computes the foobar function on input and returns the result.","parameters":{"type":"object","required":["input"],"properties":{"input":{"type":"integer","description":""}}}}}

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Please evaluate foobar(30)<|eot_id|><|start_header_id|>assistant<|end_header_id|>
<|python_tag|>{"name": "foobar", "parameters": {"input":30}}<|eom_id|><|start_header_id|>ipython<|end_header_id|>

32<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Looks good to me!

Alright, let's run it through Ed's Really Dumb Tool-calling Benchmark ™️. We'll run the original Llama 3.2 model in Ollama (llama3.2) and the Llama 3.1 tooling prompt too for comparison.

models = [
  "llama3.2",
  "ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt", # based on the llama 3.1 tooling prompt
  "ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized", # our improved prompt
]

for m in models:
  print(f"Testing model: {m}")
  !ollama pull {m} 2>/dev/null
  r = run_and_print_experiment(ChatOllama(model=m, num_ctx=num_ctx), m, n=sample_size)
  print(r)

Testing model: llama3.2
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 0.97
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.55
Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.09
{'q1': 0.97, 'q2': 0.0, 'q3': 0.55, 'q4': 0.09, 'n': 100, 'model': 'llama3.2', 'total': 1.61}
Testing model: ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt
Question 1: Can the react agent use a tool correctly when explicitly asked? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.15
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.51
Question 4: Does the react agent lose the ability to chat? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.49
{'q1': 1.0, 'q2': 0.15, 'q3': 0.51, 'q4': 0.49, 'n': 100, 'model': 'ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt', 'total': 2.15}
Testing model: ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized
Question 1: Can the react agent use a tool correctly when explicitly asked? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.99
Question 2: Does the react agent invoke a tool when it shouldn't? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.99
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.68
Question 4: Does the react agent lose the ability to chat? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 1.0
{'q1': 0.99, 'q2': 0.99, 'q3': 0.68, 'q4': 1.0, 'n': 100, 'model': 'ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized', 'total': 3.66}

The original Ollama prompt scored 1.61/4.0. The official Llama 3.1 tooling prompt scored 2.15/4.0, and my customized prompt scored 3.6/4.0. I'd say that is an improvement!

Coming Clean

I do want to come clean. I did not create that prompt template in one try. It actually took many days of experimenting and debugging.

(I find that this is often a dilemma when blogging. Keeping track of everything you did is difficult, time-consuming, and often not that interesting.)

Many things went wrong along the way, but here are a few issues that I remember:

Environment: ipython is supposed to be for enabling Llama's code interpreter, which we aren't using, but I wasn't able to get the Llama 3.1 prompt to work well without it.
Llama has some weird rules for when to use <|eom_id|> vs. <|eot_id|>, and I got them wrong. As a result, even though the initial prompt was correct, Llama could not "parse" the results to correctly build the final message.
In Llama 3.1 prompts, when the model responds with a tool call, it is supposed to be prefixed with <|python_tag|>. Somewhat oddly (in my opinion), existing Ollama prompts "rebuild" the tool call response from the parsed values, rather than using the original message. As a result, I had to add in the <|python_tag|> or the model would become confused and struggle to build the final message to the user.
In one template, I added an extra newline, and this caused a notable decrease in performance! Yes, models really are that sensitive.

Recap

Let's recap all the things we had to do to get to this point:

We used the Llama 3.1 prompt template from the llama-models repository and NOT the Llama 3.1 website, which is Ollama's prompt was based on.
We modified the wording of the prompt to improve its responses to non-tool-calls like greetings and arithmetic.

And this was all just to improve the performance of one single model. I'm tired, aren't you?

Conclusion

Prompts are definitely part of the reason why building tool calling agents did not work in Part 1. But HuggingFace raised the alarm about this a long time ago! So what went wrong?

What went wrong?

Now that tool-calling is becoming more popular, prompt templates must be considered a fundamental part of a model, just like the weights. The reason for this is simple: the model developer is the only entity who has a clear incentive to ensure that their model works as well as possible. Downstream consumers like Ollama do not have an incentive to make sure that prompt templates work as well as possible. Unfortunately for Llama, Meta did not treat the prompt templates as a fundamental part of the model. Meta did a poor job of documenting the prompt templates: the example-based documentation is vague, and the multiple conflicting sources of information further confused the issue. So right off the bat, Llama models are not bundled with a clear prompt template.

Ollama did not help the situation. Instead of adopting an existing template format such as HuggingFace's, they decided to roll their own format based on Go templates. On one hand, this is a natural decision since Ollama is written in Go. But now someone has to write a new prompt template for every model on Ollama. Currently, it is the Ollama developers themselves who are creating these prompt templates. But as I mentioned above, there is a concerning incentive mismatch: the Ollama developers don't have an incentive to determine the best prompt format for every model. Here is a stark example where the Ollama developers have provided unhelpful and misleading responses when users reported that models were making non-sensical tool calls:

Don't bind tools if you don't want [the Llama model to make] a tool call.

Of course, as evidenced by this blog post, the real problem was that the Ollama developers themselves chose a prompt template that performed poorly on tool-calling. (I don't mean to beat up on the Ollama project. I think it's a great project! But they didn't help themselves out in this area.)

What can we do about it?

Standards

We really do need a standard for prompt templates. HuggingFace's chat template is a good start, but it is not perfect. While it describes how to format messages that should be sent to the model, it doesn't define how to parse the model's responses, which is equally important. As an example, I suspect that a major reason why Ollama's Llama 3.2 model used a Llama 3.1 prompt template is because Ollama's tool-call parser does not support the pythonic format used in the Llama 3.2 prompts. Parsing tool-calls is currently very ad-hoc.

Another problem with HuggingFace's chat templates is language compatibility. I suspect that part of the reason why Ollama chose to use their own template format is convenience. HuggingFace's chat templates are based on jinja2, which is a template language for python. But Ollama is written in Go. Perhaps we need a standard format that is more language agnostic.

Ollama

Ollama should either adopt the HuggingFace template format or create a tool that can convert HuggingFace templates to Ollama templates. The current system of manually converting templates is error prone and harmful.

Ollama should also add information to their model cards about which prompt templates they adopted and why. For example, the Llama 3.2 model card does not mention that the prompt template is based on the Llama 3.1 prompt format, or why.

Benchmarks

Benchmarks such as the Berkeley Function-Calling Leaderboard (BFCL) could also be doing more about the prompt problem. For Llama, it appears that, similar to the Ollama developers, the BFCL developers have simply chosen a prompt and implemented it. Llama 3.1 and 3.3 appear to be based on a Llama HuggingFace chat template while other versions use a generic prompt.

We don't know how or why they selected these prompts. As with the Ollama developers, there is an incentive mismatch: they don't have an incentive or responsibility to experiment. Perhaps there should be more incentive for model developers to fix the prompt templates in order to score better on the BFCL, but it doesn't seem like that is how things works today. Honestly, I don't understand why; I would think that Meta would be embarrassed that the Llama-3.2-3B-Instruct model only scores 5.25% on the BFCL in Overall Multi Turn Accuracy.

Llama

The Llama developers really need to do a better job documenting its prompt templates. The example-based "documentation" is vague. And more critically, there shouldn't be conflicting information. Even after they were notified about it, the problem remains. In the same github issue, it's clear that people can't reproduce Meta's experimental results either.

Closing Thoughts

We largely got to the bottom of the problems in Llama 3.2, but this was just one of many models that performed poorly. Are all of these models suffering from prompt template problems, or are there other problems as well? Stay tuned to find out.

TLDR

Better Ollama models for Llama 3.2 tool calling are available here:

ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt is based on this Llama 3.1 prompt
ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized is based on this Llama 3.1 prompt but slightly modifies the language to improve behavior on responding to queries unrelated to tools.

Edward J. SchwartzComputer Security Researcher2 min. read

Given the recent change in government here in the US, I've been thinking a lot about my personal beliefs and values. As part of this process, I decided to write my own personal manifesto. I hope that even if I do not see eye to eye on political matters with all my neighbors and fellow citizens, we can find some common ground on the values that matter most to me.

As a reminder, everything on this blog is my own opinion and does not reflect the views of my employer or anyone else.

My Personal Manifesto

Life is inherently unequal, shaped by factors beyond our control—genetics, financial circumstances, and sheer luck. I feel that those who are fortunate bear a responsibility to help those who are not, and that principle should be a foundation of our society. A truly fair and compassionate world requires us to ensure that everyone, regardless of gender, race, or any other characteristic, is given a fair and equal chance. When some groups consistently achieve more success than others based on traits like skin color, it signals a deeper flaw in our society. Equality must be more than a principle—it must be a reality reflected in the systems and institutions that govern us.

Freedom is one of life’s most important values, but it should not come at the expense of others’ well-being. We should accept people as they are, reserving judgment only when someone’s actions cause real harm—discomfort, unease, or disagreement with someone’s identity, culture, or way of life does not constitute harm. Prejudice and bigotry, when disguised as "freedom of expression," should not be allowed to infringe upon others’ dignity and rights under the guise of tolerance.

In making political or social decisions, it is essential to think beyond our immediate reactions and consider the broader impact, including possible side effects. Policies must align with their stated goals and follow through on their commitments. For example, if a policy restricts abortion under the guise of protecting unborn children, it should also provide resources and support to ensure those children are cared for and supported after birth. Making a decision without committing to its consequences is irresponsible and ultimately harmful. True responsibility means considering all dimensions and consequences and committing to the outcomes of our choices.

A society grounded in truth and information is best equipped to address these challenges. Facts and science must take precedence over opinions or political agendas, ensuring our systems and policies are rooted in evidence rather than ideology. Truth is non-negotiable, and it must guide our decisions and perspectives. Science and evidence are the tools we use to uncover systemic injustices, evaluate the impacts of our actions, and create solutions that reflect reality, not rhetoric.

This manifesto is a call for a society that values truth, compassion, fairness, and foresight. It is a vision of a world where integrity, responsibility, and respect for all people guide our actions and our choices.

Edward J. SchwartzComputer Security Researcher10 min. read

For the past several years, I've been using Advent of Code as an excuse to do some professional development and learn new languages. The past two years, I used Rust, which is an interesting language.

I remembered seeing funny looking solutions on the Reddit AOC solutions thread that were basically a jumble of symbols. These were for a language called Uiua, which naturally piqued my interest.

For some reason, I decided to try to do AOC 2024 in Uiua. It has not been a smooth ride, and in this blog post I'll briefly touch on some of my thoughts and experiences in Uiua and two other array programming langauges, APL and BQN.

Starting in Uiua

The Uiua website describes the language as:

Uiua (wee-wuh 🔉) is a general purpose, stack-based, array-oriented programming language with a focus on simplicity, beauty, and tacit code.

Uiua lets you write code that is as short as possible while remaining readable, so you can focus on problems rather than ceremony.

The language is not yet stable, as its design space is still being explored. However, it is already quite powerful and fun to use!

But this screenshot of a basic Uiua example probably gives you a better idea how the language works:

I spent a while going through the Uiua tutorials, and I made it through the first few AOC problems with a bit of difficulty.

I eventually got to a problem where I had to write a fold. And I remember getting extremely frustrated with the language. The language does not have (local) variable names. Instead, everything is on the stack. You as the programmer must internally keep track of the stack and how all the operations you perform modify it. Oh, it's a stack-based machine too, so the top of the stack is constantly changing.

I think there were only three or four values I had to juggle in my fold function, but it was too much. Maybe because I work in binary analysis where you can't take local variables for granted, I really want to be able to use them in my "high level" programming languages.

More seriously, I think the lack of local variables just compounds complexity. Simple functions are fine. But complex functions, which are already complex, start to get even more complex as the programmer now has to deal with juggling the stack layout. No thanks.

Moving to APL

Uiua is an array-oriented programming language. Most array-oriented programming languages derive from APL (which was created in the 60s!) One of the benefits of this is that it's a pretty mature language, and there is a lot of training material available for it.

The most popular implementation is a commercial, non-open-source one called Dyalog APL. I wasn't thrilled to be using a closed-source implementation, but it's just for learning purposes so I supposed it was fine. I started following along with this tutorial. I got about half way through, and started to feel like I was probably competent enough to try some AOC problems in APL.

I immediately ran into trouble again, but this time with APL's tooling. I have two basic requirements for a programming language for AoC:

I can put the code for each day in a file.
I can run the code from inside VS code fairly easily.
I can type the weird symbols of the language from within VS code. (Oops, this one is new this year.)

I forgot to mention that Uiua's tooling was pretty great. No complaints; I installed the extension and everything worked as expected.

APL tooling is weird. It's not really designed like modern programming languages. Instead, all coding is supposed to be done in workspaces. I was pretty frustrated by this and I eventually gave up.

In retrospect, I may have been able to get by with dyalogscript. But the unpolished nature of the tooling, at least for how programming languages are used in this century, was a big turn off.

BQN

Finally, I landed on BQN. Here is the website's description:

Looking for a modern, powerful language centered on Ken Iverson's array programming paradigm? BQN now provides:

A simple, consistent, and stable array programming language

A low-dependency C implementation using bytecode compilation: installation

System functions for math, files, and I/O (including a C FFI)

Documentation with examples, visuals, explanations, and rationale for features

Libraries with interfaces for common file formats like JSON and CSV

And here's a quick example from the website.

If you are thinking that all of these languages look pretty similar, you're right.

BQN had a lot going for it. It wasn't stack based. The tooling seemed pretty good. Not only is the language designed to be used from files, you can even use multiple files. Welcome to the 21st century baby!

I also liked the name. The whole point of learning this array-programming paradigm was to be able to write short, concise code. So the idea of answering "Big Questions" was appealing.

BQN has a lot of documentation. There are a few tutorials intended for new users, but most of the documentation is of the, well, documentation variety. It's not a tutorial, but a reference manual. It's written by an absolute array programming expert, for other array programming experts. It's not the most beginner friendly.

So it was a pretty rough learning curve. I quickly joined the APL language discord and started asking a lot of questions. People there are very patient and helpful, thankfully! I also found some other people working on AOC, and I spent a lot of time unraveling their solutions.

I just finished Day 9 of AOC 2024 in BQN. It's December 20th, so obviously I'm pretty far behind. I'm not sure if I'll finish this year; I've been trying to embrace learning the array-oriented way of thinking, which has been challenging and slow.

Readability

I've been slowly getting better at reading others' BQN code, but it's hard. There are a lot of symbols to remember, but that's really not the main problem for me. Instead, it's very difficult to "parse" where parentheses should be placed. It can also be difficult to follow the general flow of very terse code.

Here's a snippet of code from RubenVerg who is a genius when it comes to tacit coding in BQN.

in←•file.Chars "input/8.txt"

P←(¬-˜⊢×·+`»⊸>)⊸⊔
OutOfBounds←∨´(0>⊢)∾≢⊸≤

Parse←>' '⊸<⊸P

Part1←{𝕊grid: ≠¬∘(grid⊸OutOfBounds)¨⊸/⍷∾⥊{(𝕨(≢∧=○(⊑⟜grid)∧'.'≠grid⊑˜⊢)𝕩)/⋈𝕩-˜2×𝕨}⌜˜⥊↕≢ grid}Parse

(Holy smokes, my formatter actually supports BQN!)

Part1 is a function that is composed with Parse. So it will Parse the input and the result will be bound to grid inside the curly brackets.

I have my doubts that anyone can read this code. Rather, you can reverse engineer it by breaking it down into smaller pieces and understanding each piece. But it's not easy to read, even if you understand what all the symbols mean.

Tacit coding

According to the BQN documentation:

Tacit programming (or "point-free" in some other languages) is a term for defining functions without referring to arguments directly, which in BQN means programming without blocks.

The idea of tacit coding is kind of cool. You basically avoid applying functions and instead compose and otherwise modify them.

BQN has a function composition operator ∘ just like you would imagine. But a lot of tacit code uses trains. For a pretty poor introduction to trains, you can view this page. But let me spell out the basics.

A 2-train is two adjacent functions, and by definition fg evaluates to f∘g (f composed with g). In other words, to evaluate fg on an input 𝕩, we could use g(f(𝕩)). (A BQN programmer would never write that and would instead just use g f 𝕩.)

3-trains, which consist of three adjacent functions, are where things get fun. Again, by definition, evaluating fgh on input 𝕩 evaluates to (f 𝕩) g (h 𝕩). This is not very intuitive, but it's useful for a couple reasons:

𝕩 appears twice, so you can use it to avoid writing a long expression multiple times
There's a fork, so you can combine two different behaviors
g acts as a summarizer or combiner

Here's an example I used in my solution to Day 9:

(⊣×(↕≠)) arg

The 3-train consists of ⊣, ×, and (↕≠). In my program, arg is actually an extremely long expression, and I did not want to write it twice. Let's expand the train:

(⊣ arg) × (↕≠ arg)

⊣ is the identity function, so it returns arg. × is multiplication. ≠ returns the length of its (right) argument, and ↕ returns the list of numbers from 0 to one less than its argument. So this train multiplies each element in arg (it's an array) by its index. Pretty cool, huh?

The trouble is that when reading and writing BQN code, it can be difficult to identify trains. I've been getting better, but I still find myself inserting a ∘ whenever my code doesn't work, since function composition will "stop" a train from forming when it wasn't intentional. Now look at RubenVerg's code above and think about all the trains. Even if you understand the symbols, it's not easy. This is very much a learned skill!

Here's a very basic example of how parsing influences trains. BQN evaluates from right to left. So if you write fgh 𝕩, that actually means f(g(h(𝕩))) and there is not a train. But (fgh) 𝕩 is completely different and fgh is a 3-train. Now again, look at RubenVerg's code and try to figure out the implied parentheses. Good luck!

Documentation

I found BQN's documentation to be very thorough, but not very beginner friendly. I think it's written by an expert for other experts. In some cases, it seems to pontificate and misses basic definitions. For example, the trains page doesn't directly define 2- and 3-trains. You can probably figure out the definition from the examples, but it's not ideal.

On the plus side, many documentation pages feature very intuitive diagrams. See the below diagram in Group for an example.

Cool features

There are some cool features in BQN. I'm not going to cover all of them, but here are ones that stand out based on my programming career.

Group `⊔`

The Group ⊔ operator is pretty nifty. Here is a nice diagram that intuitively depicts an example.

In BQN, 𝕨 is the left argument and 𝕩 is the right argument. Usually 𝕩 is some existing structure you want to analyze, and 𝕨 is a list of indices that you construct to define the groupings you want. If you want elements to be placed in the same group, you assign them the same index.

This is a powerful capability. For example, in today's AOC problem, there was a string like 00...111...2...333.44.5555.6666.777.888899 where group 0 had size two, group 1 had size 3, and so on. One easy way to determine the size of each group in BQN is using Group ⊔. If you first change each . to a -1, you can use the same array as both arguments with ⊔˜ to get the groupings, and ≠¨⊔˜ to get the length of each grouping. (≠ means length and ¨ means to modify the function to the left to apply to each element of the array to the right.)

Under `⌾`

Under ⌾ is an interesting capability that is a bit tricky to explain. Here is the official explanation from the documentation:

The Under 2-modifier expresses the idea of modifying part of an array, or applying a function in a different domain, such as working in logarithmic space. It works with a transformation 𝔾 that applies to the original argument 𝕩, and a function 𝔽 that applies to the result of 𝔾 (and if 𝕨 is given, 𝔾𝕨 is used as the left argument to 𝔽). Under does the "same thing" as 𝔽, but to the original argument, by applying 𝔾, then 𝔽, then undoing 𝔾 somehow.

So to restate, there is a transformation or selection operation, 𝔾, and a modification transformation 𝔽. There are different applications, but I always used this to transform or change part of an array. In that case, 𝔾 might be a filter, and 𝔽 describes how you want to change the array.

Here's an example from today's AOC again:

ReplaceNegWithNegOne ← {¯1¨⌾((𝕩<0)⊸/) 𝕩}

I'm not going to try to explain all the syntax. But 𝔾 is ((𝕩<0)⊸/); this says to filter 𝕩 so that only elements less than 0 remain. 𝔽 is ¯1¨, which means return negative one for each argument. So, put together, replace negative elements with negative one. I then used the resulting array as an index to Group ⊔, which ignores any element with an index of negative one.

This is kind of neat because values are immutable in BQN, and this provides an efficient way to change part of them. I assume that the implementation uses this to avoid making copies of the whole array.

Not So Cool Features

One thing that annoys me about BQN is that a number of basic functions are not built-in because they can be succinctly expressed.

For example, want to split a string? You'd better memorize

x0((⊢-˜+`×¬)∘=⊔⊢)y1 #Split y1 at occurrences of separators x0, removing the separators

Want to build a number from an array of digits? You can use

10⊸×⊸+˜´⌽d1 # Natural number from base-10 digits

These both come from BQNcrate, a repository of useful functions you could but probably don't want to derive yourself. I'd much rather see this in a standard library of some sort. Most of these are cool, and it's fun to see how they work. But when I'm actually coding, I don't want to look these up or try to derive them. I just want to say split the string by ' ' and move on.

I don't think I'm alone. I've noticed that RubenVerg for example likes to use •ParseFloat to parse integers rather than

(10⊸×⊸+˜´∘⌽-⟜'0')d1 #Parse natural number from string

which doesn't exactly roll off the tongue.

Ed's Feelings on BQN

Unfortunately, I don't have fun programming in BQN. There, I said it. I've literally felt very stupid at times trying to figure out how to write a simple function.

BQN is challenging, and I like challenges. But it's a fine line. There is an intense gratification to stringing together a whole bunch of opaque symbols that very few other people can read. But it's also frustrating and demoralizing to spend hours trying to figure out how to solve a basic problem.

It's hard to say how much of this is just part of learning a new paradigm. I remember when I first learned OCaml as a graduate student and had to figure out how to think functionally and decode arcane type errors involving parametric polymorphism. At the time, it was hard (and probably not that fun, but I can't remember). Now it's second nature. Maybe BQN will become second nature if I stick with it.

Conclusion

I probably won't be using BQN for any real projects any time soon. But I haven't given up on it entirely. I may try to finish AOC 2024 in BQN. We'll see. Given the lack of fun I've been having, I can't say I'm extremely motivated to do so. So for now I'll be taking things one day at a time.

If you're curious about my BQN code, you can find my AOC 2024 solutions here.

Edward J. SchwartzComputer Security Researcher24 min. read

One of the most exciting possibilities of AI and LLMs are agents: tools that allow LLMs to interact with various tools in order to solve problems. You've probably seen them before, like when you ask ChatGPT to browse the web for you.

In this blog post, we'll take a look at how to build agents using LangChain. They'll work great using an OpenAI model. And then we'll try to run them locally using Ollama, using a variety of open models. And they will almost all fail miserably. They fail so bad that I created this blog post to convince myself I wasn't imagining things.

In a future blog post, we will examine why.

LangChain
Building an Agent using LangGraph
Ollama: Going Local
Ed's Really Dumb Tool-calling Benchmark ™️
Berkeley Function-Calling Leaderboard
Other Reports
So What Is The Problem?

LangChain

LangChain is a framework that allows you to build LLM applications. Basically, it abstracts a bunch of different components like LLMs, vector stores, and the like, and allows you to focus on your application's logic. So, you might develop your application in LangChain while using a local LLM to run it, but then use Claude once you go to production.

Anyway, using LangChain to make a query is pretty simple.

!pip install langchain-openai~=0.2.7 python-dotenv
!pip install httpx==0.27.2 # temp

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

# We'll load my OpenAI API key using dotenv
%load_ext dotenv
%dotenv drive/MyDrive/.env

from langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage

from langchain_openai import ChatOpenAI

# Remove non-determinism for the blog post
zero_temp_gpt35 = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

response=zero_temp_gpt35.invoke("Hi!  What is your name?").content

import textwrap
print(textwrap.fill(response))

Hello! I am a language model AI assistant. How can I assist you today?

The beauty of LangChain is that the components are modular. We can replace gpt-3.5-turbo model with something else later if we want to, and indeed we will do just that!

Building an Agent using LangGraph

LangGraph is the portion of LangChain for building agents. It allows us to easily define new tools:

!pip install langgraph~=0.2.53

@tool
def foobar(input: int) -> int:
    """Computes the foobar function."""
    return input + 2

tools = [foobar]

The @tool decorator automatically transforms the function into a schema that can be used by the LLM to decide whether to invoke the tool, and if so, how.

foobar.tool_call_schema.model_json_schema()

{'description': 'Computes the foobar function.',
 'properties': {'input': {'title': 'Input', 'type': 'integer'}},
 'required': ['input'],
 'title': 'foobar',
 'type': 'object'}

With that, we can build a generic agent, called a ReAct agent, which can interact with our tools:

from langgraph.prebuilt import create_react_agent

def react_chat(prompt, model):
  agent_executor = create_react_agent(model, tools)

  response = agent_executor.invoke({"messages": [("user", prompt)]})
  return response['messages'][-1].content, response

last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_gpt35)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"

The result of evaluating foobar(30) is 32.

Yes! We did it, team! 🎉 We could change magic_function to be a web search, a database lookup, or you name it.

Let's try a query that doesn't use a tool at all.

last_msg, result = react_chat("Hi.", zero_temp_gpt35)
print(last_msg)
assert "Hello" in last_msg and "foobar" not in last_msg, "Uh oh, something went wrong"

Hello! How can I assist you today?

Great. So, in theory, we have an agent that we can chat with and is able to call tools in order to help us out.

Ollama: Going Local

Now let's try to create a tool-wielding agent using a LLM that runs on our local machine.
We'll do this by using Ollama, which is a (fairly) easy way to run smaller open LLMs on your local machine. It will use any GPUs that you might have, but it's still usable even if you don't have any. After all, you're just performing inference, not training.

Here's an example of me running llama 3.2 with ollama on my work laptop.

root@be5c1cb9e696:/# ollama run llama3.2
>>> Hi mom!
It's nice to hear from you, sweetie. Is everything okay? What's on your mind?

>>> Are you alive?
I am a computer program, so I don't have feelings or emotions like humans do. But I'm
designed to simulate conversations and answer questions to the best of my ability. I'm
not alive in the way that a living being is, but I'm here to help you with any
questions or topics you'd like to discuss!

>>> 🤯
I know it can be a bit mind-blowing to think about a computer program that can have
conversations and answer questions! But I'm designed to make interactions feel more
natural, so I'm glad you're surprised (in a good way!)

You can find instructions on how to install Ollama on the Ollama webpage.

If you don't feel like installing anything, that's fine too. You can follow along with this notebook.

After installing and running Ollama (ollama serve), we install the langchain-ollama connector package and pull down the Llama 3.2 model from Ollama's repository.

# Install Ollama
!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh
!ollama -v
# Make sure Ollama is running
!ollama ps 2>/dev/null || (env OLLAMA_DEBUG=1 nohup ollama serve &)

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
[1m[31mWARNING:[m systemd is not running
[1m[31mWARNING:[m Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Warning: could not connect to a running Ollama instance
Warning: client version is 0.5.1
nohup: appending output to 'nohup.out'

!pip install langchain_ollama~=0.2.0
!ollama pull llama3.2

Now we can attempt the same tests we performed on GPT 3.5, but using the local Llama 3.2 LLM.

from langchain_ollama import ChatOllama

# The zero temperature model is to remove non-determinism for the blog post
zero_temp_ollama_model = ChatOllama(model="llama3.2", temperature=0)
response = zero_temp_ollama_model.invoke("Hi!  What is your name?").content

print(textwrap.fill(response))

I don't have a personal name, but I'm an AI designed to assist and
communicate with users. I'm often referred to as a "language model" or
a "chatbot." You can think of me as a helpful computer program that's
here to provide information, answer questions, and engage in
conversation. What's your name?

Okay, looking good! This is not bad for a 3B parameter LLM that can easily run locally on our computer. Let's see if it can call our tool-wielding agent.

last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_ollama_model)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"

The output of `foobar(30)` is 32.

🎉 Everything is working well so far. As one final check, let's ask the agent a question that has absolutely nothing to do with tools.

last_msg, result = react_chat("Hi.", zero_temp_ollama_model)
print(last_msg)
assert "42" in last_msg, "Uh oh, something went wrong"

The input value 42 was doubled, resulting in 84.

So, we said "Hi." and the agent responded with nonsense. Let's inspect some of the metadata we get back from LangChain to see what's going on.

import pprint
pprint.pprint(result)

{'messages': [HumanMessage(content='Hi.', additional_kwargs={}, response_metadata={}, id='c4dd1ba7-cb15-4d62-a2bb-a543a32a882d'),
              AIMessage(content='', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.061349558Z', 'done': True, 'done_reason': 'stop', 'total_duration': 294464945, 'load_duration': 22079878, 'prompt_eval_count': 153, 'prompt_eval_duration': 9000000, 'eval_count': 16, 'eval_duration': 261000000, 'message': Message(role='assistant', content='', images=None, tool_calls=[ToolCall(function=Function(name='foobar', arguments={'input': 42}))])}, id='run-be60d0f6-bf62-4336-b028-d37898615e06-0', tool_calls=[{'name': 'foobar', 'args': {'input': 42}, 'id': '4d6b28d7-71bc-4f80-9a2a-e61293bdbb65', 'type': 'tool_call'}], usage_metadata={'input_tokens': 153, 'output_tokens': 16, 'total_tokens': 169}),
              ToolMessage(content='44', name='foobar', id='aed6b2d6-590d-4bc3-8828-89457178bd11', tool_call_id='4d6b28d7-71bc-4f80-9a2a-e61293bdbb65'),
              AIMessage(content='The input value 42 was doubled, resulting in 84.', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.305191931Z', 'done': True, 'done_reason': 'stop', 'total_duration': 238035622, 'load_duration': 22280620, 'prompt_eval_count': 85, 'prompt_eval_duration': 5000000, 'eval_count': 14, 'eval_duration': 208000000, 'message': Message(role='assistant', content='The input value 42 was doubled, resulting in 84.', images=None, tool_calls=None)}, id='run-50754cd1-cae9-410d-84d5-64b51bced188-0', usage_metadata={'input_tokens': 85, 'output_tokens': 14, 'total_tokens': 99})]}

We can see there are four messages:

The HumanMessage is the user's message -- "Hi."
In response, in the AiMessage, the LLM indicates that it would like to invoke a tool by setting the tool_calls field.
LangChain invokes the tool and records the result in the ToolMessage, which is given back to the LLM.
The final AiMessage includes a written message for the user.

The problem of course, is message #2. Why does the AI want to invoke a tool in response to "Hi."? Is this a problem with Llama 3.2 or something else? Let's do some 🥼 science and find out!

Ed's Really Dumb Tool-calling Benchmark ™️

I created a really dumb benchmark to answer four really basic questions. I can't stress enough that this benchmark only tests the lowest of the low hanging fruit in this area. (I am calling it a "benchmark" facetiously!)

Here are the questions:

Can the react agent use a tool correctly when explicitly asked? (Yes is good.)
Does the react agent invoke a tool when it shouldn't? (No is good.)
Does the react agent lose the ability to answer questions unrelated to tools? (No is good.)
Does the react agent lose the ability to chat? (No is good.)

Question 1: Can the react agent use a tool correctly when explicitly asked?

We'll use our example above to test this.

basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
  last_msg, _ = react_chat(basic_tool_question, model=model)
  return "32" in last_msg

Question 2: Does the react agent invoke a tool when it shouldn't?

We'll perform two simple tests to answer this question. We'll prompt the agent with both a basic arithmetic question that does not involve the foobar tool, "What is 12345 - 102?", and a greeting, "Hello!" We'll then check the response to see if the model produces a ToolMessage, which indicates that the model chose to invoke a tool. By construction, neither of those prompts should induce a tool call.

from langchain_core.messages import ToolMessage

basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"

def q2a(model):
  _, result = react_chat(basic_arithmetic_question, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2b(model):
  _, result = react_chat(greeting, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2(model):
  return q2a(model) and q2b(model)

Question 3: Does the react agent lose the ability to answer questions unrelated to tools?

To answer this, we'll ask the basic arithmetic question to the react agent and its underlying model. Since the available tool does not help with the arithmetic problem, ideally, the agent and the underlying model should be able to solve the problem under the same circumstances. If the model can't do arithmetic in the first place, I chose not to penalize it because I'm such a nice guy. 😇

def q3a(model):
  result = model.invoke(basic_arithmetic_question)
  return "12243" in result.content

def q3b(model):
  last_msg, _ = react_chat(basic_arithmetic_question, model=model)
  return "12243" in last_msg

def q3(model):
  # q3a ==> q3b: If q3a, then q3b ought to be true as well.
  return not q3a(model) or q3b(model)

Question 4: Does the react agent retain the ability to chat?

To answer this, we'll greet the agent and attempt to determine if it responds properly. This is a little difficult to do in a comprehensive way.

basic_greeting = "Hi."

def q4(model):
  last_msg, _ = react_chat(basic_greeting, model=model)
  r = any(w in last_msg for w in ["hi", "Hi", "hello", "Hello", "help you", "Welcome", "welcome", "greeting", "Greeting", "assist"])
  #if not r:
    #print(f"Debug: Not a greeting? {last_msg}")
  return r

Benchmark code

Here is code to run the experiments a couple of times.

from tqdm.notebook import tqdm

def do_bool_sample(fun, n=10, *args, **kwargs):
  try:
    # tqdm here if desired
    return sum(fun(*args, **kwargs) for _ in (range(n))) / n
  except Exception as e:
    print(e)
    return 0.0

def run_experiment(model, name, n=10):
  do = lambda f: do_bool_sample(f, model=model, n=n)
  d = {
      "q1": do(q1),
      "q2": do(q2),
      "q3": do(q3),
      "q4": do(q4),
      "model": name
  }
  d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
  return d

def print_experiment(results):
  name = results['model']
  print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
  print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
  print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
  print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")

def run_and_print_experiment(model, name):
  results = run_experiment(model, name)
  print_experiment(results)
  return results

Benchmarking Llama 3.2

Let's see what our experiments say for Llama 3.2, which we already know from above does not perform very well.

llama_model = ChatOllama(model="llama3.2")
run_and_print_experiment(llama_model, "llama3.2")

Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.1

{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.1, 'model': 'llama3.2', 'total': 1.6}

As we saw above, Llama 3.2 is able to call functions (Q1), but does so even when it should not be (Q2). Question 3 shows that even though it almost always decides to call a tool, this usually does not stop it from being able to answer basic questions. It does prevent it from being able to chat (Q4).

Benchmarking OpenAI's gpt-3.5-turbo and gpt-4o

Now let's try benchmarking gpt-3.5-turbo, which seemed to do better.

gpt35 = ChatOpenAI(model="gpt-3.5-turbo")
run_and_print_experiment(gpt35, "gpt-3.5-turbo")

Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-3.5-turbo) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-3.5-turbo) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-3.5-turbo) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (gpt-3.5-turbo) success rate: 1.0

{'q1': 1.0,
 'q2': 0.0,
 'q3': 0.9,
 'q4': 1.0,
 'model': 'gpt-3.5-turbo',
 'total': 2.9}

Great -- the benchmark showed that gpt-3.5-turbo can call tools (Q1), and unlike Llama 3.2, can still engage in chat (Q4). A bit surprisingly, it still invokes tools when it shouldn't, however (Q2). But it is smart enough to ignore their results when constructing its final response.

Let's try a newer model, gpt-4o.

gpt4o = ChatOpenAI(model="gpt-4o")
run_and_print_experiment(gpt4o, "gpt-4o")

Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-4o) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-4o) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-4o) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (gpt-4o) success rate: 1.0

{'q1': 1.0, 'q2': 1.0, 'q3': 1.0, 'q4': 1.0, 'model': 'gpt-4o', 'total': 4.0}

GPT 4o nailed it! 👏

Benchmarking a Lot Of Ollama Models

Let's benchmark a whole bunch of Ollama models. I searched Ollama's model library for models that claimed to support tool calling. Here we test a hand-picked subset of these models to see how well they do.

ollama_models = [
    "hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S",
    "llama3.3:70b",
    "llama3.2:3b",
    "llama3.1:70b",
    "llama3.1:8b",
    "llama3-groq-tool-use:8b",
    "llama3-groq-tool-use:70b",
    "MFDoom/deepseek-v2-tool-calling:16b",
    "krtkygpta/gemma2_tools",
    "interstellarninja/llama3.1-8b-tools",
    "cow/gemma2_tools:2b",
    "mistral:7b",
    "mistral-nemo: 12b",
    "interstellarninja/hermes-2-pro-llama-3-8b-tools",
    "qwq:32b",
    "qwen2.5-coder:7b",
    ]

all = []

for m in ollama_models:
  print(f"Downloading model: {m}...")
  !ollama pull {m} 2>/dev/null
  print("done.")
  r = run_and_print_experiment(ChatOllama(model=m), m)
  !ollama rm {m}
  all.append(r)
  print(r)

Downloading model: hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S'
{'q1': 0.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S', 'total': 2.9}
Downloading model: llama3.3:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.3:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.3:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.3:70b) success rate: 0.2
Question 4: Does the react agent lose the ability to chat? (llama3.3:70b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.3:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.2, 'q4': 1.0, 'model': 'llama3.3:70b', 'total': 2.2}
Downloading model: llama3.2:3b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2:3b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2:3b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2:3b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2:3b) success rate: 0.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.2:3b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.0, 'model': 'llama3.2:3b', 'total': 1.5}
Downloading model: llama3.1:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:70b) success rate: 0.3
Question 4: Does the react agent lose the ability to chat? (llama3.1:70b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.1:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.3, 'q4': 0.7, 'model': 'llama3.1:70b', 'total': 2.0}
Downloading model: llama3.1:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:8b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:8b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (llama3.1:8b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.1:8b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.7, 'model': 'llama3.1:8b', 'total': 1.7}
Downloading model: llama3-groq-tool-use:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:8b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:8b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:8b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3-groq-tool-use:8b'
{'q1': 1.0, 'q2': 0.8, 'q3': 1.0, 'q4': 1.0, 'model': 'llama3-groq-tool-use:8b', 'total': 3.8}
Downloading model: llama3-groq-tool-use:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:70b) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:70b) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:70b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3-groq-tool-use:70b'
{'q1': 1.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'llama3-groq-tool-use:70b', 'total': 3.9}
Downloading model: MFDoom/deepseek-v2-tool-calling:16b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'MFDoom/deepseek-v2-tool-calling:16b'
{'q1': 0.0, 'q2': 0.0, 'q3': 1.0, 'q4': 1.0, 'model': 'MFDoom/deepseek-v2-tool-calling:16b', 'total': 2.0}
Downloading model: krtkygpta/gemma2_tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (krtkygpta/gemma2_tools) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (krtkygpta/gemma2_tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (krtkygpta/gemma2_tools) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (krtkygpta/gemma2_tools) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'krtkygpta/gemma2_tools'
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'krtkygpta/gemma2_tools', 'total': 1.0}
Downloading model: interstellarninja/llama3.1-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/llama3.1-8b-tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 4: Does the react agent lose the ability to chat? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'interstellarninja/llama3.1-8b-tools'
{'q1': 0.7, 'q2': 0.0, 'q3': 0.7, 'q4': 0.7, 'model': 'interstellarninja/llama3.1-8b-tools', 'total': 2.0999999999999996}
Downloading model: cow/gemma2_tools:2b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (cow/gemma2_tools:2b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (cow/gemma2_tools:2b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (cow/gemma2_tools:2b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (cow/gemma2_tools:2b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'cow/gemma2_tools:2b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'cow/gemma2_tools:2b', 'total': 2.0}
Downloading model: mistral:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral:7b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral:7b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral:7b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (mistral:7b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'mistral:7b'
{'q1': 0.6, 'q2': 0.8, 'q3': 0.5, 'q4': 0.7, 'model': 'mistral:7b', 'total': 2.5999999999999996}
Downloading model: mistral-nemo: 12b...
done.
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral-nemo: 12b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral-nemo: 12b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral-nemo: 12b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (mistral-nemo: 12b) success rate: 0.0
[?25l[?25l[?25h[2K[1G[?25hError: name "mistral-nemo:" is invalid
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.0, 'model': 'mistral-nemo: 12b', 'total': 0.0}
Downloading model: interstellarninja/hermes-2-pro-llama-3-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.3
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.6
[?25l[?25l[?25h[2K[1G[?25hdeleted 'interstellarninja/hermes-2-pro-llama-3-8b-tools'
{'q1': 1.0, 'q2': 0.3, 'q3': 0.8, 'q4': 0.6, 'model': 'interstellarninja/hermes-2-pro-llama-3-8b-tools', 'total': 2.7}
Downloading model: qwq:32b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwq:32b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (qwq:32b) success rate: 0.9
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwq:32b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (qwq:32b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'qwq:32b'
{'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'model': 'qwq:32b', 'total': 3.5}
Downloading model: qwen2.5-coder:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwen2.5-coder:7b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (qwen2.5-coder:7b) success rate: 0.4
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwen2.5-coder:7b) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (qwen2.5-coder:7b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'qwen2.5-coder:7b'
{'q1': 1.0, 'q2': 0.4, 'q3': 0.8, 'q4': 1.0, 'model': 'qwen2.5-coder:7b', 'total': 3.2}

from statistics import mean
average = mean(d['total'] for d in all)
minscore = min(d['total'] for d in all)
maxscore = max(d['total'] for d in all)
all = sorted(all, key=lambda d: -d['total'])
print(f"Average total score: {average} Min: {minscore} Max: {maxscore}")
print("Top 5 models by total score:")
pprint.pprint(all[:5])

Average total score: 2.31875 Min: 0.0 Max: 3.9
Top 5 models by total score:
[{'model': 'llama3-groq-tool-use:70b',
  'q1': 1.0,
  'q2': 1.0,
  'q3': 0.9,
  'q4': 1.0,
  'total': 3.9},
 {'model': 'llama3-groq-tool-use:8b',
  'q1': 1.0,
  'q2': 0.8,
  'q3': 1.0,
  'q4': 1.0,
  'total': 3.8},
 {'model': 'qwq:32b', 'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'total': 3.5},
 {'model': 'qwen2.5-coder:7b',
  'q1': 1.0,
  'q2': 0.4,
  'q3': 0.8,
  'q4': 1.0,
  'total': 3.2},
 {'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S',
  'q1': 0.0,
  'q2': 1.0,
  'q3': 0.9,
  'q4': 1.0,
  'total': 2.9}]

There's a lot going on here. But I have a few general observations:

Most models do really poorly. The average "score" was ~2. And even a perfect score of 4.0 is really the bare minimum of what I would expect a decent model to do. And most of the models I tested claim to "support" tool calling.
Surprisingly, some models do pretty well! For example, llama3-groq-tool-use almost achieves a perfect score!
I tried a few larger ~70B models, and they did not perform noticeably better. Interestingly, the 8B variant of llama3-groq-tool-use performed almost as well as the 70B variant.

Berkeley Function-Calling Leaderboard

There are several benchmarks that test tool use in LLMs. But most of them are not designed to test tool usage as in an interactive agent. As it turns out, the Berkeley Function-Calling Leaderboard (BFCL) is the only benchmark that tests this type of behavior. And it was only added in September 2024 as part of BFCL V3.

Llama 3.2 scores a whopping 2.12% on multi-turn accuracy (what we care about). The top 12 scores are from proprietary models. The best multi-turn accuracy from an open-source model is only 17.38%, compared to GPT-4o's 45.25%. This seems to agree with Ed's Really Dumb Tool-calling Benchmark ™️.

Other Reports

YouTuber Mukul Tripathi also found that Llama 3.2 does very poorly at answering questions when a tool is not required. Confusingly, he also found that Llama 3.3 did not have the same problem though, which is not consistent with my findings. Although he was using Ollama, he was not using it with LangChain. I'll have to look into that more.

So What Is The Problem?

Are open LLMs really that far behind at tool-calling? Or perhaps only larger models can determine whether a tool should be used? Maybe the quantization process used for Ollama is to blame? Or is something else going on?

We'll explore the answer in a future blog post. Stay tuned!

Edward J. SchwartzComputer Security Researcher1 min. read

I recently signed up for BlueSky. I just learned of a new service, EchoFeed, that polls RSS feeds and posts the content to BlueSky (and elsewhere). So, this is a test.

Will it post to BlueSky? Will fed.brid.gy mirror it to Mastodon? The suspense is killing me!

Edward J. SchwartzComputer Security Researcher1 min. read

This page documents my experience with "pressure washing" my vinyl fence and siding. I have pressure washing in quotes, because it's SH or sodium hypochlorite (or bleach) that does the bulk of the work. Pros often call this "soft washing".

For vinyl fence soft washing, you want around 1-2% SH. Most household bleach is 6% SH, so if you mix 1 part bleach with 5 parts water, you'll get around 1% SH.

You also want to use a surfectant to help the mixture stick to the fence. I used Dawn Ultra. Some people claim that some dish soaps will cause a bad reaction with the bleach, ranging from "mustard gas" to neutralizing the bleach.

I personally found that at 1-1.5% SH, the mixture was safe to use around grass. I wet the grass before and after applying the mixture, and I didn't see any damage.

Supplies

1 Gallon Pump Sprayer
Measuring cup
Dawn Ultra
Whatever 6% bleach you can find

Recipe

Add 3 cups of 6% SH bleach
Add 0.8 gallons of water
Add 2 fl. oz. of Dawn Ultra

Make sure to put the soap in last, or your mixture will foam up and overflow the sprayer when you try to close it.

Spray the mixture on the fence, let it sit for about five minutes, and then rinse it off. You can use a garden hose, but I personally found that using a Ryobi One+ EZ-Clean worked better. I'm sure a pressure washer would have been even faster, but it is less convenient to use.

That's about it. This removed most of the staining.

For some areas that had large amounts of growth, I used a Ryobi Scrubber to physically remove it before spraying.

The bleach was not able to remove all stain spots. For those remaining spots that were in conspicuous places, I used a magic eraser / melanine sponge.

1 2 3 4

Edward J. Schwartz

Computer Security Researcher

🎉 New Research Published at DIMVA 2025

The Research Team

What We Investigated

Key Findings

Read More

🎉 New Research Published at DSN 2025

The Research Team

What We Studied

Key Findings

Read More

Ghidra

Hex-Rays

SLaDe: 2/10

LLM4Decompile: 9/10

resym: 2/10

DeGPT: 8/10

Hidden: Setup

Quantization

Prompt Templates

Llama 3.2 Based Prompt

Llama 3.1 Based Prompt

What prompt template is Ollama using?

Let's build our own Ollama prompt template

Does our new Ollama prompt template work as intended?

Coming Clean

Recap

Conclusion

What went wrong?

What can we do about it?

Standards

Ollama

Benchmarks

Llama

Closing Thoughts

TLDR

My Personal Manifesto

Starting in Uiua

Moving to APL

BQN

Readability

Tacit coding

Documentation

Cool features

Group ⊔

Under ⌾

Not So Cool Features

Ed's Feelings on BQN

Conclusion

LangChain

Building an Agent using LangGraph

Ollama: Going Local

Ed's Really Dumb Tool-calling Benchmark ™️

Question 1: Can the react agent use a tool correctly when explicitly asked?

Question 2: Does the react agent invoke a tool when it shouldn't?

Question 3: Does the react agent lose the ability to answer questions unrelated to tools?

Question 4: Does the react agent retain the ability to chat?

Benchmark code

Benchmarking Llama 3.2

Benchmarking OpenAI's gpt-3.5-turbo and gpt-4o

Benchmarking a Lot Of Ollama Models

Berkeley Function-Calling Leaderboard

Other Reports

So What Is The Problem?

Supplies

Recipe

Group `⊔`

Under `⌾`