Edward J. SchwartzComputer Security Researcher1 min. read

🎉 New Research Published at DIMVA 2025

I'm excited to announce that "Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement" has been published at the 2025 Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA 2025)!

The Research Team

This work was primarily conducted by Deniz Bölöni-Turgut—a bright undergraduate at Cornell University—as part of the REU in Software Engineering (REUSE) program at CMU. She was supervised by Luke Dramko from our research group.

What We Investigated

This paper tackles an important question in the evolving landscape of AI-powered reverse engineering: How do code obfuscations impact the effectiveness of these ML-based approaches? In the real world, adversaries often employ obfuscation techniques to make their code harder to analyze by reverse engineers. Although these obfuscation techniques were not designed with machine learning in mind, they can significantly modify the code, which raises the question of whether they could hinder the performance of ML models, which are currently trained on unobfuscated code.

Key Findings

Our research provides important quantitative insights into how obfuscations affect ML-based decompilation:

Obfuscations do negatively impact ML models: We demonstrated that semantics-preserving transformations that obscure program functionality significantly reduce the accuracy of machine learning-based decompilation tools.
Training on obfuscated code helps: Our experiments show that training models on obfuscated code can partially recover the lost accuracy, making the tools more resilient to obfuscation techniques.
Consistent results across multiple models: We validated our findings across three different state-of-the-art models from the literature—DIRTY, HexT5, and VarBERT—suggesting that our findings generalize.
Practical implications for malware analysis: Since obfuscations are commonly used in malware, these findings are directly applicable to improving real-world binary analysis scenarios.

This work represents an important step forward in making ML-based decompilation tools more resilient against the obfuscation techniques commonly encountered in real-world binary analysis scenarios. As the field continues to evolve, understanding these vulnerabilities and developing robust solutions will be crucial for maintaining the effectiveness of AI-powered security tools.

Want to know more? Download the complete paper.

Edward J. SchwartzComputer Security Researcher28 min. read

llm

In Part 1, we showed that tool-calling agents built using open source LLMs and LangChain almost universally performed poorly. They demonstrated strange behaviors such as responding to "Hello." by making non-sensical tool calls. In this blog post, we will try to determine why this happened.

Rather than investigating every model at once, in this blog post I'm going to focus my effort on Llama 3.2.

Hidden: Setup
Quantization
Prompt Templates
Conclusion
TLDR

Hidden: Setup

!pip install langgraph~=0.2.53 langchain-ollama langchain-huggingface python-dotenv
!pip install httpx==0.27.2 # temp
!apt-get install -y jq
debug = False
sample_size = 100
num_ctx = 8192

from langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage

@tool
def foobar(input: int) -> int:
    """Computes the foobar function on input and returns the result."""
    return input + 2

tools = [foobar]

from langgraph.prebuilt import create_react_agent

def react_chat(prompt, model):
  agent_executor = create_react_agent(model, tools)

  response = agent_executor.invoke({"messages": [("user", prompt)]})
  return response['messages'][-1].content, response

Here we install ollama.

!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh

Make sure the ollama server is running.

!ollama -v
!ollama ps 2>/dev/null || (setsid env OLLAMA_DEBUG=1 nohup ollama serve &)

!ollama pull llama3.2 2>/dev/null

basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
  last_msg, _ = react_chat(basic_tool_question, model=model)
  r = "32" in last_msg
  if not r and debug:
    print(f"q1 debug: {last_msg}")
  return r

from langchain_core.messages import ToolMessage

basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"

def q2a(model):
  _, result = react_chat(basic_arithmetic_question, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2b(model):
  _, result = react_chat(greeting, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2(model):
  return q2a(model) and q2b(model)

def q3a(model):
  result = model.invoke(basic_arithmetic_question)
  return "12243" in result.content

def q3b(model):
  last_msg, _ = react_chat(basic_arithmetic_question, model=model)
  r = "12243" in last_msg
  if not r and debug:
    print(f"q3b debug: {last_msg}")
  return "12243" in last_msg

def q3(model):
  # q3a ==> q3b: If q3a, then q3b ought to be true as well.
  r = not q3a(model) or q3b(model)
  return r

def q4(model):
  last_msg, _ = react_chat(greeting, model=model)
  c1 = any(w in last_msg for w in ["Hi", "hello", "Hello", "help you", "Welcome", "welcome", "Greeting", "assist"])
  c2 = any(w in last_msg for w in ["None of the"])

  r = c1 and not c2
  #if not r:
  if debug: print(f"q4 debug: c1={c1} c2={c2} r={r} greeting? {last_msg}")
  return r

from tqdm.notebook import tqdm
from termcolor import colored

def do_bool_sample(fun, n=10, *args, **kwargs):
  try:
    # tqdm here if desired
    return sum(fun(*args, **kwargs) for _ in (range(n))) / n
  except Exception as e:
    print(e)
    return 0.0

def run_experiment(model, name, n=10):
  do = lambda f: do_bool_sample(f, model=model, n=n)
  d = {
      "q1": do(q1),
      "q2": do(q2),
      "q3": do(q3),
      "q4": do(q4),
      "n": n,
      "model": name
  }
  d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
  return d

def print_experiment(results):
  name = results['model']
  print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
  print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
  print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
  print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")

def run_and_print_experiment(model, name, **kwargs):
  results = run_experiment(model, name, **kwargs)
  print_experiment(results)
  return results

Quantization

You probably know that modern neural networks can be pretty large, and that is why special GPUs with lots of memory are in high demand right now. So how are we able to run some of these models on our computers, which don't have these special GPUs, using Ollama?

One reason is because Ollama uses quantized models, which are numerically compressed to use less memory. For example, the original Llama 3.2-3b-Instruct model uses bfloat16 tensors which require 16-bits to store each parameter. On Ollama's Llama 3.2 model page, you can see the quantization is listed as Q4_K_M. At a high-level, this squeezes each 16-bit parameter down to 4-bits. And somehow it still works!

Or does it? Maybe this is why our tool-callling doesn't work?

One simple way to test this is to evaluate a quantized version versus a non-quantized version. Luckily, this repository on HuggingFace happens to have both quantized and non-quantized models in a format that Ollama can process. So we can evaluate both of them using Ed's Really Dumb Tool-calling Benchmark ™️ that I introduced in Part 1.

from langchain_ollama import ChatOllama

quant_models = [
  "hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16", # non-quantized
  "hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M", #quantized
]

for m in quant_models:
  print(f"Model: {m}")
  !ollama pull {m} 2>/dev/null
  r = run_and_print_experiment(ChatOllama(model=m, num_ctx=num_ctx), m, n=sample_size)
  print(r)

Model: hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.91
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.36
Question 4: Does the react agent lose the ability to chat? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.12
{'q1': 0.91, 'q2': 0.0, 'q3': 0.36, 'q4': 0.12, 'n': 100, 'model': 'hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16', 'total': 1.3900000000000001}
Model: hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.99
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.51
Question 4: Does the react agent lose the ability to chat? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.06
{'q1': 0.99, 'q2': 0.0, 'q3': 0.51, 'q4': 0.06, 'n': 100, 'model': 'hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M', 'total': 1.56}

Both models do poorly. In fact, the quantized version does slightly better.

Conclusion: Quantization is probably not the problem.

Prompt Templates

Behind the scenes, LLMs need to be prompted in a very specific format to work well. HuggingFace dubs this problem the silent performance killer. In response, they created "chat templates" which codify the format and live alongside the model to avoid any ambiguity. Note: I call these "prompt templates".

We haven't had to worry about prompt templates at all, because Ollama has been taking care of templating for us. Maybe its prompt templates are problematic?

To test this theory, we're going to build some code to query Ollama but without using Ollama to format the prompt for us. There are two purposes for this:

We will learn a bit how tool calling works and how it interacts with prompt templates. I have a suspicion that prompt templates have something to do with the problem.
We will avoid a lot of code that has been hiding behind abstraction. This is the downside of abstraction: it makes things easier to build, but it's harder to understand where the blame might lie when something fails.

Llama 3.2 Based Prompt

Let's start by examining the prompt template recommended for Llama 3.2, which is this template for zero-shot function calling from the llama-models repository. We'll talk more about this later, but Meta actually publishes conflicting prompt templates in different locations! So to be clear, this is the llama-models Llama 3.2 prompt template.

Here is the example from that page:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the weather in SF and Seattle?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It's worth adding a few notes here. Modern LLMs are implemented as chat models, which means that they expect a conversation in the form of a list of messages that are sent by various roles. The primary roles are the user and the assistant. But we can also see that there are system messages, which are hidden instructions that are sent to the LLM. They provide instructions on how the LLM should behave. In this prompt template, we can see that it also specifies how and when the LLM should interact with tools.

On the same page is an example of the format in which the model should respond:

[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]<|eot_id|>

Let's code this up and try it out.

llama_32_example_funs = """[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]"""

# https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#input-prompt-format-1
def llama_32_prompt_template(user, funs=llama_32_example_funs):
  return """<|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

%s<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>""" % (funs, user)

print(llama_32_prompt_template("What is the weather in new york?", funs=llama_32_example_funs))

<|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the weather in new york?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Looks good. Now we need some code to send the request to the Ollama API.

import json

def build_ollama_request(full_prompt, model="llama3.2"):
  return f"""{{
    "model": \"{model}\",
    "prompt": {json.dumps(full_prompt)},
    "raw": true,
    "stream": false
  }}
  """

def build_ollama_request_from_userprompt(user, template, funs=None, **kwargs):
  return build_ollama_request(template(user, funs), **kwargs)

def send_request(full_request):
  open("req", "w").write(full_request)
  !curl -s http://localhost:11434/api/generate --data-ascii "@req" | jq -M '.response'

def send_userprompt(user, **kwargs):
  request = build_ollama_request_from_userprompt(user, **kwargs)
  send_request(request)

# A few sample prompts...
prompts = [
    "Use tools to find the weather in new york", # llama 3.2 example
    "Use tools to get latest trending songs", # llama 3.1 example
    basic_tool_question, # Please evaluate foobar(30)
    greeting, # Hello!
    basic_arithmetic_question # What is 12345 - 102?
]

def try_prompts(**kwargs):
  for p in prompts:
    print(f"Prompt: {p}\nResponse: ", end="")
    send_userprompt(p, **kwargs)
    print()

In the list of example prompts, there are three tool calling examples. We will be passing along the get_weather tool definition. Obviously, commonsense tells us that the model should not attempt to list trending songs or evaluate foobar(30) using get_weather. Let's see how it does.

for _ in range(1):
  try_prompts(funs=llama_32_example_funs, template=llama_32_prompt_template, model="llama3.2")

Prompt: Use tools to find the weather in new york
Response: "\n\n[get_weather(city='New York')]"

Prompt: Use tools to get latest trending songs
Response: "\n\n[get_weather(city='trending songs', metric='') ]"

Prompt: Please evaluate foobar(30)
Response: "\n\n[]"

Prompt: Hello!
Response: "\n\nNothing to see here. Would you like to ask a question or request a function call?"

Prompt: What is 12345 - 102?
Response: "\n\n[]"

On Sunday, the weather is going to be sunny with a chance of rain in the legendary city of "Trending Songs".

Overall, this is pretty disappointing. The model appears overly eager to call tools, even when it makes no sense, such as calling get_weather on the city of "trending songs". Oops. And it often responds unnaturally to "Hello!". It doesn't respond at all to the arithmetic or foobar questions.

Llama 3.1 Based Prompt

Llama 3.2 is actually compatible with the Llama 3.1 prompt format for tool calling, so next let's try the llama-models Llama 3.1 prompt template. Below is the example from that page.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Answer the user's question by making use of the following functions if needed.
If none of the function can be used, please say so.
Here is a list of functions in JSON format:
{
    "type": "function",
    "function": {
        "name": "trending_songs",
        "description": "Returns the trending songs on a Music site",
        "parameters": {
            "type": "object",
            "properties": [
                {
                    "n": {
                        "type": "object",
                        "description": "The number of songs to return"
                    }
                },
                {
                    "genre": {
                        "type": "object",
                        "description": "The genre of the songs to return"
                    }
                }
            ],
            "required": ["n"]
        }
    }
}

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Use tools to get latest trending songs<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Notice that the Llama 3.2 and 3.1 prompt templates have very little in common!

Let's code up the Llama 3.1 prompt and test it out.

llama_31_example_funs = """{
    "type": "function",
    "function": {
        "name": "trending_songs",
        "description": "Returns the trending songs on a Music site",
        "parameters": {
            "type": "object",
            "properties": [
                {
                    "n": {
                        "type": "object",
                        "description": "The number of songs to return"
                    }
                },
                {
                    "genre": {
                        "type": "object",
                        "description": "The genre of the songs to return"
                    }
                }
            ],
            "required": ["n"]
        }
    }
}
"""

# https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/prompt_format.md#input-prompt-format-5
def llama_31_prompt_template(user, funs=llama_31_example_funs):
   return """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Answer the user's question by making use of the following functions if needed.
If none of the function can be used, please say so.
Here is a list of functions in JSON format:
%s

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>
""" % (funs, user)

Now we'll run the sample prompts again, but this time we'll use the Llama 3.1 example function trending_songs rather than get_weather. As before, commonsense tells us that we can't use trending_songs to predict the weather or compute foobar(30). Let's see how it does.

for _ in range(1):
  try_prompts(funs=llama_31_example_funs, template=llama_31_prompt_template, model="llama3.2")

Prompt: Use tools to find the weather in new york
Response: "I can't directly use the provided function to find the weather in New York as it is a location-based API and the given function is for getting trending songs, not weather information."

Prompt: Use tools to get latest trending songs
Response: "{\"type\": \"function\", \"name\": \"trending_songs\", \"parameters\": {\"n\": \"10\"}}"

Prompt: Please evaluate foobar(30)
Response: "Since there is no `foobar` function available, the answer is: None"

Prompt: Hello!
Response: "Hello! How can I assist you today?"

Prompt: What is 12345 - 102?
Response: "I'm not aware of any function that can perform this calculation. The functions provided only include the `trending_songs` function, which is used to retrieve trending songs based on a specific number of songs and genre. It does not include arithmetic operations like subtraction. If you need help with a different type of calculation, please let me know!"

These responses seems greatly improved compared to the Llama 3.2 prompt we tried. The response to a greeting is more natural. It also didn't do silly things like stuffing "trending songs" into the get_weather function calls.

The only consistent problem I can see is that it didn't even try to answer the arithmetic question. Let's see if we can fix that by slightly tweaking the wording of the prompt with the following diff:

-Answer the user's question by making use of the following functions if needed.
-If none of the function can be used, please say so.
+Help and converse with the user.  If and only if the user asks a question that
+is relevant to one of the following functions, make use of them.  If none of
+the functions can be used, answer the query as best you can.

def llama_31_custom_prompt_template(user, funs=llama_31_example_funs):
   return """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
%s

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>
""" % (funs, user)

for _ in range(1):
  try_prompts(funs=llama_31_example_funs, template=llama_31_custom_prompt_template, model="llama3.2")

Prompt: Use tools to find the weather in new york
Response: "Since the provided function `trending_songs` is related to music and not weather, I'll answer your question as best I can.\n\nAccording to current information, New York's weather can be quite varied depending on the time of year. In general, winters are cold and snowy, while summers are hot and humid. Springs and autumns are generally mild.\n\nIf you're looking for more specific or up-to-date weather information, I recommend checking a reliable weather website or app, such as AccuWeather or Weather.com, which can provide more accurate and current forecasts for New York City and surrounding areas."

Prompt: Use tools to get latest trending songs
Response: "Here is a function call in JSON format that uses the `trending_songs` function to get the latest trending songs:\n\n```\n{\n  \"type\": \"function\",\n  \"name\": \"trending_songs\",\n  \"parameters\": {\n    \"n\": \"10\"\n  }\n}\n```"

Prompt: Please evaluate foobar(30)
Response: "Unfortunately, I don't have any information about a `foobar` function. Could you please provide more context or details about this function? Is it defined somewhere else that I can access? Or would you like me to try and infer what the `foobar` function might do based on its name?"

Prompt: Hello!
Response: "Hello! How can I assist you today?"

Prompt: What is 12345 - 102?
Response: "Since no specific function can be used for this question, I'll answer it directly:\n\nThe result of subtracting 102 from 12345 is 12343."

This seemed to work really well if you ignore the fact that it got the subtraction wrong 🤣

What prompt template is Ollama using?

So we found a prompt template that appears to be working well. Let's try to make Ollama use that prompt template. To start, let's investigate what prompt template it's using "by default".

def get_raw_prompt():
  template = !cat nohup.out | fgrep "chat request" | sed -e 's/.*prompt="\(.*\)"/\1/'
  return [s.encode().decode('unicode_escape') for s in template]

!ollama pull llama3.2 2>/dev/null
!>nohup.out # Truncate ollama output
response = react_chat(greeting, model=ChatOllama(model="llama3.2", num_ctx=num_ctx))
print(get_raw_prompt()[0])

<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023

When you receive a tool call response, use the output to format an answer to the orginal user question.

You are a helpful assistant with tool calling capabilities.<|eot_id|><|start_header_id|>user<|end_header_id|>

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{"type":"function","function":{"name":"foobar","description":"Computes the foobar function on input and returns the result.","parameters":{"type":"object","required":["input"],"properties":{"input":{"type":"integer","description":""}}}}}

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

That doesn't look like either of the prompt templates we used before, which both came from the llama-models repository. After a bit of google-fu, we can see that it originated from the Llama 3.1 "JSON based" tool calling documentation on the llama website. But it's not the same Llama 3.1 prompt template that we used from the llama-models repository.

This raises a few questions:

Why are there multiple prompt templates for Llama 3.1?
Which prompt template is best?
Why not use the prompt template for Llama 3.2, since we are using the Llama 3.2 model?

Let's start by answering the first two questions. In this github issue, a user notes that there are at least three different prompt templates for Llama 3.1:

A Meta employee states that the template in the llama-models repository is the correct one. Fortunately, that's what we have been using in this blog post. (It's almost as if I knew this in advance!) But Ollama has been basing their template on the one from the model website. That seems problematic!

The last question, "Why not use the prompt format for Llama 3.2?" is pretty easy to answer as well. Llama 3.2's default prompt format responds using a pythonic function call syntax that Ollama can't parse. And, as we saw above when we tested it manually, the Llama 3.2 prompt anecdotally did not seem to work well anyway.

Let's build our own Ollama prompt template

Now that we identified a prompt template that seems to work pretty well, let's try to make Ollama use it.

To start, let's look at the default prompt template for the llama3.2 model in Ollama. We already saw the instantiated prompt, but now let's look at the template that Ollama uses to build the prompts. You can see this below, or on the Ollama website here.

<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023

{{ if .System }}{{ .System }}
{{- end }}
{{- if .Tools }}When you receive a tool call response, use the output to format an answer to the orginal user question.

You are a helpful assistant with tool calling capabilities.
{{- end }}<|eot_id|>
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>
{{- if and $.Tools $last }}

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{{ range $.Tools }}
{{- . }}
{{ end }}
{{ .Content }}<|eot_id|>
{{- else }}

{{ .Content }}<|eot_id|>
{{- end }}{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
{{- if .ToolCalls }}
{{ range .ToolCalls }}
{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}
{{- else }}

{{ .Content }}
{{- end }}{{ if not $last }}<|eot_id|>{{ end }}
{{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>

{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- end }}
{{- end }}

It's not just you. It really is hard to read. The above code is written in the Go template language. I recommend using this interactive editor to better understand how the template language works if you are interested.

At a high level, the prompt template takes a sequence of messages and converts it into a prompt for the model. Some important notes:

ToolCalls indicate the calls the model wants to make. Ollama infers these by parsing the model's responses.
The tool role contains the output of an executed tool.
The user and assistant roles are self explanatory!

With a lot of trial and error in the interactive template editor, I converted our earlier template into the Ollama format:

<|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

{{ if .System }}{{ .System }}
{{- end -}}
<|eot_id|>{{ if .Tools }}<|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
{{- range $.Tools }}
{{ . }}{{ end }}

Return function calls in JSON format.<|eot_id|>{{ end }}
{{- range $i, $_ := .Messages }}

{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>

{{ .Content }}<|eot_id|>
{{- if $last }}<|start_header_id|>assistant<|end_header_id|>
{{ end }}
{{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
{{- if .ToolCalls }}
<|python_tag|>{{- range .ToolCalls -}}
{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}<|eom_id|>
{{- else }}
{{ .Content }}<|eot_id|>
{{- end }}
{{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>

{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- end }}
{{- end }}

Now the big question -- does it actually work?

Does our new Ollama prompt template work as intended?

First, let's make a query and make sure that we get the right answer!

!>nohup.out # Truncate ollama output
!ollama pull ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized 2>/dev/null
response = react_chat(basic_tool_question, model=ChatOllama(model="ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized", num_ctx=num_ctx))
print(response[0])
assert "32" in response[0]

The result of the foobar function when called with 30 as input is 32.

Second, let's take a peek at the prompt we sent to the LLM.

print(get_raw_prompt()[-1])

<|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of 
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
{"type":"function","function":{"name":"foobar","description":"Computes the foobar function on input and returns the result.","parameters":{"type":"object","required":["input"],"properties":{"input":{"type":"integer","description":""}}}}}

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Please evaluate foobar(30)<|eot_id|><|start_header_id|>assistant<|end_header_id|>
<|python_tag|>{"name": "foobar", "parameters": {"input":30}}<|eom_id|><|start_header_id|>ipython<|end_header_id|>

32<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Looks good to me!

Alright, let's run it through Ed's Really Dumb Tool-calling Benchmark ™️. We'll run the original Llama 3.2 model in Ollama (llama3.2) and the Llama 3.1 tooling prompt too for comparison.

models = [
  "llama3.2",
  "ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt", # based on the llama 3.1 tooling prompt
  "ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized", # our improved prompt
]

for m in models:
  print(f"Testing model: {m}")
  !ollama pull {m} 2>/dev/null
  r = run_and_print_experiment(ChatOllama(model=m, num_ctx=num_ctx), m, n=sample_size)
  print(r)

Testing model: llama3.2
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 0.97
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.55
Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.09
{'q1': 0.97, 'q2': 0.0, 'q3': 0.55, 'q4': 0.09, 'n': 100, 'model': 'llama3.2', 'total': 1.61}
Testing model: ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt
Question 1: Can the react agent use a tool correctly when explicitly asked? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.15
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.51
Question 4: Does the react agent lose the ability to chat? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.49
{'q1': 1.0, 'q2': 0.15, 'q3': 0.51, 'q4': 0.49, 'n': 100, 'model': 'ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt', 'total': 2.15}
Testing model: ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized
Question 1: Can the react agent use a tool correctly when explicitly asked? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.99
Question 2: Does the react agent invoke a tool when it shouldn't? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.99
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.68
Question 4: Does the react agent lose the ability to chat? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 1.0
{'q1': 0.99, 'q2': 0.99, 'q3': 0.68, 'q4': 1.0, 'n': 100, 'model': 'ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized', 'total': 3.66}

The original Ollama prompt scored 1.61/4.0. The official Llama 3.1 tooling prompt scored 2.15/4.0, and my customized prompt scored 3.6/4.0. I'd say that is an improvement!

Coming Clean

I do want to come clean. I did not create that prompt template in one try. It actually took many days of experimenting and debugging.

(I find that this is often a dilemma when blogging. Keeping track of everything you did is difficult, time-consuming, and often not that interesting.)

Many things went wrong along the way, but here are a few issues that I remember:

Environment: ipython is supposed to be for enabling Llama's code interpreter, which we aren't using, but I wasn't able to get the Llama 3.1 prompt to work well without it.
Llama has some weird rules for when to use <|eom_id|> vs. <|eot_id|>, and I got them wrong. As a result, even though the initial prompt was correct, Llama could not "parse" the results to correctly build the final message.
In Llama 3.1 prompts, when the model responds with a tool call, it is supposed to be prefixed with <|python_tag|>. Somewhat oddly (in my opinion), existing Ollama prompts "rebuild" the tool call response from the parsed values, rather than using the original message. As a result, I had to add in the <|python_tag|> or the model would become confused and struggle to build the final message to the user.
In one template, I added an extra newline, and this caused a notable decrease in performance! Yes, models really are that sensitive.

Recap

Let's recap all the things we had to do to get to this point:

We used the Llama 3.1 prompt template from the llama-models repository and NOT the Llama 3.1 website, which is Ollama's prompt was based on.
We modified the wording of the prompt to improve its responses to non-tool-calls like greetings and arithmetic.

And this was all just to improve the performance of one single model. I'm tired, aren't you?

Conclusion

Prompts are definitely part of the reason why building tool calling agents did not work in Part 1. But HuggingFace raised the alarm about this a long time ago! So what went wrong?

What went wrong?

Now that tool-calling is becoming more popular, prompt templates must be considered a fundamental part of a model, just like the weights. The reason for this is simple: the model developer is the only entity who has a clear incentive to ensure that their model works as well as possible. Downstream consumers like Ollama do not have an incentive to make sure that prompt templates work as well as possible. Unfortunately for Llama, Meta did not treat the prompt templates as a fundamental part of the model. Meta did a poor job of documenting the prompt templates: the example-based documentation is vague, and the multiple conflicting sources of information further confused the issue. So right off the bat, Llama models are not bundled with a clear prompt template.

Ollama did not help the situation. Instead of adopting an existing template format such as HuggingFace's, they decided to roll their own format based on Go templates. On one hand, this is a natural decision since Ollama is written in Go. But now someone has to write a new prompt template for every model on Ollama. Currently, it is the Ollama developers themselves who are creating these prompt templates. But as I mentioned above, there is a concerning incentive mismatch: the Ollama developers don't have an incentive to determine the best prompt format for every model. Here is a stark example where the Ollama developers have provided unhelpful and misleading responses when users reported that models were making non-sensical tool calls:

Don't bind tools if you don't want [the Llama model to make] a tool call.

Of course, as evidenced by this blog post, the real problem was that the Ollama developers themselves chose a prompt template that performed poorly on tool-calling. (I don't mean to beat up on the Ollama project. I think it's a great project! But they didn't help themselves out in this area.)

What can we do about it?

Standards

We really do need a standard for prompt templates. HuggingFace's chat template is a good start, but it is not perfect. While it describes how to format messages that should be sent to the model, it doesn't define how to parse the model's responses, which is equally important. As an example, I suspect that a major reason why Ollama's Llama 3.2 model used a Llama 3.1 prompt template is because Ollama's tool-call parser does not support the pythonic format used in the Llama 3.2 prompts. Parsing tool-calls is currently very ad-hoc.

Another problem with HuggingFace's chat templates is language compatibility. I suspect that part of the reason why Ollama chose to use their own template format is convenience. HuggingFace's chat templates are based on jinja2, which is a template language for python. But Ollama is written in Go. Perhaps we need a standard format that is more language agnostic.

Ollama

Ollama should either adopt the HuggingFace template format or create a tool that can convert HuggingFace templates to Ollama templates. The current system of manually converting templates is error prone and harmful.

Ollama should also add information to their model cards about which prompt templates they adopted and why. For example, the Llama 3.2 model card does not mention that the prompt template is based on the Llama 3.1 prompt format, or why.

Benchmarks

Benchmarks such as the Berkeley Function-Calling Leaderboard (BFCL) could also be doing more about the prompt problem. For Llama, it appears that, similar to the Ollama developers, the BFCL developers have simply chosen a prompt and implemented it. Llama 3.1 and 3.3 appear to be based on a Llama HuggingFace chat template while other versions use a generic prompt.

We don't know how or why they selected these prompts. As with the Ollama developers, there is an incentive mismatch: they don't have an incentive or responsibility to experiment. Perhaps there should be more incentive for model developers to fix the prompt templates in order to score better on the BFCL, but it doesn't seem like that is how things works today. Honestly, I don't understand why; I would think that Meta would be embarrassed that the Llama-3.2-3B-Instruct model only scores 5.25% on the BFCL in Overall Multi Turn Accuracy.

Llama

The Llama developers really need to do a better job documenting its prompt templates. The example-based "documentation" is vague. And more critically, there shouldn't be conflicting information. Even after they were notified about it, the problem remains. In the same github issue, it's clear that people can't reproduce Meta's experimental results either.

Closing Thoughts

We largely got to the bottom of the problems in Llama 3.2, but this was just one of many models that performed poorly. Are all of these models suffering from prompt template problems, or are there other problems as well? Stay tuned to find out.

TLDR

Better Ollama models for Llama 3.2 tool calling are available here:

ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt is based on this Llama 3.1 prompt
ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized is based on this Llama 3.1 prompt but slightly modifies the language to improve behavior on responding to queries unrelated to tools.

Edward J. SchwartzComputer Security Researcher24 min. read

llm

One of the most exciting possibilities of AI and LLMs are agents: tools that allow LLMs to interact with various tools in order to solve problems. You've probably seen them before, like when you ask ChatGPT to browse the web for you.

In this blog post, we'll take a look at how to build agents using LangChain. They'll work great using an OpenAI model. And then we'll try to run them locally using Ollama, using a variety of open models. And they will almost all fail miserably. They fail so bad that I created this blog post to convince myself I wasn't imagining things.

In a future blog post, we will examine why.

LangChain
Building an Agent using LangGraph
Ollama: Going Local
Ed's Really Dumb Tool-calling Benchmark ™️
Berkeley Function-Calling Leaderboard
Other Reports
So What Is The Problem?

LangChain

LangChain is a framework that allows you to build LLM applications. Basically, it abstracts a bunch of different components like LLMs, vector stores, and the like, and allows you to focus on your application's logic. So, you might develop your application in LangChain while using a local LLM to run it, but then use Claude once you go to production.

Anyway, using LangChain to make a query is pretty simple.

!pip install langchain-openai~=0.2.7 python-dotenv
!pip install httpx==0.27.2 # temp

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

# We'll load my OpenAI API key using dotenv
%load_ext dotenv
%dotenv drive/MyDrive/.env

from langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage

from langchain_openai import ChatOpenAI

# Remove non-determinism for the blog post
zero_temp_gpt35 = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

response=zero_temp_gpt35.invoke("Hi!  What is your name?").content

import textwrap
print(textwrap.fill(response))

Hello! I am a language model AI assistant. How can I assist you today?

The beauty of LangChain is that the components are modular. We can replace gpt-3.5-turbo model with something else later if we want to, and indeed we will do just that!

Building an Agent using LangGraph

LangGraph is the portion of LangChain for building agents. It allows us to easily define new tools:

!pip install langgraph~=0.2.53

@tool
def foobar(input: int) -> int:
    """Computes the foobar function."""
    return input + 2

tools = [foobar]

The @tool decorator automatically transforms the function into a schema that can be used by the LLM to decide whether to invoke the tool, and if so, how.

foobar.tool_call_schema.model_json_schema()

{'description': 'Computes the foobar function.',
 'properties': {'input': {'title': 'Input', 'type': 'integer'}},
 'required': ['input'],
 'title': 'foobar',
 'type': 'object'}

With that, we can build a generic agent, called a ReAct agent, which can interact with our tools:

from langgraph.prebuilt import create_react_agent

def react_chat(prompt, model):
  agent_executor = create_react_agent(model, tools)

  response = agent_executor.invoke({"messages": [("user", prompt)]})
  return response['messages'][-1].content, response

last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_gpt35)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"

The result of evaluating foobar(30) is 32.

Yes! We did it, team! 🎉 We could change magic_function to be a web search, a database lookup, or you name it.

Let's try a query that doesn't use a tool at all.

last_msg, result = react_chat("Hi.", zero_temp_gpt35)
print(last_msg)
assert "Hello" in last_msg and "foobar" not in last_msg, "Uh oh, something went wrong"

Hello! How can I assist you today?

Great. So, in theory, we have an agent that we can chat with and is able to call tools in order to help us out.

Ollama: Going Local

Now let's try to create a tool-wielding agent using a LLM that runs on our local machine.
We'll do this by using Ollama, which is a (fairly) easy way to run smaller open LLMs on your local machine. It will use any GPUs that you might have, but it's still usable even if you don't have any. After all, you're just performing inference, not training.

Here's an example of me running llama 3.2 with ollama on my work laptop.

root@be5c1cb9e696:/# ollama run llama3.2
>>> Hi mom!
It's nice to hear from you, sweetie. Is everything okay? What's on your mind?

>>> Are you alive?
I am a computer program, so I don't have feelings or emotions like humans do. But I'm
designed to simulate conversations and answer questions to the best of my ability. I'm
not alive in the way that a living being is, but I'm here to help you with any
questions or topics you'd like to discuss!

>>> 🤯
I know it can be a bit mind-blowing to think about a computer program that can have
conversations and answer questions! But I'm designed to make interactions feel more
natural, so I'm glad you're surprised (in a good way!)

You can find instructions on how to install Ollama on the Ollama webpage.

If you don't feel like installing anything, that's fine too. You can follow along with this notebook.

After installing and running Ollama (ollama serve), we install the langchain-ollama connector package and pull down the Llama 3.2 model from Ollama's repository.

# Install Ollama
!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh
!ollama -v
# Make sure Ollama is running
!ollama ps 2>/dev/null || (env OLLAMA_DEBUG=1 nohup ollama serve &)

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
[1m[31mWARNING:[m systemd is not running
[1m[31mWARNING:[m Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Warning: could not connect to a running Ollama instance
Warning: client version is 0.5.1
nohup: appending output to 'nohup.out'

!pip install langchain_ollama~=0.2.0
!ollama pull llama3.2

Now we can attempt the same tests we performed on GPT 3.5, but using the local Llama 3.2 LLM.

from langchain_ollama import ChatOllama

# The zero temperature model is to remove non-determinism for the blog post
zero_temp_ollama_model = ChatOllama(model="llama3.2", temperature=0)
response = zero_temp_ollama_model.invoke("Hi!  What is your name?").content

print(textwrap.fill(response))

I don't have a personal name, but I'm an AI designed to assist and
communicate with users. I'm often referred to as a "language model" or
a "chatbot." You can think of me as a helpful computer program that's
here to provide information, answer questions, and engage in
conversation. What's your name?

Okay, looking good! This is not bad for a 3B parameter LLM that can easily run locally on our computer. Let's see if it can call our tool-wielding agent.

last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_ollama_model)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"

The output of `foobar(30)` is 32.

🎉 Everything is working well so far. As one final check, let's ask the agent a question that has absolutely nothing to do with tools.

last_msg, result = react_chat("Hi.", zero_temp_ollama_model)
print(last_msg)
assert "42" in last_msg, "Uh oh, something went wrong"

The input value 42 was doubled, resulting in 84.

So, we said "Hi." and the agent responded with nonsense. Let's inspect some of the metadata we get back from LangChain to see what's going on.

import pprint
pprint.pprint(result)

{'messages': [HumanMessage(content='Hi.', additional_kwargs={}, response_metadata={}, id='c4dd1ba7-cb15-4d62-a2bb-a543a32a882d'),
              AIMessage(content='', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.061349558Z', 'done': True, 'done_reason': 'stop', 'total_duration': 294464945, 'load_duration': 22079878, 'prompt_eval_count': 153, 'prompt_eval_duration': 9000000, 'eval_count': 16, 'eval_duration': 261000000, 'message': Message(role='assistant', content='', images=None, tool_calls=[ToolCall(function=Function(name='foobar', arguments={'input': 42}))])}, id='run-be60d0f6-bf62-4336-b028-d37898615e06-0', tool_calls=[{'name': 'foobar', 'args': {'input': 42}, 'id': '4d6b28d7-71bc-4f80-9a2a-e61293bdbb65', 'type': 'tool_call'}], usage_metadata={'input_tokens': 153, 'output_tokens': 16, 'total_tokens': 169}),
              ToolMessage(content='44', name='foobar', id='aed6b2d6-590d-4bc3-8828-89457178bd11', tool_call_id='4d6b28d7-71bc-4f80-9a2a-e61293bdbb65'),
              AIMessage(content='The input value 42 was doubled, resulting in 84.', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.305191931Z', 'done': True, 'done_reason': 'stop', 'total_duration': 238035622, 'load_duration': 22280620, 'prompt_eval_count': 85, 'prompt_eval_duration': 5000000, 'eval_count': 14, 'eval_duration': 208000000, 'message': Message(role='assistant', content='The input value 42 was doubled, resulting in 84.', images=None, tool_calls=None)}, id='run-50754cd1-cae9-410d-84d5-64b51bced188-0', usage_metadata={'input_tokens': 85, 'output_tokens': 14, 'total_tokens': 99})]}

We can see there are four messages:

The HumanMessage is the user's message -- "Hi."
In response, in the AiMessage, the LLM indicates that it would like to invoke a tool by setting the tool_calls field.
LangChain invokes the tool and records the result in the ToolMessage, which is given back to the LLM.
The final AiMessage includes a written message for the user.

The problem of course, is message #2. Why does the AI want to invoke a tool in response to "Hi."? Is this a problem with Llama 3.2 or something else? Let's do some 🥼 science and find out!

Ed's Really Dumb Tool-calling Benchmark ™️

I created a really dumb benchmark to answer four really basic questions. I can't stress enough that this benchmark only tests the lowest of the low hanging fruit in this area. (I am calling it a "benchmark" facetiously!)

Here are the questions:

Can the react agent use a tool correctly when explicitly asked? (Yes is good.)
Does the react agent invoke a tool when it shouldn't? (No is good.)
Does the react agent lose the ability to answer questions unrelated to tools? (No is good.)
Does the react agent lose the ability to chat? (No is good.)

Question 1: Can the react agent use a tool correctly when explicitly asked?

We'll use our example above to test this.

basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
  last_msg, _ = react_chat(basic_tool_question, model=model)
  return "32" in last_msg

Question 2: Does the react agent invoke a tool when it shouldn't?

We'll perform two simple tests to answer this question. We'll prompt the agent with both a basic arithmetic question that does not involve the foobar tool, "What is 12345 - 102?", and a greeting, "Hello!" We'll then check the response to see if the model produces a ToolMessage, which indicates that the model chose to invoke a tool. By construction, neither of those prompts should induce a tool call.

from langchain_core.messages import ToolMessage

basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"

def q2a(model):
  _, result = react_chat(basic_arithmetic_question, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2b(model):
  _, result = react_chat(greeting, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2(model):
  return q2a(model) and q2b(model)

Question 3: Does the react agent lose the ability to answer questions unrelated to tools?

To answer this, we'll ask the basic arithmetic question to the react agent and its underlying model. Since the available tool does not help with the arithmetic problem, ideally, the agent and the underlying model should be able to solve the problem under the same circumstances. If the model can't do arithmetic in the first place, I chose not to penalize it because I'm such a nice guy. 😇

def q3a(model):
  result = model.invoke(basic_arithmetic_question)
  return "12243" in result.content

def q3b(model):
  last_msg, _ = react_chat(basic_arithmetic_question, model=model)
  return "12243" in last_msg

def q3(model):
  # q3a ==> q3b: If q3a, then q3b ought to be true as well.
  return not q3a(model) or q3b(model)

Question 4: Does the react agent retain the ability to chat?

To answer this, we'll greet the agent and attempt to determine if it responds properly. This is a little difficult to do in a comprehensive way.

basic_greeting = "Hi."

def q4(model):
  last_msg, _ = react_chat(basic_greeting, model=model)
  r = any(w in last_msg for w in ["hi", "Hi", "hello", "Hello", "help you", "Welcome", "welcome", "greeting", "Greeting", "assist"])
  #if not r:
    #print(f"Debug: Not a greeting? {last_msg}")
  return r

Benchmark code

Here is code to run the experiments a couple of times.

from tqdm.notebook import tqdm

def do_bool_sample(fun, n=10, *args, **kwargs):
  try:
    # tqdm here if desired
    return sum(fun(*args, **kwargs) for _ in (range(n))) / n
  except Exception as e:
    print(e)
    return 0.0

def run_experiment(model, name, n=10):
  do = lambda f: do_bool_sample(f, model=model, n=n)
  d = {
      "q1": do(q1),
      "q2": do(q2),
      "q3": do(q3),
      "q4": do(q4),
      "model": name
  }
  d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
  return d

def print_experiment(results):
  name = results['model']
  print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
  print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
  print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
  print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")

def run_and_print_experiment(model, name):
  results = run_experiment(model, name)
  print_experiment(results)
  return results

Benchmarking Llama 3.2

Let's see what our experiments say for Llama 3.2, which we already know from above does not perform very well.

llama_model = ChatOllama(model="llama3.2")
run_and_print_experiment(llama_model, "llama3.2")

Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.1

{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.1, 'model': 'llama3.2', 'total': 1.6}

As we saw above, Llama 3.2 is able to call functions (Q1), but does so even when it should not be (Q2). Question 3 shows that even though it almost always decides to call a tool, this usually does not stop it from being able to answer basic questions. It does prevent it from being able to chat (Q4).

Benchmarking OpenAI's gpt-3.5-turbo and gpt-4o

Now let's try benchmarking gpt-3.5-turbo, which seemed to do better.

gpt35 = ChatOpenAI(model="gpt-3.5-turbo")
run_and_print_experiment(gpt35, "gpt-3.5-turbo")

Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-3.5-turbo) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-3.5-turbo) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-3.5-turbo) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (gpt-3.5-turbo) success rate: 1.0

{'q1': 1.0,
 'q2': 0.0,
 'q3': 0.9,
 'q4': 1.0,
 'model': 'gpt-3.5-turbo',
 'total': 2.9}

Great -- the benchmark showed that gpt-3.5-turbo can call tools (Q1), and unlike Llama 3.2, can still engage in chat (Q4). A bit surprisingly, it still invokes tools when it shouldn't, however (Q2). But it is smart enough to ignore their results when constructing its final response.

Let's try a newer model, gpt-4o.

gpt4o = ChatOpenAI(model="gpt-4o")
run_and_print_experiment(gpt4o, "gpt-4o")

Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-4o) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-4o) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-4o) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (gpt-4o) success rate: 1.0

{'q1': 1.0, 'q2': 1.0, 'q3': 1.0, 'q4': 1.0, 'model': 'gpt-4o', 'total': 4.0}

GPT 4o nailed it! 👏

Benchmarking a Lot Of Ollama Models

Let's benchmark a whole bunch of Ollama models. I searched Ollama's model library for models that claimed to support tool calling. Here we test a hand-picked subset of these models to see how well they do.

ollama_models = [
    "hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S",
    "llama3.3:70b",
    "llama3.2:3b",
    "llama3.1:70b",
    "llama3.1:8b",
    "llama3-groq-tool-use:8b",
    "llama3-groq-tool-use:70b",
    "MFDoom/deepseek-v2-tool-calling:16b",
    "krtkygpta/gemma2_tools",
    "interstellarninja/llama3.1-8b-tools",
    "cow/gemma2_tools:2b",
    "mistral:7b",
    "mistral-nemo: 12b",
    "interstellarninja/hermes-2-pro-llama-3-8b-tools",
    "qwq:32b",
    "qwen2.5-coder:7b",
    ]

all = []

for m in ollama_models:
  print(f"Downloading model: {m}...")
  !ollama pull {m} 2>/dev/null
  print("done.")
  r = run_and_print_experiment(ChatOllama(model=m), m)
  !ollama rm {m}
  all.append(r)
  print(r)

Downloading model: hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S'
{'q1': 0.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S', 'total': 2.9}
Downloading model: llama3.3:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.3:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.3:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.3:70b) success rate: 0.2
Question 4: Does the react agent lose the ability to chat? (llama3.3:70b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.3:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.2, 'q4': 1.0, 'model': 'llama3.3:70b', 'total': 2.2}
Downloading model: llama3.2:3b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2:3b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2:3b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2:3b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2:3b) success rate: 0.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.2:3b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.0, 'model': 'llama3.2:3b', 'total': 1.5}
Downloading model: llama3.1:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:70b) success rate: 0.3
Question 4: Does the react agent lose the ability to chat? (llama3.1:70b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.1:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.3, 'q4': 0.7, 'model': 'llama3.1:70b', 'total': 2.0}
Downloading model: llama3.1:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:8b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:8b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (llama3.1:8b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.1:8b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.7, 'model': 'llama3.1:8b', 'total': 1.7}
Downloading model: llama3-groq-tool-use:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:8b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:8b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:8b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3-groq-tool-use:8b'
{'q1': 1.0, 'q2': 0.8, 'q3': 1.0, 'q4': 1.0, 'model': 'llama3-groq-tool-use:8b', 'total': 3.8}
Downloading model: llama3-groq-tool-use:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:70b) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:70b) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:70b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3-groq-tool-use:70b'
{'q1': 1.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'llama3-groq-tool-use:70b', 'total': 3.9}
Downloading model: MFDoom/deepseek-v2-tool-calling:16b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'MFDoom/deepseek-v2-tool-calling:16b'
{'q1': 0.0, 'q2': 0.0, 'q3': 1.0, 'q4': 1.0, 'model': 'MFDoom/deepseek-v2-tool-calling:16b', 'total': 2.0}
Downloading model: krtkygpta/gemma2_tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (krtkygpta/gemma2_tools) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (krtkygpta/gemma2_tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (krtkygpta/gemma2_tools) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (krtkygpta/gemma2_tools) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'krtkygpta/gemma2_tools'
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'krtkygpta/gemma2_tools', 'total': 1.0}
Downloading model: interstellarninja/llama3.1-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/llama3.1-8b-tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 4: Does the react agent lose the ability to chat? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'interstellarninja/llama3.1-8b-tools'
{'q1': 0.7, 'q2': 0.0, 'q3': 0.7, 'q4': 0.7, 'model': 'interstellarninja/llama3.1-8b-tools', 'total': 2.0999999999999996}
Downloading model: cow/gemma2_tools:2b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (cow/gemma2_tools:2b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (cow/gemma2_tools:2b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (cow/gemma2_tools:2b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (cow/gemma2_tools:2b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'cow/gemma2_tools:2b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'cow/gemma2_tools:2b', 'total': 2.0}
Downloading model: mistral:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral:7b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral:7b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral:7b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (mistral:7b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'mistral:7b'
{'q1': 0.6, 'q2': 0.8, 'q3': 0.5, 'q4': 0.7, 'model': 'mistral:7b', 'total': 2.5999999999999996}
Downloading model: mistral-nemo: 12b...
done.
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral-nemo: 12b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral-nemo: 12b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral-nemo: 12b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (mistral-nemo: 12b) success rate: 0.0
[?25l[?25l[?25h[2K[1G[?25hError: name "mistral-nemo:" is invalid
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.0, 'model': 'mistral-nemo: 12b', 'total': 0.0}
Downloading model: interstellarninja/hermes-2-pro-llama-3-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.3
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.6
[?25l[?25l[?25h[2K[1G[?25hdeleted 'interstellarninja/hermes-2-pro-llama-3-8b-tools'
{'q1': 1.0, 'q2': 0.3, 'q3': 0.8, 'q4': 0.6, 'model': 'interstellarninja/hermes-2-pro-llama-3-8b-tools', 'total': 2.7}
Downloading model: qwq:32b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwq:32b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (qwq:32b) success rate: 0.9
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwq:32b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (qwq:32b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'qwq:32b'
{'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'model': 'qwq:32b', 'total': 3.5}
Downloading model: qwen2.5-coder:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwen2.5-coder:7b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (qwen2.5-coder:7b) success rate: 0.4
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwen2.5-coder:7b) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (qwen2.5-coder:7b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'qwen2.5-coder:7b'
{'q1': 1.0, 'q2': 0.4, 'q3': 0.8, 'q4': 1.0, 'model': 'qwen2.5-coder:7b', 'total': 3.2}

from statistics import mean
average = mean(d['total'] for d in all)
minscore = min(d['total'] for d in all)
maxscore = max(d['total'] for d in all)
all = sorted(all, key=lambda d: -d['total'])
print(f"Average total score: {average} Min: {minscore} Max: {maxscore}")
print("Top 5 models by total score:")
pprint.pprint(all[:5])

Average total score: 2.31875 Min: 0.0 Max: 3.9
Top 5 models by total score:
[{'model': 'llama3-groq-tool-use:70b',
  'q1': 1.0,
  'q2': 1.0,
  'q3': 0.9,
  'q4': 1.0,
  'total': 3.9},
 {'model': 'llama3-groq-tool-use:8b',
  'q1': 1.0,
  'q2': 0.8,
  'q3': 1.0,
  'q4': 1.0,
  'total': 3.8},
 {'model': 'qwq:32b', 'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'total': 3.5},
 {'model': 'qwen2.5-coder:7b',
  'q1': 1.0,
  'q2': 0.4,
  'q3': 0.8,
  'q4': 1.0,
  'total': 3.2},
 {'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S',
  'q1': 0.0,
  'q2': 1.0,
  'q3': 0.9,
  'q4': 1.0,
  'total': 2.9}]

There's a lot going on here. But I have a few general observations:

Most models do really poorly. The average "score" was ~2. And even a perfect score of 4.0 is really the bare minimum of what I would expect a decent model to do. And most of the models I tested claim to "support" tool calling.
Surprisingly, some models do pretty well! For example, llama3-groq-tool-use almost achieves a perfect score!
I tried a few larger ~70B models, and they did not perform noticeably better. Interestingly, the 8B variant of llama3-groq-tool-use performed almost as well as the 70B variant.

Berkeley Function-Calling Leaderboard

There are several benchmarks that test tool use in LLMs. But most of them are not designed to test tool usage as in an interactive agent. As it turns out, the Berkeley Function-Calling Leaderboard (BFCL) is the only benchmark that tests this type of behavior. And it was only added in September 2024 as part of BFCL V3.

Llama 3.2 scores a whopping 2.12% on multi-turn accuracy (what we care about). The top 12 scores are from proprietary models. The best multi-turn accuracy from an open-source model is only 17.38%, compared to GPT-4o's 45.25%. This seems to agree with Ed's Really Dumb Tool-calling Benchmark ™️.

Other Reports

YouTuber Mukul Tripathi also found that Llama 3.2 does very poorly at answering questions when a tool is not required. Confusingly, he also found that Llama 3.3 did not have the same problem though, which is not consistent with my findings. Although he was using Ollama, he was not using it with LangChain. I'll have to look into that more.

So What Is The Problem?

Are open LLMs really that far behind at tool-calling? Or perhaps only larger models can determine whether a tool should be used? Maybe the quantization process used for Ollama is to blame? Or is something else going on?

We'll explore the answer in a future blog post. Stay tuned!

Edward J. Schwartz

Computer Security Researcher