In Part 1, we showed that tool-calling agents built using open source LLMs and LangChain almost universally performed poorly. They demonstrated strange behaviors such as responding to "Hello." by making non-sensical tool calls. In this blog post, we will try to determine why this happened.

Rather than investigating every model at once, in this blog post I'm going to focus my effort on Llama 3.2.

Hidden: Setup

Hidden: Setup

!pip install langgraph~=0.2.53 langchain-ollama langchain-huggingface python-dotenv
!pip install httpx==0.27.2 # temp
!apt-get install -y jq
debug = False
sample_size = 100
num_ctx = 8192
from langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage
@tool
def foobar(input: int) -> int:
    """Computes the foobar function on input and returns the result."""
    return input + 2

tools = [foobar]
from langgraph.prebuilt import create_react_agent

def react_chat(prompt, model):
  agent_executor = create_react_agent(model, tools)

  response = agent_executor.invoke({"messages": [("user", prompt)]})
  return response['messages'][-1].content, response

Here we install ollama.

!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh

Make sure the ollama server is running.

!ollama -v
!ollama ps 2>/dev/null || (setsid env OLLAMA_DEBUG=1 nohup ollama serve &)
!ollama pull llama3.2 2>/dev/null
basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
  last_msg, _ = react_chat(basic_tool_question, model=model)
  r = "32" in last_msg
  if not r and debug:
    print(f"q1 debug: {last_msg}")
  return r
from langchain_core.messages import ToolMessage

basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"

def q2a(model):
  _, result = react_chat(basic_arithmetic_question, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2b(model):
  _, result = react_chat(greeting, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2(model):
  return q2a(model) and q2b(model)
def q3a(model):
  result = model.invoke(basic_arithmetic_question)
  return "12243" in result.content

def q3b(model):
  last_msg, _ = react_chat(basic_arithmetic_question, model=model)
  r = "12243" in last_msg
  if not r and debug:
    print(f"q3b debug: {last_msg}")
  return "12243" in last_msg

def q3(model):
  # q3a ==> q3b: If q3a, then q3b ought to be true as well.
  r = not q3a(model) or q3b(model)
  return r
def q4(model):
  last_msg, _ = react_chat(greeting, model=model)
  c1 = any(w in last_msg for w in ["Hi", "hello", "Hello", "help you", "Welcome", "welcome", "Greeting", "assist"])
  c2 = any(w in last_msg for w in ["None of the"])

  r = c1 and not c2
  #if not r:
  if debug: print(f"q4 debug: c1={c1} c2={c2} r={r} greeting? {last_msg}")
  return r
from tqdm.notebook import tqdm
from termcolor import colored

def do_bool_sample(fun, n=10, *args, **kwargs):
  try:
    # tqdm here if desired
    return sum(fun(*args, **kwargs) for _ in (range(n))) / n
  except Exception as e:
    print(e)
    return 0.0

def run_experiment(model, name, n=10):
  do = lambda f: do_bool_sample(f, model=model, n=n)
  d = {
      "q1": do(q1),
      "q2": do(q2),
      "q3": do(q3),
      "q4": do(q4),
      "n": n,
      "model": name
  }
  d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
  return d

def print_experiment(results):
  name = results['model']
  print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
  print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
  print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
  print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")

def run_and_print_experiment(model, name, **kwargs):
  results = run_experiment(model, name, **kwargs)
  print_experiment(results)
  return results

Quantization

You probably know that modern neural networks can be pretty large, and that is why special GPUs with lots of memory are in high demand right now. So how are we able to run some of these models on our computers, which don't have these special GPUs, using Ollama?

One reason is because Ollama uses quantized models, which are numerically compressed to use less memory. For example, the original Llama 3.2-3b-Instruct model uses bfloat16 tensors which require 16-bits to store each parameter. On Ollama's Llama 3.2 model page, you can see the quantization is listed as Q4_K_M. At a high-level, this squeezes each 16-bit parameter down to 4-bits. And somehow it still works!

Or does it? Maybe this is why our tool-callling doesn't work?

One simple way to test this is to evaluate a quantized version versus a non-quantized version. Luckily, this repository on HuggingFace happens to have both quantized and non-quantized models in a format that Ollama can process. So we can evaluate both of them using Ed's Really Dumb Tool-calling Benchmark โ„ข๏ธ that I introduced in Part 1.

from langchain_ollama import ChatOllama

quant_models = [
  "hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16", # non-quantized
  "hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M", #quantized
]

for m in quant_models:
  print(f"Model: {m}")
  !ollama pull {m} 2>/dev/null
  r = run_and_print_experiment(ChatOllama(model=m, num_ctx=num_ctx), m, n=sample_size)
  print(r)
Model: hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.91
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.36
Question 4: Does the react agent lose the ability to chat? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16) success rate: 0.12
{'q1': 0.91, 'q2': 0.0, 'q3': 0.36, 'q4': 0.12, 'n': 100, 'model': 'hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:F16', 'total': 1.3900000000000001}
Model: hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.99
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.51
Question 4: Does the react agent lose the ability to chat? (hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M) success rate: 0.06
{'q1': 0.99, 'q2': 0.0, 'q3': 0.51, 'q4': 0.06, 'n': 100, 'model': 'hf.co/prithivMLmods/Llama-3.2-3B-Instruct-GGUF:Q4_K_M', 'total': 1.56}

Both models do poorly. In fact, the quantized version does slightly better.

Conclusion: Quantization is probably not the problem.

Prompt Templates

Behind the scenes, LLMs need to be prompted in a very specific format to work well. HuggingFace dubs this problem the silent performance killer. In response, they created "chat templates" which codify the format and live alongside the model to avoid any ambiguity. Note: I call these "prompt templates".

We haven't had to worry about prompt templates at all, because Ollama has been taking care of templating for us. Maybe its prompt templates are problematic?

To test this theory, we're going to build some code to query Ollama but without using Ollama to format the prompt for us. There are two purposes for this:

  1. We will learn a bit how tool calling works and how it interacts with prompt templates. I have a suspicion that prompt templates have something to do with the problem.

  2. We will avoid a lot of code that has been hiding behind abstraction. This is the downside of abstraction: it makes things easier to build, but it's harder to understand where the blame might lie when something fails.

Llama 3.2 Based Prompt

Let's start by examining the prompt template recommended for Llama 3.2, which is this template for zero-shot function calling from the llama-models repository. We'll talk more about this later, but Meta actually publishes conflicting prompt templates in different locations! So to be clear, this is the llama-models Llama 3.2 prompt template.

Here is the example from that page:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the weather in SF and Seattle?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

It's worth adding a few notes here. Modern LLMs are implemented as chat models, which means that they expect a conversation in the form of a list of messages that are sent by various roles. The primary roles are the user and the assistant. But we can also see that there are system messages, which are hidden instructions that are sent to the LLM. They provide instructions on how the LLM should behave. In this prompt template, we can see that it also specifies how and when the LLM should interact with tools.

On the same page is an example of the format in which the model should respond:

[get_weather(city='San Francisco', metric='celsius'), get_weather(city='Seattle', metric='celsius')]<|eot_id|>

Let's code this up and try it out.

llama_32_example_funs = """[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]"""

# https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/text_prompt_format.md#input-prompt-format-1
def llama_32_prompt_template(user, funs=llama_32_example_funs):
  return """<|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

%s<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>""" % (funs, user)
print(llama_32_prompt_template("What is the weather in new york?", funs=llama_32_example_funs))
<|start_header_id|>system<|end_header_id|>

You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.

[
    {
        "name": "get_weather",
        "description": "Get weather info for places",
        "parameters": {
            "type": "dict",
            "required": [
                "city"
            ],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The name of the city to get the weather for"
                },
                "metric": {
                    "type": "string",
                    "description": "The metric for weather. Options are: celsius, fahrenheit",
                    "default": "celsius"
                }
            }
        }
    }
]<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the weather in new york?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Looks good. Now we need some code to send the request to the Ollama API.

import json

def build_ollama_request(full_prompt, model="llama3.2"):
  return f"""{{
    "model": \"{model}\",
    "prompt": {json.dumps(full_prompt)},
    "raw": true,
    "stream": false
  }}
  """

def build_ollama_request_from_userprompt(user, template, funs=None, **kwargs):
  return build_ollama_request(template(user, funs), **kwargs)

def send_request(full_request):
  open("req", "w").write(full_request)
  !curl -s http://localhost:11434/api/generate --data-ascii "@req" | jq -M '.response'

def send_userprompt(user, **kwargs):
  request = build_ollama_request_from_userprompt(user, **kwargs)
  send_request(request)
# A few sample prompts...
prompts = [
    "Use tools to find the weather in new york", # llama 3.2 example
    "Use tools to get latest trending songs", # llama 3.1 example
    basic_tool_question, # Please evaluate foobar(30)
    greeting, # Hello!
    basic_arithmetic_question # What is 12345 - 102?
]

def try_prompts(**kwargs):
  for p in prompts:
    print(f"Prompt: {p}\nResponse: ", end="")
    send_userprompt(p, **kwargs)
    print()

In the list of example prompts, there are three tool calling examples. We will be passing along the get_weather tool definition. Obviously, commonsense tells us that the model should not attempt to list trending songs or evaluate foobar(30) using get_weather. Let's see how it does.

for _ in range(1):
  try_prompts(funs=llama_32_example_funs, template=llama_32_prompt_template, model="llama3.2")
Prompt: Use tools to find the weather in new york
Response: "\n\n[get_weather(city='New York')]"

Prompt: Use tools to get latest trending songs
Response: "\n\n[get_weather(city='trending songs', metric='') ]"

Prompt: Please evaluate foobar(30)
Response: "\n\n[]"

Prompt: Hello!
Response: "\n\nNothing to see here. Would you like to ask a question or request a function call?"

Prompt: What is 12345 - 102?
Response: "\n\n[]"

On Sunday, the weather is going to be sunny with a chance of rain in the legendary city of "Trending Songs".

Overall, this is pretty disappointing. The model appears overly eager to call tools, even when it makes no sense, such as calling get_weather on the city of "trending songs". Oops. And it often responds unnaturally to "Hello!". It doesn't respond at all to the arithmetic or foobar questions.

Llama 3.1 Based Prompt

Llama 3.2 is actually compatible with the Llama 3.1 prompt format for tool calling, so next let's try the llama-models Llama 3.1 prompt template. Below is the example from that page.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Answer the user's question by making use of the following functions if needed.
If none of the function can be used, please say so.
Here is a list of functions in JSON format:
{
    "type": "function",
    "function": {
        "name": "trending_songs",
        "description": "Returns the trending songs on a Music site",
        "parameters": {
            "type": "object",
            "properties": [
                {
                    "n": {
                        "type": "object",
                        "description": "The number of songs to return"
                    }
                },
                {
                    "genre": {
                        "type": "object",
                        "description": "The genre of the songs to return"
                    }
                }
            ],
            "required": ["n"]
        }
    }
}

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Use tools to get latest trending songs<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Notice that the Llama 3.2 and 3.1 prompt templates have very little in common!

Let's code up the Llama 3.1 prompt and test it out.

llama_31_example_funs = """{
    "type": "function",
    "function": {
        "name": "trending_songs",
        "description": "Returns the trending songs on a Music site",
        "parameters": {
            "type": "object",
            "properties": [
                {
                    "n": {
                        "type": "object",
                        "description": "The number of songs to return"
                    }
                },
                {
                    "genre": {
                        "type": "object",
                        "description": "The genre of the songs to return"
                    }
                }
            ],
            "required": ["n"]
        }
    }
}
"""

# https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/prompt_format.md#input-prompt-format-5
def llama_31_prompt_template(user, funs=llama_31_example_funs):
   return """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Answer the user's question by making use of the following functions if needed.
If none of the function can be used, please say so.
Here is a list of functions in JSON format:
%s

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>
""" % (funs, user)

Now we'll run the sample prompts again, but this time we'll use the Llama 3.1 example function trending_songs rather than get_weather. As before, commonsense tells us that we can't use trending_songs to predict the weather or compute foobar(30). Let's see how it does.

for _ in range(1):
  try_prompts(funs=llama_31_example_funs, template=llama_31_prompt_template, model="llama3.2")
Prompt: Use tools to find the weather in new york
Response: "I can't directly use the provided function to find the weather in New York as it is a location-based API and the given function is for getting trending songs, not weather information."

Prompt: Use tools to get latest trending songs
Response: "{\"type\": \"function\", \"name\": \"trending_songs\", \"parameters\": {\"n\": \"10\"}}"

Prompt: Please evaluate foobar(30)
Response: "Since there is no `foobar` function available, the answer is: None"

Prompt: Hello!
Response: "Hello! How can I assist you today?"

Prompt: What is 12345 - 102?
Response: "I'm not aware of any function that can perform this calculation. The functions provided only include the `trending_songs` function, which is used to retrieve trending songs based on a specific number of songs and genre. It does not include arithmetic operations like subtraction. If you need help with a different type of calculation, please let me know!"

These responses seems greatly improved compared to the Llama 3.2 prompt we tried. The response to a greeting is more natural. It also didn't do silly things like stuffing "trending songs" into the get_weather function calls.

The only consistent problem I can see is that it didn't even try to answer the arithmetic question. Let's see if we can fix that by slightly tweaking the wording of the prompt with the following diff:

-Answer the user's question by making use of the following functions if needed.
-If none of the function can be used, please say so.
+Help and converse with the user.  If and only if the user asks a question that
+is relevant to one of the following functions, make use of them.  If none of
+the functions can be used, answer the query as best you can.
def llama_31_custom_prompt_template(user, funs=llama_31_example_funs):
   return """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

You are a helpful assistant.
<|eot_id|><|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
%s

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

%s<|eot_id|><|start_header_id|>assistant<|end_header_id|>
""" % (funs, user)

for _ in range(1):
  try_prompts(funs=llama_31_example_funs, template=llama_31_custom_prompt_template, model="llama3.2")
Prompt: Use tools to find the weather in new york
Response: "Since the provided function `trending_songs` is related to music and not weather, I'll answer your question as best I can.\n\nAccording to current information, New York's weather can be quite varied depending on the time of year. In general, winters are cold and snowy, while summers are hot and humid. Springs and autumns are generally mild.\n\nIf you're looking for more specific or up-to-date weather information, I recommend checking a reliable weather website or app, such as AccuWeather or Weather.com, which can provide more accurate and current forecasts for New York City and surrounding areas."

Prompt: Use tools to get latest trending songs
Response: "Here is a function call in JSON format that uses the `trending_songs` function to get the latest trending songs:\n\n```\n{\n  \"type\": \"function\",\n  \"name\": \"trending_songs\",\n  \"parameters\": {\n    \"n\": \"10\"\n  }\n}\n```"

Prompt: Please evaluate foobar(30)
Response: "Unfortunately, I don't have any information about a `foobar` function. Could you please provide more context or details about this function? Is it defined somewhere else that I can access? Or would you like me to try and infer what the `foobar` function might do based on its name?"

Prompt: Hello!
Response: "Hello! How can I assist you today?"

Prompt: What is 12345 - 102?
Response: "Since no specific function can be used for this question, I'll answer it directly:\n\nThe result of subtracting 102 from 12345 is 12343."

This seemed to work really well if you ignore the fact that it got the subtraction wrong ๐Ÿคฃ

What prompt template is Ollama using?

So we found a prompt template that appears to be working well. Let's try to make Ollama use that prompt template. To start, let's investigate what prompt template it's using "by default".

def get_raw_prompt():
  template = !cat nohup.out | fgrep "chat request" | sed -e 's/.*prompt="\(.*\)"/\1/'
  return [s.encode().decode('unicode_escape') for s in template]

!ollama pull llama3.2 2>/dev/null
!>nohup.out # Truncate ollama output
response = react_chat(greeting, model=ChatOllama(model="llama3.2", num_ctx=num_ctx))
print(get_raw_prompt()[0])
<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023

When you receive a tool call response, use the output to format an answer to the orginal user question.

You are a helpful assistant with tool calling capabilities.<|eot_id|><|start_header_id|>user<|end_header_id|>

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{"type":"function","function":{"name":"foobar","description":"Computes the foobar function on input and returns the result.","parameters":{"type":"object","required":["input"],"properties":{"input":{"type":"integer","description":""}}}}}

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>


That doesn't look like either of the prompt templates we used before, which both came from the llama-models repository. After a bit of google-fu, we can see that it originated from the Llama 3.1 "JSON based" tool calling documentation on the llama website. But it's not the same Llama 3.1 prompt template that we used from the llama-models repository.

This raises a few questions:

  1. Why are there multiple prompt templates for Llama 3.1?

  2. Which prompt template is best?

  3. Why not use the prompt template for Llama 3.2, since we are using the Llama 3.2 model?

Let's start by answering the first two questions. In this github issue, a user notes that there are at least three different prompt templates for Llama 3.1:

  1. The Llama 3.1 model website
  2. Meta's llama-models repository
  3. The HuggingFace model's chat template

A Meta employee states that the template in the llama-models repository is the correct one. Fortunately, that's what we have been using in this blog post. (It's almost as if I knew this in advance!) But Ollama has been basing their template on the one from the model website. That seems problematic!

The last question, "Why not use the prompt format for Llama 3.2?" is pretty easy to answer as well. Llama 3.2's default prompt format responds using a pythonic function call syntax that Ollama can't parse. And, as we saw above when we tested it manually, the Llama 3.2 prompt anecdotally did not seem to work well anyway.

Let's build our own Ollama prompt template

Now that we identified a prompt template that seems to work pretty well, let's try to make Ollama use it.

To start, let's look at the default prompt template for the llama3.2 model in Ollama. We already saw the instantiated prompt, but now let's look at the template that Ollama uses to build the prompts. You can see this below, or on the Ollama website here.

<|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023

{{ if .System }}{{ .System }}
{{- end }}
{{- if .Tools }}When you receive a tool call response, use the output to format an answer to the orginal user question.

You are a helpful assistant with tool calling capabilities.
{{- end }}<|eot_id|>
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>
{{- if and $.Tools $last }}

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{{ range $.Tools }}
{{- . }}
{{ end }}
{{ .Content }}<|eot_id|>
{{- else }}

{{ .Content }}<|eot_id|>
{{- end }}{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
{{- if .ToolCalls }}
{{ range .ToolCalls }}
{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}
{{- else }}

{{ .Content }}
{{- end }}{{ if not $last }}<|eot_id|>{{ end }}
{{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>

{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- end }}
{{- end }}

It's not just you. It really is hard to read. The above code is written in the Go template language. I recommend using this interactive editor to better understand how the template language works if you are interested.

At a high level, the prompt template takes a sequence of messages and converts it into a prompt for the model. Some important notes:

  • ToolCalls indicate the calls the model wants to make. Ollama infers these by parsing the model's responses.
  • The tool role contains the output of an executed tool.
  • The user and assistant roles are self explanatory!

With a lot of trial and error in the interactive template editor, I converted our earlier template into the Ollama format:

<|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

{{ if .System }}{{ .System }}
{{- end -}}
<|eot_id|>{{ if .Tools }}<|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
{{- range $.Tools }}
{{ . }}{{ end }}

Return function calls in JSON format.<|eot_id|>{{ end }}
{{- range $i, $_ := .Messages }}

{{- $last := eq (len (slice $.Messages $i)) 1 }}
{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>

{{ .Content }}<|eot_id|>
{{- if $last }}<|start_header_id|>assistant<|end_header_id|>
{{ end }}
{{- else if eq .Role "assistant" }}<|start_header_id|>assistant<|end_header_id|>
{{- if .ToolCalls }}
<|python_tag|>{{- range .ToolCalls -}}
{"name": "{{ .Function.Name }}", "parameters": {{ .Function.Arguments }}}{{ end }}<|eom_id|>
{{- else }}
{{ .Content }}<|eot_id|>
{{- end }}
{{- else if eq .Role "tool" }}<|start_header_id|>ipython<|end_header_id|>

{{ .Content }}<|eot_id|>{{ if $last }}<|start_header_id|>assistant<|end_header_id|>

{{ end }}
{{- end }}
{{- end }}

Now the big question -- does it actually work?

Does our new Ollama prompt template work as intended?

First, let's make a query and make sure that we get the right answer!

!>nohup.out # Truncate ollama output
!ollama pull ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized 2>/dev/null
response = react_chat(basic_tool_question, model=ChatOllama(model="ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized", num_ctx=num_ctx))
print(response[0])
assert "32" in response[0]
The result of the foobar function when called with 30 as input is 32.

Second, let's take a peek at the prompt we sent to the LLM.

print(get_raw_prompt()[-1])
<|start_header_id|>system<|end_header_id|>

Environment: ipython

Cutting Knowledge Date: December 2023
Today Date: 21 September 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Help and converse with the user.  If and only if the user asks a question that
is relevant to one of the following functions, make use of them.  If none of 
the functions can be used, answer the query as best you can.
Here is a list of functions in JSON format:
{"type":"function","function":{"name":"foobar","description":"Computes the foobar function on input and returns the result.","parameters":{"type":"object","required":["input"],"properties":{"input":{"type":"integer","description":""}}}}}

Return function calls in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Please evaluate foobar(30)<|eot_id|><|start_header_id|>assistant<|end_header_id|>
<|python_tag|>{"name": "foobar", "parameters": {"input":30}}<|eom_id|><|start_header_id|>ipython<|end_header_id|>

32<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Looks good to me!

Alright, let's run it through Ed's Really Dumb Tool-calling Benchmark โ„ข๏ธ. We'll run the original Llama 3.2 model in Ollama (llama3.2) and the Llama 3.1 tooling prompt too for comparison.

models = [
  "llama3.2",
  "ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt", # based on the llama 3.1 tooling prompt
  "ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized", # our improved prompt
]

for m in models:
  print(f"Testing model: {m}")
  !ollama pull {m} 2>/dev/null
  r = run_and_print_experiment(ChatOllama(model=m, num_ctx=num_ctx), m, n=sample_size)
  print(r)
Testing model: llama3.2
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 0.97
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.55
Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.09
{'q1': 0.97, 'q2': 0.0, 'q3': 0.55, 'q4': 0.09, 'n': 100, 'model': 'llama3.2', 'total': 1.61}
Testing model: ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt
Question 1: Can the react agent use a tool correctly when explicitly asked? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.15
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.51
Question 4: Does the react agent lose the ability to chat? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt) success rate: 0.49
{'q1': 1.0, 'q2': 0.15, 'q3': 0.51, 'q4': 0.49, 'n': 100, 'model': 'ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt', 'total': 2.15}
Testing model: ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized
Question 1: Can the react agent use a tool correctly when explicitly asked? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.99
Question 2: Does the react agent invoke a tool when it shouldn't? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.99
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 0.68
Question 4: Does the react agent lose the ability to chat? (ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized) success rate: 1.0
{'q1': 0.99, 'q2': 0.99, 'q3': 0.68, 'q4': 1.0, 'n': 100, 'model': 'ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized', 'total': 3.66}

The original Ollama prompt scored 1.61/4.0. The official Llama 3.1 tooling prompt scored 2.15/4.0, and my customized prompt scored 3.6/4.0. I'd say that is an improvement!

Coming Clean

I do want to come clean. I did not create that prompt template in one try. It actually took many days of experimenting and debugging.

(I find that this is often a dilemma when blogging. Keeping track of everything you did is difficult, time-consuming, and often not that interesting.)

Many things went wrong along the way, but here are a few issues that I remember:

  • Environment: ipython is supposed to be for enabling Llama's code interpreter, which we aren't using, but I wasn't able to get the Llama 3.1 prompt to work well without it.
  • Llama has some weird rules for when to use <|eom_id|> vs. <|eot_id|>, and I got them wrong. As a result, even though the initial prompt was correct, Llama could not "parse" the results to correctly build the final message.
  • In Llama 3.1 prompts, when the model responds with a tool call, it is supposed to be prefixed with <|python_tag|>. Somewhat oddly (in my opinion), existing Ollama prompts "rebuild" the tool call response from the parsed values, rather than using the original message. As a result, I had to add in the <|python_tag|> or the model would become confused and struggle to build the final message to the user.
  • In one template, I added an extra newline, and this caused a notable decrease in performance! Yes, models really are that sensitive.

Recap

Let's recap all the things we had to do to get to this point:

  1. We used the Llama 3.1 prompt template from the llama-models repository and NOT the Llama 3.1 website, which is Ollama's prompt was based on.

  2. We modified the wording of the prompt to improve its responses to non-tool-calls like greetings and arithmetic.

And this was all just to improve the performance of one single model. I'm tired, aren't you?

Conclusion

Prompts are definitely part of the reason why building tool calling agents did not work in Part 1. But HuggingFace raised the alarm about this a long time ago! So what went wrong?

What went wrong?

Now that tool-calling is becoming more popular, prompt templates must be considered a fundamental part of a model, just like the weights. The reason for this is simple: the model developer is the only entity who has a clear incentive to ensure that their model works as well as possible. Downstream consumers like Ollama do not have an incentive to make sure that prompt templates work as well as possible. Unfortunately for Llama, Meta did not treat the prompt templates as a fundamental part of the model. Meta did a poor job of documenting the prompt templates: the example-based documentation is vague, and the multiple conflicting sources of information further confused the issue. So right off the bat, Llama models are not bundled with a clear prompt template.

Ollama did not help the situation. Instead of adopting an existing template format such as HuggingFace's, they decided to roll their own format based on Go templates. On one hand, this is a natural decision since Ollama is written in Go. But now someone has to write a new prompt template for every model on Ollama. Currently, it is the Ollama developers themselves who are creating these prompt templates. But as I mentioned above, there is a concerning incentive mismatch: the Ollama developers don't have an incentive to determine the best prompt format for every model. Here is a stark example where the Ollama developers have provided unhelpful and misleading responses when users reported that models were making non-sensical tool calls:

Don't bind tools if you don't want [the Llama model to make] a tool call.

Of course, as evidenced by this blog post, the real problem was that the Ollama developers themselves chose a prompt template that performed poorly on tool-calling. (I don't mean to beat up on the Ollama project. I think it's a great project! But they didn't help themselves out in this area.)

What can we do about it?

Standards

We really do need a standard for prompt templates. HuggingFace's chat template is a good start, but it is not perfect. While it describes how to format messages that should be sent to the model, it doesn't define how to parse the model's responses, which is equally important. As an example, I suspect that a major reason why Ollama's Llama 3.2 model used a Llama 3.1 prompt template is because Ollama's tool-call parser does not support the pythonic format used in the Llama 3.2 prompts. Parsing tool-calls is currently very ad-hoc.

Another problem with HuggingFace's chat templates is language compatibility. I suspect that part of the reason why Ollama chose to use their own template format is convenience. HuggingFace's chat templates are based on jinja2, which is a template language for python. But Ollama is written in Go. Perhaps we need a standard format that is more language agnostic.

Ollama

Ollama should either adopt the HuggingFace template format or create a tool that can convert HuggingFace templates to Ollama templates. The current system of manually converting templates is error prone and harmful.

Ollama should also add information to their model cards about which prompt templates they adopted and why. For example, the Llama 3.2 model card does not mention that the prompt template is based on the Llama 3.1 prompt format, or why.

Benchmarks

Benchmarks such as the Berkeley Function-Calling Leaderboard (BFCL) could also be doing more about the prompt problem. For Llama, it appears that, similar to the Ollama developers, the BFCL developers have simply chosen a prompt and implemented it. Llama 3.1 and 3.3 appear to be based on a Llama HuggingFace chat template while other versions use a generic prompt.

We don't know how or why they selected these prompts. As with the Ollama developers, there is an incentive mismatch: they don't have an incentive or responsibility to experiment. Perhaps there should be more incentive for model developers to fix the prompt templates in order to score better on the BFCL, but it doesn't seem like that is how things works today. Honestly, I don't understand why; I would think that Meta would be embarrassed that the Llama-3.2-3B-Instruct model only scores 5.25% on the BFCL in Overall Multi Turn Accuracy.

Llama

The Llama developers really need to do a better job documenting its prompt templates. The example-based "documentation" is vague. And more critically, there shouldn't be conflicting information. Even after they were notified about it, the problem remains. In the same github issue, it's clear that people can't reproduce Meta's experimental results either.

Closing Thoughts

We largely got to the bottom of the problems in Llama 3.2, but this was just one of many models that performed poorly. Are all of these models suffering from prompt template problems, or are there other problems as well? Stay tuned to find out.

TLDR

Better Ollama models for Llama 3.2 tool calling are available here:

  • ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt is based on this Llama 3.1 prompt
  • ejschwar/llama3.2-better-prompts:llama3.1-tooling-prompt-customized is based on this Llama 3.1 prompt but slightly modifies the language to improve behavior on responding to queries unrelated to tools.
Edward J. SchwartzComputer Security Researcher2 min. read

Given the recent change in government here in the US, I've been thinking a lot about my personal beliefs and values. As part of this process, I decided to write my own personal manifesto. I hope that even if I do not see eye to eye on political matters with all my neighbors and fellow citizens, we can find some common ground on the values that matter most to me.

As a reminder, everything on this blog is my own opinion and does not reflect the views of my employer or anyone else.

My Personal Manifesto

Life is inherently unequal, shaped by factors beyond our controlโ€”genetics, financial circumstances, and sheer luck. I feel that those who are fortunate bear a responsibility to help those who are not, and that principle should be a foundation of our society. A truly fair and compassionate world requires us to ensure that everyone, regardless of gender, race, or any other characteristic, is given a fair and equal chance. When some groups consistently achieve more success than others based on traits like skin color, it signals a deeper flaw in our society. Equality must be more than a principleโ€”it must be a reality reflected in the systems and institutions that govern us.

Freedom is one of lifeโ€™s most important values, but it should not come at the expense of othersโ€™ well-being. We should accept people as they are, reserving judgment only when someoneโ€™s actions cause real harmโ€”discomfort, unease, or disagreement with someoneโ€™s identity, culture, or way of life does not constitute harm. Prejudice and bigotry, when disguised as "freedom of expression," should not be allowed to infringe upon othersโ€™ dignity and rights under the guise of tolerance.

In making political or social decisions, it is essential to think beyond our immediate reactions and consider the broader impact, including possible side effects. Policies must align with their stated goals and follow through on their commitments. For example, if a policy restricts abortion under the guise of protecting unborn children, it should also provide resources and support to ensure those children are cared for and supported after birth. Making a decision without committing to its consequences is irresponsible and ultimately harmful. True responsibility means considering all dimensions and consequences and committing to the outcomes of our choices.

A society grounded in truth and information is best equipped to address these challenges. Facts and science must take precedence over opinions or political agendas, ensuring our systems and policies are rooted in evidence rather than ideology. Truth is non-negotiable, and it must guide our decisions and perspectives. Science and evidence are the tools we use to uncover systemic injustices, evaluate the impacts of our actions, and create solutions that reflect reality, not rhetoric.

This manifesto is a call for a society that values truth, compassion, fairness, and foresight. It is a vision of a world where integrity, responsibility, and respect for all people guide our actions and our choices.

Edward J. SchwartzComputer Security Researcher10 min. read

For the past several years, I've been using Advent of Code as an excuse to do some professional development and learn new languages. The past two years, I used Rust, which is an interesting language.

I remembered seeing funny looking solutions on the Reddit AOC solutions thread that were basically a jumble of symbols. These were for a language called Uiua, which naturally piqued my interest.

For some reason, I decided to try to do AOC 2024 in Uiua. It has not been a smooth ride, and in this blog post I'll briefly touch on some of my thoughts and experiences in Uiua and two other array programming langauges, APL and BQN.

Starting in Uiua

The Uiua website describes the language as:

Uiua (wee-wuh ๐Ÿ”‰) is a general purpose, stack-based, array-oriented programming language with a focus on simplicity, beauty, and tacit code.

Uiua lets you write code that is as short as possible while remaining readable, so you can focus on problems rather than ceremony.

The language is not yet stable, as its design space is still being explored. However, it is already quite powerful and fun to use!

But this screenshot of a basic Uiua example probably gives you a better idea how the language works:

Uiua example
Uiua example

I spent a while going through the Uiua tutorials, and I made it through the first few AOC problems with a bit of difficulty.

I eventually got to a problem where I had to write a fold. And I remember getting extremely frustrated with the language. The language does not have (local) variable names. Instead, everything is on the stack. You as the programmer must internally keep track of the stack and how all the operations you perform modify it. Oh, it's a stack-based machine too, so the top of the stack is constantly changing.

I think there were only three or four values I had to juggle in my fold function, but it was too much. Maybe because I work in binary analysis where you can't take local variables for granted, I really want to be able to use them in my "high level" programming languages.

More seriously, I think the lack of local variables just compounds complexity. Simple functions are fine. But complex functions, which are already complex, start to get even more complex as the programmer now has to deal with juggling the stack layout. No thanks.

Moving to APL

Uiua is an array-oriented programming language. Most array-oriented programming languages derive from APL (which was created in the 60s!) One of the benefits of this is that it's a pretty mature language, and there is a lot of training material available for it.

The most popular implementation is a commercial, non-open-source one called Dyalog APL. I wasn't thrilled to be using a closed-source implementation, but it's just for learning purposes so I supposed it was fine. I started following along with this tutorial. I got about half way through, and started to feel like I was probably competent enough to try some AOC problems in APL.

I immediately ran into trouble again, but this time with APL's tooling. I have two basic requirements for a programming language for AoC:

  1. I can put the code for each day in a file.
  2. I can run the code from inside VS code fairly easily.
  3. I can type the weird symbols of the language from within VS code. (Oops, this one is new this year.)

I forgot to mention that Uiua's tooling was pretty great. No complaints; I installed the extension and everything worked as expected.

APL tooling is weird. It's not really designed like modern programming languages. Instead, all coding is supposed to be done in workspaces. I was pretty frustrated by this and I eventually gave up.

In retrospect, I may have been able to get by with dyalogscript. But the unpolished nature of the tooling, at least for how programming languages are used in this century, was a big turn off.

BQN

Finally, I landed on BQN. Here is the website's description:

Looking for a modern, powerful language centered on Ken Iverson's array programming paradigm? BQN now provides:

  • A simple, consistent, and stable array programming language
  • A low-dependency C implementation using bytecode compilation: installation
  • System functions for math, files, and I/O (including a C FFI)
  • Documentation with examples, visuals, explanations, and rationale for features
  • Libraries with interfaces for common file formats like JSON and CSV

And here's a quick example from the website.

BQN
BQN

If you are thinking that all of these languages look pretty similar, you're right.

BQN had a lot going for it. It wasn't stack based. The tooling seemed pretty good. Not only is the language designed to be used from files, you can even use multiple files. Welcome to the 21st century baby!

I also liked the name. The whole point of learning this array-programming paradigm was to be able to write short, concise code. So the idea of answering "Big Questions" was appealing.

BQN has a lot of documentation. There are a few tutorials intended for new users, but most of the documentation is of the, well, documentation variety. It's not a tutorial, but a reference manual. It's written by an absolute array programming expert, for other array programming experts. It's not the most beginner friendly.

So it was a pretty rough learning curve. I quickly joined the APL language discord and started asking a lot of questions. People there are very patient and helpful, thankfully! I also found some other people working on AOC, and I spent a lot of time unraveling their solutions.

I just finished Day 9 of AOC 2024 in BQN. It's December 20th, so obviously I'm pretty far behind. I'm not sure if I'll finish this year; I've been trying to embrace learning the array-oriented way of thinking, which has been challenging and slow.

Readability

I've been slowly getting better at reading others' BQN code, but it's hard. There are a lot of symbols to remember, but that's really not the main problem for me. Instead, it's very difficult to "parse" where parentheses should be placed. It can also be difficult to follow the general flow of very terse code.

Here's a snippet of code from RubenVerg who is a genius when it comes to tacit coding in BQN.

inโ†โ€ขfile.Chars "input/8.txt"

Pโ†(ยฌ-หœโŠขร—ยท+`ยปโŠธ>)โŠธโŠ”
OutOfBoundsโ†โˆจยด(0>โŠข)โˆพโ‰ขโŠธโ‰ค

Parseโ†>' 'โŠธ<โŠธP

Part1โ†{๐•Šgrid: โ‰ ยฌโˆ˜(gridโŠธOutOfBounds)ยจโŠธ/โทโˆพโฅŠ{(๐•จ(โ‰ขโˆง=โ—‹(โŠ‘โŸœgrid)โˆง'.'โ‰ gridโŠ‘หœโŠข)๐•ฉ)/โ‹ˆ๐•ฉ-หœ2ร—๐•จ}โŒœหœโฅŠโ†•โ‰ข grid}Parse

(Holy smokes, my formatter actually supports BQN!)

Part1 is a function that is composed with Parse. So it will Parse the input and the result will be bound to grid inside the curly brackets.

I have my doubts that anyone can read this code. Rather, you can reverse engineer it by breaking it down into smaller pieces and understanding each piece. But it's not easy to read, even if you understand what all the symbols mean.

Tacit coding

According to the BQN documentation:

Tacit programming (or "point-free" in some other languages) is a term for defining functions without referring to arguments directly, which in BQN means programming without blocks.

The idea of tacit coding is kind of cool. You basically avoid applying functions and instead compose and otherwise modify them.

BQN has a function composition operator โˆ˜ just like you would imagine. But a lot of tacit code uses trains. For a pretty poor introduction to trains, you can view this page. But let me spell out the basics.

A 2-train is two adjacent functions, and by definition fg evaluates to fโˆ˜g (f composed with g). In other words, to evaluate fg on an input ๐•ฉ, we could use g(f(๐•ฉ)). (A BQN programmer would never write that and would instead just use g f ๐•ฉ.)

3-trains, which consist of three adjacent functions, are where things get fun. Again, by definition, evaluating fgh on input ๐•ฉ evaluates to (f ๐•ฉ) g (h ๐•ฉ). This is not very intuitive, but it's useful for a couple reasons:

  • ๐•ฉ appears twice, so you can use it to avoid writing a long expression multiple times
  • There's a fork, so you can combine two different behaviors
  • g acts as a summarizer or combiner

Here's an example I used in my solution to Day 9:

(โŠฃร—(โ†•โ‰ )) arg

The 3-train consists of โŠฃ, ร—, and (โ†•โ‰ ). In my program, arg is actually an extremely long expression, and I did not want to write it twice. Let's expand the train:

(โŠฃ arg) ร— (โ†•โ‰  arg)

โŠฃ is the identity function, so it returns arg. ร— is multiplication. โ‰  returns the length of its (right) argument, and โ†• returns the list of numbers from 0 to one less than its argument. So this train multiplies each element in arg (it's an array) by its index. Pretty cool, huh?

The trouble is that when reading and writing BQN code, it can be difficult to identify trains. I've been getting better, but I still find myself inserting a โˆ˜ whenever my code doesn't work, since function composition will "stop" a train from forming when it wasn't intentional. Now look at RubenVerg's code above and think about all the trains. Even if you understand the symbols, it's not easy. This is very much a learned skill!

Here's a very basic example of how parsing influences trains. BQN evaluates from right to left. So if you write fgh ๐•ฉ, that actually means f(g(h(๐•ฉ))) and there is not a train. But (fgh) ๐•ฉ is completely different and fgh is a 3-train. Now again, look at RubenVerg's code and try to figure out the implied parentheses. Good luck!

Documentation

I found BQN's documentation to be very thorough, but not very beginner friendly. I think it's written by an expert for other experts. In some cases, it seems to pontificate and misses basic definitions. For example, the trains page doesn't directly define 2- and 3-trains. You can probably figure out the definition from the examples, but it's not ideal.

On the plus side, many documentation pages feature very intuitive diagrams. See the below diagram in Group for an example.

Cool features

There are some cool features in BQN. I'm not going to cover all of them, but here are ones that stand out based on my programming career.

Group โŠ”

The Group โŠ” operator is pretty nifty. Here is a nice diagram that intuitively depicts an example.

Group
Group

In BQN, ๐•จ is the left argument and ๐•ฉ is the right argument. Usually ๐•ฉ is some existing structure you want to analyze, and ๐•จ is a list of indices that you construct to define the groupings you want. If you want elements to be placed in the same group, you assign them the same index.

This is a powerful capability. For example, in today's AOC problem, there was a string like 00...111...2...333.44.5555.6666.777.888899 where group 0 had size two, group 1 had size 3, and so on. One easy way to determine the size of each group in BQN is using Group โŠ”. If you first change each . to a -1, you can use the same array as both arguments with โŠ”หœ to get the groupings, and โ‰ ยจโŠ”หœ to get the length of each grouping. (โ‰  means length and ยจ means to modify the function to the left to apply to each element of the array to the right.)

Under โŒพ

Under โŒพ is an interesting capability that is a bit tricky to explain. Here is the official explanation from the documentation:

The Under 2-modifier expresses the idea of modifying part of an array, or applying a function in a different domain, such as working in logarithmic space. It works with a transformation ๐”พ that applies to the original argument ๐•ฉ, and a function ๐”ฝ that applies to the result of ๐”พ (and if ๐•จ is given, ๐”พ๐•จ is used as the left argument to ๐”ฝ). Under does the "same thing" as ๐”ฝ, but to the original argument, by applying ๐”พ, then ๐”ฝ, then undoing ๐”พ somehow.

So to restate, there is a transformation or selection operation, ๐”พ, and a modification transformation ๐”ฝ. There are different applications, but I always used this to transform or change part of an array. In that case, ๐”พ might be a filter, and ๐”ฝ describes how you want to change the array.

Here's an example from today's AOC again:

ReplaceNegWithNegOne โ† {ยฏ1ยจโŒพ((๐•ฉ<0)โŠธ/) ๐•ฉ}

I'm not going to try to explain all the syntax. But ๐”พ is ((๐•ฉ<0)โŠธ/); this says to filter ๐•ฉ so that only elements less than 0 remain. ๐”ฝ is ยฏ1ยจ, which means return negative one for each argument. So, put together, replace negative elements with negative one. I then used the resulting array as an index to Group โŠ”, which ignores any element with an index of negative one.

This is kind of neat because values are immutable in BQN, and this provides an efficient way to change part of them. I assume that the implementation uses this to avoid making copies of the whole array.

Not So Cool Features

One thing that annoys me about BQN is that a number of basic functions are not built-in because they can be succinctly expressed.

For example, want to split a string? You'd better memorize

x0((โŠข-หœ+`ร—ยฌ)โˆ˜=โŠ”โŠข)y1 #Split y1 at occurrences of separators x0, removing the separators

Want to build a number from an array of digits? You can use

10โŠธร—โŠธ+หœยดโŒฝd1 # Natural number from base-10 digits

These both come from BQNcrate, a repository of useful functions you could but probably don't want to derive yourself. I'd much rather see this in a standard library of some sort. Most of these are cool, and it's fun to see how they work. But when I'm actually coding, I don't want to look these up or try to derive them. I just want to say split the string by ' ' and move on.

I don't think I'm alone. I've noticed that RubenVerg for example likes to use โ€ขParseFloat to parse integers rather than

(10โŠธร—โŠธ+หœยดโˆ˜โŒฝ-โŸœ'0')d1 #Parse natural number from string

which doesn't exactly roll off the tongue.

Ed's Feelings on BQN

Unfortunately, I don't have fun programming in BQN. There, I said it. I've literally felt very stupid at times trying to figure out how to write a simple function.

BQN is challenging, and I like challenges. But it's a fine line. There is an intense gratification to stringing together a whole bunch of opaque symbols that very few other people can read. But it's also frustrating and demoralizing to spend hours trying to figure out how to solve a basic problem.

It's hard to say how much of this is just part of learning a new paradigm. I remember when I first learned OCaml as a graduate student and had to figure out how to think functionally and decode arcane type errors involving parametric polymorphism. At the time, it was hard (and probably not that fun, but I can't remember). Now it's second nature. Maybe BQN will become second nature if I stick with it.

Conclusion

I probably won't be using BQN for any real projects any time soon. But I haven't given up on it entirely. I may try to finish AOC 2024 in BQN. We'll see. Given the lack of fun I've been having, I can't say I'm extremely motivated to do so. So for now I'll be taking things one day at a time.

If you're curious about my BQN code, you can find my AOC 2024 solutions here.

One of the most exciting possibilities of AI and LLMs are agents: tools that allow LLMs to interact with various tools in order to solve problems. You've probably seen them before, like when you ask ChatGPT to browse the web for you.

In this blog post, we'll take a look at how to build agents using LangChain. They'll work great using an OpenAI model. And then we'll try to run them locally using Ollama, using a variety of open models. And they will almost all fail miserably. They fail so bad that I created this blog post to convince myself I wasn't imagining things.

In a future blog post, we will examine why.

LangChain

LangChain is a framework that allows you to build LLM applications. Basically, it abstracts a bunch of different components like LLMs, vector stores, and the like, and allows you to focus on your application's logic. So, you might develop your application in LangChain while using a local LLM to run it, but then use Claude once you go to production.

Anyway, using LangChain to make a query is pretty simple.

!pip install langchain-openai~=0.2.7 python-dotenv
!pip install httpx==0.27.2 # temp
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# We'll load my OpenAI API key using dotenv
%load_ext dotenv
%dotenv drive/MyDrive/.env
from langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage

from langchain_openai import ChatOpenAI

# Remove non-determinism for the blog post
zero_temp_gpt35 = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

response=zero_temp_gpt35.invoke("Hi!  What is your name?").content

import textwrap
print(textwrap.fill(response))
Hello! I am a language model AI assistant. How can I assist you today?

The beauty of LangChain is that the components are modular. We can replace gpt-3.5-turbo model with something else later if we want to, and indeed we will do just that!

Building an Agent using LangGraph

LangGraph is the portion of LangChain for building agents. It allows us to easily define new tools:

!pip install langgraph~=0.2.53
@tool
def foobar(input: int) -> int:
    """Computes the foobar function."""
    return input + 2

tools = [foobar]

The @tool decorator automatically transforms the function into a schema that can be used by the LLM to decide whether to invoke the tool, and if so, how.

foobar.tool_call_schema.model_json_schema()
{'description': 'Computes the foobar function.',
 'properties': {'input': {'title': 'Input', 'type': 'integer'}},
 'required': ['input'],
 'title': 'foobar',
 'type': 'object'}

With that, we can build a generic agent, called a ReAct agent, which can interact with our tools:

from langgraph.prebuilt import create_react_agent

def react_chat(prompt, model):
  agent_executor = create_react_agent(model, tools)

  response = agent_executor.invoke({"messages": [("user", prompt)]})
  return response['messages'][-1].content, response

last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_gpt35)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"
The result of evaluating foobar(30) is 32.

Yes! We did it, team! ๐ŸŽ‰ We could change magic_function to be a web search, a database lookup, or you name it.

Let's try a query that doesn't use a tool at all.

last_msg, result = react_chat("Hi.", zero_temp_gpt35)
print(last_msg)
assert "Hello" in last_msg and "foobar" not in last_msg, "Uh oh, something went wrong"
Hello! How can I assist you today?

Great. So, in theory, we have an agent that we can chat with and is able to call tools in order to help us out.

Ollama: Going Local

Now let's try to create a tool-wielding agent using a LLM that runs on our local machine.
We'll do this by using Ollama, which is a (fairly) easy way to run smaller open LLMs on your local machine. It will use any GPUs that you might have, but it's still usable even if you don't have any. After all, you're just performing inference, not training.

Here's an example of me running llama 3.2 with ollama on my work laptop.

root@be5c1cb9e696:/# ollama run llama3.2
>>> Hi mom!
It's nice to hear from you, sweetie. Is everything okay? What's on your mind?

>>> Are you alive?
I am a computer program, so I don't have feelings or emotions like humans do. But I'm
designed to simulate conversations and answer questions to the best of my ability. I'm
not alive in the way that a living being is, but I'm here to help you with any
questions or topics you'd like to discuss!

>>> ๐Ÿคฏ
I know it can be a bit mind-blowing to think about a computer program that can have
conversations and answer questions! But I'm designed to make interactions feel more
natural, so I'm glad you're surprised (in a good way!)

You can find instructions on how to install Ollama on the Ollama webpage.

If you don't feel like installing anything, that's fine too. You can follow along with this notebook.

After installing and running Ollama (ollama serve), we install the langchain-ollama connector package and pull down the Llama 3.2 model from Ollama's repository.

# Install Ollama
!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh
!ollama -v
# Make sure Ollama is running
!ollama ps 2>/dev/null || (env OLLAMA_DEBUG=1 nohup ollama serve &)
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
WARNING: systemd is not running
WARNING: Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Warning: could not connect to a running Ollama instance
Warning: client version is 0.5.1
nohup: appending output to 'nohup.out'
!pip install langchain_ollama~=0.2.0
!ollama pull llama3.2

Now we can attempt the same tests we performed on GPT 3.5, but using the local Llama 3.2 LLM.

from langchain_ollama import ChatOllama

# The zero temperature model is to remove non-determinism for the blog post
zero_temp_ollama_model = ChatOllama(model="llama3.2", temperature=0)
response = zero_temp_ollama_model.invoke("Hi!  What is your name?").content

print(textwrap.fill(response))
I don't have a personal name, but I'm an AI designed to assist and
communicate with users. I'm often referred to as a "language model" or
a "chatbot." You can think of me as a helpful computer program that's
here to provide information, answer questions, and engage in
conversation. What's your name?

Okay, looking good! This is not bad for a 3B parameter LLM that can easily run locally on our computer. Let's see if it can call our tool-wielding agent.

last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_ollama_model)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"
The output of `foobar(30)` is 32.

๐ŸŽ‰ Everything is working well so far. As one final check, let's ask the agent a question that has absolutely nothing to do with tools.

last_msg, result = react_chat("Hi.", zero_temp_ollama_model)
print(last_msg)
assert "42" in last_msg, "Uh oh, something went wrong"
The input value 42 was doubled, resulting in 84.

So, we said "Hi." and the agent responded with nonsense. Let's inspect some of the metadata we get back from LangChain to see what's going on.

import pprint
pprint.pprint(result)
{'messages': [HumanMessage(content='Hi.', additional_kwargs={}, response_metadata={}, id='c4dd1ba7-cb15-4d62-a2bb-a543a32a882d'),
              AIMessage(content='', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.061349558Z', 'done': True, 'done_reason': 'stop', 'total_duration': 294464945, 'load_duration': 22079878, 'prompt_eval_count': 153, 'prompt_eval_duration': 9000000, 'eval_count': 16, 'eval_duration': 261000000, 'message': Message(role='assistant', content='', images=None, tool_calls=[ToolCall(function=Function(name='foobar', arguments={'input': 42}))])}, id='run-be60d0f6-bf62-4336-b028-d37898615e06-0', tool_calls=[{'name': 'foobar', 'args': {'input': 42}, 'id': '4d6b28d7-71bc-4f80-9a2a-e61293bdbb65', 'type': 'tool_call'}], usage_metadata={'input_tokens': 153, 'output_tokens': 16, 'total_tokens': 169}),
              ToolMessage(content='44', name='foobar', id='aed6b2d6-590d-4bc3-8828-89457178bd11', tool_call_id='4d6b28d7-71bc-4f80-9a2a-e61293bdbb65'),
              AIMessage(content='The input value 42 was doubled, resulting in 84.', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.305191931Z', 'done': True, 'done_reason': 'stop', 'total_duration': 238035622, 'load_duration': 22280620, 'prompt_eval_count': 85, 'prompt_eval_duration': 5000000, 'eval_count': 14, 'eval_duration': 208000000, 'message': Message(role='assistant', content='The input value 42 was doubled, resulting in 84.', images=None, tool_calls=None)}, id='run-50754cd1-cae9-410d-84d5-64b51bced188-0', usage_metadata={'input_tokens': 85, 'output_tokens': 14, 'total_tokens': 99})]}

We can see there are four messages:

  1. The HumanMessage is the user's message -- "Hi."
  2. In response, in the AiMessage, the LLM indicates that it would like to invoke a tool by setting the tool_calls field.
  3. LangChain invokes the tool and records the result in the ToolMessage, which is given back to the LLM.
  4. The final AiMessage includes a written message for the user.

The problem of course, is message #2. Why does the AI want to invoke a tool in response to "Hi."? Is this a problem with Llama 3.2 or something else? Let's do some ๐Ÿฅผ science and find out!

Ed's Really Dumb Tool-calling Benchmark โ„ข๏ธ

I created a really dumb benchmark to answer four really basic questions. I can't stress enough that this benchmark only tests the lowest of the low hanging fruit in this area. (I am calling it a "benchmark" facetiously!)

Here are the questions:

  1. Can the react agent use a tool correctly when explicitly asked? (Yes is good.)
  2. Does the react agent invoke a tool when it shouldn't? (No is good.)
  3. Does the react agent lose the ability to answer questions unrelated to tools? (No is good.)
  4. Does the react agent lose the ability to chat? (No is good.)

Question 1: Can the react agent use a tool correctly when explicitly asked?

We'll use our example above to test this.

basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
  last_msg, _ = react_chat(basic_tool_question, model=model)
  return "32" in last_msg

Question 2: Does the react agent invoke a tool when it shouldn't?

We'll perform two simple tests to answer this question. We'll prompt the agent with both a basic arithmetic question that does not involve the foobar tool, "What is 12345 - 102?", and a greeting, "Hello!" We'll then check the response to see if the model produces a ToolMessage, which indicates that the model chose to invoke a tool. By construction, neither of those prompts should induce a tool call.

from langchain_core.messages import ToolMessage

basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"

def q2a(model):
  _, result = react_chat(basic_arithmetic_question, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2b(model):
  _, result = react_chat(greeting, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2(model):
  return q2a(model) and q2b(model)

Question 3: Does the react agent lose the ability to answer questions unrelated to tools?

To answer this, we'll ask the basic arithmetic question to the react agent and its underlying model. Since the available tool does not help with the arithmetic problem, ideally, the agent and the underlying model should be able to solve the problem under the same circumstances. If the model can't do arithmetic in the first place, I chose not to penalize it because I'm such a nice guy. ๐Ÿ˜‡

def q3a(model):
  result = model.invoke(basic_arithmetic_question)
  return "12243" in result.content

def q3b(model):
  last_msg, _ = react_chat(basic_arithmetic_question, model=model)
  return "12243" in last_msg

def q3(model):
  # q3a ==> q3b: If q3a, then q3b ought to be true as well.
  return not q3a(model) or q3b(model)

Question 4: Does the react agent retain the ability to chat?

To answer this, we'll greet the agent and attempt to determine if it responds properly. This is a little difficult to do in a comprehensive way.

basic_greeting = "Hi."

def q4(model):
  last_msg, _ = react_chat(basic_greeting, model=model)
  r = any(w in last_msg for w in ["hi", "Hi", "hello", "Hello", "help you", "Welcome", "welcome", "greeting", "Greeting", "assist"])
  #if not r:
    #print(f"Debug: Not a greeting? {last_msg}")
  return r

Benchmark code

Here is code to run the experiments a couple of times.

from tqdm.notebook import tqdm

def do_bool_sample(fun, n=10, *args, **kwargs):
  try:
    # tqdm here if desired
    return sum(fun(*args, **kwargs) for _ in (range(n))) / n
  except Exception as e:
    print(e)
    return 0.0

def run_experiment(model, name, n=10):
  do = lambda f: do_bool_sample(f, model=model, n=n)
  d = {
      "q1": do(q1),
      "q2": do(q2),
      "q3": do(q3),
      "q4": do(q4),
      "model": name
  }
  d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
  return d

def print_experiment(results):
  name = results['model']
  print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
  print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
  print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
  print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")

def run_and_print_experiment(model, name):
  results = run_experiment(model, name)
  print_experiment(results)
  return results

Benchmarking Llama 3.2

Let's see what our experiments say for Llama 3.2, which we already know from above does not perform very well.

llama_model = ChatOllama(model="llama3.2")
run_and_print_experiment(llama_model, "llama3.2")
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.1
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.1, 'model': 'llama3.2', 'total': 1.6}

As we saw above, Llama 3.2 is able to call functions (Q1), but does so even when it should not be (Q2). Question 3 shows that even though it almost always decides to call a tool, this usually does not stop it from being able to answer basic questions. It does prevent it from being able to chat (Q4).

Benchmarking OpenAI's gpt-3.5-turbo and gpt-4o

Now let's try benchmarking gpt-3.5-turbo, which seemed to do better.

gpt35 = ChatOpenAI(model="gpt-3.5-turbo")
run_and_print_experiment(gpt35, "gpt-3.5-turbo")
Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-3.5-turbo) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-3.5-turbo) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-3.5-turbo) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (gpt-3.5-turbo) success rate: 1.0
{'q1': 1.0,
 'q2': 0.0,
 'q3': 0.9,
 'q4': 1.0,
 'model': 'gpt-3.5-turbo',
 'total': 2.9}

Great -- the benchmark showed that gpt-3.5-turbo can call tools (Q1), and unlike Llama 3.2, can still engage in chat (Q4). A bit surprisingly, it still invokes tools when it shouldn't, however (Q2). But it is smart enough to ignore their results when constructing its final response.

Let's try a newer model, gpt-4o.

gpt4o = ChatOpenAI(model="gpt-4o")
run_and_print_experiment(gpt4o, "gpt-4o")
Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-4o) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-4o) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-4o) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (gpt-4o) success rate: 1.0
{'q1': 1.0, 'q2': 1.0, 'q3': 1.0, 'q4': 1.0, 'model': 'gpt-4o', 'total': 4.0}

GPT 4o nailed it! ๐Ÿ‘

Benchmarking a Lot Of Ollama Models

Let's benchmark a whole bunch of Ollama models. I searched Ollama's model library for models that claimed to support tool calling. Here we test a hand-picked subset of these models to see how well they do.

ollama_models = [
    "hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S",
    "llama3.3:70b",
    "llama3.2:3b",
    "llama3.1:70b",
    "llama3.1:8b",
    "llama3-groq-tool-use:8b",
    "llama3-groq-tool-use:70b",
    "MFDoom/deepseek-v2-tool-calling:16b",
    "krtkygpta/gemma2_tools",
    "interstellarninja/llama3.1-8b-tools",
    "cow/gemma2_tools:2b",
    "mistral:7b",
    "mistral-nemo: 12b",
    "interstellarninja/hermes-2-pro-llama-3-8b-tools",
    "qwq:32b",
    "qwen2.5-coder:7b",
    ]

all = []

for m in ollama_models:
  print(f"Downloading model: {m}...")
  !ollama pull {m} 2>/dev/null
  print("done.")
  r = run_and_print_experiment(ChatOllama(model=m), m)
  !ollama rm {m}
  all.append(r)
  print(r)
Downloading model: hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S'
{'q1': 0.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S', 'total': 2.9}
Downloading model: llama3.3:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.3:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.3:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.3:70b) success rate: 0.2
Question 4: Does the react agent lose the ability to chat? (llama3.3:70b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'llama3.3:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.2, 'q4': 1.0, 'model': 'llama3.3:70b', 'total': 2.2}
Downloading model: llama3.2:3b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2:3b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2:3b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2:3b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2:3b) success rate: 0.0
[?25l[?25l[?25h[?25hdeleted 'llama3.2:3b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.0, 'model': 'llama3.2:3b', 'total': 1.5}
Downloading model: llama3.1:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:70b) success rate: 0.3
Question 4: Does the react agent lose the ability to chat? (llama3.1:70b) success rate: 0.7
[?25l[?25l[?25h[?25hdeleted 'llama3.1:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.3, 'q4': 0.7, 'model': 'llama3.1:70b', 'total': 2.0}
Downloading model: llama3.1:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:8b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:8b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (llama3.1:8b) success rate: 0.7
[?25l[?25l[?25h[?25hdeleted 'llama3.1:8b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.7, 'model': 'llama3.1:8b', 'total': 1.7}
Downloading model: llama3-groq-tool-use:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:8b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:8b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:8b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'llama3-groq-tool-use:8b'
{'q1': 1.0, 'q2': 0.8, 'q3': 1.0, 'q4': 1.0, 'model': 'llama3-groq-tool-use:8b', 'total': 3.8}
Downloading model: llama3-groq-tool-use:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:70b) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:70b) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:70b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'llama3-groq-tool-use:70b'
{'q1': 1.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'llama3-groq-tool-use:70b', 'total': 3.9}
Downloading model: MFDoom/deepseek-v2-tool-calling:16b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'MFDoom/deepseek-v2-tool-calling:16b'
{'q1': 0.0, 'q2': 0.0, 'q3': 1.0, 'q4': 1.0, 'model': 'MFDoom/deepseek-v2-tool-calling:16b', 'total': 2.0}
Downloading model: krtkygpta/gemma2_tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (krtkygpta/gemma2_tools) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (krtkygpta/gemma2_tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (krtkygpta/gemma2_tools) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (krtkygpta/gemma2_tools) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'krtkygpta/gemma2_tools'
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'krtkygpta/gemma2_tools', 'total': 1.0}
Downloading model: interstellarninja/llama3.1-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/llama3.1-8b-tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 4: Does the react agent lose the ability to chat? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
[?25l[?25l[?25h[?25hdeleted 'interstellarninja/llama3.1-8b-tools'
{'q1': 0.7, 'q2': 0.0, 'q3': 0.7, 'q4': 0.7, 'model': 'interstellarninja/llama3.1-8b-tools', 'total': 2.0999999999999996}
Downloading model: cow/gemma2_tools:2b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (cow/gemma2_tools:2b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (cow/gemma2_tools:2b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (cow/gemma2_tools:2b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (cow/gemma2_tools:2b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'cow/gemma2_tools:2b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'cow/gemma2_tools:2b', 'total': 2.0}
Downloading model: mistral:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral:7b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral:7b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral:7b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (mistral:7b) success rate: 0.7
[?25l[?25l[?25h[?25hdeleted 'mistral:7b'
{'q1': 0.6, 'q2': 0.8, 'q3': 0.5, 'q4': 0.7, 'model': 'mistral:7b', 'total': 2.5999999999999996}
Downloading model: mistral-nemo: 12b...
done.
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral-nemo: 12b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral-nemo: 12b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral-nemo: 12b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (mistral-nemo: 12b) success rate: 0.0
[?25l[?25l[?25h[?25hError: name "mistral-nemo:" is invalid
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.0, 'model': 'mistral-nemo: 12b', 'total': 0.0}
Downloading model: interstellarninja/hermes-2-pro-llama-3-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.3
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.6
[?25l[?25l[?25h[?25hdeleted 'interstellarninja/hermes-2-pro-llama-3-8b-tools'
{'q1': 1.0, 'q2': 0.3, 'q3': 0.8, 'q4': 0.6, 'model': 'interstellarninja/hermes-2-pro-llama-3-8b-tools', 'total': 2.7}
Downloading model: qwq:32b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwq:32b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (qwq:32b) success rate: 0.9
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwq:32b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (qwq:32b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'qwq:32b'
{'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'model': 'qwq:32b', 'total': 3.5}
Downloading model: qwen2.5-coder:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwen2.5-coder:7b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (qwen2.5-coder:7b) success rate: 0.4
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwen2.5-coder:7b) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (qwen2.5-coder:7b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'qwen2.5-coder:7b'
{'q1': 1.0, 'q2': 0.4, 'q3': 0.8, 'q4': 1.0, 'model': 'qwen2.5-coder:7b', 'total': 3.2}
from statistics import mean
average = mean(d['total'] for d in all)
minscore = min(d['total'] for d in all)
maxscore = max(d['total'] for d in all)
all = sorted(all, key=lambda d: -d['total'])
print(f"Average total score: {average} Min: {minscore} Max: {maxscore}")
print("Top 5 models by total score:")
pprint.pprint(all[:5])
Average total score: 2.31875 Min: 0.0 Max: 3.9
Top 5 models by total score:
[{'model': 'llama3-groq-tool-use:70b',
  'q1': 1.0,
  'q2': 1.0,
  'q3': 0.9,
  'q4': 1.0,
  'total': 3.9},
 {'model': 'llama3-groq-tool-use:8b',
  'q1': 1.0,
  'q2': 0.8,
  'q3': 1.0,
  'q4': 1.0,
  'total': 3.8},
 {'model': 'qwq:32b', 'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'total': 3.5},
 {'model': 'qwen2.5-coder:7b',
  'q1': 1.0,
  'q2': 0.4,
  'q3': 0.8,
  'q4': 1.0,
  'total': 3.2},
 {'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S',
  'q1': 0.0,
  'q2': 1.0,
  'q3': 0.9,
  'q4': 1.0,
  'total': 2.9}]

There's a lot going on here. But I have a few general observations:

  1. Most models do really poorly. The average "score" was ~2. And even a perfect score of 4.0 is really the bare minimum of what I would expect a decent model to do. And most of the models I tested claim to "support" tool calling.

  2. Surprisingly, some models do pretty well! For example, llama3-groq-tool-use almost achieves a perfect score!

  3. I tried a few larger ~70B models, and they did not perform noticeably better. Interestingly, the 8B variant of llama3-groq-tool-use performed almost as well as the 70B variant.

Berkeley Function-Calling Leaderboard

There are several benchmarks that test tool use in LLMs. But most of them are not designed to test tool usage as in an interactive agent. As it turns out, the Berkeley Function-Calling Leaderboard (BFCL) is the only benchmark that tests this type of behavior. And it was only added in September 2024 as part of BFCL V3.

Llama 3.2 scores a whopping 2.12% on multi-turn accuracy (what we care about). The top 12 scores are from proprietary models. The best multi-turn accuracy from an open-source model is only 17.38%, compared to GPT-4o's 45.25%. This seems to agree with Ed's Really Dumb Tool-calling Benchmark โ„ข๏ธ.

image.png

Other Reports

YouTuber Mukul Tripathi also found that Llama 3.2 does very poorly at answering questions when a tool is not required. Confusingly, he also found that Llama 3.3 did not have the same problem though, which is not consistent with my findings. Although he was using Ollama, he was not using it with LangChain. I'll have to look into that more.

So What Is The Problem?

Are open LLMs really that far behind at tool-calling? Or perhaps only larger models can determine whether a tool should be used? Maybe the quantization process used for Ollama is to blame? Or is something else going on?

We'll explore the answer in a future blog post. Stay tuned!

Edward J. SchwartzComputer Security Researcher1 min. read

I recently signed up for BlueSky. I just learned of a new service, EchoFeed, that polls RSS feeds and posts the content to BlueSky (and elsewhere). So, this is a test.

Will it post to BlueSky? Will fed.brid.gy mirror it to Mastodon? The suspense is killing me!

blog image
Edward J. SchwartzComputer Security Researcher1 min. read

This page documents my experience with "pressure washing" my vinyl fence and siding. I have pressure washing in quotes, because it's SH or sodium hypochlorite (or bleach) that does the bulk of the work. Pros often call this "soft washing".

For vinyl fence soft washing, you want around 1-2% SH. Most household bleach is 6% SH, so if you mix 1 part bleach with 5 parts water, you'll get around 1% SH.

You also want to use a surfectant to help the mixture stick to the fence. I used Dawn Ultra. Some people claim that some dish soaps will cause a bad reaction with the bleach, ranging from "mustard gas" to neutralizing the bleach.

I personally found that at 1-1.5% SH, the mixture was safe to use around grass. I wet the grass before and after applying the mixture, and I didn't see any damage.

Supplies

Recipe

  1. Add 3 cups of 6% SH bleach
  2. Add 0.8 gallons of water
  3. Add 2 fl. oz. of Dawn Ultra

Make sure to put the soap in last, or your mixture will foam up and overflow the sprayer when you try to close it.

Spray the mixture on the fence, let it sit for about five minutes, and then rinse it off. You can use a garden hose, but I personally found that using a Ryobi One+ EZ-Clean worked better. I'm sure a pressure washer would have been even faster, but it is less convenient to use.

That's about it. This removed most of the staining.

For some areas that had large amounts of growth, I used a Ryobi Scrubber to physically remove it before spraying.

The bleach was not able to remove all stain spots. For those remaining spots that were in conspicuous places, I used a magic eraser / melanine sponge.

Before
Before
After
After
blog image
Edward J. SchwartzComputer Security Researcher1 min. read

At some point, I hope to create a Notes section on my website that will turn Markdown files into a list of notes. This is basically how the blog works. But, I'm kind of busy. And since Gatsby seems like it's dead, I'm not sure that I want to invest a whole lot of time into it. (Although putting the notes in markdown seems like a good idea for compatibility.)

Anyway, here is my first very short note on Profiling.

SpeedScope

SpeedScope is an awesome tool for visualizing profiler output. It has a flame graph view that is wonderful. I also like to use the Sandwich view, sorting by total time and simply looking for the first function that I recognize. This is often the culprit.

The documentation is pretty good. It also shows how to record profiles in compatible formats for most platforms. I mostly use py-spy and perf.

Java

The one notably missing platform is Java! Luckily, it's not too hard to convert Java's async-profiler output to a format that SpeedScope can read. Here's how I do it:

  • Download async-profiler
  • Create an output file in collapsed format
  • You can do this in several ways, such as:
  • ./asprof start -i 1s Ghidra followed by ./asprof stop -o collapsed -f /tmp/out.prof.collapsed Ghidra
  • ./asprof collect -d 60 -o collapsed -f /tmp/out.prof.collapsed Ghidra
  • Then open out.prof.collapsed in SpeedScope.
Screenshot of profiling Ghidra
Screenshot of profiling Ghidra

The collapsed format takes a while to parse, so it might be worth it to export the native SpeedScope format.

blog image
Edward J. SchwartzComputer Security Researcher5 min. read

Fostering

My wife and I started fostering rescue dogs mostly by accident. We adopted a Shih-tzu mix, who seemed completely relaxed when we met her at the rescue. When we got home, we eventually figured out that she was terrified, and freezing was her coping mechanism. Once she got a little more comfortable, she started hiding from us in the house. It took many months, but Molly eventually warmed up to us and we gained her trust.

Our first rescue dog, Molly
Our first rescue dog, Molly

For as frustrating an experience it is to have a dog be absolutely terrified of you for no reason, it was also incredibly rewarding to see her come out of her shell. We decided that we wanted to help other dogs in similar situations, so we started fostering. It's been several years, and we've fostered dozens of dogs. For a long time, we fostered for A Tail to Tell, which unfortunately recently closed. More recently, we have been fostering for Lucky Dawg Animal Rescue.

Blanche

This week, however, we had a "first", and not a good one. We picked up a new foster dog, Blanche, on Sunday. We have a nice fenced in yard, and we immediately took Blanche out back into the yard. She was very skittish, and we gave her some space. She immediately dove into a row of large evergreen trees in our yard and hid. Eventually, I had to go in and carry her out, which was not a simple task given the size of the trees.

Blanche
Blanche

The next morning, I let Blanche out and saw her run into the same trees. It was very hot, and after about an hour, I started to grow concerned and went out to look for her. She was not in the same tree as last time, but I figured she was hiding in another one. There are several trees, and it is fairly difficult to see into them. I started exhaustively searching the trees, and I couldn't find her. I also found a part of our fence that was slightly pushed out, as if something had forced its way out. She had escaped.

The hole
The hole

The Search: Monday (Day One)

My wife and I were very upset, but we shifted into action. We reported Blanche as missing on PawBoost, our community Facebook page, called the local police department, and notified our rescue, Lucky Dawg Animal Rescue. We quickly began receiving sighting reports of Blanche. She was initially seen at approximately 10am in a wooded area next to a busy road. I went to the area and searched for her while my daughter and wife started talking to neighbors and handing out flyers. I saw no sign of her in the wooded area.

We spent the rest of the day trying to put up posters on telephone poles, which is harder than it seems! Tape doesn't adhere very well to dirty telephone poles. The trick is to tape all the way around the pole so that the tape sticks to itself.

We received another sighting report at 5pm, this time on the other side of the busy road. The report was of a dog "playing/fighting" with a chicken. Fortunately, I knew which house this was at from the description. Around the same time, two members of our rescue arrived to help. We went to the house and I talked to the owner, who revealed that Blanche had attacked her chicken. She chased Blanche off, and Blanche ran into the woods. While I was talking to the owner, our rescue members saw Blanche in a large field nearby. Blanche spotted them, and ran into a wooded area near a creek. I was able to reach the other side of the wooded area, but the experts decided that it would be better to leave a food station for Blanche so she stayed in the area rather than try to chase her.

The Search: Tuesday (Day Two)

The next morning we received a few sightings of Blanche near our neighborhood again. More surprisingly, my wife left our fence gate open and saw Blanche sniffing around the fence around 8 am. Unfortunately, Blanche ran off. It was very hot, and Blanche presumably slept during the day.

That evening, the rescue returned and put out a trap with lots of food, and a remote camera to monitor it. We cranked up the volume on my phone so that every time the camera detected movement, we would wake up. We were woken up several times, but it was mostly false positives. At 1:59am, a cat wandered by. At some point, the cat triggered the cat, and at 2:28am we were greeted by a picture of Blanche studying the cat in the trap.

Cat in the trap
Cat in the trap

I snuck outside, and saw Blanche eating food about 10 feet away from me. It was frustrating to be so close but not be able to do anything. But our rescue members told us that it was safer to make her feel safe and comfortable with the trap. She clearly enjoyed the food that we had put out for her. She would be back.

After Blanche finished eating, we freed the cat from the trap and added more food, but we didn't see Blanche again that night. Stupid cat!

The Search: Wednesday (Day Three)

The next morning, Blanche was sighted in many of the same places, including near our house. Unfortunately, she was also seen crossing the road again. We were worried that a car would hit her. She slept during the day again. In the evening, there was heavy rain, and we decided to wait until after the rain to put the trap out. Naturally, Blanche showed up during the rain, and we missed an opportunity to catch her. Fortunately my wife had put out a couple pieces of food, so she didn't leave empty handed.

Blanche in the rain
Blanche in the rain

We armed the trap again and waited. She came back around 9:43pm and began investigating the trap. She was very cautious, and decided to yank out the towel on the bottom of the trap. She stayed for a very long time, as we waited in suspense for her to trigger the trap. Eventually she entered the trap, but it didn't trigger for some reason. It was frustrating, but not the end of the world. She would begin to think the trap was a safe source of food.

Blanche investigating the towel she had removed
Blanche investigating the towel she had removed

Blanche stayed around for a long while, but eventually left. We examined the trap. When she yanked the towel, it actually disarmed the trap without activating it. We fixed the trap and waited again, hoping she would be back later that night.

At 2:05am, she came back and began to investigate the trap again. We waited for what seemed like forever. At 2:15am, we received this picture of her deep in the trap.

Blanche in the trap
Blanche in the trap

The camera we were using would take a burst of three pictures every time it detected motion. After that picture, we didn't receive any more pictures. This could mean that she was in the trap, or that the trap had not triggered and she had left. We waited for a while, and then snuck outside to check. We had caught her! We carried the trap inside and put her back into her exercise pen, and finally got some sleep.

The Aftermath

Blanche is not very happy to be back inside, but she ate, drank, and is safe. She hasn't been very lively yet, but this is not uncommon with mill rescue dogs, who often need a few days before they start to interact and show personality. Hopefully she'll quickly realize that we're not so bad, and that she is safe and sound in our house.

We have a long list of interesting stories from our years fostering rescue dogs, but this was certainly one of the more interesting and stressful ones. On the positive side, we met a lot of our neighbors, and we were pleasantly surprised by how helpful and supportive they were, without exception. Many people wanted to help in whatever way they could. We really live in a cool little community.

Edward J. SchwartzComputer Security Researcher2 min. read

In my last post I talked about how I have been using Ansible for my new laptop configuration, and shared my configuration for notion.

So far, I've been extremely happy with using Ansible for configuring my machine. Prior to using Ansible, I'd spend a fair amount of time creating detailed notes that described what I did. I estimate that creating Ansible recipes takes about the same time as keeping good notes, and maybe even less. That's because there are many existing roles for common settings and software that can be reused. As with any ecosystem, the quality of such roles varies.

The big difference between my notes and Ansible, though, is that Ansible playbooks can be played in minutes, whilst manually following my notes can take hours to set up an entire new machine. I used to dread the idea of configuring a new machine. But now it's fairly effortless.

I just publicly shared my Ansible configuration. I don't expect that anyone will use my configuration as is, any more than I expect anyone to use my notion configuration! I'm extremely opinionated and picky. But I do hope that it might give people some ideas, like how to install llvm, nvidia drivers and so on. I know I personally found other people's repositories to be helpful.

In a very similar vein, I've started using dorothy, which claims to allow you to "... bring your dotfile commands and configuration to any shell." Since I usually but not always use fish, I've always been hesistant to write my own commands in fish. Plus, I have been writing bash scripts for long enough that I'm decent at it, so it tends to be one of my go-tos. Dorothy makes it easy to define variables and commands in such a way that they magically appear in all shells. (Again, this is very useful for fish, which is not a posix-compliant shell.) There's also a fair number of useful built-in commands. Dorothy encourages users to split their dotfiles into public and private portions, and you can view my public dotfile here. Specifically, here are my custom commands. Some of these might be useful, such as setup-util-ghidra and setup-util-ghidrathon. I've found that having a designated spot for these types of utility commands encourages me to write them, which ultimately saves me time. Usually.

Edward J. SchwartzComputer Security Researcher3 min. read

Sometime while I was in graduate school, I started using the notion window manager. (Actually, at the time, I think it was ion3.) Notion is a tiling window manager that is geared towards keyboard usage instead of mouse usage. But let's be honest: I've been using notion for so long that I simply prefer it over anything else.

Notion, like most minor window managers, is a bit spartan. It does not provide a desktop environment. It really just manages windows. There are some features of a desktop environment that I don't need, such as a launcher. I know all the commands that I use; I don't need a GUI to list them for me. But it's often the little things that get you, such as locking the screen, or using the media keys on your keyboard to adjust the volume. I used to be (more of) a hardcore nerd and relished in my ability to craft a super-complex .xsessionrc file with all kinds of bells, whistles and utilities connected as if with duct tape. But as I grow older, sometimes I just want my computer to work.

For a long while now, I've found that running notion alongside GNOME for "desktop stuff" to work pretty well. For a long time, I followed an old Wiki post about how to combine GNOME with Awesome WM. This worked really welll with GNOME 2.

Many people say that you can't use GNOME 3 with window managers other than GNOME Shell. I've actually had pretty good luck copying the Ubuntu gnome-session and replacing Gnome Shell with notion. The above Awesome WM Wiki also shows how to do it. Unfortunately, I've found that some features do not work, such as the keyboard media keys, much to my dismay. Do media keys matter that much? Yes, yes, they do. This apparently broke when GNOME Shell started binding the media keys instead of gnome-settings-daemon. There used to be a gnome-fallback-media-keys-helper utility around that would simulate this behavior, but it seems to have disappeared.

As I was trying to fix this problem, I came across a blog post and an unrelated but similar github repo both describing how to use the i3 window manager with GNOME. TLDR: GNOME Flashback is a still supported variant of GNOME that is explicitly designed to support third-party window managers. Whereas GNOME Shell incorporates both the window manager and other stuff such as handling media keys, GNOME Flashback has the "other stuff" in a separate component that is designed to be used with a window manager such as metacity. But it works just fine with notion! Best of all, both my media keys and screen locking work. Hurray!

Because I hate setting up stuff like this, I've actually been hard at work packaging up my Linux computer configuration into reusable ansible components. It takes a little longer than doing it manually of course, but it's not too bad and it's pretty easy to read. I'm making my notion role available here in case anyone wants to try out my setup. Most of the logic is here if you are curious what is involved. Below are a few snippets to show how Ansible makes it relatively easy to manipulate configuration files.

# Same thing, but for gnome-flashback

- name: Copy gnome-flashback-metacity.session to notion-gnome-flashback.session
  copy:
    src: /usr/share/gnome-session/sessions/gnome-flashback-metacity.session
    dest: /usr/share/gnome-session/sessions/notion-gnome-flashback.session

- name: 'notion-gnome-flashback.session: Change metacity to notion and add stalonetray'
  replace:
    path: /usr/share/gnome-session/sessions/notion-gnome-flashback.session
    regexp: 'metacity'
    replace: notion;stalonetray

- name: 'notion-gnome-flashback.session: Remove gnome-panel'
  replace:
    path: /usr/share/gnome-session/sessions/notion-gnome-flashback.session
    regexp: ';gnome-panel'

- name: Symlink systemd target for notion-gnome-flashback session to gnome-flashback-metacity session
  file:
    src: /usr/lib/systemd/user/gnome-session@gnome-flashback-metacity.target.d
    dest: /usr/lib/systemd/user/gnome-session@notion-gnome-flashback.target.d
    state: link

- name: Install gconf override for notion-gnome-flashback
  copy:
    src: notion-gnome-flashback.gschema.override
    dest: /usr/share/glib-2.0/schemas/01_notion-gnome-flashback.gschema.override
  notify: Compile glib schemas
- name: Set META
  lineinfile:
    path: /usr/local/etc/notion/cfg_notion.lua
    regexp: '^--META='
    line: META="Mod4+"
    backup: true
- name: Set ALTMETA
  lineinfile:
    path: /usr/local/etc/notion/cfg_notion.lua
    regexp: '^--ALTMETA='
    line: ALTMETA="Mod1+"
    backup: true
- name: Disable mod_dock
  lineinfile:
    path: /usr/local/etc/notion/cfg_defaults.lua
    state: absent
    line: 'dopath("mod_dock")'
    backup: true
- name: Enable mod_statusbar
  lineinfile:
    path: /usr/local/etc/notion/cfg_notion.lua
    regexp: '^--dopath("mod_statusbar")'
    line: 'dopath("mod_statusbar")'
    backup: true

Powered with by Gatsby 5.0