Edward J. SchwartzComputer Security Researcher10 min. read

For the past several years, I've been using Advent of Code as an excuse to do some professional development and learn new languages. The past two years, I used Rust, which is an interesting language.

I remembered seeing funny looking solutions on the Reddit AOC solutions thread that were basically a jumble of symbols. These were for a language called Uiua, which naturally piqued my interest.

For some reason, I decided to try to do AOC 2024 in Uiua. It has not been a smooth ride, and in this blog post I'll briefly touch on some of my thoughts and experiences in Uiua and two other array programming langauges, APL and BQN.

Starting in Uiua

The Uiua website describes the language as:

Uiua (wee-wuh 🔉) is a general purpose, stack-based, array-oriented programming language with a focus on simplicity, beauty, and tacit code.

Uiua lets you write code that is as short as possible while remaining readable, so you can focus on problems rather than ceremony.

The language is not yet stable, as its design space is still being explored. However, it is already quite powerful and fun to use!

But this screenshot of a basic Uiua example probably gives you a better idea how the language works:

Uiua example
Uiua example

I spent a while going through the Uiua tutorials, and I made it through the first few AOC problems with a bit of difficulty.

I eventually got to a problem where I had to write a fold. And I remember getting extremely frustrated with the language. The language does not have (local) variable names. Instead, everything is on the stack. You as the programmer must internally keep track of the stack and how all the operations you perform modify it. Oh, it's a stack-based machine too, so the top of the stack is constantly changing.

I think there were only three or four values I had to juggle in my fold function, but it was too much. Maybe because I work in binary analysis where you can't take local variables for granted, I really want to be able to use them in my "high level" programming languages.

More seriously, I think the lack of local variables just compounds complexity. Simple functions are fine. But complex functions, which are already complex, start to get even more complex as the programmer now has to deal with juggling the stack layout. No thanks.

Moving to APL

Uiua is an array-oriented programming language. Most array-oriented programming languages derive from APL (which was created in the 60s!) One of the benefits of this is that it's a pretty mature language, and there is a lot of training material available for it.

The most popular implementation is a commercial, non-open-source one called Dyalog APL. I wasn't thrilled to be using a closed-source implementation, but it's just for learning purposes so I supposed it was fine. I started following along with this tutorial. I got about half way through, and started to feel like I was probably competent enough to try some AOC problems in APL.

I immediately ran into trouble again, but this time with APL's tooling. I have two basic requirements for a programming language for AoC:

  1. I can put the code for each day in a file.
  2. I can run the code from inside VS code fairly easily.
  3. I can type the weird symbols of the language from within VS code. (Oops, this one is new this year.)

I forgot to mention that Uiua's tooling was pretty great. No complaints; I installed the extension and everything worked as expected.

APL tooling is weird. It's not really designed like modern programming languages. Instead, all coding is supposed to be done in workspaces. I was pretty frustrated by this and I eventually gave up.

In retrospect, I may have been able to get by with dyalogscript. But the unpolished nature of the tooling, at least for how programming languages are used in this century, was a big turn off.

BQN

Finally, I landed on BQN. Here is the website's description:

Looking for a modern, powerful language centered on Ken Iverson's array programming paradigm? BQN now provides:

  • A simple, consistent, and stable array programming language
  • A low-dependency C implementation using bytecode compilation: installation
  • System functions for math, files, and I/O (including a C FFI)
  • Documentation with examples, visuals, explanations, and rationale for features
  • Libraries with interfaces for common file formats like JSON and CSV

And here's a quick example from the website.

BQN
BQN

If you are thinking that all of these languages look pretty similar, you're right.

BQN had a lot going for it. It wasn't stack based. The tooling seemed pretty good. Not only is the language designed to be used from files, you can even use multiple files. Welcome to the 21st century baby!

I also liked the name. The whole point of learning this array-programming paradigm was to be able to write short, concise code. So the idea of answering "Big Questions" was appealing.

BQN has a lot of documentation. There are a few tutorials intended for new users, but most of the documentation is of the, well, documentation variety. It's not a tutorial, but a reference manual. It's written by an absolute array programming expert, for other array programming experts. It's not the most beginner friendly.

So it was a pretty rough learning curve. I quickly joined the APL language discord and started asking a lot of questions. People there are very patient and helpful, thankfully! I also found some other people working on AOC, and I spent a lot of time unraveling their solutions.

I just finished Day 9 of AOC 2024 in BQN. It's December 20th, so obviously I'm pretty far behind. I'm not sure if I'll finish this year; I've been trying to embrace learning the array-oriented way of thinking, which has been challenging and slow.

Readability

I've been slowly getting better at reading others' BQN code, but it's hard. There are a lot of symbols to remember, but that's really not the main problem for me. Instead, it's very difficult to "parse" where parentheses should be placed. It can also be difficult to follow the general flow of very terse code.

Here's a snippet of code from RubenVerg who is a genius when it comes to tacit coding in BQN.

in•file.Chars "input/8.txt"

P(¬-˜×·+`»>)
OutOfBounds´(0>)

Parse>' '<P

Part1{𝕊grid: ¬(gridOutOfBounds)¨/{(𝕨(=(grid)'.'grid˜)𝕩)/𝕩-˜2×𝕨}˜ grid}Parse

(Holy smokes, my formatter actually supports BQN!)

Part1 is a function that is composed with Parse. So it will Parse the input and the result will be bound to grid inside the curly brackets.

I have my doubts that anyone can read this code. Rather, you can reverse engineer it by breaking it down into smaller pieces and understanding each piece. But it's not easy to read, even if you understand what all the symbols mean.

Tacit coding

According to the BQN documentation:

Tacit programming (or "point-free" in some other languages) is a term for defining functions without referring to arguments directly, which in BQN means programming without blocks.

The idea of tacit coding is kind of cool. You basically avoid applying functions and instead compose and otherwise modify them.

BQN has a function composition operator just like you would imagine. But a lot of tacit code uses trains. For a pretty poor introduction to trains, you can view this page. But let me spell out the basics.

A 2-train is two adjacent functions, and by definition fg evaluates to f∘g (f composed with g). In other words, to evaluate fg on an input 𝕩, we could use g(f(𝕩)). (A BQN programmer would never write that and would instead just use g f 𝕩.)

3-trains, which consist of three adjacent functions, are where things get fun. Again, by definition, evaluating fgh on input 𝕩 evaluates to (f 𝕩) g (h 𝕩). This is not very intuitive, but it's useful for a couple reasons:

  • 𝕩 appears twice, so you can use it to avoid writing a long expression multiple times
  • There's a fork, so you can combine two different behaviors
  • g acts as a summarizer or combiner

Here's an example I used in my solution to Day 9:

(×()) arg

The 3-train consists of , ×, and (↕≠). In my program, arg is actually an extremely long expression, and I did not want to write it twice. Let's expand the train:

( arg) × ( arg)

is the identity function, so it returns arg. × is multiplication. returns the length of its (right) argument, and returns the list of numbers from 0 to one less than its argument. So this train multiplies each element in arg (it's an array) by its index. Pretty cool, huh?

The trouble is that when reading and writing BQN code, it can be difficult to identify trains. I've been getting better, but I still find myself inserting a whenever my code doesn't work, since function composition will "stop" a train from forming when it wasn't intentional. Now look at RubenVerg's code above and think about all the trains. Even if you understand the symbols, it's not easy. This is very much a learned skill!

Here's a very basic example of how parsing influences trains. BQN evaluates from right to left. So if you write fgh 𝕩, that actually means f(g(h(𝕩))) and there is not a train. But (fgh) 𝕩 is completely different and fgh is a 3-train. Now again, look at RubenVerg's code and try to figure out the implied parentheses. Good luck!

Documentation

I found BQN's documentation to be very thorough, but not very beginner friendly. I think it's written by an expert for other experts. In some cases, it seems to pontificate and misses basic definitions. For example, the trains page doesn't directly define 2- and 3-trains. You can probably figure out the definition from the examples, but it's not ideal.

On the plus side, many documentation pages feature very intuitive diagrams. See the below diagram in Group for an example.

Cool features

There are some cool features in BQN. I'm not going to cover all of them, but here are ones that stand out based on my programming career.

Group

The Group operator is pretty nifty. Here is a nice diagram that intuitively depicts an example.

Group
Group

In BQN, 𝕨 is the left argument and 𝕩 is the right argument. Usually 𝕩 is some existing structure you want to analyze, and 𝕨 is a list of indices that you construct to define the groupings you want. If you want elements to be placed in the same group, you assign them the same index.

This is a powerful capability. For example, in today's AOC problem, there was a string like 00...111...2...333.44.5555.6666.777.888899 where group 0 had size two, group 1 had size 3, and so on. One easy way to determine the size of each group in BQN is using Group . If you first change each . to a -1, you can use the same array as both arguments with ⊔˜ to get the groupings, and ≠¨⊔˜ to get the length of each grouping. ( means length and ¨ means to modify the function to the left to apply to each element of the array to the right.)

Under

Under is an interesting capability that is a bit tricky to explain. Here is the official explanation from the documentation:

The Under 2-modifier expresses the idea of modifying part of an array, or applying a function in a different domain, such as working in logarithmic space. It works with a transformation 𝔾 that applies to the original argument 𝕩, and a function 𝔽 that applies to the result of 𝔾 (and if 𝕨 is given, 𝔾𝕨 is used as the left argument to 𝔽). Under does the "same thing" as 𝔽, but to the original argument, by applying 𝔾, then 𝔽, then undoing 𝔾 somehow.

So to restate, there is a transformation or selection operation, 𝔾, and a modification transformation 𝔽. There are different applications, but I always used this to transform or change part of an array. In that case, 𝔾 might be a filter, and 𝔽 describes how you want to change the array.

Here's an example from today's AOC again:

ReplaceNegWithNegOne  {¯1¨((𝕩<0)/) 𝕩}

I'm not going to try to explain all the syntax. But 𝔾 is ((𝕩<0)⊸/); this says to filter 𝕩 so that only elements less than 0 remain. 𝔽 is ¯1¨, which means return negative one for each argument. So, put together, replace negative elements with negative one. I then used the resulting array as an index to Group , which ignores any element with an index of negative one.

This is kind of neat because values are immutable in BQN, and this provides an efficient way to change part of them. I assume that the implementation uses this to avoid making copies of the whole array.

Not So Cool Features

One thing that annoys me about BQN is that a number of basic functions are not built-in because they can be succinctly expressed.

For example, want to split a string? You'd better memorize

x0((-˜+`׬)=)y1 #Split y1 at occurrences of separators x0, removing the separators

Want to build a number from an array of digits? You can use

10×+˜´d1 # Natural number from base-10 digits

These both come from BQNcrate, a repository of useful functions you could but probably don't want to derive yourself. I'd much rather see this in a standard library of some sort. Most of these are cool, and it's fun to see how they work. But when I'm actually coding, I don't want to look these up or try to derive them. I just want to say split the string by ' ' and move on.

I don't think I'm alone. I've noticed that RubenVerg for example likes to use •ParseFloat to parse integers rather than

(10×+˜´-'0')d1 #Parse natural number from string

which doesn't exactly roll off the tongue.

Ed's Feelings on BQN

Unfortunately, I don't have fun programming in BQN. There, I said it. I've literally felt very stupid at times trying to figure out how to write a simple function.

BQN is challenging, and I like challenges. But it's a fine line. There is an intense gratification to stringing together a whole bunch of opaque symbols that very few other people can read. But it's also frustrating and demoralizing to spend hours trying to figure out how to solve a basic problem.

It's hard to say how much of this is just part of learning a new paradigm. I remember when I first learned OCaml as a graduate student and had to figure out how to think functionally and decode arcane type errors involving parametric polymorphism. At the time, it was hard (and probably not that fun, but I can't remember). Now it's second nature. Maybe BQN will become second nature if I stick with it.

Conclusion

I probably won't be using BQN for any real projects any time soon. But I haven't given up on it entirely. I may try to finish AOC 2024 in BQN. We'll see. Given the lack of fun I've been having, I can't say I'm extremely motivated to do so. So for now I'll be taking things one day at a time.

If you're curious about my BQN code, you can find my AOC 2024 solutions here.

One of the most exciting possibilities of AI and LLMs are agents: tools that allow LLMs to interact with various tools in order to solve problems. You've probably seen them before, like when you ask ChatGPT to browse the web for you.

In this blog post, we'll take a look at how to build agents using LangChain. They'll work great using an OpenAI model. And then we'll try to run them locally using Ollama, using a variety of open models. And they will almost all fail miserably. They fail so bad that I created this blog post to convince myself I wasn't imagining things.

In a future blog post, we will examine why.

LangChain

LangChain is a framework that allows you to build LLM applications. Basically, it abstracts a bunch of different components like LLMs, vector stores, and the like, and allows you to focus on your application's logic. So, you might develop your application in LangChain while using a local LLM to run it, but then use Claude once you go to production.

Anyway, using LangChain to make a query is pretty simple.

!pip install langchain-openai~=0.2.7 python-dotenv
!pip install httpx==0.27.2 # temp
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# We'll load my OpenAI API key using dotenv
%load_ext dotenv
%dotenv drive/MyDrive/.env
from langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage

from langchain_openai import ChatOpenAI

# Remove non-determinism for the blog post
zero_temp_gpt35 = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

response=zero_temp_gpt35.invoke("Hi!  What is your name?").content

import textwrap
print(textwrap.fill(response))
Hello! I am a language model AI assistant. How can I assist you today?

The beauty of LangChain is that the components are modular. We can replace gpt-3.5-turbo model with something else later if we want to, and indeed we will do just that!

Building an Agent using LangGraph

LangGraph is the portion of LangChain for building agents. It allows us to easily define new tools:

!pip install langgraph~=0.2.53
@tool
def foobar(input: int) -> int:
    """Computes the foobar function."""
    return input + 2

tools = [foobar]

The @tool decorator automatically transforms the function into a schema that can be used by the LLM to decide whether to invoke the tool, and if so, how.

foobar.tool_call_schema.model_json_schema()
{'description': 'Computes the foobar function.',
 'properties': {'input': {'title': 'Input', 'type': 'integer'}},
 'required': ['input'],
 'title': 'foobar',
 'type': 'object'}

With that, we can build a generic agent, called a ReAct agent, which can interact with our tools:

from langgraph.prebuilt import create_react_agent

def react_chat(prompt, model):
  agent_executor = create_react_agent(model, tools)

  response = agent_executor.invoke({"messages": [("user", prompt)]})
  return response['messages'][-1].content, response

last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_gpt35)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"
The result of evaluating foobar(30) is 32.

Yes! We did it, team! 🎉 We could change magic_function to be a web search, a database lookup, or you name it.

Let's try a query that doesn't use a tool at all.

last_msg, result = react_chat("Hi.", zero_temp_gpt35)
print(last_msg)
assert "Hello" in last_msg and "foobar" not in last_msg, "Uh oh, something went wrong"
Hello! How can I assist you today?

Great. So, in theory, we have an agent that we can chat with and is able to call tools in order to help us out.

Ollama: Going Local

Now let's try to create a tool-wielding agent using a LLM that runs on our local machine.
We'll do this by using Ollama, which is a (fairly) easy way to run smaller open LLMs on your local machine. It will use any GPUs that you might have, but it's still usable even if you don't have any. After all, you're just performing inference, not training.

Here's an example of me running llama 3.2 with ollama on my work laptop.

root@be5c1cb9e696:/# ollama run llama3.2
>>> Hi mom!
It's nice to hear from you, sweetie. Is everything okay? What's on your mind?

>>> Are you alive?
I am a computer program, so I don't have feelings or emotions like humans do. But I'm
designed to simulate conversations and answer questions to the best of my ability. I'm
not alive in the way that a living being is, but I'm here to help you with any
questions or topics you'd like to discuss!

>>> 🤯
I know it can be a bit mind-blowing to think about a computer program that can have
conversations and answer questions! But I'm designed to make interactions feel more
natural, so I'm glad you're surprised (in a good way!)

You can find instructions on how to install Ollama on the Ollama webpage.

If you don't feel like installing anything, that's fine too. You can follow along with this notebook.

After installing and running Ollama (ollama serve), we install the langchain-ollama connector package and pull down the Llama 3.2 model from Ollama's repository.

# Install Ollama
!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh
!ollama -v
# Make sure Ollama is running
!ollama ps 2>/dev/null || (env OLLAMA_DEBUG=1 nohup ollama serve &)
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
WARNING: systemd is not running
WARNING: Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Warning: could not connect to a running Ollama instance
Warning: client version is 0.5.1
nohup: appending output to 'nohup.out'
!pip install langchain_ollama~=0.2.0
!ollama pull llama3.2

Now we can attempt the same tests we performed on GPT 3.5, but using the local Llama 3.2 LLM.

from langchain_ollama import ChatOllama

# The zero temperature model is to remove non-determinism for the blog post
zero_temp_ollama_model = ChatOllama(model="llama3.2", temperature=0)
response = zero_temp_ollama_model.invoke("Hi!  What is your name?").content

print(textwrap.fill(response))
I don't have a personal name, but I'm an AI designed to assist and
communicate with users. I'm often referred to as a "language model" or
a "chatbot." You can think of me as a helpful computer program that's
here to provide information, answer questions, and engage in
conversation. What's your name?

Okay, looking good! This is not bad for a 3B parameter LLM that can easily run locally on our computer. Let's see if it can call our tool-wielding agent.

last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_ollama_model)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"
The output of `foobar(30)` is 32.

🎉 Everything is working well so far. As one final check, let's ask the agent a question that has absolutely nothing to do with tools.

last_msg, result = react_chat("Hi.", zero_temp_ollama_model)
print(last_msg)
assert "42" in last_msg, "Uh oh, something went wrong"
The input value 42 was doubled, resulting in 84.

So, we said "Hi." and the agent responded with nonsense. Let's inspect some of the metadata we get back from LangChain to see what's going on.

import pprint
pprint.pprint(result)
{'messages': [HumanMessage(content='Hi.', additional_kwargs={}, response_metadata={}, id='c4dd1ba7-cb15-4d62-a2bb-a543a32a882d'),
              AIMessage(content='', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.061349558Z', 'done': True, 'done_reason': 'stop', 'total_duration': 294464945, 'load_duration': 22079878, 'prompt_eval_count': 153, 'prompt_eval_duration': 9000000, 'eval_count': 16, 'eval_duration': 261000000, 'message': Message(role='assistant', content='', images=None, tool_calls=[ToolCall(function=Function(name='foobar', arguments={'input': 42}))])}, id='run-be60d0f6-bf62-4336-b028-d37898615e06-0', tool_calls=[{'name': 'foobar', 'args': {'input': 42}, 'id': '4d6b28d7-71bc-4f80-9a2a-e61293bdbb65', 'type': 'tool_call'}], usage_metadata={'input_tokens': 153, 'output_tokens': 16, 'total_tokens': 169}),
              ToolMessage(content='44', name='foobar', id='aed6b2d6-590d-4bc3-8828-89457178bd11', tool_call_id='4d6b28d7-71bc-4f80-9a2a-e61293bdbb65'),
              AIMessage(content='The input value 42 was doubled, resulting in 84.', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.305191931Z', 'done': True, 'done_reason': 'stop', 'total_duration': 238035622, 'load_duration': 22280620, 'prompt_eval_count': 85, 'prompt_eval_duration': 5000000, 'eval_count': 14, 'eval_duration': 208000000, 'message': Message(role='assistant', content='The input value 42 was doubled, resulting in 84.', images=None, tool_calls=None)}, id='run-50754cd1-cae9-410d-84d5-64b51bced188-0', usage_metadata={'input_tokens': 85, 'output_tokens': 14, 'total_tokens': 99})]}

We can see there are four messages:

  1. The HumanMessage is the user's message -- "Hi."
  2. In response, in the AiMessage, the LLM indicates that it would like to invoke a tool by setting the tool_calls field.
  3. LangChain invokes the tool and records the result in the ToolMessage, which is given back to the LLM.
  4. The final AiMessage includes a written message for the user.

The problem of course, is message #2. Why does the AI want to invoke a tool in response to "Hi."? Is this a problem with Llama 3.2 or something else? Let's do some 🥼 science and find out!

Ed's Really Dumb Tool-calling Benchmark ™️

I created a really dumb benchmark to answer four really basic questions. I can't stress enough that this benchmark only tests the lowest of the low hanging fruit in this area. (I am calling it a "benchmark" facetiously!)

Here are the questions:

  1. Can the react agent use a tool correctly when explicitly asked? (Yes is good.)
  2. Does the react agent invoke a tool when it shouldn't? (No is good.)
  3. Does the react agent lose the ability to answer questions unrelated to tools? (No is good.)
  4. Does the react agent lose the ability to chat? (No is good.)

Question 1: Can the react agent use a tool correctly when explicitly asked?

We'll use our example above to test this.

basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
  last_msg, _ = react_chat(basic_tool_question, model=model)
  return "32" in last_msg

Question 2: Does the react agent invoke a tool when it shouldn't?

We'll perform two simple tests to answer this question. We'll prompt the agent with both a basic arithmetic question that does not involve the foobar tool, "What is 12345 - 102?", and a greeting, "Hello!" We'll then check the response to see if the model produces a ToolMessage, which indicates that the model chose to invoke a tool. By construction, neither of those prompts should induce a tool call.

from langchain_core.messages import ToolMessage

basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"

def q2a(model):
  _, result = react_chat(basic_arithmetic_question, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2b(model):
  _, result = react_chat(greeting, model=model)
  return not any(isinstance(msg, ToolMessage) for msg in result['messages'])

def q2(model):
  return q2a(model) and q2b(model)

Question 3: Does the react agent lose the ability to answer questions unrelated to tools?

To answer this, we'll ask the basic arithmetic question to the react agent and its underlying model. Since the available tool does not help with the arithmetic problem, ideally, the agent and the underlying model should be able to solve the problem under the same circumstances. If the model can't do arithmetic in the first place, I chose not to penalize it because I'm such a nice guy. 😇

def q3a(model):
  result = model.invoke(basic_arithmetic_question)
  return "12243" in result.content

def q3b(model):
  last_msg, _ = react_chat(basic_arithmetic_question, model=model)
  return "12243" in last_msg

def q3(model):
  # q3a ==> q3b: If q3a, then q3b ought to be true as well.
  return not q3a(model) or q3b(model)

Question 4: Does the react agent retain the ability to chat?

To answer this, we'll greet the agent and attempt to determine if it responds properly. This is a little difficult to do in a comprehensive way.

basic_greeting = "Hi."

def q4(model):
  last_msg, _ = react_chat(basic_greeting, model=model)
  r = any(w in last_msg for w in ["hi", "Hi", "hello", "Hello", "help you", "Welcome", "welcome", "greeting", "Greeting", "assist"])
  #if not r:
    #print(f"Debug: Not a greeting? {last_msg}")
  return r

Benchmark code

Here is code to run the experiments a couple of times.

from tqdm.notebook import tqdm

def do_bool_sample(fun, n=10, *args, **kwargs):
  try:
    # tqdm here if desired
    return sum(fun(*args, **kwargs) for _ in (range(n))) / n
  except Exception as e:
    print(e)
    return 0.0

def run_experiment(model, name, n=10):
  do = lambda f: do_bool_sample(f, model=model, n=n)
  d = {
      "q1": do(q1),
      "q2": do(q2),
      "q3": do(q3),
      "q4": do(q4),
      "model": name
  }
  d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
  return d

def print_experiment(results):
  name = results['model']
  print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
  print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
  print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
  print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")

def run_and_print_experiment(model, name):
  results = run_experiment(model, name)
  print_experiment(results)
  return results

Benchmarking Llama 3.2

Let's see what our experiments say for Llama 3.2, which we already know from above does not perform very well.

llama_model = ChatOllama(model="llama3.2")
run_and_print_experiment(llama_model, "llama3.2")
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.1
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.1, 'model': 'llama3.2', 'total': 1.6}

As we saw above, Llama 3.2 is able to call functions (Q1), but does so even when it should not be (Q2). Question 3 shows that even though it almost always decides to call a tool, this usually does not stop it from being able to answer basic questions. It does prevent it from being able to chat (Q4).

Benchmarking OpenAI's gpt-3.5-turbo and gpt-4o

Now let's try benchmarking gpt-3.5-turbo, which seemed to do better.

gpt35 = ChatOpenAI(model="gpt-3.5-turbo")
run_and_print_experiment(gpt35, "gpt-3.5-turbo")
Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-3.5-turbo) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-3.5-turbo) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-3.5-turbo) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (gpt-3.5-turbo) success rate: 1.0
{'q1': 1.0,
 'q2': 0.0,
 'q3': 0.9,
 'q4': 1.0,
 'model': 'gpt-3.5-turbo',
 'total': 2.9}

Great -- the benchmark showed that gpt-3.5-turbo can call tools (Q1), and unlike Llama 3.2, can still engage in chat (Q4). A bit surprisingly, it still invokes tools when it shouldn't, however (Q2). But it is smart enough to ignore their results when constructing its final response.

Let's try a newer model, gpt-4o.

gpt4o = ChatOpenAI(model="gpt-4o")
run_and_print_experiment(gpt4o, "gpt-4o")
Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-4o) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-4o) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-4o) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (gpt-4o) success rate: 1.0
{'q1': 1.0, 'q2': 1.0, 'q3': 1.0, 'q4': 1.0, 'model': 'gpt-4o', 'total': 4.0}

GPT 4o nailed it! 👏

Benchmarking a Lot Of Ollama Models

Let's benchmark a whole bunch of Ollama models. I searched Ollama's model library for models that claimed to support tool calling. Here we test a hand-picked subset of these models to see how well they do.

ollama_models = [
    "hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S",
    "llama3.3:70b",
    "llama3.2:3b",
    "llama3.1:70b",
    "llama3.1:8b",
    "llama3-groq-tool-use:8b",
    "llama3-groq-tool-use:70b",
    "MFDoom/deepseek-v2-tool-calling:16b",
    "krtkygpta/gemma2_tools",
    "interstellarninja/llama3.1-8b-tools",
    "cow/gemma2_tools:2b",
    "mistral:7b",
    "mistral-nemo: 12b",
    "interstellarninja/hermes-2-pro-llama-3-8b-tools",
    "qwq:32b",
    "qwen2.5-coder:7b",
    ]

all = []

for m in ollama_models:
  print(f"Downloading model: {m}...")
  !ollama pull {m} 2>/dev/null
  print("done.")
  r = run_and_print_experiment(ChatOllama(model=m), m)
  !ollama rm {m}
  all.append(r)
  print(r)
Downloading model: hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S'
{'q1': 0.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S', 'total': 2.9}
Downloading model: llama3.3:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.3:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.3:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.3:70b) success rate: 0.2
Question 4: Does the react agent lose the ability to chat? (llama3.3:70b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'llama3.3:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.2, 'q4': 1.0, 'model': 'llama3.3:70b', 'total': 2.2}
Downloading model: llama3.2:3b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2:3b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2:3b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2:3b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2:3b) success rate: 0.0
[?25l[?25l[?25h[?25hdeleted 'llama3.2:3b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.0, 'model': 'llama3.2:3b', 'total': 1.5}
Downloading model: llama3.1:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:70b) success rate: 0.3
Question 4: Does the react agent lose the ability to chat? (llama3.1:70b) success rate: 0.7
[?25l[?25l[?25h[?25hdeleted 'llama3.1:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.3, 'q4': 0.7, 'model': 'llama3.1:70b', 'total': 2.0}
Downloading model: llama3.1:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:8b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:8b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (llama3.1:8b) success rate: 0.7
[?25l[?25l[?25h[?25hdeleted 'llama3.1:8b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.7, 'model': 'llama3.1:8b', 'total': 1.7}
Downloading model: llama3-groq-tool-use:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:8b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:8b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:8b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'llama3-groq-tool-use:8b'
{'q1': 1.0, 'q2': 0.8, 'q3': 1.0, 'q4': 1.0, 'model': 'llama3-groq-tool-use:8b', 'total': 3.8}
Downloading model: llama3-groq-tool-use:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:70b) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:70b) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:70b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'llama3-groq-tool-use:70b'
{'q1': 1.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'llama3-groq-tool-use:70b', 'total': 3.9}
Downloading model: MFDoom/deepseek-v2-tool-calling:16b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'MFDoom/deepseek-v2-tool-calling:16b'
{'q1': 0.0, 'q2': 0.0, 'q3': 1.0, 'q4': 1.0, 'model': 'MFDoom/deepseek-v2-tool-calling:16b', 'total': 2.0}
Downloading model: krtkygpta/gemma2_tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (krtkygpta/gemma2_tools) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (krtkygpta/gemma2_tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (krtkygpta/gemma2_tools) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (krtkygpta/gemma2_tools) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'krtkygpta/gemma2_tools'
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'krtkygpta/gemma2_tools', 'total': 1.0}
Downloading model: interstellarninja/llama3.1-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/llama3.1-8b-tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 4: Does the react agent lose the ability to chat? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
[?25l[?25l[?25h[?25hdeleted 'interstellarninja/llama3.1-8b-tools'
{'q1': 0.7, 'q2': 0.0, 'q3': 0.7, 'q4': 0.7, 'model': 'interstellarninja/llama3.1-8b-tools', 'total': 2.0999999999999996}
Downloading model: cow/gemma2_tools:2b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (cow/gemma2_tools:2b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (cow/gemma2_tools:2b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (cow/gemma2_tools:2b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (cow/gemma2_tools:2b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'cow/gemma2_tools:2b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'cow/gemma2_tools:2b', 'total': 2.0}
Downloading model: mistral:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral:7b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral:7b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral:7b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (mistral:7b) success rate: 0.7
[?25l[?25l[?25h[?25hdeleted 'mistral:7b'
{'q1': 0.6, 'q2': 0.8, 'q3': 0.5, 'q4': 0.7, 'model': 'mistral:7b', 'total': 2.5999999999999996}
Downloading model: mistral-nemo: 12b...
done.
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral-nemo: 12b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral-nemo: 12b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral-nemo: 12b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (mistral-nemo: 12b) success rate: 0.0
[?25l[?25l[?25h[?25hError: name "mistral-nemo:" is invalid
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.0, 'model': 'mistral-nemo: 12b', 'total': 0.0}
Downloading model: interstellarninja/hermes-2-pro-llama-3-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.3
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.6
[?25l[?25l[?25h[?25hdeleted 'interstellarninja/hermes-2-pro-llama-3-8b-tools'
{'q1': 1.0, 'q2': 0.3, 'q3': 0.8, 'q4': 0.6, 'model': 'interstellarninja/hermes-2-pro-llama-3-8b-tools', 'total': 2.7}
Downloading model: qwq:32b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwq:32b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (qwq:32b) success rate: 0.9
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwq:32b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (qwq:32b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'qwq:32b'
{'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'model': 'qwq:32b', 'total': 3.5}
Downloading model: qwen2.5-coder:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwen2.5-coder:7b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (qwen2.5-coder:7b) success rate: 0.4
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwen2.5-coder:7b) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (qwen2.5-coder:7b) success rate: 1.0
[?25l[?25l[?25h[?25hdeleted 'qwen2.5-coder:7b'
{'q1': 1.0, 'q2': 0.4, 'q3': 0.8, 'q4': 1.0, 'model': 'qwen2.5-coder:7b', 'total': 3.2}
from statistics import mean
average = mean(d['total'] for d in all)
minscore = min(d['total'] for d in all)
maxscore = max(d['total'] for d in all)
all = sorted(all, key=lambda d: -d['total'])
print(f"Average total score: {average} Min: {minscore} Max: {maxscore}")
print("Top 5 models by total score:")
pprint.pprint(all[:5])
Average total score: 2.31875 Min: 0.0 Max: 3.9
Top 5 models by total score:
[{'model': 'llama3-groq-tool-use:70b',
  'q1': 1.0,
  'q2': 1.0,
  'q3': 0.9,
  'q4': 1.0,
  'total': 3.9},
 {'model': 'llama3-groq-tool-use:8b',
  'q1': 1.0,
  'q2': 0.8,
  'q3': 1.0,
  'q4': 1.0,
  'total': 3.8},
 {'model': 'qwq:32b', 'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'total': 3.5},
 {'model': 'qwen2.5-coder:7b',
  'q1': 1.0,
  'q2': 0.4,
  'q3': 0.8,
  'q4': 1.0,
  'total': 3.2},
 {'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S',
  'q1': 0.0,
  'q2': 1.0,
  'q3': 0.9,
  'q4': 1.0,
  'total': 2.9}]

There's a lot going on here. But I have a few general observations:

  1. Most models do really poorly. The average "score" was ~2. And even a perfect score of 4.0 is really the bare minimum of what I would expect a decent model to do. And most of the models I tested claim to "support" tool calling.

  2. Surprisingly, some models do pretty well! For example, llama3-groq-tool-use almost achieves a perfect score!

  3. I tried a few larger ~70B models, and they did not perform noticeably better. Interestingly, the 8B variant of llama3-groq-tool-use performed almost as well as the 70B variant.

Berkeley Function-Calling Leaderboard

There are several benchmarks that test tool use in LLMs. But most of them are not designed to test tool usage as in an interactive agent. As it turns out, the Berkeley Function-Calling Leaderboard (BFCL) is the only benchmark that tests this type of behavior. And it was only added in September 2024 as part of BFCL V3.

Llama 3.2 scores a whopping 2.12% on multi-turn accuracy (what we care about). The top 12 scores are from proprietary models. The best multi-turn accuracy from an open-source model is only 17.38%, compared to GPT-4o's 45.25%. This seems to agree with Ed's Really Dumb Tool-calling Benchmark ™️.

image.png

Other Reports

YouTuber Mukul Tripathi also found that Llama 3.2 does very poorly at answering questions when a tool is not required. Confusingly, he also found that Llama 3.3 did not have the same problem though, which is not consistent with my findings. Although he was using Ollama, he was not using it with LangChain. I'll have to look into that more.

So What Is The Problem?

Are open LLMs really that far behind at tool-calling? Or perhaps only larger models can determine whether a tool should be used? Maybe the quantization process used for Ollama is to blame? Or is something else going on?

We'll explore the answer in a future blog post. Stay tuned!

Edward J. SchwartzComputer Security Researcher1 min. read

I recently signed up for BlueSky. I just learned of a new service, EchoFeed, that polls RSS feeds and posts the content to BlueSky (and elsewhere). So, this is a test.

Will it post to BlueSky? Will fed.brid.gy mirror it to Mastodon? The suspense is killing me!

blog image
Edward J. SchwartzComputer Security Researcher1 min. read

This page documents my experience with "pressure washing" my vinyl fence and siding. I have pressure washing in quotes, because it's SH or sodium hypochlorite (or bleach) that does the bulk of the work. Pros often call this "soft washing".

For vinyl fence soft washing, you want around 1-2% SH. Most household bleach is 6% SH, so if you mix 1 part bleach with 5 parts water, you'll get around 1% SH.

You also want to use a surfectant to help the mixture stick to the fence. I used Dawn Ultra. Some people claim that some dish soaps will cause a bad reaction with the bleach, ranging from "mustard gas" to neutralizing the bleach.

I personally found that at 1-1.5% SH, the mixture was safe to use around grass. I wet the grass before and after applying the mixture, and I didn't see any damage.

Supplies

Recipe

  1. Add 3 cups of 6% SH bleach
  2. Add 0.8 gallons of water
  3. Add 2 fl. oz. of Dawn Ultra

Make sure to put the soap in last, or your mixture will foam up and overflow the sprayer when you try to close it.

Spray the mixture on the fence, let it sit for about five minutes, and then rinse it off. You can use a garden hose, but I personally found that using a Ryobi One+ EZ-Clean worked better. I'm sure a pressure washer would have been even faster, but it is less convenient to use.

That's about it. This removed most of the staining.

For some areas that had large amounts of growth, I used a Ryobi Scrubber to physically remove it before spraying.

The bleach was not able to remove all stain spots. For those remaining spots that were in conspicuous places, I used a magic eraser / melanine sponge.

Before
Before
After
After
blog image
Edward J. SchwartzComputer Security Researcher1 min. read

At some point, I hope to create a Notes section on my website that will turn Markdown files into a list of notes. This is basically how the blog works. But, I'm kind of busy. And since Gatsby seems like it's dead, I'm not sure that I want to invest a whole lot of time into it. (Although putting the notes in markdown seems like a good idea for compatibility.)

Anyway, here is my first very short note on Profiling.

SpeedScope

SpeedScope is an awesome tool for visualizing profiler output. It has a flame graph view that is wonderful. I also like to use the Sandwich view, sorting by total time and simply looking for the first function that I recognize. This is often the culprit.

The documentation is pretty good. It also shows how to record profiles in compatible formats for most platforms. I mostly use py-spy and perf.

Java

The one notably missing platform is Java! Luckily, it's not too hard to convert Java's async-profiler output to a format that SpeedScope can read. Here's how I do it:

  • Download async-profiler
  • Create an output file in collapsed format
  • You can do this in several ways, such as:
  • ./asprof start -i 1s Ghidra followed by ./asprof stop -o collapsed -f /tmp/out.prof.collapsed Ghidra
  • ./asprof collect -d 60 -o collapsed -f /tmp/out.prof.collapsed Ghidra
  • Then open out.prof.collapsed in SpeedScope.
Screenshot of profiling Ghidra
Screenshot of profiling Ghidra

The collapsed format takes a while to parse, so it might be worth it to export the native SpeedScope format.

blog image
Edward J. SchwartzComputer Security Researcher5 min. read

Fostering

My wife and I started fostering rescue dogs mostly by accident. We adopted a Shih-tzu mix, who seemed completely relaxed when we met her at the rescue. When we got home, we eventually figured out that she was terrified, and freezing was her coping mechanism. Once she got a little more comfortable, she started hiding from us in the house. It took many months, but Molly eventually warmed up to us and we gained her trust.

Our first rescue dog, Molly
Our first rescue dog, Molly

For as frustrating an experience it is to have a dog be absolutely terrified of you for no reason, it was also incredibly rewarding to see her come out of her shell. We decided that we wanted to help other dogs in similar situations, so we started fostering. It's been several years, and we've fostered dozens of dogs. For a long time, we fostered for A Tail to Tell, which unfortunately recently closed. More recently, we have been fostering for Lucky Dawg Animal Rescue.

Blanche

This week, however, we had a "first", and not a good one. We picked up a new foster dog, Blanche, on Sunday. We have a nice fenced in yard, and we immediately took Blanche out back into the yard. She was very skittish, and we gave her some space. She immediately dove into a row of large evergreen trees in our yard and hid. Eventually, I had to go in and carry her out, which was not a simple task given the size of the trees.

Blanche
Blanche

The next morning, I let Blanche out and saw her run into the same trees. It was very hot, and after about an hour, I started to grow concerned and went out to look for her. She was not in the same tree as last time, but I figured she was hiding in another one. There are several trees, and it is fairly difficult to see into them. I started exhaustively searching the trees, and I couldn't find her. I also found a part of our fence that was slightly pushed out, as if something had forced its way out. She had escaped.

The hole
The hole

The Search: Monday (Day One)

My wife and I were very upset, but we shifted into action. We reported Blanche as missing on PawBoost, our community Facebook page, called the local police department, and notified our rescue, Lucky Dawg Animal Rescue. We quickly began receiving sighting reports of Blanche. She was initially seen at approximately 10am in a wooded area next to a busy road. I went to the area and searched for her while my daughter and wife started talking to neighbors and handing out flyers. I saw no sign of her in the wooded area.

We spent the rest of the day trying to put up posters on telephone poles, which is harder than it seems! Tape doesn't adhere very well to dirty telephone poles. The trick is to tape all the way around the pole so that the tape sticks to itself.

We received another sighting report at 5pm, this time on the other side of the busy road. The report was of a dog "playing/fighting" with a chicken. Fortunately, I knew which house this was at from the description. Around the same time, two members of our rescue arrived to help. We went to the house and I talked to the owner, who revealed that Blanche had attacked her chicken. She chased Blanche off, and Blanche ran into the woods. While I was talking to the owner, our rescue members saw Blanche in a large field nearby. Blanche spotted them, and ran into a wooded area near a creek. I was able to reach the other side of the wooded area, but the experts decided that it would be better to leave a food station for Blanche so she stayed in the area rather than try to chase her.

The Search: Tuesday (Day Two)

The next morning we received a few sightings of Blanche near our neighborhood again. More surprisingly, my wife left our fence gate open and saw Blanche sniffing around the fence around 8 am. Unfortunately, Blanche ran off. It was very hot, and Blanche presumably slept during the day.

That evening, the rescue returned and put out a trap with lots of food, and a remote camera to monitor it. We cranked up the volume on my phone so that every time the camera detected movement, we would wake up. We were woken up several times, but it was mostly false positives. At 1:59am, a cat wandered by. At some point, the cat triggered the cat, and at 2:28am we were greeted by a picture of Blanche studying the cat in the trap.

Cat in the trap
Cat in the trap

I snuck outside, and saw Blanche eating food about 10 feet away from me. It was frustrating to be so close but not be able to do anything. But our rescue members told us that it was safer to make her feel safe and comfortable with the trap. She clearly enjoyed the food that we had put out for her. She would be back.

After Blanche finished eating, we freed the cat from the trap and added more food, but we didn't see Blanche again that night. Stupid cat!

The Search: Wednesday (Day Three)

The next morning, Blanche was sighted in many of the same places, including near our house. Unfortunately, she was also seen crossing the road again. We were worried that a car would hit her. She slept during the day again. In the evening, there was heavy rain, and we decided to wait until after the rain to put the trap out. Naturally, Blanche showed up during the rain, and we missed an opportunity to catch her. Fortunately my wife had put out a couple pieces of food, so she didn't leave empty handed.

Blanche in the rain
Blanche in the rain

We armed the trap again and waited. She came back around 9:43pm and began investigating the trap. She was very cautious, and decided to yank out the towel on the bottom of the trap. She stayed for a very long time, as we waited in suspense for her to trigger the trap. Eventually she entered the trap, but it didn't trigger for some reason. It was frustrating, but not the end of the world. She would begin to think the trap was a safe source of food.

Blanche investigating the towel she had removed
Blanche investigating the towel she had removed

Blanche stayed around for a long while, but eventually left. We examined the trap. When she yanked the towel, it actually disarmed the trap without activating it. We fixed the trap and waited again, hoping she would be back later that night.

At 2:05am, she came back and began to investigate the trap again. We waited for what seemed like forever. At 2:15am, we received this picture of her deep in the trap.

Blanche in the trap
Blanche in the trap

The camera we were using would take a burst of three pictures every time it detected motion. After that picture, we didn't receive any more pictures. This could mean that she was in the trap, or that the trap had not triggered and she had left. We waited for a while, and then snuck outside to check. We had caught her! We carried the trap inside and put her back into her exercise pen, and finally got some sleep.

The Aftermath

Blanche is not very happy to be back inside, but she ate, drank, and is safe. She hasn't been very lively yet, but this is not uncommon with mill rescue dogs, who often need a few days before they start to interact and show personality. Hopefully she'll quickly realize that we're not so bad, and that she is safe and sound in our house.

We have a long list of interesting stories from our years fostering rescue dogs, but this was certainly one of the more interesting and stressful ones. On the positive side, we met a lot of our neighbors, and we were pleasantly surprised by how helpful and supportive they were, without exception. Many people wanted to help in whatever way they could. We really live in a cool little community.

Edward J. SchwartzComputer Security Researcher2 min. read

In my last post I talked about how I have been using Ansible for my new laptop configuration, and shared my configuration for notion.

So far, I've been extremely happy with using Ansible for configuring my machine. Prior to using Ansible, I'd spend a fair amount of time creating detailed notes that described what I did. I estimate that creating Ansible recipes takes about the same time as keeping good notes, and maybe even less. That's because there are many existing roles for common settings and software that can be reused. As with any ecosystem, the quality of such roles varies.

The big difference between my notes and Ansible, though, is that Ansible playbooks can be played in minutes, whilst manually following my notes can take hours to set up an entire new machine. I used to dread the idea of configuring a new machine. But now it's fairly effortless.

I just publicly shared my Ansible configuration. I don't expect that anyone will use my configuration as is, any more than I expect anyone to use my notion configuration! I'm extremely opinionated and picky. But I do hope that it might give people some ideas, like how to install llvm, nvidia drivers and so on. I know I personally found other people's repositories to be helpful.

In a very similar vein, I've started using dorothy, which claims to allow you to "... bring your dotfile commands and configuration to any shell." Since I usually but not always use fish, I've always been hesistant to write my own commands in fish. Plus, I have been writing bash scripts for long enough that I'm decent at it, so it tends to be one of my go-tos. Dorothy makes it easy to define variables and commands in such a way that they magically appear in all shells. (Again, this is very useful for fish, which is not a posix-compliant shell.) There's also a fair number of useful built-in commands. Dorothy encourages users to split their dotfiles into public and private portions, and you can view my public dotfile here. Specifically, here are my custom commands. Some of these might be useful, such as setup-util-ghidra and setup-util-ghidrathon. I've found that having a designated spot for these types of utility commands encourages me to write them, which ultimately saves me time. Usually.

Edward J. SchwartzComputer Security Researcher3 min. read

Sometime while I was in graduate school, I started using the notion window manager. (Actually, at the time, I think it was ion3.) Notion is a tiling window manager that is geared towards keyboard usage instead of mouse usage. But let's be honest: I've been using notion for so long that I simply prefer it over anything else.

Notion, like most minor window managers, is a bit spartan. It does not provide a desktop environment. It really just manages windows. There are some features of a desktop environment that I don't need, such as a launcher. I know all the commands that I use; I don't need a GUI to list them for me. But it's often the little things that get you, such as locking the screen, or using the media keys on your keyboard to adjust the volume. I used to be (more of) a hardcore nerd and relished in my ability to craft a super-complex .xsessionrc file with all kinds of bells, whistles and utilities connected as if with duct tape. But as I grow older, sometimes I just want my computer to work.

For a long while now, I've found that running notion alongside GNOME for "desktop stuff" to work pretty well. For a long time, I followed an old Wiki post about how to combine GNOME with Awesome WM. This worked really welll with GNOME 2.

Many people say that you can't use GNOME 3 with window managers other than GNOME Shell. I've actually had pretty good luck copying the Ubuntu gnome-session and replacing Gnome Shell with notion. The above Awesome WM Wiki also shows how to do it. Unfortunately, I've found that some features do not work, such as the keyboard media keys, much to my dismay. Do media keys matter that much? Yes, yes, they do. This apparently broke when GNOME Shell started binding the media keys instead of gnome-settings-daemon. There used to be a gnome-fallback-media-keys-helper utility around that would simulate this behavior, but it seems to have disappeared.

As I was trying to fix this problem, I came across a blog post and an unrelated but similar github repo both describing how to use the i3 window manager with GNOME. TLDR: GNOME Flashback is a still supported variant of GNOME that is explicitly designed to support third-party window managers. Whereas GNOME Shell incorporates both the window manager and other stuff such as handling media keys, GNOME Flashback has the "other stuff" in a separate component that is designed to be used with a window manager such as metacity. But it works just fine with notion! Best of all, both my media keys and screen locking work. Hurray!

Because I hate setting up stuff like this, I've actually been hard at work packaging up my Linux computer configuration into reusable ansible components. It takes a little longer than doing it manually of course, but it's not too bad and it's pretty easy to read. I'm making my notion role available here in case anyone wants to try out my setup. Most of the logic is here if you are curious what is involved. Below are a few snippets to show how Ansible makes it relatively easy to manipulate configuration files.

# Same thing, but for gnome-flashback

- name: Copy gnome-flashback-metacity.session to notion-gnome-flashback.session
  copy:
    src: /usr/share/gnome-session/sessions/gnome-flashback-metacity.session
    dest: /usr/share/gnome-session/sessions/notion-gnome-flashback.session

- name: 'notion-gnome-flashback.session: Change metacity to notion and add stalonetray'
  replace:
    path: /usr/share/gnome-session/sessions/notion-gnome-flashback.session
    regexp: 'metacity'
    replace: notion;stalonetray

- name: 'notion-gnome-flashback.session: Remove gnome-panel'
  replace:
    path: /usr/share/gnome-session/sessions/notion-gnome-flashback.session
    regexp: ';gnome-panel'

- name: Symlink systemd target for notion-gnome-flashback session to gnome-flashback-metacity session
  file:
    src: /usr/lib/systemd/user/gnome-session@gnome-flashback-metacity.target.d
    dest: /usr/lib/systemd/user/gnome-session@notion-gnome-flashback.target.d
    state: link

- name: Install gconf override for notion-gnome-flashback
  copy:
    src: notion-gnome-flashback.gschema.override
    dest: /usr/share/glib-2.0/schemas/01_notion-gnome-flashback.gschema.override
  notify: Compile glib schemas
- name: Set META
  lineinfile:
    path: /usr/local/etc/notion/cfg_notion.lua
    regexp: '^--META='
    line: META="Mod4+"
    backup: true
- name: Set ALTMETA
  lineinfile:
    path: /usr/local/etc/notion/cfg_notion.lua
    regexp: '^--ALTMETA='
    line: ALTMETA="Mod1+"
    backup: true
- name: Disable mod_dock
  lineinfile:
    path: /usr/local/etc/notion/cfg_defaults.lua
    state: absent
    line: 'dopath("mod_dock")'
    backup: true
- name: Enable mod_statusbar
  lineinfile:
    path: /usr/local/etc/notion/cfg_notion.lua
    regexp: '^--dopath("mod_statusbar")'
    line: 'dopath("mod_statusbar")'
    backup: true
Edward J. SchwartzComputer Security Researcher5 min. read

I'm a serial car leaser. This December, when my old lease was running out, there were not a lot of very appealing lease deals. Some of the best deals were on electric vehicles (EVs). The US offers a tax credit on some EVs for $7500. To qualify, the vehicles must be assembled in North America. But some EV manufacturers that do not meet this requirement are offering a similar rebate out of their own pocket to avoid losing customers to their competitors who do qualify. In short, because many EVs come with a 7500 dollar rebate, either from a tax credit or otherwise, EVs were some of the best deals that I could find.

Naturally, the benefit to the environment was another selling point as well!

I ended up leasing a Hyundai Ioniq 6. It's a very slick car, and there are many aspects of it that I like that are not EV specific. But I'm not going to talk about those here. This post is about my initial thoughts on owning an EV.

EVs Are Fun to Drive

I knew that EVs could produce a great deal of Torque, but I didn't realize how much that impacted the driving experience. The Ioniq 6 can accelerate way faster than I ever need it to. This is useful for merging onto highways and such, but it's also just a lot of fun. I've owned cars with powerful motors, but the instant torque of an EV is a different experience. It's hard to imagine going back to driving a non-EV at this point.

EVs Are Quiet

The only noise that the Ioniq 6's engine makes is a whine that lets pedestrians know the car is nearby. When you turn on the car, the engine makes no noise. None. No ignition sound, no idling; nothing.

Charging At Home Is Nice

The Ioniq 6 includes a "level 1" charger that plugs into a standard 120V outlet. It's pretty slow; it takes about 48 hours to completely charge the battery from empty. But it's probably sufficient for most people's needs. I did have a problem one day when I didn't have enough time to fully charge the car at home before embarking on a long trip, so I had to stop at a charging station. But we'll talk about that in a bit.

Hyundai has a promotion where they will install a "level 2" charger for you, so I now have a "level 2" charger that charges the battery in a few hours.

My car's range is about 200 miles (we'll get to that in a bit), so unless I'm going on a long trip, I can easily charge my car at home. This is super convenient. I never have to go to a gas station. Every time I get in my car at home, it's fully charged and ready to go.

Public Charging Is Not As Convenient As Gas Stations

My car's range is about 200 miles, so for long trips, I need to stop and charge on the way. There are more public chargers than I thought, but they are not as common as gas stations, and they have different speeds. There are extremely fast, DC chargers that can charge a car in minutes, but they are not as common as the slower, level 2 chargers. In theory, my car can be charged in less than 20 minutes. But in practice, I have yet to experience that. Charging speed is greatly effected by temperature, and I have never experienced anything close to that optimal charging speed. The Ioniq 6 has a preconditioning system that is supposed to prepare the battery for high-speed charging by warming it up, but all I can say is that it hasn't worked well for me, and all of my public charging experiences have been significantly longer than 20 minutes.

The longest trip I have taken with my EV has been to Pittsburgh. This trip is largely on the PA turnpike. There are no chargers on the turnpike itself, but there are several DC charging stations a few minutes drive off the turnpike.

When you combine the time it takes to leave the turnpike, and the longer time it takes to charge an EV versus fill the gas tank, the bottom line is that long trips will take longer in an EV. On my first trip, I expected my car to charge in 20 minutes as advertised, and for the trip to take only slightly longer than usual. But it ended up taking me over an hour longer, which was upsetting.

You may have also seen various stories about charger stations having long lines during cold weather. This is a real problem, and it contributes to the unpredictability of long trips in an EV. It should get better over time as charging infrastructure becomes more adequate.

Range Anxiety

There are so many gas stations that in a internal combustion engine (ICE) car, having an exact range to empty is not that important. But because chargers are less prolific, you would think that all EVs would be able to tell you a fairly accurate estimate until empty. Well, that's simply not the case for the Ioniq 6. Battery performance is greatly effected by temperature, and the car's range drops significantly in cold weather. You would hope that the car, or the car's manual, would provide accurate guidance on how far it can go in various conditions. But they do not. Instead, you need to rely on personal experience or external guidance such as these tests from consumer reports.

I think that car manufacturers are shooting themselves in the foot here by not being more transparent. If you are going to advertise that your car has a 270 mile range, then it should have a 270 mile range in all conditions. If it doesn't, then you should be open about that. I think that the lack of transparency is going to make people less likely to buy EVs. The same principle applies to charging times. It's great if the car can charge in 20 minutes under ideal circumstances. But if you don't openly admit that it performs significantly worse in real-world conditions, then you are going to upset your customers.

I've read that Tesla cars can estimate their range very accurately. But I've been on long trips in which I've steadily watched the buffer between my car's remaining range and the distance to the closest charging station decrease in my Ioniq 6. It's not a good feeling at all, and from an engineering point, I think it's inexcusable.

Fuel Cost

Of course, one of the selling points of an EV is that you don't have to buy gas. You do have to pay for electricity, but it's significantly cheaper. Here's a comparison of how many miles I would get out of my Volvo S60 vs. my Hyundai Ioniq 6 for the cost of one gallon of gas. (The S60 actually takes 91+ fuel, so it would cost even more than the $3.95 listed.)

Comparison of fuel cost for Volvo S60 vs. Hyundai Ioniq 6
Comparison of fuel cost for Volvo S60 vs. Hyundai Ioniq 6

Conclusion

I'm very happy with my decision to lease an EV. I would do so again. But I think that manufacturers' decision to focus only on performance under ideal conditions is a mistake that will ultimately result in more people being leery of buying EVs. I know it has colored my experiences (negatively). My ultimate take away is this: If you are driving long distances in an EV, plan for your trip to take more time. We have a two car household, and I'm not sure that I would be ready to replace both with an EV at this point. But for everything but long trips, I think that EVs are the way to go.

Edward J. SchwartzComputer Security Researcher1 min. read

I was playing around with ChatGPT and it came up with this gem:

In the thrilling world of ones and zeros, where the mere thought of assembly code sends shivers down the spines of mere mortals, there exists a unique breed of masochists—reverse engineers. Are you the kind of person who looks at a complex problem and thinks, "Why not make it harder?" If the idea of disassembling convoluted code into something even remotely understandable without any upfront instructions sounds like a fun Tuesday afternoon, then we have the perfect job for you.

Introducing the glamorous world of reverse engineering, where the nightmares of dealing with assembly code, hexadecimal hellscapes, and cryptic comments—or the lack thereof—don't just haunt you in your sleep; they're your 9 to 5. Ever wanted to spend hours, nay, days, unraveling the mysteries of code that was never meant to be understood, like an archeologist deciphering the Rosetta Stone, except what you're translating doesn't want to be translated? Congratulations, your masochistic tendencies can finally pay off.

In this role, you will dive headfirst into the abyss of other people's code, armed with nothing but your wits and perhaps an unhealthy dose of caffeine. You'll experience the unparalleled joy of finally understanding a function's purpose at 3 AM, only to realize it was a decoy. But fear not, for the fruits of your labor are many—aside from the occasional existential crisis. You'll gain the satisfaction of knowing you're outsmarting attempts to keep you out, one painstaking line at a time.

So, if you're someone who relishes the thought of your brain being tied into knots by assembly language, who finds the idea of decrypting software protections better than a crossword puzzle on a Sunday morning, and who sees a "No Trespassing" sign as a personal invitation, then the lucrative world of reverse engineering awaits. Where else can you get paid for indulging in your penchant for digital masochism? Join us, and turn your nightmares into your day job.

Ready to sign up? 🤣

Powered with by Gatsby 5.0