One of the most exciting possibilities of AI and LLMs are agents: tools that allow LLMs to interact with various tools in order to solve problems. You've probably seen them before, like when you ask ChatGPT to browse the web for you.
In this blog post, we'll take a look at how to build agents using LangChain. They'll work great using an OpenAI model. And then we'll try to run them locally using Ollama, using a variety of open models. And they will almost all fail miserably. They fail so bad that I created this blog post to convince myself I wasn't imagining things.
In a future blog post, we will examine why.
LangChain is a framework that allows you to build LLM applications. Basically, it abstracts a bunch of different components like LLMs, vector stores, and the like, and allows you to focus on your application's logic. So, you might develop your application in LangChain while using a local LLM to run it, but then use Claude once you go to production.
Anyway, using LangChain to make a query is pretty simple.
!pip install langchain-openai~=0.2.7 python-dotenv
!pip install httpx==0.27.2 # tempfrom google.colab import drive
drive.mount('/content/drive')Mounted at /content/drive
# We'll load my OpenAI API key using dotenv
%load_ext dotenv
%dotenv drive/MyDrive/.envfrom langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage
from langchain_openai import ChatOpenAI
# Remove non-determinism for the blog post
zero_temp_gpt35 = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
response=zero_temp_gpt35.invoke("Hi! What is your name?").content
import textwrap
print(textwrap.fill(response))Hello! I am a language model AI assistant. How can I assist you today?
The beauty of LangChain is that the components are modular. We can replace
gpt-3.5-turbo model with something else later if we want to, and indeed we
will do just that!
LangGraph is the portion of LangChain for building agents. It allows us to easily define new tools:
!pip install langgraph~=0.2.53@tool
def foobar(input: int) -> int:
"""Computes the foobar function."""
return input + 2
tools = [foobar]The @tool decorator automatically transforms the function into a schema that can be used by the LLM to decide whether to invoke the tool, and if so, how.
foobar.tool_call_schema.model_json_schema(){'description': 'Computes the foobar function.',
'properties': {'input': {'title': 'Input', 'type': 'integer'}},
'required': ['input'],
'title': 'foobar',
'type': 'object'}
With that, we can build a generic agent, called a ReAct agent, which can interact with our tools:
from langgraph.prebuilt import create_react_agent
def react_chat(prompt, model):
agent_executor = create_react_agent(model, tools)
response = agent_executor.invoke({"messages": [("user", prompt)]})
return response['messages'][-1].content, response
last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_gpt35)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"The result of evaluating foobar(30) is 32.
Yes! We did it, team! 🎉 We could change magic_function to be a web search, a database lookup, or you name it.
Let's try a query that doesn't use a tool at all.
last_msg, result = react_chat("Hi.", zero_temp_gpt35)
print(last_msg)
assert "Hello" in last_msg and "foobar" not in last_msg, "Uh oh, something went wrong"Hello! How can I assist you today?
Great. So, in theory, we have an agent that we can chat with and is able to call tools in order to help us out.
Now let's try to create a tool-wielding agent using a LLM that runs on our local machine.
We'll do this by using Ollama, which is a (fairly) easy way to run smaller open LLMs on your local machine. It
will use any GPUs that you might have, but it's still usable even if you don't
have any. After all, you're just performing inference, not training.
Here's an example of me running llama 3.2 with ollama on my work laptop.
root@be5c1cb9e696:/# ollama run llama3.2
>>> Hi mom!
It's nice to hear from you, sweetie. Is everything okay? What's on your mind?
>>> Are you alive?
I am a computer program, so I don't have feelings or emotions like humans do. But I'm
designed to simulate conversations and answer questions to the best of my ability. I'm
not alive in the way that a living being is, but I'm here to help you with any
questions or topics you'd like to discuss!
>>> 🤯
I know it can be a bit mind-blowing to think about a computer program that can have
conversations and answer questions! But I'm designed to make interactions feel more
natural, so I'm glad you're surprised (in a good way!)You can find instructions on how to install Ollama on the Ollama webpage.
If you don't feel like installing anything, that's fine too. You can follow along with this notebook.
After installing and running Ollama (ollama serve), we install the langchain-ollama connector package and pull down the Llama 3.2 model from Ollama's repository.
# Install Ollama
!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh
!ollama -v
# Make sure Ollama is running
!ollama ps 2>/dev/null || (env OLLAMA_DEBUG=1 nohup ollama serve &)>>> Installing ollama to /usr/local >>> Downloading Linux amd64 bundle ############################################################################################# 100.0% >>> Creating ollama user... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... [1m[31mWARNING:[m systemd is not running [1m[31mWARNING:[m Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies. >>> The Ollama API is now available at 127.0.0.1:11434. >>> Install complete. Run "ollama" from the command line. Warning: could not connect to a running Ollama instance Warning: client version is 0.5.1 nohup: appending output to 'nohup.out'
!pip install langchain_ollama~=0.2.0
!ollama pull llama3.2Now we can attempt the same tests we performed on GPT 3.5, but using the local Llama 3.2 LLM.
from langchain_ollama import ChatOllama
# The zero temperature model is to remove non-determinism for the blog post
zero_temp_ollama_model = ChatOllama(model="llama3.2", temperature=0)
response = zero_temp_ollama_model.invoke("Hi! What is your name?").content
print(textwrap.fill(response))I don't have a personal name, but I'm an AI designed to assist and communicate with users. I'm often referred to as a "language model" or a "chatbot." You can think of me as a helpful computer program that's here to provide information, answer questions, and engage in conversation. What's your name?
Okay, looking good! This is not bad for a 3B parameter LLM that can easily run locally on our computer. Let's see if it can call our tool-wielding agent.
last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_ollama_model)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"The output of `foobar(30)` is 32.
🎉 Everything is working well so far. As one final check, let's ask the agent a question that has absolutely nothing to do with tools.
last_msg, result = react_chat("Hi.", zero_temp_ollama_model)
print(last_msg)
assert "42" in last_msg, "Uh oh, something went wrong"The input value 42 was doubled, resulting in 84.
So, we said "Hi." and the agent responded with nonsense. Let's inspect some of the metadata we get back from LangChain to see what's going on.
import pprint
pprint.pprint(result){'messages': [HumanMessage(content='Hi.', additional_kwargs={}, response_metadata={}, id='c4dd1ba7-cb15-4d62-a2bb-a543a32a882d'),
AIMessage(content='', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.061349558Z', 'done': True, 'done_reason': 'stop', 'total_duration': 294464945, 'load_duration': 22079878, 'prompt_eval_count': 153, 'prompt_eval_duration': 9000000, 'eval_count': 16, 'eval_duration': 261000000, 'message': Message(role='assistant', content='', images=None, tool_calls=[ToolCall(function=Function(name='foobar', arguments={'input': 42}))])}, id='run-be60d0f6-bf62-4336-b028-d37898615e06-0', tool_calls=[{'name': 'foobar', 'args': {'input': 42}, 'id': '4d6b28d7-71bc-4f80-9a2a-e61293bdbb65', 'type': 'tool_call'}], usage_metadata={'input_tokens': 153, 'output_tokens': 16, 'total_tokens': 169}),
ToolMessage(content='44', name='foobar', id='aed6b2d6-590d-4bc3-8828-89457178bd11', tool_call_id='4d6b28d7-71bc-4f80-9a2a-e61293bdbb65'),
AIMessage(content='The input value 42 was doubled, resulting in 84.', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.305191931Z', 'done': True, 'done_reason': 'stop', 'total_duration': 238035622, 'load_duration': 22280620, 'prompt_eval_count': 85, 'prompt_eval_duration': 5000000, 'eval_count': 14, 'eval_duration': 208000000, 'message': Message(role='assistant', content='The input value 42 was doubled, resulting in 84.', images=None, tool_calls=None)}, id='run-50754cd1-cae9-410d-84d5-64b51bced188-0', usage_metadata={'input_tokens': 85, 'output_tokens': 14, 'total_tokens': 99})]}
We can see there are four messages:
HumanMessage is the user's message -- "Hi."AiMessage, the LLM indicates that it would like to invoke a tool by setting the tool_calls field.ToolMessage, which is given back to the LLM.AiMessage includes a written message for the user.The problem of course, is message #2. Why does the AI want to invoke a tool in response to "Hi."? Is this a problem with Llama 3.2 or something else? Let's do some 🥼 science and find out!
I created a really dumb benchmark to answer four really basic questions. I can't stress enough that this benchmark only tests the lowest of the low hanging fruit in this area. (I am calling it a "benchmark" facetiously!)
Here are the questions:
We'll use our example above to test this.
basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
last_msg, _ = react_chat(basic_tool_question, model=model)
return "32" in last_msgWe'll perform two simple tests to answer this question. We'll prompt the agent with both a basic arithmetic question that does not involve the foobar tool, "What is 12345 - 102?", and a greeting, "Hello!" We'll then check the response to see if the model produces a ToolMessage, which indicates that the model chose to invoke a tool. By construction, neither of those prompts should induce a tool call.
from langchain_core.messages import ToolMessage
basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"
def q2a(model):
_, result = react_chat(basic_arithmetic_question, model=model)
return not any(isinstance(msg, ToolMessage) for msg in result['messages'])
def q2b(model):
_, result = react_chat(greeting, model=model)
return not any(isinstance(msg, ToolMessage) for msg in result['messages'])
def q2(model):
return q2a(model) and q2b(model)To answer this, we'll ask the basic arithmetic question to the react agent and its underlying model. Since the available tool does not help with the arithmetic problem, ideally, the agent and the underlying model should be able to solve the problem under the same circumstances. If the model can't do arithmetic in the first place, I chose not to penalize it because I'm such a nice guy. 😇
def q3a(model):
result = model.invoke(basic_arithmetic_question)
return "12243" in result.content
def q3b(model):
last_msg, _ = react_chat(basic_arithmetic_question, model=model)
return "12243" in last_msg
def q3(model):
# q3a ==> q3b: If q3a, then q3b ought to be true as well.
return not q3a(model) or q3b(model)To answer this, we'll greet the agent and attempt to determine if it responds properly. This is a little difficult to do in a comprehensive way.
basic_greeting = "Hi."
def q4(model):
last_msg, _ = react_chat(basic_greeting, model=model)
r = any(w in last_msg for w in ["hi", "Hi", "hello", "Hello", "help you", "Welcome", "welcome", "greeting", "Greeting", "assist"])
#if not r:
#print(f"Debug: Not a greeting? {last_msg}")
return rHere is code to run the experiments a couple of times.
from tqdm.notebook import tqdm
def do_bool_sample(fun, n=10, *args, **kwargs):
try:
# tqdm here if desired
return sum(fun(*args, **kwargs) for _ in (range(n))) / n
except Exception as e:
print(e)
return 0.0
def run_experiment(model, name, n=10):
do = lambda f: do_bool_sample(f, model=model, n=n)
d = {
"q1": do(q1),
"q2": do(q2),
"q3": do(q3),
"q4": do(q4),
"model": name
}
d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
return d
def print_experiment(results):
name = results['model']
print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")
def run_and_print_experiment(model, name):
results = run_experiment(model, name)
print_experiment(results)
return resultsLet's see what our experiments say for Llama 3.2, which we already know from above does not perform very well.
llama_model = ChatOllama(model="llama3.2")
run_and_print_experiment(llama_model, "llama3.2")Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.5 Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.1
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.1, 'model': 'llama3.2', 'total': 1.6}
As we saw above, Llama 3.2 is able to call functions (Q1), but does so even when it should not be (Q2). Question 3 shows that even though it almost always decides to call a tool, this usually does not stop it from being able to answer basic questions. It does prevent it from being able to chat (Q4).
Now let's try benchmarking gpt-3.5-turbo, which seemed to do better.
gpt35 = ChatOpenAI(model="gpt-3.5-turbo")
run_and_print_experiment(gpt35, "gpt-3.5-turbo")Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-3.5-turbo) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-3.5-turbo) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-3.5-turbo) success rate: 0.9 Question 4: Does the react agent lose the ability to chat? (gpt-3.5-turbo) success rate: 1.0
{'q1': 1.0,
'q2': 0.0,
'q3': 0.9,
'q4': 1.0,
'model': 'gpt-3.5-turbo',
'total': 2.9}
Great -- the benchmark showed that gpt-3.5-turbo can call tools (Q1), and unlike Llama 3.2, can still engage in chat (Q4). A bit surprisingly, it still invokes tools when it shouldn't, however (Q2). But it is smart enough to ignore their results when constructing its final response.
Let's try a newer model, gpt-4o.
gpt4o = ChatOpenAI(model="gpt-4o")
run_and_print_experiment(gpt4o, "gpt-4o")Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-4o) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-4o) success rate: 1.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-4o) success rate: 1.0 Question 4: Does the react agent lose the ability to chat? (gpt-4o) success rate: 1.0
{'q1': 1.0, 'q2': 1.0, 'q3': 1.0, 'q4': 1.0, 'model': 'gpt-4o', 'total': 4.0}
GPT 4o nailed it! 👏
Let's benchmark a whole bunch of Ollama models. I searched Ollama's model library for models that claimed to support tool calling. Here we test a hand-picked subset of these models to see how well they do.
ollama_models = [
"hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S",
"llama3.3:70b",
"llama3.2:3b",
"llama3.1:70b",
"llama3.1:8b",
"llama3-groq-tool-use:8b",
"llama3-groq-tool-use:70b",
"MFDoom/deepseek-v2-tool-calling:16b",
"krtkygpta/gemma2_tools",
"interstellarninja/llama3.1-8b-tools",
"cow/gemma2_tools:2b",
"mistral:7b",
"mistral-nemo: 12b",
"interstellarninja/hermes-2-pro-llama-3-8b-tools",
"qwq:32b",
"qwen2.5-coder:7b",
]
all = []
for m in ollama_models:
print(f"Downloading model: {m}...")
!ollama pull {m} 2>/dev/null
print("done.")
r = run_and_print_experiment(ChatOllama(model=m), m)
!ollama rm {m}
all.append(r)
print(r)Downloading model: hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S'
{'q1': 0.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S', 'total': 2.9}
Downloading model: llama3.3:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.3:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.3:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.3:70b) success rate: 0.2
Question 4: Does the react agent lose the ability to chat? (llama3.3:70b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.3:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.2, 'q4': 1.0, 'model': 'llama3.3:70b', 'total': 2.2}
Downloading model: llama3.2:3b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2:3b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2:3b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2:3b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (llama3.2:3b) success rate: 0.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.2:3b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.0, 'model': 'llama3.2:3b', 'total': 1.5}
Downloading model: llama3.1:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:70b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:70b) success rate: 0.3
Question 4: Does the react agent lose the ability to chat? (llama3.1:70b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.1:70b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.3, 'q4': 0.7, 'model': 'llama3.1:70b', 'total': 2.0}
Downloading model: llama3.1:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:8b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:8b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (llama3.1:8b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.1:8b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.7, 'model': 'llama3.1:8b', 'total': 1.7}
Downloading model: llama3-groq-tool-use:8b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:8b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:8b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:8b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:8b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3-groq-tool-use:8b'
{'q1': 1.0, 'q2': 0.8, 'q3': 1.0, 'q4': 1.0, 'model': 'llama3-groq-tool-use:8b', 'total': 3.8}
Downloading model: llama3-groq-tool-use:70b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:70b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:70b) success rate: 1.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:70b) success rate: 0.9
Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:70b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3-groq-tool-use:70b'
{'q1': 1.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'llama3-groq-tool-use:70b', 'total': 3.9}
Downloading model: MFDoom/deepseek-v2-tool-calling:16b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'MFDoom/deepseek-v2-tool-calling:16b'
{'q1': 0.0, 'q2': 0.0, 'q3': 1.0, 'q4': 1.0, 'model': 'MFDoom/deepseek-v2-tool-calling:16b', 'total': 2.0}
Downloading model: krtkygpta/gemma2_tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (krtkygpta/gemma2_tools) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (krtkygpta/gemma2_tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (krtkygpta/gemma2_tools) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (krtkygpta/gemma2_tools) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'krtkygpta/gemma2_tools'
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'krtkygpta/gemma2_tools', 'total': 1.0}
Downloading model: interstellarninja/llama3.1-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/llama3.1-8b-tools) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
Question 4: Does the react agent lose the ability to chat? (interstellarninja/llama3.1-8b-tools) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'interstellarninja/llama3.1-8b-tools'
{'q1': 0.7, 'q2': 0.0, 'q3': 0.7, 'q4': 0.7, 'model': 'interstellarninja/llama3.1-8b-tools', 'total': 2.0999999999999996}
Downloading model: cow/gemma2_tools:2b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (cow/gemma2_tools:2b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (cow/gemma2_tools:2b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (cow/gemma2_tools:2b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (cow/gemma2_tools:2b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'cow/gemma2_tools:2b'
{'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'cow/gemma2_tools:2b', 'total': 2.0}
Downloading model: mistral:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral:7b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral:7b) success rate: 0.8
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral:7b) success rate: 0.5
Question 4: Does the react agent lose the ability to chat? (mistral:7b) success rate: 0.7
[?25l[?25l[?25h[2K[1G[?25hdeleted 'mistral:7b'
{'q1': 0.6, 'q2': 0.8, 'q3': 0.5, 'q4': 0.7, 'model': 'mistral:7b', 'total': 2.5999999999999996}
Downloading model: mistral-nemo: 12b...
done.
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
model "mistral-nemo: 12b" not found, try pulling it first
Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral-nemo: 12b) success rate: 0.0
Question 2: Does the react agent invoke a tool when it shouldn't? (mistral-nemo: 12b) success rate: 0.0
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral-nemo: 12b) success rate: 0.0
Question 4: Does the react agent lose the ability to chat? (mistral-nemo: 12b) success rate: 0.0
[?25l[?25l[?25h[2K[1G[?25hError: name "mistral-nemo:" is invalid
{'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.0, 'model': 'mistral-nemo: 12b', 'total': 0.0}
Downloading model: interstellarninja/hermes-2-pro-llama-3-8b-tools...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.3
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.6
[?25l[?25l[?25h[2K[1G[?25hdeleted 'interstellarninja/hermes-2-pro-llama-3-8b-tools'
{'q1': 1.0, 'q2': 0.3, 'q3': 0.8, 'q4': 0.6, 'model': 'interstellarninja/hermes-2-pro-llama-3-8b-tools', 'total': 2.7}
Downloading model: qwq:32b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwq:32b) success rate: 0.6
Question 2: Does the react agent invoke a tool when it shouldn't? (qwq:32b) success rate: 0.9
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwq:32b) success rate: 1.0
Question 4: Does the react agent lose the ability to chat? (qwq:32b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'qwq:32b'
{'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'model': 'qwq:32b', 'total': 3.5}
Downloading model: qwen2.5-coder:7b...
done.
Question 1: Can the react agent use a tool correctly when explicitly asked? (qwen2.5-coder:7b) success rate: 1.0
Question 2: Does the react agent invoke a tool when it shouldn't? (qwen2.5-coder:7b) success rate: 0.4
Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwen2.5-coder:7b) success rate: 0.8
Question 4: Does the react agent lose the ability to chat? (qwen2.5-coder:7b) success rate: 1.0
[?25l[?25l[?25h[2K[1G[?25hdeleted 'qwen2.5-coder:7b'
{'q1': 1.0, 'q2': 0.4, 'q3': 0.8, 'q4': 1.0, 'model': 'qwen2.5-coder:7b', 'total': 3.2}
from statistics import mean
average = mean(d['total'] for d in all)
minscore = min(d['total'] for d in all)
maxscore = max(d['total'] for d in all)
all = sorted(all, key=lambda d: -d['total'])
print(f"Average total score: {average} Min: {minscore} Max: {maxscore}")
print("Top 5 models by total score:")
pprint.pprint(all[:5])Average total score: 2.31875 Min: 0.0 Max: 3.9
Top 5 models by total score:
[{'model': 'llama3-groq-tool-use:70b',
'q1': 1.0,
'q2': 1.0,
'q3': 0.9,
'q4': 1.0,
'total': 3.9},
{'model': 'llama3-groq-tool-use:8b',
'q1': 1.0,
'q2': 0.8,
'q3': 1.0,
'q4': 1.0,
'total': 3.8},
{'model': 'qwq:32b', 'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'total': 3.5},
{'model': 'qwen2.5-coder:7b',
'q1': 1.0,
'q2': 0.4,
'q3': 0.8,
'q4': 1.0,
'total': 3.2},
{'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S',
'q1': 0.0,
'q2': 1.0,
'q3': 0.9,
'q4': 1.0,
'total': 2.9}]
There's a lot going on here. But I have a few general observations:
Most models do really poorly. The average "score" was ~2. And even a perfect score of 4.0 is really the bare minimum of what I would expect a decent model to do. And most of the models I tested claim to "support" tool calling.
Surprisingly, some models do pretty well! For example, llama3-groq-tool-use almost achieves a perfect score!
I tried a few larger ~70B models, and they did not perform noticeably better. Interestingly, the 8B variant of llama3-groq-tool-use performed almost as well as the 70B variant.
There are several benchmarks that test tool use in LLMs. But most of them are not designed to test tool usage as in an interactive agent. As it turns out, the Berkeley Function-Calling Leaderboard (BFCL) is the only benchmark that tests this type of behavior. And it was only added in September 2024 as part of BFCL V3.
Llama 3.2 scores a whopping 2.12% on multi-turn accuracy (what we care about). The top 12 scores are from proprietary models. The best multi-turn accuracy from an open-source model is only 17.38%, compared to GPT-4o's 45.25%. This seems to agree with Ed's Really Dumb Tool-calling Benchmark ™️.
YouTuber Mukul Tripathi also found that Llama 3.2 does very poorly at answering questions when a tool is not required. Confusingly, he also found that Llama 3.3 did not have the same problem though, which is not consistent with my findings. Although he was using Ollama, he was not using it with LangChain. I'll have to look into that more.
Are open LLMs really that far behind at tool-calling? Or perhaps only larger models can determine whether a tool should be used? Maybe the quantization process used for Ollama is to blame? Or is something else going on?
We'll explore the answer in a future blog post. Stay tuned!
I recently signed up for BlueSky. I just learned of a new service, EchoFeed, that polls RSS feeds and posts the content to BlueSky (and elsewhere). So, this is a test.
Will it post to BlueSky? Will fed.brid.gy mirror it to Mastodon? The suspense is killing me!
This page documents my experience with "pressure washing" my vinyl fence and siding. I have pressure washing in quotes, because it's SH or sodium hypochlorite (or bleach) that does the bulk of the work. Pros often call this "soft washing".
For vinyl fence soft washing, you want around 1-2% SH. Most household bleach is 6% SH, so if you mix 1 part bleach with 5 parts water, you'll get around 1% SH.
You also want to use a surfectant to help the mixture stick to the fence. I used Dawn Ultra. Some people claim that some dish soaps will cause a bad reaction with the bleach, ranging from "mustard gas" to neutralizing the bleach.
I personally found that at 1-1.5% SH, the mixture was safe to use around grass. I wet the grass before and after applying the mixture, and I didn't see any damage.
Make sure to put the soap in last, or your mixture will foam up and overflow the sprayer when you try to close it.
Spray the mixture on the fence, let it sit for about five minutes, and then rinse it off. You can use a garden hose, but I personally found that using a Ryobi One+ EZ-Clean worked better. I'm sure a pressure washer would have been even faster, but it is less convenient to use.
That's about it. This removed most of the staining.
For some areas that had large amounts of growth, I used a Ryobi Scrubber to physically remove it before spraying.
The bleach was not able to remove all stain spots. For those remaining spots that were in conspicuous places, I used a magic eraser / melanine sponge.
At some point, I hope to create a Notes section on my website that will turn Markdown files into a list of notes. This is basically how the blog works. But, I'm kind of busy. And since Gatsby seems like it's dead, I'm not sure that I want to invest a whole lot of time into it. (Although putting the notes in markdown seems like a good idea for compatibility.)
Anyway, here is my first very short note on Profiling.
SpeedScope is an awesome tool for visualizing profiler output. It has a flame graph view that is wonderful. I also like to use the Sandwich view, sorting by total time and simply looking for the first function that I recognize. This is often the culprit.
The documentation is pretty good.
It also shows how to record profiles in compatible formats for most platforms.
I mostly use py-spy and perf.
The one notably missing platform is Java! Luckily, it's not too hard to convert Java's async-profiler output to a format that SpeedScope can read. Here's how I do it:
collapsed format./asprof start -i 1s Ghidra followed by ./asprof stop -o collapsed -f /tmp/out.prof.collapsed Ghidra./asprof collect -d 60 -o collapsed -f /tmp/out.prof.collapsed Ghidraout.prof.collapsed in SpeedScope.The collapsed format takes a while to parse, so it might be worth it to export the native SpeedScope format.
My wife and I started fostering rescue dogs mostly by accident. We adopted a Shih-tzu mix, who seemed completely relaxed when we met her at the rescue. When we got home, we eventually figured out that she was terrified, and freezing was her coping mechanism. Once she got a little more comfortable, she started hiding from us in the house. It took many months, but Molly eventually warmed up to us and we gained her trust.
For as frustrating an experience it is to have a dog be absolutely terrified of you for no reason, it was also incredibly rewarding to see her come out of her shell. We decided that we wanted to help other dogs in similar situations, so we started fostering. It's been several years, and we've fostered dozens of dogs. For a long time, we fostered for A Tail to Tell, which unfortunately recently closed. More recently, we have been fostering for Lucky Dawg Animal Rescue.
This week, however, we had a "first", and not a good one. We picked up a new foster dog, Blanche, on Sunday. We have a nice fenced in yard, and we immediately took Blanche out back into the yard. She was very skittish, and we gave her some space. She immediately dove into a row of large evergreen trees in our yard and hid. Eventually, I had to go in and carry her out, which was not a simple task given the size of the trees.
The next morning, I let Blanche out and saw her run into the same trees. It was very hot, and after about an hour, I started to grow concerned and went out to look for her. She was not in the same tree as last time, but I figured she was hiding in another one. There are several trees, and it is fairly difficult to see into them. I started exhaustively searching the trees, and I couldn't find her. I also found a part of our fence that was slightly pushed out, as if something had forced its way out. She had escaped.
My wife and I were very upset, but we shifted into action. We reported Blanche as missing on PawBoost, our community Facebook page, called the local police department, and notified our rescue, Lucky Dawg Animal Rescue. We quickly began receiving sighting reports of Blanche. She was initially seen at approximately 10am in a wooded area next to a busy road. I went to the area and searched for her while my daughter and wife started talking to neighbors and handing out flyers. I saw no sign of her in the wooded area.
We spent the rest of the day trying to put up posters on telephone poles, which is harder than it seems! Tape doesn't adhere very well to dirty telephone poles. The trick is to tape all the way around the pole so that the tape sticks to itself.
We received another sighting report at 5pm, this time on the other side of the busy road. The report was of a dog "playing/fighting" with a chicken. Fortunately, I knew which house this was at from the description. Around the same time, two members of our rescue arrived to help. We went to the house and I talked to the owner, who revealed that Blanche had attacked her chicken. She chased Blanche off, and Blanche ran into the woods. While I was talking to the owner, our rescue members saw Blanche in a large field nearby. Blanche spotted them, and ran into a wooded area near a creek. I was able to reach the other side of the wooded area, but the experts decided that it would be better to leave a food station for Blanche so she stayed in the area rather than try to chase her.
The next morning we received a few sightings of Blanche near our neighborhood again. More surprisingly, my wife left our fence gate open and saw Blanche sniffing around the fence around 8 am. Unfortunately, Blanche ran off. It was very hot, and Blanche presumably slept during the day.
That evening, the rescue returned and put out a trap with lots of food, and a remote camera to monitor it. We cranked up the volume on my phone so that every time the camera detected movement, we would wake up. We were woken up several times, but it was mostly false positives. At 1:59am, a cat wandered by. At some point, the cat triggered the cat, and at 2:28am we were greeted by a picture of Blanche studying the cat in the trap.
I snuck outside, and saw Blanche eating food about 10 feet away from me. It was frustrating to be so close but not be able to do anything. But our rescue members told us that it was safer to make her feel safe and comfortable with the trap. She clearly enjoyed the food that we had put out for her. She would be back.
After Blanche finished eating, we freed the cat from the trap and added more food, but we didn't see Blanche again that night. Stupid cat!
The next morning, Blanche was sighted in many of the same places, including near our house. Unfortunately, she was also seen crossing the road again. We were worried that a car would hit her. She slept during the day again. In the evening, there was heavy rain, and we decided to wait until after the rain to put the trap out. Naturally, Blanche showed up during the rain, and we missed an opportunity to catch her. Fortunately my wife had put out a couple pieces of food, so she didn't leave empty handed.
We armed the trap again and waited. She came back around 9:43pm and began investigating the trap. She was very cautious, and decided to yank out the towel on the bottom of the trap. She stayed for a very long time, as we waited in suspense for her to trigger the trap. Eventually she entered the trap, but it didn't trigger for some reason. It was frustrating, but not the end of the world. She would begin to think the trap was a safe source of food.
Blanche stayed around for a long while, but eventually left. We examined the trap. When she yanked the towel, it actually disarmed the trap without activating it. We fixed the trap and waited again, hoping she would be back later that night.
At 2:05am, she came back and began to investigate the trap again. We waited for what seemed like forever. At 2:15am, we received this picture of her deep in the trap.
The camera we were using would take a burst of three pictures every time it detected motion. After that picture, we didn't receive any more pictures. This could mean that she was in the trap, or that the trap had not triggered and she had left. We waited for a while, and then snuck outside to check. We had caught her! We carried the trap inside and put her back into her exercise pen, and finally got some sleep.
Blanche is not very happy to be back inside, but she ate, drank, and is safe. She hasn't been very lively yet, but this is not uncommon with mill rescue dogs, who often need a few days before they start to interact and show personality. Hopefully she'll quickly realize that we're not so bad, and that she is safe and sound in our house.
We have a long list of interesting stories from our years fostering rescue dogs, but this was certainly one of the more interesting and stressful ones. On the positive side, we met a lot of our neighbors, and we were pleasantly surprised by how helpful and supportive they were, without exception. Many people wanted to help in whatever way they could. We really live in a cool little community.
In my last post I talked about how I have been using Ansible for my new laptop configuration, and shared my configuration for notion.
So far, I've been extremely happy with using Ansible for configuring my machine. Prior to using Ansible, I'd spend a fair amount of time creating detailed notes that described what I did. I estimate that creating Ansible recipes takes about the same time as keeping good notes, and maybe even less. That's because there are many existing roles for common settings and software that can be reused. As with any ecosystem, the quality of such roles varies.
The big difference between my notes and Ansible, though, is that Ansible playbooks can be played in minutes, whilst manually following my notes can take hours to set up an entire new machine. I used to dread the idea of configuring a new machine. But now it's fairly effortless.
I just publicly shared my Ansible configuration. I don't expect that anyone will use my configuration as is, any more than I expect anyone to use my notion configuration! I'm extremely opinionated and picky. But I do hope that it might give people some ideas, like how to install llvm, nvidia drivers and so on. I know I personally found other people's repositories to be helpful.
In a very similar vein, I've started using
dorothy, which claims to allow you to "...
bring your dotfile commands and configuration to any shell." Since I usually
but not always use fish, I've always been hesistant to
write my own commands in fish. Plus, I have been writing bash scripts for long
enough that I'm decent at it, so it tends to be one of my go-tos. Dorothy makes
it easy to define variables and commands in such a way that they magically
appear in all shells. (Again, this is very useful for fish, which is not a
posix-compliant shell.) There's also a fair number of useful built-in commands.
Dorothy encourages users to split their dotfiles into public and private
portions, and you can view my public dotfile
here. Specifically, here are my
custom commands. Some
of these might be useful, such as setup-util-ghidra and
setup-util-ghidrathon. I've found that having a designated spot for these
types of utility commands encourages me to write them, which ultimately saves me
time. Usually.
Sometime while I was in graduate school, I started using the notion window manager. (Actually, at the time, I think it was ion3.) Notion is a tiling window manager that is geared towards keyboard usage instead of mouse usage. But let's be honest: I've been using notion for so long that I simply prefer it over anything else.
Notion, like most minor window managers, is a bit spartan. It does not provide
a desktop environment. It really just manages windows. There are some
features of a desktop environment that I don't need, such as a launcher. I know
all the commands that I use; I don't need a GUI to list them for me. But it's
often the little things that get you, such as locking the screen, or using the
media keys on your keyboard to adjust the volume. I used to be (more of) a hardcore nerd
and relished in my ability to craft a super-complex .xsessionrc file with all
kinds of bells, whistles and utilities connected as if with duct tape. But as I
grow older, sometimes I just want my computer to work.
For a long while now, I've found that running notion alongside GNOME for "desktop stuff" to work pretty well. For a long time, I followed an old Wiki post about how to combine GNOME with Awesome WM. This worked really welll with GNOME 2.
Many people say that you can't use GNOME 3 with window managers other than GNOME
Shell. I've actually had pretty good luck copying the Ubuntu gnome-session and
replacing Gnome Shell with notion. The above Awesome WM Wiki also shows how to
do it. Unfortunately, I've found that some features do not work, such as the
keyboard media keys, much to my dismay. Do media keys matter that much? Yes,
yes, they do. This apparently broke when GNOME Shell started binding the media
keys instead of gnome-settings-daemon. There
used to be a gnome-fallback-media-keys-helper utility around that would
simulate this behavior, but it seems to have disappeared.
As I was trying to fix this problem, I came across a blog post and an unrelated but similar github repo both describing how to use the i3 window manager with GNOME. TLDR: GNOME Flashback is a still supported variant of GNOME that is explicitly designed to support third-party window managers. Whereas GNOME Shell incorporates both the window manager and other stuff such as handling media keys, GNOME Flashback has the "other stuff" in a separate component that is designed to be used with a window manager such as metacity. But it works just fine with notion! Best of all, both my media keys and screen locking work. Hurray!
Because I hate setting up stuff like this, I've actually been hard at work packaging up my Linux computer configuration into reusable ansible components. It takes a little longer than doing it manually of course, but it's not too bad and it's pretty easy to read. I'm making my notion role available here in case anyone wants to try out my setup. Most of the logic is here if you are curious what is involved. Below are a few snippets to show how Ansible makes it relatively easy to manipulate configuration files.
# Same thing, but for gnome-flashback
- name: Copy gnome-flashback-metacity.session to notion-gnome-flashback.session
copy:
src: /usr/share/gnome-session/sessions/gnome-flashback-metacity.session
dest: /usr/share/gnome-session/sessions/notion-gnome-flashback.session
- name: 'notion-gnome-flashback.session: Change metacity to notion and add stalonetray'
replace:
path: /usr/share/gnome-session/sessions/notion-gnome-flashback.session
regexp: 'metacity'
replace: notion;stalonetray
- name: 'notion-gnome-flashback.session: Remove gnome-panel'
replace:
path: /usr/share/gnome-session/sessions/notion-gnome-flashback.session
regexp: ';gnome-panel'
- name: Symlink systemd target for notion-gnome-flashback session to gnome-flashback-metacity session
file:
src: /usr/lib/systemd/user/gnome-session@gnome-flashback-metacity.target.d
dest: /usr/lib/systemd/user/gnome-session@notion-gnome-flashback.target.d
state: link
- name: Install gconf override for notion-gnome-flashback
copy:
src: notion-gnome-flashback.gschema.override
dest: /usr/share/glib-2.0/schemas/01_notion-gnome-flashback.gschema.override
notify: Compile glib schemas- name: Set META
lineinfile:
path: /usr/local/etc/notion/cfg_notion.lua
regexp: '^--META='
line: META="Mod4+"
backup: true
- name: Set ALTMETA
lineinfile:
path: /usr/local/etc/notion/cfg_notion.lua
regexp: '^--ALTMETA='
line: ALTMETA="Mod1+"
backup: true
- name: Disable mod_dock
lineinfile:
path: /usr/local/etc/notion/cfg_defaults.lua
state: absent
line: 'dopath("mod_dock")'
backup: true
- name: Enable mod_statusbar
lineinfile:
path: /usr/local/etc/notion/cfg_notion.lua
regexp: '^--dopath("mod_statusbar")'
line: 'dopath("mod_statusbar")'
backup: trueI'm a serial car leaser. This December, when my old lease was running out, there were not a lot of very appealing lease deals. Some of the best deals were on electric vehicles (EVs). The US offers a tax credit on some EVs for $7500. To qualify, the vehicles must be assembled in North America. But some EV manufacturers that do not meet this requirement are offering a similar rebate out of their own pocket to avoid losing customers to their competitors who do qualify. In short, because many EVs come with a 7500 dollar rebate, either from a tax credit or otherwise, EVs were some of the best deals that I could find.
Naturally, the benefit to the environment was another selling point as well!
I ended up leasing a Hyundai Ioniq 6. It's a very slick car, and there are many aspects of it that I like that are not EV specific. But I'm not going to talk about those here. This post is about my initial thoughts on owning an EV.
I knew that EVs could produce a great deal of Torque, but I didn't realize how much that impacted the driving experience. The Ioniq 6 can accelerate way faster than I ever need it to. This is useful for merging onto highways and such, but it's also just a lot of fun. I've owned cars with powerful motors, but the instant torque of an EV is a different experience. It's hard to imagine going back to driving a non-EV at this point.
The only noise that the Ioniq 6's engine makes is a whine that lets pedestrians know the car is nearby. When you turn on the car, the engine makes no noise. None. No ignition sound, no idling; nothing.
The Ioniq 6 includes a "level 1" charger that plugs into a standard 120V outlet. It's pretty slow; it takes about 48 hours to completely charge the battery from empty. But it's probably sufficient for most people's needs. I did have a problem one day when I didn't have enough time to fully charge the car at home before embarking on a long trip, so I had to stop at a charging station. But we'll talk about that in a bit.
Hyundai has a promotion where they will install a "level 2" charger for you, so I now have a "level 2" charger that charges the battery in a few hours.
My car's range is about 200 miles (we'll get to that in a bit), so unless I'm going on a long trip, I can easily charge my car at home. This is super convenient. I never have to go to a gas station. Every time I get in my car at home, it's fully charged and ready to go.
My car's range is about 200 miles, so for long trips, I need to stop and charge on the way. There are more public chargers than I thought, but they are not as common as gas stations, and they have different speeds. There are extremely fast, DC chargers that can charge a car in minutes, but they are not as common as the slower, level 2 chargers. In theory, my car can be charged in less than 20 minutes. But in practice, I have yet to experience that. Charging speed is greatly effected by temperature, and I have never experienced anything close to that optimal charging speed. The Ioniq 6 has a preconditioning system that is supposed to prepare the battery for high-speed charging by warming it up, but all I can say is that it hasn't worked well for me, and all of my public charging experiences have been significantly longer than 20 minutes.
The longest trip I have taken with my EV has been to Pittsburgh. This trip is largely on the PA turnpike. There are no chargers on the turnpike itself, but there are several DC charging stations a few minutes drive off the turnpike.
When you combine the time it takes to leave the turnpike, and the longer time it takes to charge an EV versus fill the gas tank, the bottom line is that long trips will take longer in an EV. On my first trip, I expected my car to charge in 20 minutes as advertised, and for the trip to take only slightly longer than usual. But it ended up taking me over an hour longer, which was upsetting.
You may have also seen various stories about charger stations having long lines during cold weather. This is a real problem, and it contributes to the unpredictability of long trips in an EV. It should get better over time as charging infrastructure becomes more adequate.
There are so many gas stations that in a internal combustion engine (ICE) car, having an exact range to empty is not that important. But because chargers are less prolific, you would think that all EVs would be able to tell you a fairly accurate estimate until empty. Well, that's simply not the case for the Ioniq 6. Battery performance is greatly effected by temperature, and the car's range drops significantly in cold weather. You would hope that the car, or the car's manual, would provide accurate guidance on how far it can go in various conditions. But they do not. Instead, you need to rely on personal experience or external guidance such as these tests from consumer reports.
I think that car manufacturers are shooting themselves in the foot here by not being more transparent. If you are going to advertise that your car has a 270 mile range, then it should have a 270 mile range in all conditions. If it doesn't, then you should be open about that. I think that the lack of transparency is going to make people less likely to buy EVs. The same principle applies to charging times. It's great if the car can charge in 20 minutes under ideal circumstances. But if you don't openly admit that it performs significantly worse in real-world conditions, then you are going to upset your customers.
I've read that Tesla cars can estimate their range very accurately. But I've been on long trips in which I've steadily watched the buffer between my car's remaining range and the distance to the closest charging station decrease in my Ioniq 6. It's not a good feeling at all, and from an engineering point, I think it's inexcusable.
Of course, one of the selling points of an EV is that you don't have to buy gas. You do have to pay for electricity, but it's significantly cheaper. Here's a comparison of how many miles I would get out of my Volvo S60 vs. my Hyundai Ioniq 6 for the cost of one gallon of gas. (The S60 actually takes 91+ fuel, so it would cost even more than the $3.95 listed.)
I'm very happy with my decision to lease an EV. I would do so again. But I think that manufacturers' decision to focus only on performance under ideal conditions is a mistake that will ultimately result in more people being leery of buying EVs. I know it has colored my experiences (negatively). My ultimate take away is this: If you are driving long distances in an EV, plan for your trip to take more time. We have a two car household, and I'm not sure that I would be ready to replace both with an EV at this point. But for everything but long trips, I think that EVs are the way to go.
I was playing around with ChatGPT and it came up with this gem:
In the thrilling world of ones and zeros, where the mere thought of assembly code sends shivers down the spines of mere mortals, there exists a unique breed of masochists—reverse engineers. Are you the kind of person who looks at a complex problem and thinks, "Why not make it harder?" If the idea of disassembling convoluted code into something even remotely understandable without any upfront instructions sounds like a fun Tuesday afternoon, then we have the perfect job for you.
Introducing the glamorous world of reverse engineering, where the nightmares of dealing with assembly code, hexadecimal hellscapes, and cryptic comments—or the lack thereof—don't just haunt you in your sleep; they're your 9 to 5. Ever wanted to spend hours, nay, days, unraveling the mysteries of code that was never meant to be understood, like an archeologist deciphering the Rosetta Stone, except what you're translating doesn't want to be translated? Congratulations, your masochistic tendencies can finally pay off.
In this role, you will dive headfirst into the abyss of other people's code, armed with nothing but your wits and perhaps an unhealthy dose of caffeine. You'll experience the unparalleled joy of finally understanding a function's purpose at 3 AM, only to realize it was a decoy. But fear not, for the fruits of your labor are many—aside from the occasional existential crisis. You'll gain the satisfaction of knowing you're outsmarting attempts to keep you out, one painstaking line at a time.
So, if you're someone who relishes the thought of your brain being tied into knots by assembly language, who finds the idea of decrypting software protections better than a crossword puzzle on a Sunday morning, and who sees a "No Trespassing" sign as a personal invitation, then the lucrative world of reverse engineering awaits. Where else can you get paid for indulging in your penchant for digital masochism? Join us, and turn your nightmares into your day job.
Ready to sign up? 🤣
I saw a quote in a recent ACM article that resonated with me:
Security must be a business enabler, not a hinderer.
The article is about product development, but the quote is applicable to IT security as well. Too often, IT departments forget what their purpose is, which is to support the mission of the organization. Security is a means to that end, not an end in itself. The goal of IT security is not to eliminate risk. The most secure computer is one that is not plugged in, but that doesn't really help us, does it? The goal of IT security is to manage risk so that the organization can achieve its mission. Unfortunately, it's significantly easier to just say "No, you can't do that because it would increase our risk" instead of "Let me fully understand why you need or want to do that, have a discussion about the security risks that will impose, and collaboratively make a decision that balances the risks and rewards for the organization."
I'm not saying that we should be cavalier about security. After all, I'm personally passionate about security. But we should be intentional about how and where we apply security when it impacts the mission, which is almost always!
I suspect a lot of people would be surprised at the (relatively low) level of security I employ personally. It's not that I'm lazy or incompetent. It's that I've made a conscious decision to accept certain risks in order to achieve certain rewards. For example, I don't bother trying to defend against nation state adversaries. I'm not a high value target. I don't have the resources (time) to defend against a nation state. I'd rather be nimble and flexible in my work capabilities, so I focus on the easy, low-hanging-fruit mitigations that will stop most (but not all) attackers.
Powered with by Gatsby 5.0


