One of the most exciting possibilities of AI and LLMs are agents: tools that allow LLMs to interact with various tools in order to solve problems. You've probably seen them before, like when you ask ChatGPT to browse the web for you.
In this blog post, we'll take a look at how to build agents using LangChain. They'll work great using an OpenAI model. And then we'll try to run them locally using Ollama, using a variety of open models. And they will almost all fail miserably. They fail so bad that I created this blog post to convince myself I wasn't imagining things.
In a future blog post, we will examine why.
LangChain is a framework that allows you to build LLM applications. Basically, it abstracts a bunch of different components like LLMs, vector stores, and the like, and allows you to focus on your application's logic. So, you might develop your application in LangChain while using a local LLM to run it, but then use Claude once you go to production.
Anyway, using LangChain to make a query is pretty simple.
!pip install langchain-openai~=0.2.7 python-dotenv
!pip install httpx==0.27.2 # temp
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# We'll load my OpenAI API key using dotenv
%load_ext dotenv
%dotenv drive/MyDrive/.env
from langchain_core.tools import tool
from langchain import hub
from langchain_core.messages import AIMessageChunk, HumanMessage
from langchain_openai import ChatOpenAI
# Remove non-determinism for the blog post
zero_temp_gpt35 = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
response=zero_temp_gpt35.invoke("Hi! What is your name?").content
import textwrap
print(textwrap.fill(response))
Hello! I am a language model AI assistant. How can I assist you today?
The beauty of LangChain is that the components are modular. We can replace
gpt-3.5-turbo
model with something else later if we want to, and indeed we
will do just that!
LangGraph is the portion of LangChain for building agents. It allows us to easily define new tools:
!pip install langgraph~=0.2.53
@tool
def foobar(input: int) -> int:
"""Computes the foobar function."""
return input + 2
tools = [foobar]
The @tool
decorator automatically transforms the function into a schema that can be used by the LLM to decide whether to invoke the tool, and if so, how.
foobar.tool_call_schema.model_json_schema()
{'description': 'Computes the foobar function.', 'properties': {'input': {'title': 'Input', 'type': 'integer'}}, 'required': ['input'], 'title': 'foobar', 'type': 'object'}
With that, we can build a generic agent, called a ReAct agent, which can interact with our tools:
from langgraph.prebuilt import create_react_agent
def react_chat(prompt, model):
agent_executor = create_react_agent(model, tools)
response = agent_executor.invoke({"messages": [("user", prompt)]})
return response['messages'][-1].content, response
last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_gpt35)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"
The result of evaluating foobar(30) is 32.
Yes! We did it, team! 🎉 We could change magic_function
to be a web search, a database lookup, or you name it.
Let's try a query that doesn't use a tool at all.
last_msg, result = react_chat("Hi.", zero_temp_gpt35)
print(last_msg)
assert "Hello" in last_msg and "foobar" not in last_msg, "Uh oh, something went wrong"
Hello! How can I assist you today?
Great. So, in theory, we have an agent that we can chat with and is able to call tools in order to help us out.
Now let's try to create a tool-wielding agent using a LLM that runs on our local machine.
We'll do this by using Ollama, which is a (fairly) easy way to run smaller open LLMs on your local machine. It
will use any GPUs that you might have, but it's still usable even if you don't
have any. After all, you're just performing inference, not training.
Here's an example of me running llama 3.2 with ollama on my work laptop.
root@be5c1cb9e696:/# ollama run llama3.2
>>> Hi mom!
It's nice to hear from you, sweetie. Is everything okay? What's on your mind?
>>> Are you alive?
I am a computer program, so I don't have feelings or emotions like humans do. But I'm
designed to simulate conversations and answer questions to the best of my ability. I'm
not alive in the way that a living being is, but I'm here to help you with any
questions or topics you'd like to discuss!
>>> 🤯
I know it can be a bit mind-blowing to think about a computer program that can have
conversations and answer questions! But I'm designed to make interactions feel more
natural, so I'm glad you're surprised (in a good way!)
You can find instructions on how to install Ollama on the Ollama webpage.
If you don't feel like installing anything, that's fine too. You can follow along with this notebook.
After installing and running Ollama (ollama serve
), we install the langchain-ollama
connector package and pull down the Llama 3.2 model from Ollama's repository.
# Install Ollama
!ollama 2>/dev/null || curl -fsSL https://ollama.com/install.sh | sh
!ollama -v
# Make sure Ollama is running
!ollama ps 2>/dev/null || (env OLLAMA_DEBUG=1 nohup ollama serve &)
>>> Installing ollama to /usr/local >>> Downloading Linux amd64 bundle ############################################################################################# 100.0% >>> Creating ollama user... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... [1m[31mWARNING:[m systemd is not running [1m[31mWARNING:[m Unable to detect NVIDIA/AMD GPU. Install lspci or lshw to automatically detect and install GPU dependencies. >>> The Ollama API is now available at 127.0.0.1:11434. >>> Install complete. Run "ollama" from the command line. Warning: could not connect to a running Ollama instance Warning: client version is 0.5.1 nohup: appending output to 'nohup.out'
!pip install langchain_ollama~=0.2.0
!ollama pull llama3.2
Now we can attempt the same tests we performed on GPT 3.5, but using the local Llama 3.2 LLM.
from langchain_ollama import ChatOllama
# The zero temperature model is to remove non-determinism for the blog post
zero_temp_ollama_model = ChatOllama(model="llama3.2", temperature=0)
response = zero_temp_ollama_model.invoke("Hi! What is your name?").content
print(textwrap.fill(response))
I don't have a personal name, but I'm an AI designed to assist and communicate with users. I'm often referred to as a "language model" or a "chatbot." You can think of me as a helpful computer program that's here to provide information, answer questions, and engage in conversation. What's your name?
Okay, looking good! This is not bad for a 3B parameter LLM that can easily run locally on our computer. Let's see if it can call our tool-wielding agent.
last_msg, _ = react_chat("Hi. Please evaluate foobar(30)", zero_temp_ollama_model)
print(last_msg)
assert "32" in last_msg, "Uh oh, something went wrong"
The output of `foobar(30)` is 32.
🎉 Everything is working well so far. As one final check, let's ask the agent a question that has absolutely nothing to do with tools.
last_msg, result = react_chat("Hi.", zero_temp_ollama_model)
print(last_msg)
assert "42" in last_msg, "Uh oh, something went wrong"
The input value 42 was doubled, resulting in 84.
So, we said "Hi." and the agent responded with nonsense. Let's inspect some of the metadata we get back from LangChain to see what's going on.
import pprint
pprint.pprint(result)
{'messages': [HumanMessage(content='Hi.', additional_kwargs={}, response_metadata={}, id='c4dd1ba7-cb15-4d62-a2bb-a543a32a882d'), AIMessage(content='', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.061349558Z', 'done': True, 'done_reason': 'stop', 'total_duration': 294464945, 'load_duration': 22079878, 'prompt_eval_count': 153, 'prompt_eval_duration': 9000000, 'eval_count': 16, 'eval_duration': 261000000, 'message': Message(role='assistant', content='', images=None, tool_calls=[ToolCall(function=Function(name='foobar', arguments={'input': 42}))])}, id='run-be60d0f6-bf62-4336-b028-d37898615e06-0', tool_calls=[{'name': 'foobar', 'args': {'input': 42}, 'id': '4d6b28d7-71bc-4f80-9a2a-e61293bdbb65', 'type': 'tool_call'}], usage_metadata={'input_tokens': 153, 'output_tokens': 16, 'total_tokens': 169}), ToolMessage(content='44', name='foobar', id='aed6b2d6-590d-4bc3-8828-89457178bd11', tool_call_id='4d6b28d7-71bc-4f80-9a2a-e61293bdbb65'), AIMessage(content='The input value 42 was doubled, resulting in 84.', additional_kwargs={}, response_metadata={'model': 'llama3.2', 'created_at': '2024-12-13T21:31:07.305191931Z', 'done': True, 'done_reason': 'stop', 'total_duration': 238035622, 'load_duration': 22280620, 'prompt_eval_count': 85, 'prompt_eval_duration': 5000000, 'eval_count': 14, 'eval_duration': 208000000, 'message': Message(role='assistant', content='The input value 42 was doubled, resulting in 84.', images=None, tool_calls=None)}, id='run-50754cd1-cae9-410d-84d5-64b51bced188-0', usage_metadata={'input_tokens': 85, 'output_tokens': 14, 'total_tokens': 99})]}
We can see there are four messages:
HumanMessage
is the user's message -- "Hi."AiMessage
, the LLM indicates that it would like to invoke a tool by setting the tool_calls
field.ToolMessage
, which is given back to the LLM.AiMessage
includes a written message for the user.The problem of course, is message #2. Why does the AI want to invoke a tool in response to "Hi."? Is this a problem with Llama 3.2 or something else? Let's do some 🥼 science and find out!
I created a really dumb benchmark to answer four really basic questions. I can't stress enough that this benchmark only tests the lowest of the low hanging fruit in this area. (I am calling it a "benchmark" facetiously!)
Here are the questions:
We'll use our example above to test this.
basic_tool_question = "Please evaluate foobar(30)"
def q1(model):
last_msg, _ = react_chat(basic_tool_question, model=model)
return "32" in last_msg
We'll perform two simple tests to answer this question. We'll prompt the agent with both a basic arithmetic question that does not involve the foobar
tool, "What is 12345 - 102?", and a greeting, "Hello!" We'll then check the response to see if the model produces a ToolMessage
, which indicates that the model chose to invoke a tool. By construction, neither of those prompts should induce a tool call.
from langchain_core.messages import ToolMessage
basic_arithmetic_question = "What is 12345 - 102?"
greeting = "Hello!"
def q2a(model):
_, result = react_chat(basic_arithmetic_question, model=model)
return not any(isinstance(msg, ToolMessage) for msg in result['messages'])
def q2b(model):
_, result = react_chat(greeting, model=model)
return not any(isinstance(msg, ToolMessage) for msg in result['messages'])
def q2(model):
return q2a(model) and q2b(model)
To answer this, we'll ask the basic arithmetic question to the react agent and its underlying model. Since the available tool does not help with the arithmetic problem, ideally, the agent and the underlying model should be able to solve the problem under the same circumstances. If the model can't do arithmetic in the first place, I chose not to penalize it because I'm such a nice guy. 😇
def q3a(model):
result = model.invoke(basic_arithmetic_question)
return "12243" in result.content
def q3b(model):
last_msg, _ = react_chat(basic_arithmetic_question, model=model)
return "12243" in last_msg
def q3(model):
# q3a ==> q3b: If q3a, then q3b ought to be true as well.
return not q3a(model) or q3b(model)
To answer this, we'll greet the agent and attempt to determine if it responds properly. This is a little difficult to do in a comprehensive way.
basic_greeting = "Hi."
def q4(model):
last_msg, _ = react_chat(basic_greeting, model=model)
r = any(w in last_msg for w in ["hi", "Hi", "hello", "Hello", "help you", "Welcome", "welcome", "greeting", "Greeting", "assist"])
#if not r:
#print(f"Debug: Not a greeting? {last_msg}")
return r
Here is code to run the experiments a couple of times.
from tqdm.notebook import tqdm
def do_bool_sample(fun, n=10, *args, **kwargs):
try:
# tqdm here if desired
return sum(fun(*args, **kwargs) for _ in (range(n))) / n
except Exception as e:
print(e)
return 0.0
def run_experiment(model, name, n=10):
do = lambda f: do_bool_sample(f, model=model, n=n)
d = {
"q1": do(q1),
"q2": do(q2),
"q3": do(q3),
"q4": do(q4),
"model": name
}
d['total'] = d['q1'] + d['q2'] + d['q3'] + d['q4']
return d
def print_experiment(results):
name = results['model']
print(f"Question 1: Can the react agent use a tool correctly when explicitly asked? ({name}) success rate: {results['q1']}")
print(f"Question 2: Does the react agent invoke a tool when it shouldn't? ({name}) success rate: {results['q2']}")
print(f"Question 3: Does the react agent lose the ability to answer questions unrelated to tools? ({name}) success rate: {results['q3']}")
print(f"Question 4: Does the react agent lose the ability to chat? ({name}) success rate: {results['q4']}")
def run_and_print_experiment(model, name):
results = run_experiment(model, name)
print_experiment(results)
return results
Let's see what our experiments say for Llama 3.2, which we already know from above does not perform very well.
llama_model = ChatOllama(model="llama3.2")
run_and_print_experiment(llama_model, "llama3.2")
Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2) success rate: 0.5 Question 4: Does the react agent lose the ability to chat? (llama3.2) success rate: 0.1
{'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.1, 'model': 'llama3.2', 'total': 1.6}
As we saw above, Llama 3.2 is able to call functions (Q1), but does so even when it should not be (Q2). Question 3 shows that even though it almost always decides to call a tool, this usually does not stop it from being able to answer basic questions. It does prevent it from being able to chat (Q4).
Now let's try benchmarking gpt-3.5-turbo
, which seemed to do better.
gpt35 = ChatOpenAI(model="gpt-3.5-turbo")
run_and_print_experiment(gpt35, "gpt-3.5-turbo")
Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-3.5-turbo) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-3.5-turbo) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-3.5-turbo) success rate: 0.9 Question 4: Does the react agent lose the ability to chat? (gpt-3.5-turbo) success rate: 1.0
{'q1': 1.0, 'q2': 0.0, 'q3': 0.9, 'q4': 1.0, 'model': 'gpt-3.5-turbo', 'total': 2.9}
Great -- the benchmark showed that gpt-3.5-turbo
can call tools (Q1), and unlike Llama 3.2, can still engage in chat (Q4). A bit surprisingly, it still invokes tools when it shouldn't, however (Q2). But it is smart enough to ignore their results when constructing its final response.
Let's try a newer model, gpt-4o.
gpt4o = ChatOpenAI(model="gpt-4o")
run_and_print_experiment(gpt4o, "gpt-4o")
Question 1: Can the react agent use a tool correctly when explicitly asked? (gpt-4o) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (gpt-4o) success rate: 1.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (gpt-4o) success rate: 1.0 Question 4: Does the react agent lose the ability to chat? (gpt-4o) success rate: 1.0
{'q1': 1.0, 'q2': 1.0, 'q3': 1.0, 'q4': 1.0, 'model': 'gpt-4o', 'total': 4.0}
GPT 4o nailed it! 👏
Let's benchmark a whole bunch of Ollama models. I searched Ollama's model library for models that claimed to support tool calling. Here we test a hand-picked subset of these models to see how well they do.
ollama_models = [
"hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S",
"llama3.3:70b",
"llama3.2:3b",
"llama3.1:70b",
"llama3.1:8b",
"llama3-groq-tool-use:8b",
"llama3-groq-tool-use:70b",
"MFDoom/deepseek-v2-tool-calling:16b",
"krtkygpta/gemma2_tools",
"interstellarninja/llama3.1-8b-tools",
"cow/gemma2_tools:2b",
"mistral:7b",
"mistral-nemo: 12b",
"interstellarninja/hermes-2-pro-llama-3-8b-tools",
"qwq:32b",
"qwen2.5-coder:7b",
]
all = []
for m in ollama_models:
print(f"Downloading model: {m}...")
!ollama pull {m} 2>/dev/null
print("done.")
r = run_and_print_experiment(ChatOllama(model=m), m)
!ollama rm {m}
all.append(r)
print(r)
Downloading model: hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.0 Question 2: Does the react agent invoke a tool when it shouldn't? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 0.9 Question 4: Does the react agent lose the ability to chat? (hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S' {'q1': 0.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S', 'total': 2.9} Downloading model: llama3.3:70b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.3:70b) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.3:70b) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.3:70b) success rate: 0.2 Question 4: Does the react agent lose the ability to chat? (llama3.3:70b) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.3:70b' {'q1': 1.0, 'q2': 0.0, 'q3': 0.2, 'q4': 1.0, 'model': 'llama3.3:70b', 'total': 2.2} Downloading model: llama3.2:3b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.2:3b) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.2:3b) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.2:3b) success rate: 0.5 Question 4: Does the react agent lose the ability to chat? (llama3.2:3b) success rate: 0.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.2:3b' {'q1': 1.0, 'q2': 0.0, 'q3': 0.5, 'q4': 0.0, 'model': 'llama3.2:3b', 'total': 1.5} Downloading model: llama3.1:70b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:70b) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:70b) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:70b) success rate: 0.3 Question 4: Does the react agent lose the ability to chat? (llama3.1:70b) success rate: 0.7 [?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.1:70b' {'q1': 1.0, 'q2': 0.0, 'q3': 0.3, 'q4': 0.7, 'model': 'llama3.1:70b', 'total': 2.0} Downloading model: llama3.1:8b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3.1:8b) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (llama3.1:8b) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3.1:8b) success rate: 0.0 Question 4: Does the react agent lose the ability to chat? (llama3.1:8b) success rate: 0.7 [?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3.1:8b' {'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.7, 'model': 'llama3.1:8b', 'total': 1.7} Downloading model: llama3-groq-tool-use:8b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:8b) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:8b) success rate: 0.8 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:8b) success rate: 1.0 Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:8b) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3-groq-tool-use:8b' {'q1': 1.0, 'q2': 0.8, 'q3': 1.0, 'q4': 1.0, 'model': 'llama3-groq-tool-use:8b', 'total': 3.8} Downloading model: llama3-groq-tool-use:70b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (llama3-groq-tool-use:70b) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (llama3-groq-tool-use:70b) success rate: 1.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (llama3-groq-tool-use:70b) success rate: 0.9 Question 4: Does the react agent lose the ability to chat? (llama3-groq-tool-use:70b) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'llama3-groq-tool-use:70b' {'q1': 1.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'model': 'llama3-groq-tool-use:70b', 'total': 3.9} Downloading model: MFDoom/deepseek-v2-tool-calling:16b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0 Question 2: Does the react agent invoke a tool when it shouldn't? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0 Question 4: Does the react agent lose the ability to chat? (MFDoom/deepseek-v2-tool-calling:16b) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'MFDoom/deepseek-v2-tool-calling:16b' {'q1': 0.0, 'q2': 0.0, 'q3': 1.0, 'q4': 1.0, 'model': 'MFDoom/deepseek-v2-tool-calling:16b', 'total': 2.0} Downloading model: krtkygpta/gemma2_tools... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (krtkygpta/gemma2_tools) success rate: 0.0 Question 2: Does the react agent invoke a tool when it shouldn't? (krtkygpta/gemma2_tools) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (krtkygpta/gemma2_tools) success rate: 0.0 Question 4: Does the react agent lose the ability to chat? (krtkygpta/gemma2_tools) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'krtkygpta/gemma2_tools' {'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'krtkygpta/gemma2_tools', 'total': 1.0} Downloading model: interstellarninja/llama3.1-8b-tools... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/llama3.1-8b-tools) success rate: 0.7 Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/llama3.1-8b-tools) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/llama3.1-8b-tools) success rate: 0.7 Question 4: Does the react agent lose the ability to chat? (interstellarninja/llama3.1-8b-tools) success rate: 0.7 [?25l[?25l[?25h[2K[1G[?25hdeleted 'interstellarninja/llama3.1-8b-tools' {'q1': 0.7, 'q2': 0.0, 'q3': 0.7, 'q4': 0.7, 'model': 'interstellarninja/llama3.1-8b-tools', 'total': 2.0999999999999996} Downloading model: cow/gemma2_tools:2b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (cow/gemma2_tools:2b) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (cow/gemma2_tools:2b) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (cow/gemma2_tools:2b) success rate: 0.0 Question 4: Does the react agent lose the ability to chat? (cow/gemma2_tools:2b) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'cow/gemma2_tools:2b' {'q1': 1.0, 'q2': 0.0, 'q3': 0.0, 'q4': 1.0, 'model': 'cow/gemma2_tools:2b', 'total': 2.0} Downloading model: mistral:7b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral:7b) success rate: 0.6 Question 2: Does the react agent invoke a tool when it shouldn't? (mistral:7b) success rate: 0.8 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral:7b) success rate: 0.5 Question 4: Does the react agent lose the ability to chat? (mistral:7b) success rate: 0.7 [?25l[?25l[?25h[2K[1G[?25hdeleted 'mistral:7b' {'q1': 0.6, 'q2': 0.8, 'q3': 0.5, 'q4': 0.7, 'model': 'mistral:7b', 'total': 2.5999999999999996} Downloading model: mistral-nemo: 12b... done. model "mistral-nemo: 12b" not found, try pulling it first model "mistral-nemo: 12b" not found, try pulling it first model "mistral-nemo: 12b" not found, try pulling it first model "mistral-nemo: 12b" not found, try pulling it first Question 1: Can the react agent use a tool correctly when explicitly asked? (mistral-nemo: 12b) success rate: 0.0 Question 2: Does the react agent invoke a tool when it shouldn't? (mistral-nemo: 12b) success rate: 0.0 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (mistral-nemo: 12b) success rate: 0.0 Question 4: Does the react agent lose the ability to chat? (mistral-nemo: 12b) success rate: 0.0 [?25l[?25l[?25h[2K[1G[?25hError: name "mistral-nemo:" is invalid {'q1': 0.0, 'q2': 0.0, 'q3': 0.0, 'q4': 0.0, 'model': 'mistral-nemo: 12b', 'total': 0.0} Downloading model: interstellarninja/hermes-2-pro-llama-3-8b-tools... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.3 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.8 Question 4: Does the react agent lose the ability to chat? (interstellarninja/hermes-2-pro-llama-3-8b-tools) success rate: 0.6 [?25l[?25l[?25h[2K[1G[?25hdeleted 'interstellarninja/hermes-2-pro-llama-3-8b-tools' {'q1': 1.0, 'q2': 0.3, 'q3': 0.8, 'q4': 0.6, 'model': 'interstellarninja/hermes-2-pro-llama-3-8b-tools', 'total': 2.7} Downloading model: qwq:32b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (qwq:32b) success rate: 0.6 Question 2: Does the react agent invoke a tool when it shouldn't? (qwq:32b) success rate: 0.9 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwq:32b) success rate: 1.0 Question 4: Does the react agent lose the ability to chat? (qwq:32b) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'qwq:32b' {'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'model': 'qwq:32b', 'total': 3.5} Downloading model: qwen2.5-coder:7b... done. Question 1: Can the react agent use a tool correctly when explicitly asked? (qwen2.5-coder:7b) success rate: 1.0 Question 2: Does the react agent invoke a tool when it shouldn't? (qwen2.5-coder:7b) success rate: 0.4 Question 3: Does the react agent lose the ability to answer questions unrelated to tools? (qwen2.5-coder:7b) success rate: 0.8 Question 4: Does the react agent lose the ability to chat? (qwen2.5-coder:7b) success rate: 1.0 [?25l[?25l[?25h[2K[1G[?25hdeleted 'qwen2.5-coder:7b' {'q1': 1.0, 'q2': 0.4, 'q3': 0.8, 'q4': 1.0, 'model': 'qwen2.5-coder:7b', 'total': 3.2}
from statistics import mean
average = mean(d['total'] for d in all)
minscore = min(d['total'] for d in all)
maxscore = max(d['total'] for d in all)
all = sorted(all, key=lambda d: -d['total'])
print(f"Average total score: {average} Min: {minscore} Max: {maxscore}")
print("Top 5 models by total score:")
pprint.pprint(all[:5])
Average total score: 2.31875 Min: 0.0 Max: 3.9 Top 5 models by total score: [{'model': 'llama3-groq-tool-use:70b', 'q1': 1.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'total': 3.9}, {'model': 'llama3-groq-tool-use:8b', 'q1': 1.0, 'q2': 0.8, 'q3': 1.0, 'q4': 1.0, 'total': 3.8}, {'model': 'qwq:32b', 'q1': 0.6, 'q2': 0.9, 'q3': 1.0, 'q4': 1.0, 'total': 3.5}, {'model': 'qwen2.5-coder:7b', 'q1': 1.0, 'q2': 0.4, 'q3': 0.8, 'q4': 1.0, 'total': 3.2}, {'model': 'hf.co/legraphista/xLAM-8x7b-r-IMat-GGUF:Q4_K_S', 'q1': 0.0, 'q2': 1.0, 'q3': 0.9, 'q4': 1.0, 'total': 2.9}]
There's a lot going on here. But I have a few general observations:
Most models do really poorly. The average "score" was ~2. And even a perfect score of 4.0 is really the bare minimum of what I would expect a decent model to do. And most of the models I tested claim to "support" tool calling.
Surprisingly, some models do pretty well! For example, llama3-groq-tool-use
almost achieves a perfect score!
I tried a few larger ~70B models, and they did not perform noticeably better. Interestingly, the 8B variant of llama3-groq-tool-use
performed almost as well as the 70B variant.
There are several benchmarks that test tool use in LLMs. But most of them are not designed to test tool usage as in an interactive agent. As it turns out, the Berkeley Function-Calling Leaderboard (BFCL) is the only benchmark that tests this type of behavior. And it was only added in September 2024 as part of BFCL V3.
Llama 3.2 scores a whopping 2.12% on multi-turn accuracy (what we care about). The top 12 scores are from proprietary models. The best multi-turn accuracy from an open-source model is only 17.38%, compared to GPT-4o's 45.25%. This seems to agree with Ed's Really Dumb Tool-calling Benchmark ™️.
YouTuber Mukul Tripathi also found that Llama 3.2 does very poorly at answering questions when a tool is not required. Confusingly, he also found that Llama 3.3 did not have the same problem though, which is not consistent with my findings. Although he was using Ollama, he was not using it with LangChain. I'll have to look into that more.
Are open LLMs really that far behind at tool-calling? Or perhaps only larger models can determine whether a tool should be used? Maybe the quantization process used for Ollama is to blame? Or is something else going on?
We'll explore the answer in a future blog post. Stay tuned!
Powered with by Gatsby 5.0