A research paper from UIUC called "Executable Code Actions Elicit Better LLM Agents" (CodeAct) makes a simple claim: instead of having LLM agents produce JSON tool calls, let them write executable code. They tested 17 LLMs and found up to 20% higher success rate.

The paper focuses on success rate. I wanted to measure something else: token efficiency. How much cheaper is CodeAct in practice?

So I built a benchmark.

The paper's idea in 30 seconds

Today, most LLM agents work by generating structured JSON to call tools one at a time. Each call requires a round trip: the model outputs a JSON tool call, the system executes it, feeds the result back, and the model decides what to do next. Repeat.

CodeAct says: skip all that. Give the model a typed API and let it write a Python (or TypeScript) program that does the whole task at once. The program runs in a sandbox, and only the final output comes back.

The authors argue this works better because LLMs have seen billions of lines of real code during training, but very few examples of tool calling. Code is their native language.

My setup

I generated 120 mock API tools across 20 resources (Users, Projects, Tasks, Bugs, Invoices, Sprints, etc.), each with 6 actions: list, get, create, update, delete, search. Think of it as a full project management platform API.

Then I ran 5 tasks of increasing complexity against Mistral Small (free tier) in two modes:

  • Tool Calling: All 120 tools presented as function definitions. The model calls them one at a time across multiple rounds.
  • CodeAct: The 120 tools converted into a TypeScript API definition (~16K chars). The model writes one code snippet. Single round.

Results

Task Tool Tokens CodeAct Tokens Savings Rounds
Find user + get team info 51,932 4,323 91.7% 4 → 1
Find tasks + add comments 52,663 4,513 91.4% 4 → 1
Create project + milestone + tasks 27,012 4,649 82.8% 2 → 1
Aggregate data across projects 146,073 4,579 96.9% 10 → 1
Search bugs + notify assignees 66,369 4,615 93.0% 5 → 1

Average token savings: 91.2%. CodeAct used ~4,500 tokens per task. Tool Calling averaged ~68,800.

Why the gap is so large

Every round of tool calling resends the entire conversation history plus all 120 tool schemas to the model. That's ~13,000 tokens just for the tool definitions. Do 4 rounds and you've burned 52,000 tokens on schema repetition alone.

CodeAct pays the API definition cost once (~4,000 tokens for the TypeScript types) and gets back a complete solution in a single round.

The worst case was the data aggregation task. Tool calling hit the 10 round limit and consumed 146,073 tokens. It had to list projects, then for each project fetch tasks, time entries, and invoices. Each step resent everything. CodeAct wrote a clean async function with loops and did it in 4,579 tokens. That's a 97% reduction.

Code quality was better too

This matches the paper's findings. The code that CodeAct produced was well structured: proper error handling, async/await, clear variable names, loops for batch operations. The tool calling approach sometimes called the wrong tool or forgot to chain results between steps.

The paper's explanation: LLMs have been trained on millions of real codebases. They've seen very few examples of tool calling JSON. Code is their native output format. Asking an LLM to write code that calls APIs is asking it to do something it already excels at. Asking it to produce tool call JSON is asking it to work in a format it barely knows.

What CodeAct adds beyond just token savings

The original paper highlights benefits I also observed:

  • Composability: The model can combine multiple API calls with loops, conditionals, and data transformations in a single code block. Tool calling can only do one call at a time.
  • Self-debugging: CodeAct agents can catch errors and revise their code. Tool calling agents just fail.
  • Flexibility: The model can use standard library functions (string manipulation, date math, array operations) alongside API calls. No need for a dedicated tool for every small operation.

The tradeoff is that you need a sandbox to execute the generated code. The paper uses a Python interpreter. For production, you'd want something isolated.

Reproduce it

The entire benchmark is a single Python file. No dependencies. Set your Mistral API key (free tier works) and run:

export MISTRAL_API_KEY=your_key
python3 benchmark.py

It generates 120 mock tools, runs 5 tasks in both modes, and prints a comparison table.

Bottom line

The CodeAct paper was right. Letting LLMs write code instead of calling tools saves 91% of tokens on average across 120 tools with Mistral Small. The more complex the task, the bigger the gap. The data aggregation task hit 97% savings.

If you're building AI agents with many tools, consider the CodeAct approach: convert your tool schemas into a typed API definition, let the model write code, and execute it in a sandbox. Your token bill will thank you.