ARTIFICIAL INTELLIGENCE
Code Agents: The Future of Agentic AI
HuggingFace smolagents framework in action
Mariya Mansurova
May 26, 2025 17 min read
Image generated by author using DALL-E 3
We are living in the age of AI agents. LLMs are no longer just tools.
They’ve become active participants in our lives, boosting productivity
and transforming the way we live and work.
OpenAI recently launched Operator, an AI agent that can
autonomously perform various tasks, from browsing the web to
filling out forms and scheduling appointments.
Anthropic introduced MCP (Model Context Protocol), a new
standard for how AI assistants interact with the outside
world. With over five thousand active MCP servers already,
adoption is growing rapidly.
AI agents are also changing the landscape of software
engineering. Tools like GitHub Copilot’s agentic mode, Claude
Code, OpenAI Codex, and others are not only improving
developer productivity and code quality but also democratising
the field, making software development accessible to people
without a technical background.
We’ve previously looked at different AI Agent frameworks, such as
LangGraph or CrewAI. In this article, I would like to discuss a new
one I’ve been exploring recently — HuggingFace smolagents. It’s an
interesting framework since it implements the concept of code
agents.
In this article, we will explore several topics:
What code agents are (teaser: it’s not related to vibe coding).
How to use the HuggingFace smolagents framework in practice.
Whether it’s secure to give LLMs so much agency.
The real difference in performance between code agents and
traditional tool-calling agents.
AI Agents recap
Let’s start with a quick refresher: what exactly are AI agents?
HuggingFace provides a clear and concise definition of what they
mean by agents.
AI Agents are programs where LLM outputs control the workflow.
So, we need an agentic flow when we want a system to reason and
act based on observations. Actually, agency is not a binary variable
(yes or no), but a spectrum.
At one end, we can have systems without agency at all, for
example, a simple process where an LLM defines the sentiment
of a text, translates it or summarises it.
The next level is routing, where an LLM can classify an incoming
question and decide which path to take — for example, calling a
tool if a customer is asking about the status of their current
order, and transferring the conversation to a human CS agent
otherwise.
More advanced systems can exhibit higher degrees of agency.
These might include the ability to execute other LLMs (multi-
agent setup) or even create new tools on the fly.
Code agents fall into this more advanced category. They are multi-
step agents that execute tool calls in the form of code, in contrast
to the more traditional approach using a JSON format with the tool
name and arguments.
Several recent papers have shown that using code in agentic flows
leads to better results:
Wang, Xingyao, et al. “Executable Code Actions Elicit Better LLM
Agents”
Nguyen, Dang, et al. “DynaSaur: Large Language Agents Beyond
Predefined Actions”
Yang, Ke, et al. “If LLM Is the Wizard, Then Code Is the Wand: A
Survey on How Code Empowers Large Language Models to Serve
as Intelligent Agents.”
It makes sense when you think about it. We’ve been developing
programming languages for decades to solve complex problems. So,
it’s natural that these languages are better suited to LLM’s tasks
than simple JSON configs. An additional benefit is that LLMs are
already quite good at writing code in common programming
languages, thanks to the vast amount of available data for training.
This approach comes with several other benefits as well:
By generating code, an LLM is not limited to a predefined set of
tools and can create its own functions.
It can combine multiple tools within a single action using
conditions and loops, which helps reduce the number of steps
required to complete a task.
It also enables the model to work with a wider variety of
outputs, such as generating charts, images, or other complex
objects.
These benefits aren’t just theoretical; we can observe them in
practice. In “Executable Code Actions Elicit Better LLM Agents”, the
authors show that code agents outperform traditional methods,
achieving a higher success rate and completing a task in fewer
steps, which in turn reduces costs.
Chart from Wang, Xingyao, et al. “Executable Code Actions Elicit Better LLM Agents”
Code agents look promising, which inspired me to try this approach
in practice.
HuggingFace smolagents framework
First try
Luckily, we don’t need to build code agents from scratch, as
HuggingFace has released a handy library called smolagents that
implements this approach.
Let’s start by installing the library.
pip install smolagents[litellm]
# I've used litellm, since I'm planning to use it with OpenAI model
Next, let’s build a basic example. To initialise the agent, we need just
two parameters: model and tools.
I plan to use OpenAI for the model, which is accessible via LiteLLM.
However, the framework supports other options as well. You can use
a local model via Ollama or TransformersModel, or public models via
Inference Providers or choose other options (you can find more
details in the documentation).
I didn’t specify any tools, but used add_base_tools = True , so my
agent has a default set of tools, such as a Python interpreter or
DuckDuckGo search. Let’s try it out with a simple question.
from smolagents import CodeAgent, LiteLLMModel
model = LiteLLMModel(model_id="openai/gpt-4o-mini",
api_key=config['OPENAI_API_KEY'])
agent = CodeAgent(tools=[], model=model, add_base_tools=True)
agent.run(
"""I have 5 different balls and I randomly select 2.
How many possible combinations of the balls I can get?""",
)
As a result, we see a really nicely formatted execution flow. It’s just
amazing and allows you to understand the process perfectly.
Image by author
So, the agent found an answer in one step and wrote Python code to
calculate the number of combinations.
The output is quite helpful, but we can go even deeper and look at
the complete information related to execution (including prompts),
via agent.memory.steps . Let’s look at the system prompt used by the
agent.
You are an expert assistant who can solve any task using code blobs.
You will be given a task to solve as best you can.
To do so, you have been given access to a list of tools: these tools
are basically Python functions which you can call with code.
To solve the task, you must plan forward to proceed in a series of
steps, in a cycle of 'Thought:', 'Code:',
and 'Observation:' sequences.
At each step, in the 'Thought:' sequence, you should first explain
your reasoning towards solving the task and the tools that you want
to use.
Then in the 'Code:' sequence, you should write the code in simple
Python. The code sequence must end with '<end_code>' sequence.
During each intermediate step, you can use 'print()' to save
whatever important information you will then need.
These print outputs will then appear in the 'Observation:' field,
which will be available as input for the next step.
In the end you have to return a final answer using
the final_answer tool.
Here are a few examples using notional tools: <...>
It’s pretty clear that smolagents implements the ReAct approach
(introduced in the paper by Yao et al. “ReAct: Synergizing Reasoning
and Acting in Language Models”) and uses a few-shot prompting
technique.
The smolagents library handles all behind-the-scenes work involved
in the agent workflow: assembling the system prompt with all
necessary information for the LLM (i.e. available tools), parsing the
output and executing the generated code. It also provides
comprehensive logging and a retry mechanism to help correct
mistakes.
Additionally, the library offers memory management features. By
default, all execution results are saved to memory, but you can
customise this behaviour. For example, you can remove some
intermediary results from the memory to reduce the number of
tokens or execute the agent step-by-step. While we won’t dive deep
into memory management here, you can find handy code examples
in the documentation.
Security
Now, it’s time to discuss the drawbacks of the code agents’
approach. Giving an LLM more agency by allowing it to execute
arbitrary code introduces higher risks. Indeed, an LLM can run
harmful code either by mistake (since LLMs are still far from perfect)
or due to targeted attacks like prompt injections or compromised
models.
To mitigate these risks, the local Python executor implemented in
the smolagents library has a bunch of safety checks:
By default, imports are not allowed unless the package has been
explicitly added to additional_authorized_imports list.
Moreover, submodules are blocked by default, so you must
authorise them specifically (i.e. numpy.* ). It’s been done
because some packages can expose potentially harmful
submodules, i.e. random._os .
The total number of executed operations is capped, preventing
infinite loops and resource bloating.
Any operation not explicitly defined in the interpreter will raise
an error.
Let’s test whether these safety measures actually work.
from smolagents.local_python_executor import LocalPythonExecutor
custom_executor = LocalPythonExecutor(["numpy.*", "random"])
# function to have pretty formatted exceptions
def run_capture_exception(command: str):
try:
custom_executor(harmful_command)
except Exception as e:
print("ERROR:\n", e)
# Unauthorised imports are blocked
harmful_command="import os; exit_code = os.system('<bad_command>')"
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'import os' due to:
# InterpreterError: Import of os is not allowed. Authorized imports
# are: ['datetime', 'itertools', 're', 'math', 'statistics', 'time',
# 'numpy.*', 'random', 'collections', 'unicodedata', 'stat']
# Submodules are also blocked unless stated specifically
harmful_command="from random import _os; exit_code = _os.system('<bad_
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'exit_code = _os.system('<bad_c
# due to: InterpreterError: Forbidden access to module: os
# The cap on the number of iterations breaks inifinity loops
harmful_command = '''
while True:
pass
'''
run_capture_exception(harmful_command)
# ERROR: Code execution failed at line 'while True: pass' due to:
# InterpreterError: Maximum number of 1000000 iterations in While loop
# exceeded
# Undefined operations don't work
harmful_command="!echo <bad_command>"
custom_executor(harmful_command)
# ERROR: Code parsing failed on line 1 due to: SyntaxError
It seems we have some safety nets with code agents. However,
despite these safeguards, risks persist when you’re executing code
locally. For example, an LLM can recursively create threads on your
computer or create too many files, leading to resource bloating. A
possible solution is to execute code in a sandboxed environment,
such as using Docker or solutions like E2B. I’m willing to be
adventurous and run my code locally, but if you prefer a more risk-
averse approach, you can follow the sandbox set-up guidance in the
documentation.
Code agent vs traditional Tool-Calling agent
It’s claimed that the code agents perform better compared to the
traditional JSON-based approach. Let’s put this to the test.
I will use the task of metrics change analysis that I described in my
previous article, “Making sense of KPI changes”. We will start with a
straightforward case: analysing a simple metric (revenue) split by
one dimension (country).
raw_df = pd.read_csv('absolute_metrics_example.csv', sep = '\t')
df = raw_df.groupby('country')[['revenue_before', 'revenue_after_scena
.sort_values('revenue_before', ascending = False).rename(
columns = {'revenue_after_scenario_2': 'after',
'revenue_before': 'before'})
Image by author
The smolagents library supports two classes, which we can use to
compare two approaches:
CodeAgent — an agent that acts by generating and executing
code,
ToolCallingAgent — a traditional JSON-based agent.
Our agents will need some tools, so let’s implement them. There are
multiple options to create tools in smolagents: we can re-use
LangChain tools, download them from HuggingFace Hub or simply
create Python functions. We will take the most straightforward
approach by writing a couple of Python functions and annotating
them with @tool .
I will create two tools: one to estimate the relative difference
between metrics, and another to calculate the sum of a list. Since
LLM will be using these tools, providing detailed descriptions is
crucial.
@tool
def calculate_metric_increase(before: float, after: float) -> float:
"""
Calculate the percentage change of the metric between before and a
Args:
before: value before
after: value after
"""
return (before - after) * 100/ before
@tool
def calculate_sum(values: list) -> float:
"""
Calculate the sum of list
Args:
values: list of numbers
"""
return sum(values)
Teaser: I will later realise that I should have provided more tools to
the agent, but I genuinely overlooked them.
CodeAgent
Let’s start with a CodeAgent. I’ve initialised the agent with the tools
we defined earlier and authorised the usage of some Python
packages that might be helpful.
agent = CodeAgent(
model=model,
tools=[calculate_metric_increase, calculate_sum],
max_steps=10,
additional_authorized_imports=["pandas", "numpy", "matplotlib.*",
"plotly.*"],
verbosity_level=1
)
task = """
Here is a dataframe showing revenue by segment, comparing values
before and after.
Could you please help me understand the changes? Specifically:
1. Estimate how the total revenue and the revenue for each segment
have changed, both in absolute terms and as a percentage.
2. Calculate the contribution of each segment to the total
change in revenue.
Please round all floating-point numbers in the output
to two decimal places.
"""
agent.run(
task,
additional_args={"data": df},
)
Overall, the code agent completed the task in just two steps, using
only 5,451 input and 669 output tokens. The result also looks quite
plausible.
{'total_before': 1731985.21, 'total_after':
1599065.55, 'total_change': -132919.66, 'segment_changes':
{'absolute_change': {'other': 4233.09, 'UK': -4376.25, 'France':
-132847.57, 'Germany': -690.99, 'Italy': 979.15, 'Spain':
-217.09}, 'percentage_change': {'other': 0.67, 'UK': -0.91,
'France': -55.19, 'Germany': -0.43, 'Italy': 0.81, 'Spain':
-0.23}, 'contribution_to_change': {'other': -3.18, 'UK': 3.29,
'France': 99.95, 'Germany': 0.52, 'Italy': -0.74, 'Spain': 0.16}}}
Let’s take a look at the execution flow. The LLM received the
following prompt.
╭─────────────────────────── New run ────────────────────────
│ │
│ Here is a pandas dataframe showing revenue by segment, │
│ comparing values before and after. │
│ Could you please help me understand the changes? │
│ Specifically: │
│ 1. Estimate how the total revenue and the revenue for each │
│ segment have changed, both in absolute terms and as a │
│ percentage. │
│ 2. Calculate the contribution of each segment to the total │
│ change in revenue. │
│ │
│ Please round all floating-point numbers in the output to two │
│ decimal places. │
│ │
│ You have been provided with these additional arguments, that │
│ you can access using the keys as variables in your python │
│ code: │
│ {'df': before after │
│ country │
│ other 632767.39 637000.48 │
│ UK 481409.27 477033.02 │
│ France 240704.63 107857.06 │
│ Germany 160469.75 159778.76 │
│ Italy 120352.31 121331.46 │
│ Spain 96281.86 96064.77}. │
│ │
╰─ LiteLLMModel - openai/gpt-4o-mini ────────────────────────────
In the first step, the LLM generated a dataframe and performed all
calculations. Interestingly, it chose to write all the code
independently rather than using the provided tools.
Even more surprisingly, the LLM recreated the dataframe based on
the input data instead of referencing it directly. This approach is not
ideal (especially when working with massive datasets), as it can lead
to errors and higher token usage. This behaviour could potentially be
improved by using a more explicit system prompt. Here’s the code
the agent executed in the first step.
import pandas as pd
# Creating the DataFrame from the provided data
data = {
'before': [632767.39, 481409.27, 240704.63, 160469.75,
120352.31, 96281.86],
'after': [637000.48, 477033.02, 107857.06, 159778.76,
121331.46, 96064.77]
}
index = ['other', 'UK', 'France', 'Germany', 'Italy', 'Spain']
df = pd.DataFrame(data, index=index)
# Calculating total revenue before and after
total_before = df['before'].sum()
total_after = df['after'].sum()
# Calculating absolute and percentage change for each segment
df['absolute_change'] = df['after'] - df['before']
df['percentage_change'] = (df['absolute_change'] /
df['before']) * 100
# Calculating total revenue change
total_change = total_after - total_before
# Calculating contribution of each segment to the total change
df['contribution_to_change'] = (df['absolute_change'] /
total_change) * 100
# Rounding results
df = df.round(2)
# Printing the calculated results
print("Total revenue before:", total_before)
print("Total revenue after:", total_after)
print("Total change in revenue:", total_change)
print(df)
In the second step, the LLM simply constructed the final answer by
referring to the variables calculated at the previous step (which is
really neat).
final_answer({
"total_before": round(total_before, 2),
"total_after": round(total_after, 2),
"total_change": round(total_change, 2),
"segment_changes": df[['absolute_change',
'percentage_change', 'contribution_to_change']].to_dict()
})
It worked pretty well.
ToolCallingAgent
Now, it’s time to see how traditional tool-calling agents can tackle
this problem. We initialised it in a similar manner and ran the task.
from smolagents import ToolCallingAgent
traditional_agent = ToolCallingAgent(
model=model,
tools=[calculate_metric_increase, calculate_sum],
max_steps=30,
)
task = """
Here is a dataframe showing revenue by segment, comparing values
before and after.
Could you please help me understand the changes? Specifically:
1. Estimate how the total revenue and the revenue for each segment
have changed, both in absolute terms and as a percentage.
2. Calculate the contribution of each segment to the total
change in revenue.
Please round all floating-point numbers in the output
to two decimal places.
"""
traditional_agent.run(
task,
additional_args={"data": df},
)
The results are far from ideal: only the relative changes are correct,
while the rest of the numbers are pure hallucinations. I have to
admit, the core issue was the lack of appropriate tools (specifically,
tools to calculate differences and to estimate shares). However, the
agent should have flagged missing tools rather than generating
random numbers.
Total revenue change: -7319.66 (-7.67%). Revenue Changes by Segment:
- Other: +232.09 (-0.67%)
- UK: -4376.25 (0.91%)
- France: -132847.57 (55.19%)
- Germany: -690.99 (0.43%)
- Italy: +979.15 (-0.81%)
- Spain: -217.09 (0.23%)
Contribution to total change:
- Other: 0.03%
- UK: -59.88%
- France: -181.77%
- Germany: -9.43%
- Italy: +13.38%
- Spain: -0.03%
In terms of resource usage, the tool-calling agent performed
significantly worse: 12 steps, with 29,201 input and 1,695 output
tokens. So, code agents clearly offer cost savings when it comes to
agent execution.
Let’s dig a bit deeper to understand what the agent actually did.
First, it took four iterations just to figure out how to call
calculate_sum correctly and estimate the total revenue before and
after.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 1 ━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────
│ Calling tool: 'calculate_sum' with arguments: │
│ {'values': {'before': [632767.39, 481409.27, 240704.63, │
│ 160469.75, 120352.31, 96281.86], │
│ 'after': [637000.48, 477033.02, 107857.06, 159778.76, │
│ 121331.46, 96064.77]}} │
╰──────────────────────────────────────────────────────────
Invalid call to tool 'calculate_sum' with arguments
{"values": {"before": [632767.39, 481409.27, 240704.63,
160469.75, 120352.31, 96281.86],
"after": [637000.48, 477033.02, 107857.06, 159778.76,
121331.46, 96064.77]}}:
unsupported operand type(s) for +: 'int' and 'str'
You should call this tool with correct input arguments.
Expected inputs: {"values": {"type": "object",
"description": "list of values"}}
Returns output type: number
Tool description: 'Calculate the sum of list'
[Step 1: Duration 4.06 sec | Input tokens: 1,190 | Output: 250]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 2 ━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────
│ Calling tool: 'calculate_sum' with arguments: │
│ {'values': {'before': [632767.39, 481409.27, 240704.63, │
│ 160469.75, 120352.31, 96281.86], │
│ 'after': [637000.48, 477033.02, 107857.06, 159778.76, │
│ 121331.46, 96064.77]}} │
╰──────────────────────────────────────────────────────────
Invalid call to tool 'calculate_sum' with arguments
{"values": {"before": [632767.39, 481409.27, 240704.63,
160469.75, 120352.31, 96281.86],
"after": [637000.48, 477033.02, 107857.06, 159778.76,
121331.46, 96064.77]}}:
unsupported operand type(s) for +: 'int' and 'str'
You should call this tool with correct input arguments.
Expected inputs: {"values": {"type": "object",
"description": "list of values"}}
Returns output type: number
Tool description: 'Calculate the sum of list'
[Step 2: Duration 8.60 sec | Input tokens: 2,815 | Output: 386]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 3 ━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────
│ Calling tool: 'calculate_sum' with arguments: │
│ {'values': [632767.39, 481409.27, 240704.63, 160469.75, │
│ 120352.31, 96281.86]} │
╰──────────────────────────────────────────────────────────
Observations: 1731985.2100000002
[Step 3: Duration 1.23 sec | Input tokens: 4,871 | Output: 488]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 4 ━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────
│ Calling tool: 'calculate_sum' with arguments: │
│ {'values': [637000.48, 477033.02, 107857.06, 159778.76, │
│ 121331.46, 96064.77]} │
╰──────────────────────────────────────────────────────────
Observations: 1599065.55
The next seven steps were spent calculating the relative metric
changes using the calculate_metric_increase tool.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 5 ━━━━━━━━━━━━━━━━━━━━━━
╭──────────────────────────────────────────────────────────
│ Calling tool: 'calculate_metric_increase' with │
│ arguments: {'before': 1731985.21, 'after': 1599065.55} │
╰──────────────────────────────────────────────────────────
Observations: 7.674410799385517
<-- similar tool calls for all country segments -->
In the end, the agent put together a final call.
So, if the LLM had had tools to calculate the absolute difference and
the share of the sum, it would have taken an additional 14 iterations
and even more tokens. Of course, we can prevent such inefficiencies
by carefully designing the tools we provide:
We could modify our functions to work with lists of values
instead of single items, which would significantly reduce the
number of steps.
Additionally, we could create more complex functions that
calculate all necessary metrics at once (similar to what the code
agent did). This way, LLM wouldn’t need to perform calculations
step-by-step. However, this approach might reduce the flexibility
of the system.
Even though the results weren’t ideal due to a poor choice of tools, I
still find this example quite insightful. It’s clear that code agents are
more powerful, cost-efficient and flexible as they can invent their
own comprehensive tools and perform multiple actions in a single
step.
You can find the complete code and execution logs on GitHub.
Summary
We’ve learned a lot about the code agents. Now, it’s time to wrap
things up with a quick summary.
Code agents are LLM agents that “think” and act using Python code.
Instead of calling tools via JSON, they generate and execute actual
code. It makes them more flexible and cost-efficient as they can
invent their own comprehensive tools and perform multiple actions
in a single step.
HuggingFace has brought this approach to life in their framework,
smolagents. Smolagents makes it easy to build quite complex agents
without much hassle, while also providing safety measures during
the code execution.
In this article, we’ve explored the basic functionality of the
smolagents library. But there’s a lot more to it. In the next article, we
will dive into more advanced features (like multi-agent setup and
planning steps) to build the agent that can narrate KPI changes. Stay
tuned!
Thank you a lot for reading this article. I hope this article was
insightful for you.
Reference
This article is inspired by the “Building Code Agents with Hugging
Face smolagents” short course by DeepLearning.AI.
· · ·
WRITTEN BY
Mariya Mansurova
See all from Mariya Mansurova
Topics:
Agentic Ai Data Science Editors Pick Llm Machine Learning
Share this article:
Related Articles
ARTIFICIAL INTELLIGENCE
What Do Large Language Models
“Understand”?
A deep dive on the meaning of
understanding and how it applies to
LLMs
Tarik Dzekman
ARTIFICIAL INTELLIGENCE August 21, 2024 31 min read
Implementing Convolutional
Neural Networks in TensorFlow
Step-by-step code guide to building a
Convolutional Neural Network
Shreya Rao
August 20, 2024 6 min read
ARTIFICIAL INTELLIGENCE DATA SCIENCE
How to Forecast Hierarchical Hands-on Time Series Anomaly
Time Series Detection using Autoencoders,
A beginner’s guide to forecast with Python
reconciliation Here’s how to use Autoencoders to
Dr. Robert Kübler detect signals with anomalies in a few
August 20, 2024 13 min read lines of…
Piero Paialunga
August 21, 2024 12 min read
MACHINE LEARNING DATA SCIENCE
3 AI Use Cases (That Are Not a Back To Basics, Part Uno: Linear
Chatbot) Regression and Cost Function
Feature engineering, structuring An illustrated guide on essential machine
unstructured data, and lead scoring learning concepts
Shaw Talebi Shreya Rao
August 21, 2024 7 min read February 3, 2023 6 min read
DATA SCIENCE
Must-Know in Statistics: The
Bivariate Normal Projection
Explained
Derivation and practical examples of this
powerful concept
Luigi Battistoni
August 14, 2024 7 min read
Your home for data science and Al. The world’s leading publication for data science, data
analytics, data engineering, machine learning, and artificial intelligence professionals.
© Insight Media Group, LLC 2025
Subscribe to Our Newsletter
ABOUT • ADVERTISE • PRIVACY POLICY • TERMS OF USE
COOKIES SETTINGS