Llama 3 / 3.1 / 3.2 Tool Calling
This page is a getting-started guide for wiring Llama 3 / 3.1 / 3.2 tool calling into your own application. It covers the full path: load a GGUF, register custom tools, run the agent loop, and observe what the model sees and emits each round. The latter half explains the two prompt/response formats Meta documents and why the default works for Llama 3.2.
|
For Llama 3.2 1B / 3B (and any Llama 3.x in 2025) leave the defaults alone. The default format is |
Quick start: try it from the CLI
Run the bundled demo against a Llama 3.x GGUF — useful as a sanity check before embedding the same code in your app.
./gradlew :llm-apps:kllama-cli:run --quiet \
--args='-m /path/to/Llama-3.2-1B-Instruct-Q8_0.gguf --demo -s 256 -k 0.7 \
"What files are in /tmp?"'
The demo registers two tools (list_files, calculator), prints the rendered prompt and tool schemas, and runs the agent loop until the model produces a final assistant message. Expect output like:
[Tools] (2)
- list_files: List files and directories in a local folder. ...
- calculator: Evaluate a mathematical expression. ...
[Prompt → Round 1] (1553 chars)
┌──────────────────────────────────────────────────────────────────────┐
│ <|begin_of_text|><|start_header_id|>system<|end_header_id|>
│ ...full Llama 3 tool-calling system prompt with both function schemas...
│ <|eot_id|><|start_header_id|>user<|end_header_id|>
│ What files are in /tmp?<|eot_id|>...
└──────────────────────────────────────────────────────────────────────┘
[Raw Assistant → Round 1] {"name": "list_files", "parameters": {"path": "/tmp"}}
[Tool Call] list_files({"path":"/tmp"})
[Tool Result] list_files -> [dir] .ICE-unix ... and 4647 more entries
The agent loop then runs round 2, feeding the tool result back so the model can summarise.
Use it from your own Kotlin app
The pieces you need live in three modules:
-
llm-runtime-kllama—KLlamaJava.loadGGUF(path)builds the runtime + tokenizer in one call (Java-friendly facade; works fine from Kotlin too). -
llm-agent—ChatSession,AgentLoop,Tool,ToolRegistry,AgentListener. -
llm-core— pulled in transitively.
Step 1 — Add the dependency
dependencies {
implementation(platform("sk.ainet.transformers:skainet-transformers-bom:0.31.0"))
implementation("sk.ainet.transformers:skainet-transformers-runtime-kllama")
implementation("sk.ainet.transformers:skainet-transformers-agent")
}
The runtime needs the Java Vector API at launch:
--enable-preview --add-modules jdk.incubator.vector
Step 2 — Load the model
import sk.ainet.apps.kllama.java.KLlamaJava
import java.nio.file.Path
val session = KLlamaJava.loadGGUF(Path.of("models/Llama-3.2-1B-Instruct-Q8_0.gguf"))
// session.runtime : InferenceRuntime<FP32>
// session.tokenizer: Tokenizer
// session is AutoCloseable — close it to release the Arena.
KLlamaJava.loadGGUF accepts Llama / Mistral GGUFs and bundles the loader, tokenizer, and runtime construction. For SafeTensors checkpoints use loadSafeTensors(modelDir).
Step 3 — Define your tool
A tool is a ToolDefinition (name + JSON-Schema parameters) plus an execute function.
import kotlinx.serialization.json.*
import sk.ainet.apps.kllama.chat.Tool
import sk.ainet.apps.kllama.chat.ToolDefinition
class WeatherTool : Tool {
override val definition = ToolDefinition(
name = "get_weather",
description = "Get the current weather for a city.",
parameters = buildJsonObject {
put("type", "object")
putJsonObject("properties") {
putJsonObject("city") {
put("type", "string")
put("description", "City name, e.g. 'Bratislava'.")
}
}
putJsonArray("required") { add(JsonPrimitive("city")) }
}
)
override fun execute(arguments: JsonObject): String {
val city = arguments["city"]?.jsonPrimitive?.content
?: return "Error: missing 'city'"
// Real call to your weather backend goes here.
return """{"city":"$city","tempC":22,"condition":"sunny"}"""
}
}
The schema is the contract the model sees in the system prompt — keep it tight, mark required fields, and make description something the model can actually act on.
Step 4 — Wire ChatSession + AgentLoop
import sk.ainet.apps.kllama.chat.*
val chat = ChatSession(
runtime = session.runtime,
tokenizer = session.tokenizer,
// family="llama" auto-resolves to Llama3ToolCallingSupport with the
// bare-JSON format Llama 3.2 was fine-tuned on. Override only if you
// know you need FUNCTION_TAG (see "The two formats" below).
metadata = ModelMetadata(family = "llama", architecture = "llama"),
)
val tools = ToolRegistry().apply {
register(WeatherTool())
}
val loop = chat.createAgentLoop(
toolRegistry = tools,
maxTokens = 256,
temperature = 0.7f,
)
val messages = mutableListOf(
ChatMessage(
role = ChatRole.SYSTEM,
content = "You are a helpful assistant with access to tools. " +
"Always call get_weather when asked about weather — never guess."
),
ChatMessage(role = ChatRole.USER, content = "What's the weather in Bratislava?"),
)
val finalAnswer = loop.runWithEncoder(
messages = messages,
encode = { chat.encode(it) },
)
println(finalAnswer)
The loop renders the chat template with your tools embedded, generates until EOS, parses the assistant’s reply for a tool call, executes the tool, appends the result to messages, and re-runs — up to AgentConfig.maxToolRounds (default 5).
Step 5 — Observe what the model sees and emits
Pass an AgentListener to log prompts, raw responses, tool invocations, and results. This is the same listener ToolCallingDemo uses for the CLI output above.
val listener = object : AgentListener {
override fun onToken(token: String) { print(token) }
override fun onAssistantMessage(text: String) {
println("\n[raw assistant] $text")
}
override fun onToolCalls(calls: List<ToolCall>) {
for (c in calls) println("[tool call] ${c.name}(${c.arguments})")
}
override fun onToolResult(call: ToolCall, result: String) {
println("[tool result] ${call.name} -> $result")
}
override fun onToolCallValidationFailed(call: ToolCall, reason: String) {
println("[tool call invalid] ${call.name}: $reason")
}
override fun onComplete(finalResponse: String) {}
}
loop.runWithEncoder(messages, encode = { chat.encode(it) }, listener = listener)
To see the prompt the model receives at the start of each round (not just the response), render the template yourself before calling the loop:
val rendered = chat.chatTemplate.apply(
messages = messages,
tools = tools.definitions(),
addGenerationPrompt = true,
)
println("[prompt] (${rendered.length} chars)\n$rendered")
|
Llama 3.2 1B sometimes wraps its tool-call JSON in a markdown code fence ( |
Verify it’s working
You should see exactly this sequence in your listener output for the weather example:
-
onTokenfires repeatedly as the model generates{"name": "get_weather", "parameters": {"city": "Bratislava"}}. -
onAssistantMessagefires once with that full text. -
onToolCallsfires with[ToolCall(name="get_weather", arguments={"city":"Bratislava"})]. -
onToolResultfires with your stub’s JSON response. -
The loop spins again — the model now sees the tool result in its context and produces a natural-language answer.
-
onCompletefires with the final user-facing answer.
If onToolCalls never fires and onComplete returns the raw JSON instead, the model emitted a call but the parser missed it — file an issue with the [raw assistant] text. The bare-JSON parser handles <|python_tag|> prefixes, code fences, and trailing prose, but novel surface forms slip through.
The two formats
Llama 3.x supports two response shapes for custom tool calls. They are not auto-negotiated between the model and the harness — the system prompt declares which one the model should emit, and the parser must be told to look for the same one. Llama3ChatTemplate and Llama3ToolCallingSupport take a single Llama3ToolFormat that both sides share.
Llama3ToolFormat.JSON (default)
What Llama 3.2 1B / 3B was fine-tuned on for custom tool calling. Meta documents this in llama-models/models/llama3_2/text_prompt_format.md.
Model emits a single JSON object on one line (no surrounding prose):
{"name": "list_files", "parameters": {"path": "/tmp"}}
System prompt the template builds:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant with tool calling capabilities.
When you receive a tool call response, use the output to format an answer to the original user question.
You have access to the following functions:
{"name":"list_files","description":"...","parameters":{...}}
If you choose to call a function, your reply MUST be a single JSON object on one line in the following format and nothing else:
{"name": <function-name>, "parameters": <arguments-object>}
Do not write the function definition. Do not include any prose. Do not use variables.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What files are in /tmp?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Parser (Llama31ToolCallParserStrategy) accepts:
-
The Meta-documented
"parameters"key, or"arguments"(Hermes-style alias). -
A leading
<|python_tag|>marker (used by Llama 3.2’s built-in tools; tolerated here too). -
A surrounding markdown code fence (
``json`/```) — Llama 3.2 1B occasionally fences its JSON despite the system-prompt instruction. -
Trailing prose after the JSON object (small models often append "I hope that helps!").
Llama3ToolFormat.FUNCTION_TAG (Llama 3.1 legacy)
Tag-wrapped JSON. Documented in early Llama 3.1 prompt-format material. Llama 3.2 will follow this format if asked, but Meta no longer recommends it for custom tools on 3.2.
Model emits:
<function=list_files>{"path": "/tmp"}</function>
System prompt the template builds:
...
If you choose to call a function, ONLY reply in the format below and nothing else:
<function=function_name>{"arg_name": "arg_value"}</function>
Function calls MUST be on a single line. Required parameters MUST be specified.
...
Parser: Llama3FunctionTagParserStrategy. Multiple <function=…> blocks in a single response are extracted in order.
Picking a format programmatically
val support = Llama3ToolCallingSupport(format = Llama3ToolFormat.JSON)
val template = support.createChatTemplate() // Llama3ChatTemplate(JSON)
val calls = support.parseToolCalls(modelOutput) // tries Hermes → function-tag → JSON
ToolCallParser.parse tries every registered strategy and returns the first non-empty hit. The three default strategies are disjoint by surface form, so you never get a double-parse:
| Surface form | Strategy |
|---|---|
|
|
|
|
Bare |
|
That means you can safely select either Llama 3 format on the prompt side without touching the parser registration — the parser will pick up whichever the model actually emits.
Why two formats exist
-
Llama 3.1 shipped with the
<function=…>…</function>tag form in the early prompt-format docs. Meta later updated the docs to also show the bare-JSON format alongside it. -
Llama 3.2 released in late 2024 with built-in tools (
brave_search,wolfram_alpha,code_interpreter) that use the<|python_tag|>-prefixed bare-JSON format; for custom tools the docs canonicalise plain bare JSON with"parameters". The 1B and 3B Instruct variants are fine-tuned for this format.
So: if you’re running Llama 3.2, default JSON is the trained-on format and gives the best chance of a clean call. If you’re running an older Llama 3.1 prompt or you have prompt material specifically calling for the tag form, switch to FUNCTION_TAG.
Model-size caveat
Llama 3.2 1B is a small model. Even with the correct format and prompt it can:
-
Echo back the tool schema instead of producing a call (treat with a few-shot example added to the system prompt).
-
Hallucinate a tool result directly without calling the tool.
-
Append commentary after the JSON (the parser handles this).
3B is meaningfully better; 8B (Llama 3.1) is the sweet spot for tool calling on commodity hardware. Drop the temperature to 0.0 for deterministic tool-call generation.
Related files
-
llm-agent/…/chat/Llama3ChatTemplate.kt— prompt builder. -
llm-agent/…/chat/Llama3ToolFormat.kt— format enum. -
llm-agent/…/chat/ToolCallParser.kt— both Llama 3 parser strategies + Hermes. -
llm-agent/…/chat/ToolCallingSupport.kt—Llama3ToolCallingSupportpulls everything together. -
llm-runtime/kllama/…/cli/ToolCallingDemo.kt— the--demorunner.