Use the Unified CLI
The skainet CLI auto-detects model architecture from GGUF metadata, so you don’t need to pick the right runner. One binary handles every supported family.
Tool Calling Demo
Interactive:
./gradlew :llm-apps:skainet-cli:run \
--args="-m model.gguf --demo"
Single-shot (for scripts/testing):
./gradlew :llm-apps:skainet-cli:run \
--args="-m model.gguf --demo 'What is 2 + 2?'"
Cross-Architecture Examples
The same skainet invocation works across families — the CLI resolves the right loader, tokenizer, and chat template from GGUF metadata.
TinyLlama (Llama family)
./gradlew :llm-apps:skainet-cli:run \
--args="-m tinyllama-1.1b-chat-v1.0.Q8_0.gguf 'The capital of France is'"
Detected as architecture=llama, uses Llama3ChatTemplate for chat / agent / demo modes.
Qwen 3 (Llama tensor layout, ChatML template)
./gradlew :llm-apps:skainet-cli:run \
--args="-m Qwen3-1.7B-Q8_0.gguf --demo 'What is 17 * 23?'"
Auto-resolves to ChatML chat template + Hermes-style <tool_call> parser. Override the template explicitly with --template=chatml if needed.
Llama 3.2 (custom-tools JSON format)
./gradlew :llm-apps:skainet-cli:run \
--args="-m Llama-3.2-1B-Instruct-Q8_0.gguf --demo --template=llama3 -k 0.0 'What files are in /tmp?'"
See Llama 3 / 3.1 / 3.2 Tool Calling for the format details and Meta’s two response shapes.
Gemma 4
./gradlew :llm-apps:skainet-cli:run \
--args="-m gemma-4-E2B-it-Q4_K_M.gguf 'The capital of France is'"
Tool-call format emission has a known gap on Gemma 4 E2B — basic generation works, agent / demo modes do not yet produce parseable tool-call markup. Track via the gemma4_toolcall_status follow-up.
|
All Options
skainet -m <model.gguf> [options] [prompt]
Options:
-m, --model Path to .gguf model (required)
-s, --steps Generation steps (default: 64)
-k, --temperature Sampling temperature (default: 0.8)
--chat Interactive chat mode
--agent Interactive agent with tool calling
--demo Tool calling demo (add prompt for single-shot)
--template=NAME Chat template override: llama3, chatml, qwen, gemma
--context=N Cap context length to N tokens
-h, --help Show help
Model-Specific CLIs
Three model-specific CLIs remain alongside the unified one:
| CLI | Gradle Task | When to use |
|---|---|---|
|
|
Llama-family advanced flags (custom RoPE, attention backend overrides) and the legacy |
|
|
Gemma-specific DSL runtime flags (PLE diagnostics, |
|
|
BERT embeddings (different output shape; not handled by |
The previously available kqwen, kapertus, and kvoxtral CLIs were removed — Qwen runs through skainet / kllama (same tensor layout); Apertus and Voxtral runtimes remain as libraries but no longer ship a standalone CLI.