Use the Unified CLI

The skainet CLI auto-detects model architecture from GGUF metadata, so you don’t need to pick the right runner. One binary handles every supported family.

Text Generation

./gradlew :llm-apps:skainet-cli:run \
  --args="-m model.gguf 'Your prompt here'"

Chat Mode

./gradlew :llm-apps:skainet-cli:run \
  --args="-m model.gguf --chat"

Agent Mode (with Tool Calling)

./gradlew :llm-apps:skainet-cli:run \
  --args="-m model.gguf --agent"

Tool Calling Demo

Interactive:

./gradlew :llm-apps:skainet-cli:run \
  --args="-m model.gguf --demo"

Single-shot (for scripts/testing):

./gradlew :llm-apps:skainet-cli:run \
  --args="-m model.gguf --demo 'What is 2 + 2?'"

Cross-Architecture Examples

The same skainet invocation works across families — the CLI resolves the right loader, tokenizer, and chat template from GGUF metadata.

TinyLlama (Llama family)

./gradlew :llm-apps:skainet-cli:run \
  --args="-m tinyllama-1.1b-chat-v1.0.Q8_0.gguf 'The capital of France is'"

Detected as architecture=llama, uses Llama3ChatTemplate for chat / agent / demo modes.

Qwen 3 (Llama tensor layout, ChatML template)

./gradlew :llm-apps:skainet-cli:run \
  --args="-m Qwen3-1.7B-Q8_0.gguf --demo 'What is 17 * 23?'"

Auto-resolves to ChatML chat template + Hermes-style <tool_call> parser. Override the template explicitly with --template=chatml if needed.

Llama 3.2 (custom-tools JSON format)

./gradlew :llm-apps:skainet-cli:run \
  --args="-m Llama-3.2-1B-Instruct-Q8_0.gguf --demo --template=llama3 -k 0.0 'What files are in /tmp?'"

See Llama 3 / 3.1 / 3.2 Tool Calling for the format details and Meta’s two response shapes.

Gemma 4

./gradlew :llm-apps:skainet-cli:run \
  --args="-m gemma-4-E2B-it-Q4_K_M.gguf 'The capital of France is'"
Tool-call format emission has a known gap on Gemma 4 E2B — basic generation works, agent / demo modes do not yet produce parseable tool-call markup. Track via the gemma4_toolcall_status follow-up.

BERT (embeddings)

BERT is exposed through the dedicated kbert CLI (different surface — embeddings, not generation):

./gradlew :llm-apps:kbert-cli:run \
  --args="model.gguf 'sentence to embed'"

All Options

skainet -m <model.gguf> [options] [prompt]

Options:
  -m, --model       Path to .gguf model (required)
  -s, --steps       Generation steps (default: 64)
  -k, --temperature Sampling temperature (default: 0.8)
  --chat            Interactive chat mode
  --agent           Interactive agent with tool calling
  --demo            Tool calling demo (add prompt for single-shot)
  --template=NAME   Chat template override: llama3, chatml, qwen, gemma
  --context=N       Cap context length to N tokens
  -h, --help        Show help

Model-Specific CLIs

Three model-specific CLIs remain alongside the unified one:

CLI Gradle Task When to use

kllama

:llm-apps:kllama-cli:run

Llama-family advanced flags (custom RoPE, attention backend overrides) and the legacy --demo runner used by Llama-3 examples.

kgemma

:llm-runtime:kgemma:jvmRun

Gemma-specific DSL runtime flags (PLE diagnostics, GEMMA4_* env knobs).

kbert

:llm-apps:kbert-cli:run

BERT embeddings (different output shape; not handled by skainet-cli).

The previously available kqwen, kapertus, and kvoxtral CLIs were removed — Qwen runs through skainet / kllama (same tensor layout); Apertus and Voxtral runtimes remain as libraries but no longer ship a standalone CLI.