Architecture Overview
Module Structure
llm-core Core abstractions (Tokenizer, InferenceRuntime, ModelRegistry) llm-agent Chat templates, tool calling, AgentLoop, ChatSession llm-inference/ llama/ LLaMA/Qwen network definition and weight loading apertus/ Apertus network definition and weight loading gemma/ Gemma runtime and weight loading bert/ BERT network definition voxtral/ Voxtral TTS runtimes llm-runtime/ kllama/ LLaMA/Qwen CPU runtime, attention backend, tokenizer kqwen/ Qwen-specific runner CLI kgemma/ Gemma runner CLI kapertus/ Apertus runner CLI llm-apps/ skainet-cli/ Unified CLI (auto-detects architecture) kllama-cli/ LLaMA-specific CLI kapertus-cli/ Apertus-specific CLI kbert-cli/ BERT CLI kvoxtral-cli/ Voxtral TTS CLI llm-performance/ Benchmarking module
Key Interfaces
InferenceRuntime<T>-
Minimal inference contract:
forward(tokenId): Tensorandreset(). All model runtimes implement this. Tokenizer-
Encode/decode text with
eosTokenId,bosTokenId,vocabSize. Implementations:GGUFTokenizer,HuggingFaceBPETokenizer,TekkenTokenizerAdapter. ChatTemplate-
Format conversation messages into prompt strings and parse tool calls from output. Implementations:
QwenChatTemplate,Llama3ChatTemplate,GemmaChatTemplate,ChatMLTemplate. DecoderRuntime<T>-
Template method base class for decoder-only transformers. Provides shared
forward(),generate(),sample()logic. AttentionBackend<T>-
Pluggable attention computation with KV cache. Implementations:
CpuAttentionBackend,GpuAttentionBackend.