Pipeline Design Decisions
The Problem
Early SKaiNET had a monolithic approach: each model family (LLaMA, Gemma, Apertus) had its own hand-coded runtime that handled everything — weight loading, forward pass, KV cache, tokenization, and generation. This led to:
-
Duplicated logic — each runtime reimplemented
generate(),sample(),forward(). -
Tight coupling — tool calling only worked with kllama because
ToolCallingDemodepended onGGUFTokenizer, a kllama-specific class. -
No optimization — hand-coded runtimes couldn’t benefit from compute graph optimization passes.
The Solution: Separated Pipeline Stages
The pipeline is split into stages that are independently replaceable:
- Weight Loading
-
Parse GGUF/SafeTensors into typed tensor maps. Model-format concern, not architecture concern.
- Network Definition
-
Pure functions (
llamaNetwork(),apertusNetwork()) that return aModuletree. Architecture concern only. - Graph Compilation
-
Trace the module tree into a DAG, apply optimization passes. Framework concern.
- Inference Runtime
-
forward(tokenId)andgenerate(). Pure inference, no I/O. - Tokenization
-
encode()/decode(). Completely independent of model architecture. - Chat Pipeline
-
ChatSession,AgentLoop,ChatTemplate. Independent of both model and tokenizer implementation.
Key Design Decisions
Tokenizer Interface with Metadata
The Tokenizer interface includes eosTokenId, bosTokenId, and vocabSize.
This eliminated the need for the GGUFTokenizer downcast that previously coupled tool calling to kllama.
Any tokenizer implementation works with ChatSession and AgentLoop.
ChatSession as the Composition Root
Rather than having each CLI wire up InferenceRuntime + Tokenizer + ChatTemplate + ToolRegistry individually, ChatSession bundles them.
A runner creates one ChatSession and gets chat, agent, and demo modes for free.
ModelRegistry for Auto-Detection
Instead of if/else chains in each CLI to determine which loader to use, ModelRegistry.detect(architecture) returns a ModelFamily enum with capabilities (tool calling support, chat template family).
The unified skainet CLI uses this to load any GGUF model without architecture-specific flags.