Inference Pipeline

Pipeline Stages

Diagram

Stage Details

[1] Weight Loading

LlamaWeightLoader supports:

Sequential loading — entire file in memory (small models < 2GB)
Streaming loading — metadata-only in memory, tensors on demand (any size)
Quantization policies — DEQUANTIZE_TO_FP32, NATIVE_OPTIMIZED (SIMD dequant at inference)

[2] DSL Network Definition

Pure functions that return a Module<T, V> tree:

val model = llamaNetwork<FP32, Float>(metadata)

The DSL provides: embedding(), multiHeadAttention(), swiGluFFN(), rmsNorm(), residual().

[3-5] Compute Graph Compilation

OptimizedLLMRuntime traces the module tree into a DAG and applies optimization passes:

TransposeEliminationPass — fold transposes into matmul parameters
SharedWeightDeduplicationPass — eliminate redundant weight loads
LLMFusionPass — fuse RMSNorm, SwiGLU FFN, and QKV projections
DeadCodeEliminationPass — remove unused operations

[6] Inference Runtime

Three execution modes via OptimizedLLMMode:

DIRECT: Module tree executes forward passes directly (debugging)
OPTIMIZED: Full DAG execution with fused kernels
HYBRID: Direct execution with per-layer compiled subgraphs

[7] Tokenization

TokenizerFactory auto-detects the right tokenizer:

fromGGUF(source) — reads vocab and merges from GGUF metadata
fromTokenizerJson(json) — parses HuggingFace tokenizer.json
fromHuggingFace(json, config) — full HF BPE with config

[8] Chat Pipeline

ChatSession bundles runtime + tokenizer + metadata:

Auto-detects chat template from ModelMetadata
createAgentLoop() — multi-turn tool calling
runSingleTurn() — one-shot tool calling