Inference Pipeline
Stage Details
[1] Weight Loading
LlamaWeightLoader supports:
-
Sequential loading — entire file in memory (small models < 2GB)
-
Streaming loading — metadata-only in memory, tensors on demand (any size)
-
Quantization policies —
DEQUANTIZE_TO_FP32,NATIVE_OPTIMIZED(SIMD dequant at inference)
[2] DSL Network Definition
Pure functions that return a Module<T, V> tree:
val model = llamaNetwork<FP32, Float>(metadata)
The DSL provides: embedding(), multiHeadAttention(), swiGluFFN(), rmsNorm(), residual().
[3-5] Compute Graph Compilation
OptimizedLLMRuntime traces the module tree into a DAG and applies optimization passes:
-
TransposeEliminationPass — fold transposes into matmul parameters
-
SharedWeightDeduplicationPass — eliminate redundant weight loads
-
LLMFusionPass — fuse RMSNorm, SwiGLU FFN, and QKV projections
-
DeadCodeEliminationPass — remove unused operations
[6] Inference Runtime
Three execution modes via OptimizedLLMMode:
DIRECT-
Module tree executes forward passes directly (debugging)
OPTIMIZED-
Full DAG execution with fused kernels
HYBRID-
Direct execution with per-layer compiled subgraphs