Architecture
This page follows arc42’s chapter ordering at a coarse grain. Sections that are still single-paragraph stubs ask for contributions; sections about the kernel SPI and eager execution pipeline are the deepest because that’s where the 0.21.0 work landed.
1. Introduction and goals
SKaiNET is a Kotlin Multiplatform ML framework whose primary target is on-device / edge inference and training in environments that already have JVM tooling: Android, JVM server, and Kotlin/Native iOS and linuxX64. The framework separates model authoring (a typed Kotlin DSL with compile-time tensor shape checks where possible) from execution (pluggable backends), so the same model can run eagerly during development and be lowered to MLIR StableHLO β IREE for deployment.
Hard non-goals:
-
Become a full numerics library. SKaiNET targets the operators real models use, not the long tail in PyTorch / NumPy.
-
Run untrusted user code. Kernels are trusted code; security is about not corrupting memory, not about sandboxing.
2. Constraints
| Constraint | Why |
|---|---|
Kotlin Multiplatform with |
Same DSL must run on JVM, Android, iOS, macOS, linuxX64, JS, Wasm. |
|
FloatVector / ByteVector are still incubator on JDK 25 (JEP 508). |
Maven Central publication via |
All modules signed; coordinates |
Antora-based docs site under |
Source-controlled, follows DiΓ‘taxis quadrants for user-facing pages. |
arc42 ordering for this page |
Architectural reference, not a tutorial. |
3. Context (system boundaries)
The framework’s outer surface:
-
DSL layer β Kotlin model DSL (
nn { … },tensor { … }) and imperativeTensorOps/ExecutionContextAPI. -
I/O layer β model loaders for GGUF, SafeTensors, ONNX (read-only) in
skainet-io-*modules. -
Compile layer β
RecordingExecutionrecords ops to a tape, then lowers to StableHLO / IREE bytecode inskainet-compile-*modules. -
Backend layer β
BackendProviderdispatchesTensorOpscalls to a concrete implementation (CPU, XNNPACK, future GPU). Inside a backend, the kernel SPI picks the SIMD recipe for the host hardware.
4. Solution strategy
SKaiNET runs the same model graph through one of two execution strategies:
-
Eager execution β
DirectCpuExecutionContextcalls op implementations as the user invokes them. Used during development, testing, and on-device inference paths where AOT compilation is impractical (debug builds, dynamic graphs). This is the path the 0.21.0 SIMD work targets. -
Recorded execution β
RecordingExecutionbuilds an op tape, whichHloGeneratorlowers to StableHLO MLIR. IREE compiles the MLIR to a portable bytecode for production deployment.
Both strategies share the same TensorOps surface, so a model
written once runs in either mode without changes. Numerical parity
between modes is part of the test contract.
5. Building block view (static structure)
5.1 Module layout
| Module path | Role |
|---|---|
|
DSL types, tensor abstractions, common ops, |
|
Reference reusable models (Llama, Gemma, Qwen, Whisper) built on the DSL. |
|
Neutral backend SPI β |
|
CPU implementation. Eager-execution |
|
Optional XNNPACK CPU backend (FP32 matmul / conv2d / pooling) on linuxX64 / linuxArm64 / Android. |
|
JMH harness β |
|
Tape recording, StableHLO emission, IREE export. |
|
Model loaders (GGUF, SafeTensors, ONNX), tokenizers, IRPA writer. |
5.2 Kernel SPI
Introduced in 0.21.0 (PRs #554, #559, #562). The static structure:
commonMain (skainet-backend-api)
ββββββββββββββββββββββββββββββββββββββββ
β KernelProvider { β
β name: String β
β priority: Int β
β isAvailable(): Boolean β
β matmulFp32(): Fp32MatmulKernel? β
β matmulQ4K(): Q4KMatmulKernel? β
β } β
β β
β KernelRegistry { β
β register(KernelProvider) β
β bestAvailable(): KernelProvider? β
β find(name): KernelProvider? β
β } β
β β
β Fp32MatmulKernel.matmul(...) β
β Q4KMatmulKernel.matmul(...) β
ββββββββββββββββ¬ββββββββββββββββββββββββ
β implements / extends
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β jvmMain (api) β commonMain (cpu) β
β KernelServiceLoader β ScalarMatmulKernel (priority 0) β
β installAll() β ScalarKernelProvider β
ββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β jvmMain (cpu) β
β PanamaVectorKernelProvider (priority 50) β
β FP32 BF16 Q8_0 Q4_0 Q4_K Q6_K Q5_1 Q5_0 (SIMD) β
β Scalar/PanamaVectorKernelProviderFactory (no-arg wrappers) β
β META-INF/services/...KernelProvider β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β jvmMain (skainet-backend-native-cpu) β
β NativeKernelProvider (priority 100, FFM/C) β
β FP32 BF16 Q8_0 Q4_0 Q4_K (+ Q4_K MemSeg zero-copy) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Four live providers ship. The exact, machine-generated coverage of every weight format on every KMP target is at Kernel Γ platform support matrix; for how the kernels are implemented see How SIMD Kernels Are Built (FP32) and How Quantized SIMD Kernels Are Built (quantized). Packed-quant matmul (Q4_K/Q6_K/Q5_1/Q5_0) also has a commonMain scalar kernel, so it runs on Kotlin/Native, JS and WASM β not only the JVM.
|
Native (FFM) provider
|
6. Runtime view β eager execution
The eager pipeline for a single op:
User code
β
β ctx.ops.matmul(a, b)
βΌ
DirectCpuExecutionContext
β
β delegates to TensorOps implementation
βΌ
DefaultCpuOpsJvm.matmul (jvmMain)
β
βββ chooseQuantizedMatmul(a, b)?
β β
β β matches Q4_K / Q6_K / Q8_0 / Q4_0 weights
β βΌ
β ββ Q4_K branch: q4kMatmulKernel?.matmul(...) via SPI
β β fallback to JvmQuantizedVectorKernels.matmulQ4_KVec
β ββ Q6_K branch: JvmQuantizedVectorKernels.matmulQ6_KVec
β ββ Q8_0 / Q4_0 branches: per-format SIMD inner loops
β ββ MemSeg variants: same kernels, ByteVector.fromMemorySegment
β
βββ chooseMatmul(a, b)? (FP32 path)
β β
β β fp32MatmulKernel.matmul(...) β SPI dispatch
β βΌ
β PanamaVectorMatmulKernel.matmul(...)
β β (or ScalarMatmulKernel when Panama unavailable)
β β
β β tile-blocked FMA inner loop:
β β for each (m, n, k)-tile:
β β load FloatVector slices of A and B^T
β β acc = va.fma(vb, acc)
β β acc.reduceLanes(ADD) per output cell
β βΌ
β FloatArray output β back up the stack
β
βββ super.matmul(a, b) (DefaultCpuOpsBase fallback)
Two specifics worth calling out:
-
Lazy provider resolution.
DefaultCpuOpsJvm.fp32MatmulKernelandq4kMatmulKernelareby lazyproperties. First access triggersKernelServiceLoader.installAll()if the registry is empty, then caches the resolved kernel for the lifetime of the op set. Apps that pre-register custom providers viaKernelRegistry.register(…)before constructing the op set bypass the auto-discovery path. -
Fall-through everywhere. Each routing decision (
chooseQuantizedMatmulβchooseMatmulβsuper.matmul) returnsnullon a miss, never throws, so adding a new tensor type or a new SPI accessor is purely additive. The MemSeg arena leak fix in PR #556 made every per-op output useArena.ofAuto(), so even the fast-path branches don’t need explicit lifetime management.
For shape inference, broadcasting, and the lazy shape-swap transpose
specializations on Q4_KTensorData / Q6_KTensorData / MemorySegmentBackedData,
see the same source file (DefaultCpuOpsJvm.transpose).
7. Deployment view
-
Maven Central β every module published as
sk.ainet.core:<module>-<target>:<version>. KMP variants land per target (-jvm,-android,-iosarm64,-macosarm64,-linuxx64,-linuxarm64,-js,-wasm-js,*-wasm-wasi). -
Single BOM β
sk.ainet:skainet-bomprovides aplatform()import for downstream Gradle. Note the group issk.ainet, notsk.ainet.core, so downstream BOMs (e.g.sk.ainet.transformers:skainet-transformers-bom) can import it under the standard umbrella group. -
Releases β tags
0.X.Yon the release branch trigger.github/workflows/publish.ymlβ./gradlew publishon macOS-latest with JDK 25.
8. Cross-cutting concepts
-
Numerical parity testing. Every accelerated kernel has a parity test against a scalar reference within a documented tolerance (typically
1e-5 * kor1e-4relative). Examples:PanamaVectorMatmulKernelTest,PanamaVectorQ4KMatmulKernelTest,Q6KMatmulTest. The scalar reference is the contract; SIMD speed is a non-functional improvement that must not break the contract. -
Lazy resource lifetimes.
MemorySegmentTensorDataFactoryusesArena.ofAuto()for per-op outputs so output segments are GC-reclaimable. The earlierArena.ofConfined()builds leaked ~tens of MB per matmul, blowing 32+ GiB of direct memory in inference loops over a 35-layer Gemma 4 forward pass. Fixed in PR #556. -
Kill switches via system properties. The Vector API code path respects
-Dskainet.cpu.vector.enabled=falseso a deployment can opt out of incubator code without a recompile. Same pattern for BLAS (-Dskainet.cpu.blas.enabled=true).
9. Architecture decisions
| Decision | Date | Rationale |
|---|---|---|
Kernel SPI parallel to BackendProvider |
2026-04 (PR #554) |
Matmul / SDPA are model-agnostic; isolating them lets bench harnesses time the SIMD loop directly and lets a future native provider register without touching the op layer. |
|
2026-04 (PR #562) |
Backwards compat for existing providers (Scalar) without forcing every implementation to override. Same pattern will be used for Q6KMatmulKernel / Q4KMemSegMatmulKernel sibling SPIs. |
ServiceLoader auto-discovery deferred until 2 providers exist |
2026-04 (PR #559) |
Single-provider auto-discovery would have been ceremony for nothing; once Panama landed alongside Scalar, the trigger condition was met. |
FFM (not JNI) for any future native code |
roadmap M5 |
JNI’s per-call overhead and global lock are wrong for hot per-token kernels. |
Antora docs (DiΓ‘taxis), not GitHub Wiki |
2025 |
Source-controlled, branchable, ranked higher in search than wikis, ships with the repo. |
10. Quality requirements
-
Performance. Panama FP32 matmul β₯1.5Γ scalar (M5 metric β met, ~10Γ at 1024Β² on Apple Silicon NEON). Native Q4_K matmul β₯2.5Γ scalar dequant baseline (M5 metric β deferred to native FFM PR).
-
Numerical equivalence. Every SIMD kernel matches its scalar reference within FP-rounding tolerance (
1e-5 * kfor matmul,1e-4relative for quantized). Pinned by parity tests. -
Multi-target buildability.
./gradlew allTests(the release gate) must pass on every KMP target β JVM, JS, Wasm, macosArm64, iosSimulatorArm64, linuxX64, linuxArm64, Android.
11. Risks and technical debt
-
Vector API still incubator on JDK 25. JEP 508 keeps it that way through 2026; we depend on it heavily. If it breaks API in a future JDK, every
JvmVectorKernels/JvmQuantizedVectorKernelsfile needs adjustment. Mitigation: thin wrappers, parity tests are a canary. -
No native FFM provider yet. The literal M5 milestone metric (
β₯2.5Γfor Q4_K) is met by Panama in absolute terms but not in the "native vs JVM" framing the metric originally specified. The priority-100 native provider is designed but not yet shipped. -
Two reverted optimizations on develop history. MemSeg pool (commit 8642b322) and intra-op matmul parallelism (commit 9ed633b6) were both tried and reverted. Re-attempts need a different strategy than what was tried; the revert commits explain why.
12. Glossary (selected)
| Term | Meaning |
|---|---|
FFM |
Foreign Function & Memory API (JEP 442 et seq.). Java 22 stable, 21 preview. Replaces JNI for native interop. |
FMA |
Fused multiply-add ( |
ggml |
The C library underpinning llama.cpp; defines the canonical Q4_K / Q6_K / Q8_0 block layouts SKaiNET uses for GGUF compatibility. |
Lazy transpose |
Specialization in |
MemSeg |
Short for |
Panama |
Codename for the JDK Vector API ( |
SPI |
Service provider interface β a public interface with multiple registered implementations, looked up at runtime. SKaiNET uses it for backends and now for kernels. |