Architecture

This page follows arc42’s chapter ordering at a coarse grain. Sections that are still single-paragraph stubs ask for contributions; sections about the kernel SPI and eager execution pipeline are the deepest because that’s where the 0.21.0 work landed.

1. Introduction and goals

SKaiNET is a Kotlin Multiplatform ML framework whose primary target is on-device / edge inference and training in environments that already have JVM tooling: Android, JVM server, and Kotlin/Native iOS and linuxX64. The framework separates model authoring (a typed Kotlin DSL with compile-time tensor shape checks where possible) from execution (pluggable backends), so the same model can run eagerly during development and be lowered to MLIR StableHLO β†’ IREE for deployment.

Hard non-goals:

  • Become a full numerics library. SKaiNET targets the operators real models use, not the long tail in PyTorch / NumPy.

  • Run untrusted user code. Kernels are trusted code; security is about not corrupting memory, not about sandboxing.

Architecture diagram of the SKaiNET compiler pipeline

2. Constraints

Constraint Why

Kotlin Multiplatform with commonMain / per-target source sets

Same DSL must run on JVM, Android, iOS, macOS, linuxX64, JS, Wasm.

--enable-preview --add-modules jdk.incubator.vector on JVM 21+

FloatVector / ByteVector are still incubator on JDK 25 (JEP 508).

Maven Central publication via vanniktech.mavenPublish

All modules signed; coordinates sk.ainet.core:*.

Antora-based docs site under docs/modules/ROOT/

Source-controlled, follows DiΓ‘taxis quadrants for user-facing pages.

arc42 ordering for this page

Architectural reference, not a tutorial.

3. Context (system boundaries)

The framework’s outer surface:

  • DSL layer β€” Kotlin model DSL (nn { …​ }, tensor { …​ }) and imperative TensorOps / ExecutionContext API.

  • I/O layer β€” model loaders for GGUF, SafeTensors, ONNX (read-only) in skainet-io-* modules.

  • Compile layer β€” RecordingExecution records ops to a tape, then lowers to StableHLO / IREE bytecode in skainet-compile-* modules.

  • Backend layer β€” BackendProvider dispatches TensorOps calls to a concrete implementation (CPU, XNNPACK, future GPU). Inside a backend, the kernel SPI picks the SIMD recipe for the host hardware.

4. Solution strategy

SKaiNET runs the same model graph through one of two execution strategies:

  • Eager execution β€” DirectCpuExecutionContext calls op implementations as the user invokes them. Used during development, testing, and on-device inference paths where AOT compilation is impractical (debug builds, dynamic graphs). This is the path the 0.21.0 SIMD work targets.

  • Recorded execution β€” RecordingExecution builds an op tape, which HloGenerator lowers to StableHLO MLIR. IREE compiles the MLIR to a portable bytecode for production deployment.

Both strategies share the same TensorOps surface, so a model written once runs in either mode without changes. Numerical parity between modes is part of the test contract.

5. Building block view (static structure)

5.1 Module layout

Module path Role

skainet-lang/skainet-lang-core

DSL types, tensor abstractions, common ops, TensorOps / ExecutionContext interfaces. KMP, all targets.

skainet-lang/skainet-lang-models

Reference reusable models (Llama, Gemma, Qwen, Whisper) built on the DSL.

skainet-backends/skainet-backend-api

Neutral backend SPI β€” TensorOps, TensorDataFactory, kernel SPI (KernelProvider, Fp32MatmulKernel, Q4KMatmulKernel, KernelRegistry).

skainet-backends/skainet-backend-cpu

CPU implementation. Eager-execution DefaultCpuOpsBase (commonMain) + DefaultCpuOpsJvm (jvmMain) with SIMD kernels.

skainet-backends/skainet-backend-xnnpack

Optional XNNPACK CPU backend (FP32 matmul / conv2d / pooling) on linuxX64 / linuxArm64 / Android.

skainet-backends/benchmarks/jvm-cpu-jmh

JMH harness β€” MatmulBench, KernelMatmulBench, QuantizedMatmulBench, ElementwiseAdd1MBench, Reductions1MBench.

skainet-compile/*

Tape recording, StableHLO emission, IREE export.

skainet-io/*

Model loaders (GGUF, SafeTensors, ONNX), tokenizers, IRPA writer.

5.2 Kernel SPI

Introduced in 0.21.0 (PRs #554, #559, #562). The static structure:

                 commonMain (skainet-backend-api)
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  KernelProvider {                    β”‚
                 β”‚    name: String                      β”‚
                 β”‚    priority: Int                     β”‚
                 β”‚    isAvailable(): Boolean            β”‚
                 β”‚    matmulFp32(): Fp32MatmulKernel?   β”‚
                 β”‚    matmulQ4K(): Q4KMatmulKernel?     β”‚
                 β”‚  }                                   β”‚
                 β”‚                                      β”‚
                 β”‚  KernelRegistry {                    β”‚
                 β”‚    register(KernelProvider)          β”‚
                 β”‚    bestAvailable(): KernelProvider?  β”‚
                 β”‚    find(name): KernelProvider?       β”‚
                 β”‚  }                                   β”‚
                 β”‚                                      β”‚
                 β”‚  Fp32MatmulKernel.matmul(...)        β”‚
                 β”‚  Q4KMatmulKernel.matmul(...)         β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚ implements / extends
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ jvmMain (api)              β”‚ commonMain (cpu)               β”‚
   β”‚  KernelServiceLoader       β”‚  ScalarMatmulKernel (priority 0) β”‚
   β”‚  installAll()              β”‚  ScalarKernelProvider          β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ jvmMain (cpu)                                               β”‚
   β”‚  PanamaVectorKernelProvider (priority 50)                   β”‚
   β”‚   FP32 BF16 Q8_0 Q4_0 Q4_K Q6_K Q5_1 Q5_0 (SIMD)            β”‚
   β”‚  Scalar/PanamaVectorKernelProviderFactory (no-arg wrappers) β”‚
   β”‚  META-INF/services/...KernelProvider                        β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ jvmMain (skainet-backend-native-cpu)                        β”‚
   β”‚  NativeKernelProvider (priority 100, FFM/C)                 β”‚
   β”‚   FP32 BF16 Q8_0 Q4_0 Q4_K (+ Q4_K MemSeg zero-copy)        β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Four live providers ship. The exact, machine-generated coverage of every weight format on every KMP target is at Kernel Γ— platform support matrix; for how the kernels are implemented see How SIMD Kernels Are Built (FP32) and How Quantized SIMD Kernels Are Built (quantized). Packed-quant matmul (Q4_K/Q6_K/Q5_1/Q5_0) also has a commonMain scalar kernel, so it runs on Kotlin/Native, JS and WASM β€” not only the JVM.

Native (FFM) provider

NativeKernelProvider registers at priority 100 so that on JDK 21+ it wins KernelRegistry.bestAvailable() over the Panama Vector provider whenever the native library loads, and transparently falls back to Panama (priority 50) or scalar (priority 0) when it doesn’t β€” no code change above the registry. It uses FFM, not JNI (near-zero call overhead, no global lock), ships in the skainet-backend-native-cpu module with C kernels for FP32/BF16/Q8_0/Q4_0/Q4_K (plus a zero-copy MemorySegment Q4_K path), and currently builds for the host architecture only (cross-arch builds and Maven classifier JARs are out of scope). Native FFM kernels for Q5_1/Q5_0/Q6_K are a tracked follow-up (SKaiNET#708). The kernel SPI this builds on shipped across 0.21.0 (PRs #554–#565); the in-process native-FFM groundwork landed in 0.22.0 (PR #571).

6. Runtime view β€” eager execution

The eager pipeline for a single op:

User code
   β”‚
   β”‚  ctx.ops.matmul(a, b)
   β–Ό
DirectCpuExecutionContext
   β”‚
   β”‚  delegates to TensorOps implementation
   β–Ό
DefaultCpuOpsJvm.matmul                                (jvmMain)
   β”‚
   β”œβ”€β†’ chooseQuantizedMatmul(a, b)?
   β”‚      β”‚
   β”‚      β”‚  matches Q4_K / Q6_K / Q8_0 / Q4_0 weights
   β”‚      β–Ό
   β”‚   β”Œβ”€ Q4_K branch: q4kMatmulKernel?.matmul(...) via SPI
   β”‚   β”‚  fallback to JvmQuantizedVectorKernels.matmulQ4_KVec
   β”‚   β”œβ”€ Q6_K branch: JvmQuantizedVectorKernels.matmulQ6_KVec
   β”‚   β”œβ”€ Q8_0 / Q4_0 branches: per-format SIMD inner loops
   β”‚   └─ MemSeg variants: same kernels, ByteVector.fromMemorySegment
   β”‚
   β”œβ”€β†’ chooseMatmul(a, b)?                             (FP32 path)
   β”‚      β”‚
   β”‚      β”‚  fp32MatmulKernel.matmul(...)              ← SPI dispatch
   β”‚      β–Ό
   β”‚   PanamaVectorMatmulKernel.matmul(...)
   β”‚      β”‚  (or ScalarMatmulKernel when Panama unavailable)
   β”‚      β”‚
   β”‚      β”‚  tile-blocked FMA inner loop:
   β”‚      β”‚    for each (m, n, k)-tile:
   β”‚      β”‚      load FloatVector slices of A and B^T
   β”‚      β”‚      acc = va.fma(vb, acc)
   β”‚      β”‚      acc.reduceLanes(ADD) per output cell
   β”‚      β–Ό
   β”‚   FloatArray output                                ← back up the stack
   β”‚
   └─→ super.matmul(a, b)                              (DefaultCpuOpsBase fallback)

Two specifics worth calling out:

  • Lazy provider resolution. DefaultCpuOpsJvm.fp32MatmulKernel and q4kMatmulKernel are by lazy properties. First access triggers KernelServiceLoader.installAll() if the registry is empty, then caches the resolved kernel for the lifetime of the op set. Apps that pre-register custom providers via KernelRegistry.register(…​) before constructing the op set bypass the auto-discovery path.

  • Fall-through everywhere. Each routing decision (chooseQuantizedMatmul β†’ chooseMatmul β†’ super.matmul) returns null on a miss, never throws, so adding a new tensor type or a new SPI accessor is purely additive. The MemSeg arena leak fix in PR #556 made every per-op output use Arena.ofAuto(), so even the fast-path branches don’t need explicit lifetime management.

For shape inference, broadcasting, and the lazy shape-swap transpose specializations on Q4_KTensorData / Q6_KTensorData / MemorySegmentBackedData, see the same source file (DefaultCpuOpsJvm.transpose).

7. Deployment view

  • Maven Central β€” every module published as sk.ainet.core:<module>-<target>:<version>. KMP variants land per target (-jvm, -android, -iosarm64, -macosarm64, -linuxx64, -linuxarm64, -js, -wasm-js, *-wasm-wasi).

  • Single BOM β€” sk.ainet:skainet-bom provides a platform() import for downstream Gradle. Note the group is sk.ainet, not sk.ainet.core, so downstream BOMs (e.g. sk.ainet.transformers:skainet-transformers-bom) can import it under the standard umbrella group.

  • Releases β€” tags 0.X.Y on the release branch trigger .github/workflows/publish.yml β†’ ./gradlew publish on macOS-latest with JDK 25.

8. Cross-cutting concepts

  • Numerical parity testing. Every accelerated kernel has a parity test against a scalar reference within a documented tolerance (typically 1e-5 * k or 1e-4 relative). Examples: PanamaVectorMatmulKernelTest, PanamaVectorQ4KMatmulKernelTest, Q6KMatmulTest. The scalar reference is the contract; SIMD speed is a non-functional improvement that must not break the contract.

  • Lazy resource lifetimes. MemorySegmentTensorDataFactory uses Arena.ofAuto() for per-op outputs so output segments are GC-reclaimable. The earlier Arena.ofConfined() builds leaked ~tens of MB per matmul, blowing 32+ GiB of direct memory in inference loops over a 35-layer Gemma 4 forward pass. Fixed in PR #556.

  • Kill switches via system properties. The Vector API code path respects -Dskainet.cpu.vector.enabled=false so a deployment can opt out of incubator code without a recompile. Same pattern for BLAS (-Dskainet.cpu.blas.enabled=true).

9. Architecture decisions

Decision Date Rationale

Kernel SPI parallel to BackendProvider

2026-04 (PR #554)

Matmul / SDPA are model-agnostic; isolating them lets bench harnesses time the SIMD loop directly and lets a future native provider register without touching the op layer.

KernelProvider.matmulQ4K() accessor with default null

2026-04 (PR #562)

Backwards compat for existing providers (Scalar) without forcing every implementation to override. Same pattern will be used for Q6KMatmulKernel / Q4KMemSegMatmulKernel sibling SPIs.

ServiceLoader auto-discovery deferred until 2 providers exist

2026-04 (PR #559)

Single-provider auto-discovery would have been ceremony for nothing; once Panama landed alongside Scalar, the trigger condition was met.

FFM (not JNI) for any future native code

roadmap M5

JNI’s per-call overhead and global lock are wrong for hot per-token kernels.

Antora docs (DiΓ‘taxis), not GitHub Wiki

2025

Source-controlled, branchable, ranked higher in search than wikis, ships with the repo.

10. Quality requirements

  • Performance. Panama FP32 matmul β‰₯1.5Γ— scalar (M5 metric β€” met, ~10Γ— at 1024Β² on Apple Silicon NEON). Native Q4_K matmul β‰₯2.5Γ— scalar dequant baseline (M5 metric β€” deferred to native FFM PR).

  • Numerical equivalence. Every SIMD kernel matches its scalar reference within FP-rounding tolerance (1e-5 * k for matmul, 1e-4 relative for quantized). Pinned by parity tests.

  • Multi-target buildability. ./gradlew allTests (the release gate) must pass on every KMP target β€” JVM, JS, Wasm, macosArm64, iosSimulatorArm64, linuxX64, linuxArm64, Android.

11. Risks and technical debt

  • Vector API still incubator on JDK 25. JEP 508 keeps it that way through 2026; we depend on it heavily. If it breaks API in a future JDK, every JvmVectorKernels / JvmQuantizedVectorKernels file needs adjustment. Mitigation: thin wrappers, parity tests are a canary.

  • No native FFM provider yet. The literal M5 milestone metric (β‰₯2.5Γ— for Q4_K) is met by Panama in absolute terms but not in the "native vs JVM" framing the metric originally specified. The priority-100 native provider is designed but not yet shipped.

  • Two reverted optimizations on develop history. MemSeg pool (commit 8642b322) and intra-op matmul parallelism (commit 9ed633b6) were both tried and reverted. Re-attempts need a different strategy than what was tried; the revert commits explain why.

12. Glossary (selected)

Term Meaning

FFM

Foreign Function & Memory API (JEP 442 et seq.). Java 22 stable, 21 preview. Replaces JNI for native interop.

FMA

Fused multiply-add (a Β· b + c in one instruction). Supported by every modern x86_64 (FMA3) and ARMv8 CPU.

ggml

The C library underpinning llama.cpp; defines the canonical Q4_K / Q6_K / Q8_0 block layouts SKaiNET uses for GGUF compatibility.

Lazy transpose

Specialization in DefaultCpuOpsJvm.transpose that swaps a tensor’s shape without reordering its packed bytes β€” works because the matmul kernel’s byte-offset math is symmetric across the swap.

MemSeg

Short for java.lang.foreign.MemorySegment. Off-heap memory abstraction used for mmap’d weight buffers.

Panama

Codename for the JDK Vector API (jdk.incubator.vector) and FFM. Both originate from Project Panama.

SPI

Service provider interface β€” a public interface with multiple registered implementations, looked up at runtime. SKaiNET uses it for backends and now for kernels.