Architecture Overview
The Three-Layer Stack
The Coral NPU stack is split across three codebases that map to three distinct abstraction layers. Each layer has a single responsibility and communicates with the next through a well-defined artifact format.
Why three layers?
Each layer operates in a different language ecosystem with different constraints:
- Layer 1 (SKaiNET)
-
Kotlin Multiplatform targets JVM, Native, and WASM. Model authors work in type-safe Kotlin with IDE support. The compilation module exports standard StableHLO MLIR — the same IR that JAX, TensorFlow, and PyTorch/XLA produce. This means models defined in SKaiNET are interoperable with the broader MLIR ecosystem.
- Layer 2 (iree-tools)
-
Python handles the MLIR-to-C transpilation. This layer exists because the intended IREE Coral NPU plugin is not available in the open-source release. The Python transpiler bridges the gap by parsing StableHLO text and emitting C source that follows the
coralnpu_v2_binaryBazel conventions. - Layer 3 (coralnpu)
-
Bazel manages the cross-compilation toolchain, linker scripts, CRT startup code, and simulator infrastructure. This is Google’s open-source release — the hardware design itself in Chisel/Verilog, plus everything needed to compile and simulate bare-metal programs.
The Artifact Chain
Each layer produces a concrete artifact that the next layer consumes:
| Artifact | Format | Size (grayscale) | Contents |
|---|---|---|---|
|
StableHLO text |
~500 bytes |
|
|
C source |
~800 bytes |
|
|
RISC-V ELF |
~8 KB |
Machine code + CRT + BSS init + linker symbols ( |
Sim output |
NumPy array |
64 bytes |
|
SKaiNET Internal Architecture
SKaiNET is organized as a layer cake where each module depends only on layers below it:
The ML vs. Orchestration Boundary
A critical architectural insight: the agentic AI layer (AgentLoop, ChatTemplate, ToolRegistry) contains zero trainable parameters. It is pure control flow — deciding when to call the model, not what the model says. The boundary is the InferenceRuntime<T>.forward() method:
-
Above the boundary: chat formatting, tool call parsing, agent loop, JSON parsing — software engineering orchestration
-
Below the boundary: embedding lookup, RoPE attention, SiLU-gated FFN, RMSNorm, softmax sampling — deep learning math
This separation means the same LlamaRuntime<T> powers both --chat (one-shot text completion) and --agent (autonomous multi-step reasoning) modes.
Design Decisions and Trade-offs
Why StableHLO as the IR?
StableHLO is the standard portable IR for ML computations. By exporting to StableHLO rather than a custom format, SKaiNET models can be:
-
Verified using standard IREE tooling (
iree-compile+iree-run-moduleon host) -
Consumed by any StableHLO-compatible compiler (XLA, IREE, custom backends)
-
Debugged using standard MLIR tools (
mlir-opt,mlir-translate)
The trade-off is that StableHLO is a high-level IR. It represents what to compute (convolutions, element-wise ops) but not how (tiling, vectorization, memory layout). The lowering from StableHLO to efficient bare-metal code is where the missing IREE plugin matters.
Why a Python transpiler instead of IREE?
The IREE compiler can compile StableHLO to RISC-V machine code, but it wraps the result in a VM/HAL runtime that requires ~50 KB of support code — far exceeding the NPU’s 8 KB ITCM. The Python transpiler produces minimal C code that compiles to ~2 KB of machine code, fitting comfortably in tightly-constrained memory.
See The Missing IREE Plugin for the full analysis.
Why bare-metal C and not IREE VMFB?
The Coral NPU runs a run-to-completion execution model with no OS, no interrupts, and no dynamic memory allocation. Programs start at _start, execute main(), and halt with ebreak. The IREE VM runtime assumes a host OS with memory allocation, file I/O, and thread support — none of which exist on the NPU.