Architecture Overview

The Three-Layer Stack

The Coral NPU stack is split across three codebases that map to three distinct abstraction layers. Each layer has a single responsibility and communicates with the next through a well-defined artifact format.

Why three layers?

Each layer operates in a different language ecosystem with different constraints:

Layer 1 (SKaiNET): Kotlin Multiplatform targets JVM, Native, and WASM. Model authors work in type-safe Kotlin with IDE support. The compilation module exports standard StableHLO MLIR — the same IR that JAX, TensorFlow, and PyTorch/XLA produce. This means models defined in SKaiNET are interoperable with the broader MLIR ecosystem.
Layer 2 (iree-tools): Python handles the MLIR-to-C transpilation. This layer exists because the intended IREE Coral NPU plugin is not available in the open-source release. The Python transpiler bridges the gap by parsing StableHLO text and emitting C source that follows the coralnpu_v2_binary Bazel conventions.
Layer 3 (coralnpu): Bazel manages the cross-compilation toolchain, linker scripts, CRT startup code, and simulator infrastructure. This is Google’s open-source release — the hardware design itself in Chisel/Verilog, plus everything needed to compile and simulate bare-metal programs.

Data Flow: Model Definition to Simulation Result

The Artifact Chain

Each layer produces a concrete artifact that the next layer consumes:

Artifact Format Size (grayscale) Contents

Artifact	Format	Size (grayscale)	Contents
`.mlir`	StableHLO text	~500 bytes	`func.func @rgb2grayscale` with `stablehlo.convolution`, constants, types
`.cc`	C source	~800 bytes	`float input[]`, `float output[]`, nested loops, `attributesection(".data")`
`.elf`	RISC-V ELF	~8 KB	Machine code + CRT + BSS init + linker symbols (`input_0`, `output_0`)
Sim output	NumPy array	64 bytes	`float32[16]` — the grayscale pixel values read from DTCM

.mlir

StableHLO text

~500 bytes

func.func @rgb2grayscale with stablehlo.convolution, constants, types

.cc

C source

~800 bytes

float input[], float output[], nested loops, attributesection(".data")

.elf

RISC-V ELF

~8 KB

Machine code + CRT + BSS init + linker symbols (input_0, output_0)

Sim output

NumPy array

64 bytes

float32[16] — the grayscale pixel values read from DTCM

SKaiNET Internal Architecture

SKaiNET is organized as a layer cake where each module depends only on layers below it:

The ML vs. Orchestration Boundary

A critical architectural insight: the agentic AI layer (AgentLoop, ChatTemplate, ToolRegistry) contains zero trainable parameters. It is pure control flow — deciding when to call the model, not what the model says. The boundary is the InferenceRuntime<T>.forward() method:

Above the boundary: chat formatting, tool call parsing, agent loop, JSON parsing — software engineering orchestration
Below the boundary: embedding lookup, RoPE attention, SiLU-gated FFN, RMSNorm, softmax sampling — deep learning math

This separation means the same LlamaRuntime<T> powers both --chat (one-shot text completion) and --agent (autonomous multi-step reasoning) modes.

Design Decisions and Trade-offs

Why StableHLO as the IR?

StableHLO is the standard portable IR for ML computations. By exporting to StableHLO rather than a custom format, SKaiNET models can be:

Verified using standard IREE tooling (iree-compile + iree-run-module on host)
Consumed by any StableHLO-compatible compiler (XLA, IREE, custom backends)
Debugged using standard MLIR tools (mlir-opt, mlir-translate)

The trade-off is that StableHLO is a high-level IR. It represents what to compute (convolutions, element-wise ops) but not how (tiling, vectorization, memory layout). The lowering from StableHLO to efficient bare-metal code is where the missing IREE plugin matters.

Why a Python transpiler instead of IREE?

The IREE compiler can compile StableHLO to RISC-V machine code, but it wraps the result in a VM/HAL runtime that requires ~50 KB of support code — far exceeding the NPU’s 8 KB ITCM. The Python transpiler produces minimal C code that compiles to ~2 KB of machine code, fitting comfortably in tightly-constrained memory.

See The Missing IREE Plugin for the full analysis.

Why bare-metal C and not IREE VMFB?

The Coral NPU runs a run-to-completion execution model with no OS, no interrupts, and no dynamic memory allocation. Programs start at _start, execute main(), and halt with ebreak. The IREE VM runtime assumes a host OS with memory allocation, file I/O, and thread support — none of which exist on the NPU.