Architecture Overview

The Three-Layer Stack

The Coral NPU stack is split across three codebases that map to three distinct abstraction layers. Each layer has a single responsibility and communicates with the next through a well-defined artifact format.

Layer 3 — Hardware Target (coralnpu, Bazel/C)

Layer 2 — Transpilation (iree-tools, Python)

Layer 1 — Model Definition (SKaiNET, Kotlin)

.mlir file

.cc + BUILD.bazel

.elf

Tensor DSL
dag { }

Tape Recording
trace execution

Compute Graph
DAG analysis

StableHLO Converter
type mapping

Optimizer
const fold, fusion, DCE

MLIR Parser
regex-based

IR Dataclasses
Module, FuncDef, Op

C Code Generator
coralnpu conventions

Bazel Builder
write .cc + BUILD

RV32 Toolchain
Clang cross-compile

CRT Startup
coralnpu_start.S

Linker Script
ITCM/DTCM/EXTMEM

MPACT Simulator
behavioral execution

Why three layers?

Each layer operates in a different language ecosystem with different constraints:

Layer 1 (SKaiNET)

Kotlin Multiplatform targets JVM, Native, and WASM. Model authors work in type-safe Kotlin with IDE support. The compilation module exports standard StableHLO MLIR — the same IR that JAX, TensorFlow, and PyTorch/XLA produce. This means models defined in SKaiNET are interoperable with the broader MLIR ecosystem.

Layer 2 (iree-tools)

Python handles the MLIR-to-C transpilation. This layer exists because the intended IREE Coral NPU plugin is not available in the open-source release. The Python transpiler bridges the gap by parsing StableHLO text and emitting C source that follows the coralnpu_v2_binary Bazel conventions.

Layer 3 (coralnpu)

Bazel manages the cross-compilation toolchain, linker scripts, CRT startup code, and simulator infrastructure. This is Google’s open-source release — the hardware design itself in Chisel/Verilog, plus everything needed to compile and simulate bare-metal programs.

Data Flow: Model Definition to Simulation Result

MPACT Simulatorcoralnpu (Bazel)iree-tools (Python).mlir fileSKaiNET (Kotlin)Model AuthorMPACT Simulatorcoralnpu (Bazel)iree-tools (Python).mlir fileSKaiNET (Kotlin)Model AuthorDefine model in dag { } DSLTape-record executionBuild computation graphConvert to StableHLOApply optimization passesExport .mlir fileuv run python main.py run-all model.mlirParse MLIR text (regex)Build IR dataclassesGenerate C source codeWrite .cc + BUILD.bazelto coralnpu/examples/generated/Cross-compile (Clang, rv32imf)Link with CRT + linker scriptLoad .elf binaryExecute: _start → main() → ebreakRead output arrays from memory

The Artifact Chain

Each layer produces a concrete artifact that the next layer consumes:

Artifact Format Size (grayscale) Contents

.mlir

StableHLO text

~500 bytes

func.func @rgb2grayscale with stablehlo.convolution, constants, types

.cc

C source

~800 bytes

float input[], float output[], nested loops, attributesection(".data")

.elf

RISC-V ELF

~8 KB

Machine code + CRT + BSS init + linker symbols (input_0, output_0)

Sim output

NumPy array

64 bytes

float32[16] — the grayscale pixel values read from DTCM

SKaiNET Internal Architecture

SKaiNET is organized as a layer cake where each module depends only on layers below it:

Backend

Tensor Primitives

Compilation Layer

Inference Layer (ML forward pass)

Agentic AI Layer (orchestration, not ML)

Application Layer

skainet-kllama-cli
--chat / --agent

skainet-grayscale-cli

AgentLoop<T>

ChatTemplate

ToolRegistry

LlamaRuntime<T>

AttentionBackend<T>

KvCache

Tape Recording

Graph Optimization

StableHLO Lowering

C99 Codegen

Tensor<T,V>, Shape, DType

Embedding, Linear, RMSNorm

DirectCpuExecutionContext
JDK 21 Vector API SIMD

The ML vs. Orchestration Boundary

A critical architectural insight: the agentic AI layer (AgentLoop, ChatTemplate, ToolRegistry) contains zero trainable parameters. It is pure control flow — deciding when to call the model, not what the model says. The boundary is the InferenceRuntime<T>.forward() method:

  • Above the boundary: chat formatting, tool call parsing, agent loop, JSON parsing — software engineering orchestration

  • Below the boundary: embedding lookup, RoPE attention, SiLU-gated FFN, RMSNorm, softmax sampling — deep learning math

This separation means the same LlamaRuntime<T> powers both --chat (one-shot text completion) and --agent (autonomous multi-step reasoning) modes.

Design Decisions and Trade-offs

Why StableHLO as the IR?

StableHLO is the standard portable IR for ML computations. By exporting to StableHLO rather than a custom format, SKaiNET models can be:

  • Verified using standard IREE tooling (iree-compile + iree-run-module on host)

  • Consumed by any StableHLO-compatible compiler (XLA, IREE, custom backends)

  • Debugged using standard MLIR tools (mlir-opt, mlir-translate)

The trade-off is that StableHLO is a high-level IR. It represents what to compute (convolutions, element-wise ops) but not how (tiling, vectorization, memory layout). The lowering from StableHLO to efficient bare-metal code is where the missing IREE plugin matters.

Why a Python transpiler instead of IREE?

The IREE compiler can compile StableHLO to RISC-V machine code, but it wraps the result in a VM/HAL runtime that requires ~50 KB of support code — far exceeding the NPU’s 8 KB ITCM. The Python transpiler produces minimal C code that compiles to ~2 KB of machine code, fitting comfortably in tightly-constrained memory.

See The Missing IREE Plugin for the full analysis.

Why bare-metal C and not IREE VMFB?

The Coral NPU runs a run-to-completion execution model with no OS, no interrupts, and no dynamic memory allocation. Programs start at _start, execute main(), and halt with ebreak. The IREE VM runtime assumes a host OS with memory allocation, file I/O, and thread support — none of which exist on the NPU.