DSL to NPU Pipeline
The Full Compilation Pipeline
This page traces a single model — RGB-to-grayscale conversion — through every transformation stage, showing the exact representation at each step.
Stage 1: Kotlin DSL
The model is defined as a pure function on tensors:
fun Tensor<Float32, Shape4D>.rgb2GrayScaleMatMul(): Tensor<Float32, Shape4D> {
val grayWeights = constant(
floatArrayOf(0.299f, 0.587f, 0.114f),
Shape1D(3)
).reshape(Shape2D(3, 1))
val reshaped = this.transpose(intArrayOf(0, 2, 3, 1)) // NCHW → NHWC
val gray = reshaped.matmul(grayWeights) // [B,H,W,3] × [3,1]
return gray.transpose(intArrayOf(0, 3, 1, 2)) // NHWC → NCHW
}
The Kotlin type system enforces tensor shapes at compile time. Tensor<Float32, Shape4D> carries both the element type (Float32) and the dimensionality (Shape4D = batch, channel, height, width) as type parameters.
An alternative formulation uses 1x1 convolution instead of matmul — this is what the --backend=hlo-export path actually produces:
// Equivalent: 1x1 convolution with 3 input channels, 1 output channel
// Kernel weights [C_OUT=1, C_IN=3, KH=1, KW=1] = [0.299, 0.587, 0.114]
Stage 2: Tape Recording
When the model function executes, SKaiNET’s skainet-compile-core module records every operation onto a tape — a linear trace of the computation. This is similar to PyTorch’s autograd tape or JAX’s tracing mechanism.
The tape records operations in execution order. The skainet-compile-dag module then converts this linear trace into a DAG (directed acyclic graph), resolving data dependencies and enabling optimization.
Stage 3: StableHLO MLIR Export
The StableHloConverter in skainet-compile-hlo transforms the computation graph into StableHLO MLIR text. This is a direct mapping — each graph node becomes one or more StableHLO operations.
For the grayscale model, the output is:
module {
func.func @rgb2grayscale(%arg0: tensor<1x3x4x4xf32>) -> (tensor<1x1x4x4xf32>) {
// Luminance weights as a 1x1 convolution kernel
%v0 = stablehlo.constant dense<[[[[0.2989]], [[0.587]], [[0.114]]]]>
: tensor<1x3x1x1xf32>
// 1x1 convolution: 3 input channels → 1 output channel
%v1 = stablehlo.convolution(%arg0, %v0)
dim_numbers = [b, f, 0, 1]x[o, i, 0, 1]->[b, f, 0, 1],
window = {stride = [1, 1], pad = [[0, 0], [0, 0]], rhs_dilate = [1, 1]}
{batch_group_count = 1 : i64, feature_group_count = 1 : i64}
: (tensor<1x3x4x4xf32>, tensor<1x3x1x1xf32>) -> tensor<1x1x4x4xf32>
return %v1 : tensor<1x1x4x4xf32>
}
}
Key StableHLO Concepts
dim_numbers-
Specifies the data layout convention.
[b, f, 0, 1]means batch-first, then features (channels), then spatial dimensions. This is NCHW layout. stablehlo.convolution-
A high-level operation that computes the convolution without specifying how — no loop nests, no tiling, no memory access patterns. The compiler (or transpiler) is responsible for lowering this to concrete code.
tensor<1x3x1x1xf32>-
The kernel shape encodes a 1x1 convolution — spatial dimensions are both 1, with 3 input channels and 1 output channel (the outer dimension).
Stage 4: MLIR Parsing to IR Dataclasses
The Python transpiler in iree-tools/ parses the MLIR text using regex patterns and builds a typed IR:
@dataclass
class Module:
functions: list[FuncDef]
@dataclass
class FuncDef:
name: str # "rgb2grayscale"
args: list[tuple[str, TensorType]] # [("%arg0", tensor<1x3x4x4xf32>)]
return_types: list[TensorType] # [tensor<1x1x4x4xf32>]
body: list[Op] # [ConstantOp, ConvolutionOp, ReturnOp]
The parser handles:
-
stablehlo.constant dense<…>— extracts float values from arbitrarily nested brackets -
stablehlo.convolution(…)— extracts operands, strides, padding, dilation, group counts -
stablehlo.add/multiply/subtract/divide— element-wise binary operations -
stablehlo.convert— type conversions (f16→f32 becomes a no-op alias) -
return— maps return values to output arrays
Stage 5: C Code Generation
The codegen.py module transforms the IR into C source following coralnpu_v2_binary conventions:
// Generated from StableHLO MLIR function @rgb2grayscale
// f16 promoted to f32 (Coral NPU has hardware f32, no f16)
float input_0[48] __attribute__((section(".data"))); // 1×3×4×4 = 48 floats
float output_0[16] __attribute__((section(".data"))); // 1×1×4×4 = 16 floats
static const float v0[3] = {0.2989f, 0.587f, 0.114f};
int main() {
// 1x1 convolution: 3 input channels -> 1 output channels
for (int i = 0; i < 16; i++) {
float sum = 0.0f;
for (int c = 0; c < 3; c++) {
sum += input_0[c * 16 + i] * v0[c];
}
output_0[i] = sum;
}
return 0;
}
Critical C Conventions
__attribute__((section(".data")))-
Forces arrays into the
.datasection, which the linker script places in DTCM (Data Tightly-Coupled Memory at0x00010000). Without this attribute, the compiler might place them in.bss(zero-initialized) or.rodata(read-only), which have different memory mapping behavior. - Global arrays as I/O
-
The simulator accesses input/output data by looking up ELF symbol addresses (
input_0,output_0), writing input bytes to the input address, running the program, then reading output bytes from the output address. This is the NPU’s I/O convention — no syscalls, no file I/O, just shared memory. static constfor weights-
Convolution weights are compile-time constants placed in
.rodata(which lives in ITCM). They are read-only and never modified. - 1x1 convolution optimization
-
When the kernel is 1x1 with stride 1 and no padding, the codegen produces a simplified 2-loop nest instead of the general 7-loop convolution. This reduces code size (critical with 8 KB ITCM) and improves cache behavior.
Stage 6: Cross-Compilation to ELF
The bazel_builder.py module writes the generated .cc and a BUILD.bazel file to coralnpu/examples/generated/, then invokes Bazel:
The CRT startup sequence (coralnpu_start.S) runs before main():
-
Set stack pointer and global pointer from linker symbols
-
Zero out
.bsssection (uninitialized global variables) -
Run C++ static constructors (
.init_array) -
Enable FP and Vector extensions in RISC-V
mstatusCSR -
Write sentinel
0x0badd00dto_retsymbol -
Call
main(0, 0) -
After
main()returns, store return value, executeebreakto halt
Stage 7: Simulator Execution
The MPACT behavioral simulator loads the ELF, writes input data to memory, runs the program, and reads output data:
sim = CoralNPUV2Simulator()
entry, symbols = sim.get_elf_entry_and_symbol(elf, ["input_0", "output_0"])
sim.load_program(elf, entry)
# Write test image (3 channels × 4×4 pixels = 48 floats)
sim.write_memory(symbols["input_0"], test_image.astype(np.float32).view(np.uint8))
sim.run()
sim.wait()
# Read grayscale output (1 channel × 4×4 pixels = 16 floats)
result = sim.read_memory(symbols["output_0"], 64).view(np.float32)
The simulator provides cycle-accurate execution counts, enabling performance analysis without physical hardware.
What Gets Lost at Each Stage
Understanding what information is removed at each transformation is as important as understanding what is produced:
| Transition | Information Lost |
|---|---|
Kotlin → StableHLO |
Type-level shape constraints, Kotlin generics, multiplatform targets, module boundaries |
StableHLO → IR Dataclasses |
MLIR metadata, dim_numbers semantics (assumed NCHW), source location info |
IR → C Source |
SSA form, explicit data flow graph, tensor semantics (becomes flat arrays) |
C → ELF |
Variable names (become addresses), high-level loop structure (becomes branch instructions) |
ELF → Simulator |
Nothing — the simulator executes the exact machine code |