SKaiNET Compilation Layer

Overview

The SKaiNET compilation layer (skainet-compile-* modules) takes tensor operations defined in Kotlin and produces StableHLO MLIR — the standard portable IR for ML computations. This is a trace-based compilation approach, similar to JAX’s tracing or PyTorch 2.0’s torch.compile.

skainet-compile-c

skainet-compile-hlo

skainet-compile-dag

skainet-compile-core

.mlir

.c/.h

TapeRecorder
Records every tensor op

Recorded Ops
constant, conv, matmul, add, ...

ComputeGraph
DAG construction

Topological Sort

Shape Inference

Validation

StableHloConverter
Op → StableHLO text

TypeMapper
Kotlin types → MLIR types

StableHloOptimizer
const fold, fusion, DCE

C99 Code Generator
Arduino library export

StableHLO MLIR

C99 Source

Tape Recording (Trace-Based Compilation)

When you execute a model function like rgb2GrayScaleMatMul(), the tensor operations don’t compute values immediately. Instead, they record themselves onto a tape — a linear log of operations with their operands and result types.

How It Works

// This Kotlin code...
val weights = constant(floatArrayOf(0.299f, 0.587f, 0.114f), Shape1D(3))
val result = input.convolution(weights, stride=1, padding=0)

…​produces this tape:

Op #0: ConstantOp(values=[0.299, 0.587, 0.114], shape=[3], dtype=f32)
Op #1: ConvolutionOp(lhs=#input, rhs=#0, stride=[1,1], padding=[[0,0],[0,0]])

Why Trace Instead of Eager?

Eager execution (computing values immediately) prevents optimization across operations. Trace-based compilation captures the full computation graph before execution, enabling:

  • Constant folding: If two constants are added, compute the result at compile time

  • Operation fusion: Combine conv + bias + relu into a single fused kernel

  • Dead code elimination: Remove operations whose results are never used

  • Memory planning: Know all tensor sizes upfront, enabling optimal memory layout

The trade-off is that trace-based compilation cannot handle data-dependent control flow (if/else based on tensor values). For ML inference, this is rarely a limitation — the computation graph is fixed at model load time.

Graph Construction (DAG Analysis)

The skainet-compile-dag module converts the linear tape into a DAG (directed acyclic graph). Each operation becomes a node; data dependencies become edges.

For the grayscale model, the DAG is trivial:

input
tensor<1×3×4×4×f32>

constant
[0.299, 0.587, 0.114]
tensor<1×3×1×1×f32>

convolution
1×1, stride=1, pad=0

output
tensor<1×1×4×4×f32>

For complex models (ResNets, YOLO heads), the DAG captures skip connections, multi-scale outputs, and shared parameters.

Validation

The graph is validated before export:

  • All tensor shapes are consistent across connected edges

  • No cycles exist in the graph

  • All inputs have corresponding sources (function args or constants)

  • All outputs are reachable from the inputs

Type Mapping: Kotlin to MLIR

The TypeMapper converts SKaiNET’s Kotlin type system to MLIR types:

Kotlin Type MLIR Type Notes

Tensor<Float32, Shape4D>

tensor<BxCxHxWxf32>

4D tensor with f32 elements

Tensor<Float16, Shape4D>

tensor<BxCxHxWxf16>

Promoted to f32 for Coral NPU (no hardware f16)

Tensor<Int8, Shape4D>

tensor<BxCxHxWxi8>

Used for quantized models

Shape4D(1, 3, 4, 4)

1x3x4x4

Dimensions in NCHW order

FP32 (DType)

f32

Element type

FP16 (DType)

f16

Half-precision

StableHLO Conversion

The StableHloConverter maps each graph operation to one or more StableHLO operations. This is a direct translation — no optimization happens here.

Converter Registry

Converter Graph Ops StableHLO Ops

MathOperationsConverter

add, subtract, multiply, divide

stablehlo.add, stablehlo.subtract, stablehlo.multiply, stablehlo.divide

LinalgOperationsConverter

matmul, dot, transpose

stablehlo.dot_general, stablehlo.transpose

ActivationOperationsConverter

relu, silu, softmax

stablehlo.maximum (relu), stablehlo.custom_call (silu)

NeuralNetOperationsConverter

conv2d, batch_norm, pooling

stablehlo.convolution, custom lowering

ConstantOperationsConverter

constant, parameter

stablehlo.constant dense<…​>

Optimization Framework

The StableHloOptimizer applies passes that transform the MLIR text to reduce operation count, memory traffic, and code size. See Optimization Passes for a detailed breakdown of each pass.

Default Pipeline

Unoptimized
MLIR

Constant
Folding

Operation
Fusion

Dead Code
Elimination

Optimized
MLIR

Aggressive Pipeline

Runs constant folding twice — once before fusion (to simplify inputs to fuseable patterns) and once after (to fold constants created by fusion):

Constant Folding → Operation Fusion → Dead Code Elimination → Constant Folding

Dual Output Paths

SKaiNET supports two compilation targets from the same computation graph:

Path 1: StableHLO MLIR (for NPU via iree-tools)

./gradlew :skainet-compile:skainet-compile-hlo:generateHlo \
  -Pmodel=rgb2grayscale -Poutput=rgb2grayscale.mlir

Produces standard StableHLO that can be consumed by IREE, the Python transpiler, or any MLIR toolchain.

Path 2: C99 Source (for Arduino/embedded)

The skainet-compile-c module generates C99 code with Arduino library conventions — header files, setup()/loop() entry points, and platform-independent math. This path bypasses MLIR entirely and targets microcontrollers that have C compilers but not MLIR toolchains.

The KSP Code Generation Layer

SKaiNET uses Kotlin Symbol Processing (KSP) to generate boilerplate code at compile time:

  • @GenerateTensorOp — generates type-safe tensor operation methods

  • @GenerateNetworkDsl — generates the nn { } DSL builder functions

  • @GenerateGraphDsl — generates the dag { } DSL builder functions

This means adding a new operation to SKaiNET requires defining it once with annotations, and KSP generates the DSL extensions, tape recording hooks, and type inference code automatically.