Getting Started with HLO in SKaiNET
What is HLO?
HLO (High-Level Operations) is SKaiNET’s intermediate representation for neural network computations, based on StableHLO - the portable high-level operation set for machine learning. HLO serves as a bridge between SKaiNET’s Kotlin DSL and various execution backends, enabling optimizations and cross-platform deployment.
Why MLIR/XLA Instead of Direct Backends?
SKaiNET uses the MLIR/XLA compilation approach rather than implementing separate backends for each hardware target. This design choice provides several key advantages:
Single Implementation Path: Write operations once in Kotlin, compile to StableHLO MLIR, then let XLA handle hardware-specific optimizations. No need to maintain separate CUDA, Metal, or ROCm implementations.
Automatic Optimization: XLA provides sophisticated optimizations like operator fusion, memory layout optimization, and hardware-specific kernel selection without manual tuning.
Future-Proof: New hardware targets (like future GPU architectures) are automatically supported when XLA adds support, without requiring SKaiNET updates.
Ecosystem Integration: Full compatibility with JAX, TensorFlow, and other MLIR-based frameworks enables model sharing and toolchain reuse.
Key Benefits
-
Portability: Write once, compile to any XLA-supported hardware (CPU, GPU, TPU)
-
Optimization: Leverage XLA’s advanced compiler optimizations and operator fusion
-
Interoperability: Full compatibility with XLA, JAX, TensorFlow, and MLIR ecosystems
-
Performance: Hardware-specific optimizations without manual kernel development
-
No Backend Lock-in: Single compilation target supports all hardware through XLA
Architecture Overview
SKaiNET’s HLO compilation pipeline transforms high-level Kotlin DSL operations into hardware-optimized executable code through the MLIR/XLA ecosystem:
Building Blocks
1. HLO Converters
Converters transform SKaiNET operations into StableHLO operations:
-
MathOperationsConverter: Basic arithmetic operations
-
LinalgOperationsConverter: Linear algebra operations
-
ActivationOperationsConverter: Neural network activations
-
NeuralNetOperationsConverter: High-level NN operations
-
ConstantOperationsConverter: Constant value operations
Practical Example: RGB to Grayscale Conversion
Let’s walk through converting a color image tensor Tensor<B,C,H,W> to grayscale using matrix multiplication.
Step 1: Define the Operation in Kotlin DSL
// From: skainet-lang/skainet-lang-models/src/commonMain/kotlin/sk/ainet/lang/model/compute/Rgb2GrayScaleMultiply.kt
fun Tensor<Float32, Shape4D>.rgb2GrayScaleMatMul(): Tensor<Float32, Shape4D> {
// RGB to grayscale weights: [0.299, 0.587, 0.114]
val grayWeights = constant(
floatArrayOf(0.299f, 0.587f, 0.114f),
Shape1D(3)
).reshape(Shape2D(3, 1))
// Reshape input from [B,C,H,W] to [B,H,W,C] for matrix multiplication
val reshaped = this.transpose(intArrayOf(0, 2, 3, 1))
// Matrix multiply: [B,H,W,3] × [3,1] = [B,H,W,1]
val gray = reshaped.matmul(grayWeights)
// Reshape back to [B,1,H,W]
return gray.transpose(intArrayOf(0, 3, 1, 2))
}
Step 3: Generated StableHLO IR
The converter produces MLIR code like this:
func.func @rgb2grayscale(%input: tensor<?x3x?x?xf32>) -> tensor<?x1x?x?xf32> {
// Define grayscale conversion weights
%weights = stablehlo.constant dense<[[0.299], [0.587], [0.114]]> : tensor<3x1xf32>
// Transpose input: [B,C,H,W] -> [B,H,W,C]
%transposed = stablehlo.transpose %input, dims = [0, 2, 3, 1] :
(tensor<?x3x?x?xf32>) -> tensor<?x?x?x3xf32>
// Matrix multiplication: [B,H,W,3] × [3,1] -> [B,H,W,1]
%gray = stablehlo.dot_general %transposed, %weights,
contracting_dims = [3] x [0] :
(tensor<?x?x?x3xf32>, tensor<3x1xf32>) -> tensor<?x?x?x1xf32>
// Transpose back: [B,H,W,1] -> [B,1,H,W]
%result = stablehlo.transpose %gray, dims = [0, 3, 1, 2] :
(tensor<?x?x?x1xf32>) -> tensor<?x1x?x?xf32>
return %result : tensor<?x1x?x?xf32>
}
Hardware Target Compilation via XLA
SKaiNET uses the MLIR/XLA compilation pipeline to target different hardware platforms without requiring separate backend implementations. The StableHLO IR serves as a portable intermediate representation that XLA can compile to optimized code for various targets.
Supported Hardware Targets
-
CPU: x86_64, ARM64 (via XLA CPU backend)
-
GPU: NVIDIA CUDA, AMD ROCm (via XLA GPU backend)
-
TPU: Google TPUs (via XLA TPU backend)
-
Mobile: iOS Metal, Android GPU (via XLA mobile backends)
Prerequisites for GPU Compilation
-
XLA with GPU support: Installation guide
-
NVIDIA CUDA Toolkit (for NVIDIA GPUs): Download here
-
ROCm (for AMD GPUs): Installation guide
Step 1: Generate StableHLO IR
# Build SKaiNET HLO compiler
./gradlew :skainet-compile:skainet-compile-hlo:build
# Convert your model to StableHLO MLIR
./gradlew :skainet-compile:skainet-compile-hlo:generateHlo \
-Pmodel=rgb2grayscale \
-Poutput=rgb2grayscale.mlir
Step 2: Compile with XLA for Target Hardware
# Compile to GPU executable (NVIDIA CUDA)
xla_compile \
--input_format=mlir \
--output_format=executable \
--platform=gpu \
--gpu_backend=cuda \
--input_file=rgb2grayscale.mlir \
--output_file=rgb2grayscale_cuda.so
# Compile to CPU executable
xla_compile \
--input_format=mlir \
--output_format=executable \
--platform=cpu \
--input_file=rgb2grayscale.mlir \
--output_file=rgb2grayscale_cpu.so
# Compile to TPU executable
xla_compile \
--input_format=mlir \
--output_format=executable \
--platform=tpu \
--input_file=rgb2grayscale.mlir \
--output_file=rgb2grayscale_tpu.so
Step 3: Runtime Execution
# Execute on target hardware using XLA runtime
xla_run \
--executable=rgb2grayscale_cuda.so \
--input=image.jpg \
--output=gray.jpg \
--device=gpu:0
Jetson and Edge Device Deployment
For NVIDIA Jetson and other edge devices, the same MLIR → XLA compilation approach applies:
# Cross-compile for ARM64 with CUDA support
xla_compile \
--input_format=mlir \
--output_format=executable \
--platform=gpu \
--gpu_backend=cuda \
--target_triple=aarch64-linux-gnu \
--input_file=rgb2grayscale.mlir \
--output_file=rgb2grayscale_jetson.so
# Deploy to Jetson device
scp rgb2grayscale_jetson.so jetson@192.168.1.100:~/models/
# Execute on Jetson
ssh jetson@192.168.1.100
cd ~/models
xla_run --executable=rgb2grayscale_jetson.so --device=gpu:0
Advanced Topics
Custom HLO Operations
Extend SKaiNET with custom operations:
// Define custom operation
@HloOperation("custom.rgb_enhance")
class RgbEnhanceOp : HloConverter {
override fun convert(context: ConversionContext): String {
return """
%enhanced = custom_call @rgb_enhance(%input) :
(tensor<?x3x?x?xf32>) -> tensor<?x3x?x?xf32>
"""
}
}
Next Steps
-
Explore Examples: Check
skainet-compile/skainet-compile-hlo/src/commonMain/kotlin/sk/ainet/compile/hlo/examples/ -
Run Tests: Execute
./gradlew :skainet-compile:skainet-compile-hlo:test -
Contribute: Add new HLO converters for missing operations
-
Optimize: Profile and optimize your models using HLO tools
For more detailed information, see the HLO Optimization Guide and API Documentation.