Matrix Multiplication Examples

The examples below use the real SKaiNET DSL. For the wider tensor construction surface (single vs many tensors, named maps, init strategies), see How-to: Build Tensors. For how matmul dispatches into the kernel layer (and what the benchmarks measure), see Reading the matmul benchmark.

Basic Usage

Simple Matrix Multiplication

import sk.ainet.context.DirectCpuExecutionContext
import sk.ainet.context.data
import sk.ainet.lang.tensor.Tensor
import sk.ainet.lang.tensor.dsl.tensor
import sk.ainet.lang.types.FP32

val ctx = DirectCpuExecutionContext.create()

lateinit var a: Tensor<FP32, Float>
lateinit var b: Tensor<FP32, Float>

data(ctx) {
    a = tensor<FP32, Float> {
        shape(3, 2) { from(1f, 2f, 3f, 4f, 5f, 6f) }
    }
    b = tensor<FP32, Float> {
        shape(2, 4) {
            from(
                1f, 0f, 1f, 0f,
                0f, 1f, 0f, 1f,
            )
        }
    }
}

val result = ctx.ops.matmul(a, b)
println("Result shape: ${'$'}{result.data.shape}")   // Shape(3, 4)

Batch Operations

Matmul is batched whenever the inputs carry a leading batch dimension; there’s no separate "batched matmul" entry point — ctx.ops.matmul broadcasts along the leading dim.

lateinit var batchA: Tensor<FP32, Float>
lateinit var batchB: Tensor<FP32, Float>

data(ctx) {
    // [batch=2, m=3, k=2]
    batchA = tensor<FP32, Float> {
        shape(2, 3, 2) {
            from(
                1f, 2f, 3f, 4f, 5f, 6f,    // first sample
                2f, 1f, 4f, 3f, 6f, 5f,    // second sample
            )
        }
    }
    // [batch=2, k=2, n=3]
    batchB = tensor<FP32, Float> {
        shape(2, 2, 3) {
            from(
                1f, 0f, 1f, 0f, 1f, 0f,
                0f, 1f, 0f, 1f, 0f, 1f,
            )
        }
    }
}

val batchResult = ctx.ops.matmul(batchA, batchB)
// batchResult.data.shape == Shape(2, 3, 3)

Linear Layer

A linear layer is a single matmul plus an optional bias add. Construction uses the same data { } pattern; the forward pass goes through ctx.ops:

import kotlin.math.sqrt

class LinearLayer(
    private val ctx: DirectCpuExecutionContext,
    private val weights: Tensor<FP32, Float>,
    private val bias: Tensor<FP32, Float>? = null,
) {
    fun forward(input: Tensor<FP32, Float>): Tensor<FP32, Float> {
        // input:   [batch, in_features]
        // weights: [in_features, out_features]
        // output:  [batch, out_features]
        var output = ctx.ops.matmul(input, weights)
        if (bias != null) {
            output = ctx.ops.add(output, bias)        // broadcasting add
        }
        return output
    }
}

val inputSize = 784
val hiddenSize = 256
val batchSize = 32
val std = sqrt(2.0f / (inputSize + hiddenSize))

lateinit var weights: Tensor<FP32, Float>
lateinit var bias: Tensor<FP32, Float>
lateinit var input: Tensor<FP32, Float>

data(ctx) {
    weights = tensor<FP32, Float> {
        shape(inputSize, hiddenSize) { randn(mean = 0f, std = std) }
    }
    bias = tensor<FP32, Float> { shape(hiddenSize) { zeros() } }
    input = tensor<FP32, Float> {
        shape(batchSize, inputSize) { randn(mean = 0f, std = 1f) }
    }
}

val layer = LinearLayer(ctx, weights, bias)
val output = layer.forward(input)
// output.data.shape == Shape(32, 256)

Performance Considerations

ctx.ops.matmul automatically routes to the highest-priority registered kernel via KernelRegistry.bestAvailable(). On JDK 21+ with the incubator Vector module loaded, that’s the Panama Vector kernel — typically ~14–23 GFLOPS on AVX2 for a 1024³ FP32 GEMM depending on the workload shape (the mnpack tile-microkernel dispatch adds ~1.7× over the naive Panama 1×1 inner loop). The numbers and the kernel selection mechanics are detailed in Reading the matmul benchmark and the engine benchmark program.

For quantized matmul (Q4_K, Q8_0, BF16-weight), load weights via the GGUF / SafeTensors loaders — the loaders preserve packed-block storage, and the matmul dispatch recognises the quantized TensorData subtype and routes to the matching SPI kernel.

Common Patterns

Matrix-Vector Multiplication

A 1D vector is rank 1; matmul against a 2D matrix requires the vector to be reshaped to rank 2 first (no implicit broadcasting between rank-1 and rank-2 in ctx.ops.matmul):

lateinit var matrix: Tensor<FP32, Float>
lateinit var vector: Tensor<FP32, Float>

data(ctx) {
    matrix = tensor<FP32, Float> {
        shape(100, 50) { randn(mean = 0f, std = 1f) }
    }
    // Shape (50, 1) so matmul produces (100, 1).
    vector = tensor<FP32, Float> {
        shape(50, 1) { randn(mean = 0f, std = 1f) }
    }
}

val result = ctx.ops.matmul(matrix, vector)
// result.data.shape == Shape(100, 1)

Transpose Before Matmul

ctx.ops.transpose produces a tensor view that the matmul dispatch recognises; for some packed quantized formats the transpose is lazy (no data reordering — see Q4MemorySegmentTensorData in skainet-backend-cpu for the marker class).

val aT = ctx.ops.transpose(a)
val result = ctx.ops.matmul(b, aT)