Metrics and performance testing

There are two distinct "how good is it?" questions in SKaiNET:

Model quality — does the network make correct predictions? Measured with metrics (accuracy, error) computed from a forward pass.
Engine performance — how fast does the math run? Measured with the benchmark suite.

This page covers both and points at the deeper references for each.

Measuring model quality

A metric is computed from a forward pass over held-out data. The training example classifies two clusters and then measures classification accuracy — the fraction of samples whose predicted label matches the truth:

        // Metric: classification accuracy on a fresh inference context.
        val evalCtx = DirectCpuExecutionContext()
        val preds = model.forward(x, evalCtx)
        var correct = 0
        for (i in 0 until n) {
            val score = preds.data.get(i, 0)
            val predicted = if (score >= 0f) 1f else -1f
            if (predicted == labelsFlat[i]) correct++
        }
        val accuracy = correct.toFloat() / n

The same shape applies to any metric: run model.forward(x, ctx) on an evaluation context, then reduce predictions against targets. The snippet is compiled and run in CI from skainet-docs-samples (TrainingDemo.kt); the full training loop that produces model is in Kotlin getting started.

Always evaluate on a fresh inference context (DirectCpuExecutionContext()), separate from the autograd/training context, so metric computation does not record gradients.

Benchmarking engine performance

Engine performance is a separate concern with its own reproducible harness. Rather than ad-hoc timing in user code, SKaiNET ships an official benchmark suite:

Engine benchmark program — what the suite measures, headline vs. secondary metrics, lanes, the result-record schema, and how to reproduce a public run locally.
Reading the matmul benchmark — interpreting the numbers for the kernel that dominates inference cost.
Register a self-hosted bench runner — running the suite on your own hardware.

Performance-testing practices

Pin methodology — fixed warmup/iteration counts and a stable machine; see the Methodology pinning section of the benchmark guide.
Compare against a baseline run rather than absolute numbers; hardware varies.
For backend-level context on where the time goes, see JVM CPU performance and How SIMD kernels are built.

Kotlin getting started — defines and trains the model measured here.
Kernel × platform support — which kernels back each op per platform.

Metrics and performance testing

Measuring model quality

Benchmarking engine performance

Performance-testing practices

Related