Run Benchmarks
This repository exposes benchmarking through the llm-performance module.
Prerequisites
-
JDK 21 or newer (Java 25 preferred)
-
Enough RAM for the selected model
-
A local GGUF model file
Resolve a model path
./gradlew :llm-performance:jvmRun \
--args='resolve-model --model-path /absolute/path/to/model.gguf'
Run the LLaMA throughput benchmark
./gradlew :llm-performance:jvmRun \
--args='run --scenario llama-runtime-throughput --model-path /absolute/path/to/tinyllama-1.1b-chat-v1.0.Q8_0.gguf'
Model configuration precedence
The benchmark resolves models in this order:
-
CLI
--model-pathor--model -
System property
-Dskainet.model.path=… -
Environment variable
SKAINET_MODEL_PATH
SKAINET_MODEL_PATH=/absolute/path/to/model.gguf \
./gradlew :llm-performance:jvmRun --args='run --scenario llama-runtime-throughput'
./gradlew :llm-performance:jvmRun \
-Dskainet.model.path=/absolute/path/to/model.gguf \
--args='run --scenario llama-runtime-throughput'
Useful runtime options
./gradlew :llm-performance:jvmRun \
--args='run --scenario llama-runtime-throughput --model-path /absolute/path/to/model.gguf --warmup-runs 1 --measured-runs 3 --steps 16,64'
| Flag | Meaning |
|---|---|
|
benchmark scenario id |
|
model reference or explicit local path |
|
warmup iterations per case |
|
measured iterations per case |
|
comma-separated generation step counts |
|
|
Building & running the fat JAR
Instead of ./gradlew :llm-performance:jvmRun (which recompiles on every invocation), build a self-contained JAR once and run it directly.
Build
./gradlew :llm-performance:shadowJar
The output JAR is at llm-performance/build/libs/llm-performance-all.jar.
Run
java \
--enable-preview --add-modules jdk.incubator.vector \
-Xms2g -Xmx12g \
-jar llm-performance/build/libs/llm-performance-all.jar \
run --scenario llama-runtime-throughput \
--model-path /absolute/path/to/tinyllama-1.1b-chat-v1.0.Q8_0.gguf
All CLI options (--warmup-runs, --measured-runs, --steps, --format) work the same way as with jvmRun.
Quick examples
# List scenarios
java --enable-preview --add-modules jdk.incubator.vector \
-jar llm-performance/build/libs/llm-performance-all.jar list-scenarios
# Minimal run (1 warmup, 1 measured, 16 steps only)
java --enable-preview --add-modules jdk.incubator.vector \
-Xms2g -Xmx12g \
-jar llm-performance/build/libs/llm-performance-all.jar \
run --scenario llama-runtime-throughput \
--model-path /path/to/model.gguf \
--warmup-runs 1 --measured-runs 1 --steps 16
Progress logging
Progress is logged to stderr during benchmark execution:
[BENCH] Resolved model: /path/to/model.gguf (source: cli) [BENCH] === Runtime 1/3: LlamaRuntime === [BENCH] LlamaRuntime | prompt=short steps=16 | warming up (3 runs)... [BENCH] warmup 1/3 done [BENCH] ... [BENCH] LlamaRuntime | prompt=short steps=16 | measuring (3 runs)... [BENCH] measured 1/3: 1234ms [BENCH] ... [BENCH] LlamaRuntime | prompt=short steps=16 | median=1200ms throughput=13.33 tok/s
To capture only the final results (no progress), redirect stderr:
java ... -jar llm-performance-all.jar run ... 2>/dev/null
Native macOS benchmarks (CPU / Metal / MLX)
The llm-performance module also builds a native macOS ARM64 binary that benchmarks CPU, Metal, and MLX backends head-to-head. This requires the skainet-backend-metal and skainet-backend-mlx artifacts published to mavenLocal().
Prerequisites
-
macOS on Apple Silicon (ARM64)
-
skainet-backend-metalandskainet-backend-mlxpublished to local maven:cd /path/to/SKaiNET && ./gradlew publishAllPublicationsToMavenLocalRepository
Build the native binary
./gradlew :llm-performance:linkReleaseExecutableMacosArm64
The binary is at llm-performance/build/bin/macosArm64/releaseExecutable/llm-performance.kexe.
Run
./llm-performance/build/bin/macosArm64/releaseExecutable/llm-performance.kexe \
run --scenario native-backend-throughput \
--model-path /absolute/path/to/tinyllama-1.1b-chat-v1.0.Q8_0.gguf
Available options
# List scenarios
./llm-performance.kexe list-scenarios
# Minimal quick run
./llm-performance.kexe \
run --scenario native-backend-throughput \
--model-path /path/to/model.gguf \
--warmup-runs 1 --measured-runs 1 --steps 16
# Full run with JSON output
./llm-performance.kexe \
run --scenario native-backend-throughput \
--model-path /path/to/model.gguf \
--warmup-runs 3 --measured-runs 5 --steps 16,64,128 \
--format json
Scenario
native-backend-throughput compares three backends using the same LlamaRuntime:
| Backend | ExecutionContext | Attention |
|---|---|---|
CPU |
|
|
Metal |
|
|
MLX |
|
|
If a backend is not available or fails to initialize, it is automatically skipped.
Notes
-
The JVM benchmark compares runtime strategies (
LlamaRuntime,DIRECT,OPTIMIZED) using the CPU backend. -
The native macOS benchmark compares backends (CPU, Metal, MLX) using the same
LlamaRuntime. -
Large FP32 model loads are memory-heavy.
-
Docker or otherwise memory-constrained environments may fail even when the benchmark is correct.
-
A normal host machine with more RAM is the preferred environment for real throughput measurements.