Java 25 Advantages for the JVM CPU Backend
Java 25 (GA September 2025) delivers significant free performance improvements to the SKaiNET JVM CPU backend through JIT/C2 optimizations, faster Panama FFI, and new GC/startup features — all without requiring code changes.
Compatibility
The same code, same flags, and same runtime detection work across JDK 21–25:
-
Vector API remains incubator on JDK 25 (JEP 508) — identical
jdk.incubator.vectorpackage. -
Panama FFI finalized in JDK 22;
--enable-previewis harmless on 22+. -
Runtime detection (
Class.forName,Runtime.version()) works on all versions. -
Build config (
jvmTarget = JVM_21,options.release.set(21)) produces compatible bytecode.
No special treatment is needed for JDK >= 21 but < 25.
Required flags remain:
--enable-preview --add-modules jdk.incubator.vector
JIT / C2 improvements mapped to SKaiNET ops
These are automatic — the JIT produces better native code for existing bytecode.
| Improvement | JDK bug | Speedup | Affected SKaiNET code |
|---|---|---|---|
VPointer refactoring for vector loads/stores |
up to 14x |
All |
|
SuperWord SIMD enhancement |
up to 33x |
Same vectorized loops (elementwise, reductions, matmul inner loops) |
|
|
JDK-8350485 |
3–5x |
Shape computation, tile clamping in blocked matmul |
Source files:
-
skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/tensor/ops/JvmVectorKernels.kt -
skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/tensor/ops/JvmQuantizedVectorKernels.kt
Panama FFI improvements
| Improvement | JDK bug | Speedup | Affected SKaiNET code |
|---|---|---|---|
Faster |
~2x |
|
|
|
~2.5x |
Tensor zeroing, blocked matmul result initialization |
Source files:
-
skainet-lang/skainet-lang-core/src/jvmMain/kotlin/sk/ainet/lang/tensor/data/MemorySegmentTensorData.kt -
skainet-apps/skainet-kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/PagedKvCache.kt
Object layout and GC
-
Compact Object Headers (JEP 519) — reduces object header from 12 to 8 bytes. Meaningful for tensor metadata arrays with millions of small objects. Opt-in:
-XX:+UseCompactObjectHeaders -
Generational Shenandoah (JEP 521) — lower GC pause times for allocation-heavy workloads (tensor creation, KV cache churn). Opt-in:
-XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational
Startup and warmup
-
AOT profiling / caching (JEP 515) — records JIT profile data from a training run and replays it on subsequent launches. Reduces warmup by 15–25%. Useful for CLI apps like kLLaMA where first-token latency matters.
Usage:
# Training run (records profile) java -XX:AOTCacheOutput=app.aot -jar kllama.jar --prompt "warmup" # Production run (replays profile) java -XX:AOTCache=app.aot -jar kllama.jar --prompt "Hello"
Recommended JVM flags for Java 25
Required (same as JDK 21–24):
--enable-preview --add-modules jdk.incubator.vector
Optional — enable for maximum benefit on JDK 25:
-XX:+UseCompactObjectHeaders -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational -XX:AOTCache=app.aot # after training run
Summary
| Feature | Benefit | Component |
|---|---|---|
VPointer refactoring (C2) |
Up to 14x faster vector loads/stores |
|
SuperWord SIMD (C2) |
Up to 33x faster auto-vectorized loops |
Same vector kernel files |
|
3–5x faster long comparisons |
Shape computation, tile clamping |
Faster segment allocation |
~2x allocation throughput |
|
|
~2.5x faster bulk zeroing |
Tensor init, matmul result buffers |
Compact Object Headers |
~30% smaller object headers |
All tensor metadata |
Generational Shenandoah |
Lower GC pauses |
Allocation-heavy inference |
AOT profiling |
15–25% faster warmup |
CLI apps (kLLaMA) |