Java 25 Advantages for the JVM CPU Backend

Java 25 (GA September 2025) delivers significant free performance improvements to the SKaiNET JVM CPU backend through JIT/C2 optimizations, faster Panama FFI, and new GC/startup features — all without requiring code changes.

Compatibility

The same code, same flags, and same runtime detection work across JDK 21–25:

  • Vector API remains incubator on JDK 25 (JEP 508) — identical jdk.incubator.vector package.

  • Panama FFI finalized in JDK 22; --enable-preview is harmless on 22+.

  • Runtime detection (Class.forName, Runtime.version()) works on all versions.

  • Build config (jvmTarget = JVM_21, options.release.set(21)) produces compatible bytecode.

No special treatment is needed for JDK >= 21 but < 25.

Required flags remain:

--enable-preview --add-modules jdk.incubator.vector
JIT / C2 improvements mapped to SKaiNET ops

These are automatic — the JIT produces better native code for existing bytecode.

Improvement JDK bug Speedup Affected SKaiNET code

VPointer refactoring for vector loads/stores

JDK-8350748

up to 14x

All FloatVector.fromArray / fromMemorySegment loops in JvmVectorKernels.kt, JvmQuantizedVectorKernels.kt

SuperWord SIMD enhancement

JDK-8343685

up to 33x

Same vectorized loops (elementwise, reductions, matmul inner loops)

Math.max / Math.min intrinsified for long

JDK-8350485

3–5x

Shape computation, tile clamping in blocked matmul

Source files:

  • skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/tensor/ops/JvmVectorKernels.kt

  • skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/tensor/ops/JvmQuantizedVectorKernels.kt

Panama FFI improvements
Improvement JDK bug Speedup Affected SKaiNET code

Faster MemorySegment allocation

JDK-8345687

~2x

MemorySegmentTensorData.kt (MemorySegmentTensorDataFactory), PagedKvCache.kt

MemorySegment::fill optimized on AArch64

JDK-8354674

~2.5x

Tensor zeroing, blocked matmul result initialization

Source files:

  • skainet-lang/skainet-lang-core/src/jvmMain/kotlin/sk/ainet/lang/tensor/data/MemorySegmentTensorData.kt

  • skainet-apps/skainet-kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/PagedKvCache.kt

Object layout and GC
  • Compact Object Headers (JEP 519) — reduces object header from 12 to 8 bytes. Meaningful for tensor metadata arrays with millions of small objects. Opt-in: -XX:+UseCompactObjectHeaders

  • Generational Shenandoah (JEP 521) — lower GC pause times for allocation-heavy workloads (tensor creation, KV cache churn). Opt-in: -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational

Startup and warmup
  • AOT profiling / caching (JEP 515) — records JIT profile data from a training run and replays it on subsequent launches. Reduces warmup by 15–25%. Useful for CLI apps like kLLaMA where first-token latency matters.

Usage:

# Training run (records profile)
java -XX:AOTCacheOutput=app.aot -jar kllama.jar --prompt "warmup"

# Production run (replays profile)
java -XX:AOTCache=app.aot -jar kllama.jar --prompt "Hello"

Required (same as JDK 21–24):

--enable-preview
--add-modules jdk.incubator.vector

Optional — enable for maximum benefit on JDK 25:

-XX:+UseCompactObjectHeaders
-XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational
-XX:AOTCache=app.aot          # after training run
Summary
Feature Benefit Component

VPointer refactoring (C2)

Up to 14x faster vector loads/stores

JvmVectorKernels, JvmQuantizedVectorKernels

SuperWord SIMD (C2)

Up to 33x faster auto-vectorized loops

Same vector kernel files

Math.max/min intrinsic

3–5x faster long comparisons

Shape computation, tile clamping

Faster segment allocation

~2x allocation throughput

MemorySegmentTensorDataFactory, PagedKvCache

MemorySegment::fill (AArch64)

~2.5x faster bulk zeroing

Tensor init, matmul result buffers

Compact Object Headers

~30% smaller object headers

All tensor metadata

Generational Shenandoah

Lower GC pauses

Allocation-heavy inference

AOT profiling

15–25% faster warmup

CLI apps (kLLaMA)