Untitled :: SKaiNET

Java 25 Advantages for the JVM CPU Backend

Java 25 (GA September 2025) delivers significant free performance improvements to the SKaiNET JVM CPU backend through JIT/C2 optimizations, faster Panama FFI, and new GC/startup features — all without requiring code changes.

Compatibility

The same code, same flags, and same runtime detection work across JDK 21–25:

Vector API remains incubator on JDK 25 (JEP 508) — identical jdk.incubator.vector package.
Panama FFI finalized in JDK 22; --enable-preview is harmless on 22+.
Runtime detection (Class.forName, Runtime.version()) works on all versions.
Build config (jvmTarget = JVM_21, options.release.set(21)) produces compatible bytecode.

No special treatment is needed for JDK >= 21 but < 25.

Required flags remain:

--enable-preview --add-modules jdk.incubator.vector

JIT / C2 improvements mapped to SKaiNET ops

These are automatic — the JIT produces better native code for existing bytecode.

Improvement JDK bug Speedup Affected SKaiNET code

Improvement	JDK bug	Speedup	Affected SKaiNET code
VPointer refactoring for vector loads/stores	JDK-8350748	up to 14x	All `FloatVector.fromArray` / `fromMemorySegment` loops in `JvmVectorKernels.kt`, `JvmQuantizedVectorKernels.kt`
SuperWord SIMD enhancement	JDK-8343685	up to 33x	Same vectorized loops (elementwise, reductions, matmul inner loops)
`Math.max` / `Math.min` intrinsified for `long`	JDK-8350485	3–5x	Shape computation, tile clamping in blocked matmul

VPointer refactoring for vector loads/stores

JDK-8350748

up to 14x

All FloatVector.fromArray / fromMemorySegment loops in JvmVectorKernels.kt, JvmQuantizedVectorKernels.kt

SuperWord SIMD enhancement

JDK-8343685

up to 33x

Same vectorized loops (elementwise, reductions, matmul inner loops)

Math.max / Math.min intrinsified for long

JDK-8350485

3–5x

Shape computation, tile clamping in blocked matmul

Source files:

skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/tensor/ops/JvmVectorKernels.kt
skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/tensor/ops/JvmQuantizedVectorKernels.kt

Panama FFI improvements

Improvement JDK bug Speedup Affected SKaiNET code

Improvement	JDK bug	Speedup	Affected SKaiNET code
Faster `MemorySegment` allocation	JDK-8345687	~2x	`MemorySegmentTensorData.kt` (`MemorySegmentTensorDataFactory`), `PagedKvCache.kt`
`MemorySegment::fill` optimized on AArch64	JDK-8354674	~2.5x	Tensor zeroing, blocked matmul result initialization

Faster MemorySegment allocation

JDK-8345687

~2x

MemorySegmentTensorData.kt (MemorySegmentTensorDataFactory), PagedKvCache.kt

MemorySegment::fill optimized on AArch64

JDK-8354674

~2.5x

Tensor zeroing, blocked matmul result initialization

Source files:

skainet-lang/skainet-lang-core/src/jvmMain/kotlin/sk/ainet/lang/tensor/data/MemorySegmentTensorData.kt
skainet-apps/skainet-kllama/src/jvmMain/kotlin/sk/ainet/apps/kllama/PagedKvCache.kt

Object layout and GC

Compact Object Headers (JEP 519) — reduces object header from 12 to 8 bytes. Meaningful for tensor metadata arrays with millions of small objects. Opt-in: -XX:+UseCompactObjectHeaders
Generational Shenandoah (JEP 521) — lower GC pause times for allocation-heavy workloads (tensor creation, KV cache churn). Opt-in: -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational

Startup and warmup

AOT profiling / caching (JEP 515) — records JIT profile data from a training run and replays it on subsequent launches. Reduces warmup by 15–25%. Useful for CLI apps like kLLaMA where first-token latency matters.

Usage:

# Training run (records profile)
java -XX:AOTCacheOutput=app.aot -jar kllama.jar --prompt "warmup"

# Production run (replays profile)
java -XX:AOTCache=app.aot -jar kllama.jar --prompt "Hello"

Recommended JVM flags for Java 25

Required (same as JDK 21–24):

--enable-preview
--add-modules jdk.incubator.vector

Optional — enable for maximum benefit on JDK 25:

-XX:+UseCompactObjectHeaders
-XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational
-XX:AOTCache=app.aot          # after training run

Summary

Feature Benefit Component

Feature	Benefit	Component
VPointer refactoring (C2)	Up to 14x faster vector loads/stores	`JvmVectorKernels`, `JvmQuantizedVectorKernels`
SuperWord SIMD (C2)	Up to 33x faster auto-vectorized loops	Same vector kernel files
`Math.max/min` intrinsic	3–5x faster long comparisons	Shape computation, tile clamping
Faster segment allocation	~2x allocation throughput	`MemorySegmentTensorDataFactory`, `PagedKvCache`
`MemorySegment::fill` (AArch64)	~2.5x faster bulk zeroing	Tensor init, matmul result buffers
Compact Object Headers	~30% smaller object headers	All tensor metadata
Generational Shenandoah	Lower GC pauses	Allocation-heavy inference
AOT profiling	15–25% faster warmup	CLI apps (kLLaMA)

VPointer refactoring (C2)

Up to 14x faster vector loads/stores

JvmVectorKernels, JvmQuantizedVectorKernels

SuperWord SIMD (C2)

Up to 33x faster auto-vectorized loops

Same vector kernel files

Math.max/min intrinsic

3–5x faster long comparisons

Shape computation, tile clamping

Faster segment allocation

~2x allocation throughput

MemorySegmentTensorDataFactory, PagedKvCache

MemorySegment::fill (AArch64)

~2.5x faster bulk zeroing

Tensor init, matmul result buffers

Compact Object Headers

~30% smaller object headers

All tensor metadata

Generational Shenandoah

Lower GC pauses

Allocation-heavy inference

AOT profiling

15–25% faster warmup

CLI apps (kLLaMA)