skainet-io-gguf/sk.ainet.io.gguf.llama/loadLlamaRuntimeWeightsStreaming

loadLlamaRuntimeWeightsStreaming

suspend fun <T : DType> loadLlamaRuntimeWeightsStreaming(ctx: ExecutionContext, randomAccessProvider: () -> RandomAccessSource, dtype: KClass<T>, quantPolicy: QuantPolicy = QuantPolicy.RAW_BYTES, allowQuantized: Boolean = false): LlamaRuntimeWeights<T>(source)

Load LLaMA runtime weights using streaming API. Parses metadata only (~1MB memory), loads tensors on-demand. Suitable for models of any size (100+ GB) that exceed Java array limits.

Parameters

ctx

Execution context for tensor creation

randomAccessProvider

Factory that provides RandomAccessSource to the GGUF file

quantPolicy

How to handle quantized tensors

allowQuantized

If false, error on encountering quantized tensors

suspend fun loadLlamaRuntimeWeightsStreaming(ctx: ExecutionContext, randomAccessProvider: () -> RandomAccessSource, quantPolicy: QuantPolicy = QuantPolicy.RAW_BYTES, allowQuantized: Boolean = false): LlamaRuntimeWeights<FP32>(source)

Backward-compatible overload defaulting to FP32.