loadLlamaRuntimeWeightsStreaming
suspend fun <T : DType> loadLlamaRuntimeWeightsStreaming(ctx: ExecutionContext, randomAccessProvider: () -> RandomAccessSource, dtype: KClass<T>, quantPolicy: QuantPolicy = QuantPolicy.RAW_BYTES, allowQuantized: Boolean = false): LlamaRuntimeWeights<T>(source)
Load LLaMA runtime weights using streaming API. Parses metadata only (~1MB memory), loads tensors on-demand. Suitable for models of any size (100+ GB) that exceed Java array limits.
Parameters
ctx
Execution context for tensor creation
randomAccessProvider
Factory that provides RandomAccessSource to the GGUF file
quantPolicy
How to handle quantized tensors
allowQuantized
If false, error on encountering quantized tensors
suspend fun loadLlamaRuntimeWeightsStreaming(ctx: ExecutionContext, randomAccessProvider: () -> RandomAccessSource, quantPolicy: QuantPolicy = QuantPolicy.RAW_BYTES, allowQuantized: Boolean = false): LlamaRuntimeWeights<FP32>(source)
Backward-compatible overload defaulting to FP32.