skainet-lang-core/sk.ainet.lang.nn.optim/AdamOptimizer

AdamOptimizer

class AdamOptimizer @JvmOverloads constructor(lr: Double = 0.001, beta1: Double = 0.9, beta2: Double = 0.999, epsilon: Double = 1.0E-8, weightDecay: Double = 0.0, decoupledWeightDecay: Boolean = true, amsgrad: Boolean = false) : Optimizer(source)

Adam optimizer (Adaptive Moment Estimation).

Implements the Adam algorithm from "Adam: A Method for Stochastic Optimization" (Kingma & Ba, 2014) with optional decoupled weight decay (AdamW).

The update rule is:

m_t = β1 * m_{t-1} + (1 - β1) * g_t
v_t = β2 * v_{t-1} + (1 - β2) * g_t^2
m_hat = m_t / (1 - β1^t)
v_hat = v_t / (1 - β2^t)
θ_t = θ_{t-1} - lr * m_hat / (sqrt(v_hat) + ε)

When decoupledWeightDecay is true (default), weight decay is applied directly to the parameters (AdamW style) rather than added to the gradient (L2 regularization).

Parameters

Learning rate (default: 0.001)

beta1

Exponential decay rate for the first moment estimates (default: 0.9)

beta2

Exponential decay rate for the second moment estimates (default: 0.999)

epsilon

Small constant for numerical stability (default: 1e-8)

weightDecay

Weight decay coefficient (default: 0.0)

decoupledWeightDecay

If true, uses AdamW-style decoupled weight decay (default: true)

amsgrad

If true, uses the AMSGrad variant that maintains the maximum of all v_t (default: false)

Constructors

AdamOptimizer

@JvmOverloads

constructor(lr: Double = 0.001, beta1: Double = 0.9, beta2: Double = 0.999, epsilon: Double = 1.0E-8, weightDecay: Double = 0.0, decoupledWeightDecay: Boolean = true, amsgrad: Boolean = false)

Functions

addParameter

open override fun addParameter(param: ModuleParameter<*, *>, applyWeightDecay: Boolean)

open override fun addParameter(param: Parameter, applyWeightDecay: Boolean)

reset

fun reset()

Resets the optimizer state (moment estimates and step counter). Useful when starting training from scratch with the same optimizer instance.

step

open override fun step()

Perform one optimization step, updating all registered parameters in-place (via reassigning their tensor values where needed).

zeroGrad

open override fun zeroGrad()

Zero accumulated gradients on all registered parameters.