AdamOptimizer
Adam optimizer (Adaptive Moment Estimation).
Implements the Adam algorithm from "Adam: A Method for Stochastic Optimization" (Kingma & Ba, 2014) with optional decoupled weight decay (AdamW).
The update rule is:
m_t = β1 * m_{t-1} + (1 - β1) * g_t
v_t = β2 * v_{t-1} + (1 - β2) * g_t^2
m_hat = m_t / (1 - β1^t)
v_hat = v_t / (1 - β2^t)
θ_t = θ_{t-1} - lr * m_hat / (sqrt(v_hat) + ε)When decoupledWeightDecay is true (default), weight decay is applied directly to the parameters (AdamW style) rather than added to the gradient (L2 regularization).
Parameters
Learning rate (default: 0.001)
Exponential decay rate for the first moment estimates (default: 0.9)
Exponential decay rate for the second moment estimates (default: 0.999)
Small constant for numerical stability (default: 1e-8)
Weight decay coefficient (default: 0.0)
If true, uses AdamW-style decoupled weight decay (default: true)
If true, uses the AMSGrad variant that maintains the maximum of all v_t (default: false)