merlin.models.tf.LazyAdam#

class merlin.models.tf.LazyAdam(learning_rate: Union[tensorflow.python.framework.ops.Tensor, float, numpy.float16, numpy.float32, numpy.float64, Callable] = 0.001, beta_1: Union[tensorflow.python.framework.ops.Tensor, float, numpy.float16, numpy.float32, numpy.float64] = 0.9, beta_2: Union[tensorflow.python.framework.ops.Tensor, float, numpy.float16, numpy.float32, numpy.float64] = 0.999, epsilon: Union[tensorflow.python.framework.ops.Tensor, float, numpy.float16, numpy.float32, numpy.float64] = 1e-07, amsgrad: bool = False, name: str = 'LazyAdam', **kwargs)[source]#

Bases: keras.optimizers.adam.Adam

Variant of the Adam optimizer that handles sparse updates more efficiently.

The original Adam algorithm maintains two moving-average accumulators for each trainable variable; the accumulators are updated at every step. This class provides lazier handling of gradient updates for sparse variables. It only updates moving-average accumulators for sparse variable indices that appear in the current batch, rather than updating the accumulators for all indices. Compared with the original Adam optimizer, it can provide large improvements in model training throughput for some applications. However, it provides slightly different semantics than the original Adam algorithm, and may lead to different empirical results.

Note, amsgrad is currently not supported and the argument can only be False.

This implementation was adapted from the original Tensforflow Addons implementation of TensorFlow Addons Optimizers: LazyAdam: tensorflow/addons

__init__(learning_rate: Union[tensorflow.python.framework.ops.Tensor, float, numpy.float16, numpy.float32, numpy.float64, Callable] = 0.001, beta_1: Union[tensorflow.python.framework.ops.Tensor, float, numpy.float16, numpy.float32, numpy.float64] = 0.9, beta_2: Union[tensorflow.python.framework.ops.Tensor, float, numpy.float16, numpy.float32, numpy.float64] = 0.999, epsilon: Union[tensorflow.python.framework.ops.Tensor, float, numpy.float16, numpy.float32, numpy.float64] = 1e-07, amsgrad: bool = False, name: str = 'LazyAdam', **kwargs)[source]#

Constructs a new LazyAdam optimizer.

Parameters
  • learning_rate (Union[FloatTensorLike, Callable]) – A Tensor or a floating point value. or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule The learning rate. FloatTensorLike = Union[tf.Tensor, float, np.float16, np.float32, np.float64]

  • beta_1 (FloatTensorLike) – A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.

  • beta_2 (FloatTensorLike) – A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates.

  • epsilon (FloatTensorLike) – A small constant for numerical stability. This epsilon is “epsilon hat” in [Adam: A Method for Stochastic Optimization. Kingma et al., 2014](http://arxiv.org/abs/1412.6980) (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

  • amsgrad (bool) – Whether to apply AMSGrad variant of this algorithm from the paper “On the Convergence of Adam and beyond”. Note that this argument is currently not supported and the argument can only be False.

  • name (str) – Optional name for the operations created when applying gradients. Defaults to “LazyAdam”.

  • **kwargs – keyword arguments. Allowed to be {clipnorm, clipvalue, lr, decay}. clipnorm is clip gradients by norm; clipvalue is clip gradients by value, decay is included for backward compatibility to allow time inverse decay of learning rate. lr is included for backward compatibility, recommended to use learning_rate instead.

Methods

__init__([learning_rate, beta_1, beta_2, ...])

Constructs a new LazyAdam optimizer.

add_variable(shape[, dtype, initializer, name])

Create an optimizer variable.

add_variable_from_reference(model_variable, ...)

aggregate_gradients(grads_and_vars)

Aggregate gradients on all devices.

apply_gradients(grads_and_vars[, name, ...])

Apply gradients to variables.

build(var_list)

Initialize optimizer variables.

compute_gradients(loss, var_list[, tape])

Compute gradients of loss on trainable variables.

exclude_from_weight_decay([var_list, var_names])

Exclude variables from weight decay.

finalize_variable_values(var_list)

Set the final value of model's trainable variables.

from_config(config[, custom_objects])

Creates an optimizer from its config.

get_config()

minimize(loss, var_list[, tape])

Minimize loss by updating var_list.

set_weights(weights)

Set the weights of the optimizer.

update_step(gradient, variable)

Update step given gradient and the associated model variable.

Attributes

iterations

The number of training steps this optimizer has run.

learning_rate

lr

Alias of learning_rate().

variables

Returns variables of this optimizer.