Adaptive optimization methods, such as Adam and Adagrad, maintain some statistics over time about the variables and gradients (e.g. moments) which affect the learning rate. These statistics won’t be very accurate when working with sparse tensors, where most of its elements are zero or near zero. We investigated the effects of using Adam with varying learning rates on sparse tensors, and explored solutions to allow Adam to properly handle the sparsity.
In our experiments, we observed a lack of responsiveness to learning rate changes. We speculate this is due to fact that Adam update its moment estimates from a much smaller sample than the batch size suggests due to the sparsity, and also unnecessarily attempt to update the cells even if they are zero or unused. We hoped that a sparse variant of Adam might address this problem. Sparse variants are found in a number deep learning libraries, although TensorFlow is a bit behind on this compared to PyTorch.
This article will provide some background for Adam and sparse representations, the implementation details for using TensorFlow sparse variant of Adam with sparse tensors as well as the outcome of our experiments.
A sparse representation is where most elements of the representation are near-zero with a few nonzero elements, which leads to a more semantic, efficient and robust representation . There are many different methods for creating sparse representations such as [1, 3, 4]. For simplicity, we can create a sparse representation by masking the activation outputs in a hidden layer. A mask, for example, can be based on the top k cells in the output, to be represented as a one in the mask with the rest being zeros. This ends up creating sparse gradients as some cells will not get updated when they’re zeroed out in the sparse representation.
The Adam optimizer works by maintaining a per-parameter learning rate that is based on the moment estimates, see Figure 1 for formulas. It keeps track of an exponential moving average controlled by the decay rates beta1 and beta2, which are recommended to be close to 1.0 (the default values are 0.9 and 0.999 respectively). Epsilon, used for numerical stability, might be worth tuning as it has been noted that it can vary in orders of magnitude depending on the architecture and the problem. For example, an Inception network training on ImageNet, an optimal epsilon value might be 1.0 or 0.1.
The sparse variant of Adam would use its ‘sparse update’ mechanism when it sees that the gradient of a variable is of type IndexedSlices, which means an operation like tf.gather was used on a variable to create a sparse version of it. In this case, Adam will sparsely perform weight updates, but update its moment estimates as the dense Adam would. The Lazy Adam Optimizer addresses that part by also performing sparse moment updates. This optimizer is currently in the tf.contrib package and seems to not be as stable as the Adam optimizer.
TensorFlow operations that create sparse representations from variables, such as tf.gather and tf.nn.embedding_lookup, must be used directly on the variable itself before any other operations. This ended up being a bit complicated especially when dealing with convolutional architectures, as we the sparse mask was applied after the activation function. Instead, we focused on our a non-convolutional implementation of a Sparse Autoencoder. In that implementation, we compute the activation output for the encoding and create the sparse masks based on those outputs. The sparse mask used in training is reduced over the batch using tf.reduce_max to be used as the indices for active cells when using the tf.gather operation. This operation is then applied directly to the weights matrix, resulting in a sparse weights matrix. This is then used to recompute the encoding.
We initially ran some experiments comparing the standard and sparse Adam implementations using a Sparse Autoencoder on the MNIST dataset. Initially we noticed superior training performance with the Lazy Adam Optimizer (sparse Adam); however, this did not translate to better classification performance on either the training or the test sets. This led us to dig more into how exactly the sparsening methods in the training and test settings differ.
We noticed that the lifetime sparsity constraint [1, 4] was causing all cells to be updated every step. This was due to the fact that when we reduced the mask over the batch to retrieve the indices, all the cells were active and resulted in a non-sparse weights matrix, and resulted in a non-sparse output. The consequence was much worse performance as we were essentially training on a dense encoding, then evaluating on a sparse one.
Further experiments were conducted without lifetime sparsity as well as re-applying the mask after the tf.gather operation to ensure that only the active cells for that particular sample in the batch are active. The experiments showed comparable performance to the standard Adam optimizer in reconstruction and classification accuracy, with no obvious improvements due to the sparse Adam implementation. It’s likely that additional masking necessary to ensure the outputs are valid is negating any potential upsides the sparse Adam implementation might have. In conclusion, the lazy Adam optimizer will not be helpful if you’re doing any updates in each batch, as it doesn’t count the number of updates per batch; it’s only able to say whether a cells’ weights are updated or not.
If you’re working with non-sparse tensors, then using the standard Adam optimizer will be your best option. If you’re using embeddings in NLP for example, then you’re likely already automatically using the ‘sparse mechanism’ of the standard Adam, but it might be worth trying the lazy Adam as well for potentially better performance. If you’re working with sparse representations, Adam should be good enough in most cases if you want to avoid re-implementing Adam and backpropagation as there currently no out-of-the-box solution.
- Makhzani, A. & Frey, B. 2013, k-Sparse Autoencoders, arXiv:1312.5663 [cs.LG]
- Kowadlo, G. 2014, Sparse Distributed Representations (SDRs [Blog post]. Retrieved from https://agi.io/2014/12/24/sparse-distributed-representations-sdrs/
- Ngiam, J., Chen, Z., Bhaskar, S. A., Koh, P. W & Ng, A. Y. 2011, Sparse Filtering, Advances in Neural Information Processing Systems 24 (NeurIPS 2011)
- Makhzani, A. & Frey, B. J. 2015, Winner-Take-All Autoencoders, Advances in Neural Information Processing Systems 28 (NeurIPS 2015)
Also published on Medium.