Convolutional Competitive Learning vs. Sparse Autoencoders (1/2)
Competitive learning is a branch of unsupervised learning that was popular a long, long time ago in the 1990s. Older readers may remember – the days before widespread use of GSM mobile phones and before Google won the search engine wars! Although competitive learning is now rarely used, it is worth understanding the broader range of algorithms developed in the era before fully differentiable architectures became so dominant. For example, Hopfield networks, a form of content addressable memory, were inspirational in the development of new memory systems such as the Neural Turing Machine (Graves2014).
This article discusses whether selected competitive learning methods can be brought up-to-date with tricks from modern machine learning. We discuss the adaptation of these concepts here. A second article reports new results demonstrating whether these competitive learning methods are as good as modern autoencoders – or not!
Competitive Learning Basics
The premise of competitive learning (CL) is that cells “compete” to be responsible for representing (and usually, being able to reproduce) the input. In this way, the cell population learns to represent all possible inputs with some degree of error. Encoding using competitive learning is a form of dimensionality reduction. In modern terminology, the winning cell[s] represent a compressive “bottleneck”. Popular algorithms include K-means clustering and Kohonen Self-Organizing Maps (SOMs). In the extreme case known as Hard Competitive Learning, a single “winning” cell is adapted towards the input. Hard competitive learning suffers especially from a key drawback of competitive learning, which is that training doesn’t successfully find uses for all cells resulting in poor resource utilization (so called ‘dead’ cells).
Biologically Plausible Credit Assignment in Deep Networks
So why return to competitive learning at all? Many researchers (including us) are concerned about the biological plausibility of credit assignment in deep networks (Marblestone2016). In today’s parlance, we are suffering from “FOMO”.
Despite a thorough search, it has not been possible to confirm a biological equivalent to error-gradient backpropagation between layers, which is necessary to assign the “credit” (aka error-responsibility) to arbitrary weights across many layers in deep networks. There are 2 alternative explanations:
- The mechanism exists, but we didn’t find it yet
- Natural networks have a different way to train deep networks, using either local error functions, or a global error with local credit assignment
If natural neural networks do it differently, what are we missing out on?! Hence, FOMO. We need to reconsider how we train network layers, and need to reconsider local error functions in particular. This leads us to competitive learning, among other ideas.
In modern times, competitive learning is an antiquated concept. Autoencoders are predominantly used where unsupervised training is desired. The same principle of a representational bottleneck applies to both autoencoders and competitive learning; the difference is in how they are applied and trained. One major distinction is that in a vanilla autoencoder, all hidden cells participate in all inputs; they do not compete with each other except implicitly during backpropagation. However, some autoencoders are sparsely active in their hidden layers. It might be accurate to say that a sparse autoencoder is a form of competitive learning, although this terminology is not used. Nevertheless, benefits of autoencoders over other CL methods include:
Hidden layer cells can participate in input reconstruction to varying degrees, rather than winner-take-all or binary hidden layer activation, as is common in most CL methods. This means that responsibility for error is variably distributed between active hidden cells.
Most importantly, in a deep autoencoder, errors can be backpropagated to shallower layers, thereby allowing all layers to be simultaneously optimized for a useful function, such as classification. A good discussion of this insight can be found in LeCun’s et al’s Gradient Based Learning paper (LeCun1998), which fully applies to deep monolithic autoencoders (but not stacked autoencoders with separate loss functions for each layer).
Key features of modern Autoencoders
Batch training improves training of networks trained by gradient descent because it’s better than both extremes of online learning (batch size = 1) and computing a gradient based on all the data (which is very inefficient). The randomness of batch sample selection is actually helpful, because it allows the learning process to jump out of local minima. However, training with individual samples (batch size = 1) is too random, resulting in learning that’s either too slow or too noisy, jumping around all over the place. A moderate batch size produces a good approximation of the true error gradient (allowing fast optimization) with a little bit of stochasticity to avoid local optima and robustly explore the space.
However, this insight is not restricted to autoencoders; it could also be applied to most competitive learning algorithms by generalizing the learning rules. Typically, competitive learning algorithms are trained on individual data samples (batch-size = 1).
This section assumes familiarity with the terminology of convolutional networks. If that doesn’t sound like you, this guide will help!
Another big insight from LeCun’s gradient based learning paper (LeCun1998) was the idea that – particularly in images – the same weights learned in one part of the image should work well in other areas. This means that weights can be “shared” between filter locations, reducing the overall number of parameters to learn, and reducing the size of the training set required to learn input patterns with translation. Intuitively, any input where translation (or other transformations!) are possible should benefit from convolutional computation. This paper pioneered the development of convolutional networks.
The idea of convolutional competitive learning is not new – the earliest reference we could find is Fukushima’s Neocognitron (Fukushima1980). However, this paper is so old that there are no results that can be meaningfully compared to today’s algorithms: computers weren’t fast enough back then. But we can easily imagine how to make CL methods convolutional.
There are many other features that contribute to modern network performance, such as better weight initialization and hidden layer dropout, both of which aim to improve hidden layer resource utilization. Many of them could also be adapted to competitive learning; however, they are beyond the scope of this article.
Self Organising Maps and Growing-Neural-Gas
Self-Organizing Maps (SOMs) are a popular Competitive Learning technique where, usefully, the resulting cells are topographically organized – cells that respond to similar inputs are located close together (see banner image at the top of this article, which shows a SOM clasifying R,G,B tuples). Weirdly, we also know that topographical orientation of cells’ receptive fields is a property of the visual cortex (see figure below). However, we don’t know of a useful way to exploit this in artificial networks. Put it in the box of unresolved mysteries.
If we want a SOM, but don’t care about topographical organization, you can use the Neural Gas (NG) algorithm instead. The intuition here is that cells – like atoms of a gas – expand to occupy the space populated by samples from the input distribution. There’s no topology to the cells. There’s another variant called “Growing Neural Gas” (GNG) in which cells are preferentially added to the active population based on the representational error of the existing cells, using simple neighbourhood relations in a high-dimensional space.
Online learning of Non-Stationary problems
But what if your input keeps changing? It turns out that many problems change over time, and in fact, exploration of a new space also has the effect of changing the input statistics due to the new environments experienced. We call this a “non-stationary” problem and for an intelligent agent in an open-ended world, we can expect it to be encountered regularly.
(Fritzke1997) introduced a variant of GNG, called GNG-U (“Growing Neural Gas with Utility”). GNG-U recycles cells from areas of the input space that are comparatively over-represented, and inserts them into areas that are under-represented. In this way the representational power of the cell population follows changing input statistics. There are a few alternative SOM or CL algorithms that claim to be able to adapt to nonstationary input (such as the Dynamic SOM (DSOM) by (Rougier2011)), but we were unable to get them to work satisfactorily.
The GNG-U algorithm is theoretically capable of online or batch learning of nonstationary input distributions with some guarantees regarding optimality of the resulting representation, assuming the input changes much more slowly than the model. This is potentially interesting enough to make it worth revisiting.
Sparse and Distributed representation
Sparse means that at any time, only a small fraction of hidden layer activity values are nonzero. In the extreme case, only 1 cell is active (this is the case for many competitive learning algorithms). The opposite of sparse is dense – all or most cells participate in the representation of all input.
Now let’s consider things another way: Local versus Distributed representation. In a Local representation, one or more cells fully and exclusively represents each input. This has the drawback that each cell is crucial and therefore the network has poor tolerance to change or damage. In a distributed representation, the encoding is ‘distributed’ among many cells, meaning that each one is individually less critical.
Note I deliberately avoid the term “sparse coding” (often used in neuroscience) because it refers to a specific method in machine learning and there are many alternatives (Gregor2010), in particular training ANNs with sparsity penalties (Lee2007). Typically, sparse representations are overcomplete which provides robustness via redundancy.
Sparse representations are typically used to discover features in data – an unsupervised learning process. While there were concerns about encoding and training efficiency, recent techniques (see below) achieve good results at no additional cost to ordinary feed-forward networks.
Why is sparse, distributed representation useful? There is still some debate as to whether and why sparse representation is actually beneficial, but there is a considerable amount of evidence that it helps. It can be seen as a good compromise between local and dense distributed codes (Foldiak2002) with most of the benefits of both extremes, and few of the drawbacks.
Intuitively, sparse representations have a combinatorial representational capacity (2^n): We can pick a very large number of feature-combinations from the set, rather than expressing a fixed set of elements to varying degrees, or even exclusively (in which case capacity = n).
The partially distributed nature of the resulting state also makes representations more robust (compared to winner take all coding), in the same way that dropout reduces reliance on individual features. Overall, the best of both worlds.
That wraps up the first part of this topic, having introduced a range of relevant motivation, concepts and methods. You can find the next section here, in which we describe our implementation of a convolutional winner-take-all competitive learning algorithm based on GNGU, and its comparison to sparse convolutional and fully-connected autoencoders.
Graves2014 – “Neural Turing Machines”. Alex Graves, Greg Wayne, Ivo Danihelka. arXiv preprint arXiv:1410.5401 (2014)
LeCun1998 – “Gradient Based Learning applied to document recognition”. Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner. Proceedings IEEE November (1998)
Fukushima1980 – “A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”. Kunihiko Fukushima. Biological Cybernetics. 36 (4): 193–202 (1980)
Marblestone2016 – “Toward an Integration of Deep Learning and Neuroscience”. Marblestone, A., Wayne, G., Kording, K. Frontiers in Computational Neuroscience (2016)
Fritzke1997 – “A self-organizing network that can follow non-stationary distributions”. Bernd Fritzke. Proceedings of International Conference on Artificial Neural Networks, p. 613-618 (1997)
Lee2007 – “Sparse deep belief net model for visual area V2”. Lee, Honglak, Ekanadham, Chaitanya, and Ng, Andrew Y. Advances in Neural Information Processing Systems, (2007)
Rougier2011 – “Dynamic Self-Organising map”. N.P.Rougier and Y.Boniface. Neurocomputing 74, 11, pp 1840-1847 (2011)
Gregor2010 – “Learning Fast Approximations of Sparse Coding”. Karol Gregor and Yann LeCun. Proceedings of the 27th International Conference on International Conference on Machine Learning (2010)
Foldiak2002 – “Sparse coding in the primate cortex”. P Földiák. The Handbook of Brain Theory and Neural Networks, Second Edition, pp 1064-1068, Arbib, MIT Press, ISBN 0-262-01197-2 (2002)
Makhzani2013 – “k-Sparse Autoencoders”. Makhzani, Alireza and Frey, Brendan. arXiv preprint arXiv:1312.5663 (2013)
Makhzani2015 – “Winner-Take-All Autoencoders”. Makhzani, Alireza and Frey, Brendan J. Advances in Neural Information Processing Systems (2015)