Predictive Capsules Networks – Research update

Predictive Capsules Networks – Research update

Facial equivariances modelled by Predictive Capsules algorithm

We recently talked about Capsules networks and equivariances. NB: If you’re not familiar with Capsules networks, read this first. Our primary objective with Capsules networks is to exploit their enhanced generalization abilities.

However, what we’ve found instead raises new questions about how generalization can be measured and whether Capsules networks are actually good at it.

PredCaps Algorithm

In the equivariances post, we introduced the Predictive Capsules (“PredCaps”) algorithm, so-called because it uses prediction as a consensus mechanism. In other Capsules networks, capsules compete with each other to predict a consensus pose of the input, with the clique that has the most support suppressing the other capsules. In PredCaps, capsules merely predict themselves given feedback from deeper layers, and the most-predicted capsules win. The capsules that predict themselves most strongly are selected.

We use “top-k-capsules” sparse training to ensure that capsules specialize, with each capsule learning to model related subsets of the available input. A “lifetime sparsity” rule ensures that all capsules get trained and used at least occasionally. Winning capsules are trained and used for classification; other capsules are masked to zero (this removes their influence on the output, which prevents any learning via backprop). Like an autoencoder, capsules are trained to minimize input reconstruction loss, which means that as long as we are consistent about the way capsules are selected, they will learn to represent the observed variation in a subset of the input.

In all our experiments, we tested a 3-layer encoder network of PredCaps with alternative “decoder” heads for classification and/or reconstruction, trained separately. Note that a local reconstruction loss is used in each layer – hence only the first layer attempts to reconstruct the input image. PredCaps layers receive feedback from the layer above, which acts as a dynamic (data-dependent) bias on the hidden layer state such that the reconstruction loss is minimized.

The PredCaps algorithm is intended to satisfy the following design goals:

  • Only local credit assignment
  • Continuous, unsupervised learning
  • Sparse coding
  • Hierarchical representation
  • Homogeneous network (all capsules, not a mix of capsules and conventional layers)
  • Bidirectional connectivity between layers, with simultaneous training of all layers
  • Consensus via iterative update of all layers given feed-forward and feed-back input
  • Capsular equivariance and consequent generalization benefit

Facial Equivariances

Facial equivariances modelled by Predictive Capsules algorithm

We were pleased to observe equivariances in the “Labelled Faces in the Wild” (LFW) dataset. As in Sabour et al, we use univariate perturbations of the deepest hidden capsules layer to visualize the import of each capsule parameter. The assumption is that by varying each parameter individually, we can discover how it is interpreted by the network and implicitly what it represents. The headline figure, above, shows some selected examples. Note that all reconstructions are plausible faces! This appears to be a unique benefit of capsule networks compared to e.g. stacked autoencoder networks. From top to bottom we observe PredCaps has “discovered” the following latent variables:

  • Emotion: Frightened → Smiling
  • Race: Asian → Caucasian
  • Teeth: Prominent → hidden
  • Weight: Heavy-set → Thin
  • Illumination: Background → Foreground
  • Pose: Right → Left
  • Pose: Right → Left (another example)
  • Illumination: Background → Foreground
  • Illumination: Above → Below
  • Emotion: Neutral → Happy
  • Gender: Female → Male

In many cases the observed role of each capsule-parameter is preserved across inputs; for example, we can use the “gender” parameters to change a reconstructed image of any person to another gender. These results are exciting, because it suggests Capsules networks might be good at generating novel images, much like generative adversarial networks (GANs).

To date we have not seen facial equivariance via capsules networks in any published work.

MNIST Equivariances

Equivariances in MNIST data

We also tested PredCaps on MNIST (handwritten digits). The figure above shows some selected equivariances obtained from PredCaps on MNIST digits. This reproduces the effect observed in Sabour et al. The examples show, from top:

  • Digit morphing and skew
  • Convex → Concave digit form
  • Stroke intensity
  • Stroke straightness
  • Digit height & form
  • Local digit form and skew variation
  • Middle of digit is pulled or pushed
  • Digit proportion changes

To verify that we are actually seeing something special and not the typical consequence of sparse autoencoder training, we also tested the baseline case without capsular selection (see Equivariances article for details). Without capsular selection & training, we do not observe equivariant output after univariate perturbation. Instead, small variations in hidden layer values cause the reconstructed forms to disintegrate into random blobs.


Average convergence measured in top-k cell flips over routing iterations

Convergence in Capsules networks with EM Routing is reportedly problematic. In particular there are several reports that the EM-routing as described in Hinton et al is difficult to reproduce and/or unstable, but this may also be due to minor implementation errors as no reference source code is available.

PredCaps achieves convergence without any inherent stabilization. Routing simply updates layer outputs and convergence occurs when the set of active capsules reaches a local minima defined by the learned weights. Good convergence occurs with near-perfect reliability after approximately 5000 batches on all the datasets tested. We measure convergence as the number of bits flipped in the top-k mask of a convolutional capsules layer. Each cell that has moved in or out of the top-k cells in the layer (at any convolutional location) causes 2 “flips” (see figure REF). In this example the middle capsules layer has w,h = [5,5] and a depth of 200 cells (20 capsules of 10 cells each). Therefore, the total number of cells is 5x5x200 = 5000. The wave effect is due to flip-propagation through the 3 layer network.

The convergence behaviour is very encouraging because it suggests the weights are successfully discovering stable representations of the input, distributed across the layers.

Since the bottom-layer feed-forward input is constant during routing, these changes are also evidence that the PredCaps network reaches a consensus across all layers; the state of all layers varies significantly in response to the state of other layers. Further evidence of this is that the magnitude of feedback weights from deeper layers increases monotonically, at least as long as we can be bothered to continue to train.

Classification Performance

We tested PredCaps on several datasets, including:

  • MNIST digits
  • Train on MNIST, test on affNIST digits (intended to be a generalization-to-affine-poses test-case)
  • SmallNORB (images of toys in various poses and illuminations; we measure whether classifiers can generalize to unseen poses and illuminations)
  • Labelled-Faces-In-The-Wild (LFW). A collection of celebrity face images scraped from the Internet.

We found classification performance on MNIST to be acceptable (98.5%) but not remarkable. Due to the low number of samples per individual in LFW (often just 1) we could not practicably test classification accuracy.

On both smallNORB and affNIST we observed very ordinary classification performance when comparing the same architecture with a conventional deep-backpropagation convolutional network of the same size: 65% on affNIST (when trained only on MNIST) and 93% on smallNORB. These numbers are disappointing and we weren’t able to demonstrate any generalization benefit from the capsules property.

When testing classification performance on the training set, we found very high accuracy, typically 99%+ on all datasets. This is normally a sign of overfitting, so we applied L2 regularization to the encoder and decoder networks, without success. We also tried dropout in the encoder and decoder networks, with some success in the latter (the difference between train and test accuracy was somewhat controlled) although accuracy was still disappointing and higher on training data.

These results are, unexpectedly, a generalization failure: PredCaps compresses all the information necessary to encode and classify images through a tiny bottleneck (typically 160 capsule-parameters in the deepest layer) with very little loss. Reconstructions of the input images from capsule parameters are very accurate: The information is there, but the features are not suitable for a supervised classifier to generalize with!

Use of labels during encoder training

We tried providing classification labels as feedback input to the top layer of the PredCaps encoder to improve classification performance by biasing the way capsules are matched to image inputs. Training caused the labels to be assigned high and monotonically increasing weight magnitudes, and changed the statistics of capsule usage, suggesting that they were used productively by the algorithm. We found that removing this information at test time was damaging to the algorithm unless a high rate of label dropout was used during training (again suggesting the algorithm made use of the labels). But use of labels during encoder training did not significantly improve classification accuracy. Nuts.

Next steps

We have some reasons to be optimistic about PredCaps; it ticks all of our design goals for the manner in which it is trained; it is a homogeneous network; it naturally reaches convergence as a minima in the learned weights; and it demonstrates the equivariance property.

PredCaps is good at interpreting images in a pre-determined manner – for example, if you insist that any face is female, it can produce a plausible female version of it.

However, what it doesn’t do is generalize better, in terms of classification accuracy in unseen data. PredCaps seems to be a powerful generative method of encoding images of objects, but doesn’t do so in a manner that is useful to a classifier.

We plan to test PredCaps some new ways:

  • PredCaps is similar to an Autoencoder network, in that it minimizes a reconstruction loss. Is this loss better on unseen inputs due to the Capsules network paradigm, compared to vanilla convnets? i.e. does reconstruction loss “generalize” even if classification loss doesn’t?
  • In Sabour et al, their Capsules network was very good at distinguishing overlapping digits when prompted. Is PredCaps similarly good at disentangling overlapping digits? This seems likely, given its ability to interpret input as a desired class. We will test this.

Generalizing the Generalization task

Classification tasks require the ability to determine the “true” class Y of an input X. Should generalization include the ability to perform an “inverse” task, to interpret any input X as an instance of class Y? Consider this:

What is it?

On the one hand, it’s a tree-stump; or a block of wood; or a piece of trunk. Subjectively, it could be described many ways with varying – subjective – “correctness” of these labels.

What if I told you to “take a seat by the fire”?

In this context it is useful to be able to sensibly interpret the image in the desired way. There’s nothing chairlike about a block of wood. But it nevertheless offers these affordances, if you can perceive them. Rather than “predict class X of input Y”, the task becomes “interpret Y as class X”. For a general intelligence, the latter ability might be as important as classification.

The overlapping-digit segmentation task reported in Sabour et al is a good test of this type of problem:

Segmentation of overlapping digits (from Sabour et al)

More to come!

David Rawlinson