Category Archives

5 Articles

AGI/Artificial General Intelligence/unsupervised learning

Unsupervised Learning with Spike-Timing Dependent Plasticity

Posted by Yi-Ling Hwong on

Our brain is a source of great inspiration for the development of Artificial General Intelligence. In fact, one of the common views is that any effort in developing human-level AI is almost destined to fail without an intimate understanding of how the brain works. However, we do not understand our brain that well yet. But that is another story for another day. In today’s blog post we are going to talk about a learning method in machine learning that takes its inspiration from a biological process underpinning how humans learn – Spike Timing Dependent Plasticity (STDP).

Biological neurons communicate with each other through synapses, which are tiny connections between neurons in our brains. A presynaptic neuron is the neuron that fires the electrical impulse (the signal, so to speak), and a postsynaptic neuron is the neuron that receives this impulse. The wiring of the neurons makes our brain an extremely complex piece of machinery: a typical neuron receives thousands of inputs and sends its signals to over 10,000 other neurons. Incoming signals to a neuron alter its voltage (potential). When these signals reach a threshold value the neuron will produce a sudden increase in voltage for a short time (1ms). We refer to these short bursts of electrical energy as spikes. Computers communicate with bits, while neurons use spikes.

Anatomy of a neuron (image credit: Wikimedia)

Artificial Neural Networks (ANNs) attempt to capture this mechanism of neuronal communication through mathematical models. However, these computational models may be an inadequate representation of the brain. To understand the trend towards STDP and why we think it is a viable path forward, let’s back up a little bit and talk briefly about the current common methods in ANNs.

Gradient Descent: the dominant paradigm

Artificial Neural Networks are based on a collection of connected nodes mimicking the behaviour of biological neurons. A receiving (or postsynaptic) neuron receives multiple inputs, processes the signals, multiplies them by a weight, applies a nonlinear transfer function, and then propagates this signal to other neurons. The weights of the neurons vary as learning happens. This process of tweaking the weights is the most important thing in an artificial neural network. One popular learning algorithm is Stochastic Gradient Descent (SGD). To calculate the gradient of the loss function with respect to the weights, most state of the art ANNs use a procedure called back-propagation. However, the biological plausibility of back-propagation remains highly debatable. For example, there is no evidence of a global error minimisation mechanism in biological neurons. Therefore, a better learning algorithm might help us to move towards AGI. Something that raises the biological realism of our models. And this is where the Spiking Neural Network comes in.

The incorporation of timing in an SNN

The main difference between a conventional ANN and SNN is the neuron model that is used. The neuron model used in a conventional ANN does not employ individual spikes in computations. Instead the output signals from the neurons are treated as normalised firing rates, or frequency, of inputs within a certain time frame [1]. This is an averaging mechanism and is commonly referred to as rate coding. Consequently, input to the network can be real values, instead of a binary time-series. In contrast, each individual spike is used in the neuron model of an SNN. Instead of using rate coding, SNN uses pulse coding. What is important here is the incorporation of timing of the firing in computations, like real neurons do. The neurons in an SNN do not fire at every propagation cycle. They only fire when signals from other incoming neurons cause charge accumulation that reaches a certain threshold voltage.

Basic model of a spiking neuron (Image credit: EPFL)

The use of individual spikes in pulse coding is more biologically accurate in two ways. First, it is a more plausible representation for tasks where speed is an important consideration. For example in human visual system. Studies have shown that humans analyse and classify visual input (e.g. facial recognition) in under 100ms. Considering it takes at least 10 synaptic steps from the retina to the temporal lobe [2], this leaves about 10ms of processing time for each neuron. This is too little time for an averaging mechanism like rate coding to take place. Hence, an implementation that uses pulse coding might be a more suitable model for object recognition tasks, which is currently not the case considering the popularity of conventional ANN. Second, the use of only local information (i.e. timing of spikes) in learning is a more biologically realistic representation in comparison with a global error minimisation mechanism.

Learning using Spike-Timing Dependent Plasticity

The changing and shaping of neuron connections in our brain is known as synaptic plasticity. Neurons fire, or spike, to signal the presence of the feature that they are tuned for. As cleverly suggested by the Canadian psychologist Donald Hebb, “Neurons that fire together, wire together.” Simply put, when two neurons fire at almost the same time the connections between them are strengthened and thus they become more likely to fire again in the future. When two neurons fire in an uncoordinated manner the connections between them weaken and they are more likely to act independently in the future. This is known as Hebbian learning. The strengthening of synapses is known as Long Term Potentiation (LTP) and the weakening of synaptic strength is known as Long Term Depression (LTD). What determines whether a synapse will undergo LTP or LTD is the timing between the pre- and postsynaptic firing. If the presynaptic neuron fires before the postsynaptic neuron within the preceding 20ms, LTP occurs; and if the presynaptic neuron fires after the postsynaptic neuron within the following 20ms, LTD occurs. This is known as Spike-Timing Dependent Plasticity (STDP).

This biological mechanism can be adopted as a learning rule in machine learning. A general approach is to apply a delta rule Δw to each synapse in a network to compute its weight change. The weight change will be positive (therefore increasing the strength of the synaptic connection) if the postsynaptic neuron fires just after the presynaptic neuron, and negative if the postsynaptic neuron fires just before the presynaptic neuron. Compared with the supervised learning algorithm employed in backpropagation, STDP is an unsupervised learning method. This is another reason STDP-based learning is believed to more accurately reflect human learning, given that much of the most important learning we do is experiential and unsupervised, i.e. there is no “right answer” available for the brain to learn from.


STDP represents a potential shift in approach when it comes to developing learning procedures in neural networks. Recent research shows that it has predominantly been applied in pattern recognition related tasks. One 2015 study using an exponential STDP learning rule achieved 95% accuracy on the MNIST dataset [3], a large handwritten digit database that is widely used a training dataset for computer vision. Merely a year later, researchers have managed to make significant progress. For example, Kheradpisheh et al. achieved 98.5% accuracy MNIST by combining SNN and features of deep learning [4]. The network they used comprised several convolutional and pooling layers, and STDP learning rules were used in the convolutional layers to learn the features. Another interesting study took its inspiration from Reinforcement Learning and combined it with a hierarchical SNN to perform pattern recognition [5]. Using a network structure that consists of two simple and two complex layers and a novel reward-modulated STDP (R-STDP), their method outperformed classic unsupervised STDP on several image datasets. STDP has also been applied in real-time learning to take advantage of its speedy nature [6]. The SNN and fast unsupervised STDP learning method that was developed achieved an impressive 21.3 fps in training and 17.9 fps in testing. To put things in perspective, human eyes are able to detect around 24 fps.

Apart from object recognition, STDP has also been applied in speech recognition related tasks. One study uses an STDP-trained, nonrecurrent SNN to convert speech signals into a spike train signature for speech recognition [7]. Another study combines a hidden Markov model with SNN and STDP learning to classify segments of sequential data such as individual spoken words [8]. STDP has also proven to be a useful learning method in modelling pitch perception (i.e. recognising tones). Researchers developed a computational model using neural network that learns using STDP rules to identify (and strengthen) the neuronal connections that are most effective for the extraction of pitch [9].

Final thoughts

Having learned what we have about STDP, what can we conclude about the state of the art of machine learning? We think that conventional Artificial Neural Networks are probably here to stay. They are simplistic models of neurons but they do work. However the extent to which supervised ANNs would be suitable in the development of AGI is debatable. On the other hand, while the Spiking Neural Network is a more authentic model of how the human brain works, its performance thus far still lags behind that of ANNs on some tasks, not least because a lot more research has been done on supervised ANNs than SNNs. Despite its intuitive appeal and biological validity, there are also many neuroscientific experiments in which STDP has not matched observations [10]. One major quandary is the observation of LTD in certain hippocampal neurons (CA3 and CA1 regions, to be precise) when low frequency (1 Hz) presynaptic stimulation drives postsynaptic firing [11]. Conventional STDP wisdom says LTP should happen in this case. The frequency-dependence of plasticity does not stop here. At high enough frequencies (i.e. firing rates), the STDP learning rule becomes LTP-only. That is, both positive and negative Δw produce LTP [12]. Several other additional mechanisms also appear to influence STDP learning. For example, LTD can be converted to LTP by altering the firing pattern of the postsynaptic spikes: firing ‘bursts’ or even a pair of spikes in the postsynaptic neuron lead to LTP where single spikes would have led to LTD [13] [14]. Plasticity also appears to accumulate as a nonlinear function of the number of pre- and postsynaptic pairings, with depression accumulating at a lower rate than potentiation, i.e. requiring more pairings [13]. Finally, it seems that neural activity that does not cause any measurable plasticity may have a ‘priming’ effect on subsequent activities. In the CA1 region for example, LTP could be activated with as few as four stimuli, provided that a single priming stimulus was given 170 ms earlier [15] .

SNN’s inferior performance when compared to other ANNs might be due to its poor scalability. Large scale SNN’s are relatively rare because the computational intensity involved in designing such networks are not yet fully supported in most high performance computing (there are, however, exceptions such as this and this). Most implementations today use only one or two trainable layers of unsupervised learning, which limits its generalisation capabilities [16]. Moreover, and perhaps most importantly, STDP is vulnerable to the common shortcoming of unsupervised learning algorithms: it works well in sifting out statistically significant features but has problems identifying rare but diagnostic features which are crucial in important processes such as decision making. My sense is that if STDP is to become the key in unlocking the secrets of AGI, there needs to be more creativity in its implementation that takes advantage of its biological roots and nuances while striving for a general purpose learning algorithm.

What do you think? Comment and let us know your thoughts!


[1] Vreeken, J. (2003). Spiking neural networks, an introduction.

[2] Thorpe, S., Delorme, A., & Van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural networks, 14(6), 715-725.

[3] Diehl, P. U., & Cook, M. (2015). Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in computational neuroscience, 9.

[4] Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J., & Masquelier, T. (2016). STDP-based spiking deep neural networks for object recognition. arXiv preprint arXiv:1611.01421.

[5] Mozafari, M., Kheradpisheh, S. R., Masquelier, T., Nowzari-Dalini, A., & Ganjtabesh, M. (2017). First-spike based visual categorization using reward-modulated STDP. arXiv preprint arXiv:1705.09132.

[6] Liu, D., & Yue, S. (2017). Fast unsupervised learning for visual pattern recognition using spike timing dependent plasticity. Neurocomputing, 249, 212-224.

[7] Tavanaei, A., & Maida, A. S. (2017). A spiking network that learns to extract spike signatures from speech signals. Neurocomputing, 240, 191-199.

[8] Tavanaei, A., & Maida, A. S. (2016). Training a Hidden markov model with a Bayesian spiking neural network. Journal of Signal Processing Systems, 1-10.

[9] Saeedi, N. E., Blamey, P. J., Burkitt, A. N., & Grayden, D. B. (2016). Learning Pitch with STDP: A Computational Model of Place and Temporal Pitch Perception Using Spiking Neural Networks. PLoS computational biology, 12(4), e1004860.

[10] Shouval, H. Z., Wang, S. S. H., & Wittenberg, G. M. (2010). Spike timing dependent plasticity: a consequence of more fundamental learning rules. Frontiers in Computational Neuroscience, 4.

[11] Wittenberg, G. M., and Wang, S. S.-H. (2006). Malleability of spike-timing- dependent plasticity at the CA3-CA1 synapse. J. Neurosci. 26, 6610–6617.

[12] Sjöström, P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32(6), 1149-1164.

[13] Wittenberg, G. M., and Wang, S. S.-H. (2006). Malleability of spike-timing- dependent plasticity at the CA3-CA1 synapse. J. Neurosci. 26, 6610–6617.

[14] Pike, F. G., Meredith, R. M., Olding, A. W., & Paulsen, O. (1999). Postsynaptic bursting is essential for ‘Hebbian’induction of associative long‐term potentiation at excitatory synapses in rat hippocampus. The Journal of physiology, 518(2), 571-576.

[15] Rose, G. M., and Dunwiddie, T. V. (1986). Induction of hippocampal long-term potentiation using physiologically patterned stimulation. Neurosci. Lett. 69, 244–248.

[16] Almási, A. D., Woźniak, S., Cristea, V., Leblebici, Y., & Engbersen, T. (2016). Review of advances in neural networks: Neural design technology stack. Neurocomputing, 174, 31-41.

Predictive Coding/pyramidal cell/Rao & Ballard/unsupervised learning

Pyramidal Neurons and Predictive Coding

Posted by David Rawlinson on

Today’s post tries to fit the theoretical concept of Predictive Coding with the unusual structure and connectivity of Pyramidal cells in the Neocortex.

A reconstruction of a pyramidal cell (source: Wikipedia / Wikimedia Commons). Soma and dendrites are labeled in red, axon arbor in blue. 1) Soma (cell body) 2) Basal dendrite (feed-forward input) 3) Apical dendrite (feed-back input) 4) Axon (output) 5) Collateral axon (output).

Pyramidal neurons

Pyramidal neurons are interesting because they are one of the most common neuron types in the computational layers of the neocortex. This almost certainly means they are critical to many of the key cortical functions, such as forming representations of knowledge and reasoning about the world.

Anatomy of a Pyramidal Neuron

Pyramidal neurons are so-called because they tend to have a triangular body (soma). But this isn’t the most interesting feature! While all neurons have dendrites (inputs) and at least one axon (output), Pyramidal cells have more than one type of input – Basal and Apical dendrites.

Apical Dendrite

Pyramidal neurons tend to have a single, long Apical dendrite that extends with few forks a long way from the body of the neuron. When it reaches layer 1 of the cortex (which contains mostly top-down feedback from cortical areas that are believed to represent more abstract concepts), the apical dendrite branches out. This suggests the apical dendrite likes to receive feedback input. If feedback represents more abstract, longer-term context, then this data would be useful for predicting bottom-up input. More on this later.

Basal Dendrites

Pyramidal cells tend to have a few Basal dendrites that branch almost immediately, in the vicinity of the cell body. Note that this means the input provided to basal and apical dendrites is physically separated. We know from analysis of cortical microcircuits that axons terminating around the body of pyramidal cells in cortex layers 2 and 3 contain bottom-up data that is propagating in a feed-forward direction – i.e. information about the external state of the world.


Pyramidal cells have a single Axonal output that may fork, and may travel a very long distance to its targets including other areas of the cortex.

Predictive Coding

Predictive Coding (PC) is a method of transforming data from its original form, to a representation in terms of prediction errors. There’s not much interest in PC In the Machine Learning community, but in Neuroscience there is substantial evidence that the Cortex encodes information in this way. Similar but unrelated concepts have also been used for efficient compression of data in signal processing. The benefit of this transformation is due to compression: We assume that only prediction errors are important, because by definition, everything else can be predicted and is therefore sufficiently described elsewhere.

There are several research groups looking at computational models of Predictive Coding – in particular those of Karl Friston and Andy Clark.

Two uses for feedback

Assuming feedback contains a more processed and abstract representation of a broader set of data, it has two uses.

  • Prediction for a more efficient representation of the world (e.g. Predictive Coding)
  • Prediction for more robust interpretation (via integration of top-down information in perception)

Predictive coding aims to transform the representation inside the cortex to a more efficient one that encodes only the relationships between prediction errors. Take some time to decide for yourself whether this loses anything…!

But there are many perceptual phenomena that show how internal state affects perception and interpretation of external input. For example, the phenomenon of multistable perception in some visual illusions: We need to know what we’re looking for before we can see it, and we can deliberately change from one interpretation to another (see figure).

A Necker Cube: This object can be interpreted in two distinct ways; as a cube from slightly above or slightly below. With a little practice you can easily switch between interpretations. One explanation of this is that a high-level decision as to the preferred interpretation is provided as feedback to hierarchically-lower processing areas.

Now consider Bayesian inference, such as Belief Propagation, or Markov Random Fields – in all cases we combine a Prior (e.g. top-down feedback) with a Likelihood produced from current, bottom-up data. Good inference depends on effective integration of both inputs.

Ideally we would be able to resolve how both the modelling and inference benefits could be realized in the pyramidal cell, and how physical segregation of apical & basal dendrites might help this happen.

False-Negative Error Coding

The simplest scheme for predictive coding is simply to propagate only false-negative errors – where something was observed, but it was not predicted in advance. In this encoding, if the event was predicted, simply suppress any output. (Note: This assumes that another mechanism limits the number of false-positive errors – for example a homeostatic system to limit the total number of predictions.)

When a neuron fires, it represents a set of coincident input on a number of synapses. A pattern of input was observed. If the neuron was in a “predicted” state, immediately prior to firing, then we could safely suppress the output and achieve a simple predictive coding scheme. If a neuron is not in a predicted state when it fires, then the output should be propagated as normal.

False-Negative Error Coding in Pyramidal Cells

Since Pyramidal cells have 2 distinct inputs – basal and apical dendrites – we can implement the false negative coding as follows:

  • Basal dendrites recognize patterns of bottom-up input; the neuron “represents” those patterns by generating a spike on its axonal output when stimulated by the basal dendrites.
  • Apical dendrite learns to detect input that allows the cell’s spiking to be predicted. The apical dendrite determines the “predicted” state of the cell. Top-down feedback input is used for this purpose.
  • If the cell is “predicted” when the basal dendrite tries to generate an output, then suppress that output.
  • The cell internally self-regulates to ensure that it is rarely in a predicted state, and typically only at the right times.
  • Physical segregation of the two dendrite types ensures that they can target feedback data for prediction and feed-forward data for classification.

Spike bursts (spike trains)

When Pyramidal cells fire, they usually don’t fire just once. They tend to generate a short sequence of spikes known as a “burst” or “train”. So it’s possible that False-Negative coding doesn’t completely eliminate the spike, but rather truncates the sequence of spikes to make the output far less significant and less likely to significantly drive activity in other cells. There may also be some benefit to being able to broadcast the event in a subtle way, perhaps as a form of timing signal.

Time series plots of typical spike trains produced by pyramidal cells.

So to evidence this theory, we could look for truncated or absent spike trains in presence of predictive input to the apical dendrite. Specifically, to observe that input causing a spike in the apical dendrite truncates or eliminates an expected spike train resulting from basal stimulation.

Is there any direct neurological evidence for different integration of spikes from Apical and Basal dendrites in Pyramidal cells? It turns out, yes, there is! Metz, Spruston and Martina [1] say: “… our data present evidence for a dendritic segregation of Kv1-like channels in CA1 pyramidal neurons and identify a novel action for these channels, showing that they inhibit action potential bursting by restricting the size of the [afterdepolarization]”.

Now for the AI/ML audience it’s necessary to translate this a bit. An “action potential” “occurs when the membrane potential (voltage) of a specific axon location rapidly rises and falls. Action potentials in neurons are also known as “nerve impulses” or “spikes” So bursting is the generation of a short sequence of rapid spikes.

So in other words, apical stimulation inhibits bursts of axonal output spikes from a pyramidal neuron. There’s our smoking gun!

According to this paper, the Apical dendrite uniquely inhibits the spike burst from soma (the basal dendrites don’t). This matches the behaviour we would expect, if pyramidal cells implement false-negative predictive coding via the different inputs to different dendrite types: If the apical dendrite fires, there’s no axonal burst. If there wasn’t a spike in the apical dendrite, but basal activity drives the cell over its threshold, then the cell output does burst.

Note there are many other papers with similar claims; we found that search terms such as “differential basal apical dendrite integration” to be helpful.

[1] “Dendritic D-type potassium currents inhibit the spike afterdepolarization in rat hippocampal CA1 pyramidal neurons”
Alexia E. Metz, Nelson Spruston and Marco Martina. J. Physiol. 581.1 pp 175–187 (2007)


We’ve seen how we might combine the observed phenomena of multistable perception via separation of feedback and feed-forward input to the basal and apical dendrites, and predictive coding, via a simple model of pyramidal cell function by false-negative error coding.

Unlike existing models of predictive coding within the cortex, which often posit separate populations of cells representing predictions and residual errors (e.g. Rao and Ballard, 1999), we have proposed that coding could occur within the known biology of individual pyramidal cells, due to the different integration of apical and basal dendrite activity. At the same time, the proposed method allows feedback and feedforward information to be integrated within the same mechanism.

Over the next few months we’ll be testing some of these ideas in simulation!

AGI/Experimental Framework/MNIST/unsupervised learning

Region-Layer Experiments

Posted by ProjectAGI on
Region-Layer Experiments
Typical results from our experiments: Some active cells in layer 3 of a 3 layer network, transformed back into the input pixels they represent. The red pixels are positive weights and the blue pixels are negative weights; absence of colour indicates neutral weighting (ambiguity). The white pixels are the input stimulus that produced the selected set of active cells in layer 3. It appears these layer 3 cells collectively represent a generic ‘5’ digit. The input was a specific ‘5’ digit. Note that the weights of the hidden layer cells differ from the input pixels, but are recognizably the same digit.

We are running a series of experiments to test the capabilities of the Region-Layer component. The objective is to understand to what extent these ideas work, and to expose limitations both in implementation and theory.

Results will be posted to the blog and written up for publication if, or when, we reach an acceptable level of novelty and rigor.

We are not trying to beat benchmarks here. We’re trying to show whether certain ideas have useful qualities – the best way to tackle specific AI problems is almost certainly not an AGI way. But what we’d love to see is that AGI-inspired methods can perform close to state-of-the-art (e.g. deep convolutional networks) on a wide range of problems. Now that would be general intelligence!

Dataset Choice

We are going to start with the MNIST digit classification dataset, and perform a number of experiments based on that. In future we will look at some more sophisticated image / object classification datasets such as LabelMe or Caltech_101.

The good thing about MNIST is that it’s simple and has been extremely widely studied. It’s easy to work with the data and the images are a practical size – big enough to be interesting, but not so big as to require lots of preprocessing or too much memory. Despite only 28×28 pixels, variations in digit appearance gives considerable depth to the data (example digit ‘5’ above).

The bad thing about MNIST is that it’s largely “solved” by supervised learning algorithms. A range of different supervised techniques have reached human performance and it’s debatable whether any further improvements are genuine.

So what’s the point of trying new approaches? Well, supervised algorithms have some odd qualities, perhaps due to the narrowness of training samples or the specificity of the cost function. For example, the discovery of “adversarial examples” – images that look easily classifiable to the naked eye but cannot be classified correctly with a trained network because they exploit weaknesses in trained neural networks.

But the biggest drawback of supervised learning is the need to tell it the “correct” answer for every input. This has led to a range of techniques – such as transfer learning – to make the most of what training data is available, even if not directly relevant. But fundamentally, supervised learning is unlike the experience of an agent learning as it explores its world. Animals can learn without a tutor.

However, unsupervised results with MNIST are less widely reported. Partially this is because you need to come up with a way to measure the performance of an unsupervised method. The most common approach is to use unsupervised networks to boost the performance of a final supervised network layer – but in MNIST the supervised layer is so powerful it’s hard to distinguish the contribution of the unsupervised layers. Nevertheless, these experiments are encouraging because having a few unsupervised layers seems to improve overall performance, compared to all-supervised networks. In addition to the limited data problem with supervised learning, unsupervised learning actually seems to add something.

One possible method of capturing the contribution of unsupervised layers alone is the Rand Index, which measures the similarity between two clusters. However, we are intending to use a distributed representation where there will be overlap between similar representations – that’s one of the features of the algorithm!

So, for now we’re going to go for the simplest approach we can think of, and measure the correlation between the active cells in selected hidden layers and each digit label, and see if the correlation alone is enough to pick the right label given a set of active cells. If the concepts defined by the digits exist somewhere in the hierarchy, they should be detectable as features uniquely correlated with specific labels…

Note also that we’re not doing any preprocessing of the MNIST images except binarization at threshold 0.5. Since the MNIST dataset is very high contrast, hopefully the threshold doesn’t matter much: It’s almost binary already.

Sequence Learning Tests

Before we start the experiments proper we conducted some ad-hoc tests to verify the features of the Region-Layer are implemented as intended. Remember, the Region-Layer has two key capabilities:

  • Classification … of the feedforward input, and
  • Prediction … of future classification results (i.e. future internal states)

See here and here to understand the classification role, and here for more information about prediction. Taken together, the ability to classify and predict future classifications allows sequences of input to be learned. This is a topic we have looked at in detail in earlier blog posts and we have some fairly effective techniques at our disposal.

We completed the following tests:

  • Cycle 0,1,2: We verified that the algorithm could predict the set of active cells in a short cycle of images. This ensures the sequence learning feature is working. The same image was used for each instance of a particular digit (i.e. there was no variation in digit appearance).
  • Cycle 0,1,…,9: We tested a longer cycle. Again, the Region-Layer was able to predict the sequence perfectly.
  • Cycle 0,1,2,3, 0,2,3,1: We tested an ambiguous cycle. At 0, it appears that the next state can be 1 or 2, and similarly, at 3, the next state can be 1 or 2. However, due to the variable order modelling behaviour of the Region-Layer, a single Region-Layer is able to predict this cycle perfectly. Note that first-order prediction cannot predict this sequence correctly.
  • Cycle 0,1,2,3,1,2,4,0,2,3,1,2,1,5,0,3,2,1,4,5: We tested a complex graph of state sequences and again a single Region-Layer was able to predict the sequence perfectly. We also were able to predict this using only first order modelling and a deep hierarchy.

After completion of the unit tests we were satisfied that our Region-Layer component has the ability to efficiently produce variable order models of observed sequences using unsupervised learning, assuming that the states can reliably be detected.


Now we come to the harder part. What if each digit exemplar image is ambiguous? In other words, what if each ‘0’ is represented by a randomly selected ‘0’ image from the MNIST dataset? The ambiguity of appearance means that the observed sequences will appear to be non-deterministic.

We decided to run the following experiments:

Experiment 1: Random image classification

In this experiment there will be no predictable sequence; each digit must be recognized solely based on its appearance. The classic experiment is used: Up to N training passes over the entire MNIST dataset, followed by fixing the internal weights and a single pass to calculate the correlation between each active cell in selected hidden layer[s] and the digit labels. Then, a single pass over the test set recording, for each test input image, the most highly correlated digit label for each set of active hidden cells. The algorithm gets a “correct” result if the most correlated label is the correct label.

  • Passes 1-N: Train networks

Present each digit in the training set once, in a random order. Train the internal weights of the algorithm. Repeated several times if necessary.

  • Pass N+1: Measure correlation of hidden layer features with training images.

Present each digit in the training set once, in a random order. Accumulate the frequency with which each active cell is associated with each digit label. After all images have been seen, convert the observed frequencies to correlations.

  • Pass N+2: Predict label of test images. 

Present each digit in the testing set once, in a random order. Use the correlations between cell activity and training labels to predict the most likely digit label given the set of active cells in selected Region-Layer components (they are arranged into a hierarchy).

Experiment 2: Image classification & sequence prediction

What if the digit images are not in a random order? We can use the English language to generate a training set of digit sequences. For example, we can get a book, convert each character to a 2 digit number and select random appropriate digit images to represent each number.

The motivation for this experiment is to see how the sequence learning can boost image recognition: Our Region-Layer component is supposed to be able to integrate both sequential and spatial information. This experiment actually has a lot of depth because English isn’t entirely predictable – if we use a different book for testing, there’ll be lots of sub-sequences the algorithm has never observed before. There’ll be uncertainty in image appearance and uncertainty in sequence, and we’d like to see how a hierarchy of Region-Layer components responds to both. Our expectation is that it will improve digit classification performance beyond the random image case.

In the next article, we will describe the specifics of the algorithms we implemented and tested on these problems.

A final article will present some results.

AGI/Artificial General Intelligence/deep belief networks/Reading List/Reinforcement Learning/sparse coding/Sparse Distributed Representations/stationary problem/unsupervised learning

Reading list – May 2016

Posted by ProjectAGI on
Reading list – May 2016
Digit classification error over time in our experiments. The image isn’t very helpful but it’s a hint as to why we’re excited 🙂

Project AGI

A few weeks ago we paused the “How to build a General Intelligence” series (part 1, part 2, part 3, part 4). We paused it because the next article in the series requires us to specify everything in detail, and we need working code to do that.

We have been testing our algorithm on a variety of MNIST-derived handwritten digit datasets, to better understand how well it generalizes its representation of digit-images and how it behaves when exposed to varying degrees of predictability. Initial results look promising: We will post everything here once we’ve verified them and completed the first batch of proper experiments. The series will continue soon!

Deep Unsupervised Learning

Our algorithm is a type of Online Deep Unsupervised Learning, so naturally we’re looking carefully at similar algorithms.

We recommend this video of a talk by Andrew Ng. It starts with a good introduction to the methods and importance of feature representation and touches on types of automatic feature discovery. He looks at some of the important feature detectors in computer vision, such as SIFT and HoG and shows how feature detectors – such as edge detectors – can emerge from more general pattern recognition algorithms such as sparse coding. For more on sparse coding see Shakir’s excellent machine learning blog.

For anyone struggling to intuit deep feature discovery, I also loved this post on yCombinator which nicely illustrates how and why deep networks discover useful features, and why the depth helps.

The latter part of the video covers Ng’s latest work on deep hierarchical sparse coding using Deep Belief Networks, in turn based on AutoEncoders. He reports benchmark-beating results on video activity and phoneme recognition with this framework. You can find details of his deep unsupervised algorithm here:

Finally he presents a plot suggesting that training dataset size is a more important determiner of eventual supervised network performance than algorithm choice! This is a fundamental limitation of supervised learning where the necessary training data is much more limited than in unsupervised learning (in the latter case, the real world provides a handy training set!)

Effect of algorithm and training set size on accuracy. Training set size more significant. This is a fundamental limitation of supervised learning.

Online K-sparse autoencoders (with some deep-ness)

We’ve also been reading this paper by Makhzani and Frey about deep online learning with auto-encoders (a type of supervised learning neural network that is used in an unsupervised way to reconstruct its input, often known as semi-supervised learning). Actually we’ve struggled to find any comparison of autoencoders to earlier methods of unsupervised learning both in terms of computational efficiency and ability to cover the search space effectively. Let us know if you find a paper that covers this.

The Makhzani paper has some interesting characteristics – the algorithm is online, which means it receives data as a stream rather than in batches. It is also sparse, which we believe is desirable from a representational perspective.
One limitation is that the solution is most likely unable to handle changes in input data statistics (i.e. non-stationary problems). The reason this is an important quality is that in any arbitrarily deep network the typical position of a vertex is between higher and lower vertices. If all vertices are continually learning, the problem being modelled by any single vertex is constantly changing. Therefore, intermediate vertices must be capable of online learning of non stationary problems otherwise the network will not be able to function effectively. In Makhzani and Frey, they instead use the greedy layerwise training approach from Deep Belief Networks. The authors describe this approach:
“4.6. Deep Supervised Learning Results The k-sparse autoencoder can be used as a building block of a deep neural network, using greedy layerwise pre-training (Bengio et al., 2007). We first train a shallow k-sparse autoencoder and obtain the hidden codes. We then fix the features and train another ksparse autoencoder on top of them to obtain another set of hidden codes. Then we use the parameters of these autoencoders to initialize a discriminative neural network with two hidden layers.”

The limitation introduced can be thought of as an inability to escape from local minima that result from prior training. This paper by Choromanska et al tries to explain why this happens.
Greedy layerwise training is an attempt to work around the fact that deep belief networks of Autoencoders cannot effectively handle nonstationary problems.

For more information here’s some papers on deep sparse networks built from autoencoders:

Variations on Supervised Learning – a Taxonomy

Back to supervised learning, and the limitation of training dataset size. Thanks to a discussion with Jay Chakravarty we have this brief taxonomy of supervised learning workarounds for insufficient training datasets:

Weakly supervised learning: [For poorly labelled training data] where you want to learn models for object recognition under weak supervision – you have say object labels for images, but no localization (e.g. bounding box) for the object in the image (there might be other objects in the image as well). You would use a Latent SVM to solve the problem of localizing the objects in the images, and at the same time learning a classifier for it.
Another example of weakly supervised learning is that you have a bag of positive samples mixed up with negative training samples, but also have a bag of purely negative samples – you would use Multiple Instance Learning for this.

Cross-modal adaptation: where one mode of data supervises another – e.g. audio supervises video or vice-versa.

Domain adaptation: model learnt on one set of data is adapted, in unsupervised fashion, to new datasets with slightly different data distributions.

Transfer learning: using the knowledge gained in learning one problem on a different, but related problem. Here’s a good example of transfer learning, a finalist in the NVIDIA 2016 Global Impact Award. The system learns to predict poverty from day and night satellite images, with very few labelled samples.

Full paper:

Interactive Brain Concept Map

We enjoyed this interactive map of the distribution of concepts within the cortex captured using fMRI and produced by the Gallant Lab (source papers here).

Using the map you can find the voxels corresponding to various concepts, which although maybe not generalizable due to the small sample size (7) gives you a good idea of the hierarchical structure the brain has produced, and what the intermediate concepts represent.

Thanks to David Ray @ for the link.

Interactive brain concept map

OpenAI Gym – Reinforcement Learning platform

We also follow the OpenAI project with interest. OpenAI have just released their “Gym” – a platform for training and testing reinforcement learning algorithms. Have a play with it here:

According to Wired magazine, OpenAI will continue to release free and open source software (FOSS) for the wider impact this will have on uptake. There are many companies now competing to win market share in this space.

The Talking Machines Blog

We’re regular readers of this blog and have been meaning to mention it for months. Worth reading.

How the brain generates actions

A big gap in our knowledge is how the brain generates actions from its internal representation. This new paper by Vicente et al challenges the established (rather vague) dogma on how the brain generates actions.
“We found that contrary to common belief, the indirect pathway does not always prevent actions from being performed, it can actually reinforce the performance of actions. However, the indirect pathway promotes a different type of actions, habits.”

This is probably quite informative for reverse-engineering purposes. Full paper here.

Hierarchical Temporal Memory

HTM is an online method for feature discovery and representation and now we have a baseline result for HTM on the famous MNIST numerical digit classification problem. Since HTM works with time-series data, the paper compares HTM to LSTM (Long-Short-Term Memory), the leading supervised-learning approach to this problem domain.

It is also interesting that the paper deals with adaptation to sudden changes in the input data statistics, the very problem that frustrates the deep belief networks described above.

Full paper by Cui et al here.

For a detailed mathematical description of HTM see this paper by Mnatzaganian and Kudithipudi.

AGI/AlphaGo/Artificial General Intelligence/deep convolutional networks/HQSOM/machine learning/Reading List/unsupervised learning

Reading list: Assorted AGI links. March 2016

Posted by ProjectAGI on
Reading list: Assorted AGI links. March 2016
A Minecraft API is now available to train your AGIs

Our News

We are working hard on experiments, and software to run experiments. So this week there is no normal blog post. Instead, here’s an eclectic mix of links we’ve noticed recently.

First, AlphaGo continues to make headlines. Of interest to Project AGI is Yann LeCun agreeing with us that unsupervised hierarchical modelling is an essential step in building intelligence with humanlike qualities [1]. We also note this IEEE Spectrum post by Jean-Christophe Baillie [2] which argues, as we did [3], that we need to start creating embodied agents.


Speaking of which, the BBC reports that the Minecraft team are preparing an API for machine learning researchers to test their algorithms in the famous game [4]. The Minecraft team also stress the value of embodied agents and the depth of gameplay and graphics. It sounds like Minecraft could be a crucial testbed for an AGI. We’re always on the lookout for test problems like these.

Of course, to play Minecraft well you need to balance local activities – building, mining etc. – with exploration. Another frontier, beyond AlphaGo, is exploration. Monte-Carlo Tree Search (as used in AlphaGo) explores in more limited ways than humans do, argues John Langford [5].

Sharing places with robots 

If robots are going to be embodied, we need to make some changes. Wired magazine says that a few small changes to the urban environment and driver behaviour will make the rollout of autonomous vehicles easier [6]. It’s important to meet the machines halfway, for the benefit of all.

This excellent paper on robotic grasping also caught our attention [7]. A key challenge in this area is adaptability to slightly varying circumstances, such as variations in the objects being grasped and their pose relative the the arm. General solutions to these problems will suddenly make robots far more flexible and applicable to a greater range of tasks.

Hierarchical Quilted Self-Organizing Maps & Distributed Representations

Last week I also rediscovered this older paper on Hierarchical-Quilted Self-Organizing Maps (HQSOMs) [8].This is close to our hearts because we originally believed this type of representation was the right approach for AGI. With the success of Deep Convolutional Networks (DCNs) it’s worth looking back and noticing the similarities between the two. While HQSOM is purely unsupervised learning, (a plus, see comment from Yann LeCun above) DCNs are trained by supervised techniques. However, both methods use small, overlapping, independent units – analogous to biological cortical columns – to classify different patches of the input. The overlapping and independent classifiers lead to robust and distributed representations, which is probably the reason these methods work so well.

Distributed representation is one of the key features of Hawkins’ Hierarchical Temporal Memory (HTM). Fergal Byrne has recently published an updated description of the HTM algorithm [9] for those interested.

We at Project AGI believe that a grid-like “region” of columns employing a “Winner-Take-All” policy [10], with overlapping input receptive fields, can produce a distributed representation. Different regions are then connected together into a tree-like structure (acyclic). The result is a hierarchy. Not only does this resemble the state-of-the-art methods of DCNs, but there’s a lot of biological evidence for this type of representation too. This paper by Rinkus [11] describes columnar features arranged into a hierarchy, with winner-take-all behaviour implemented via local inhibition.

Rinkus says: “Saying only that a group of L2/3 units forms a WTA CM places no a priori constraints on what their tuning functions or receptive fields should look like. This is what gives that functionality a chance of being truly generic, i.e., of applying across all areas and species, regardless of the observed tuning profiles of closely neighboring units.”

Reinforcement Learning 

But unsupervised learning can’t be the only form of learning. We also need to consider consequences, and so we need reinforcement learning to take account of these. As Yann said, the “cherry on the cake” (this is probably understating the difficulty of the RL component, but right now it seems easier than creating representations).

Shakir’s Machine Learning blog has a great post exploring the biology of reinforcement learning [12] within the brain. This is a good overview of the topic and useful for ML researchers wanting to access this area.

But regular readers of this blog will remember that we’re obsessed with unfolding or inverting abstract plans into concrete actions. We found a great paper by Manita et al [13] that shows biological evidence for the translation and propagation of an abstract concept into sensory and motor areas, where it can assist with perception. This is the hierarchy in action.

Long-Short-Term Memory (LSTM)

One more tack before we finish. Thanks to Jay for this link to NVIDIA’s description of LSTMs [14], an architecture for recurrent neural networks (i.e. the state can depend on the previous state of the cells). It’s a good introduction, but we’re still fans of Monner’s Generalized LSTM [15].

Fun thoughts

Now let’s end with something fun. Wired magazine again, describing watching AlphaGo as our first taste of a superhuman intelligence [16]. Although this is a “narrow” intelligence, not a general one, it has qualities beyond anything we’ve experienced in this domain before. What’s more, watching these machines can make us humans better, without any nasty bio-engineering:

“But as hard as it was for Fan Hui to lose back in October and have the loss reported across the globe—and as hard as it has been to watch Lee Sedol’s struggles—his primary emotion isn’t sadness. As he played match after match with AlphaGo over the past five months, he watched the machine improve. But he also watched himself improve. The experience has, quite literally, changed the way he views the game. When he first played the Google machine, he was ranked 633rd in the world. Now, he is up into the 300s. In the months since October, AlphaGo has taught him, a human, to be a better player. He sees things he didn’t see before. And that makes him happy. “So beautiful,” he says. “So beautiful.”