Artificial General Intelligence/Theory

Attention in Artificial Intelligence systems

Posted by Yi-Ling Hwong on

One of the features of our brain is its modularity. It is characterised by distinct but interacting subsystems that underlie key functions such as memory, language, perceptions, etc. Understanding the complex interplay between these modules requires decomposing them into a set of components. Indeed, this modular approach to modeling the brain is one of the ways neuroscientists are studying the brain. The ‘module’ that I am going to be talking about in this blogpost has its roots in neuroscience but has greatly inspired AI research: attention. I will focus on the neuroscience aspect of attention first before moving on to review some of the most exciting developments using attentional mechanisms in machine learning.

The neuroscientific roots of attention

Neuroscientists have long studied attention as an important cognitive process. It is described as the ability of organisms to ‘select a subset of available information upon which to focus for enhanced processing and integration’ and encompasses three aspects: orienting, filtering and searching. Visual attention, for example, is an active area of research. Our ability to focus on specific area of a visual scene and extract and process the information that is streamed to our brain is thought to be an evolutionary trait that all but guaranteed the survival of our species. This capability to select, process and act upon sensory experience has inspired a whole branch of research in computational modelling of visual attention.

Visual attention (image credit: Wikimedia)

The emergence of a whole suite of sophisticated equipment to scan and study the brain has further fanned the flames of enthusiasm for attention research. In a recent study using eye tracking and fMRI data, Leong et al. demonstrated the bidirectional interaction between attention and learning: attention facilitates learning, and learned values in turn inform attentional selection [1]. The relationship between attention and consciousness is a complex issue, and in many sense both a scientific and a philosophical exploration. The ability to focus one’s thoughts out of several simultaneous objects or trains of thought and take control of one’s own mind in a vivid and conscious manner is not just a delightful and useful perk. It is a quintessential part of our experience of human-ness.

Given their significance, attentional mechanisms have in recent years received increasing attention (pun intended) from the AI community. A detailed explanation of how they are applied in machine learning will require a separate blog post (I highly recommend this excellent article by Olah and Carter) but in essence attention layers provide the functionality of focusing on specific elements to improve the performance of a model. In an image recognition task for example, it does so by taking ‘glimpses’ of the input image at each step, updating the internal state representations, and then selecting the next location to sample. In a cluttered setting or when the input is too big, attention serves a ‘prioritisation’ function to filter out irrelevant elements. It is a powerful technique that can be used when interfacing with a neural network that has a repeating structure in its output. For example, when applied to augment LSTM (a special variant of recurrent neural networks), it lets every step of an RNN select information to look at from a larger body of information. However, attentional mechanisms are not just useful in RNNs, as we will find out below.

State of the art using attention in machine learning

In machine learning, attention is especially useful in sequence prediction problems. Let’s review a few of the major areas where it has been applied successfully.

1. Natural language processing

Attentional mechanisms have been applied in many natural language processing (NLP) related tasks. The seminal work by Bahdanau et al. proposed a neural machine translation model that implements an attention mechanism in the decoder for English-to-French translation [2]. As the system reads the English input (encoder), the decoder outputs French translation whereby the attention mechanism learns by stochastic gradient descent to shift the focus to concentrate on the parts surrounding the word that is being translated. Their RNN-based model has been shown to outperform traditional phrase-based models by huge margins. RNNs are the incumbent architecture for text applications but it does not allow for parallelisation, which limits its potential of using GPU hardware that powers modern machine learning. A team of Facebook AI researchers introduced a novel approach using convolutional neural networks (which are highly parallelisable) and a separate attention module in each decoder layer. As opposed to Bahdanau et al’s ‘single step attention’, theirs is a multi-hop attention module. This means instead of looking at the sentence once and then translating it without looking back, the mechanism takes multiple glimpses at the sentence to determine what it will translate next. Their approach outperformed state of the art results for English-German and English-French translation at an order of magnitude faster speed [3]. Other examples of attentional mechanisms being applied in NLP problems include text classification [4], language processing (performing tasks described by natural language instructions in a 3D game-play environment) [5] and text comprehension (answering close-style questions about a document) [6].

2. Object recognition

Object recognition is one of the hallmarks of machine intelligence. Mnih et al. demonstrated how an attentional mechanism can be used to ignore irrelevant objects in a scene, allowing the model to perform well in challenging object recognition tasks in the presence of clutter [7]. In their Recurrent Attention Model (RAM), the agent receives partial observation of the environment at each step and learns where to focus (i.e. pay attention to) next through training an RNN. Attention is used to produce a ‘glimpse feature vector’ whereby regions around a target pixel is encoded at high-resolution and pixels further from the target pixel uses progressively lower resolution. Using a similar approach, another study used  a deep recurrent attention model to both localise and recognise multiple objects in images [8]. Xu et al. trained a model that automatically learns to describe the content of images [9]. Their attention models were trained using a multilayer perceptron that is conditioned on some previous hidden state, meaning where the network looks next depends on the sequence of words that has already been generated. The researchers showed how to use convolutional neural networks to pay attention to images when outputting a sequence, i.e. the image caption. Another advantage of attention in this case is the insights gained by approximately visualising where and what the attention focused on (i.e. what the model ‘sees’).

Telling mistakes in image caption generation with visual attention (image taken from Xu et al., 2016)

3. Gameplay

Google DeepMind’s Deep Q-Network (DQN) represented a significant advance in Reinforcement Learning and a breakthrough in general AI in the sense that it showed a single algorithm can learn to play a wide variety of Atari 2600 games: the agent was able to continually adapt its behaviour without any human intervention. Sorokin et al. added attention to the equation and developed the Deep Attention Recurrent Q-Network (DARQN) [10]. Their model outperformed that of DQN by incorporating what they termed ‘soft’ and ‘hard’ attention mechanisms. The attention network takes the current game state as input and generates a context vector based on the features observed. An LSTM then takes this context vector along with a previous hidden state and memory state to evaluate the action that an agent can take. Choi et al. further improved on DARQN by implementing a multi-focus attention network where the agent is capable of attending to multiple important elements [11]. In contrast to DARQN that uses only one attention layer, the model uses multiple parallel attentions to attend to entities that are relevant to tackling the problem.

4. Generative models

Attention has also proven useful in generative models, systems that can simulate (i.e. generate) values of any variable (inputs and outputs) in the model. Hong et al. developed a deep generative model based on a convolutional neural network for semantic segmentation (the task of assigning class labels to groups of pixels in an image) [12]. By incorporating attention-like mechanisms they were able to capture transferable segmentation knowledge across categories. The attention mechanism adaptively focuses on different areas depending on the input labels. A softmax function is used to encourage the model to pay attention to only a segment of the image.  Another example is Google DeepMind’s Deep Recurrent Attentive Writer (DRAW) neural network for image generation [13]. Attention allows the system to build up an image incrementally (shown in the video below). The attention model is fully differentiable (making it possible to train with gradient descent), thus allowing the encoder to focus on only part of the input and the decoder to modify only a part of the canvas. The model achieved impressive results generating images from the MNIST data set and when trained on the Street View House Number data set, it generated images that are almost identical to the real data.

5. Attention alone for NLP tasks

Another exciting line of research focusses on using attentional mechanisms alone for NLP tasks traditionally solved with neural networks. Vaswani et al. developed Transformer, a simple network architecture based solely on a novel multi-head attention mechanism for translation task [14]. They compute the attention function on a set of queries simultaneously using a dot-product attention (each key is multiplied with the query to see how similar they are) with an additional scaling factor. This multi-head approach allows their model to attend to information from different positions at the same time. Their model completely foregoes recurrence and convolutions but still managed to attain state-of-art results for English-to-German and English-to-French translations. Moreover, they achieved this in significantly less training time and their model is highly parallelizable. An earlier work by Parikh et al. experimented with a simple attention-based approach to solve natural language inference tasks [15]. They used attention to deconstruct the problem into subproblems that can be solved individually, hence making the model trivially parallelizable.

Not just a cog in the machine

What we have learned about attention so far tells us it is likely to be an essential component in the development of general AI. Philosophically, it is a key feature of the human psyche, which makes it a natural inclusion in pursuits that concerns the grey matter, while computationally, attention-based mechanisms have helped boost model performance to deliver stunning results in many areas. Attention has also proven to be a versatile technique, as is evident in its ability to replace recurrent layers in machine translation and other NLP related tasks. But it is most powerful when used in conjunction with other components, as Kaiser et al. demonstrated in their study One Model To Learn Them All that presented a model capable of solving a number of problems spanning multiple domains [16]. To be sure, attentional mechanisms are not without weaknesses. As Olah and Carter suggested, their propensity to take every action at every step (albeit to varying extent) could potentially be very costly computationally. Nonetheless, I believe that in a modular approach to develop general AI – IMO our best bet in this quest – attention will be a worthwhile, and perhaps even indispensable, module.


[1] Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451-463.

[2] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[3] Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. arXiv preprint arXiv:1705.03122.

[4] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical Attention Networks for Document Classification. In HLT-NAACL (pp. 1480-1489).

[5] Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Rajagopal, D., & Salakhutdinov, R. (2017). Gated-Attention Architectures for Task-Oriented Language Grounding. arXiv preprint arXiv:1706.07230.

[6] Dhingra, B., Liu, H., Yang, Z., Cohen, W. W., & Salakhutdinov, R. (2016). Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.

[7] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204-2212).

[8] Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.

[9] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., … & Bengio, Y. (2016). Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.

[10] Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.

[11] Choi, J., Lee, B. J., & Zhang, B. T. (2017). Multi-Focus Attention Network for Efficient Deep Reinforcement Learning. AAAI Publications, Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.

[12] Hong, S., Oh, J., Lee, H., & Han, B. (2016). Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3204-3212).

[13] Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.

[14] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

[15] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.

[16] Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., & Uszkoreit, J. (2017). One Model To Learn Them All. arXiv preprint arXiv:1706.05137.


Continuous Learning

Posted by Gideon Kowadlo on

The standard machine learning approach is to learn to accomplish a specific task with an associated dataset. A model is trained using the dataset and is only able to perform that one task. This is in stark contrast to animals which continue to learn throughout life and accumulate and re-purpose knowledge and skills. The limitation has been widely acknowledged and addressed in different ways, and with a variety of terminology, which can be confusing. I wanted to take a brief look at those approaches and to create a precise definition of the Continuous Learning that we want to implement in our pursuit of AGI.

Transfer Learning is a term that has been used a lot recently in the context of Deep Learning. It was actually first discussed in a paper by Pratt in 1993. Transfer Learning techniques use knowledge for related tasks on either the same or similar datasets. A classic example is learning to recognise cars and then applying the model to the task of recognising trucks. Or learning to recognise a different aspect of objects on the same dataset, such as learning how to recognise petals instead of leaves, of a dataset containing many plants.

One type of Transfer Learning is Domain Adaptation. It refers to the idea of learning on one domain, or data distribution, and then applying the model to and optimising it for a related data distribution. Training a model on different data distributions is often referred to as Multi Domain Learning. In some cases the distributions are similar, but other times they are deliberately unrelated.

The term Lifelong Learning pops up about the same time as Transfer Learning, in a paper by Thrun in 1994. He describes it as an approach that “addresses situations in which a learner faces a series of different learning tasks providing the opportunity for synergy among them”. It overlaps with Transfer Learning, but the emphasis is on gathering general purpose knowledge that transfers across multiple consecutive tasks for an ‘entire lifetime’. Thrun demonstrated results with real robotic systems.

Curriculum Learning by Bengio is a special case of Lifelong or Transfer Learning, where the objective is to optimise performance on a specific task, rather than across different tasks. It does this by making an easy version of that one task and making it subsequently harder and harder.

Online Learning algorithms learn iteratively with new data, in contrast to learning from a pass of a whole dataset, as is commonly done in conventional supervised and unsupervised learning, referred to as Batch Learning. Batches can also refer to portions of the dataset.

Online Learning is useful when the whole dataset does not fit into memory at once. Or more relevant for AGI, in scenarios where new data is observed over time. For example, with new samples being generated by users of a system, by an agent exploring its environment or for cases where the phenomena being modelled changes. Another way to describe it is that the underlying input data distribution is not static i.e. a non-stationary distribution, hence these are referred to as Non-stationary Problems.

Online learning systems can be susceptible to ‘forgetting’. That is, becoming less effective at modelling older data. The worst case is failing completely and suddenly, known as Catastrophic Forgetting or Catastrophic Interference.

Incremental Learning, as the name suggests, is about learning bit by bit, extending the model and improving performance over time. Incremental Learning explicitly handles the level of forgetting of past data. In this way, it is a type of online learning that avoids catastrophic forgetting.

In One-shot Learning, the algorithm is able to learn from one or very few examples. Instance Learning is one way of achieving that, constructing hypotheses from the training instances directly.

A related concept is Multi-Modal Learning, where a model is trained on different types of data for the same task. An example is learning to classify letters from the way they look with visual data, and the way they sound, with audio.

Now that we have some greater clarity around these terms, we recognise that they are all important features of what we consider to be Continuous Learning for a successful AGI agent. I think it’s instructive to express it in terms of traits in the context of an autonomous agent. I’ve mapped these traits to the associated Machine Learning algorithm concepts.

Trait ML Terminology
Uses learnt information to help with subsequent tasks.

Builds on its knowledge. Enables more complex behaviour and faster learning.

Transfer Learning

Curriculum Learning

As features of the task change gradually, it will adapt.

This will not cause catastrophic forgetting.

Domain Adaption

Non-stationary input distributions

Iterative Learning

Can learn entirely new tasks.

This will not cause catastrophic forgetting of old tasks. Also, it can learn these new tasks as well as it would have, if it was the first task learnt i.e. learning a task does not impede the ability to learn subsequent tasks.

Iterative Learning
Learns important aspects of the task from very few examples.

It has the ability to learn fast when necessary.

One-shot Learning
Continues to learn as it collects more data. Online Learning
Combines sensory modalities to learn a task. Multi-modal Learning

Note that in continuous learning, if there are fixed resources, and you are operating at your limit, then there has to be some forgetting, but as mentioned in the table, it should not be ‘catastrophic forgetting’.

Reading List

Reading list – August 2017

Posted by Yi-Ling Hwong on

1. Neuroscience-inspired Artificial Intelligence

Authors: Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick
Type: Review article in Neuron
Publication date: 19 July 2017

This paper outlined the contribution of neuroscience to the most recent advances in AI and argued that the study of neural computation in humans and other animals could provide useful (albeit subtle) inspiration to AI researchers, stimulating questions about specific aspects of learning and intelligence that could guide algorithm design.

  • Four specific examples of neuroscientific inspirations that are currently used in AI were mentioned: attentional mechanism, episodic memory, working memory and continual learning
  • Four areas where neuroscience could be relevant for future AI research were also mentioned: intuitive understanding of the physical world, efficient (or rapid) learning, transfer learning, imagination and planning

2. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

Authors: William Lotter, Gabriel Kreiman, David Cox
Type: arXiv preprint (accompanying codebase available here)
Publication date: 25 May 2016

The PredNet architecture (image credit: PredNet)

This paper introduced ‘PredNet’, a predictive neural network architecture that is able to predict future frames in a video sequence using a deep, recurrent convolutional network with both bottom-up and top-down connections.

  • The study demonstrated the potential for video to be used in unsupervised learning, where prediction of future frames can serve as a powerful learning signal, given that an agent must have an implicit model of the objects that constitute the environment and how they are allowed to move.
  • By training using car-mounted camera videos, results showed that the network was able to learn to predict both the movement of the camera and the movement of the objects in the camera’s view.

3. Distral: Robust Multitask Reinforcement Learning

Authors: Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, Razvan Pascanu
Type: arXiv preprint
Publication date: 13 July 2017

This paper proposed a method to overcome a common problem in Deep Reinforcement Learning, whereby training on multiple related tasks negatively affect performance on the individual tasks, when intuition tells us solutions to related tasks should improve learning since the tasks share common structures.

  • The authors developed Distral (Distill & transfer learning), based on the idea of a shared ‘policy’ that distills common behaviours or representations from task-specific policies.
  • Knowledge obtained in an individual task is distilled into the shared policy and then transferred to other tasks.

4. How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

Authors: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao
Type: Conference proceeding published at the Neural Information Processing Systems Conference
Publication year: 2012

This paper tackled the problem of how the brain combines sensory input and prior knowledge when making decision in the natural world.

  • The authors derived a model based on the framework of a partially observable Markov decision processes (POMDPs) and computed the optimal behaviour for sequential decision making tasks.
  • Their results suggest that decision making in our brain may be controlled by the dual principles of Bayesian inference and reward maximisation.
  • The proposed model offered a unifying explanation for experimental data previously accounted for by two competing models for incorporating prior knowledge, the additive offset model that assumes static influence of the prior, and the dynamic weighing model that assumes a time-varying effect.

5. First-spike based visual categorization using reward-modulated STDP

Authors: Milad Mozafari, Saeed Reza Kheradpisheh, Timothée Masquelier, Abbas Nowzari-Dalini, Mohammad Ganjtabesh
Type: arXiv preprint
Publication date: 25 May 2017

This paper proposed a hierarchical Spiking Neural Network (SNN) equipped with a novel Reward-modulated STDP (R-STDP) learning algorithm to solve object recognition tasks without using an external classifier.

  • The learning algorithm combined the principles of Reinforcement Learning and STDP
  • The network is structured as a feedforward convolutional SNN with four layers, however training took place in only one layer.
  • Results from R-STDP outperformed STDP on several datasets

6. A Distributional Perspective on Reinforcement Learning

Authors: Marc G. Bellemare, Will Dabney, Rémi Munos
Type: arXiv preprint
Publication date: 21 July 2017

This paper sought to provide a more complete picture of reinforcement learning (RL) by incorporating the concept of value distribution, understood as the distribution of the random return received by a learning agent.

  • The main object of the study is a random return Z that is characterised by the interaction of three random variables: the reward R, the next state-action, and its random return.
  • The authors designed a new algorithm using this distributional perspective to learn approximate value distribution and obtained state of the art results, at the same time demonstrating the importance of the value distribution in approximate RL.

AGI Conference 2017

Posted by David Rawlinson on

During the conference we met many interesting people, including Hiroshi Yamakawa and Koichi Takahashi from the Whole Brain Architecture Initiative, who presented at the AGA17 workshop.

We attended the 2017 10th Conference on Artificial General Intelligence, which was located in our hometown of Melbourne, Australia! Excitingly, the IJCAI 2017 conference is also in Melbourne this week and ICML 2017 was in Sydney this year. In particular, the “Architectures for Generality and Autonomy” workshop may be of interest to readers.

You can find a preprint of our paper here, and also download our slides from the conference.

Experiment/Experimental Framework

AGI Experimental Framework

Posted by Gideon Kowadlo on

We’re very excited to launch AGI Experimental Framework, AGIEF, our open source framework.

We first introduced it a while back, at the end of 2015 here, and it has certainly come a long way.

AGIEF was created to make running rigorous AI experiments convenient, reproducible and scalable. The goals are:

  • Repeatability: ability to save/load, stop/start an experiment from any execution step, and know that it will execute deterministically
  • Visualisation: ability to visualise all the data structures at any step
  • Distributed operation for performance

The Github wiki and Readme describe the project in detail and how to get started.

The framework comprises 3 repositories.

agi – Java project comprising core algorithmic code and framework package to support compute nodes.

run-framework – Python scripts to run and interact with the compute nodes covering aspects such as generating input files, launching cloud infrastructure, running those experiments (locally or remotely), executing parameter sweeps and exporting and uploading the output artefacts.

experiment-definitions – contains the experiment definitions, the files required to run and repeat specific experiments.

AGI/Artificial General Intelligence/unsupervised learning

Unsupervised Learning with Spike-Timing Dependent Plasticity

Posted by Yi-Ling Hwong on

Our brain is a source of great inspiration for the development of Artificial General Intelligence. In fact, one of the common views is that any effort in developing human-level AI is almost destined to fail without an intimate understanding of how the brain works. However, we do not understand our brain that well yet. But that is another story for another day. In today’s blog post we are going to talk about a learning method in machine learning that takes its inspiration from a biological process underpinning how humans learn – Spike Timing Dependent Plasticity (STDP).

Biological neurons communicate with each other through synapses, which are tiny connections between neurons in our brains. A presynaptic neuron is the neuron that fires the electrical impulse (the signal, so to speak), and a postsynaptic neuron is the neuron that receives this impulse. The wiring of the neurons makes our brain an extremely complex piece of machinery: a typical neuron receives thousands of inputs and sends its signals to over 10,000 other neurons. Incoming signals to a neuron alter its voltage (potential). When these signals reach a threshold value the neuron will produce a sudden increase in voltage for a short time (1ms). We refer to these short bursts of electrical energy as spikes. Computers communicate with bits, while neurons use spikes.

Anatomy of a neuron (image credit: Wikimedia)

Artificial Neural Networks (ANNs) attempt to capture this mechanism of neuronal communication through mathematical models. However, these computational models may be an inadequate representation of the brain. To understand the trend towards STDP and why we think it is a viable path forward, let’s back up a little bit and talk briefly about the current common methods in ANNs.

Gradient Descent: the dominant paradigm

Artificial Neural Networks are based on a collection of connected nodes mimicking the behaviour of biological neurons. A receiving (or postsynaptic) neuron receives multiple inputs, processes the signals, multiplies them by a weight, applies a nonlinear transfer function, and then propagates this signal to other neurons. The weights of the neurons vary as learning happens. This process of tweaking the weights is the most important thing in an artificial neural network. One popular learning algorithm is Stochastic Gradient Descent (SGD). To calculate the gradient of the loss function with respect to the weights, most state of the art ANNs use a procedure called back-propagation. However, the biological plausibility of back-propagation remains highly debatable. For example, there is no evidence of a global error minimisation mechanism in biological neurons. Therefore, a better learning algorithm might help us to move towards AGI. Something that raises the biological realism of our models. And this is where the Spiking Neural Network comes in.

The incorporation of timing in an SNN

The main difference between a conventional ANN and SNN is the neuron model that is used. The neuron model used in a conventional ANN does not employ individual spikes in computations. Instead the output signals from the neurons are treated as normalised firing rates, or frequency, of inputs within a certain time frame [1]. This is an averaging mechanism and is commonly referred to as rate coding. Consequently, input to the network can be real values, instead of a binary time-series. In contrast, each individual spike is used in the neuron model of an SNN. Instead of using rate coding, SNN uses pulse coding. What is important here is the incorporation of timing of the firing in computations, like real neurons do. The neurons in an SNN do not fire at every propagation cycle. They only fire when signals from other incoming neurons cause charge accumulation that reaches a certain threshold voltage.

Basic model of a spiking neuron (Image credit: EPFL)

The use of individual spikes in pulse coding is more biologically accurate in two ways. First, it is a more plausible representation for tasks where speed is an important consideration. For example in human visual system. Studies have shown that humans analyse and classify visual input (e.g. facial recognition) in under 100ms. Considering it takes at least 10 synaptic steps from the retina to the temporal lobe [2], this leaves about 10ms of processing time for each neuron. This is too little time for an averaging mechanism like rate coding to take place. Hence, an implementation that uses pulse coding might be a more suitable model for object recognition tasks, which is currently not the case considering the popularity of conventional ANN. Second, the use of only local information (i.e. timing of spikes) in learning is a more biologically realistic representation in comparison with a global error minimisation mechanism.

Learning using Spike-Timing Dependent Plasticity

The changing and shaping of neuron connections in our brain is known as synaptic plasticity. Neurons fire, or spike, to signal the presence of the feature that they are tuned for. As cleverly suggested by the Canadian psychologist Donald Hebb, “Neurons that fire together, wire together.” Simply put, when two neurons fire at almost the same time the connections between them are strengthened and thus they become more likely to fire again in the future. When two neurons fire in an uncoordinated manner the connections between them weaken and they are more likely to act independently in the future. This is known as Hebbian learning. The strengthening of synapses is known as Long Term Potentiation (LTP) and the weakening of synaptic strength is known as Long Term Depression (LTD). What determines whether a synapse will undergo LTP or LTD is the timing between the pre- and postsynaptic firing. If the presynaptic neuron fires before the postsynaptic neuron within the preceding 20ms, LTP occurs; and if the presynaptic neuron fires after the postsynaptic neuron within the following 20ms, LTD occurs. This is known as Spike-Timing Dependent Plasticity (STDP).

This biological mechanism can be adopted as a learning rule in machine learning. A general approach is to apply a delta rule Δw to each synapse in a network to compute its weight change. The weight change will be positive (therefore increasing the strength of the synaptic connection) if the postsynaptic neuron fires just after the presynaptic neuron, and negative if the postsynaptic neuron fires just before the presynaptic neuron. Compared with the supervised learning algorithm employed in backpropagation, STDP is an unsupervised learning method. This is another reason STDP-based learning is believed to more accurately reflect human learning, given that much of the most important learning we do is experiential and unsupervised, i.e. there is no “right answer” available for the brain to learn from.


STDP represents a potential shift in approach when it comes to developing learning procedures in neural networks. Recent research shows that it has predominantly been applied in pattern recognition related tasks. One 2015 study using an exponential STDP learning rule achieved 95% accuracy on the MNIST dataset [3], a large handwritten digit database that is widely used a training dataset for computer vision. Merely a year later, researchers have managed to make significant progress. For example, Kheradpisheh et al. achieved 98.5% accuracy MNIST by combining SNN and features of deep learning [4]. The network they used comprised several convolutional and pooling layers, and STDP learning rules were used in the convolutional layers to learn the features. Another interesting study took its inspiration from Reinforcement Learning and combined it with a hierarchical SNN to perform pattern recognition [5]. Using a network structure that consists of two simple and two complex layers and a novel reward-modulated STDP (R-STDP), their method outperformed classic unsupervised STDP on several image datasets. STDP has also been applied in real-time learning to take advantage of its speedy nature [6]. The SNN and fast unsupervised STDP learning method that was developed achieved an impressive 21.3 fps in training and 17.9 fps in testing. To put things in perspective, human eyes are able to detect around 24 fps.

Apart from object recognition, STDP has also been applied in speech recognition related tasks. One study uses an STDP-trained, nonrecurrent SNN to convert speech signals into a spike train signature for speech recognition [7]. Another study combines a hidden Markov model with SNN and STDP learning to classify segments of sequential data such as individual spoken words [8]. STDP has also proven to be a useful learning method in modelling pitch perception (i.e. recognising tones). Researchers developed a computational model using neural network that learns using STDP rules to identify (and strengthen) the neuronal connections that are most effective for the extraction of pitch [9].

Final thoughts

Having learned what we have about STDP, what can we conclude about the state of the art of machine learning? We think that conventional Artificial Neural Networks are probably here to stay. They are simplistic models of neurons but they do work. However the extent to which supervised ANNs would be suitable in the development of AGI is debatable. On the other hand, while the Spiking Neural Network is a more authentic model of how the human brain works, its performance thus far still lags behind that of ANNs on some tasks, not least because a lot more research has been done on supervised ANNs than SNNs. Despite its intuitive appeal and biological validity, there are also many neuroscientific experiments in which STDP has not matched observations [10]. One major quandary is the observation of LTD in certain hippocampal neurons (CA3 and CA1 regions, to be precise) when low frequency (1 Hz) presynaptic stimulation drives postsynaptic firing [11]. Conventional STDP wisdom says LTP should happen in this case. The frequency-dependence of plasticity does not stop here. At high enough frequencies (i.e. firing rates), the STDP learning rule becomes LTP-only. That is, both positive and negative Δw produce LTP [12]. Several other additional mechanisms also appear to influence STDP learning. For example, LTD can be converted to LTP by altering the firing pattern of the postsynaptic spikes: firing ‘bursts’ or even a pair of spikes in the postsynaptic neuron lead to LTP where single spikes would have led to LTD [13] [14]. Plasticity also appears to accumulate as a nonlinear function of the number of pre- and postsynaptic pairings, with depression accumulating at a lower rate than potentiation, i.e. requiring more pairings [13]. Finally, it seems that neural activity that does not cause any measurable plasticity may have a ‘priming’ effect on subsequent activities. In the CA1 region for example, LTP could be activated with as few as four stimuli, provided that a single priming stimulus was given 170 ms earlier [15] .

SNN’s inferior performance when compared to other ANNs might be due to its poor scalability. Large scale SNN’s are relatively rare because the computational intensity involved in designing such networks are not yet fully supported in most high performance computing (there are, however, exceptions such as this and this). Most implementations today use only one or two trainable layers of unsupervised learning, which limits its generalisation capabilities [16]. Moreover, and perhaps most importantly, STDP is vulnerable to the common shortcoming of unsupervised learning algorithms: it works well in sifting out statistically significant features but has problems identifying rare but diagnostic features which are crucial in important processes such as decision making. My sense is that if STDP is to become the key in unlocking the secrets of AGI, there needs to be more creativity in its implementation that takes advantage of its biological roots and nuances while striving for a general purpose learning algorithm.

What do you think? Comment and let us know your thoughts!


[1] Vreeken, J. (2003). Spiking neural networks, an introduction.

[2] Thorpe, S., Delorme, A., & Van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural networks, 14(6), 715-725.

[3] Diehl, P. U., & Cook, M. (2015). Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in computational neuroscience, 9.

[4] Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J., & Masquelier, T. (2016). STDP-based spiking deep neural networks for object recognition. arXiv preprint arXiv:1611.01421.

[5] Mozafari, M., Kheradpisheh, S. R., Masquelier, T., Nowzari-Dalini, A., & Ganjtabesh, M. (2017). First-spike based visual categorization using reward-modulated STDP. arXiv preprint arXiv:1705.09132.

[6] Liu, D., & Yue, S. (2017). Fast unsupervised learning for visual pattern recognition using spike timing dependent plasticity. Neurocomputing, 249, 212-224.

[7] Tavanaei, A., & Maida, A. S. (2017). A spiking network that learns to extract spike signatures from speech signals. Neurocomputing, 240, 191-199.

[8] Tavanaei, A., & Maida, A. S. (2016). Training a Hidden markov model with a Bayesian spiking neural network. Journal of Signal Processing Systems, 1-10.

[9] Saeedi, N. E., Blamey, P. J., Burkitt, A. N., & Grayden, D. B. (2016). Learning Pitch with STDP: A Computational Model of Place and Temporal Pitch Perception Using Spiking Neural Networks. PLoS computational biology, 12(4), e1004860.

[10] Shouval, H. Z., Wang, S. S. H., & Wittenberg, G. M. (2010). Spike timing dependent plasticity: a consequence of more fundamental learning rules. Frontiers in Computational Neuroscience, 4.

[11] Wittenberg, G. M., and Wang, S. S.-H. (2006). Malleability of spike-timing- dependent plasticity at the CA3-CA1 synapse. J. Neurosci. 26, 6610–6617.

[12] Sjöström, P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32(6), 1149-1164.

[13] Wittenberg, G. M., and Wang, S. S.-H. (2006). Malleability of spike-timing- dependent plasticity at the CA3-CA1 synapse. J. Neurosci. 26, 6610–6617.

[14] Pike, F. G., Meredith, R. M., Olding, A. W., & Paulsen, O. (1999). Postsynaptic bursting is essential for ‘Hebbian’induction of associative long‐term potentiation at excitatory synapses in rat hippocampus. The Journal of physiology, 518(2), 571-576.

[15] Rose, G. M., and Dunwiddie, T. V. (1986). Induction of hippocampal long-term potentiation using physiologically patterned stimulation. Neurosci. Lett. 69, 244–248.

[16] Almási, A. D., Woźniak, S., Cristea, V., Leblebici, Y., & Engbersen, T. (2016). Review of advances in neural networks: Neural design technology stack. Neurocomputing, 174, 31-41.

Predictive Coding/pyramidal cell/Rao & Ballard/unsupervised learning

Pyramidal Neurons and Predictive Coding

Posted by David Rawlinson on

Today’s post tries to fit the theoretical concept of Predictive Coding with the unusual structure and connectivity of Pyramidal cells in the Neocortex.

A reconstruction of a pyramidal cell (source: Wikipedia / Wikimedia Commons). Soma and dendrites are labeled in red, axon arbor in blue. 1) Soma (cell body) 2) Basal dendrite (feed-forward input) 3) Apical dendrite (feed-back input) 4) Axon (output) 5) Collateral axon (output).

Pyramidal neurons

Pyramidal neurons are interesting because they are one of the most common neuron types in the computational layers of the neocortex. This almost certainly means they are critical to many of the key cortical functions, such as forming representations of knowledge and reasoning about the world.

Anatomy of a Pyramidal Neuron

Pyramidal neurons are so-called because they tend to have a triangular body (soma). But this isn’t the most interesting feature! While all neurons have dendrites (inputs) and at least one axon (output), Pyramidal cells have more than one type of input – Basal and Apical dendrites.

Apical Dendrite

Pyramidal neurons tend to have a single, long Apical dendrite that extends with few forks a long way from the body of the neuron. When it reaches layer 1 of the cortex (which contains mostly top-down feedback from cortical areas that are believed to represent more abstract concepts), the apical dendrite branches out. This suggests the apical dendrite likes to receive feedback input. If feedback represents more abstract, longer-term context, then this data would be useful for predicting bottom-up input. More on this later.

Basal Dendrites

Pyramidal cells tend to have a few Basal dendrites that branch almost immediately, in the vicinity of the cell body. Note that this means the input provided to basal and apical dendrites is physically separated. We know from analysis of cortical microcircuits that axons terminating around the body of pyramidal cells in cortex layers 2 and 3 contain bottom-up data that is propagating in a feed-forward direction – i.e. information about the external state of the world.


Pyramidal cells have a single Axonal output that may fork, and may travel a very long distance to its targets including other areas of the cortex.

Predictive Coding

Predictive Coding (PC) is a method of transforming data from its original form, to a representation in terms of prediction errors. There’s not much interest in PC In the Machine Learning community, but in Neuroscience there is substantial evidence that the Cortex encodes information in this way. Similar but unrelated concepts have also been used for efficient compression of data in signal processing. The benefit of this transformation is due to compression: We assume that only prediction errors are important, because by definition, everything else can be predicted and is therefore sufficiently described elsewhere.

There are several research groups looking at computational models of Predictive Coding – in particular those of Karl Friston and Andy Clark.

Two uses for feedback

Assuming feedback contains a more processed and abstract representation of a broader set of data, it has two uses.

  • Prediction for a more efficient representation of the world (e.g. Predictive Coding)
  • Prediction for more robust interpretation (via integration of top-down information in perception)

Predictive coding aims to transform the representation inside the cortex to a more efficient one that encodes only the relationships between prediction errors. Take some time to decide for yourself whether this loses anything…!

But there are many perceptual phenomena that show how internal state affects perception and interpretation of external input. For example, the phenomenon of multistable perception in some visual illusions: We need to know what we’re looking for before we can see it, and we can deliberately change from one interpretation to another (see figure).

A Necker Cube: This object can be interpreted in two distinct ways; as a cube from slightly above or slightly below. With a little practice you can easily switch between interpretations. One explanation of this is that a high-level decision as to the preferred interpretation is provided as feedback to hierarchically-lower processing areas.

Now consider Bayesian inference, such as Belief Propagation, or Markov Random Fields – in all cases we combine a Prior (e.g. top-down feedback) with a Likelihood produced from current, bottom-up data. Good inference depends on effective integration of both inputs.

Ideally we would be able to resolve how both the modelling and inference benefits could be realized in the pyramidal cell, and how physical segregation of apical & basal dendrites might help this happen.

False-Negative Error Coding

The simplest scheme for predictive coding is simply to propagate only false-negative errors – where something was observed, but it was not predicted in advance. In this encoding, if the event was predicted, simply suppress any output. (Note: This assumes that another mechanism limits the number of false-positive errors – for example a homeostatic system to limit the total number of predictions.)

When a neuron fires, it represents a set of coincident input on a number of synapses. A pattern of input was observed. If the neuron was in a “predicted” state, immediately prior to firing, then we could safely suppress the output and achieve a simple predictive coding scheme. If a neuron is not in a predicted state when it fires, then the output should be propagated as normal.

False-Negative Error Coding in Pyramidal Cells

Since Pyramidal cells have 2 distinct inputs – basal and apical dendrites – we can implement the false negative coding as follows:

  • Basal dendrites recognize patterns of bottom-up input; the neuron “represents” those patterns by generating a spike on its axonal output when stimulated by the basal dendrites.
  • Apical dendrite learns to detect input that allows the cell’s spiking to be predicted. The apical dendrite determines the “predicted” state of the cell. Top-down feedback input is used for this purpose.
  • If the cell is “predicted” when the basal dendrite tries to generate an output, then suppress that output.
  • The cell internally self-regulates to ensure that it is rarely in a predicted state, and typically only at the right times.
  • Physical segregation of the two dendrite types ensures that they can target feedback data for prediction and feed-forward data for classification.

Spike bursts (spike trains)

When Pyramidal cells fire, they usually don’t fire just once. They tend to generate a short sequence of spikes known as a “burst” or “train”. So it’s possible that False-Negative coding doesn’t completely eliminate the spike, but rather truncates the sequence of spikes to make the output far less significant and less likely to significantly drive activity in other cells. There may also be some benefit to being able to broadcast the event in a subtle way, perhaps as a form of timing signal.

Time series plots of typical spike trains produced by pyramidal cells.

So to evidence this theory, we could look for truncated or absent spike trains in presence of predictive input to the apical dendrite. Specifically, to observe that input causing a spike in the apical dendrite truncates or eliminates an expected spike train resulting from basal stimulation.

Is there any direct neurological evidence for different integration of spikes from Apical and Basal dendrites in Pyramidal cells? It turns out, yes, there is! Metz, Spruston and Martina [1] say: “… our data present evidence for a dendritic segregation of Kv1-like channels in CA1 pyramidal neurons and identify a novel action for these channels, showing that they inhibit action potential bursting by restricting the size of the [afterdepolarization]”.

Now for the AI/ML audience it’s necessary to translate this a bit. An “action potential” “occurs when the membrane potential (voltage) of a specific axon location rapidly rises and falls. Action potentials in neurons are also known as “nerve impulses” or “spikes” So bursting is the generation of a short sequence of rapid spikes.

So in other words, apical stimulation inhibits bursts of axonal output spikes from a pyramidal neuron. There’s our smoking gun!

According to this paper, the Apical dendrite uniquely inhibits the spike burst from soma (the basal dendrites don’t). This matches the behaviour we would expect, if pyramidal cells implement false-negative predictive coding via the different inputs to different dendrite types: If the apical dendrite fires, there’s no axonal burst. If there wasn’t a spike in the apical dendrite, but basal activity drives the cell over its threshold, then the cell output does burst.

Note there are many other papers with similar claims; we found that search terms such as “differential basal apical dendrite integration” to be helpful.

[1] “Dendritic D-type potassium currents inhibit the spike afterdepolarization in rat hippocampal CA1 pyramidal neurons”
Alexia E. Metz, Nelson Spruston and Marco Martina. J. Physiol. 581.1 pp 175–187 (2007)


We’ve seen how we might combine the observed phenomena of multistable perception via separation of feedback and feed-forward input to the basal and apical dendrites, and predictive coding, via a simple model of pyramidal cell function by false-negative error coding.

Unlike existing models of predictive coding within the cortex, which often posit separate populations of cells representing predictions and residual errors (e.g. Rao and Ballard, 1999), we have proposed that coding could occur within the known biology of individual pyramidal cells, due to the different integration of apical and basal dendrite activity. At the same time, the proposed method allows feedback and feedforward information to be integrated within the same mechanism.

Over the next few months we’ll be testing some of these ideas in simulation!


Introducing Yi-Ling

Posted by Yi-Ling Hwong on

Hello everyone! I am Yi-Ling and I am the newest member of the AGI project team. It is an incredibly exciting time to be dipping one’s toes in the field of Artificial Intelligence, given the impressive progress and explosion of AI applications in recent years. In my case, I am actually going to dive in and fully immerse myself in one of the frontier issues and thrilling challenges of AI – Artificial General Intelligence. I will be documenting my journey and learnings in the form of blog posts on this website, and hopefully spark some interesting discussions with you. But before I do that, here’s a little bit about myself so you get a peek of the person behind the words.

Who am I

I was born in and grew up on a beautiful tropical island called Penang on the Northwest of Malaysia. I left Malaysia at the age of 20 to pursue a tertiary education in Germany, majoring in power engineering. Upon graduation I was awarded the Marie Curie Fellowship program funded by the European Commission to work as a software engineer at the European Organisation for Nuclear Research (CERN) in Geneva, Switzerland. The experiment that I was working for, CMS, was one of the two experiments that first discovered the Higgs Boson. I have also worked for several nonprofit organisations, including Doctors without Borders, as a digital communication specialist. I am currently a PhD candidate at the University of New South Wales in Sydney, Australia. My research concerns the impact of social media science communication on public trust in science.

I also do a bunch of stuff outside of science. I am a Toastmaster, the Editor in Chief of the Scientific Malaysian magazine, and a salsa dancer. I used to play the keyboard in a rock band and my idea of a perfect Sunday involves jazz, a hammock, coffee, and a good book.

Why did I join AGI

For as long as I can remember, I have been fascinated by the human brain. Not so much its structure and composition, but what it is capable of. My deep fascination with what makes us conscious and sentient beings capable of extraordinary feats — both good and evil — has followed me through my diverse career. I have always believed that the best science happens when humans are driven not by intellect alone, but by a deeper and more visceral desire to understand our nature and the universe. The fact that the pursuits of AGI at times border on the philosophical makes it all the more interesting to me. My current research is tangentially related to AI in that I am applying machine learning to study big data. I would like to think that my becoming a member of the AGI project is a step up into finally fulfilling a lifelong dream.

I believe in the vision and mission of the AGI project. Although I have just met Gideon and Dave, they strike me as intelligent, passionate and generous human beings who genuinely wish to accomplish something meaningful. And I want to be a part of it.

What will I be doing

One of the missions of the AGI project is to rally and connect the community of AI researchers and practitioners. Our blog is one way for us to reach out and network with the community. I will be involved in the research aspect of the AGI project, and will be sharing learnings and ideas through a series of blog posts. There are many topics that we are currently exploring e.g. sparse coding, unsupervised learning algorithms, deep hierarchical reinforcement learning etc. These are relatively new (or at least less-reported on) concepts compared with the current deep learning paradigm which has mainly focused on backpropagation techniques. However we believe they harbour promising potential to tackle the AGI problem. I will be reviewing the literature on these areas and writing about them. This is useful not only for our own record, but by sharing openly about what we are currently working on, we hope to engage you in conversations.

So stay tuned and till next time folks.


Open Sourcing MNIST and NIST Preprocessing Code

Posted by Gideon Kowadlo on

In our most recent post we discussed the current set of experiments that we are conducting, using the MNIST dataset. We’ve also been looking at the NIST dataset which is similar, but extends to handwritten letters (as well as digits).

These are extremely popular datasets and freely available, so make a great choice for testing and comparing an algorithm with the benchmarks.

The MNIST data is not available directly as images though. Even though it’s a standard format, it’s not common. It’s easy to find snippets of code to convert this format into standard images (such as PNG or JPG), but putting it together and getting it working is not where you want to spend your time – instead of designing and running your experiment!

We’ve been through that phase, so very happy to open source our code to make it easier for others to get going faster.

These are simple, small, self contained Java projects with ZERO dependencies. There are two projects, one for preprocessing MNIST files into images, the other is for NIST images, to make them equivalent to the MNIST images to be used in the same experimental setup easily. See the README for more information about the a steps taken.



AGI/Experimental Framework/MNIST/unsupervised learning

Region-Layer Experiments

Posted by ProjectAGI on
Region-Layer Experiments
Typical results from our experiments: Some active cells in layer 3 of a 3 layer network, transformed back into the input pixels they represent. The red pixels are positive weights and the blue pixels are negative weights; absence of colour indicates neutral weighting (ambiguity). The white pixels are the input stimulus that produced the selected set of active cells in layer 3. It appears these layer 3 cells collectively represent a generic ‘5’ digit. The input was a specific ‘5’ digit. Note that the weights of the hidden layer cells differ from the input pixels, but are recognizably the same digit.

We are running a series of experiments to test the capabilities of the Region-Layer component. The objective is to understand to what extent these ideas work, and to expose limitations both in implementation and theory.

Results will be posted to the blog and written up for publication if, or when, we reach an acceptable level of novelty and rigor.

We are not trying to beat benchmarks here. We’re trying to show whether certain ideas have useful qualities – the best way to tackle specific AI problems is almost certainly not an AGI way. But what we’d love to see is that AGI-inspired methods can perform close to state-of-the-art (e.g. deep convolutional networks) on a wide range of problems. Now that would be general intelligence!

Dataset Choice

We are going to start with the MNIST digit classification dataset, and perform a number of experiments based on that. In future we will look at some more sophisticated image / object classification datasets such as LabelMe or Caltech_101.

The good thing about MNIST is that it’s simple and has been extremely widely studied. It’s easy to work with the data and the images are a practical size – big enough to be interesting, but not so big as to require lots of preprocessing or too much memory. Despite only 28×28 pixels, variations in digit appearance gives considerable depth to the data (example digit ‘5’ above).

The bad thing about MNIST is that it’s largely “solved” by supervised learning algorithms. A range of different supervised techniques have reached human performance and it’s debatable whether any further improvements are genuine.

So what’s the point of trying new approaches? Well, supervised algorithms have some odd qualities, perhaps due to the narrowness of training samples or the specificity of the cost function. For example, the discovery of “adversarial examples” – images that look easily classifiable to the naked eye but cannot be classified correctly with a trained network because they exploit weaknesses in trained neural networks.

But the biggest drawback of supervised learning is the need to tell it the “correct” answer for every input. This has led to a range of techniques – such as transfer learning – to make the most of what training data is available, even if not directly relevant. But fundamentally, supervised learning is unlike the experience of an agent learning as it explores its world. Animals can learn without a tutor.

However, unsupervised results with MNIST are less widely reported. Partially this is because you need to come up with a way to measure the performance of an unsupervised method. The most common approach is to use unsupervised networks to boost the performance of a final supervised network layer – but in MNIST the supervised layer is so powerful it’s hard to distinguish the contribution of the unsupervised layers. Nevertheless, these experiments are encouraging because having a few unsupervised layers seems to improve overall performance, compared to all-supervised networks. In addition to the limited data problem with supervised learning, unsupervised learning actually seems to add something.

One possible method of capturing the contribution of unsupervised layers alone is the Rand Index, which measures the similarity between two clusters. However, we are intending to use a distributed representation where there will be overlap between similar representations – that’s one of the features of the algorithm!

So, for now we’re going to go for the simplest approach we can think of, and measure the correlation between the active cells in selected hidden layers and each digit label, and see if the correlation alone is enough to pick the right label given a set of active cells. If the concepts defined by the digits exist somewhere in the hierarchy, they should be detectable as features uniquely correlated with specific labels…

Note also that we’re not doing any preprocessing of the MNIST images except binarization at threshold 0.5. Since the MNIST dataset is very high contrast, hopefully the threshold doesn’t matter much: It’s almost binary already.

Sequence Learning Tests

Before we start the experiments proper we conducted some ad-hoc tests to verify the features of the Region-Layer are implemented as intended. Remember, the Region-Layer has two key capabilities:

  • Classification … of the feedforward input, and
  • Prediction … of future classification results (i.e. future internal states)

See here and here to understand the classification role, and here for more information about prediction. Taken together, the ability to classify and predict future classifications allows sequences of input to be learned. This is a topic we have looked at in detail in earlier blog posts and we have some fairly effective techniques at our disposal.

We completed the following tests:

  • Cycle 0,1,2: We verified that the algorithm could predict the set of active cells in a short cycle of images. This ensures the sequence learning feature is working. The same image was used for each instance of a particular digit (i.e. there was no variation in digit appearance).
  • Cycle 0,1,…,9: We tested a longer cycle. Again, the Region-Layer was able to predict the sequence perfectly.
  • Cycle 0,1,2,3, 0,2,3,1: We tested an ambiguous cycle. At 0, it appears that the next state can be 1 or 2, and similarly, at 3, the next state can be 1 or 2. However, due to the variable order modelling behaviour of the Region-Layer, a single Region-Layer is able to predict this cycle perfectly. Note that first-order prediction cannot predict this sequence correctly.
  • Cycle 0,1,2,3,1,2,4,0,2,3,1,2,1,5,0,3,2,1,4,5: We tested a complex graph of state sequences and again a single Region-Layer was able to predict the sequence perfectly. We also were able to predict this using only first order modelling and a deep hierarchy.

After completion of the unit tests we were satisfied that our Region-Layer component has the ability to efficiently produce variable order models of observed sequences using unsupervised learning, assuming that the states can reliably be detected.


Now we come to the harder part. What if each digit exemplar image is ambiguous? In other words, what if each ‘0’ is represented by a randomly selected ‘0’ image from the MNIST dataset? The ambiguity of appearance means that the observed sequences will appear to be non-deterministic.

We decided to run the following experiments:

Experiment 1: Random image classification

In this experiment there will be no predictable sequence; each digit must be recognized solely based on its appearance. The classic experiment is used: Up to N training passes over the entire MNIST dataset, followed by fixing the internal weights and a single pass to calculate the correlation between each active cell in selected hidden layer[s] and the digit labels. Then, a single pass over the test set recording, for each test input image, the most highly correlated digit label for each set of active hidden cells. The algorithm gets a “correct” result if the most correlated label is the correct label.

  • Passes 1-N: Train networks

Present each digit in the training set once, in a random order. Train the internal weights of the algorithm. Repeated several times if necessary.

  • Pass N+1: Measure correlation of hidden layer features with training images.

Present each digit in the training set once, in a random order. Accumulate the frequency with which each active cell is associated with each digit label. After all images have been seen, convert the observed frequencies to correlations.

  • Pass N+2: Predict label of test images. 

Present each digit in the testing set once, in a random order. Use the correlations between cell activity and training labels to predict the most likely digit label given the set of active cells in selected Region-Layer components (they are arranged into a hierarchy).

Experiment 2: Image classification & sequence prediction

What if the digit images are not in a random order? We can use the English language to generate a training set of digit sequences. For example, we can get a book, convert each character to a 2 digit number and select random appropriate digit images to represent each number.

The motivation for this experiment is to see how the sequence learning can boost image recognition: Our Region-Layer component is supposed to be able to integrate both sequential and spatial information. This experiment actually has a lot of depth because English isn’t entirely predictable – if we use a different book for testing, there’ll be lots of sub-sequences the algorithm has never observed before. There’ll be uncertainty in image appearance and uncertainty in sequence, and we’d like to see how a hierarchy of Region-Layer components responds to both. Our expectation is that it will improve digit classification performance beyond the random image case.

In the next article, we will describe the specifics of the algorithms we implemented and tested on these problems.

A final article will present some results.