Yi-Ling Hwong

Artificial General Intelligence/Theory

Attention in Artificial Intelligence systems

Posted by Yi-Ling Hwong on

One of the features of our brain is its modularity. It is characterised by distinct but interacting subsystems that underlie key functions such as memory, language, perceptions, etc. Understanding the complex interplay between these modules requires decomposing them into a set of components. Indeed, this modular approach to modeling the brain is one of the ways neuroscientists are studying the brain. The ‘module’ that I am going to be talking about in this blogpost has its roots in neuroscience but has greatly inspired AI research: attention. I will focus on the neuroscience aspect of attention first before moving on to review some of the most exciting developments using attentional mechanisms in machine learning.

The neuroscientific roots of attention

Neuroscientists have long studied attention as an important cognitive process. It is described as the ability of organisms to ‘select a subset of available information upon which to focus for enhanced processing and integration’ and encompasses three aspects: orienting, filtering and searching. Visual attention, for example, is an active area of research. Our ability to focus on specific area of a visual scene and extract and process the information that is streamed to our brain is thought to be an evolutionary trait that all but guaranteed the survival of our species. This capability to select, process and act upon sensory experience has inspired a whole branch of research in computational modelling of visual attention.

Visual attention (image credit: Wikimedia)

The emergence of a whole suite of sophisticated equipment to scan and study the brain has further fanned the flames of enthusiasm for attention research. In a recent study using eye tracking and fMRI data, Leong et al. demonstrated the bidirectional interaction between attention and learning: attention facilitates learning, and learned values in turn inform attentional selection [1]. The relationship between attention and consciousness is a complex issue, and in many sense both a scientific and a philosophical exploration. The ability to focus one’s thoughts out of several simultaneous objects or trains of thought and take control of one’s own mind in a vivid and conscious manner is not just a delightful and useful perk. It is a quintessential part of our experience of human-ness.

Given their significance, attentional mechanisms have in recent years received increasing attention (pun intended) from the AI community. A detailed explanation of how they are applied in machine learning will require a separate blog post (I highly recommend this excellent article by Olah and Carter) but in essence attention layers provide the functionality of focusing on specific elements to improve the performance of a model. In an image recognition task for example, it does so by taking ‘glimpses’ of the input image at each step, updating the internal state representations, and then selecting the next location to sample. In a cluttered setting or when the input is too big, attention serves a ‘prioritisation’ function to filter out irrelevant elements. It is a powerful technique that can be used when interfacing with a neural network that has a repeating structure in its output. For example, when applied to augment LSTM (a special variant of recurrent neural networks), it lets every step of an RNN select information to look at from a larger body of information. However, attentional mechanisms are not just useful in RNNs, as we will find out below.

State of the art using attention in machine learning

In machine learning, attention is especially useful in sequence prediction problems. Let’s review a few of the major areas where it has been applied successfully.

1. Natural language processing

Attentional mechanisms have been applied in many natural language processing (NLP) related tasks. The seminal work by Bahdanau et al. proposed a neural machine translation model that implements an attention mechanism in the decoder for English-to-French translation [2]. As the system reads the English input (encoder), the decoder outputs French translation whereby the attention mechanism learns by stochastic gradient descent to shift the focus to concentrate on the parts surrounding the word that is being translated. Their RNN-based model has been shown to outperform traditional phrase-based models by huge margins. RNNs are the incumbent architecture for text applications but it does not allow for parallelisation, which limits its potential of using GPU hardware that powers modern machine learning. A team of Facebook AI researchers introduced a novel approach using convolutional neural networks (which are highly parallelisable) and a separate attention module in each decoder layer. As opposed to Bahdanau et al’s ‘single step attention’, theirs is a multi-hop attention module. This means instead of looking at the sentence once and then translating it without looking back, the mechanism takes multiple glimpses at the sentence to determine what it will translate next. Their approach outperformed state of the art results for English-German and English-French translation at an order of magnitude faster speed [3]. Other examples of attentional mechanisms being applied in NLP problems include text classification [4], language processing (performing tasks described by natural language instructions in a 3D game-play environment) [5] and text comprehension (answering close-style questions about a document) [6].

2. Object recognition

Object recognition is one of the hallmarks of machine intelligence. Mnih et al. demonstrated how an attentional mechanism can be used to ignore irrelevant objects in a scene, allowing the model to perform well in challenging object recognition tasks in the presence of clutter [7]. In their Recurrent Attention Model (RAM), the agent receives partial observation of the environment at each step and learns where to focus (i.e. pay attention to) next through training an RNN. Attention is used to produce a ‘glimpse feature vector’ whereby regions around a target pixel is encoded at high-resolution and pixels further from the target pixel uses progressively lower resolution. Using a similar approach, another study used  a deep recurrent attention model to both localise and recognise multiple objects in images [8]. Xu et al. trained a model that automatically learns to describe the content of images [9]. Their attention models were trained using a multilayer perceptron that is conditioned on some previous hidden state, meaning where the network looks next depends on the sequence of words that has already been generated. The researchers showed how to use convolutional neural networks to pay attention to images when outputting a sequence, i.e. the image caption. Another advantage of attention in this case is the insights gained by approximately visualising where and what the attention focused on (i.e. what the model ‘sees’).

Telling mistakes in image caption generation with visual attention (image taken from Xu et al., 2016)

3. Gameplay

Google DeepMind’s Deep Q-Network (DQN) represented a significant advance in Reinforcement Learning and a breakthrough in general AI in the sense that it showed a single algorithm can learn to play a wide variety of Atari 2600 games: the agent was able to continually adapt its behaviour without any human intervention. Sorokin et al. added attention to the equation and developed the Deep Attention Recurrent Q-Network (DARQN) [10]. Their model outperformed that of DQN by incorporating what they termed ‘soft’ and ‘hard’ attention mechanisms. The attention network takes the current game state as input and generates a context vector based on the features observed. An LSTM then takes this context vector along with a previous hidden state and memory state to evaluate the action that an agent can take. Choi et al. further improved on DARQN by implementing a multi-focus attention network where the agent is capable of attending to multiple important elements [11]. In contrast to DARQN that uses only one attention layer, the model uses multiple parallel attentions to attend to entities that are relevant to tackling the problem.

4. Generative models

Attention has also proven useful in generative models, systems that can simulate (i.e. generate) values of any variable (inputs and outputs) in the model. Hong et al. developed a deep generative model based on a convolutional neural network for semantic segmentation (the task of assigning class labels to groups of pixels in an image) [12]. By incorporating attention-like mechanisms they were able to capture transferable segmentation knowledge across categories. The attention mechanism adaptively focuses on different areas depending on the input labels. A softmax function is used to encourage the model to pay attention to only a segment of the image.  Another example is Google DeepMind’s Deep Recurrent Attentive Writer (DRAW) neural network for image generation [13]. Attention allows the system to build up an image incrementally (shown in the video below). The attention model is fully differentiable (making it possible to train with gradient descent), thus allowing the encoder to focus on only part of the input and the decoder to modify only a part of the canvas. The model achieved impressive results generating images from the MNIST data set and when trained on the Street View House Number data set, it generated images that are almost identical to the real data.

5. Attention alone for NLP tasks

Another exciting line of research focusses on using attentional mechanisms alone for NLP tasks traditionally solved with neural networks. Vaswani et al. developed Transformer, a simple network architecture based solely on a novel multi-head attention mechanism for translation task [14]. They compute the attention function on a set of queries simultaneously using a dot-product attention (each key is multiplied with the query to see how similar they are) with an additional scaling factor. This multi-head approach allows their model to attend to information from different positions at the same time. Their model completely foregoes recurrence and convolutions but still managed to attain state-of-art results for English-to-German and English-to-French translations. Moreover, they achieved this in significantly less training time and their model is highly parallelizable. An earlier work by Parikh et al. experimented with a simple attention-based approach to solve natural language inference tasks [15]. They used attention to deconstruct the problem into subproblems that can be solved individually, hence making the model trivially parallelizable.

Not just a cog in the machine

What we have learned about attention so far tells us it is likely to be an essential component in the development of general AI. Philosophically, it is a key feature of the human psyche, which makes it a natural inclusion in pursuits that concerns the grey matter, while computationally, attention-based mechanisms have helped boost model performance to deliver stunning results in many areas. Attention has also proven to be a versatile technique, as is evident in its ability to replace recurrent layers in machine translation and other NLP related tasks. But it is most powerful when used in conjunction with other components, as Kaiser et al. demonstrated in their study One Model To Learn Them All that presented a model capable of solving a number of problems spanning multiple domains [16]. To be sure, attentional mechanisms are not without weaknesses. As Olah and Carter suggested, their propensity to take every action at every step (albeit to varying extent) could potentially be very costly computationally. Nonetheless, I believe that in a modular approach to develop general AI – IMO our best bet in this quest – attention will be a worthwhile, and perhaps even indispensable, module.


[1] Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451-463.

[2] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[3] Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. arXiv preprint arXiv:1705.03122.

[4] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical Attention Networks for Document Classification. In HLT-NAACL (pp. 1480-1489).

[5] Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Rajagopal, D., & Salakhutdinov, R. (2017). Gated-Attention Architectures for Task-Oriented Language Grounding. arXiv preprint arXiv:1706.07230.

[6] Dhingra, B., Liu, H., Yang, Z., Cohen, W. W., & Salakhutdinov, R. (2016). Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.

[7] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204-2212).

[8] Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.

[9] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., … & Bengio, Y. (2016). Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.

[10] Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.

[11] Choi, J., Lee, B. J., & Zhang, B. T. (2017). Multi-Focus Attention Network for Efficient Deep Reinforcement Learning. AAAI Publications, Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.

[12] Hong, S., Oh, J., Lee, H., & Han, B. (2016). Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3204-3212).

[13] Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.

[14] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

[15] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.

[16] Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., & Uszkoreit, J. (2017). One Model To Learn Them All. arXiv preprint arXiv:1706.05137.

Reading List

Reading list – August 2017

Posted by Yi-Ling Hwong on

1. Neuroscience-inspired Artificial Intelligence

Authors: Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, and Matthew Botvinick
Type: Review article in Neuron
Publication date: 19 July 2017

This paper outlined the contribution of neuroscience to the most recent advances in AI and argued that the study of neural computation in humans and other animals could provide useful (albeit subtle) inspiration to AI researchers, stimulating questions about specific aspects of learning and intelligence that could guide algorithm design.

  • Four specific examples of neuroscientific inspirations that are currently used in AI were mentioned: attentional mechanism, episodic memory, working memory and continual learning
  • Four areas where neuroscience could be relevant for future AI research were also mentioned: intuitive understanding of the physical world, efficient (or rapid) learning, transfer learning, imagination and planning

2. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

Authors: William Lotter, Gabriel Kreiman, David Cox
Type: arXiv preprint (accompanying codebase available here)
Publication date: 25 May 2016

The PredNet architecture (image credit: PredNet)

This paper introduced ‘PredNet’, a predictive neural network architecture that is able to predict future frames in a video sequence using a deep, recurrent convolutional network with both bottom-up and top-down connections.

  • The study demonstrated the potential for video to be used in unsupervised learning, where prediction of future frames can serve as a powerful learning signal, given that an agent must have an implicit model of the objects that constitute the environment and how they are allowed to move.
  • By training using car-mounted camera videos, results showed that the network was able to learn to predict both the movement of the camera and the movement of the objects in the camera’s view.

3. Distral: Robust Multitask Reinforcement Learning

Authors: Yee Whye Teh, Victor Bapst, Wojciech Marian Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, Razvan Pascanu
Type: arXiv preprint
Publication date: 13 July 2017

This paper proposed a method to overcome a common problem in Deep Reinforcement Learning, whereby training on multiple related tasks negatively affect performance on the individual tasks, when intuition tells us solutions to related tasks should improve learning since the tasks share common structures.

  • The authors developed Distral (Distill & transfer learning), based on the idea of a shared ‘policy’ that distills common behaviours or representations from task-specific policies.
  • Knowledge obtained in an individual task is distilled into the shared policy and then transferred to other tasks.

4. How Prior Probability Influences Decision Making: A Unifying Probabilistic Model

Authors: Yanping Huang, Timothy Hanks, Mike Shadlen, Abram L. Friesen, Rajesh P. Rao
Type: Conference proceeding published at the Neural Information Processing Systems Conference
Publication year: 2012

This paper tackled the problem of how the brain combines sensory input and prior knowledge when making decision in the natural world.

  • The authors derived a model based on the framework of a partially observable Markov decision processes (POMDPs) and computed the optimal behaviour for sequential decision making tasks.
  • Their results suggest that decision making in our brain may be controlled by the dual principles of Bayesian inference and reward maximisation.
  • The proposed model offered a unifying explanation for experimental data previously accounted for by two competing models for incorporating prior knowledge, the additive offset model that assumes static influence of the prior, and the dynamic weighing model that assumes a time-varying effect.

5. First-spike based visual categorization using reward-modulated STDP

Authors: Milad Mozafari, Saeed Reza Kheradpisheh, Timothée Masquelier, Abbas Nowzari-Dalini, Mohammad Ganjtabesh
Type: arXiv preprint
Publication date: 25 May 2017

This paper proposed a hierarchical Spiking Neural Network (SNN) equipped with a novel Reward-modulated STDP (R-STDP) learning algorithm to solve object recognition tasks without using an external classifier.

  • The learning algorithm combined the principles of Reinforcement Learning and STDP
  • The network is structured as a feedforward convolutional SNN with four layers, however training took place in only one layer.
  • Results from R-STDP outperformed STDP on several datasets

6. A Distributional Perspective on Reinforcement Learning

Authors: Marc G. Bellemare, Will Dabney, Rémi Munos
Type: arXiv preprint
Publication date: 21 July 2017

This paper sought to provide a more complete picture of reinforcement learning (RL) by incorporating the concept of value distribution, understood as the distribution of the random return received by a learning agent.

  • The main object of the study is a random return Z that is characterised by the interaction of three random variables: the reward R, the next state-action, and its random return.
  • The authors designed a new algorithm using this distributional perspective to learn approximate value distribution and obtained state of the art results, at the same time demonstrating the importance of the value distribution in approximate RL.
AGI/Artificial General Intelligence/unsupervised learning

Unsupervised Learning with Spike-Timing Dependent Plasticity

Posted by Yi-Ling Hwong on

Our brain is a source of great inspiration for the development of Artificial General Intelligence. In fact, one of the common views is that any effort in developing human-level AI is almost destined to fail without an intimate understanding of how the brain works. However, we do not understand our brain that well yet. But that is another story for another day. In today’s blog post we are going to talk about a learning method in machine learning that takes its inspiration from a biological process underpinning how humans learn – Spike Timing Dependent Plasticity (STDP).

Biological neurons communicate with each other through synapses, which are tiny connections between neurons in our brains. A presynaptic neuron is the neuron that fires the electrical impulse (the signal, so to speak), and a postsynaptic neuron is the neuron that receives this impulse. The wiring of the neurons makes our brain an extremely complex piece of machinery: a typical neuron receives thousands of inputs and sends its signals to over 10,000 other neurons. Incoming signals to a neuron alter its voltage (potential). When these signals reach a threshold value the neuron will produce a sudden increase in voltage for a short time (1ms). We refer to these short bursts of electrical energy as spikes. Computers communicate with bits, while neurons use spikes.

Anatomy of a neuron (image credit: Wikimedia)

Artificial Neural Networks (ANNs) attempt to capture this mechanism of neuronal communication through mathematical models. However, these computational models may be an inadequate representation of the brain. To understand the trend towards STDP and why we think it is a viable path forward, let’s back up a little bit and talk briefly about the current common methods in ANNs.

Gradient Descent: the dominant paradigm

Artificial Neural Networks are based on a collection of connected nodes mimicking the behaviour of biological neurons. A receiving (or postsynaptic) neuron receives multiple inputs, processes the signals, multiplies them by a weight, applies a nonlinear transfer function, and then propagates this signal to other neurons. The weights of the neurons vary as learning happens. This process of tweaking the weights is the most important thing in an artificial neural network. One popular learning algorithm is Stochastic Gradient Descent (SGD). To calculate the gradient of the loss function with respect to the weights, most state of the art ANNs use a procedure called back-propagation. However, the biological plausibility of back-propagation remains highly debatable. For example, there is no evidence of a global error minimisation mechanism in biological neurons. Therefore, a better learning algorithm might help us to move towards AGI. Something that raises the biological realism of our models. And this is where the Spiking Neural Network comes in.

The incorporation of timing in an SNN

The main difference between a conventional ANN and SNN is the neuron model that is used. The neuron model used in a conventional ANN does not employ individual spikes in computations. Instead the output signals from the neurons are treated as normalised firing rates, or frequency, of inputs within a certain time frame [1]. This is an averaging mechanism and is commonly referred to as rate coding. Consequently, input to the network can be real values, instead of a binary time-series. In contrast, each individual spike is used in the neuron model of an SNN. Instead of using rate coding, SNN uses pulse coding. What is important here is the incorporation of timing of the firing in computations, like real neurons do. The neurons in an SNN do not fire at every propagation cycle. They only fire when signals from other incoming neurons cause charge accumulation that reaches a certain threshold voltage.

Basic model of a spiking neuron (Image credit: EPFL)

The use of individual spikes in pulse coding is more biologically accurate in two ways. First, it is a more plausible representation for tasks where speed is an important consideration. For example in human visual system. Studies have shown that humans analyse and classify visual input (e.g. facial recognition) in under 100ms. Considering it takes at least 10 synaptic steps from the retina to the temporal lobe [2], this leaves about 10ms of processing time for each neuron. This is too little time for an averaging mechanism like rate coding to take place. Hence, an implementation that uses pulse coding might be a more suitable model for object recognition tasks, which is currently not the case considering the popularity of conventional ANN. Second, the use of only local information (i.e. timing of spikes) in learning is a more biologically realistic representation in comparison with a global error minimisation mechanism.

Learning using Spike-Timing Dependent Plasticity

The changing and shaping of neuron connections in our brain is known as synaptic plasticity. Neurons fire, or spike, to signal the presence of the feature that they are tuned for. As cleverly suggested by the Canadian psychologist Donald Hebb, “Neurons that fire together, wire together.” Simply put, when two neurons fire at almost the same time the connections between them are strengthened and thus they become more likely to fire again in the future. When two neurons fire in an uncoordinated manner the connections between them weaken and they are more likely to act independently in the future. This is known as Hebbian learning. The strengthening of synapses is known as Long Term Potentiation (LTP) and the weakening of synaptic strength is known as Long Term Depression (LTD). What determines whether a synapse will undergo LTP or LTD is the timing between the pre- and postsynaptic firing. If the presynaptic neuron fires before the postsynaptic neuron within the preceding 20ms, LTP occurs; and if the presynaptic neuron fires after the postsynaptic neuron within the following 20ms, LTD occurs. This is known as Spike-Timing Dependent Plasticity (STDP).

This biological mechanism can be adopted as a learning rule in machine learning. A general approach is to apply a delta rule Δw to each synapse in a network to compute its weight change. The weight change will be positive (therefore increasing the strength of the synaptic connection) if the postsynaptic neuron fires just after the presynaptic neuron, and negative if the postsynaptic neuron fires just before the presynaptic neuron. Compared with the supervised learning algorithm employed in backpropagation, STDP is an unsupervised learning method. This is another reason STDP-based learning is believed to more accurately reflect human learning, given that much of the most important learning we do is experiential and unsupervised, i.e. there is no “right answer” available for the brain to learn from.


STDP represents a potential shift in approach when it comes to developing learning procedures in neural networks. Recent research shows that it has predominantly been applied in pattern recognition related tasks. One 2015 study using an exponential STDP learning rule achieved 95% accuracy on the MNIST dataset [3], a large handwritten digit database that is widely used a training dataset for computer vision. Merely a year later, researchers have managed to make significant progress. For example, Kheradpisheh et al. achieved 98.5% accuracy MNIST by combining SNN and features of deep learning [4]. The network they used comprised several convolutional and pooling layers, and STDP learning rules were used in the convolutional layers to learn the features. Another interesting study took its inspiration from Reinforcement Learning and combined it with a hierarchical SNN to perform pattern recognition [5]. Using a network structure that consists of two simple and two complex layers and a novel reward-modulated STDP (R-STDP), their method outperformed classic unsupervised STDP on several image datasets. STDP has also been applied in real-time learning to take advantage of its speedy nature [6]. The SNN and fast unsupervised STDP learning method that was developed achieved an impressive 21.3 fps in training and 17.9 fps in testing. To put things in perspective, human eyes are able to detect around 24 fps.

Apart from object recognition, STDP has also been applied in speech recognition related tasks. One study uses an STDP-trained, nonrecurrent SNN to convert speech signals into a spike train signature for speech recognition [7]. Another study combines a hidden Markov model with SNN and STDP learning to classify segments of sequential data such as individual spoken words [8]. STDP has also proven to be a useful learning method in modelling pitch perception (i.e. recognising tones). Researchers developed a computational model using neural network that learns using STDP rules to identify (and strengthen) the neuronal connections that are most effective for the extraction of pitch [9].

Final thoughts

Having learned what we have about STDP, what can we conclude about the state of the art of machine learning? We think that conventional Artificial Neural Networks are probably here to stay. They are simplistic models of neurons but they do work. However the extent to which supervised ANNs would be suitable in the development of AGI is debatable. On the other hand, while the Spiking Neural Network is a more authentic model of how the human brain works, its performance thus far still lags behind that of ANNs on some tasks, not least because a lot more research has been done on supervised ANNs than SNNs. Despite its intuitive appeal and biological validity, there are also many neuroscientific experiments in which STDP has not matched observations [10]. One major quandary is the observation of LTD in certain hippocampal neurons (CA3 and CA1 regions, to be precise) when low frequency (1 Hz) presynaptic stimulation drives postsynaptic firing [11]. Conventional STDP wisdom says LTP should happen in this case. The frequency-dependence of plasticity does not stop here. At high enough frequencies (i.e. firing rates), the STDP learning rule becomes LTP-only. That is, both positive and negative Δw produce LTP [12]. Several other additional mechanisms also appear to influence STDP learning. For example, LTD can be converted to LTP by altering the firing pattern of the postsynaptic spikes: firing ‘bursts’ or even a pair of spikes in the postsynaptic neuron lead to LTP where single spikes would have led to LTD [13] [14]. Plasticity also appears to accumulate as a nonlinear function of the number of pre- and postsynaptic pairings, with depression accumulating at a lower rate than potentiation, i.e. requiring more pairings [13]. Finally, it seems that neural activity that does not cause any measurable plasticity may have a ‘priming’ effect on subsequent activities. In the CA1 region for example, LTP could be activated with as few as four stimuli, provided that a single priming stimulus was given 170 ms earlier [15] .

SNN’s inferior performance when compared to other ANNs might be due to its poor scalability. Large scale SNN’s are relatively rare because the computational intensity involved in designing such networks are not yet fully supported in most high performance computing (there are, however, exceptions such as this and this). Most implementations today use only one or two trainable layers of unsupervised learning, which limits its generalisation capabilities [16]. Moreover, and perhaps most importantly, STDP is vulnerable to the common shortcoming of unsupervised learning algorithms: it works well in sifting out statistically significant features but has problems identifying rare but diagnostic features which are crucial in important processes such as decision making. My sense is that if STDP is to become the key in unlocking the secrets of AGI, there needs to be more creativity in its implementation that takes advantage of its biological roots and nuances while striving for a general purpose learning algorithm.

What do you think? Comment and let us know your thoughts!


[1] Vreeken, J. (2003). Spiking neural networks, an introduction.

[2] Thorpe, S., Delorme, A., & Van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural networks, 14(6), 715-725.

[3] Diehl, P. U., & Cook, M. (2015). Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in computational neuroscience, 9.

[4] Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J., & Masquelier, T. (2016). STDP-based spiking deep neural networks for object recognition. arXiv preprint arXiv:1611.01421.

[5] Mozafari, M., Kheradpisheh, S. R., Masquelier, T., Nowzari-Dalini, A., & Ganjtabesh, M. (2017). First-spike based visual categorization using reward-modulated STDP. arXiv preprint arXiv:1705.09132.

[6] Liu, D., & Yue, S. (2017). Fast unsupervised learning for visual pattern recognition using spike timing dependent plasticity. Neurocomputing, 249, 212-224.

[7] Tavanaei, A., & Maida, A. S. (2017). A spiking network that learns to extract spike signatures from speech signals. Neurocomputing, 240, 191-199.

[8] Tavanaei, A., & Maida, A. S. (2016). Training a Hidden markov model with a Bayesian spiking neural network. Journal of Signal Processing Systems, 1-10.

[9] Saeedi, N. E., Blamey, P. J., Burkitt, A. N., & Grayden, D. B. (2016). Learning Pitch with STDP: A Computational Model of Place and Temporal Pitch Perception Using Spiking Neural Networks. PLoS computational biology, 12(4), e1004860.

[10] Shouval, H. Z., Wang, S. S. H., & Wittenberg, G. M. (2010). Spike timing dependent plasticity: a consequence of more fundamental learning rules. Frontiers in Computational Neuroscience, 4.

[11] Wittenberg, G. M., and Wang, S. S.-H. (2006). Malleability of spike-timing- dependent plasticity at the CA3-CA1 synapse. J. Neurosci. 26, 6610–6617.

[12] Sjöström, P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32(6), 1149-1164.

[13] Wittenberg, G. M., and Wang, S. S.-H. (2006). Malleability of spike-timing- dependent plasticity at the CA3-CA1 synapse. J. Neurosci. 26, 6610–6617.

[14] Pike, F. G., Meredith, R. M., Olding, A. W., & Paulsen, O. (1999). Postsynaptic bursting is essential for ‘Hebbian’induction of associative long‐term potentiation at excitatory synapses in rat hippocampus. The Journal of physiology, 518(2), 571-576.

[15] Rose, G. M., and Dunwiddie, T. V. (1986). Induction of hippocampal long-term potentiation using physiologically patterned stimulation. Neurosci. Lett. 69, 244–248.

[16] Almási, A. D., Woźniak, S., Cristea, V., Leblebici, Y., & Engbersen, T. (2016). Review of advances in neural networks: Neural design technology stack. Neurocomputing, 174, 31-41.


Introducing Yi-Ling

Posted by Yi-Ling Hwong on

Hello everyone! I am Yi-Ling and I am the newest member of the AGI project team. It is an incredibly exciting time to be dipping one’s toes in the field of Artificial Intelligence, given the impressive progress and explosion of AI applications in recent years. In my case, I am actually going to dive in and fully immerse myself in one of the frontier issues and thrilling challenges of AI – Artificial General Intelligence. I will be documenting my journey and learnings in the form of blog posts on this website, and hopefully spark some interesting discussions with you. But before I do that, here’s a little bit about myself so you get a peek of the person behind the words.

Who am I

I was born in and grew up on a beautiful tropical island called Penang on the Northwest of Malaysia. I left Malaysia at the age of 20 to pursue a tertiary education in Germany, majoring in power engineering. Upon graduation I was awarded the Marie Curie Fellowship program funded by the European Commission to work as a software engineer at the European Organisation for Nuclear Research (CERN) in Geneva, Switzerland. The experiment that I was working for, CMS, was one of the two experiments that first discovered the Higgs Boson. I have also worked for several nonprofit organisations, including Doctors without Borders, as a digital communication specialist. I am currently a PhD candidate at the University of New South Wales in Sydney, Australia. My research concerns the impact of social media science communication on public trust in science.

I also do a bunch of stuff outside of science. I am a Toastmaster, the Editor in Chief of the Scientific Malaysian magazine, and a salsa dancer. I used to play the keyboard in a rock band and my idea of a perfect Sunday involves jazz, a hammock, coffee, and a good book.

Why did I join AGI

For as long as I can remember, I have been fascinated by the human brain. Not so much its structure and composition, but what it is capable of. My deep fascination with what makes us conscious and sentient beings capable of extraordinary feats — both good and evil — has followed me through my diverse career. I have always believed that the best science happens when humans are driven not by intellect alone, but by a deeper and more visceral desire to understand our nature and the universe. The fact that the pursuits of AGI at times border on the philosophical makes it all the more interesting to me. My current research is tangentially related to AI in that I am applying machine learning to study big data. I would like to think that my becoming a member of the AGI project is a step up into finally fulfilling a lifelong dream.

I believe in the vision and mission of the AGI project. Although I have just met Gideon and Dave, they strike me as intelligent, passionate and generous human beings who genuinely wish to accomplish something meaningful. And I want to be a part of it.

What will I be doing

One of the missions of the AGI project is to rally and connect the community of AI researchers and practitioners. Our blog is one way for us to reach out and network with the community. I will be involved in the research aspect of the AGI project, and will be sharing learnings and ideas through a series of blog posts. There are many topics that we are currently exploring e.g. sparse coding, unsupervised learning algorithms, deep hierarchical reinforcement learning etc. These are relatively new (or at least less-reported on) concepts compared with the current deep learning paradigm which has mainly focused on backpropagation techniques. However we believe they harbour promising potential to tackle the AGI problem. I will be reviewing the literature on these areas and writing about them. This is useful not only for our own record, but by sharing openly about what we are currently working on, we hope to engage you in conversations.

So stay tuned and till next time folks.