Category Archives

11 Articles

Artificial General Intelligence/Theory

Attention in Artificial Intelligence systems

Posted by Yi-Ling Hwong on

One of the features of our brain is its modularity. It is characterised by distinct but interacting subsystems that underlie key functions such as memory, language, perceptions, etc. Understanding the complex interplay between these modules requires decomposing them into a set of components. Indeed, this modular approach to modeling the brain is one of the ways neuroscientists are studying the brain. The ‘module’ that I am going to be talking about in this blogpost has its roots in neuroscience but has greatly inspired AI research: attention. I will focus on the neuroscience aspect of attention first before moving on to review some of the most exciting developments using attentional mechanisms in machine learning.

The neuroscientific roots of attention

Neuroscientists have long studied attention as an important cognitive process. It is described as the ability of organisms to ‘select a subset of available information upon which to focus for enhanced processing and integration’ and encompasses three aspects: orienting, filtering and searching. Visual attention, for example, is an active area of research. Our ability to focus on specific area of a visual scene and extract and process the information that is streamed to our brain is thought to be an evolutionary trait that all but guaranteed the survival of our species. This capability to select, process and act upon sensory experience has inspired a whole branch of research in computational modelling of visual attention.

Visual attention (image credit: Wikimedia)

The emergence of a whole suite of sophisticated equipment to scan and study the brain has further fanned the flames of enthusiasm for attention research. In a recent study using eye tracking and fMRI data, Leong et al. demonstrated the bidirectional interaction between attention and learning: attention facilitates learning, and learned values in turn inform attentional selection [1]. The relationship between attention and consciousness is a complex issue, and in many sense both a scientific and a philosophical exploration. The ability to focus one’s thoughts out of several simultaneous objects or trains of thought and take control of one’s own mind in a vivid and conscious manner is not just a delightful and useful perk. It is a quintessential part of our experience of human-ness.

Given their significance, attentional mechanisms have in recent years received increasing attention (pun intended) from the AI community. A detailed explanation of how they are applied in machine learning will require a separate blog post (I highly recommend this excellent article by Olah and Carter) but in essence attention layers provide the functionality of focusing on specific elements to improve the performance of a model. In an image recognition task for example, it does so by taking ‘glimpses’ of the input image at each step, updating the internal state representations, and then selecting the next location to sample. In a cluttered setting or when the input is too big, attention serves a ‘prioritisation’ function to filter out irrelevant elements. It is a powerful technique that can be used when interfacing with a neural network that has a repeating structure in its output. For example, when applied to augment LSTM (a special variant of recurrent neural networks), it lets every step of an RNN select information to look at from a larger body of information. However, attentional mechanisms are not just useful in RNNs, as we will find out below.

State of the art using attention in machine learning

In machine learning, attention is especially useful in sequence prediction problems. Let’s review a few of the major areas where it has been applied successfully.

1. Natural language processing

Attentional mechanisms have been applied in many natural language processing (NLP) related tasks. The seminal work by Bahdanau et al. proposed a neural machine translation model that implements an attention mechanism in the decoder for English-to-French translation [2]. As the system reads the English input (encoder), the decoder outputs French translation whereby the attention mechanism learns by stochastic gradient descent to shift the focus to concentrate on the parts surrounding the word that is being translated. Their RNN-based model has been shown to outperform traditional phrase-based models by huge margins. RNNs are the incumbent architecture for text applications but it does not allow for parallelisation, which limits its potential of using GPU hardware that powers modern machine learning. A team of Facebook AI researchers introduced a novel approach using convolutional neural networks (which are highly parallelisable) and a separate attention module in each decoder layer. As opposed to Bahdanau et al’s ‘single step attention’, theirs is a multi-hop attention module. This means instead of looking at the sentence once and then translating it without looking back, the mechanism takes multiple glimpses at the sentence to determine what it will translate next. Their approach outperformed state of the art results for English-German and English-French translation at an order of magnitude faster speed [3]. Other examples of attentional mechanisms being applied in NLP problems include text classification [4], language processing (performing tasks described by natural language instructions in a 3D game-play environment) [5] and text comprehension (answering close-style questions about a document) [6].

2. Object recognition

Object recognition is one of the hallmarks of machine intelligence. Mnih et al. demonstrated how an attentional mechanism can be used to ignore irrelevant objects in a scene, allowing the model to perform well in challenging object recognition tasks in the presence of clutter [7]. In their Recurrent Attention Model (RAM), the agent receives partial observation of the environment at each step and learns where to focus (i.e. pay attention to) next through training an RNN. Attention is used to produce a ‘glimpse feature vector’ whereby regions around a target pixel is encoded at high-resolution and pixels further from the target pixel uses progressively lower resolution. Using a similar approach, another study used  a deep recurrent attention model to both localise and recognise multiple objects in images [8]. Xu et al. trained a model that automatically learns to describe the content of images [9]. Their attention models were trained using a multilayer perceptron that is conditioned on some previous hidden state, meaning where the network looks next depends on the sequence of words that has already been generated. The researchers showed how to use convolutional neural networks to pay attention to images when outputting a sequence, i.e. the image caption. Another advantage of attention in this case is the insights gained by approximately visualising where and what the attention focused on (i.e. what the model ‘sees’).

Telling mistakes in image caption generation with visual attention (image taken from Xu et al., 2016)

3. Gameplay

Google DeepMind’s Deep Q-Network (DQN) represented a significant advance in Reinforcement Learning and a breakthrough in general AI in the sense that it showed a single algorithm can learn to play a wide variety of Atari 2600 games: the agent was able to continually adapt its behaviour without any human intervention. Sorokin et al. added attention to the equation and developed the Deep Attention Recurrent Q-Network (DARQN) [10]. Their model outperformed that of DQN by incorporating what they termed ‘soft’ and ‘hard’ attention mechanisms. The attention network takes the current game state as input and generates a context vector based on the features observed. An LSTM then takes this context vector along with a previous hidden state and memory state to evaluate the action that an agent can take. Choi et al. further improved on DARQN by implementing a multi-focus attention network where the agent is capable of attending to multiple important elements [11]. In contrast to DARQN that uses only one attention layer, the model uses multiple parallel attentions to attend to entities that are relevant to tackling the problem.

4. Generative models

Attention has also proven useful in generative models, systems that can simulate (i.e. generate) values of any variable (inputs and outputs) in the model. Hong et al. developed a deep generative model based on a convolutional neural network for semantic segmentation (the task of assigning class labels to groups of pixels in an image) [12]. By incorporating attention-like mechanisms they were able to capture transferable segmentation knowledge across categories. The attention mechanism adaptively focuses on different areas depending on the input labels. A softmax function is used to encourage the model to pay attention to only a segment of the image.  Another example is Google DeepMind’s Deep Recurrent Attentive Writer (DRAW) neural network for image generation [13]. Attention allows the system to build up an image incrementally (shown in the video below). The attention model is fully differentiable (making it possible to train with gradient descent), thus allowing the encoder to focus on only part of the input and the decoder to modify only a part of the canvas. The model achieved impressive results generating images from the MNIST data set and when trained on the Street View House Number data set, it generated images that are almost identical to the real data.

5. Attention alone for NLP tasks

Another exciting line of research focusses on using attentional mechanisms alone for NLP tasks traditionally solved with neural networks. Vaswani et al. developed Transformer, a simple network architecture based solely on a novel multi-head attention mechanism for translation task [14]. They compute the attention function on a set of queries simultaneously using a dot-product attention (each key is multiplied with the query to see how similar they are) with an additional scaling factor. This multi-head approach allows their model to attend to information from different positions at the same time. Their model completely foregoes recurrence and convolutions but still managed to attain state-of-art results for English-to-German and English-to-French translations. Moreover, they achieved this in significantly less training time and their model is highly parallelizable. An earlier work by Parikh et al. experimented with a simple attention-based approach to solve natural language inference tasks [15]. They used attention to deconstruct the problem into subproblems that can be solved individually, hence making the model trivially parallelizable.

Not just a cog in the machine

What we have learned about attention so far tells us it is likely to be an essential component in the development of general AI. Philosophically, it is a key feature of the human psyche, which makes it a natural inclusion in pursuits that concerns the grey matter, while computationally, attention-based mechanisms have helped boost model performance to deliver stunning results in many areas. Attention has also proven to be a versatile technique, as is evident in its ability to replace recurrent layers in machine translation and other NLP related tasks. But it is most powerful when used in conjunction with other components, as Kaiser et al. demonstrated in their study One Model To Learn Them All that presented a model capable of solving a number of problems spanning multiple domains [16]. To be sure, attentional mechanisms are not without weaknesses. As Olah and Carter suggested, their propensity to take every action at every step (albeit to varying extent) could potentially be very costly computationally. Nonetheless, I believe that in a modular approach to develop general AI – IMO our best bet in this quest – attention will be a worthwhile, and perhaps even indispensable, module.

References

[1] Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451-463.

[2] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[3] Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. arXiv preprint arXiv:1705.03122.

[4] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical Attention Networks for Document Classification. In HLT-NAACL (pp. 1480-1489).

[5] Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Rajagopal, D., & Salakhutdinov, R. (2017). Gated-Attention Architectures for Task-Oriented Language Grounding. arXiv preprint arXiv:1706.07230.

[6] Dhingra, B., Liu, H., Yang, Z., Cohen, W. W., & Salakhutdinov, R. (2016). Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.

[7] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204-2212).

[8] Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.

[9] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., … & Bengio, Y. (2016). Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.

[10] Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.

[11] Choi, J., Lee, B. J., & Zhang, B. T. (2017). Multi-Focus Attention Network for Efficient Deep Reinforcement Learning. AAAI Publications, Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.

[12] Hong, S., Oh, J., Lee, H., & Han, B. (2016). Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3204-3212).

[13] Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.

[14] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

[15] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.

[16] Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., & Uszkoreit, J. (2017). One Model To Learn Them All. arXiv preprint arXiv:1706.05137.

AGI/Artificial General Intelligence/unsupervised learning

Unsupervised Learning with Spike-Timing Dependent Plasticity

Posted by Yi-Ling Hwong on

Our brain is a source of great inspiration for the development of Artificial General Intelligence. In fact, one of the common views is that any effort in developing human-level AI is almost destined to fail without an intimate understanding of how the brain works. However, we do not understand our brain that well yet. But that is another story for another day. In today’s blog post we are going to talk about a learning method in machine learning that takes its inspiration from a biological process underpinning how humans learn – Spike Timing Dependent Plasticity (STDP).

Biological neurons communicate with each other through synapses, which are tiny connections between neurons in our brains. A presynaptic neuron is the neuron that fires the electrical impulse (the signal, so to speak), and a postsynaptic neuron is the neuron that receives this impulse. The wiring of the neurons makes our brain an extremely complex piece of machinery: a typical neuron receives thousands of inputs and sends its signals to over 10,000 other neurons. Incoming signals to a neuron alter its voltage (potential). When these signals reach a threshold value the neuron will produce a sudden increase in voltage for a short time (1ms). We refer to these short bursts of electrical energy as spikes. Computers communicate with bits, while neurons use spikes.

Anatomy of a neuron (image credit: Wikimedia)

Artificial Neural Networks (ANNs) attempt to capture this mechanism of neuronal communication through mathematical models. However, these computational models may be an inadequate representation of the brain. To understand the trend towards STDP and why we think it is a viable path forward, let’s back up a little bit and talk briefly about the current common methods in ANNs.

Gradient Descent: the dominant paradigm

Artificial Neural Networks are based on a collection of connected nodes mimicking the behaviour of biological neurons. A receiving (or postsynaptic) neuron receives multiple inputs, processes the signals, multiplies them by a weight, applies a nonlinear transfer function, and then propagates this signal to other neurons. The weights of the neurons vary as learning happens. This process of tweaking the weights is the most important thing in an artificial neural network. One popular learning algorithm is Stochastic Gradient Descent (SGD). To calculate the gradient of the loss function with respect to the weights, most state of the art ANNs use a procedure called back-propagation. However, the biological plausibility of back-propagation remains highly debatable. For example, there is no evidence of a global error minimisation mechanism in biological neurons. Therefore, a better learning algorithm might help us to move towards AGI. Something that raises the biological realism of our models. And this is where the Spiking Neural Network comes in.

The incorporation of timing in an SNN

The main difference between a conventional ANN and SNN is the neuron model that is used. The neuron model used in a conventional ANN does not employ individual spikes in computations. Instead the output signals from the neurons are treated as normalised firing rates, or frequency, of inputs within a certain time frame [1]. This is an averaging mechanism and is commonly referred to as rate coding. Consequently, input to the network can be real values, instead of a binary time-series. In contrast, each individual spike is used in the neuron model of an SNN. Instead of using rate coding, SNN uses pulse coding. What is important here is the incorporation of timing of the firing in computations, like real neurons do. The neurons in an SNN do not fire at every propagation cycle. They only fire when signals from other incoming neurons cause charge accumulation that reaches a certain threshold voltage.

Basic model of a spiking neuron (Image credit: EPFL)

The use of individual spikes in pulse coding is more biologically accurate in two ways. First, it is a more plausible representation for tasks where speed is an important consideration. For example in human visual system. Studies have shown that humans analyse and classify visual input (e.g. facial recognition) in under 100ms. Considering it takes at least 10 synaptic steps from the retina to the temporal lobe [2], this leaves about 10ms of processing time for each neuron. This is too little time for an averaging mechanism like rate coding to take place. Hence, an implementation that uses pulse coding might be a more suitable model for object recognition tasks, which is currently not the case considering the popularity of conventional ANN. Second, the use of only local information (i.e. timing of spikes) in learning is a more biologically realistic representation in comparison with a global error minimisation mechanism.

Learning using Spike-Timing Dependent Plasticity

The changing and shaping of neuron connections in our brain is known as synaptic plasticity. Neurons fire, or spike, to signal the presence of the feature that they are tuned for. As cleverly suggested by the Canadian psychologist Donald Hebb, “Neurons that fire together, wire together.” Simply put, when two neurons fire at almost the same time the connections between them are strengthened and thus they become more likely to fire again in the future. When two neurons fire in an uncoordinated manner the connections between them weaken and they are more likely to act independently in the future. This is known as Hebbian learning. The strengthening of synapses is known as Long Term Potentiation (LTP) and the weakening of synaptic strength is known as Long Term Depression (LTD). What determines whether a synapse will undergo LTP or LTD is the timing between the pre- and postsynaptic firing. If the presynaptic neuron fires before the postsynaptic neuron within the preceding 20ms, LTP occurs; and if the presynaptic neuron fires after the postsynaptic neuron within the following 20ms, LTD occurs. This is known as Spike-Timing Dependent Plasticity (STDP).

This biological mechanism can be adopted as a learning rule in machine learning. A general approach is to apply a delta rule Δw to each synapse in a network to compute its weight change. The weight change will be positive (therefore increasing the strength of the synaptic connection) if the postsynaptic neuron fires just after the presynaptic neuron, and negative if the postsynaptic neuron fires just before the presynaptic neuron. Compared with the supervised learning algorithm employed in backpropagation, STDP is an unsupervised learning method. This is another reason STDP-based learning is believed to more accurately reflect human learning, given that much of the most important learning we do is experiential and unsupervised, i.e. there is no “right answer” available for the brain to learn from.

Applications

STDP represents a potential shift in approach when it comes to developing learning procedures in neural networks. Recent research shows that it has predominantly been applied in pattern recognition related tasks. One 2015 study using an exponential STDP learning rule achieved 95% accuracy on the MNIST dataset [3], a large handwritten digit database that is widely used a training dataset for computer vision. Merely a year later, researchers have managed to make significant progress. For example, Kheradpisheh et al. achieved 98.5% accuracy MNIST by combining SNN and features of deep learning [4]. The network they used comprised several convolutional and pooling layers, and STDP learning rules were used in the convolutional layers to learn the features. Another interesting study took its inspiration from Reinforcement Learning and combined it with a hierarchical SNN to perform pattern recognition [5]. Using a network structure that consists of two simple and two complex layers and a novel reward-modulated STDP (R-STDP), their method outperformed classic unsupervised STDP on several image datasets. STDP has also been applied in real-time learning to take advantage of its speedy nature [6]. The SNN and fast unsupervised STDP learning method that was developed achieved an impressive 21.3 fps in training and 17.9 fps in testing. To put things in perspective, human eyes are able to detect around 24 fps.

Apart from object recognition, STDP has also been applied in speech recognition related tasks. One study uses an STDP-trained, nonrecurrent SNN to convert speech signals into a spike train signature for speech recognition [7]. Another study combines a hidden Markov model with SNN and STDP learning to classify segments of sequential data such as individual spoken words [8]. STDP has also proven to be a useful learning method in modelling pitch perception (i.e. recognising tones). Researchers developed a computational model using neural network that learns using STDP rules to identify (and strengthen) the neuronal connections that are most effective for the extraction of pitch [9].

Final thoughts

Having learned what we have about STDP, what can we conclude about the state of the art of machine learning? We think that conventional Artificial Neural Networks are probably here to stay. They are simplistic models of neurons but they do work. However the extent to which supervised ANNs would be suitable in the development of AGI is debatable. On the other hand, while the Spiking Neural Network is a more authentic model of how the human brain works, its performance thus far still lags behind that of ANNs on some tasks, not least because a lot more research has been done on supervised ANNs than SNNs. Despite its intuitive appeal and biological validity, there are also many neuroscientific experiments in which STDP has not matched observations [10]. One major quandary is the observation of LTD in certain hippocampal neurons (CA3 and CA1 regions, to be precise) when low frequency (1 Hz) presynaptic stimulation drives postsynaptic firing [11]. Conventional STDP wisdom says LTP should happen in this case. The frequency-dependence of plasticity does not stop here. At high enough frequencies (i.e. firing rates), the STDP learning rule becomes LTP-only. That is, both positive and negative Δw produce LTP [12]. Several other additional mechanisms also appear to influence STDP learning. For example, LTD can be converted to LTP by altering the firing pattern of the postsynaptic spikes: firing ‘bursts’ or even a pair of spikes in the postsynaptic neuron lead to LTP where single spikes would have led to LTD [13] [14]. Plasticity also appears to accumulate as a nonlinear function of the number of pre- and postsynaptic pairings, with depression accumulating at a lower rate than potentiation, i.e. requiring more pairings [13]. Finally, it seems that neural activity that does not cause any measurable plasticity may have a ‘priming’ effect on subsequent activities. In the CA1 region for example, LTP could be activated with as few as four stimuli, provided that a single priming stimulus was given 170 ms earlier [15] .

SNN’s inferior performance when compared to other ANNs might be due to its poor scalability. Large scale SNN’s are relatively rare because the computational intensity involved in designing such networks are not yet fully supported in most high performance computing (there are, however, exceptions such as this and this). Most implementations today use only one or two trainable layers of unsupervised learning, which limits its generalisation capabilities [16]. Moreover, and perhaps most importantly, STDP is vulnerable to the common shortcoming of unsupervised learning algorithms: it works well in sifting out statistically significant features but has problems identifying rare but diagnostic features which are crucial in important processes such as decision making. My sense is that if STDP is to become the key in unlocking the secrets of AGI, there needs to be more creativity in its implementation that takes advantage of its biological roots and nuances while striving for a general purpose learning algorithm.

What do you think? Comment and let us know your thoughts!

References

[1] Vreeken, J. (2003). Spiking neural networks, an introduction.

[2] Thorpe, S., Delorme, A., & Van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural networks, 14(6), 715-725.

[3] Diehl, P. U., & Cook, M. (2015). Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in computational neuroscience, 9.

[4] Kheradpisheh, S. R., Ganjtabesh, M., Thorpe, S. J., & Masquelier, T. (2016). STDP-based spiking deep neural networks for object recognition. arXiv preprint arXiv:1611.01421.

[5] Mozafari, M., Kheradpisheh, S. R., Masquelier, T., Nowzari-Dalini, A., & Ganjtabesh, M. (2017). First-spike based visual categorization using reward-modulated STDP. arXiv preprint arXiv:1705.09132.

[6] Liu, D., & Yue, S. (2017). Fast unsupervised learning for visual pattern recognition using spike timing dependent plasticity. Neurocomputing, 249, 212-224.

[7] Tavanaei, A., & Maida, A. S. (2017). A spiking network that learns to extract spike signatures from speech signals. Neurocomputing, 240, 191-199.

[8] Tavanaei, A., & Maida, A. S. (2016). Training a Hidden markov model with a Bayesian spiking neural network. Journal of Signal Processing Systems, 1-10.

[9] Saeedi, N. E., Blamey, P. J., Burkitt, A. N., & Grayden, D. B. (2016). Learning Pitch with STDP: A Computational Model of Place and Temporal Pitch Perception Using Spiking Neural Networks. PLoS computational biology, 12(4), e1004860.

[10] Shouval, H. Z., Wang, S. S. H., & Wittenberg, G. M. (2010). Spike timing dependent plasticity: a consequence of more fundamental learning rules. Frontiers in Computational Neuroscience, 4.

[11] Wittenberg, G. M., and Wang, S. S.-H. (2006). Malleability of spike-timing- dependent plasticity at the CA3-CA1 synapse. J. Neurosci. 26, 6610–6617.

[12] Sjöström, P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32(6), 1149-1164.

[13] Wittenberg, G. M., and Wang, S. S.-H. (2006). Malleability of spike-timing- dependent plasticity at the CA3-CA1 synapse. J. Neurosci. 26, 6610–6617.

[14] Pike, F. G., Meredith, R. M., Olding, A. W., & Paulsen, O. (1999). Postsynaptic bursting is essential for ‘Hebbian’induction of associative long‐term potentiation at excitatory synapses in rat hippocampus. The Journal of physiology, 518(2), 571-576.

[15] Rose, G. M., and Dunwiddie, T. V. (1986). Induction of hippocampal long-term potentiation using physiologically patterned stimulation. Neurosci. Lett. 69, 244–248.

[16] Almási, A. D., Woźniak, S., Cristea, V., Leblebici, Y., & Engbersen, T. (2016). Review of advances in neural networks: Neural design technology stack. Neurocomputing, 174, 31-41.

AGI/Artificial General Intelligence/columns/Hierarchical Generative Models/Predictive Coding/pyramidal cell/sparse coding/Sparse Distributed Representations/symbol grounding problem

The Region-Layer: A building block for AGI

Posted by ProjectAGI on
The Region-Layer: A building block for AGI
Figure 1: The Region-Layer component. The upper surface in the figure is the Region-Layer, which consists of Cells (small rectangles) grouped into Columns. Within each Column, only a few cells are active at any time. The output of the Region-Layer is the activity of the Cells. Columns in the Region-Layer have similar – overlapping – but unique Receptive Fields – illustrated here by lines joining two Columns in the Region-Layer to the input matrix at the bottom. All the Cells in a Column have the same inputs, but respond to different combinations of active input in particular sequential contexts. Overall, the Region-Layer demonstrates self-organization at two scales: into Columns with unique receptive fields, and into Cells responding to unique (input, context) combinations of the Column’s input. 

Introducing the Region-Layer

From our background reading (see here, here, or here) we believe that the key component of a general intelligence can be described as a structure of “Region-Layer” components. As the name suggests, these are finite 2-dimensional areas of cells on a surface. They are surrounded by other Region-Layers, which may be connected in a hierarchical manner; and can be sandwiched by other Region-Layers, on parallel surfaces, by which additional functionality can be achieved. For example, one Region-Layer could implement our concept of the Objective system, another the Region-Layer the Subjective system. Each Region-Layer approximates a single Layer within a Region of Cortex, part of one vertex or level in a hierarchy. For more explanation of this terminology, see earlier articles on Layers and Levels.
The Region-Layer has a biological analogue – it is intended to approximate the collective function of two cell populations within a single layer of a cortical macrocolumn. The first population is a set of pyramidal cells, which we believe perform a sparse classifier function of the input; the second population is a set of inhibitory interneuron cells, which we believe cause the pyramidal cells to become active only in particular sequential contexts, or only when selectively dis-inhibited for other purposes (e.g. attention). Neocortex layers 2/3 and 5 are specifically and individually the inspirations for this model: Each Region-Layer object is supposed to approximate the collective cellular behaviour of a patch of just one of these cortical layers.
We assume the Region-Layer is trained by unsupervised learning only – it finds structure in its input without caring about associated utility or rewards. Learning should be continuous and online, learning as an agent from experience. It should adapt to non-stationary input statistics at any time.
The Region-Layer should be self-organizing: Given a surface of Region-Layer components, they should arrange themselves into a hierarchy automatically. [We may defer implementation of this feature and initially implement a manually-defined hierarchy]. Within each Region-Layer component, the cell populations should exhibit a form of competitive learning such that all cells are used efficiently to model the variety of input observed.
We believe the function of the Region-Layer is best described by Jeff Hawkins: To find spatial features and predictable sequences in the input, and replace them with patterns of cell activity that are increasingly abstract and stable over time. Cumulative discovery of these features over many Region-Layers amounts to an incremental transformation from raw data to fully grounded but abstract symbols. 
Within a Region-Layer, Cells are organized into Columns (see figure 1). Columns are organized within the Region-Layer to optimally cover the distribution of active input observed. Each Column and each Cell responds to only a fraction of the input. Via these two levels of self-organization, the set of active cells becomes a robust, distributed representation of the input.
Given these properties, a surface of Region-Layer components should have nice scaling characteristics, both in response to changing the size of individual Region-Layer column / cell populations and the number of Region-Layer components in the hierarchy. Adding more Region-Layer components should improve input modelling capabilities without any other changes to the system.
So let’s put our cards on the table and test these ideas. 

Region-Layer Implementation

Parameters

For the algorithm outlined below very few parameters are required. The few that are mentioned are needed merely to describe the resources available to the Region-Layer. In theory, they are not affected by the qualities of the input data. This is a key characteristic of a general intelligence.
  • RW: Width of region layer in Columns
  • RH: Height of region layer in Columns
  • CW: Width of column in Cells 
  • CH: Height of column in Cells

Inputs and Outputs

  • Feed-Forward Input (FFI): Must be sparse, and binary. Size: A matrix of any dimension*.
  • Feed-Back Input (FBI): Sparse, binary Size: A vector of any dimension
  • Prediction Disinhibition Input (PDI): Sparse, rare. Size: Region Area+
  • Feed-Forward Output (FFO): Sparse, binary and distributed. Size: Region Area+
* the 2D shape of input[s] may be important for learning receptive fields of columns and cells, depending on implementation.
+  Region Area = CW * CH * RW * RH

Pseudocode

    Here is some pseudocode for iterative update and training of a Region-Layer. Both occur simultaneously.
    We also have fully working code. In the next few blog posts we will describe some of our concrete implementations of this algorithm, and the tests we have performed on it. Watch this space!
    function: UpdateAndTrain( 
      feed_forward_input, 
      feed_back_input, 
      prediction_disinhibition 
    )

    // if no active input, then do nothing
    if( sum( input ) == 0 ) {
      return
    }

    // Sparse activation
    // Note: Can be implemented via a Quilt[1] of any competitive learning algorithm, 
    // e.g. Growing Neural Gas [2], Self-Organizing Maps [3], K-Sparse Autoencoder [4].
    activity(t) = 0

    for-each( column c ) {
      // find cell x that most responds to FFI 
      // in current sequential context given: 
      //  a) prior active cells in region 
      //  b) feedback input.
      x = findBestCellsInColumn( feed_forward_input, feed_back_input, c )

      activity(t)[ x ] = 1
    }

    // Change detection
    // if active cells in region unchanged, then do nothing
    if( activity(t) == activity(t-1) ) {
      return
    }

    // Update receptive fields to organize columns
    trainReceptiveFields( feed_forward_input, columns )

    // Update cell weights given column receptive fields
    // and selected active cells
    trainCells( feed_forward_input, feed_back_input, activity(t) )

    // Predictive coding: output false-negative errors only [5]
    for-each( cell x in region-layer ) {

      coding = 0

      if( ( activity(t)[x] == 1 ) and ( prediction(t-1)[x] == 0 ) ) {
        coding = 1
      }
      // optional: mute output from region, for attentional gating of hierarchy
      if( prediction_disinhibition(t)[x] == 0 ) {
        coding = 0 
      }

      output(t)[x] = coding
    }

    // Update prediction
    // Note: Predictor can be as simple as first-order Hebbian learning. 
    // The prediction model is variable order due to the inclusion of sequential 
    // context in the active cell selection step.
    trainPredictor( activity(t), activity(t-1) )
    prediction(t) = predict( activity(t) )
    [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1401&rep=rep1&type=pdf[2] https://papers.nips.cc/paper/893-a-growing-neural-gas-network-learns-topologies.pdf[3] http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf[4] https://arxiv.org/pdf/1312.5663[5] http://www.ncbi.nlm.nih.gov/pubmed/10195184
    AGI/Artificial General Intelligence/deep belief networks/Reading List/Reinforcement Learning/sparse coding/Sparse Distributed Representations/stationary problem/unsupervised learning

    Reading list – May 2016

    Posted by ProjectAGI on
    Reading list – May 2016
    Digit classification error over time in our experiments. The image isn’t very helpful but it’s a hint as to why we’re excited 🙂

    Project AGI

    A few weeks ago we paused the “How to build a General Intelligence” series (part 1, part 2, part 3, part 4). We paused it because the next article in the series requires us to specify everything in detail, and we need working code to do that.

    We have been testing our algorithm on a variety of MNIST-derived handwritten digit datasets, to better understand how well it generalizes its representation of digit-images and how it behaves when exposed to varying degrees of predictability. Initial results look promising: We will post everything here once we’ve verified them and completed the first batch of proper experiments. The series will continue soon!

    Deep Unsupervised Learning

    Our algorithm is a type of Online Deep Unsupervised Learning, so naturally we’re looking carefully at similar algorithms.

    We recommend this video of a talk by Andrew Ng. It starts with a good introduction to the methods and importance of feature representation and touches on types of automatic feature discovery. He looks at some of the important feature detectors in computer vision, such as SIFT and HoG and shows how feature detectors – such as edge detectors – can emerge from more general pattern recognition algorithms such as sparse coding. For more on sparse coding see Shakir’s excellent machine learning blog.

    For anyone struggling to intuit deep feature discovery, I also loved this post on yCombinator which nicely illustrates how and why deep networks discover useful features, and why the depth helps.

    The latter part of the video covers Ng’s latest work on deep hierarchical sparse coding using Deep Belief Networks, in turn based on AutoEncoders. He reports benchmark-beating results on video activity and phoneme recognition with this framework. You can find details of his deep unsupervised algorithm here:

    http://deeplearning.stanford.edu/wiki

    Finally he presents a plot suggesting that training dataset size is a more important determiner of eventual supervised network performance than algorithm choice! This is a fundamental limitation of supervised learning where the necessary training data is much more limited than in unsupervised learning (in the latter case, the real world provides a handy training set!)

    Effect of algorithm and training set size on accuracy. Training set size more significant. This is a fundamental limitation of supervised learning.

    Online K-sparse autoencoders (with some deep-ness)

    We’ve also been reading this paper by Makhzani and Frey about deep online learning with auto-encoders (a type of supervised learning neural network that is used in an unsupervised way to reconstruct its input, often known as semi-supervised learning). Actually we’ve struggled to find any comparison of autoencoders to earlier methods of unsupervised learning both in terms of computational efficiency and ability to cover the search space effectively. Let us know if you find a paper that covers this.

    The Makhzani paper has some interesting characteristics – the algorithm is online, which means it receives data as a stream rather than in batches. It is also sparse, which we believe is desirable from a representational perspective.
    One limitation is that the solution is most likely unable to handle changes in input data statistics (i.e. non-stationary problems). The reason this is an important quality is that in any arbitrarily deep network the typical position of a vertex is between higher and lower vertices. If all vertices are continually learning, the problem being modelled by any single vertex is constantly changing. Therefore, intermediate vertices must be capable of online learning of non stationary problems otherwise the network will not be able to function effectively. In Makhzani and Frey, they instead use the greedy layerwise training approach from Deep Belief Networks. The authors describe this approach:
    “4.6. Deep Supervised Learning Results The k-sparse autoencoder can be used as a building block of a deep neural network, using greedy layerwise pre-training (Bengio et al., 2007). We first train a shallow k-sparse autoencoder and obtain the hidden codes. We then fix the features and train another ksparse autoencoder on top of them to obtain another set of hidden codes. Then we use the parameters of these autoencoders to initialize a discriminative neural network with two hidden layers.”

    The limitation introduced can be thought of as an inability to escape from local minima that result from prior training. This paper by Choromanska et al tries to explain why this happens.
    Greedy layerwise training is an attempt to work around the fact that deep belief networks of Autoencoders cannot effectively handle nonstationary problems.

    For more information here’s some papers on deep sparse networks built from autoencoders:

    Variations on Supervised Learning – a Taxonomy

    Back to supervised learning, and the limitation of training dataset size. Thanks to a discussion with Jay Chakravarty we have this brief taxonomy of supervised learning workarounds for insufficient training datasets:

    Weakly supervised learning: [For poorly labelled training data] where you want to learn models for object recognition under weak supervision – you have say object labels for images, but no localization (e.g. bounding box) for the object in the image (there might be other objects in the image as well). You would use a Latent SVM to solve the problem of localizing the objects in the images, and at the same time learning a classifier for it.
    Another example of weakly supervised learning is that you have a bag of positive samples mixed up with negative training samples, but also have a bag of purely negative samples – you would use Multiple Instance Learning for this.

    Cross-modal adaptation: where one mode of data supervises another – e.g. audio supervises video or vice-versa.

    Domain adaptation: model learnt on one set of data is adapted, in unsupervised fashion, to new datasets with slightly different data distributions.

    Transfer learning: using the knowledge gained in learning one problem on a different, but related problem. Here’s a good example of transfer learning, a finalist in the NVIDIA 2016 Global Impact Award. The system learns to predict poverty from day and night satellite images, with very few labelled samples.

    Full paper:

    http://arxiv.org/pdf/1510.00098v2.pdf

    Interactive Brain Concept Map

    We enjoyed this interactive map of the distribution of concepts within the cortex captured using fMRI and produced by the Gallant Lab (source papers here).

    Using the map you can find the voxels corresponding to various concepts, which although maybe not generalizable due to the small sample size (7) gives you a good idea of the hierarchical structure the brain has produced, and what the intermediate concepts represent.

    Thanks to David Ray @ http://cortical.io for the link.

    Interactive brain concept map

    OpenAI Gym – Reinforcement Learning platform

    We also follow the OpenAI project with interest. OpenAI have just released their “Gym” – a platform for training and testing reinforcement learning algorithms. Have a play with it here:

    https://openai.com/blog/openai-gym-beta/

    According to Wired magazine, OpenAI will continue to release free and open source software (FOSS) for the wider impact this will have on uptake. There are many companies now competing to win market share in this space.

    The Talking Machines Blog

    We’re regular readers of this blog and have been meaning to mention it for months. Worth reading.

    How the brain generates actions

    A big gap in our knowledge is how the brain generates actions from its internal representation. This new paper by Vicente et al challenges the established (rather vague) dogma on how the brain generates actions.
    “We found that contrary to common belief, the indirect pathway does not always prevent actions from being performed, it can actually reinforce the performance of actions. However, the indirect pathway promotes a different type of actions, habits.”

    This is probably quite informative for reverse-engineering purposes. Full paper here.

    Hierarchical Temporal Memory

    HTM is an online method for feature discovery and representation and now we have a baseline result for HTM on the famous MNIST numerical digit classification problem. Since HTM works with time-series data, the paper compares HTM to LSTM (Long-Short-Term Memory), the leading supervised-learning approach to this problem domain.

    It is also interesting that the paper deals with adaptation to sudden changes in the input data statistics, the very problem that frustrates the deep belief networks described above.

    Full paper by Cui et al here.

    For a detailed mathematical description of HTM see this paper by Mnatzaganian and Kudithipudi.

    AGI/AlphaGo/Artificial General Intelligence/deep convolutional networks/HQSOM/machine learning/Reading List/unsupervised learning

    Reading list: Assorted AGI links. March 2016

    Posted by ProjectAGI on
    Reading list: Assorted AGI links. March 2016
    A Minecraft API is now available to train your AGIs

    Our News

    We are working hard on experiments, and software to run experiments. So this week there is no normal blog post. Instead, here’s an eclectic mix of links we’ve noticed recently.

    First, AlphaGo continues to make headlines. Of interest to Project AGI is Yann LeCun agreeing with us that unsupervised hierarchical modelling is an essential step in building intelligence with humanlike qualities [1]. We also note this IEEE Spectrum post by Jean-Christophe Baillie [2] which argues, as we did [3], that we need to start creating embodied agents.

    Minecraft 

    Speaking of which, the BBC reports that the Minecraft team are preparing an API for machine learning researchers to test their algorithms in the famous game [4]. The Minecraft team also stress the value of embodied agents and the depth of gameplay and graphics. It sounds like Minecraft could be a crucial testbed for an AGI. We’re always on the lookout for test problems like these.

    Of course, to play Minecraft well you need to balance local activities – building, mining etc. – with exploration. Another frontier, beyond AlphaGo, is exploration. Monte-Carlo Tree Search (as used in AlphaGo) explores in more limited ways than humans do, argues John Langford [5].

    Sharing places with robots 

    If robots are going to be embodied, we need to make some changes. Wired magazine says that a few small changes to the urban environment and driver behaviour will make the rollout of autonomous vehicles easier [6]. It’s important to meet the machines halfway, for the benefit of all.

    This excellent paper on robotic grasping also caught our attention [7]. A key challenge in this area is adaptability to slightly varying circumstances, such as variations in the objects being grasped and their pose relative the the arm. General solutions to these problems will suddenly make robots far more flexible and applicable to a greater range of tasks.

    Hierarchical Quilted Self-Organizing Maps & Distributed Representations

    Last week I also rediscovered this older paper on Hierarchical-Quilted Self-Organizing Maps (HQSOMs) [8].This is close to our hearts because we originally believed this type of representation was the right approach for AGI. With the success of Deep Convolutional Networks (DCNs) it’s worth looking back and noticing the similarities between the two. While HQSOM is purely unsupervised learning, (a plus, see comment from Yann LeCun above) DCNs are trained by supervised techniques. However, both methods use small, overlapping, independent units – analogous to biological cortical columns – to classify different patches of the input. The overlapping and independent classifiers lead to robust and distributed representations, which is probably the reason these methods work so well.

    Distributed representation is one of the key features of Hawkins’ Hierarchical Temporal Memory (HTM). Fergal Byrne has recently published an updated description of the HTM algorithm [9] for those interested.

    We at Project AGI believe that a grid-like “region” of columns employing a “Winner-Take-All” policy [10], with overlapping input receptive fields, can produce a distributed representation. Different regions are then connected together into a tree-like structure (acyclic). The result is a hierarchy. Not only does this resemble the state-of-the-art methods of DCNs, but there’s a lot of biological evidence for this type of representation too. This paper by Rinkus [11] describes columnar features arranged into a hierarchy, with winner-take-all behaviour implemented via local inhibition.

    Rinkus says: “Saying only that a group of L2/3 units forms a WTA CM places no a priori constraints on what their tuning functions or receptive fields should look like. This is what gives that functionality a chance of being truly generic, i.e., of applying across all areas and species, regardless of the observed tuning profiles of closely neighboring units.”

    Reinforcement Learning 

    But unsupervised learning can’t be the only form of learning. We also need to consider consequences, and so we need reinforcement learning to take account of these. As Yann said, the “cherry on the cake” (this is probably understating the difficulty of the RL component, but right now it seems easier than creating representations).

    Shakir’s Machine Learning blog has a great post exploring the biology of reinforcement learning [12] within the brain. This is a good overview of the topic and useful for ML researchers wanting to access this area.

    But regular readers of this blog will remember that we’re obsessed with unfolding or inverting abstract plans into concrete actions. We found a great paper by Manita et al [13] that shows biological evidence for the translation and propagation of an abstract concept into sensory and motor areas, where it can assist with perception. This is the hierarchy in action.

    Long-Short-Term Memory (LSTM)

    One more tack before we finish. Thanks to Jay for this link to NVIDIA’s description of LSTMs [14], an architecture for recurrent neural networks (i.e. the state can depend on the previous state of the cells). It’s a good introduction, but we’re still fans of Monner’s Generalized LSTM [15].

    Fun thoughts

    Now let’s end with something fun. Wired magazine again, describing watching AlphaGo as our first taste of a superhuman intelligence [16]. Although this is a “narrow” intelligence, not a general one, it has qualities beyond anything we’ve experienced in this domain before. What’s more, watching these machines can make us humans better, without any nasty bio-engineering:

    “But as hard as it was for Fan Hui to lose back in October and have the loss reported across the globe—and as hard as it has been to watch Lee Sedol’s struggles—his primary emotion isn’t sadness. As he played match after match with AlphaGo over the past five months, he watched the machine improve. But he also watched himself improve. The experience has, quite literally, changed the way he views the game. When he first played the Google machine, he was ranked 633rd in the world. Now, he is up into the 300s. In the months since October, AlphaGo has taught him, a human, to be a better player. He sees things he didn’t see before. And that makes him happy. “So beautiful,” he says. “So beautiful.”

    References

    [1] https://www.facebook.com/yann.lecun/posts/10153426023477143

    [2] http://spectrum.ieee.org/automaton/robotics/artificial-intelligence/why-alphago-is-not-ai

    [3] http://blog.agi.io/2016/03/what-after-alphago.html

    [4] http://www.bbc.com/news/technology-35778288

    [5] http://cacm.acm.org/blogs/blog-cacm/199663-alphago-is-not-the-solution-to-ai/fulltext

    [6] http://www.wired.com/2016/03/self-driving-cars-wont-work-change-roads-attitudes/

    [7] http://arxiv.org/pdf/1603.02199v1.pdf

    [8] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1401&rep=rep1&type=pdf

    [9] http://arxiv.org/pdf/1509.08255v2.pdf

    [10] https://en.wikipedia.org/wiki/Winner-take-all_(computing)

    [11] http://journal.frontiersin.org/article/10.3389/fnana.2010.00017/full

    [12] http://blog.shakirm.com/2016/02/learning-in-brains-and-machines-1/

    [13] https://www.researchgate.net/profile/Masanori_Murayama/publication/277144323_A_Top-Down_Cortical_Circuit_for_Accurate_Sensory_Perception/links/556839e008aec22683011a30.pdf

    [14] https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/

    [15] http://www.overcomplete.net/papers/nn2012.pdf

    [16] http://www.wired.com/2016/03/sadness-beauty-watching-googles-ai-play-go/

    action selection/agency/AGI/Artificial General Intelligence/columns/cortex/Hierarchical Generative Models/hierarchy/Memory-Prediction Framework/neurobiology/pyramidal cell/Thalamocortical

    How to build a General Intelligence: An interpretation of the biology

    Posted by ProjectAGI on
    How to build a General Intelligence: An interpretation of the biology
    Figure 1: Our interpretation of the Thalamocortical system as 3 interacting sub-systems (objective, subjective and executive). The structure of the diagram indicates the dominant direction of information flow in each system. The objective system is primarily concerned with feed-forward data flow, for the purpose of building a representation of the actual agent-world system.  The executive system is responsible for making desired future agent-world states a reality. When predictions become observations, they are fed back into the objective system. The subjective system is a circular because its behaviour depends on internal state as much as external. The subjective system builds a filtered, subjective model of observed reality, that also represents objectives or instructions for the executive. This article will describe how this model fits into the structure of the Thalamocortical system.

    Authors: David Rawlinson and Gideon Kowadlo

    This is part 4 of our series on how to build an artificial general intelligence (AGI).

    • Part 1: An overview of hierarchical general intelligence
    • Part 2: Reverse engineering (the physical perspective – cells and layers – and the logical perspective – a hierarchy).
    • Part 3: Circuits and pathways; we introduced our canonical cortical micro-circuit and fitted pathways to it.

    In this article, part 4, we will try to interpret all the information provided so far. We will try to fit what we know about the biological general intelligence to our theoretical expectations.

    Systems

    We believe cortical activity can be usefully interpreted as 3 integrated systems. These are:

    • Objective system
    • Subjective system
    • Executive system

    So, what are these systems, why are they needed and how do they work?

    Objective System

    We theorise that the purpose of the objective system is to construct a hierarchical, generative model of both the external world and the actual state of the agent. This includes internal plans & goals already executed or in progress. From our conceptual overview of General Intelligence we think that this representation should be distributed, compositional and therefore robust and able to immediately model novel situations instantly and meaningfully.

    The objective system models varying timespans depending on the level of abstraction, but events are anchored to the current state of the world and agent. Abstract events may cover long periods of time – for example, “I made dinner” might be one conceptual event.

    We propose that the objective system is implemented by pyramidal cells in layers 2/3 and by spiny excitatory cells in layer 4. Specifically, we suggest that the purpose of the spiny excitatory cells is primarily dimensionality reduction, by performing a classifier function, analogous to the ‘Spatial Pooling’ function of Hawkins’ HTM theory. This is supported by analysis of C4 spiny stellate connectivity: “… spiny stellate cells act predominantly as local signal processors within a single barrel…”. We believe the pyramidal cells are more complex and have two functions. First, they perform dimensionality reduction by requiring a set of active inputs on specific apical (distal) dendrite branches to be simultaneously observed before the apical dendrite can output a signal (an action potential). Second, they use basal (proximal) dendrites to identify the sequential context in which the apical dendrite has become active. Via a local competitive process, pyramidal cells learn to become active only when observing a set of specific input patterns in specific historical contexts.

    The output of pyramidal cells in C2/3 is routed via the Feed-Forward Direct pathway to a “higher” or more abstract cortical region, where it enters in C4 (or in some parts of the Cortex, C2/3 directly). In this “higher” region, the same classifier and context recognition process is repeated. If C4 cells are omitted, we have less dimensionality reduction and a greater emphasis on sequential or historical context.

    We propose these pyramidal cells only output along their axons when they become active without entering a “predicted” state first. Alternatively, interneurons could play a role in inhibiting cells via prediction to achieve the same effect. If pyramidal cells only produce an output when they make a False-Negative prediction error (i.e. they fail to predict their active state), output is equivalent to Predictive Coding (link, link). Predictive Coding produces an output that is more stable over time, which is a form of Temporal Pooling as proposed by Numenta.

    To summarize, the computational properties of the objective system are:

    1. Replace simultaneously active inputs with a smaller set of active cells representing particular sub-patterns, and
    2. Replace predictable sequences of active cells with a false-negative error coding to transform the output into a simpler sequence of prediction errors

    These functions will achieve the stated purpose of incrementally transforming input data into simpler forms with accumulating invariances, while propagating (rather than hiding) errors, for further analysis in other Columns or cortical regions. In combination with a tree-like hierarchical structure, higher Columns will process data with increasing breadth and stability over time and space.

    The Feed-Forward direct pathway is not filtered by the Thalamus. This means that Columns always have access to the state of objective system pyramidal cells in lower columns. This could explain the phenomenon that we can process data without being aware of it (aka “Blindsight”); essentially the objective system alone does not cause conscious attention. This is a very useful quality, because it means the data required to trigger a change in attention is available throughout the cortex. The “access” phenomenon is well documented and rather mysterious; the organisation of the cortex into objective and subjective systems could explain it.

    Another purpose of the objective system is to ensure internal state cannot become detached from reality. This can easily occur in graphical models, when cycles form that exclude external influence. To prevent this, we believe that the roles of feed-forward and feed-back input must be separated to break the cycles. However, C2/3 pyramidal cells’ dendrites receive both feed-forward (from C4) and feed-back input (via C1).

    One way that this problem might be avoided is by different treatment of feed-forward and feed-back input, so that the latter can be discounted when it is contradicted by feed-forward information. There is evidence that feed-forward and feedback signals are differently encoded, which would make this distinction possible.

    We speculate that the set of states represented by the cells in C2/3 could be defined only using feed-forward input, and that the purpose of feedback data in the objective system is restricted to improved prediction, because feedback contains state information from a larger part of the hierarchy (see figure 2).

    Figure 2: The benefit of feedback. This figure shows part of a hierarchy. The hierarchy structure is defined by the receptive fields of the columns (shown as lines between cylinders, left). Each Column has receptive fields of similar size. Moving up the hierarchy, Columns receive increasingly abstract input with a greater scope, being at the top of a pyramid of lower Columns whose receptive fields collectively cover a much larger area of input. Feedback has the opposite effect, summarizing a much larger set of Column states from elsewhere and higher in the hierarchy. Of course there is information loss during these transfers, but all data is fully represented somewhere in the hierarchy.

    So although the objective system makes use of feedback, the hierarchy it defines should be predominantly determined by feed-forward information. The feed-forward direct pathway (see figure 3) enables the propagation of this data and consequently the formation of the hierarchy.

    Figure 3: Feed-Forward Direct pathway within our canonical cortical micro-circuit. Data travels from C4 to C2/3 and then to C4 in a higher Column. This pattern is repeated up the hierarchy. This pathway is not filtered by the Thalamus or any other central structure, and note that it is largely uni-directional (except for feedback to improve prediction accuracy). We propose this pathway implements the Objective System, which aims to construct a hierarchical generative model of the world and the agent within it.

    Subjective System

    We think that the subjective system is a selectively filtered model of both external and internal state including filtered predictions of future events. We propose that filtering of input constitutes selective attention, whereas filtering of predictions constitutes action selection and intent. So, the system is a subjective model of reality, rather than an objective one, and it is used for both perception and planning simultaneously.

    The time span encompassed by the system includes a subset of both present and future event-concepts, but as with the objective system, this may represent a long period of real-world time, depending on the abstraction of the events (for example, “now” I am going to work, and “next” I will check my email [in 1 hour’s time]).

    It makes good sense to have two parallel systems, one filtered (subjective) and one not (objective). Filtering external state reduces distraction and enhances focus and continuity. Filtering of future predictions allows selected actions to be maintained and pursued effectively, to achieve goals.

    In addition to events the agent can control, it is important to be aware of negative outcomes outside the agent’s control. Therefore the state of the subjective system must include events with both positive and negative reward outcomes. There is a big difference between a subjective model and a goal-oriented planning model. The subjective system should represent all outcomes, but preferentially select positive outcomes for execution.

    The subjective system represents potential future states, both internal and external. It does not necessarily represent reality; it represents a biased interpretation of intended or expected outcomes based on a biased interpretation of current reality! These biases and omissions are useful; they provide the ability to “imagine’ future events by serially “predicting” a pruned tree of potential futures.

    More speculatively, differences between the subjective and objective systems may be the cause of phenomena such as selective awareness and “access” consciousness.

    Figure 4: Feed-Forward Indirect pathway, particularly involved in the Subjective system due to its influence on C5. The Thalamus is involved in this pathway, and is believed to have a gating or filtering effect. Data flows from the Thalamus to C4, to C2/3, to C5 and then to a different Thalamic nuclei that serves as the input gateway to another cortical Column in a different region of the Cortex. We propose that the Feed-Forward Indirect pathway is a major component of the subjective system.
    Figure 5:  The inhibitory micro-circuit, which we suggest makes the subjective system subjective! The red highlight shows how the Thalamus controls activity in C5 by activating inhibitory cells in C4. The circuit is completed by C5 pyramidal cells driving C6 cells that modulate the activity of the same Thalamic nuclei that selectively activate C5.

    The subjective system primarily comprises C5 (where subjective states are represented) and the Thalamus (which controls subjectivity), but it draws input from the objective system via C2/3. The latter provides context and defines the role and scope (within the hierarchy) of C5 cells in a particular column. Between each cortical region (and therefore every hierarchy level), input to the subjective system is filtered by the Thalamus (figure 5). This implements the selection process. The Feed-Forward Indirect pathway includes these Thalamo-Cortical loops.

    We suggest the Thalamus implements selection within C5 using special cells in C4 that are activated by axons (outputs) from the Thalamus (see figure 6). These inhibitory C4 cells target C5 pyramidal cells and inhibit them from becoming active. Therefore, thalamic axons are both informative (“this selection has been made”) and executive (the axon drives inhibition of selected C5 pyramidal cells).

    Figure 6: Thalamocortical axons (afferents) are shown driving inhibitory cells in C4 (leftmost green cell) that in turn inhibit pyramidal cells in C5 (red). They also provide information about these selections to other layers, including C2/3. When a selection has been made, it becomes objective rather than subjective, hence provision of a copy to C2/3. Image source.


    Note that selection may be a process of selective dis-inhibition rather than direct control: Selection alone may not be enough to activate the C5 cells.  Instead, C5 pyramidal cells likely require both selection by the Thalamus, and feed-forward activation via input from C2/3. The feed-forward activation could occur anywhere within a window of time in which the C5 cell is “selected”.  This would relax timing requirements on the selection task, making control easier; you only need to ensure that the desired C5 cell is disinhibited when the right contextual information arrives from other sources (such as C2/3). This also ensures C5 cell activation fits into its expected sequence of events and doesn’t occur without the right prior context.

    C5 also benefits from informational feedback from higher regions and neighbouring cells that help to define unique contexts for the activation of each cell.

    We suggest that C5 pyramidal cells are similar to C2/3 pyramidal cells but with some differences in the way the cells become active. Whereas C2/3 cells require both matching input via the apical dendrites and valid historical input to the basal dendrites to become active, C5 cells additionally need to be disinhibited for full activation to occur.

    As mentioned in the previous article, output from C5 cells sometimes drives motors very directly, so full activation of C5 cells may immediately result in physical actions. We can consider C5 to be the “output” layer of the cortex. This makes sense if the representation within C5 includes selected future states.

    Management of C5 activity will require a lot of inhibition; we would expect most of the input connections to C5 to be inhibitory because in every context, for every potential outcome, there are many alternative outcomes that must be inhibited (ignored). At any given time, only a sparse set of C5 cells would be fully active, but many more would be potentially-active (available for selection).

    Given predictive encoding and filtering inhibition, it would be common for few pyramidal cells to be active in a Column at any time. Separately, we would expect objective C2/3 pyramidal activity to be more consistent and repeatable than subjective C5 pyramidal activity, given a constant external stimulus.

    Executive System

    So far we have defined a mechanism for generating a hierarchical representation and a mechanism for selectively filtering activity within that representation. In our original conceptual look at general intelligence, we also desired that filtering predictions would be equivalent to action selection. But if we have selected predictions of future actions at various levels of abstraction within the hierarchy, how can we make these abstract prediction-actions actually happen?

    The purpose of the executive system is to execute hierarchical plans reliably. As previously discussed, this is no trivial matter due to problems such as vanishing agency at higher hierarchy levels. If a potential future outcome represented within the subjective system is selected for action, the job of the executive system is to make it occur.

    We know that we want abstract concepts at high levels within the hierarchy to be faithfully translated into their equivalent patterns of activity at lower levels. Moving towards more concrete forms would result in increasing activity as the incremental dimensionality reduction of the feed-forward hierarchy is reversed.

    Figure 7: Differences in dominant direction of data flow between objective and executive systems. Whereas the Objective system builds increasingly abstract concepts of greater breadth, the Executive system is concerned with decomposing these concepts into their many constituent parts. so that hierarchically-represented plans can be executed.

    We also know that we need to actively prioritize execution of a high level plan over local prediction / action candidates in lower levels. So, we are looking for a cascade of activity from higher hierarchy levels to lower ones.

    Figure 8: One of two Feed-Back direct pathways. This pathway may well be involved in cascading control activity down the hierarchy towards sensors and motors. Activity propagates from C6 to C6 directly; C6 modulates the activity of local C5 cells and relevant Thalamic nuclei that activate local C5 cells by selective disinhibition in conjunction with matching contextual information from C2/3.

    It turns out that such a system does exist: The feed-back direct pathway from C6 to C6. Cortex layer 6 is directly connected to Cortex layer 6 in the hierarchy levels immediately below. What’s more, these connections are direct, i.e. unfiltered (which is necessary to avoid the vanishing agency problem). Note that C5 (the subjective system) is still the output of the Cortex, particularly in motor areas. C6 must modulate the activity of cells in C5, biasing C5 to particular predictions (selections) and thereby implementing a cascading abstract plan. Finally, C6 also modulates the activity of Thalamic nuclei that are responsible for disinhibiting local C5 cells. This is obviously necessary to ensure that the Thalamus doesn’t override or interfere with the execution of a cascading plan already selected at a higher level of abstraction.

    Our theory is that ideally, all selections originate centrally (e.g. in the Thalamus). When C5 cells are disinhibited and then become predicted, an associated set of local C6 cells is triggered to make these C5 predictions become reality.

    These C6 cells have a number of modulatory outputs to achieve this goal:

    Executive Training

    No, this is not a personal development course for CEOs. This section checks whether C6 cells can learn to replay specific action sequences via C5 activity. This is an essential feature of our interpretation, because only C6 cells participate in a direct, modulatory feedback pathway.

    We propose that C6 pyramidal neurons are taught by historical activity in the subjective system. Patterns of subjective activity become available as “stored procedures” (sequences of disinhibition and excitatory outputs) within C6.

    Let’s start by assuming that C6 pyramidal cells have similar functionality to C2/3 and C5 pyramidal cells, due to their common morphhology. Assume that C5 cells in motor areas are direct outputs, and when active will cause the agent to take actions without any further opportunity for suppression or inhibition (see previous article).

    In other cortical areas, we assume that the role of C5 cells is to trigger more abstract “plans” that will be incrementally translated into activity in motor areas, and therefore will also become actions performed by the agent.

    To hierarchically compose more abstract action sequences from simpler ones, we need activity of an abstract C5 cell to trigger a sequence of activity in more concrete C5 cells. C6 cells will be responsible for linking these C5 cells. So, activating a C6 cell should trigger a replay of a sequence of C5 cell activity in a lower Column. How can C6 cells learn which sequences to trigger, and how can these sequences be interpreted correctly by C6 cells in higher hierarchy levels?

    C6 pyramidal cells are mostly oriented with their dendrites pointing towards the more superficial cortex layers C1,…,C5 and their axons emerging from the opposite end. Activity from C5 to C6 is transferred via axons from C5 synapsing with dendrites from C6. Given a particular model of pyramidal cell learning rules, C6 pyramidal cells will come to recognize patterns of simultaneous C5 activity in a specific sequential context, and C6 interneurons will ensure that unique sets of C6 pyramidal cells respond in each context.

    So how will these C6 cells learn to trigger sequences of C5 cells? We know that the axons of C6 cells bend around and reach up into C5, down to the Thalamus and directly to hierarchically-lower C6 cells. At all targets they can be excitatory or inhibitory.

    All we need beyond this, is for C6 axons to seek out axon target cells that become active immediately after the originating C6 cell is stimulated by active C5 cells. This will cause each C6 cell to trigger C5 and C6 cells that are observed to be activated afterwards. Note that we require the C6 cells themselves be organised into sequences (technically, a graph of transitions).

    Target seeking by axons is known as “Axon Guidance” and C6 pyramidal cells’ axons do seem to target electrically active cells by ceasing growth when activity is detected. We have not yet found biological evidence for the predicted timing.

    C6 axons can also target C4 inhibitory cells (evidence) and Thalamic cells, which again is compatible with our interpretation, as long as they are cells that become active after the originating C6 cell. If we want to “replay” some activity that followed a particular C6 cell, then all the cells described above should be excited or inhibited to ensure that the same events occur again. Activating a C6 cell directly should reproduce the same outcome as incidental activation of the C6 cell via C5 – a chain of sequential inhibition and promotion will result. Note that the same learning rule could work to discover all axon targets mentioned.

    Collectively, the C6 cells within a Column will become a repertoire of “stored procedures” that can be triggered and replayed by a cascade of activity from higher in the hierarchy or by direct selection via C5. C6 cells would behave the same way whether activated by local C5 cells, or by C6 cells in the hierarchy level above. This allows cascading, incremental execution of hierarchical plans.

    C6 cells do not need to replace sequences of C5 cell activity with a single C6 cell (i.e. label replacement for symbolic encoding), but they do need to collectively encode transitions between chains of C5 cells, individually trigger at least 1 C5 cell and collectively allow a single C6 cell to trigger a sequence of C6 cells in both the current and lower hierarchy regions.

    C6 interneurons can resolve conflicts when multiple C6 triggers coincide within a column. We can expect C6 interneurons to inhibit competing C6 pyramidal cells until the winners are found, resulting in a locally consistent plan of action.

    As with layers C2/3 and C5, C6 inhibitory interneurons will also support training C6 pyramidal cells for collective coverage of the space of observed inputs, in this case from C5 and C2/3.

    Bootstrapping

    Now we are only left with a bootstrapping problem: How can the system develop itself? Specifically, how do the sequences of C5 activity come to be defined so that they can be learned by C6?

    We suggest that conscious choice of behaviour via the Thalamus is used to build the hierarchical repertoire from simple primitive actions to increasingly sophisticated sequences of fine control. Initially, thalamic filtering of C5 state would be used to control motor outputs directly, without the involvement of C6. Deliberate practice and repetition would provide the training for C6 cells to learn to encode particular sequences of behaviour, making them part of the repertoire available to C6 cells in hierarchically “higher” Columns.

    Initially, concentration is needed to perform actions via direct C5 selections; these activities need to be carefully centrally coordinated using selective attention. However, when C6 has learnt to encode these sequences, they become both more reliable and require less effort to execute, requiring only a trigger to one C6 cell.

    After training, only minimal thalamic interventions are needed to execute complex sequences of behaviour learned by C6 cells. Innovation can continue by combining procedures encoded by C6 with interventions via the Thalamus, that can still excite or inhibit C5 cells. However, in most other cases C6 training is accelerated by the independence of Columns: When a C6 cell learns to control other cells within the Column, this learning remains valid no matter how many higher hierarchy levels are placed on top. By analogy, once you’ve learned to drink from a cup, you don’t need to relearn that skill to drink in restaurants, at home, at work etc.

    As C6 learns and starts to play a role in the actions and internal state of the agent, it becomes important to provide the state of C6 to the objective and subjective systems as contextual input.

    Axons from C6 to other, hierarchically lower Columns take two paths: To C6, and to C1. We propose that the copy provided to C1 is used as informational feedback in C2/3 and C5 pyramidal cells (these axons synapse with Pyramidal cell Apical dendrites). We suggest the copy to C6 allows C6 cells to execute plans hierarchically, by delegating execution to a number of more concrete C6 cells. Therefore, the feedback direct pathway from C6 to C6 is part of the executive system. These axons should synapse on cell bodies, or nearby, to inhibit or trigger C6 activation artificially (rather than via C5).

    Interpretation of the Thalamus

    Rather than as merely a relay, we propose that a better concept of the Thalamus is as a control centre. It’s job is to centrally control cortical activity in C5 (the subjective system). Abstract activity in C5 is propagated down the hierarchy by C6, and translated into its concrete component states, eventually resulting in specific motor actions. Therefore, via this feedback pathway the filtering performed by the Thalamus assumes an executive role also.

    We believe that filtering predictions of oneself performing an action or experiencing a reward is the mechanism by which objectives and plans are selected. We believe there is only one representation of the world in our heads. There is no separate “goal-oriented” or “action-based” representation. This means that filtering predictions is the mechanism of behaviour generation. Note that in a hierarchical system, you can simultaneously select novel combinations of predictions to achieve innovation without changing the hierarchical model.

    Our interpretation of the Thalamus depends on some theoretical assumptions about how general intelligence works. Crucially, we believe there is no difference between selective awareness of externally-caused and self-generated events, except some of the latter have agency in the real world via the agent’s actions. This means that selective attention and action selection can both be consequences of the same subjective modelling process.

    But where does selection actually occur?

    For a number of practical reasons, action and attentional selection should be centralized functions. For one thing, the reward criteria for selecting actions are of much smaller dimension than the cortical representations – for example, the set of possible pain sensations are far more limited than the potential external causes of pain. We essentially need to compare the reward of all potential actions against each other, rather than an absolute scale.

    It is also important that conflicts between items competing for attention or execution are resolved so that incompatible plans are replaced by a single clear choice. Conflict resolution is difficult to do in a highly parallel & distributed system; instead, it is preferable to force all alternatives to compete against each other until a few clear winners are found.

    Finally, once an action or attentional target is selected, it should be maintained for a long period (if still relevant), to avoid vacillation. (See Scholarpedia for a good introduction to the difficulties of conflict resolution and the importance of sticking to a decision for long enough to evaluate it).

    We believe the Thalamus plays this role via its interactions with the Cortex. It interacts with the Cortex in two ways. First, the Thalamus selectively dis-inhibits particular C5 cells, allowing them to become active when the right circumstances are later observed objectively (i.e. via C2/3, which is not subjective).

    Second, the Thalamus must also co-operate with the Feed-Back cascade via C6.  While the Thalamus generates new selections by controlling C5, it must also permit the execution of existing, more abstract Thalamic selections by allowing cascading feedback activity to override local selections. Together, these mechanisms ensure that execution of abstract plans is as easily accomplished as simpler, concrete actions.

    Interpretation of the Basal Ganglia

    The Basal Ganglia are involved in so many distinct functions that they can’t be fully described within this article. They consist of a set of discrete structures located adjacent to the Thalamus.

    In our model, selection is implemented by the Thalamus manipulating the subjective system within the Cortex. We propose that the selections themselves are generated by the Basal Ganglia, which then controls the behaviour of the Thalamus.

    Crucially, we believe the Striatum within the Basal Ganglia uses reward values (such as pleasure and pain) to make adaptive selections. In other words, the Basal Ganglia are responsible for picking good actions, biasing the entire Thalamo-Cortical system towards futures that are expected to be more pleasant for the agent.

    However, to make adaptive choices it is necessary to have accurate context and predictions (candidate actions). The hierarchical model defined within the Cortex is an efficient and powerful source for this data, and in fact, this pathway (Cortex → Basal Ganglia → Thalamus → Cortex) does exist within the brain (see figure 9 below).

    Thanks to studies of relevant disorders such as Parkinson’s and Huntingdon’s, it is known that this pathway is associated with behaviour initiation and selection based on adaptive criteria.

    Figure 9: Pathways forming a circuit from Cortex to Basal Ganglia to Thalamus and back to Cortex. Image source.

    Lifecycle of an idea

    Using our interpretation of biological general intelligence, we can follow the lifecycle of an idea from the conception to execution. Lets walk through the theorized response to a stimulus, resulting in an action.

    Although the brain is operating constantly and asynchronously, we can define the start of our idea as some sensory data that arrives at the visual cortex. In this example, it’s an image of an ice-cream in a shop.

    Objective Modelling

    Sensor data propagates unfiltered up the Feed-Forward Direct pathway, activating cells in C4 and C2/3 in numerous cortical areas as it is transformed into its hierarchical form. The visual stimuli become a rich network of associated concepts, including predictions of near-future outcomes, such as experiencing the taste of ice-cream. These concepts represent an objective external reality and are now active and available for attention.

    Subjective Prediction

    Activity within the Objective system triggers activity in the Subjective system. Some C5 cells become “predicted”, but are inhibited by the Thalamus. These cells represent potential future actions and outcomes. Things that, from experience, we know are likely to occur after the current situation.

    The Cortex projects data from C2/3 to the Striatum where it is weighted according to reward criteria. A strong response to the flavour of the frozen treat percolates through the Basal Ganglia and manipulates the activity of the Thalamus.

    Between the Thalamus and the Cortex, an iterative negotiation takes place resulting in the selection (via dis-inhibition) of some C5 cells. The Basal Ganglia have learned which manipulations of the Thalamus maximize the expected Reward given the current input from Cortex.

    The way that the Thalamus stimulates particular C5 cells is somewhat indirect. The path of activity to “select” C5 cells in layer n is C5[n-1] →  Thalamus → C4[n] → C5[n]. The signal is re-interpreted at each stage of this pipeline – that is, connections do not carry a specific meaning from point to point. Therefore, you can’t just adjust one “wire” to trigger a particular C5 cell. Rather, you must adjust the inhibition of input to many C4 → C5 cells until you’ve achieved the conditions to “select” a target C5 cell. Many target C5 cells might be simultaneously selected.

    In addition to requiring disinhibition, C5 cells also wait for specific patterns of cell activity in C2/3 prior to becoming “predicted”. This means that it’s very difficult to select a C5 cell that is not “predicted”; it simply doesn’t have the support to out-compete its neighbours in the column and become “selected”. This prevents unrealistic outcomes being “selected”, or output commencing, before the right circumstances have arrived to match the expectation.

    Eventually, a subset of C5 cells become “predicted” and “selected”, representing a subjective model of potential futures for the agent in the world. In this case, the anticipated future involves eating ice-cream.

    Execution

    When C5 cells become active, they in turn drive C6 pyramidal cells that are responsible for causing the future represented by “contextual, selected & predicted” C5 cells. In this case, C6 cells are charged with executing the high-level plan to “buy some ice-cream and eat it”.

    The plan is embodied by many C5 cells, distributed throughout the hierarchy; each represents a subset of the “qualia” relating to the eating of ice-cream. C6 cells begin to interpret these C5 cells into concrete actions, via the C6-C6 Feed-Back Direct pathway. Crucially, they no longer require the Thalamus to modulate the input that makes C5 cells “selected”. Instead, C6 cells stimulate C5 and C6 cells in hierarchically-lower Columns directly, moving them to “selected” status and allowing them to become active as soon as the corresponding Feed-Forward evidence arrives to match.

    C6 cells also modulate relay cells in the Thalamus, guiding the Thalamus to disinhibit C5 cells in lower hierarchy regions. This helps to ensure the parts of the decomposed plan are executed as intended. In turn, these newly selected “lower” C5 cells drive associated C6 cells, and the plan cascades down the hierarchy.

    Note that the plan is also flowing in the “forward” direction, as it incrementally becomes reality rather than expectation. As motor actions take place, they are sensed and signalled through the Feed-Forward pathways. When C5 cells become “selected”, this information becomes available to higher columns in the hierarchy, if not filtered. This also helps the Feed-Forward Indirect pathway and C6 cells to keep track of activity and execute the plan in a coordinated manner.

    At the lowest levels of the hierarchy, the plan becomes a sequence of motor activity, which is activated by C5 cells directly, and also by other brain components that are not covered by our general intelligence model.

    A few moments later, the ice-cream is enjoyed, triggering a release of Dopamine into the Striatum and reinforcing the rewards associated with recent active Cortical input. Delicious!

    Summary

    In the previous articles we explored the characteristics of a general intelligence and looked at some of the features we expected it to have. In part 2 and part 3 we reviewed some relevant computational neuroscience research. In this article we’ve described our interpretation of this background material.

    We presented a model of general intelligence built from 3 interacting systems – Objective, Subjective and Executive. We described how these systems could learn and bootstrap via interaction with the world, and how they could be implemented by the anatomy of the brain. As an example, we traced an experience from sensation, through planning and to execution.

    Let’s assume that our understanding of biology is approximately correct. We can use this as inspiration to build an artificial general intelligence with a similar architecture and test whether the systems behave as described in these articles.

    The next article in this series will look specifically at how these concepts could be implemented in software, resulting in a system that behaves much like the one described here.

    action selection/AGI/Architecture/Artificial General Intelligence/columns/Consciousness/cortex/Memory-Prediction Framework/Neocortex/pyramidal cell/Thalamocortical/Thalamus

    How to build a General Intelligence: Circuits and Pathways

    Posted by ProjectAGI on
    How to build a General Intelligence: Circuits and Pathways
    Figure 1: Our headline image is from the Cognitive Consilience: An atlas of key pathways cross-referenced to supporting literature articles. The complexity and variety of routing within the brain can be appreciated with this beautiful illustration. Note in particular the specialisation of cortical cells and the way this affects their interactions with other cells in the cortex and elsewhere in the brain. Explore this fantastic resource yourself.

    By David Rawlinson and Gideon Kowadlo

    This is part 3 of our series “how to build an artificial general intelligence” (AGI). Part 1 was a theoretical look at General Intelligence (follow the link if you don’t know what General Intelligence is).

    We believe that the Thalamo-Cortical system is the origin of General Intelligence in people. In Part 2 we presented very broadly how the Thalamo-Cortical system is structured and organised. We applied some core concepts, such as hierarchy, to help us describe the system.

    We also looked at the cellular structure of the Cortex and in particular introduced Pyramidal cells.

    This article is again about what we can learn from reverse-engineering the Thalamo-Cortical system, but this time from its connectivity, which we present in terms of circuits and pathways.

    Pathways and Circuits

    A pathway is a gross pattern of sequential connectivity between brain regions – for example, if part A is highly connected to part B, and activity in A is followed by activity in B, we say there exists a pathway between A and B. Cells in the Thalamo-Cortical system are connected to each other in quite restricted and specific ways, so these pathways are quite informative.

    Circuits are more specific and precise details of both connectivity and functional interaction between neurons. In computational neuroscience there exists a concept called the Canonical Cortical Micro-Circuit. The specifics of this circuit are not widely agreed, because (a) the Cortex is complex and (b) many of the evidence-gathering exercises are statistical observations (e.g. “X% of outputs from A and Y% of output from B projects to region C”) which may obscure fundamental functional or topological features. For example, outputs from A and B may project to cells with exclusive roles, but physically co-located in C. Statistical, regional approaches will not capture such distinctions.

    In the neuroscience literature there’s a frustrating habit of selectively reporting supporting details while ignoring others. Perhaps this is simply because it’s impossible to describe any part exhaustively. In particular, there is a lot of contradictory information about Cortical circuits. But the research can still shed some light on what is happening. Just don’t expect all sources to be consistent or complete!

    Key Cortical Pathways

    There are several widely-cited and well established cortical pathways (i.e. routes with at least one end in the Cortex). To understand these, it is important to remember both the physical and logical structure of the Cortex as described in the previous article. Physically, the cortex is made of layers, and logically, Columns within the Cortex form a hierarchy.

    The hierarchy defines a structure made of Columns, and determines which Columns interact. Pathways describe the patterns of interaction between cells within a Column, and between Columns. We assume that all Columns are functionally identical prior to training.

    Cells within Columns are usually identified by both the physical location of cell bodies within particular Layers in the Column, and by the morphology (shape) of the cell. Data flow to and from Cortical cells is largely restricted to a handful of core pathways that begin and terminate in particular cell types in specific cortical layers.

    There are many descriptions of cortical pathways and circuits in the literature. We will first introduce just 4 well-established cortical-cortical pathways, and then some thalamo-cortical pathways. Note that although the existence of these pathways is unambiguous, their purpose and function is poorly understood. They appear to be consistent across various somatosensory regions of the cortex, especially in comparison to variations in other brain tissues.

    Hawkins’ Hierarchical Temporal Memory (HTM) introduces 3 of the 4 pathways in a single, coherent scheme and relates them to a general intelligence algorithm. We will borrow this terminology and describe them in detail below. Their names describe the direction of data flow and the routing used:

    Feed-Forward Direct Pathway: C2/3 → C4 → C2/3
    Feed-Forward Indirect Pathway: C5 → Thalamus → C4 → C2/3
    Feed-Back Direct Pathway #1: from C6 → C1

    We are also interested in a second Feed-Back Direct “pathway” implemented by cortically projecting C6 pyramidal cells with axons that terminate in both C6 and C1 in hierarchically lower regions (see here for a diagram).

    Feed-Back Direct Pathway #2: from C6 → C6

    Note that cells in all cortical layers (except, perhaps, C4) receive input via their dendrites in C1. In other words, feedback from C6 to C1 is then used as input to many layers. Feedback from C6 to C6 is generally not input for other layers.

    In neuroscience, Feed-Forward usually means the flow of data away from external sources such as sensors (towards greater abstraction, if you believe in a cortical hierarchy). Feed-Back means the opposite – data flow towards regions that have direct interaction with external sensors and motors.

    Direct pathways are so-called because data is routed directly from one cortical column or region to another, without a stop along the way. Indirect pathways are routed via other structures. The  “Feed-Forward-Indirect” pathway described by Hawkins is routed via the Thalamus.

    Figure 2, derived from a Hawkins/Numenta publication, shows graphically how information flows between columns and between layers within columns, as part of these 3 pathways according to the HTM theory. As mentioned before, the community is welcome to contribute by updating and adding to the figure.

    Hawkins assigns specific roles to these pathways, but we will be re-interpreting them in the next article.

    Figure 2: Routing of 3 core pathways, based on a diagram from the HTM/CLA White Paper. Note the involvement of specific cortical layers with each pathway, and the central role of the Thalamus. The names of the pathways indicate direct (cortex-to-cortex) and indirect (cortex-thalamus-cortex) variants, with direction being either forward (away from external sensors and motors, towards increasing abstraction) or backward (towards more concrete regions dealing with specific sensor/motor input). 

    The role of the Thalamus

    Let’s recap: The Cortex is composed of Columns, organised into a hierarchy. Cells pass messages directly to other Columns that are higher or lower in the hierarchy. Messages may also be transmitted indirectly between Columns, via the Thalamus.

    The Thalamus is often viewed as having a gating or relaying function. The Thalamus is particularly associated with control of attention.

    This section will describe indirect pathways involving the Thalamus. Figure 3 is a reproduction of a figure from Sherman and Guillery (2006) that has two new features of interest. These authors use the terminology “first order” to denote cortical regions receiving direct sensor input and “higher order” to denote cortical regions receiving input from “first order” cortical regions. This corresponds with the notion of hierarchy levels 1 and 2.

    The Thalamus is a significant part of the “Feed-Forward Indirect” pathway. This pathway originates at Cortex layer 5 and propagates to a nucleus in the Thalamus. There, the nucleus may react by transmitting a (presumably corresponding) signal to one or more other Cortical Columns, in a different region. In some theories of cortical function, the target Column is conceptually “higher” in the hierarchy. The Thalamic input enters the Cortex via Thalamic axons terminating in Cortex layer 4 and is then propagated to Cortex Layer 5 where the pathway begins again.

    Figure 3 also shows Cells in Columns in Cortex layer 6 fairly accurately form reciprocal modulatory connections to Thalamic nuclei that provide input to the Column via C4 and C5! Therefore, a Column within the Cortex has influence on data that it receives from the Thalamus. In effect, the Cortex is not a passive recipient but works with the Thalamus to control its input. The figure also depicts C6 cells projecting to C6 in lower regions (our second feedback pathway).

    Figure 3: Pathways between cortical columns in different regions, showing layer involvement in each pathway and the role of the Thalamus. Sherman and Guillery use the terminology “first order” to denote cortical regions receiving direct sensor input and “higher order” to denote cortical regions receiving input from lower (e.g. “first order”) cortical regions. This corresponds with the notion of hierarchy levels 1 and 2. Note that in addition to the 3 pathways shown in the previous figure, we see additional direct feedback pathways and reciprocal feedback from Cortex layer 6 to the Thalamic nuclei that stimulate the cortical region. Image source.

    Motor output

    At this point it is interesting to look at how the Cortex can influence or control behaviour, particularly the generation of motor output. There are two pathways that allow the cortex to influence or control behaviour:

    Cortical Control: Basal Ganglia → Thalamus → Cortex → Motors
    Cortical Influence: Cortex → Basal Ganglia → Motors

    Note that in both cases, the origin of action selection is the Basal Ganglia. In the first case, the Basal Ganglia control signals emitted by the Thalamus, with these signals in turn affecting activity within Cortex layer 5 (C5). C5, particularly in motor areas, has been studied in detail. 10-15% of the cells in these areas are very large pyramidal neurons known as Betz cells, that can be observed to drive muscles very directly with few synapses in between. These cells are more prevalent in primates and are especially important for control of the hands. This makes sense given that manual tasks are typically more complex and require greater dexterity than movements by other parts of the body. The human Cortex is believed to be crucial for innovative and sophisticated manual tasks such as tool-making.

    Within the Cortical layers, C5 seems to be uniquely involved in motor output. Figure 4 shows some of the ways output from Pyramidal cells in C5 project output to areas of the brain associated with motor output and control. In contrast, pyramidal cells in C2/3 predominantly project to other areas of the cortex and are not directly involved in control.

    Figure 4: Pyramidal cells in C5 project output to areas of the brain associated with motor output and control. In contrast, pyramidal cells in C2/3 predominantly project to other areas of the cortex and are not directly involved in control. Image source.

    The second way that the Cortex can influence motor output is via the Basal Ganglia. In this case, we propose that the Cortex might provide contextual information to assist the Basal Ganglia in its direct control outputs, but we found no evidence that the Cortex is able to exert control over the Basal Ganglia.

    We suggest Cortical influence over the Basal Ganglia is less interesting from a General Intelligence perspective, because the hierarchical representations formed within the Cortex are not exploited, and execution is performed by more ancient brain systems not associated with General Intelligence qualities.

    For the rest of this article series, we will ignore control pathways that do not involve the Cortex,  and will focus on direct control output from Cortex layer 5.

    Action Selection

    It is widely believed that action selection occurs within the flow of information from Cortex through the Basal Ganglia, a group of deep, centralised brain structures adjacent to the Thalamus. There are a number of theories about how this occurs, but it is generally believed to involve a form of Reinforcement Learning used to select ideas from the options presented by the Cortex, with competitive mechanisms for clean switching and conflict resolution.

    A major output of the Basal Ganglia is to the Thalamus; one prevailing theory of this relationship is that the Basal Ganglia controls the gating or filtering function performed by the Thalamus, effectively manipulating the state of the Cortex in consequence. The full loop then becomes Cortex → Basal Ganglia → Thalamus → Cortex (see Wikipedia for a good illustration, or figure 5).

    As discussed above, this article will focus on motor output generated directly by the Cortex.

    Figure 5: Pathways forming a circuit from Cortex to Basal Ganglia to Thalamus and back to Cortex. Image Source.

    Canonical Cortical Circuit

    We now have the all the background information needed to define a “Canonical Cortical micro-Circuit” at a cellular level. All the information presented so far has been relatively uncontroversial, but this circuit is definitely our interpretation, not an established fact. However, we will present some evidence to (inconclusively) support our interpretation.

    Figure 6: Our interpretation of the canonical cortical micro-circuit. Only a single cortical region or Column is shown. Arrow endings indicate the type of connection – driver, modulator or inhibitor. The numbers 2/3, 4, 5, and 6 refer to specific cortical layers. Each shape represents a set of cells of a particular type, not an individual cell. Self-connections and connections within each set are not shown, but often exist. Shapes T and B refer to Thalamus and Basal Ganglia, not broken down into specific cell layers or types. Data enters the diagram at 4 points, labelled A-D, but does not exit; in general the system forms a circuit not a linear path. Note that shape T occurs twice, because the circuit receives data from only one part of the Thalamus but projects to two areas in forward and backward directions.

    Diagram Explanation

    We will use variants of the diagram shown in figure 6 to explain our interpretation of cortical function. In this diagram, only a single Cortical region or Column (used interchangeably here) is shown. In later diagrams, we will show 3 hierarchy levels together so the flow of information between hierarchy levels is apparent.

    In these diagrams, shapes represent a class of Neurons within a specific Cortical Layer. The numbers 2/3, 4, 5 and 6 refer to the Cortical layers in which these cell classes occur. The shapes labelled T and B refer to the Thalamus and Basal Ganglia (internal cell types and layers are not shown). Arrows on the diagram show the effect of each connection, either driving (providing information or input that causes another cell to become active), modulation (stimulating or inhibiting the activity of a target cell) or inhibition (exclusively inhibiting the activity of a target cell).

    If you want more detail on the thalamic end of the thalamocortical circuitry, an excellent source is this paper by Sherman.

    There are many interneurons (described in the previous article) that are not shown in this diagram. We chose to omit these because we believe they are integral to the function of a layer of pyramidal cells within a Column, rather than an independent system. Specifically, we suggest that inhibitory interneurons implement local self-organising and local competitive functions (e.g. winner-take-all), ensuring sparse activation of the cell types represented by shapes in our diagram (C2/3, C4, C5, and C6). The self-organising behaviour also ensures that cells within each column optimise coverage of observed input patterns given a finite cell population. Inclusion of the interneurons would clutter the diagram without adding much explanatory value.

    We also omit self-connections within a class of cells represented by a shape. These self-connections likely provide context and contribute to learning and exclusive activity within the class, but don’t make it easier to understand circuits in terms of cortical layers and hierarchy levels.

    Excitatory Circuit

    Figure 7 shows a multilevel version of the cortical circuit, similar to the multi-level figure from Sherman and Guillery (figure 3). We can now understand where the inputs to the circuit come from, in terms of other layers and external Sensors (S) and Motors (M). Note that Motors are driven directly from C5.

    Figure 7: The cortical micro-circuit across several levels of Cortex with involvement of Thalamus and Basal Ganglia. The red highlight shows a single excitatory ‘circuit’. See text for details.

    The red path in figure 7 shows our excitatory “canonical circuit”: Data flows from the Thalamus to spiny stellate (star-shape in figures) cells in C4 (see source and source), from where it propagates to pyramidal cells in C2/3, and then to pyramidal cells in C5. C6 is known as the multiform layer, but also contains many pyramidal cells of unusual proportions and orientations. C6 cells are driven by C5, and in turn modulate the Thalamus. Note that C6 cells within a region modulate the same Thalamic nuclei that provide input to that region of Cortex.

    Inhibitory Circuit

    A second, inhibitory circuit exists alongside our excitatory circuit. In addition to providing input to the Cortex via C4, axons from the Thalamus also drive inhibitory Parvalbumin-expressing (PV) neurons in C4 (shown as circles in the diagram). These inhibitory neurons make up a large fraction of all the cells in C4, and inhibit pyramidal cells in C5 (see source or source ).

    This means that the input from the Thalamus can be both informative and executive. It is executive in that it actually manipulates the activity of layer 5 within the Cortex, and informative by providing a copy of the signal driving the manipulation to C4. Figure 8 shows our inhibitory circuit. We believe this circuit is of critical importance because it provides a mechanism for the Thalamus to centrally manipulate the state of the Cortex, specifically layer 5 and 6 pyramidal cells. This hypothesis will be expanded in the next article.

    Figure 9 catalogues inhibitory cells, notably showing the cells used in our inhibitory circuit.

    Figure 8: The inhibitory micro-circuit. The red highlight shows how the Thalamus controls activity in C5 within a Column by activating inhibitory cells in C4. The circuit is completed by C5 pyramidal cells driving C6 cells, which in turn modulate the activity of the same Thalamic nuclei that selectively activates C5. Each shape denotes a population of cells of a specific type within a single Column, excluding ‘T’ and ‘B’ that refer to the Thalamus and Basal Ganglia respectively.
    Figure 9: Inhibitory interneurons in the Cortex. Of particular interest are the “PV” cells that are driven by axons from the Thalamus terminating in layer 4 and in turn inhibit pyramidal cells in layer 5. Image source

    Pathways and the Canonical Circuits

    Now let’s look at how pathways emerge from our cortical micro-circuit. Figures 10, 11, 12 show the Feed-Forward Direct, Feed-Forward Indirect and first Feed-Back pathways respectively. We also include another direct, Feed-Back pathway terminating at C6 (figure 13). Feed-back direct pathways terminating at C1, where many fibres are intermingled, are harder to interpret than feedback terminating directly at C6. Pyramidal neurons from many layers have dendrites in C1.

    Figure 10: Feed-Forward Direct pathway within our canonical cortical micro-circuit.

    Figure 10 highlights the Feed-Forward direct pathway. Signals propagate from C4 to C2/3 and then to C4 in a higher Column. This pattern is repeated up the hierarchy. This pathway is not filtered by the Thalamus or any other central structure. Although activity from C2/3 propagates to C5, it does not ascend the hierarchy via this route: C5 in one Column does not directly connect to C5 in a higher Column, only via an indirect pathway (see below).

    Figure 11: Feed-Forward Indirect pathway.

    Figure 11 highlights the Feed-Forward Indirect pathway. The Thalamus is involved in this pathway, and may have a gating or filtering effect. Data flows from the Thalamus to C4, to C2/3, to C5 and then to a different Thalamic nucleus that serves as the input gateway to another cortical Column in a different region of the Cortex.

    Figure 12: The first of two Feed-Back Direct pathways.

    Figure 12 highlights the first type of Feed-Back Direct pathway. This pathway may be more concerned with provision of broader and more abstract (i.e. hierarchically higher) contextual information to be used in the Feed-Forward pathways for better prediction. This suggestion is supported by evidence that axons from C6 via C1 synapse with apical dendrites of pyramidal cells in C2/3, C5 and C6, in hierarchically lower regions.

    Figure 13 highlights the second of two Feed-Back Direct pathways. This pathway might be involved in cascading control activity down the hierarchy towards sensors and motors – the next article will expand on this idea. Activity propagates from C6 to C6 directly. C6 modulates the activity of local C5 cells and relevant Thalamic nuclei that drive local C5 cells. Note that connections from a Column to the Thalamus are reciprocal; feedback from C6 to the Thalamus targets the same nuclei that project axons to C4.

    Figure 13: The second of two Feed-Back Direct pathways.

    Summary

    We’ve presented some additional, detailed perspectives on the organisation and function of circuits and pathways within the Thalamo-Cortical system and presented our interpretation of the canonical cortical micro-circuit.

    So what’s the point of all this information? What do these circuits and pathways do, and why are they connected this way? How do they work?

    It might seem that we’ve stopped short of really trying to interpret all this information and that’s because we are, indeed, holding back. After having spent so much time presenting background information, the next article finally attempts to understand why the thalamocortical system is connected in the ways described here, and how this system might give rise to general intelligence. 

    AGI/Algorithm/Architecture/Artificial General Intelligence/Hierarchical Generative Models/hierarchy/invariances

    How to build a General Intelligence: What we think we already know

    Posted by ProjectAGI on
    Authors: D Rawlinson and G Kowadlo

    This is the first of three articles detailing our latest thinking on general intelligence: A one-size-fits-all algorithm that, like people, is able to learn how to function effectively in almost any environment. This differs from most Artificial Intelligence (AI), which is designed by people for a specific purpose. This article will set out assumptions, principles, insights and design guidelines based on what we think we already know about general intelligence. It turns out that we can describe general intelligence in some detail, although not enough detail to actually build it…yet.

    The second article will look at how these ideas fit existing computational neuroscience, which helps to refine and filter the design; and the third article will describe a (high-level) algorithm that is, at least, not contradictory to the design goals and biology already established.

    As usual, our plans have got ahead of implementation, so code will follow in a few weeks after the end of the series (or months…)

    FIGURE 1: A hierarchy of units. Although units start out identically, they become differentiated as they learn from their unique input. The input to a unit depends on its position within the hierarchy and the state of the units connected to it. The hierarchy is conceptualized as having levels; the lowest levels are connected to sensors and motors. Higher levels are separated from sensors and motors by many intermediate units. The hierarchy may have a tree-like structure without cycles, but the number of units per level does not necessarily decrease as you move higher.

    Architecture of General Intelligence

    Let’s start with some fundamental assumptions and outline the structure of a system that has general intelligence characteristics.

    It Exists

    We assume there exists a “general intelligence algorithm” that is not irreducibly complex. That is, we don’t need to understand it in excruciating detail. Instead, we can break it down into simpler models that we can easily understand in isolation. This is not necessarily a reasonable assumption, but there is evidence for it:

    Units

    A general intelligence algorithm can be described more simply as a collection of many simpler, functionally-identical units. Again, this is a big assumption, but it is supported by at least two pieces of evidence. First, it has often been observed that the human cortex has quite uniform structure across areas having greatly varying functional roles. Second, this structure has revealed that the cortex is made up of many smaller units (called columns, at one particular scale). It is reasonable to decompose the cortex in this way due to high and varied intra-column connectivity and limited variety of inter-column connectivity. The patterns of inter and intra column connectivity are very similar throughout the cortex. “Columns” contain only a few thousand neurons organized into layers and micro-columns that further simplify understanding of the structure. That’s not overwhelmingly complex, although we are making simplifying assumptions about neuron function.

    Hierarchy

    Our reading and experimentation has suggested that hierarchical representation is critical for the types of information processing involved in general intelligence. Hierarchies are built from many units connected together in layers. Typically, only the lowest level of the hierarchy receives external input. Other levels receive input from lower levels of the hierarchy instead. For more background on hierarchies, see earlier posts. Hierarchy allows units in higher layers to model more complex and abstract features of input, despite the fixed complexity of each unit. Hierarchy also allows units to cover all available input data and allow combinations of features to be jointly represented within a reasonable memory limit. It’s a crucial concept.

    Synchronization

    Do we need synchronization between units? Synchronization can simplify sequence modelling in a hierarchy by restricting the number of possible permutations of events. However, synchronization between units may significantly hinder fast execution on parallel computing hardware, so this question is important. A point of confusion may be the difference between synchronization and timing / clock signals. We can have synchronization without clocks, but in any case there is biological evidence of timing signals within the brain. Pathological conditions can arise without a sense of time. In conclusion we’re going to assume that units should be functionally asynchronous, but might make use of clock signals.

    Robustness

    Your brain doesn’t completely stop working if you damage it. Robustness is a characteristic of a distributed system and one we should hope to emulate. Robustness applies not just to internal damage but external changes (i.e. it doesn’t matter if your brain is wrong or the world has changed; either way you have to learn to cope).

    Scalability

    Adding more units should improve capability and performance. The algorithm must scale effectively without changes other than having more of the same units appended to the hierarchy. Note the specific criteria for how scalability is to be achieved (i.e. enlarge the hierarchy rather than enlarge the units). It is important to test for this feature to demonstrate the generality of the solution.

    Generality

    The same unit should work reasonably well for all types of input data, without preprocessing. Of course, tailored preprocessing could make it better, but it shouldn’t be essential.

    Local interpretation

    The unit must locally interpret all input. In real brains it isn’t plausible that neuron X evolved to target neuron Y precisely. Neurons develop dendrites and synapses with sources and targets that are carefully guided, but not to the extent of identifying specific cells amongst thousands of peers. Any algorithm that requires exact targeting or mapping of long-range connections is biologically implausible. Rather, units should locally select and interpret incoming signals using characteristics of the input. Since many AI methods require exact mapping between algorithm stages, this principle is actually quite discriminating.

    Cellular plausibility

    Similarly, we can validate designs by questioning whether they could develop by biologically plausible processes, such as cell migration or preferential affinity for specific signal coding or molecular markers. However, be aware that brain neurons rarely match the traditional integrate-and-fire model.

    Key Insights

    It’s surprising that in careers cumulatively spanning more than 25 years we (the authors) had very little idea how the methods we used everyday could lead to general intelligence. It is only in the last 5 years that we have begun to research the particular sub-disciplines of AI that may lead us in that direction.

    Today, those who have studied this area can talk in some detail about the nature of general intelligence without getting into specifics. Although we don’t yet have all the answers, the problem has become more approachable. For example, we’re really looking to understand a much simpler unit, not an entire brain holistically. Many complex systems can be easily understood when broken down in the right way, because we can selectively ignore detail that is irrelevant to the question at hand.

    From our experience, we’ve developed some insights we want to share. Many of these insights were already known, and we just needed to find the right terminology. By sharing this terminology we can help others to find the right research to read.

    We’re looking for a stackable building block, not the perfect monolith

    We must find a unit that can be assembled into an arbitrarily large – yet still functional – structure. In fact, a similar feature was instrumental in the success of “deep” learning: Networks could suddenly be built up to arbitrary depths. Building a stackable block is surprisingly hard and astonishingly important.

    We’re not looking to beat any specific benchmark

    … but if we could do reasonably well at a wide range of benchmarks, that would be exciting. This is why the DeepMind Atari demos are so exciting; the same algorithm could succeed in very different problems.

    Abstraction by accumulation of invariances

    This insight comes from Hawkins’ work on Hierarchical Temporal Memory. He proposes that abstraction towards symbolic representation comes about incrementally, rather than as a single mapping process. Concepts accumulate invariances – such as appearance from different angles – until labels can correctly be associated with them.  This neatly avoids the fearful “symbol grounding problem” from the early days of AI.

    Biased Prediction and Selective Attention are both action selection

    We believe that selective bias of predictions and expectations is responsible for both narrowing of the range of anticipated futures (selective ignorance of potential outcomes) and the mechanism by which motor actions are generated. A selective prediction of oneself performing an action is a great way to generate or “select” that action. Similarly, selective attention to external events affects the way data is perceived and in turn the way the agent will respond. Filtering data flow between hierarchy units implements both selective attention and action selection, if data flowing towards motors represents candidate futures including both self-actions and external consequences.

    The importance of spatial structure in data

    As you will see in later parts of this article series, the spatial structure of input data is actually quite important when training our latest algorithms. This is not true of many algorithms, especially in Machine Learning where each input scalar is often treated as an independent dimension. Note that we now believe spatial structure is important both in raw input and in data communicated between units. We’re not simply saying that external data structure is important to the algorithm – we’re claiming that simulated spatial structure is actually an essential part of algorithms for dynamically dividing a pool of resources between hierarchy units.

    Binary data

    There’s a lot of simplification and assumption here, but we believe this is the most useful format for input and internal data. In any case, the algorithms we’re finding most useful can’t easily be refactored for the obvious alternative (continuous input values). However, continuous input can be encoded with some loss of precision as subsets of bits. There is some evidence that this is biologically plausible, but it is not definitive. Why binary? Dimensionality reduction is an essential feature of a hierarchical model; it may be that sparse binary representations are simply a good compromise between data loss and qualities such as compositionality:

    Sparse, Distributed Representations

    We will be using Sparse, Distributed Representations (SDRs) to represent agent and world state [RE  ]. SDRs are binary data (i.e. all values are 1 or 0). SDRs are sparse, meaning that at any moment, only a fraction of the bits are 1’s (active). The most complex feature to grasp is that SDRs are distributed: No individual bit uniquely represents anything. Instead, data features are jointly represented by sets of bits. SDRs are overcomplete representations – not all bits in a feature-set are required to “detect” a feature, which also means that degrees of similarity can also be expressed as if the data were continuous. These characteristics also mean that SDRs are robust to noise – missing bits are unlikely to affect interpretation. .

    Predictive Coding

    SDRs are a specific form of Sparse (Population) Coding where state is jointly represented by a set of active bits. Transforming data into a sparse representation is necessarily lossy and balances representational capacity against bit-density. The most promising sparse coding scheme we have identified is Predictive Coding, in which internal state is represented by prediction errors. PC has the benefit that errors are propagated rather than hidden in local states, and data dimensionality automatically reduces in proportion to its predictability. Perfect prediction implies that data is fully understood, and produces no output. A specific description of PC is given by Friston et al but a more general framework has been discussed in several papers by Rao, Ballard et al since about 1999. The latter is quite similar to the inter-region coding via temporal pooling described in the HTM Cortical Learning Algorithm.

    Generative Models

    Training an SDR typically produces a Generative Model of its input. This means that the system encodes observed data in such a way that it can generate novel instances of observed data. In other words, the system can generate predictions of all inputs (with varying uncertainty) from an arbitrary internal state. This is a key prerequisite for a general intelligence that must simulate outcomes for planned novel action combinations.

    Dimensionality Reduction

    In constructing models, we will be looking to extract stable features and in doing so reduce the complexity of input data. This is known as dimensionality reduction, for which we can use algorithms such as auto-encoders. To cope with the vast number of possible permutations and combinations of input, an incredibly efficient incremental process of compression is required. So how can we detect stable features within data?

    Unsupervised Learning

    By the definition of general intelligence, we can’t possibly hope to provide a tutor-algorithm that provides the optimum model update for every input presented. It’s also worth noting that internal representations of the world and agent should be formed without consideration of the utility of the representations – in other words, internal models should be formed for completeness, generality and accuracy rather than task-fulfilment. This allows less abstract representations to become part of more abstract, long-term plans, despite lacking immediate value. It requires that we use unsupervised learning to build internal representations.

    Hierarchical Planning & Execution

    We don’t want to have to model the world twice: Once for understanding what’s happening, and again for planning & control. The same model should be used for both. This means we have to do planning & action selection within the single hierarchical model used for perception. It also makes sense, given that the agent’s own actions will help to explain sensor input (for example, turning your head will alter the images received in a predictable way). As explained earlier, we can generate plans by simply biasing “predictions” of our own behaviour towards actions with rewarding outcomes.

    Reinforcement Learning

    In the context of an intelligent agent, it is generally impossible to discover the “correct” set of actions or output for any given situation. There are many alternatives of varying quality; we don’t even insist on the best action but expect the agent to usually pick rewarding actions. In these scenarios, we will require a Reinforcement Learning system to model the quality of the actions considered by the agent. Since there is value in exploration, we may also expect the agent to occasionally pick suboptimal strategies, to learn new information.

    Supervised Learning

    There is still a role for supervised learning within general intelligence. Specifically, during the execution of hierarchical control tasks we can describe both the ideal outcome and some metric describing similarity of actual outcome to desired. Supervised learning is ideal for discovery of actions with agency to bring about desired results. Supervised Learning can tell us how best to execute a plan constructed in an Unsupervised Learning model, that was later selected by Reinforcement Learning.

    Challenges Anticipated 

    The features and constraints already identified mean that we can expect some specific difficulties when creating our general intelligence.

    Among other problems, we are particularly concerned about:

    1. Allocation of limited resources
    2: Signal dilution
    3: Detached circuits within the hierarchy
    4: Dilution of executive influence
    5: Conflict resolution
    6: Parameter selection

    Let’s elaborate:

    Allocation of limited resources

    This is an inherent problem when allocating a fixed pool of computational resources (such as memory) to a hierarchy of units. Often, resources per unit are fixed, ensuring that there are sufficient resources for the desired hierarchy structure. However, this is far less efficient than dynamically allocating resources to units to globally maximize performance. It also presupposes the ideal hierarchy structure is known, and not a function of the data. If the hierarchy structure is also dynamic, this becomes particularly difficult to manage because resources are being allocated at two scales simultaneously (resources → units and units → hierarchy structure), with constraints at both scales.

    In our research we will initially adopt a fixed resource quota per hierarchy unit and a fixed branching factor for the hierarchy, allowing the structure of the hierarchy and resources per unit to be determined by data. This arrangement is the one most likely to work given a universal unit with constant parameters, as the number of inputs to each unit is constrained (due to the branching factor). It is interesting that the human cortex is a continuous sheet, and evidences dynamic resource allocation as neuroplasticity – resources can be dynamically assigned to working areas and sensors when others fail.

    Signal Dilution

    As data is transformed from raw input into a hierarchical model, information will be lost (not represented anywhere). This problem is certain to occur in all realistic tasks because input data will be modelled locally in each unit without global oversight over which data is useful. Given local resource constraints, this will be a lossy process. Moreover, we have also identified the need for units to identify patterns in the data and output a simplified signal for higher-order modelling by other units in the hierarchy (dimensionality reduction). Therefore, each unit will deliberately and necessarily lose data during these transformations. We will use techniques such as Predictive Coding to allow data that is not understood (i.e. not predictable) to flow through the system until it can be modelled accurately (predicted). However, it will still be important to characterise the failure modes in which important data is eliminated before it can be combined with other data that provides explanatory power.

    Detached circuits within the hierarchy

    Consider figure 2. Here we have a tree of hierarchy units. If the interactions between units are reciprocal (i.e. X outputs to Y and receives data from Y) there is a strong danger of small self-reinforcing circuits forming in the hierarchy. These feedback circuits exchange mutually complementary data between a pair or more units, causing them to ignore data from the rest of the hierarchy. In effect, the circuit becomes “detached” from the rest of the hierarchy. Since sensor data enters via leaf-units at the bottom of the hierarchy, everything above the detached circuit is also detached from the outside world and the system will cease to function satisfactorily.

    In any hierarchy with reciprocal connections, this problem is very likely to occur, and disastrous when it does. In Belief Propagation, another graphical model, this problem manifests as “double counting” and is avoided by nodes carefully ignoring their own evidence returned to them.

    FIGURE 2: Detached circuits within the hierarchy. Units X and Y have formed a mutually reinforcing circuit that ignores all data from other parts of the hierarchy. By doing so, they have ceased to model the external world and have divided the hierarchy into separate components.

    Dilution of executive influence

    A generally-intelligent agent needs to have the ability to execute abstract, high-level plans as easily as primitive, immediate actions. As people we often conceive plans that may take minutes, hours, days or even longer to complete. How is execution of lengthy plans achieved in a hierarchical system?

    If abstract concepts exist only in higher levels of the hierarchy, they need to control large subtrees of the hierarchy over long periods of time to be successfully executed. However, if each hierarchy unit is independent; how is this control to be achieved? If higher units do not effectively subsume lower ones, executive influence will dilute as plans are incrementally re-interpreted from abstract to concrete (see figure 3). Ideally, abstract units will have quite specific control over concrete units. However, it is impractical for abstract units to have the complexity to “micro-manage” an entire tree of concrete units.

    FIGURE 3: Dilution of executive influence. A high-level unit within the hierarchy wishes to execute a plan; the plan must be translated towards the most concrete units to be performed. However, each translation and re-interpretation risks losing details of the original intent which cannot be fully represented in the lower levels. Somehow, executive influence must be maintained down through an arbitrarily deep hierarchy. 

    Let’s define “agency” as the ability to influence or control outcomes. Lacking the ability to cause a particular outcome is a lack of agency over the desired and actual outcomes. By making each hierarchy unit responsible for the execution of goals defined in the hierarchy level immediately above, we indirectly maximise the agency of more abstract units. Without this arrangement, more abstract units would have little or no agency at all.

    Figure 4 shows what happens when an abstract plan gets “lost in translation” to concrete form. I walked up to my car and pulled my keys from my pocket. The car key is on a ring with many others, but it’s much bigger and can’t be mistaken by touch. It can only be mistaken if you don’t care about the differences.

    In this case, when I got to the car door I tried to unlock it with the house key! I only stopped when the key wouldn’t fit in the keyhole. Strangely, all low-level mechanical actions were performed skillfully, but high level knowledge (which key) was lost. Although the plan was put in motion, it was not successful in achieving the goal.

    Obviously this is just a hypothesis about why this type of error happens. What’s surprising is that it isn’t more common. Can you think of any examples?

    car_key_ampf_agi_translation.jpg

    FIGURE 4: Abstract plan translation failure: Picking the wrong key but skilfully trying it in the lock. This may be an example of abstract plans being carried out, but losing relevant details while being transformed into concrete motor actions by a hierarchy of units.

    In our model, planning and action selection occur as biased prediction. There is an inherent conflict between accurate prediction and bias. Attempting to bias predictions of events beyond your control leads to unexpected failure, which is even worse than expected failure.

    The alternative is to predict accurately, but often the better outcome is the less likely one. There must be a mechanism to increase the probability of low-frequency events where the agent has agency over the real-world outcome.

    Where possible, lower units must separate learning to predict and trying to use that learning to satisfy higher units’ objectives. Units should seek to maximise the probability of goal outcomes, given an accurate estimate of the state of the local unit as prior knowledge. But units should not become blind to objective reality in the process.

    Conflict resolution

    General intelligence must be able to function effectively in novel situations. Modelling and prediction must work in the first instance, without time for re-learning. This means that existing knowledge must be combined effectively to extrapolate to a novel situation.

    We also want the general intelligence to spontaneously create novel combinations of behaviour as a way to innovate and discover new ways to do things. Since we assume that behaviour is generated by filtering predictions, we are really saying we need to be able to predict (simulate) accurately when extrapolating combinations of existing models to new situations. So we also need conflict resolution for non-physical or non-action predictions. The agent needs a clear and decisive vision of the future, even when simulating outcomes it has never experienced.

    The downside of all this creativity is that there’s really no way to tell whether these combinations are valid. Often they will be, but not always. For example, you can’t touch two objects that are far apart at the same time. When incompatible, we need a way to resolve the conflict.

    There’s a good discussion of different conflict resolution strategies on Scholarpedia; our preferred technique is selecting a solitary active strategy in each hierarchy unit, choosing locally to optimise for a single objective when multiple are requested.

    Evaluating alternative plans is most easily accomplished as a centralised task – you have to bring all the potential alternatives together where they can be compared. This is because we can only assign relative rewards to each alternative; it is impossible to calculate meaningful absolute rewards for the experiences of an intelligent agent. It is also important to place all plans on a level playing-field regardless of the level of abstraction; therefore abstract plans should be competing against more concrete ones and vice-versa.

    Therefore, unlike most of the pieces we’ve described, action selection should be a centralised activity rather than a distributed one.

    Parameter Selection

    In a hierarchical system the input to “higher” units will be determined by modelling in “lower” units and interactions with the world. The agent-world system will develop in markedly different ways each time. It will take an unknown amount of time for stable modelling to emerge, first in the lower units and then moving higher in the hierarchy.

    As a result of all these factors it will be very difficult to pick suitable values for time-constants and other parameters that control the learning processes in each unit, due to compounded uncertainty about lower units’ input. Instead, we must allow recent input to each unit to determine suitable values for parameters. This is online learning. Some parameters cannot be automatically adjusted in response to data. For these, to have any hope of debugging a general intelligence, a fixed parameter configuration must work for all units in all circumstances. This constraint will limit the use of some existing algorithms.

    Summary

    That wraps up our theoretical overview of what we think a general intelligence algorithm must look like. The next article in this series will explain what we’ve learnt from biology’s implementation of general intelligence – ourselves! The final article will describe how we hope to build an algorithm that satisfies all these requirements.

    Artificial General Intelligence/CLA/Cortical Learning Algorithm/Grossberg/Neocortex/Sherman/Thalamocortical/Thalamus

    Thalamocortical architecture

    Posted by ProjectAGI on
    by Gideon Kowadlo and David Rawlinson


    Introduction

    One of the keys to understanding the neocortex as a whole, and the emergence of intelligence, is to understand how the cortical hierarchical levels interconnect. This includes:

    • the physical connections,
    • the meaning of the signals being transmitted,
    • and possibly also the way the signal is encoded.
    Physical connections: Physical connections refer to gross patterns of neuron routing throughout the brain. This is known as the connectome. Below is an image from the Human Connectome Project, that beautifully illustrates many connections including thalamocortical ones.
    Figure 1. Courtesy of the Laboratory of Neuro Imaging and Martinos Center for Biomedical Imaging, Consortium of the Human Connectome Project – www.humanconnectomeproject.org

    Meaning of signals: One classification that can be applied to thalamocortical neurons is drivers versus modulators. A driver can be thought of as a neuron that carries information, whereas a modulator modulates or alters the transmission of information in a driver. They have different functional and anatomical properties, as nicely described in (Sherman and Guillery 2011). If a neuron is a driver, what information does it encode, and if it is a modulator, is it inhibiting or excitatory and what effect does this have?

    Signal Encoding: Signal encoding refers to the details of how the information is represented. This includes timing and amplitude information. The way the signal is encoded in the neurons may have a bearing on the properties of the system. Specific information has been added to the diagram where this looks relevant.

    Our aim is to build AI with general intelligence characteristic of biological organisms such as primates. Therefore, we draw inspiration and insight from these working examples. Understanding the biology obviously gives us the best insight into how to do that. However, what level of abstraction do we need to capture the essential qualities?

    • at the lowest level: molecular structure, interactions and neurotransmitters,
    • above that, firing patterns and newly discovered molecular machinery (that excitingly shows this is more complex and interesting than previously thought – see paper and work by Seth Grant),
    • higher still, the brain as a set of modules that interact with each other,
    • or multi scale simulation of the whole brain (see the Human Brain Project).

    For simplicity, we want to understand it at the highest level that is still capable of capturing the essential qualities, and drill down where necessary. Therefore, are factors such as the way that the signal is encoded important? Not in and of themselves, but they may have a bearing on emergent qualities, that are significant.

    In order to understand the above, including drawing conclusions about the appropriate level of abstraction, we’ve elaborated on a figure first published in the CLA White Paper that was included in a previous post (in the section ‘Regions’). In that article, we started to explore these topics in the context of Numenta‘s work. The figure shows the thalamo-cortical connections to specific cortical layers and is very useful for exploring the concepts described above. Here, we will expand on that figure, shown below. We will go over a first version, and we plan to make further posts in the future, as we develop it further. Each of the initial annotations are explained in the sub sections below.

    Figure 2. Thalamocortical architecture including cortical layers and connections between hierarchy levels. This figure is an annotated version of a figure from the ‘CLA White Paper‘. Some information is added from the text of that document. Other sources used are Sherman and Guillery 2011Grossberg 2007, Sherman 2006 and Sherman 2007.

    We invite the community to make use of and contribute to this annotated diagram. The diagram is publicly available in a universal vector graphics format called SVG. Being vector based, it is easily modifiable. SVG is a common format, which many graphics packages are capable of editing.

    The file is available from a git repository hosted on github called cortico-thalamic-circuit. Anyone can download, clone, make a pull request or fork the repository.

    Pull requests allow you to make modifications and then give them back to the shared repository so that they are available to everyone. This is the action to take if you share our purpose for the diagram – staying as high level as possible, filling in details where they contribute to a holistic view or emergent properties of the thalamocortical architecture. Forking allows you to create a new repository that diverges from the main one. Use this option if you’d like to use the diagram for a different purpose, such as documentation of all the neurotransmitters in the different pathways.

    The first set of diagram additions are described below.

    Diagram Additions

    Cortico-Cortical Feedback

    The illustrated feedback between levels from layer 6 in Level (n+1) to layer 1 in Level (n) is described briefly in the CLA white paper. We have included an additional illustration from Grossberg 2007 (see figure 3 below), that shows in more detail how internal neural circuitry completes the intra-cortical, inter-level, feedback loop from:

    H[n+1]:C[6] → H[n]:d[1]C[5]
    H[n]:d[1]C[5] → H[n]:C[6]
    H[n]:C[6] → H[n]:C[4]

    Note: The connections above are described in a notation we have adopted for succinctly describing cortical neural pathways. Refer to our post for more details.

    Figure 3. Inter-level feedback loop, reproduced from Grossberg 2007. The circles and triangles are neuron bodies, with varying shape depicting different neuron types. Two hierarchy levels are shown (V1,V2 from the visual cortex). Each hierarchy level has 6 cortical layers (numbered 1 to 6 where relevant). You can see that feedback from V2 affects activation of neurons in V1 layer 4.
    The feedforward/feedback architecture gives rise to at least three important qualities, the first of which has been explored in the MPF literature. They are described below, reproduced from Grossberg 2007:

    1. the developmental and learning processes whereby the cortex shapes its circuits to match environmental constraints in a stable way through time; 
    2. the binding process whereby cortex groups distributed data into coherent object representations that remain sensitive to analog properties of the environment; and 
    3. the attentional process whereby cortex selectively processes important events. 

    We may elaborate with a summary of Grossberg 2007 in a future post.

    Gating by the Thalamus

    Our main references for this section are Sherman 2006 and Sherman 2007.

    We’ve seen that the thalamus acts as a relay for information passing up the hierarchy between cortical levels, which we’re referring to as the feedforward indirect pathway (FF Indirect). It has been postulated that via this gating, the thalamus plays an important role in attention.

    What inputs and computations determine that gating? This is one of the questions we are attempting to learn more about, and so have explored inputs to the gating.

    Cortical feedback

    One of the significant inputs is FB from Layer 6 in the level above. That is to say that the gating from Level (n) to (n+1), is modulated by FB from Layer 6 in Level (n+1).

    Thalamic feedback and TRN

    There is a substructure of the Thalamus called the Thalamic Reticular Nucleus (TRN) that receives cortical and thalamic excitatory input, and sends inhibitory inputs to the relay cells of the thalamus.

    These gating cells also receive inhibitory input from other Thalamic cells, labelled interneurons. Thalamic interneurons receive input from the very same relay cells, layer 6 of the cortex and the brainstem.

    These circuits between TRN, BRF and thalamus are complex. They are simplified in the figure below, which appears in Sherman 2006 (Scholarpedia on the Thalamus), a version of which is found in Sherman 2007.

    Figure 4. “Schematic diagram of circuitry for the lateral geniculate nucleus. The inputs to relay cells are shown along with the relevant neurotransmitters and postsynaptic receptors (ionotropic and metabotropic) Abbreviations: LGN, lateral geniculate nucleus; BRF, brainstem reticular formation; TRN, thalamic reticular nucleus.” Caption and figure reproduced from Sherman 2006.
    We are currently representing this complexity as a black box (as shown in the diagram) that receives input from the Thalamus, BRF and cortex, and inhibits the relay cells. The purpose and transfer function require analysis and exploration. It may be necessary to model the complexity explained above, or some simpler equivalent may provide the necessary functionality.

    BRF

    The BRF is the Brainstem Reticular Formation, which as the name suggests, is a part of the brainstem. It has a number of functions that could be very important for attention and general functioning of the cortex, and therefore, we have included it and it’s connections to the Thalamus. Some of these functions include:

    1. Somatic motor control
    2. Cardiovascular control
    3. Pain modulation
    4. Sleep and consciousness
    5. Habituation

    The Wikipedia page for the BRF gives a very good summary.

    Modulation Signal Characteristics

    It is interesting to note that the firing mechanism for the BRF and Layer 6 modulation of the Thalamic relay is Burst Mode rather than the more common Tonic Mode. Tonic firing has a frequency that is proportional to the ‘activation’ of a neuron. The frequency can be interpreted as the “strength” of the signal. Some have interpreted it in the past as a probability or confidence value. For Burst Mode firing, after a ‘silent’ period, the initial firing pattern is a burst of activity. This “results in a very different message relayed to cortex, depending on the recent voltage history of the relay cell” (Sherman 2006). It is thought that this acts as a ‘wake up call’ to the cortex when there has been some external change. We plan to speculate and elaborate further on possible purposes of this in the future.

    Timing Information

    The CLA White Paper makes mention of timing information being fed back from the thalamus to layer 5 via layer 1. This has been added to the diagram for visibility. It is thought to be important for prediction of the next state at the appropriate time.

    Other Factors

    There are a number of other significant brain components that may substantially affect the operation of the neocortex. Based on the literature, the most significant of these is probably the Basal Ganglia, which forms circuits with the Thalamus and Cortex. Another interesting and possibly important component are Betz cells, which directly drive muscles from the cortex.

    Conclusion

    This post was a first attempt to create an enhanced diagram of cortical layers and thalamocortical connectivity in the context of MPF/HTM/CLA theory. We’ll continue to elaborate on this in future posts.

    Artificial General Intelligence/Memes/Natural Selection/Singularity/Theory

    Constraints on intelligence

    Posted by Gideon Kowadlo on
    by Gideon Kowadlo and David Rawlinson

    Introduction

    This article contains some musings on the factors that limit the increase of intelligence as a species.

    We speculate that ultimately, our level of intelligence is limited by at least two factors, and possibly a third:

    1. our own cultural development,
    2. physical constraints, and
    3. an intelligence threshold.

    We’ll now explore each of these factors.

    Cultural Development

    Natural Selection

    Most readers are familiar with Natural Selection. The best known and dominant mechanism is that fitter biological organisms in a population tend to survive longer, reproduce more frequently and successfully, and pass on their traits to the next generation. Given some form of external pressure and therefore competition, such as resource constraints, the species on average is likely to increase in fitness. In competition with other species, this is necessary for species survival.

    Although this is the mechanism we are focusing on in this post, there are other important forms of selection. Two examples are ‘Group Selection’ and ‘Sexual Selection’. Group selection favours traits that benefit the group over the individual, such as altruism. Especially when the group shares common genes. Sexual selection favours traits that improve an individual’s success in reproducing by two means: being attractive to the other gender, and ability to compete with rivals of the same gender. Sometimes sexually appealing characteristics are highly costly or risky to individuals, for example by making them vulnerable to predators.

    Culture

    Another influence on ability to survive is culture. Humans have developed culture, and some form of culture is widely believed to exist in other species such as primates and birds (e.g. Science). Richard Dawkins introduced the concept of memes, cultural entities that evolve in a way that is analogous to genes. The word meme now conjures up funny pictures of cats (see Wired magazine’s article on the re-appropriation of the word meme), and no-one is complaining about that, but it’s hard to argue that these make us fitter as a species. However, it’s clear that cultural evolution, by way of technological progress, can have a significant influence. This could be negative, but is generally a positive, making us more likely to survive as a species.

    Culture and Biology

    A thought experiment regarding the effect on survival due to natural selection and cultural development, and due to their relationship with each other, is explored with a graph below.

    Figure 1: A thought experiment: The shape of survivability vs time, due to cultural evolution, and due to natural selection. The total survivability is the sum of the two. Survivability due to natural selection plateaus when it is surpassed by survivability due to cultural evolution. Survivability due to cultural evolution plateaus when cultural development allows almost everyone in the population to survive.

    For humans, the main biological factor contributing to survival is our intellect. The graph shows how our ability to survive steadily improves with time as we evolve naturally. The choice of linear growth is based on the fact that the ‘force’ for genetic change does not increase or decrease as that change occurs*. On the other hand, it is thought that cultural evolution improves our survivability exponentially. In recent years, this has been argued by well known authors and thinkers such as Ray Kurzweil and Eliezer S. Yudkowsky in the context of the Technological Singularity. We build on knowledge continuously, and leverage our technological advances. This enables us to make ever larger steps, as each generation exploits the work of the preceding generations. As Isaac Newton wrote, “If I have seen further it is by standing on the shoulders of giants” **. Many predict that this will result in the ability to create machines that surpass human intelligence. The point at which this occurs is known as the aforementioned Technological Singularity.

    Cultural Development – Altruism

    Additionally, cultural evolution could include the development of humanitarian and altruistic ideals and behaviour. An environment in which communities care for all their people, which would increase the survivability of (almost) everyone to the threshold of reproduction – leaving only a varied ability to prosper beyond survival. This is shown in the figure above as a plateau in survivability due to cultural evolution.

    Cultural Development – Technology

    Cultural factors dominate once survivability due to cultural evolution and technological development surpasses that due to natural selection. For example, the advantages given by use of a bow and arrow for hunting, will reduce the competitive advantage of becoming a faster runner. Having a supermarket at the end of your street will render faster running insignificant. The species would no longer evolve biologically through the same process of natural selection. Other forces may still cause biological evolution in extreme cases, such as resistance to new diseases, but this is unlikely to drive the majority of further change. This means that biological evolution of our species would stagnate***. This effect is shown in the graph with the plateau in survivability due to natural selection.

    * On a fine scale, this would not be linear and would be affected by many many unpredictable factors such as climate changes, other environmental instability as well as successes/failures of other species.

    ** Although this metaphor was first recorded in the twelfth century and has been attributed to Bernard of Chartres.

    *** Interestingly, removal of selective pressure does not allow species to rest at a given level of fitness. Deleterious mutations rapidly accumulate within the population, giving us a short window of opportunity to learn to control and improve our own genetic heritage.

    Physical Constraints

    One current perspective in neuroscience, and the basis for our work and this blog, is that much of our intelligence emerges from, very simply put, a hierarchical assembly of regions of identical computational units (cortical columns). As explained in previous posts (here and here), this is physically structured as a sheet of cortex, that form connections from region to region. The connecting regions are conceptually at different levels in the hierarchy. The connections themselves form the bulk of the cortex. We believe that with an increasingly deep hierarchy, the brain is able to represent increasingly abstract and general spatiotemporal concepts, which would play a significant role in increasing intelligence.

    The reasoning above predicts that the number of neurons and connections is correlated with intelligence. These neurons and connections have mass and volume and require a blood supply. They cannot increase indefinitely.

    Simply increasing the size of the skull has its drawbacks. Maintaining stable temperature becomes more difficult, and structural strength is sacrificed. The body would become disproportionately large to carry around extra mass, making the animal less mobile, coupled with the fact that there would be higher energy demands. Larger distances for neuronal connections leads to slower signal propagation which could also have negative impact. Evidence of the consequences of such physical constraints is found in the fact that the brain folds in on itself, appearing wrinkled, in order to maximise surface area (and hence the number of neurons and connections) in the given volume of the skull. Evolution has produced a tradeoff between these characteristics that limits our intelligence to promote survival.

    It is possible to imagine completely different architectures that might circumvent these limitations. Perhaps a neural network distributed throughout the body, such as exists for some marine creatures. However, it is implausible that physical constraints would not ultimately be a limiting factor. Also, reality is more constrained than our imagination. For example, it must be physically and biologically possible for the organism to develop from a single cell to a neonate, and on to a reproducing adult.

    An Intelligence Threshold

    There could be a point at which the species crosses an intelligence threshold, beyond which higher intelligence does not confer a greater ability to survive. However, since the threshold may be dictated by cultural evolution it is very difficult to separate the two. For example, the threshold might be very low in an altruistic world, and it is possible to envision a hyper-competitive and adversarial culture in which the opposite is true.

    But perhaps a threshold exists as a result of a fundamental quality of intelligence, completely independent of culture. Could it be, that once you can grasp concepts at a sufficient level of abstraction, and have the ability to externalise and record concepts with written symbols (thereby extending the hierarchy outside of the physical brain), that it would be possible to conduct any ‘thought’ computation, given enough working memory, concentration and time? Similarly, a Turing Machine is capable of carrying out any computation, given infinite memory.

    The topic of consciousness and it’s definition is beyond the scope of this post. However, accepting that there appears to be a clear relationship between intelligence and what most people understand as consciousness, this ‘Intelligence Threshold’ has implications for consciousness itself. It is interesting to ponder the threshold as having a corresponding crossing point in terms of conscious experience.

    We may explore the existence and nature of this potential threshold in greater detail in the future.

    Impact of Artificial General Intelligence (AGI)

    The biological limitations to intelligence discussed in this article show why Artificial General Intelligence (AGI) will be such a dramatic development. We still exist in a physical world (at least perceptibly), but building an agent out of silicon (or other materials in the future), will effectively free us from all of these constraints. It also allows us to modify parameters, architecture and monitor activity. It will be possible to invest large quantities of energy into ‘thinking’ in a mind that does not fatigue. Perhaps this is a key enabling technology on the path to the Singularity.