Category Archives

4 Articles

Artificial General Intelligence/Theory

Attention in Artificial Intelligence systems

Posted by Yi-Ling Hwong on

One of the features of our brain is its modularity. It is characterised by distinct but interacting subsystems that underlie key functions such as memory, language, perceptions, etc. Understanding the complex interplay between these modules requires decomposing them into a set of components. Indeed, this modular approach to modeling the brain is one of the ways neuroscientists are studying the brain. The ‘module’ that I am going to be talking about in this blogpost has its roots in neuroscience but has greatly inspired AI research: attention. I will focus on the neuroscience aspect of attention first before moving on to review some of the most exciting developments using attentional mechanisms in machine learning.

The neuroscientific roots of attention

Neuroscientists have long studied attention as an important cognitive process. It is described as the ability of organisms to ‘select a subset of available information upon which to focus for enhanced processing and integration’ and encompasses three aspects: orienting, filtering and searching. Visual attention, for example, is an active area of research. Our ability to focus on specific area of a visual scene and extract and process the information that is streamed to our brain is thought to be an evolutionary trait that all but guaranteed the survival of our species. This capability to select, process and act upon sensory experience has inspired a whole branch of research in computational modelling of visual attention.

Visual attention (image credit: Wikimedia)

The emergence of a whole suite of sophisticated equipment to scan and study the brain has further fanned the flames of enthusiasm for attention research. In a recent study using eye tracking and fMRI data, Leong et al. demonstrated the bidirectional interaction between attention and learning: attention facilitates learning, and learned values in turn inform attentional selection [1]. The relationship between attention and consciousness is a complex issue, and in many sense both a scientific and a philosophical exploration. The ability to focus one’s thoughts out of several simultaneous objects or trains of thought and take control of one’s own mind in a vivid and conscious manner is not just a delightful and useful perk. It is a quintessential part of our experience of human-ness.

Given their significance, attentional mechanisms have in recent years received increasing attention (pun intended) from the AI community. A detailed explanation of how they are applied in machine learning will require a separate blog post (I highly recommend this excellent article by Olah and Carter) but in essence attention layers provide the functionality of focusing on specific elements to improve the performance of a model. In an image recognition task for example, it does so by taking ‘glimpses’ of the input image at each step, updating the internal state representations, and then selecting the next location to sample. In a cluttered setting or when the input is too big, attention serves a ‘prioritisation’ function to filter out irrelevant elements. It is a powerful technique that can be used when interfacing with a neural network that has a repeating structure in its output. For example, when applied to augment LSTM (a special variant of recurrent neural networks), it lets every step of an RNN select information to look at from a larger body of information. However, attentional mechanisms are not just useful in RNNs, as we will find out below.

State of the art using attention in machine learning

In machine learning, attention is especially useful in sequence prediction problems. Let’s review a few of the major areas where it has been applied successfully.

1. Natural language processing

Attentional mechanisms have been applied in many natural language processing (NLP) related tasks. The seminal work by Bahdanau et al. proposed a neural machine translation model that implements an attention mechanism in the decoder for English-to-French translation [2]. As the system reads the English input (encoder), the decoder outputs French translation whereby the attention mechanism learns by stochastic gradient descent to shift the focus to concentrate on the parts surrounding the word that is being translated. Their RNN-based model has been shown to outperform traditional phrase-based models by huge margins. RNNs are the incumbent architecture for text applications but it does not allow for parallelisation, which limits its potential of using GPU hardware that powers modern machine learning. A team of Facebook AI researchers introduced a novel approach using convolutional neural networks (which are highly parallelisable) and a separate attention module in each decoder layer. As opposed to Bahdanau et al’s ‘single step attention’, theirs is a multi-hop attention module. This means instead of looking at the sentence once and then translating it without looking back, the mechanism takes multiple glimpses at the sentence to determine what it will translate next. Their approach outperformed state of the art results for English-German and English-French translation at an order of magnitude faster speed [3]. Other examples of attentional mechanisms being applied in NLP problems include text classification [4], language processing (performing tasks described by natural language instructions in a 3D game-play environment) [5] and text comprehension (answering close-style questions about a document) [6].

2. Object recognition

Object recognition is one of the hallmarks of machine intelligence. Mnih et al. demonstrated how an attentional mechanism can be used to ignore irrelevant objects in a scene, allowing the model to perform well in challenging object recognition tasks in the presence of clutter [7]. In their Recurrent Attention Model (RAM), the agent receives partial observation of the environment at each step and learns where to focus (i.e. pay attention to) next through training an RNN. Attention is used to produce a ‘glimpse feature vector’ whereby regions around a target pixel is encoded at high-resolution and pixels further from the target pixel uses progressively lower resolution. Using a similar approach, another study used  a deep recurrent attention model to both localise and recognise multiple objects in images [8]. Xu et al. trained a model that automatically learns to describe the content of images [9]. Their attention models were trained using a multilayer perceptron that is conditioned on some previous hidden state, meaning where the network looks next depends on the sequence of words that has already been generated. The researchers showed how to use convolutional neural networks to pay attention to images when outputting a sequence, i.e. the image caption. Another advantage of attention in this case is the insights gained by approximately visualising where and what the attention focused on (i.e. what the model ‘sees’).

Telling mistakes in image caption generation with visual attention (image taken from Xu et al., 2016)

3. Gameplay

Google DeepMind’s Deep Q-Network (DQN) represented a significant advance in Reinforcement Learning and a breakthrough in general AI in the sense that it showed a single algorithm can learn to play a wide variety of Atari 2600 games: the agent was able to continually adapt its behaviour without any human intervention. Sorokin et al. added attention to the equation and developed the Deep Attention Recurrent Q-Network (DARQN) [10]. Their model outperformed that of DQN by incorporating what they termed ‘soft’ and ‘hard’ attention mechanisms. The attention network takes the current game state as input and generates a context vector based on the features observed. An LSTM then takes this context vector along with a previous hidden state and memory state to evaluate the action that an agent can take. Choi et al. further improved on DARQN by implementing a multi-focus attention network where the agent is capable of attending to multiple important elements [11]. In contrast to DARQN that uses only one attention layer, the model uses multiple parallel attentions to attend to entities that are relevant to tackling the problem.

4. Generative models

Attention has also proven useful in generative models, systems that can simulate (i.e. generate) values of any variable (inputs and outputs) in the model. Hong et al. developed a deep generative model based on a convolutional neural network for semantic segmentation (the task of assigning class labels to groups of pixels in an image) [12]. By incorporating attention-like mechanisms they were able to capture transferable segmentation knowledge across categories. The attention mechanism adaptively focuses on different areas depending on the input labels. A softmax function is used to encourage the model to pay attention to only a segment of the image.  Another example is Google DeepMind’s Deep Recurrent Attentive Writer (DRAW) neural network for image generation [13]. Attention allows the system to build up an image incrementally (shown in the video below). The attention model is fully differentiable (making it possible to train with gradient descent), thus allowing the encoder to focus on only part of the input and the decoder to modify only a part of the canvas. The model achieved impressive results generating images from the MNIST data set and when trained on the Street View House Number data set, it generated images that are almost identical to the real data.

5. Attention alone for NLP tasks

Another exciting line of research focusses on using attentional mechanisms alone for NLP tasks traditionally solved with neural networks. Vaswani et al. developed Transformer, a simple network architecture based solely on a novel multi-head attention mechanism for translation task [14]. They compute the attention function on a set of queries simultaneously using a dot-product attention (each key is multiplied with the query to see how similar they are) with an additional scaling factor. This multi-head approach allows their model to attend to information from different positions at the same time. Their model completely foregoes recurrence and convolutions but still managed to attain state-of-art results for English-to-German and English-to-French translations. Moreover, they achieved this in significantly less training time and their model is highly parallelizable. An earlier work by Parikh et al. experimented with a simple attention-based approach to solve natural language inference tasks [15]. They used attention to deconstruct the problem into subproblems that can be solved individually, hence making the model trivially parallelizable.

Not just a cog in the machine

What we have learned about attention so far tells us it is likely to be an essential component in the development of general AI. Philosophically, it is a key feature of the human psyche, which makes it a natural inclusion in pursuits that concerns the grey matter, while computationally, attention-based mechanisms have helped boost model performance to deliver stunning results in many areas. Attention has also proven to be a versatile technique, as is evident in its ability to replace recurrent layers in machine translation and other NLP related tasks. But it is most powerful when used in conjunction with other components, as Kaiser et al. demonstrated in their study One Model To Learn Them All that presented a model capable of solving a number of problems spanning multiple domains [16]. To be sure, attentional mechanisms are not without weaknesses. As Olah and Carter suggested, their propensity to take every action at every step (albeit to varying extent) could potentially be very costly computationally. Nonetheless, I believe that in a modular approach to develop general AI – IMO our best bet in this quest – attention will be a worthwhile, and perhaps even indispensable, module.


[1] Leong, Y. C., Radulescu, A., Daniel, R., DeWoskin, V., & Niv, Y. (2017). Dynamic interaction between reinforcement learning and attention in multidimensional environments. Neuron, 93(2), 451-463.

[2] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

[3] Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. arXiv preprint arXiv:1705.03122.

[4] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A. J., & Hovy, E. H. (2016). Hierarchical Attention Networks for Document Classification. In HLT-NAACL (pp. 1480-1489).

[5] Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Rajagopal, D., & Salakhutdinov, R. (2017). Gated-Attention Architectures for Task-Oriented Language Grounding. arXiv preprint arXiv:1706.07230.

[6] Dhingra, B., Liu, H., Yang, Z., Cohen, W. W., & Salakhutdinov, R. (2016). Gated-attention readers for text comprehension. arXiv preprint arXiv:1606.01549.

[7] Mnih, V., Heess, N., & Graves, A. (2014). Recurrent models of visual attention. In Advances in neural information processing systems (pp. 2204-2212).

[8] Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755.

[9] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., … & Bengio, Y. (2016). Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044.

[10] Sorokin, I., Seleznev, A., Pavlov, M., Fedorov, A., & Ignateva, A. (2015). Deep attention recurrent Q-network. arXiv preprint arXiv:1512.01693.

[11] Choi, J., Lee, B. J., & Zhang, B. T. (2017). Multi-Focus Attention Network for Efficient Deep Reinforcement Learning. AAAI Publications, Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.

[12] Hong, S., Oh, J., Lee, H., & Han, B. (2016). Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3204-3212).

[13] Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623.

[14] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762.

[15] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016.

[16] Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N., Jones, L., & Uszkoreit, J. (2017). One Model To Learn Them All. arXiv preprint arXiv:1706.05137.


Continuous Learning

Posted by Gideon Kowadlo on

The standard machine learning approach is to learn to accomplish a specific task with an associated dataset. A model is trained using the dataset and is only able to perform that one task. This is in stark contrast to animals which continue to learn throughout life and accumulate and re-purpose knowledge and skills. The limitation has been widely acknowledged and addressed in different ways, and with a variety of terminology, which can be confusing. I wanted to take a brief look at those approaches and to create a precise definition of the Continuous Learning that we want to implement in our pursuit of AGI.

Transfer Learning is a term that has been used a lot recently in the context of Deep Learning. It was actually first discussed in a paper by Pratt in 1993. Transfer Learning techniques use knowledge for related tasks on either the same or similar datasets. A classic example is learning to recognise cars and then applying the model to the task of recognising trucks. Or learning to recognise a different aspect of objects on the same dataset, such as learning how to recognise petals instead of leaves, of a dataset containing many plants.

One type of Transfer Learning is Domain Adaptation. It refers to the idea of learning on one domain, or data distribution, and then applying the model to and optimising it for a related data distribution. Training a model on different data distributions is often referred to as Multi Domain Learning. In some cases the distributions are similar, but other times they are deliberately unrelated.

The term Lifelong Learning pops up about the same time as Transfer Learning, in a paper by Thrun in 1994. He describes it as an approach that “addresses situations in which a learner faces a series of different learning tasks providing the opportunity for synergy among them”. It overlaps with Transfer Learning, but the emphasis is on gathering general purpose knowledge that transfers across multiple consecutive tasks for an ‘entire lifetime’. Thrun demonstrated results with real robotic systems.

Curriculum Learning by Bengio is a special case of Lifelong or Transfer Learning, where the objective is to optimise performance on a specific task, rather than across different tasks. It does this by making an easy version of that one task and making it subsequently harder and harder.

Online Learning algorithms learn iteratively with new data, in contrast to learning from a pass of a whole dataset, as is commonly done in conventional supervised and unsupervised learning, referred to as Batch Learning. Batches can also refer to portions of the dataset.

Online Learning is useful when the whole dataset does not fit into memory at once. Or more relevant for AGI, in scenarios where new data is observed over time. For example, with new samples being generated by users of a system, by an agent exploring its environment or for cases where the phenomena being modelled changes. Another way to describe it is that the underlying input data distribution is not static i.e. a non-stationary distribution, hence these are referred to as Non-stationary Problems.

Online learning systems can be susceptible to ‘forgetting’. That is, becoming less effective at modelling older data. The worst case is failing completely and suddenly, known as Catastrophic Forgetting or Catastrophic Interference.

Incremental Learning, as the name suggests, is about learning bit by bit, extending the model and improving performance over time. Incremental Learning explicitly handles the level of forgetting of past data. In this way, it is a type of online learning that avoids catastrophic forgetting.

In One-shot Learning, the algorithm is able to learn from one or very few examples. Instance Learning is one way of achieving that, constructing hypotheses from the training instances directly.

A related concept is Multi-Modal Learning, where a model is trained on different types of data for the same task. An example is learning to classify letters from the way they look with visual data, and the way they sound, with audio.

Now that we have some greater clarity around these terms, we recognise that they are all important features of what we consider to be Continuous Learning for a successful AGI agent. I think it’s instructive to express it in terms of traits in the context of an autonomous agent. I’ve mapped these traits to the associated Machine Learning algorithm concepts.

Trait ML Terminology
Uses learnt information to help with subsequent tasks.

Builds on its knowledge. Enables more complex behaviour and faster learning.

Transfer Learning

Curriculum Learning

As features of the task change gradually, it will adapt.

This will not cause catastrophic forgetting.

Domain Adaption

Non-stationary input distributions

Iterative Learning

Can learn entirely new tasks.

This will not cause catastrophic forgetting of old tasks. Also, it can learn these new tasks as well as it would have, if it was the first task learnt i.e. learning a task does not impede the ability to learn subsequent tasks.

Iterative Learning
Learns important aspects of the task from very few examples.

It has the ability to learn fast when necessary.

One-shot Learning
Continues to learn as it collects more data. Online Learning
Combines sensory modalities to learn a task. Multi-modal Learning

Note that in continuous learning, if there are fixed resources, and you are operating at your limit, then there has to be some forgetting, but as mentioned in the table, it should not be ‘catastrophic forgetting’.

'no input' state/CLA/missing data/Predictive Coding/Sparse Distributed Representations/Theory

When is missing data a valid state?

Posted by Gideon Kowadlo on

By Gideon Kowadlo, David Rawlinson and Alan Zhang

Can you hear silence or see pitch black?
Should we classify no input as a valid state or ignore it?

To my knowledge, the machine learning and statistics literature mainly regards an absence of input as missing data. There are several ways that it’s handled. It can be considered to be a missing data point, a value is inferred and then treated as the real input. When a period of no data occurs at the beginning or end of a stream (time series data), it can be ignored, referred to as censoring. Finally, when there is a variable that can never (or is never) observed, it can be viewed as data that is always missing, and modelled with what is referred to as latent or hidden variables. I believe there is more to the question of whether an absence of input is in fact a valid state, particularly when learning time varying sequences and when considering computational parallels of biological processes where an absence of signal might never occur.

It is also relevant in the context of systems where ‘no signal’ is an integral type of message that can be passed around. One such system is Predictive Coding (PC), which is a popular theory of cortical function within the neuroscience community. In PC, prediction errors are fed forward (see PC post [1] for more information). Therefore, perfectly correct predictions result in ‘no-input’ in the next level, which may occur from time to time given it is the objective of the encoding system.

Let’s say your system is classifying sequences of colours Red (R), Green (G) and Blue (B), with periods of no input which we represent as Black (K). There is a sequence of colours RGB, followed by a period of K, then BGR and then two steps of K again, illustrated below (the figure is a Markov graph representation).

Figure 1: Markov graph representation of a sequence of colour transitions.

What’s in a name?
What actually defines Black as no input?

This question is explored in the following paragraphs along with Figure 2 below. We start with the way the signal is encoded. In the case of an image, each pixel is a tuple of scalar values, including black (K) with a value of (0, 0, 0). No specific component value has a privileged status. We could define black as any scalar tuple. For other types of sensors, signal modulation is used to encode information. For example, frequency of binary spikes/firing is used in neural systems. No firing, or more generally no change, indicates no input. Superficially it appears to be qualitatively different. However, a specific modulation pattern can be mapped to a specific scalar value. Are they therefore equivalent?

We reach some clarity by considering the presence of a clock as a reference. The use of signal modulation implies the requirement of a clock, but does not necessitate it. With an internal clock, modulation can be measured in a time-absolute* sense, the modulation can be mapped to a scalar representation, and the status of the no-input state does indeed become equivalent to the case of a scalar input with a clock i.e. no value is privileged.

Where there is no clock, for either type of signal encoding, time can effectively stand still for the system. If the input does not change at all, there is no way to perceive the passage of time. For scalar input, this means that the input does not transition. For modulated input, it includes the most obvious type of ‘no-input’, no firing or zero frequency.

This would obviously present a problem to an intelligent agent that needs to continue to predict, plan and act in the world. Although there are likely to be inputs to at least some of the sensors, it suggests that biological brains must have an internal clock. There is evidence that the brain has multiple clocks, summarised here in Your brain has two clocks [2]. I wonder if the time course of perceptible bodily processes or thoughts themselves could be sufficient for some crude perception of time.

Figure 2: Definition of ‘no-input’ for different system characteristics.
* With respect to the clock at least. This does give rise to the interesting question of the absoluteness of the clock itself. Assume for arguments sake that consciousness can be achieved with deterministic machines. The simulated brain won’t know how fast time is running. You can pause it and resume without it being any wiser.
If we assume that we can define a ‘no-input’ state, how would we approach it?

The system could be viewed as an HMM (Hidden Markov Model). The sensed/measured states represent hidden world states that can not be measured directly. Let us make many observations and look to the statistics of occurrence, and compare this to the other observable states. If the statistics are similar, we can assume option A – no special meaning. If on the other hand, it occurs between the other observable sequences, sequences which are not correlated with each other, and is therefore not significantly correlated with any transitions, then we can say that it is B – a delineator.

A – no special meaning

There are two options, treat K as any other state, or ignore it. For the former, it’s business as usual. For the latter, ‘ignoring the input’, there don’t seem to be any consequences for the following reason. The system will identify at least two shorter sequences, one before K and one after. Any type of sequence learning must anyway have an upper limit on the length of the representable sequences* (unlike the theoretical Turing Machine); this will just make those sequences shorter. In the case of hierarchical algorithms such as HTM/CLA, higher levels in the hierarchy will integrate these sub sequences together into longer (more abstracted) temporal sequences.

However, ignoring K will have implications for learning the timing of state persistence and transitions. If the system ignores state K including the timing information, then modelling will be incomplete. For example, referring back to Figure 1, K occurs for two time steps before the transition back to R. This is important information for learning to predict when this R will occur. Additionally, the transition to K signalled the end of the occurrence of R preceding K. Another example is illustrated below in Figure 3. Here, K following B is a fork between two sub chains. The transition to R occurs 95% of the time. That information can be used to make a strong prediction about future transitions from this K, however if K is ignored, as shown on the right of the figure, the information is ignored and the prediction is not possible.

Figure 3: Markov chain showing some limitations of ignoring K.

* However, it is possible to have the ability to represent sequences far longer than the expected observable sequences with enough combinatorial power, as described in CLA and argued to exist in biological systems.

B – a delineator

This is the case where the ‘no-input’ state is not correlated (above some significant measure) with any observed sequence. The premise of this categorisation, is that due to lack of correlation, it is an effectively meaningless state. However, it can be used to make inferences about the underlying state. Using the example from Figure 1, based on repeated observations, the statement could be made that R, G and B summarise hidden states. We can also surmise that there are states that generate white noise, in this example random selections of R, G, B or K. This can be inferred since we never observe the same signal twice when in those states. Observations of K are then useful for modelling the hidden states, which calls into question the definition of K as ‘no-input’.

However, it may in fact be an absence of input. In any case, we did not observe any correlations with other sequences. Therefore in practice, this is similar to ‘A – no special meaning – ignore the state’. The difference is the semantic meaning of the ‘no-input’ state as a delineator. There is also no expectation that there is meaningful information in the duration of the absence of input. The ‘state’ is useful to indicate the sequence is finished, and therefore defines the timing of persistence of the last state of the sequence.

CLA and hierarchical systems

Turning our attention briefly to the context of HTM CLA [3]. CLA utilises Sparse Distributed Representations (see SDR post [4] for more information) as a common data structure in a hierarchical architecture. A given state, represented as an SDR, will normally be propagated to the level above which also receives input from other regions. It will therefore be represented as one (or possibly more) of many bits in the state above. Each bit is semantically meaningful. A ‘0’ should therefore be as meaningful as a ‘1’. The questions discussed above arise when the SDR is completely zero, which I’ll refer to as a ‘null SDR’.

The presence of a null SDR depends on the input source, presence of noise and the implementation details of the encoders. In a given region, the occurrence of null SDR’s will tend to dissipate, as the receptive field adjusts until a certain average complexity is observed. In addition, null SDR’s becomes increasingly unlikely as you move up the hierarchy and incorporate larger and larger receptive fields, thus increasing the surface area for possible activity. If the null SDR can still occur occasionally, there may be times when it is significant. If it is not classified, will the higher levels in the hierarchy recover the ‘lost’ information? This question applies to other hierarchical systems and will be investigated in future posts.

So what ……. ?

What does all of this mean for the design of intelligent systems? A realistic system will be operating with multiple sensor modalities and will be processing time varying inputs (regardless of the encoding of the signal). Real sensors and environments are likely to produce background noise, and in front of that, periods of no input in ways that are correlated with other environmental sequences, and in ways that are not – relating to the categorisations above ‘A – no special meaning’ and ‘B – a delineator’. There is no simple ‘so what’, but hopefully this gives us some food for thought and shows that it is something that should be considered. In future I’ll be looking in more depth at biological sensors and the nature of the signals that reach the cortex (are they ever completely silent?), as well as the implications for other leading machine learning algorithms.


[1] On Predictive Coding and Temporal Pooling[2] Emilie Reas, Your brain has two clocks, Scientific American, 2013[3] HTM White Paper[4] Sparse Distributed Representations (SDRs)
Artificial General Intelligence/Memes/Natural Selection/Singularity/Theory

Constraints on intelligence

Posted by Gideon Kowadlo on
by Gideon Kowadlo and David Rawlinson


This article contains some musings on the factors that limit the increase of intelligence as a species.

We speculate that ultimately, our level of intelligence is limited by at least two factors, and possibly a third:

  1. our own cultural development,
  2. physical constraints, and
  3. an intelligence threshold.

We’ll now explore each of these factors.

Cultural Development

Natural Selection

Most readers are familiar with Natural Selection. The best known and dominant mechanism is that fitter biological organisms in a population tend to survive longer, reproduce more frequently and successfully, and pass on their traits to the next generation. Given some form of external pressure and therefore competition, such as resource constraints, the species on average is likely to increase in fitness. In competition with other species, this is necessary for species survival.

Although this is the mechanism we are focusing on in this post, there are other important forms of selection. Two examples are ‘Group Selection’ and ‘Sexual Selection’. Group selection favours traits that benefit the group over the individual, such as altruism. Especially when the group shares common genes. Sexual selection favours traits that improve an individual’s success in reproducing by two means: being attractive to the other gender, and ability to compete with rivals of the same gender. Sometimes sexually appealing characteristics are highly costly or risky to individuals, for example by making them vulnerable to predators.


Another influence on ability to survive is culture. Humans have developed culture, and some form of culture is widely believed to exist in other species such as primates and birds (e.g. Science). Richard Dawkins introduced the concept of memes, cultural entities that evolve in a way that is analogous to genes. The word meme now conjures up funny pictures of cats (see Wired magazine’s article on the re-appropriation of the word meme), and no-one is complaining about that, but it’s hard to argue that these make us fitter as a species. However, it’s clear that cultural evolution, by way of technological progress, can have a significant influence. This could be negative, but is generally a positive, making us more likely to survive as a species.

Culture and Biology

A thought experiment regarding the effect on survival due to natural selection and cultural development, and due to their relationship with each other, is explored with a graph below.

Figure 1: A thought experiment: The shape of survivability vs time, due to cultural evolution, and due to natural selection. The total survivability is the sum of the two. Survivability due to natural selection plateaus when it is surpassed by survivability due to cultural evolution. Survivability due to cultural evolution plateaus when cultural development allows almost everyone in the population to survive.

For humans, the main biological factor contributing to survival is our intellect. The graph shows how our ability to survive steadily improves with time as we evolve naturally. The choice of linear growth is based on the fact that the ‘force’ for genetic change does not increase or decrease as that change occurs*. On the other hand, it is thought that cultural evolution improves our survivability exponentially. In recent years, this has been argued by well known authors and thinkers such as Ray Kurzweil and Eliezer S. Yudkowsky in the context of the Technological Singularity. We build on knowledge continuously, and leverage our technological advances. This enables us to make ever larger steps, as each generation exploits the work of the preceding generations. As Isaac Newton wrote, “If I have seen further it is by standing on the shoulders of giants” **. Many predict that this will result in the ability to create machines that surpass human intelligence. The point at which this occurs is known as the aforementioned Technological Singularity.

Cultural Development – Altruism

Additionally, cultural evolution could include the development of humanitarian and altruistic ideals and behaviour. An environment in which communities care for all their people, which would increase the survivability of (almost) everyone to the threshold of reproduction – leaving only a varied ability to prosper beyond survival. This is shown in the figure above as a plateau in survivability due to cultural evolution.

Cultural Development – Technology

Cultural factors dominate once survivability due to cultural evolution and technological development surpasses that due to natural selection. For example, the advantages given by use of a bow and arrow for hunting, will reduce the competitive advantage of becoming a faster runner. Having a supermarket at the end of your street will render faster running insignificant. The species would no longer evolve biologically through the same process of natural selection. Other forces may still cause biological evolution in extreme cases, such as resistance to new diseases, but this is unlikely to drive the majority of further change. This means that biological evolution of our species would stagnate***. This effect is shown in the graph with the plateau in survivability due to natural selection.

* On a fine scale, this would not be linear and would be affected by many many unpredictable factors such as climate changes, other environmental instability as well as successes/failures of other species.

** Although this metaphor was first recorded in the twelfth century and has been attributed to Bernard of Chartres.

*** Interestingly, removal of selective pressure does not allow species to rest at a given level of fitness. Deleterious mutations rapidly accumulate within the population, giving us a short window of opportunity to learn to control and improve our own genetic heritage.

Physical Constraints

One current perspective in neuroscience, and the basis for our work and this blog, is that much of our intelligence emerges from, very simply put, a hierarchical assembly of regions of identical computational units (cortical columns). As explained in previous posts (here and here), this is physically structured as a sheet of cortex, that form connections from region to region. The connecting regions are conceptually at different levels in the hierarchy. The connections themselves form the bulk of the cortex. We believe that with an increasingly deep hierarchy, the brain is able to represent increasingly abstract and general spatiotemporal concepts, which would play a significant role in increasing intelligence.

The reasoning above predicts that the number of neurons and connections is correlated with intelligence. These neurons and connections have mass and volume and require a blood supply. They cannot increase indefinitely.

Simply increasing the size of the skull has its drawbacks. Maintaining stable temperature becomes more difficult, and structural strength is sacrificed. The body would become disproportionately large to carry around extra mass, making the animal less mobile, coupled with the fact that there would be higher energy demands. Larger distances for neuronal connections leads to slower signal propagation which could also have negative impact. Evidence of the consequences of such physical constraints is found in the fact that the brain folds in on itself, appearing wrinkled, in order to maximise surface area (and hence the number of neurons and connections) in the given volume of the skull. Evolution has produced a tradeoff between these characteristics that limits our intelligence to promote survival.

It is possible to imagine completely different architectures that might circumvent these limitations. Perhaps a neural network distributed throughout the body, such as exists for some marine creatures. However, it is implausible that physical constraints would not ultimately be a limiting factor. Also, reality is more constrained than our imagination. For example, it must be physically and biologically possible for the organism to develop from a single cell to a neonate, and on to a reproducing adult.

An Intelligence Threshold

There could be a point at which the species crosses an intelligence threshold, beyond which higher intelligence does not confer a greater ability to survive. However, since the threshold may be dictated by cultural evolution it is very difficult to separate the two. For example, the threshold might be very low in an altruistic world, and it is possible to envision a hyper-competitive and adversarial culture in which the opposite is true.

But perhaps a threshold exists as a result of a fundamental quality of intelligence, completely independent of culture. Could it be, that once you can grasp concepts at a sufficient level of abstraction, and have the ability to externalise and record concepts with written symbols (thereby extending the hierarchy outside of the physical brain), that it would be possible to conduct any ‘thought’ computation, given enough working memory, concentration and time? Similarly, a Turing Machine is capable of carrying out any computation, given infinite memory.

The topic of consciousness and it’s definition is beyond the scope of this post. However, accepting that there appears to be a clear relationship between intelligence and what most people understand as consciousness, this ‘Intelligence Threshold’ has implications for consciousness itself. It is interesting to ponder the threshold as having a corresponding crossing point in terms of conscious experience.

We may explore the existence and nature of this potential threshold in greater detail in the future.

Impact of Artificial General Intelligence (AGI)

The biological limitations to intelligence discussed in this article show why Artificial General Intelligence (AGI) will be such a dramatic development. We still exist in a physical world (at least perceptibly), but building an agent out of silicon (or other materials in the future), will effectively free us from all of these constraints. It also allows us to modify parameters, architecture and monitor activity. It will be possible to invest large quantities of energy into ‘thinking’ in a mind that does not fatigue. Perhaps this is a key enabling technology on the path to the Singularity.