Gideon Kowadlo


Theory

Continuous Learning

Posted by Gideon Kowadlo on

The standard machine learning approach is to learn to accomplish a specific task with an associated dataset. A model is trained using the dataset and is only able to perform that one task. This is in stark contrast to animals which continue to learn throughout life and accumulate and re-purpose knowledge and skills. The limitation has been widely acknowledged and addressed in different ways, and with a variety of terminology, which can be confusing. I wanted to take a brief look at those approaches and to create a precise definition of the Continuous Learning that we want to implement in our pursuit of AGI.

Transfer Learning is a term that has been used a lot recently in the context of Deep Learning. It was actually first discussed in a paper by Pratt in 1993. Transfer Learning techniques use knowledge for related tasks on either the same or similar datasets. A classic example is learning to recognise cars and then applying the model to the task of recognising trucks. Or learning to recognise a different aspect of objects on the same dataset, such as learning how to recognise petals instead of leaves, of a dataset containing many plants.

One type of Transfer Learning is Domain Adaptation. It refers to the idea of learning on one domain, or data distribution, and then applying the model to and optimising it for a related data distribution. Training a model on different data distributions is often referred to as Multi Domain Learning. In some cases the distributions are similar, but other times they are deliberately unrelated.

The term Lifelong Learning pops up about the same time as Transfer Learning, in a paper by Thrun in 1994. He describes it as an approach that “addresses situations in which a learner faces a series of different learning tasks providing the opportunity for synergy among them”. It overlaps with Transfer Learning, but the emphasis is on gathering general purpose knowledge that transfers across multiple consecutive tasks for an ‘entire lifetime’. Thrun demonstrated results with real robotic systems.

Curriculum Learning by Bengio is a special case of Lifelong or Transfer Learning, where the objective is to optimise performance on a specific task, rather than across different tasks. It does this by making an easy version of that one task and making it subsequently harder and harder.

Online Learning algorithms learn iteratively with new data, in contrast to learning from a pass of a whole dataset, as is commonly done in conventional supervised and unsupervised learning, referred to as Batch Learning. Batches can also refer to portions of the dataset.

Online Learning is useful when the whole dataset does not fit into memory at once. Or more relevant for AGI, in scenarios where new data is observed over time. For example, with new samples being generated by users of a system, by an agent exploring its environment or for cases where the phenomena being modelled changes. Another way to describe it is that the underlying input data distribution is not static i.e. a non-stationary distribution, hence these are referred to as Non-stationary Problems.

Online learning systems can be susceptible to ‘forgetting’. That is, becoming less effective at modelling older data. The worst case is failing completely and suddenly, known as Catastrophic Forgetting or Catastrophic Interference.

Incremental Learning, as the name suggests, is about learning bit by bit, extending the model and improving performance over time. Incremental Learning explicitly handles the level of forgetting of past data. In this way, it is a type of online learning that avoids catastrophic forgetting.

In One-shot Learning, the algorithm is able to learn from one or very few examples. Instance Learning is one way of achieving that, constructing hypotheses from the training instances directly.

A related concept is Multi-Modal Learning, where a model is trained on different types of data for the same task. An example is learning to classify letters from the way they look with visual data, and the way they sound, with audio.

Now that we have some greater clarity around these terms, we recognise that they are all important features of what we consider to be Continuous Learning for a successful AGI agent. I think it’s instructive to express it in terms of traits in the context of an autonomous agent. I’ve mapped these traits to the associated Machine Learning algorithm concepts.

Trait ML Terminology
Uses learnt information to help with subsequent tasks.

Builds on its knowledge. Enables more complex behaviour and faster learning.

Transfer Learning

Curriculum Learning

As features of the task change gradually, it will adapt.

This will not cause catastrophic forgetting.

Domain Adaption

Non-stationary input distributions

Iterative Learning

Can learn entirely new tasks.

This will not cause catastrophic forgetting of old tasks. Also, it can learn these new tasks as well as it would have, if it was the first task learnt i.e. learning a task does not impede the ability to learn subsequent tasks.

Iterative Learning
Learns important aspects of the task from very few examples.

It has the ability to learn fast when necessary.

One-shot Learning
Continues to learn as it collects more data. Online Learning
Combines sensory modalities to learn a task. Multi-modal Learning

Note that in continuous learning, if there are fixed resources, and you are operating at your limit, then there has to be some forgetting, but as mentioned in the table, it should not be ‘catastrophic forgetting’.

Experiment/Experimental Framework

AGI Experimental Framework

Posted by Gideon Kowadlo on

We’re very excited to launch AGI Experimental Framework, AGIEF, our open source framework.

We first introduced it a while back, at the end of 2015 here, and it has certainly come a long way.

AGIEF was created to make running rigorous AI experiments convenient, reproducible and scalable. The goals are:

  • Repeatability: ability to save/load, stop/start an experiment from any execution step, and know that it will execute deterministically
  • Visualisation: ability to visualise all the data structures at any step
  • Distributed operation for performance

The Github wiki and Readme describe the project in detail and how to get started.

The framework comprises 3 repositories.

agi – Java project comprising core algorithmic code and framework package to support compute nodes.

run-framework – Python scripts to run and interact with the compute nodes covering aspects such as generating input files, launching cloud infrastructure, running those experiments (locally or remotely), executing parameter sweeps and exporting and uploading the output artefacts.

experiment-definitions – contains the experiment definitions, the files required to run and repeat specific experiments.

Code/Experiment

Open Sourcing MNIST and NIST Preprocessing Code

Posted by Gideon Kowadlo on

In our most recent post we discussed the current set of experiments that we are conducting, using the MNIST dataset. We’ve also been looking at the NIST dataset which is similar, but extends to handwritten letters (as well as digits).

These are extremely popular datasets and freely available, so make a great choice for testing and comparing an algorithm with the benchmarks.

The MNIST data is not available directly as images though. Even though it’s a standard format, it’s not common. It’s easy to find snippets of code to convert this format into standard images (such as PNG or JPG), but putting it together and getting it working is not where you want to spend your time – instead of designing and running your experiment!

We’ve been through that phase, so very happy to open source our code to make it easier for others to get going faster.

These are simple, small, self contained Java projects with ZERO dependencies. There are two projects, one for preprocessing MNIST files into images, the other is for NIST images, to make them equivalent to the MNIST images to be used in the same experimental setup easily. See the README for more information about the a steps taken.

Preprocess-MNIST

Preprocess_NIST_SD19

Emotion/Intuition/vmPFC

Intuition over reasoning for AI

Posted by Gideon Kowadlo on

By Gideon Kowadlo

I’m reading a fascinating book called The Righteous Mind, by Jonathan Haidt. It’s one of those reads that can fundamentally shift the way that you see the world. In this case, the human world, everyone around you, and yourself.

A central idea of the book is that our behaviour is mainly dictated by intuition rather than reasoning and that both are aspects of cognition.

Many will be able to identify in themselves and others, the tendency to act first and rationalise later – even though it feels like the opposite. But more than that, our sense of morality arises from intuition and it enables us to act quickly and make good decisions.

A compelling biological correlate is the ventromedial prefrontal cortex. The way it enables us to use emotion/intuition for decision making is described well in this passage:

Damasio had noticed an unusual pattern of symptoms in patients who had suffered brain damage to a specific part of the brain – the ventromedial (i.e., bottom-middle) prefrontal cortex (abbreviated vmPFC; it’s the region just behind and above the bridge of the nose). Their emotionality dropped nearly to zero. They could look at the most joyous or gruesome photographs and feel nothing. They retained full knowledge of what was right and wrong, and they showed no deficits in IQ. They even scored well on Kohlberg’s tests of moral reasoning. Yet when it came to making decisions in their personal lives and at work, they made foolish decisions or no decisions at all. They alienated their families and their employers, and their lives fell apart.

Damasio’s interpretation was that gut feelings and bodily reactions were necessary to think rationally, and that one job of the vmPFC was to integrate those gut feelings into a person’s conscious deliberations. When you weigh the advantages and disadvantages of murdering your parents … you can’t even do it, because feelings of horror come rushing in through the vmPFC.

But Damasio’s patients could think about anything, with no filtering or coloring from their emotions. With the vmPFC shut down, every option at every moment felt as good as every other. The only way to make a decision was to examine each option, weighting the pros and cons using conscious verbal reasoning. If you’ve ever shopped for an appliance about which you have few feelings – say, a washing machine – you know how hard it can be once the number of options exceeds six or seven (which is the capacity of our short-term memory). Just imagine what your life would be like if at every moment, in every social situation, picking the right thing to do or say became like picking the best washing machine among ten options, minute after minute, day after day. You’d make foolish decisions too.


Our aim has always been to build a general reasoning machine that can be scaled up. We aren’t interested in building an artificial human, which carries the legacy of a long evolution through many incarnations.

This is the first time I’ve considered the importance of building intuition into the algorithm as a fundamental component. ‘Gut’ reactions are not to be underestimated. It may be the only way to make effective AGI, not to mention the need to create ‘pro-social’ agents with which we can interact in daily life.

It is possible though, that this is an adaption to the limitations of our reasoning, rather than a fundamentally required feature. If the intelligence was implemented in silicon and not bound by ‘cognitive effort’ in the same way that we are, it could potentially select favourable actions efficiently based on intellectual reasoning, without the ‘intuition’.

This is fascinating to think about in terms of human intelligence and behaviour. It raises exciting questions about the nature of intelligence itself and the relationship between cognition and both reasoning and intuition. We’ll be sure to consider these questions as we continue to develop an algorithm for AGI.

Addendum

From a functional perspective the vmPFC appears to be a separate parallel ‘component’ that is richly connected to many other brain areas.

“The ventromedial prefrontal cortex is connected to and receives input from the ventral tegmental area, amygdala, the temporal lobe, the olfactory system, and the dorsomedial thalamus. It, in turn, sends signals to many different brain regions including; The temporal lobe, amygdala, the lateral hypothalamus, the hippocampal formation, the cingulate cortex, and certain other regions of the prefrontal cortex.[4] This huge network of connections affords the vmPFC the ability to receive and monitor large amounts of sensory data and to affect and influence a plethora of other brain regions, particularly the amygdala.”

Wikipedia Ventromedial prefrontal cortex

AGI/AGIEF/Architecture/Experimental Framework

AGI Experimental Framework: A platform for AGI R&D

Posted by Gideon Kowadlo on

By Gideon Kowadlo and David Rawlinson

Introduction

We’ve been building and testing AGI algorithms for the last few years. As the systems become more complex, we have found it ever more difficult to run meaningful experiments. To summarise, the main challenges are:

  • testing a version of the algorithm repeatedly and over some range of parameters or conditions,
  • scaling it up so that it can run quickly,
  • debugging: the complexity of the ‘brain’ makes visualising and interpreting its state almost as hard as the problem itself!

Platforms for testing AIs already exist, such as the Arcade Learning Environment. There are also a number of standard datasets and frameworks for testing them. What we want is a framework for understanding the behaviour of an AI that can be applied successfully to any problem – it is supposed to be an Artificial General Intelligence, after all. The goal isn’t to advance the gold-standard incrementally; instead we want to better understand the behaviour of algorithms that might work reasonably well on many different problems.

Whereas most AI testing frameworks are designed to facilitate a particular problem, we want to facilitate understanding of the algorithms used. Further, the algorithms will have complex internal state and be variably parameterised from small instances on trivial problems to large instances – comprising many computers – on complex problems. As such there will be a lot of emphasis on interfaces that allow the state of the algorithm to be explored.

These design goals mean that we need to look more at the enterprise and web-scale frameworks for distributed systems, than test harnesses for AIs. There’s a huge variety of tools out there: Distributed filesystems, cloud resourcing (such as Elastic Compute), and cluster job management (e.g. many scientific packages available in Python). We’ll design a framework with the capability to jump between platforms as available technologies evolve.

Developing distributed applications is significantly harder than single-process software. Synchronization and coordination is harder (c.f. Apache Zookeeper), and there’s a lot of crud to get right before you can actually get to the interesting bits (i.e. the AGI). We’re going to try to get the boring stuff done nicely, so that others can focus on the interesting bits!

Foundational Principles

  • Agent/World conceptualisation
    • For AGI, we have developed a system based around Experiments, with each Experiment having Agents situated in a World.
  • Reproducible
    • All data is persisted by default so that any experiment can be reproduced from any time step.
  • Easy to run and use
    • Minimal setup and dependencies.
    • No knowledge of the implementation is required to implement a custom module (primarily the intelligent Agent or World in which it operates).
  • Highly modular (Scalability)
    • Different parts of the system can be customised, extended or overridden independently.
    • Distributed architecture (Scalability)
    • Modules can be run on physically separated machines, without any modification to the interactions between modules (i.e. the programmer’s perspective is not affected by scaling of the system to multiple computers).
  • Easy to develop
    • Code is open source.
    • Code is well documented.
    • All API’s well documented and using standard protocols (at the moment RESTful, in future could be websockets or other).
  • Explorable / Visualisable
    • High priority placed on debugging and understanding of data rather than simply efficiency and throughput. We don’t yet know what the algorithm should look like!
    • All state is accessible, relations are can be explored.
    • Execution is on demand (step-by-step) or automatic (until criteria, or batches of experiments completed).
    • It must be easy for anyone to build a UI client that can explore the state of all parts of the system.

Conceptual Entities

We have defined a number of components that make up an experiment. We refer to these components as Entities, and give them a specific interface.

  • World
    • The simulated environment within which all the other simulated components exist.
  • Agent 
    • The intelligent agent itself. It operates within a World, and interacts with that World and (optionally) other Agents via a set of Sensors and Actuators.
  • Sensor
    • A means by which the Agent senses the world. The output is a function of a subset of the World state. For example, a unidirectional light sensor may provide the perceived brightness at the location of the sensor.
  • Actuator
    • A means by which an Agent acts on the World. The output is a simulated physical action. For example, a motor rotating a wheel.
  • Experiment
    • The Experiment Entity is a container for a World, and a set of Agents (each of which have a set of Sensors and Actuators), and an Objective Function which determines the terminating condition of the experiment (which may be a time duration).
  • Laboratory
    • A collection of Experiments that form a suite to be analysed collectively. This may be a set of Experiments that have similar setups with minor parameter variations.
  • ObjectiveFunction
    • The objective function computes metrics about the World and/or Agents that are necessary to provide Supervised Learning or Reinforcement Learning signals. It might instead provide a multivariate Optimization function. The ObjectiveFunction is a useful encapsulation because it is often easy to separate objective measurements from the AI that is needed to achieve them.

Architecture

To enforce good design principles, the architecture is multi-layered and highly modular. Multiple layers (also known as multi-tier architecture) allows you to work with concepts that are at the appropriate level of abstraction, which simplifies development and use of the system.

Each entity is a module. Use of particular entities is optional and extensible. A user will inherit the entities that they choose, and implement the desired functionality. Another modularisation occurs with the AGIEF Nodes. They communicate via interprocess conventions so that components can be split between multiple host computers.

Interprocess communication occurs via a central interface called the Coordinator, which is a single point of contact for all Entities and the shared system state. This also enables graphical user interfaces to be built to control and explore the system.

These concepts are expanded in the sections below.

Design Considerations

The various components of the system may have huge in-memory data-structures. This is an important consideration for persisting state, distributed operation, and ability to visualise the state.

Processing to update the state of Worlds and Agents will be compute-intensive. Many AI methods can easily be accelerated by parallel execution. Therefore, the system can be broken down into many computing nodes, each tasked with performing a specific computational function on some part of the shared system state. We hope to support massively parallel hardware such as GPUs in these compute nodes.

We will write the bulk of the framework and initial algorithm implementations in Java. Others can extend on this, or develop against the framework in other languages. We will also write a graphical user interface using web technologies that will allow easy management of the system.

Perspectives on the system design

The architectural layers are shown in the diagram below.

Figure 1: ‘Architectural Layers’

Each layer is distinct, with strict separation. No layer has access to the layers above, which operate at a higher level of abstraction.

  • State:
    • State persistence: storage and retrieval of state of all parts of the system at every time step. This comprises the shared filesystem.
  • Interprocess:
    • Communications between all modules running in the system, locally and/or across a network.
    • Provides a single point of contact via a local interface, to any part of the system (which may be running in different physical locations), for both control signals and state.
  • Experiment:
    • Provides all of the entities that are required for an experiment. These are expanded shortly.
  • UI:
    • The user interface that an experimenter uses to run experiments, debug and visualise results.
    • The typical features would be:
      • set up parameters of an experiment,
      • run, stop, step through an experiment,
      • save/load an experiment,
      • visualise the state of any part of the experiment.
  • Specific Experiments:
    • This is defined by the person experimenting with the system. For example, a specific Agent that seeks light, a specific World that contains a light source, and an objective function that defines the time span for operation.

Another perspective on the design is to view the Services and Entities and their lines of communication. The diagram is colour coded to indicate Layers, as per the diagram above.

Figure 2: ‘Services and Entities’

The Coordinator and Database are services. The Coordinator is shown at the centre, as described earlier (Architecture section), being the primary point of contact for Entities and potentially other clients such as a Graphical User Interface.

A similar perspective is shown in an expanded diagram below that illustrates the Database API module and the distributed implementation of the Coordinator in the Interprocess layer, enabling Entities to run on separate machines. This is just one possible configurations; there can be multiple slaves, each with multiple entities.

Each bounding box indicates what we refer to as an AGIEF Node (or node for short). The node comprises a process that provides a context for execution of one or more entities, or other clients such as the GUI.
Figure 3: ‘AGIEF Nodes’

We looked at popular No-SQL web storage systems (basically key-value stores) which are very convenient and flexible due to the inherently dynamic, software-defined schemas and HTTP interfaces. However, we have a relatively static schema for our data, on which we will build utilities for managing experiments and visualising data. In addition, relational databases such as MySQL and PostgreSQL are beginning to offer HTTP interfaces as well. Whether we pick a NoSQL or Relational Database, we will require a HTTP interface.

A third perspective is the data model that represents the system in its entirety. This is the model implemented in the database.

Figure 4: ‘Data Model’

The data model stores the entire system state, including hierarchy and relationship between entities, as well as the state of each entity. With a RESTful API exposing the database, we have a shared filesystem accessible as a service, essential for distributed operation and restoring the system at any point in time.

Future Work

We will shortly be releasing an initial version of our framework and we’ll post about the technology choices we’ve made, and some alternatives. We’ll include a demonstration problem with the initial release and then start rolling out some more exciting algorithms and graphics, including lots of AI methods from the literature (we have hundreds in our old codebase ready to go).

AI/Computational/Consciousness/Model

Consciousness and AI

Posted by Gideon Kowadlo on

I really enjoyed reading this article Why can’t the world’s greatest minds solve the mystery of consciousness? published recently in the Guardian. It’s an engaging re-exploration of the contemporary discourse on the classic mind-body dilemma.

One view is that the ‘hard problem’ of consciousness is an illusion. There is no separation between mind and body, and that consciousness is simply a result of the computational machinery of the brain – an algorithm. Whether you are willing to make this leap completely or not, there is no denying that the physical reality of the brain influences the experience and existence of consciousness.

This leads us to questions of the relationship between AI and consciousness. The topic is not often discussed. It is somewhat taboo, much like the topic of consciousness was itself, until recently. As the author points out with regard to an influential conference on consciousness at the University of Arizona in 1994, “in many quarters, consciousness was still taboo, too weird and new agey to take seriously, and some of the scientists in the audience were risking their reputations by attending.”

However, the relationship between consciousness and intelligence is one of the most interesting aspects of cognition. Consciousness may even prove to be an important factor in higher intelligence, and conversely, understanding human level intelligence may shed light on the meaning and fabric of consciousness. We can only be certain of our own consciousness, yet we have no way to understand what it is.

There are a few salient attempts to describe cognition and consciousness in computational terms, two of which are listed below. Please leave a comment if you know of other examples. It will be great to see more active discussion on the topic.

The Computational Theory of Mind, Stanford Encyclopedia of Philosophy

Edited by Edward N. Zalta, published in The Stanford Encyclopedia of Philosophy, 2011.

and

Accounting for the computational basis of consciousness: a connectionist approach

By Ron Sun, published in Consciousness and Cognition, 1999.
'no input' state/CLA/missing data/Predictive Coding/Sparse Distributed Representations/Theory

When is missing data a valid state?

Posted by Gideon Kowadlo on

By Gideon Kowadlo, David Rawlinson and Alan Zhang

Can you hear silence or see pitch black?
Should we classify no input as a valid state or ignore it?

To my knowledge, the machine learning and statistics literature mainly regards an absence of input as missing data. There are several ways that it’s handled. It can be considered to be a missing data point, a value is inferred and then treated as the real input. When a period of no data occurs at the beginning or end of a stream (time series data), it can be ignored, referred to as censoring. Finally, when there is a variable that can never (or is never) observed, it can be viewed as data that is always missing, and modelled with what is referred to as latent or hidden variables. I believe there is more to the question of whether an absence of input is in fact a valid state, particularly when learning time varying sequences and when considering computational parallels of biological processes where an absence of signal might never occur.

It is also relevant in the context of systems where ‘no signal’ is an integral type of message that can be passed around. One such system is Predictive Coding (PC), which is a popular theory of cortical function within the neuroscience community. In PC, prediction errors are fed forward (see PC post [1] for more information). Therefore, perfectly correct predictions result in ‘no-input’ in the next level, which may occur from time to time given it is the objective of the encoding system.

Let’s say your system is classifying sequences of colours Red (R), Green (G) and Blue (B), with periods of no input which we represent as Black (K). There is a sequence of colours RGB, followed by a period of K, then BGR and then two steps of K again, illustrated below (the figure is a Markov graph representation).

Figure 1: Markov graph representation of a sequence of colour transitions.

What’s in a name?
What actually defines Black as no input?

This question is explored in the following paragraphs along with Figure 2 below. We start with the way the signal is encoded. In the case of an image, each pixel is a tuple of scalar values, including black (K) with a value of (0, 0, 0). No specific component value has a privileged status. We could define black as any scalar tuple. For other types of sensors, signal modulation is used to encode information. For example, frequency of binary spikes/firing is used in neural systems. No firing, or more generally no change, indicates no input. Superficially it appears to be qualitatively different. However, a specific modulation pattern can be mapped to a specific scalar value. Are they therefore equivalent?

We reach some clarity by considering the presence of a clock as a reference. The use of signal modulation implies the requirement of a clock, but does not necessitate it. With an internal clock, modulation can be measured in a time-absolute* sense, the modulation can be mapped to a scalar representation, and the status of the no-input state does indeed become equivalent to the case of a scalar input with a clock i.e. no value is privileged.

Where there is no clock, for either type of signal encoding, time can effectively stand still for the system. If the input does not change at all, there is no way to perceive the passage of time. For scalar input, this means that the input does not transition. For modulated input, it includes the most obvious type of ‘no-input’, no firing or zero frequency.

This would obviously present a problem to an intelligent agent that needs to continue to predict, plan and act in the world. Although there are likely to be inputs to at least some of the sensors, it suggests that biological brains must have an internal clock. There is evidence that the brain has multiple clocks, summarised here in Your brain has two clocks [2]. I wonder if the time course of perceptible bodily processes or thoughts themselves could be sufficient for some crude perception of time.

Figure 2: Definition of ‘no-input’ for different system characteristics.
* With respect to the clock at least. This does give rise to the interesting question of the absoluteness of the clock itself. Assume for arguments sake that consciousness can be achieved with deterministic machines. The simulated brain won’t know how fast time is running. You can pause it and resume without it being any wiser.
If we assume that we can define a ‘no-input’ state, how would we approach it?

The system could be viewed as an HMM (Hidden Markov Model). The sensed/measured states represent hidden world states that can not be measured directly. Let us make many observations and look to the statistics of occurrence, and compare this to the other observable states. If the statistics are similar, we can assume option A – no special meaning. If on the other hand, it occurs between the other observable sequences, sequences which are not correlated with each other, and is therefore not significantly correlated with any transitions, then we can say that it is B – a delineator.

A – no special meaning

There are two options, treat K as any other state, or ignore it. For the former, it’s business as usual. For the latter, ‘ignoring the input’, there don’t seem to be any consequences for the following reason. The system will identify at least two shorter sequences, one before K and one after. Any type of sequence learning must anyway have an upper limit on the length of the representable sequences* (unlike the theoretical Turing Machine); this will just make those sequences shorter. In the case of hierarchical algorithms such as HTM/CLA, higher levels in the hierarchy will integrate these sub sequences together into longer (more abstracted) temporal sequences.

However, ignoring K will have implications for learning the timing of state persistence and transitions. If the system ignores state K including the timing information, then modelling will be incomplete. For example, referring back to Figure 1, K occurs for two time steps before the transition back to R. This is important information for learning to predict when this R will occur. Additionally, the transition to K signalled the end of the occurrence of R preceding K. Another example is illustrated below in Figure 3. Here, K following B is a fork between two sub chains. The transition to R occurs 95% of the time. That information can be used to make a strong prediction about future transitions from this K, however if K is ignored, as shown on the right of the figure, the information is ignored and the prediction is not possible.

Figure 3: Markov chain showing some limitations of ignoring K.

* However, it is possible to have the ability to represent sequences far longer than the expected observable sequences with enough combinatorial power, as described in CLA and argued to exist in biological systems.


B – a delineator

This is the case where the ‘no-input’ state is not correlated (above some significant measure) with any observed sequence. The premise of this categorisation, is that due to lack of correlation, it is an effectively meaningless state. However, it can be used to make inferences about the underlying state. Using the example from Figure 1, based on repeated observations, the statement could be made that R, G and B summarise hidden states. We can also surmise that there are states that generate white noise, in this example random selections of R, G, B or K. This can be inferred since we never observe the same signal twice when in those states. Observations of K are then useful for modelling the hidden states, which calls into question the definition of K as ‘no-input’.

However, it may in fact be an absence of input. In any case, we did not observe any correlations with other sequences. Therefore in practice, this is similar to ‘A – no special meaning – ignore the state’. The difference is the semantic meaning of the ‘no-input’ state as a delineator. There is also no expectation that there is meaningful information in the duration of the absence of input. The ‘state’ is useful to indicate the sequence is finished, and therefore defines the timing of persistence of the last state of the sequence.

CLA and hierarchical systems

Turning our attention briefly to the context of HTM CLA [3]. CLA utilises Sparse Distributed Representations (see SDR post [4] for more information) as a common data structure in a hierarchical architecture. A given state, represented as an SDR, will normally be propagated to the level above which also receives input from other regions. It will therefore be represented as one (or possibly more) of many bits in the state above. Each bit is semantically meaningful. A ‘0’ should therefore be as meaningful as a ‘1’. The questions discussed above arise when the SDR is completely zero, which I’ll refer to as a ‘null SDR’.

The presence of a null SDR depends on the input source, presence of noise and the implementation details of the encoders. In a given region, the occurrence of null SDR’s will tend to dissipate, as the receptive field adjusts until a certain average complexity is observed. In addition, null SDR’s becomes increasingly unlikely as you move up the hierarchy and incorporate larger and larger receptive fields, thus increasing the surface area for possible activity. If the null SDR can still occur occasionally, there may be times when it is significant. If it is not classified, will the higher levels in the hierarchy recover the ‘lost’ information? This question applies to other hierarchical systems and will be investigated in future posts.

So what ……. ?

What does all of this mean for the design of intelligent systems? A realistic system will be operating with multiple sensor modalities and will be processing time varying inputs (regardless of the encoding of the signal). Real sensors and environments are likely to produce background noise, and in front of that, periods of no input in ways that are correlated with other environmental sequences, and in ways that are not – relating to the categorisations above ‘A – no special meaning’ and ‘B – a delineator’. There is no simple ‘so what’, but hopefully this gives us some food for thought and shows that it is something that should be considered. In future I’ll be looking in more depth at biological sensors and the nature of the signals that reach the cortex (are they ever completely silent?), as well as the implications for other leading machine learning algorithms.

References

[1] On Predictive Coding and Temporal Pooling[2] Emilie Reas, Your brain has two clocks, Scientific American, 2013[3] HTM White Paper[4] Sparse Distributed Representations (SDRs)
Artificial General Intelligence/Memes/Natural Selection/Singularity/Theory

Constraints on intelligence

Posted by Gideon Kowadlo on
by Gideon Kowadlo and David Rawlinson

Introduction

This article contains some musings on the factors that limit the increase of intelligence as a species.

We speculate that ultimately, our level of intelligence is limited by at least two factors, and possibly a third:

  1. our own cultural development,
  2. physical constraints, and
  3. an intelligence threshold.

We’ll now explore each of these factors.

Cultural Development

Natural Selection

Most readers are familiar with Natural Selection. The best known and dominant mechanism is that fitter biological organisms in a population tend to survive longer, reproduce more frequently and successfully, and pass on their traits to the next generation. Given some form of external pressure and therefore competition, such as resource constraints, the species on average is likely to increase in fitness. In competition with other species, this is necessary for species survival.

Although this is the mechanism we are focusing on in this post, there are other important forms of selection. Two examples are ‘Group Selection’ and ‘Sexual Selection’. Group selection favours traits that benefit the group over the individual, such as altruism. Especially when the group shares common genes. Sexual selection favours traits that improve an individual’s success in reproducing by two means: being attractive to the other gender, and ability to compete with rivals of the same gender. Sometimes sexually appealing characteristics are highly costly or risky to individuals, for example by making them vulnerable to predators.

Culture

Another influence on ability to survive is culture. Humans have developed culture, and some form of culture is widely believed to exist in other species such as primates and birds (e.g. Science). Richard Dawkins introduced the concept of memes, cultural entities that evolve in a way that is analogous to genes. The word meme now conjures up funny pictures of cats (see Wired magazine’s article on the re-appropriation of the word meme), and no-one is complaining about that, but it’s hard to argue that these make us fitter as a species. However, it’s clear that cultural evolution, by way of technological progress, can have a significant influence. This could be negative, but is generally a positive, making us more likely to survive as a species.

Culture and Biology

A thought experiment regarding the effect on survival due to natural selection and cultural development, and due to their relationship with each other, is explored with a graph below.

Figure 1: A thought experiment: The shape of survivability vs time, due to cultural evolution, and due to natural selection. The total survivability is the sum of the two. Survivability due to natural selection plateaus when it is surpassed by survivability due to cultural evolution. Survivability due to cultural evolution plateaus when cultural development allows almost everyone in the population to survive.

For humans, the main biological factor contributing to survival is our intellect. The graph shows how our ability to survive steadily improves with time as we evolve naturally. The choice of linear growth is based on the fact that the ‘force’ for genetic change does not increase or decrease as that change occurs*. On the other hand, it is thought that cultural evolution improves our survivability exponentially. In recent years, this has been argued by well known authors and thinkers such as Ray Kurzweil and Eliezer S. Yudkowsky in the context of the Technological Singularity. We build on knowledge continuously, and leverage our technological advances. This enables us to make ever larger steps, as each generation exploits the work of the preceding generations. As Isaac Newton wrote, “If I have seen further it is by standing on the shoulders of giants” **. Many predict that this will result in the ability to create machines that surpass human intelligence. The point at which this occurs is known as the aforementioned Technological Singularity.

Cultural Development – Altruism

Additionally, cultural evolution could include the development of humanitarian and altruistic ideals and behaviour. An environment in which communities care for all their people, which would increase the survivability of (almost) everyone to the threshold of reproduction – leaving only a varied ability to prosper beyond survival. This is shown in the figure above as a plateau in survivability due to cultural evolution.

Cultural Development – Technology

Cultural factors dominate once survivability due to cultural evolution and technological development surpasses that due to natural selection. For example, the advantages given by use of a bow and arrow for hunting, will reduce the competitive advantage of becoming a faster runner. Having a supermarket at the end of your street will render faster running insignificant. The species would no longer evolve biologically through the same process of natural selection. Other forces may still cause biological evolution in extreme cases, such as resistance to new diseases, but this is unlikely to drive the majority of further change. This means that biological evolution of our species would stagnate***. This effect is shown in the graph with the plateau in survivability due to natural selection.

* On a fine scale, this would not be linear and would be affected by many many unpredictable factors such as climate changes, other environmental instability as well as successes/failures of other species.

** Although this metaphor was first recorded in the twelfth century and has been attributed to Bernard of Chartres.

*** Interestingly, removal of selective pressure does not allow species to rest at a given level of fitness. Deleterious mutations rapidly accumulate within the population, giving us a short window of opportunity to learn to control and improve our own genetic heritage.

Physical Constraints

One current perspective in neuroscience, and the basis for our work and this blog, is that much of our intelligence emerges from, very simply put, a hierarchical assembly of regions of identical computational units (cortical columns). As explained in previous posts (here and here), this is physically structured as a sheet of cortex, that form connections from region to region. The connecting regions are conceptually at different levels in the hierarchy. The connections themselves form the bulk of the cortex. We believe that with an increasingly deep hierarchy, the brain is able to represent increasingly abstract and general spatiotemporal concepts, which would play a significant role in increasing intelligence.

The reasoning above predicts that the number of neurons and connections is correlated with intelligence. These neurons and connections have mass and volume and require a blood supply. They cannot increase indefinitely.

Simply increasing the size of the skull has its drawbacks. Maintaining stable temperature becomes more difficult, and structural strength is sacrificed. The body would become disproportionately large to carry around extra mass, making the animal less mobile, coupled with the fact that there would be higher energy demands. Larger distances for neuronal connections leads to slower signal propagation which could also have negative impact. Evidence of the consequences of such physical constraints is found in the fact that the brain folds in on itself, appearing wrinkled, in order to maximise surface area (and hence the number of neurons and connections) in the given volume of the skull. Evolution has produced a tradeoff between these characteristics that limits our intelligence to promote survival.

It is possible to imagine completely different architectures that might circumvent these limitations. Perhaps a neural network distributed throughout the body, such as exists for some marine creatures. However, it is implausible that physical constraints would not ultimately be a limiting factor. Also, reality is more constrained than our imagination. For example, it must be physically and biologically possible for the organism to develop from a single cell to a neonate, and on to a reproducing adult.

An Intelligence Threshold

There could be a point at which the species crosses an intelligence threshold, beyond which higher intelligence does not confer a greater ability to survive. However, since the threshold may be dictated by cultural evolution it is very difficult to separate the two. For example, the threshold might be very low in an altruistic world, and it is possible to envision a hyper-competitive and adversarial culture in which the opposite is true.

But perhaps a threshold exists as a result of a fundamental quality of intelligence, completely independent of culture. Could it be, that once you can grasp concepts at a sufficient level of abstraction, and have the ability to externalise and record concepts with written symbols (thereby extending the hierarchy outside of the physical brain), that it would be possible to conduct any ‘thought’ computation, given enough working memory, concentration and time? Similarly, a Turing Machine is capable of carrying out any computation, given infinite memory.

The topic of consciousness and it’s definition is beyond the scope of this post. However, accepting that there appears to be a clear relationship between intelligence and what most people understand as consciousness, this ‘Intelligence Threshold’ has implications for consciousness itself. It is interesting to ponder the threshold as having a corresponding crossing point in terms of conscious experience.

We may explore the existence and nature of this potential threshold in greater detail in the future.

Impact of Artificial General Intelligence (AGI)

The biological limitations to intelligence discussed in this article show why Artificial General Intelligence (AGI) will be such a dramatic development. We still exist in a physical world (at least perceptibly), but building an agent out of silicon (or other materials in the future), will effectively free us from all of these constraints. It also allows us to modify parameters, architecture and monitor activity. It will be possible to invest large quantities of energy into ‘thinking’ in a mind that does not fatigue. Perhaps this is a key enabling technology on the path to the Singularity.