Category Archives

8 Articles

AGI/Artificial General Intelligence/columns/Hierarchical Generative Models/Predictive Coding/pyramidal cell/sparse coding/Sparse Distributed Representations/symbol grounding problem

The Region-Layer: A building block for AGI

Posted by ProjectAGI on
The Region-Layer: A building block for AGI
Figure 1: The Region-Layer component. The upper surface in the figure is the Region-Layer, which consists of Cells (small rectangles) grouped into Columns. Within each Column, only a few cells are active at any time. The output of the Region-Layer is the activity of the Cells. Columns in the Region-Layer have similar – overlapping – but unique Receptive Fields – illustrated here by lines joining two Columns in the Region-Layer to the input matrix at the bottom. All the Cells in a Column have the same inputs, but respond to different combinations of active input in particular sequential contexts. Overall, the Region-Layer demonstrates self-organization at two scales: into Columns with unique receptive fields, and into Cells responding to unique (input, context) combinations of the Column’s input. 

Introducing the Region-Layer

From our background reading (see here, here, or here) we believe that the key component of a general intelligence can be described as a structure of “Region-Layer” components. As the name suggests, these are finite 2-dimensional areas of cells on a surface. They are surrounded by other Region-Layers, which may be connected in a hierarchical manner; and can be sandwiched by other Region-Layers, on parallel surfaces, by which additional functionality can be achieved. For example, one Region-Layer could implement our concept of the Objective system, another the Region-Layer the Subjective system. Each Region-Layer approximates a single Layer within a Region of Cortex, part of one vertex or level in a hierarchy. For more explanation of this terminology, see earlier articles on Layers and Levels.
The Region-Layer has a biological analogue – it is intended to approximate the collective function of two cell populations within a single layer of a cortical macrocolumn. The first population is a set of pyramidal cells, which we believe perform a sparse classifier function of the input; the second population is a set of inhibitory interneuron cells, which we believe cause the pyramidal cells to become active only in particular sequential contexts, or only when selectively dis-inhibited for other purposes (e.g. attention). Neocortex layers 2/3 and 5 are specifically and individually the inspirations for this model: Each Region-Layer object is supposed to approximate the collective cellular behaviour of a patch of just one of these cortical layers.
We assume the Region-Layer is trained by unsupervised learning only – it finds structure in its input without caring about associated utility or rewards. Learning should be continuous and online, learning as an agent from experience. It should adapt to non-stationary input statistics at any time.
The Region-Layer should be self-organizing: Given a surface of Region-Layer components, they should arrange themselves into a hierarchy automatically. [We may defer implementation of this feature and initially implement a manually-defined hierarchy]. Within each Region-Layer component, the cell populations should exhibit a form of competitive learning such that all cells are used efficiently to model the variety of input observed.
We believe the function of the Region-Layer is best described by Jeff Hawkins: To find spatial features and predictable sequences in the input, and replace them with patterns of cell activity that are increasingly abstract and stable over time. Cumulative discovery of these features over many Region-Layers amounts to an incremental transformation from raw data to fully grounded but abstract symbols. 
Within a Region-Layer, Cells are organized into Columns (see figure 1). Columns are organized within the Region-Layer to optimally cover the distribution of active input observed. Each Column and each Cell responds to only a fraction of the input. Via these two levels of self-organization, the set of active cells becomes a robust, distributed representation of the input.
Given these properties, a surface of Region-Layer components should have nice scaling characteristics, both in response to changing the size of individual Region-Layer column / cell populations and the number of Region-Layer components in the hierarchy. Adding more Region-Layer components should improve input modelling capabilities without any other changes to the system.
So let’s put our cards on the table and test these ideas. 

Region-Layer Implementation

Parameters

For the algorithm outlined below very few parameters are required. The few that are mentioned are needed merely to describe the resources available to the Region-Layer. In theory, they are not affected by the qualities of the input data. This is a key characteristic of a general intelligence.
  • RW: Width of region layer in Columns
  • RH: Height of region layer in Columns
  • CW: Width of column in Cells 
  • CH: Height of column in Cells

Inputs and Outputs

  • Feed-Forward Input (FFI): Must be sparse, and binary. Size: A matrix of any dimension*.
  • Feed-Back Input (FBI): Sparse, binary Size: A vector of any dimension
  • Prediction Disinhibition Input (PDI): Sparse, rare. Size: Region Area+
  • Feed-Forward Output (FFO): Sparse, binary and distributed. Size: Region Area+
* the 2D shape of input[s] may be important for learning receptive fields of columns and cells, depending on implementation.
+  Region Area = CW * CH * RW * RH

Pseudocode

    Here is some pseudocode for iterative update and training of a Region-Layer. Both occur simultaneously.
    We also have fully working code. In the next few blog posts we will describe some of our concrete implementations of this algorithm, and the tests we have performed on it. Watch this space!
    function: UpdateAndTrain( 
      feed_forward_input, 
      feed_back_input, 
      prediction_disinhibition 
    )

    // if no active input, then do nothing
    if( sum( input ) == 0 ) {
      return
    }

    // Sparse activation
    // Note: Can be implemented via a Quilt[1] of any competitive learning algorithm, 
    // e.g. Growing Neural Gas [2], Self-Organizing Maps [3], K-Sparse Autoencoder [4].
    activity(t) = 0

    for-each( column c ) {
      // find cell x that most responds to FFI 
      // in current sequential context given: 
      //  a) prior active cells in region 
      //  b) feedback input.
      x = findBestCellsInColumn( feed_forward_input, feed_back_input, c )

      activity(t)[ x ] = 1
    }

    // Change detection
    // if active cells in region unchanged, then do nothing
    if( activity(t) == activity(t-1) ) {
      return
    }

    // Update receptive fields to organize columns
    trainReceptiveFields( feed_forward_input, columns )

    // Update cell weights given column receptive fields
    // and selected active cells
    trainCells( feed_forward_input, feed_back_input, activity(t) )

    // Predictive coding: output false-negative errors only [5]
    for-each( cell x in region-layer ) {

      coding = 0

      if( ( activity(t)[x] == 1 ) and ( prediction(t-1)[x] == 0 ) ) {
        coding = 1
      }
      // optional: mute output from region, for attentional gating of hierarchy
      if( prediction_disinhibition(t)[x] == 0 ) {
        coding = 0 
      }

      output(t)[x] = coding
    }

    // Update prediction
    // Note: Predictor can be as simple as first-order Hebbian learning. 
    // The prediction model is variable order due to the inclusion of sequential 
    // context in the active cell selection step.
    trainPredictor( activity(t), activity(t-1) )
    prediction(t) = predict( activity(t) )
    [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1401&rep=rep1&type=pdf[2] https://papers.nips.cc/paper/893-a-growing-neural-gas-network-learns-topologies.pdf[3] http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf[4] https://arxiv.org/pdf/1312.5663[5] http://www.ncbi.nlm.nih.gov/pubmed/10195184
    action selection/agency/AGI/Artificial General Intelligence/columns/cortex/Hierarchical Generative Models/hierarchy/Memory-Prediction Framework/neurobiology/pyramidal cell/Thalamocortical

    How to build a General Intelligence: An interpretation of the biology

    Posted by ProjectAGI on
    How to build a General Intelligence: An interpretation of the biology
    Figure 1: Our interpretation of the Thalamocortical system as 3 interacting sub-systems (objective, subjective and executive). The structure of the diagram indicates the dominant direction of information flow in each system. The objective system is primarily concerned with feed-forward data flow, for the purpose of building a representation of the actual agent-world system.  The executive system is responsible for making desired future agent-world states a reality. When predictions become observations, they are fed back into the objective system. The subjective system is a circular because its behaviour depends on internal state as much as external. The subjective system builds a filtered, subjective model of observed reality, that also represents objectives or instructions for the executive. This article will describe how this model fits into the structure of the Thalamocortical system.

    Authors: David Rawlinson and Gideon Kowadlo

    This is part 4 of our series on how to build an artificial general intelligence (AGI).

    • Part 1: An overview of hierarchical general intelligence
    • Part 2: Reverse engineering (the physical perspective – cells and layers – and the logical perspective – a hierarchy).
    • Part 3: Circuits and pathways; we introduced our canonical cortical micro-circuit and fitted pathways to it.

    In this article, part 4, we will try to interpret all the information provided so far. We will try to fit what we know about the biological general intelligence to our theoretical expectations.

    Systems

    We believe cortical activity can be usefully interpreted as 3 integrated systems. These are:

    • Objective system
    • Subjective system
    • Executive system

    So, what are these systems, why are they needed and how do they work?

    Objective System

    We theorise that the purpose of the objective system is to construct a hierarchical, generative model of both the external world and the actual state of the agent. This includes internal plans & goals already executed or in progress. From our conceptual overview of General Intelligence we think that this representation should be distributed, compositional and therefore robust and able to immediately model novel situations instantly and meaningfully.

    The objective system models varying timespans depending on the level of abstraction, but events are anchored to the current state of the world and agent. Abstract events may cover long periods of time – for example, “I made dinner” might be one conceptual event.

    We propose that the objective system is implemented by pyramidal cells in layers 2/3 and by spiny excitatory cells in layer 4. Specifically, we suggest that the purpose of the spiny excitatory cells is primarily dimensionality reduction, by performing a classifier function, analogous to the ‘Spatial Pooling’ function of Hawkins’ HTM theory. This is supported by analysis of C4 spiny stellate connectivity: “… spiny stellate cells act predominantly as local signal processors within a single barrel…”. We believe the pyramidal cells are more complex and have two functions. First, they perform dimensionality reduction by requiring a set of active inputs on specific apical (distal) dendrite branches to be simultaneously observed before the apical dendrite can output a signal (an action potential). Second, they use basal (proximal) dendrites to identify the sequential context in which the apical dendrite has become active. Via a local competitive process, pyramidal cells learn to become active only when observing a set of specific input patterns in specific historical contexts.

    The output of pyramidal cells in C2/3 is routed via the Feed-Forward Direct pathway to a “higher” or more abstract cortical region, where it enters in C4 (or in some parts of the Cortex, C2/3 directly). In this “higher” region, the same classifier and context recognition process is repeated. If C4 cells are omitted, we have less dimensionality reduction and a greater emphasis on sequential or historical context.

    We propose these pyramidal cells only output along their axons when they become active without entering a “predicted” state first. Alternatively, interneurons could play a role in inhibiting cells via prediction to achieve the same effect. If pyramidal cells only produce an output when they make a False-Negative prediction error (i.e. they fail to predict their active state), output is equivalent to Predictive Coding (link, link). Predictive Coding produces an output that is more stable over time, which is a form of Temporal Pooling as proposed by Numenta.

    To summarize, the computational properties of the objective system are:

    1. Replace simultaneously active inputs with a smaller set of active cells representing particular sub-patterns, and
    2. Replace predictable sequences of active cells with a false-negative error coding to transform the output into a simpler sequence of prediction errors

    These functions will achieve the stated purpose of incrementally transforming input data into simpler forms with accumulating invariances, while propagating (rather than hiding) errors, for further analysis in other Columns or cortical regions. In combination with a tree-like hierarchical structure, higher Columns will process data with increasing breadth and stability over time and space.

    The Feed-Forward direct pathway is not filtered by the Thalamus. This means that Columns always have access to the state of objective system pyramidal cells in lower columns. This could explain the phenomenon that we can process data without being aware of it (aka “Blindsight”); essentially the objective system alone does not cause conscious attention. This is a very useful quality, because it means the data required to trigger a change in attention is available throughout the cortex. The “access” phenomenon is well documented and rather mysterious; the organisation of the cortex into objective and subjective systems could explain it.

    Another purpose of the objective system is to ensure internal state cannot become detached from reality. This can easily occur in graphical models, when cycles form that exclude external influence. To prevent this, we believe that the roles of feed-forward and feed-back input must be separated to break the cycles. However, C2/3 pyramidal cells’ dendrites receive both feed-forward (from C4) and feed-back input (via C1).

    One way that this problem might be avoided is by different treatment of feed-forward and feed-back input, so that the latter can be discounted when it is contradicted by feed-forward information. There is evidence that feed-forward and feedback signals are differently encoded, which would make this distinction possible.

    We speculate that the set of states represented by the cells in C2/3 could be defined only using feed-forward input, and that the purpose of feedback data in the objective system is restricted to improved prediction, because feedback contains state information from a larger part of the hierarchy (see figure 2).

    Figure 2: The benefit of feedback. This figure shows part of a hierarchy. The hierarchy structure is defined by the receptive fields of the columns (shown as lines between cylinders, left). Each Column has receptive fields of similar size. Moving up the hierarchy, Columns receive increasingly abstract input with a greater scope, being at the top of a pyramid of lower Columns whose receptive fields collectively cover a much larger area of input. Feedback has the opposite effect, summarizing a much larger set of Column states from elsewhere and higher in the hierarchy. Of course there is information loss during these transfers, but all data is fully represented somewhere in the hierarchy.

    So although the objective system makes use of feedback, the hierarchy it defines should be predominantly determined by feed-forward information. The feed-forward direct pathway (see figure 3) enables the propagation of this data and consequently the formation of the hierarchy.

    Figure 3: Feed-Forward Direct pathway within our canonical cortical micro-circuit. Data travels from C4 to C2/3 and then to C4 in a higher Column. This pattern is repeated up the hierarchy. This pathway is not filtered by the Thalamus or any other central structure, and note that it is largely uni-directional (except for feedback to improve prediction accuracy). We propose this pathway implements the Objective System, which aims to construct a hierarchical generative model of the world and the agent within it.

    Subjective System

    We think that the subjective system is a selectively filtered model of both external and internal state including filtered predictions of future events. We propose that filtering of input constitutes selective attention, whereas filtering of predictions constitutes action selection and intent. So, the system is a subjective model of reality, rather than an objective one, and it is used for both perception and planning simultaneously.

    The time span encompassed by the system includes a subset of both present and future event-concepts, but as with the objective system, this may represent a long period of real-world time, depending on the abstraction of the events (for example, “now” I am going to work, and “next” I will check my email [in 1 hour’s time]).

    It makes good sense to have two parallel systems, one filtered (subjective) and one not (objective). Filtering external state reduces distraction and enhances focus and continuity. Filtering of future predictions allows selected actions to be maintained and pursued effectively, to achieve goals.

    In addition to events the agent can control, it is important to be aware of negative outcomes outside the agent’s control. Therefore the state of the subjective system must include events with both positive and negative reward outcomes. There is a big difference between a subjective model and a goal-oriented planning model. The subjective system should represent all outcomes, but preferentially select positive outcomes for execution.

    The subjective system represents potential future states, both internal and external. It does not necessarily represent reality; it represents a biased interpretation of intended or expected outcomes based on a biased interpretation of current reality! These biases and omissions are useful; they provide the ability to “imagine’ future events by serially “predicting” a pruned tree of potential futures.

    More speculatively, differences between the subjective and objective systems may be the cause of phenomena such as selective awareness and “access” consciousness.

    Figure 4: Feed-Forward Indirect pathway, particularly involved in the Subjective system due to its influence on C5. The Thalamus is involved in this pathway, and is believed to have a gating or filtering effect. Data flows from the Thalamus to C4, to C2/3, to C5 and then to a different Thalamic nuclei that serves as the input gateway to another cortical Column in a different region of the Cortex. We propose that the Feed-Forward Indirect pathway is a major component of the subjective system.
    Figure 5:  The inhibitory micro-circuit, which we suggest makes the subjective system subjective! The red highlight shows how the Thalamus controls activity in C5 by activating inhibitory cells in C4. The circuit is completed by C5 pyramidal cells driving C6 cells that modulate the activity of the same Thalamic nuclei that selectively activate C5.

    The subjective system primarily comprises C5 (where subjective states are represented) and the Thalamus (which controls subjectivity), but it draws input from the objective system via C2/3. The latter provides context and defines the role and scope (within the hierarchy) of C5 cells in a particular column. Between each cortical region (and therefore every hierarchy level), input to the subjective system is filtered by the Thalamus (figure 5). This implements the selection process. The Feed-Forward Indirect pathway includes these Thalamo-Cortical loops.

    We suggest the Thalamus implements selection within C5 using special cells in C4 that are activated by axons (outputs) from the Thalamus (see figure 6). These inhibitory C4 cells target C5 pyramidal cells and inhibit them from becoming active. Therefore, thalamic axons are both informative (“this selection has been made”) and executive (the axon drives inhibition of selected C5 pyramidal cells).

    Figure 6: Thalamocortical axons (afferents) are shown driving inhibitory cells in C4 (leftmost green cell) that in turn inhibit pyramidal cells in C5 (red). They also provide information about these selections to other layers, including C2/3. When a selection has been made, it becomes objective rather than subjective, hence provision of a copy to C2/3. Image source.


    Note that selection may be a process of selective dis-inhibition rather than direct control: Selection alone may not be enough to activate the C5 cells.  Instead, C5 pyramidal cells likely require both selection by the Thalamus, and feed-forward activation via input from C2/3. The feed-forward activation could occur anywhere within a window of time in which the C5 cell is “selected”.  This would relax timing requirements on the selection task, making control easier; you only need to ensure that the desired C5 cell is disinhibited when the right contextual information arrives from other sources (such as C2/3). This also ensures C5 cell activation fits into its expected sequence of events and doesn’t occur without the right prior context.

    C5 also benefits from informational feedback from higher regions and neighbouring cells that help to define unique contexts for the activation of each cell.

    We suggest that C5 pyramidal cells are similar to C2/3 pyramidal cells but with some differences in the way the cells become active. Whereas C2/3 cells require both matching input via the apical dendrites and valid historical input to the basal dendrites to become active, C5 cells additionally need to be disinhibited for full activation to occur.

    As mentioned in the previous article, output from C5 cells sometimes drives motors very directly, so full activation of C5 cells may immediately result in physical actions. We can consider C5 to be the “output” layer of the cortex. This makes sense if the representation within C5 includes selected future states.

    Management of C5 activity will require a lot of inhibition; we would expect most of the input connections to C5 to be inhibitory because in every context, for every potential outcome, there are many alternative outcomes that must be inhibited (ignored). At any given time, only a sparse set of C5 cells would be fully active, but many more would be potentially-active (available for selection).

    Given predictive encoding and filtering inhibition, it would be common for few pyramidal cells to be active in a Column at any time. Separately, we would expect objective C2/3 pyramidal activity to be more consistent and repeatable than subjective C5 pyramidal activity, given a constant external stimulus.

    Executive System

    So far we have defined a mechanism for generating a hierarchical representation and a mechanism for selectively filtering activity within that representation. In our original conceptual look at general intelligence, we also desired that filtering predictions would be equivalent to action selection. But if we have selected predictions of future actions at various levels of abstraction within the hierarchy, how can we make these abstract prediction-actions actually happen?

    The purpose of the executive system is to execute hierarchical plans reliably. As previously discussed, this is no trivial matter due to problems such as vanishing agency at higher hierarchy levels. If a potential future outcome represented within the subjective system is selected for action, the job of the executive system is to make it occur.

    We know that we want abstract concepts at high levels within the hierarchy to be faithfully translated into their equivalent patterns of activity at lower levels. Moving towards more concrete forms would result in increasing activity as the incremental dimensionality reduction of the feed-forward hierarchy is reversed.

    Figure 7: Differences in dominant direction of data flow between objective and executive systems. Whereas the Objective system builds increasingly abstract concepts of greater breadth, the Executive system is concerned with decomposing these concepts into their many constituent parts. so that hierarchically-represented plans can be executed.

    We also know that we need to actively prioritize execution of a high level plan over local prediction / action candidates in lower levels. So, we are looking for a cascade of activity from higher hierarchy levels to lower ones.

    Figure 8: One of two Feed-Back direct pathways. This pathway may well be involved in cascading control activity down the hierarchy towards sensors and motors. Activity propagates from C6 to C6 directly; C6 modulates the activity of local C5 cells and relevant Thalamic nuclei that activate local C5 cells by selective disinhibition in conjunction with matching contextual information from C2/3.

    It turns out that such a system does exist: The feed-back direct pathway from C6 to C6. Cortex layer 6 is directly connected to Cortex layer 6 in the hierarchy levels immediately below. What’s more, these connections are direct, i.e. unfiltered (which is necessary to avoid the vanishing agency problem). Note that C5 (the subjective system) is still the output of the Cortex, particularly in motor areas. C6 must modulate the activity of cells in C5, biasing C5 to particular predictions (selections) and thereby implementing a cascading abstract plan. Finally, C6 also modulates the activity of Thalamic nuclei that are responsible for disinhibiting local C5 cells. This is obviously necessary to ensure that the Thalamus doesn’t override or interfere with the execution of a cascading plan already selected at a higher level of abstraction.

    Our theory is that ideally, all selections originate centrally (e.g. in the Thalamus). When C5 cells are disinhibited and then become predicted, an associated set of local C6 cells is triggered to make these C5 predictions become reality.

    These C6 cells have a number of modulatory outputs to achieve this goal:

    Executive Training

    No, this is not a personal development course for CEOs. This section checks whether C6 cells can learn to replay specific action sequences via C5 activity. This is an essential feature of our interpretation, because only C6 cells participate in a direct, modulatory feedback pathway.

    We propose that C6 pyramidal neurons are taught by historical activity in the subjective system. Patterns of subjective activity become available as “stored procedures” (sequences of disinhibition and excitatory outputs) within C6.

    Let’s start by assuming that C6 pyramidal cells have similar functionality to C2/3 and C5 pyramidal cells, due to their common morphhology. Assume that C5 cells in motor areas are direct outputs, and when active will cause the agent to take actions without any further opportunity for suppression or inhibition (see previous article).

    In other cortical areas, we assume that the role of C5 cells is to trigger more abstract “plans” that will be incrementally translated into activity in motor areas, and therefore will also become actions performed by the agent.

    To hierarchically compose more abstract action sequences from simpler ones, we need activity of an abstract C5 cell to trigger a sequence of activity in more concrete C5 cells. C6 cells will be responsible for linking these C5 cells. So, activating a C6 cell should trigger a replay of a sequence of C5 cell activity in a lower Column. How can C6 cells learn which sequences to trigger, and how can these sequences be interpreted correctly by C6 cells in higher hierarchy levels?

    C6 pyramidal cells are mostly oriented with their dendrites pointing towards the more superficial cortex layers C1,…,C5 and their axons emerging from the opposite end. Activity from C5 to C6 is transferred via axons from C5 synapsing with dendrites from C6. Given a particular model of pyramidal cell learning rules, C6 pyramidal cells will come to recognize patterns of simultaneous C5 activity in a specific sequential context, and C6 interneurons will ensure that unique sets of C6 pyramidal cells respond in each context.

    So how will these C6 cells learn to trigger sequences of C5 cells? We know that the axons of C6 cells bend around and reach up into C5, down to the Thalamus and directly to hierarchically-lower C6 cells. At all targets they can be excitatory or inhibitory.

    All we need beyond this, is for C6 axons to seek out axon target cells that become active immediately after the originating C6 cell is stimulated by active C5 cells. This will cause each C6 cell to trigger C5 and C6 cells that are observed to be activated afterwards. Note that we require the C6 cells themselves be organised into sequences (technically, a graph of transitions).

    Target seeking by axons is known as “Axon Guidance” and C6 pyramidal cells’ axons do seem to target electrically active cells by ceasing growth when activity is detected. We have not yet found biological evidence for the predicted timing.

    C6 axons can also target C4 inhibitory cells (evidence) and Thalamic cells, which again is compatible with our interpretation, as long as they are cells that become active after the originating C6 cell. If we want to “replay” some activity that followed a particular C6 cell, then all the cells described above should be excited or inhibited to ensure that the same events occur again. Activating a C6 cell directly should reproduce the same outcome as incidental activation of the C6 cell via C5 – a chain of sequential inhibition and promotion will result. Note that the same learning rule could work to discover all axon targets mentioned.

    Collectively, the C6 cells within a Column will become a repertoire of “stored procedures” that can be triggered and replayed by a cascade of activity from higher in the hierarchy or by direct selection via C5. C6 cells would behave the same way whether activated by local C5 cells, or by C6 cells in the hierarchy level above. This allows cascading, incremental execution of hierarchical plans.

    C6 cells do not need to replace sequences of C5 cell activity with a single C6 cell (i.e. label replacement for symbolic encoding), but they do need to collectively encode transitions between chains of C5 cells, individually trigger at least 1 C5 cell and collectively allow a single C6 cell to trigger a sequence of C6 cells in both the current and lower hierarchy regions.

    C6 interneurons can resolve conflicts when multiple C6 triggers coincide within a column. We can expect C6 interneurons to inhibit competing C6 pyramidal cells until the winners are found, resulting in a locally consistent plan of action.

    As with layers C2/3 and C5, C6 inhibitory interneurons will also support training C6 pyramidal cells for collective coverage of the space of observed inputs, in this case from C5 and C2/3.

    Bootstrapping

    Now we are only left with a bootstrapping problem: How can the system develop itself? Specifically, how do the sequences of C5 activity come to be defined so that they can be learned by C6?

    We suggest that conscious choice of behaviour via the Thalamus is used to build the hierarchical repertoire from simple primitive actions to increasingly sophisticated sequences of fine control. Initially, thalamic filtering of C5 state would be used to control motor outputs directly, without the involvement of C6. Deliberate practice and repetition would provide the training for C6 cells to learn to encode particular sequences of behaviour, making them part of the repertoire available to C6 cells in hierarchically “higher” Columns.

    Initially, concentration is needed to perform actions via direct C5 selections; these activities need to be carefully centrally coordinated using selective attention. However, when C6 has learnt to encode these sequences, they become both more reliable and require less effort to execute, requiring only a trigger to one C6 cell.

    After training, only minimal thalamic interventions are needed to execute complex sequences of behaviour learned by C6 cells. Innovation can continue by combining procedures encoded by C6 with interventions via the Thalamus, that can still excite or inhibit C5 cells. However, in most other cases C6 training is accelerated by the independence of Columns: When a C6 cell learns to control other cells within the Column, this learning remains valid no matter how many higher hierarchy levels are placed on top. By analogy, once you’ve learned to drink from a cup, you don’t need to relearn that skill to drink in restaurants, at home, at work etc.

    As C6 learns and starts to play a role in the actions and internal state of the agent, it becomes important to provide the state of C6 to the objective and subjective systems as contextual input.

    Axons from C6 to other, hierarchically lower Columns take two paths: To C6, and to C1. We propose that the copy provided to C1 is used as informational feedback in C2/3 and C5 pyramidal cells (these axons synapse with Pyramidal cell Apical dendrites). We suggest the copy to C6 allows C6 cells to execute plans hierarchically, by delegating execution to a number of more concrete C6 cells. Therefore, the feedback direct pathway from C6 to C6 is part of the executive system. These axons should synapse on cell bodies, or nearby, to inhibit or trigger C6 activation artificially (rather than via C5).

    Interpretation of the Thalamus

    Rather than as merely a relay, we propose that a better concept of the Thalamus is as a control centre. It’s job is to centrally control cortical activity in C5 (the subjective system). Abstract activity in C5 is propagated down the hierarchy by C6, and translated into its concrete component states, eventually resulting in specific motor actions. Therefore, via this feedback pathway the filtering performed by the Thalamus assumes an executive role also.

    We believe that filtering predictions of oneself performing an action or experiencing a reward is the mechanism by which objectives and plans are selected. We believe there is only one representation of the world in our heads. There is no separate “goal-oriented” or “action-based” representation. This means that filtering predictions is the mechanism of behaviour generation. Note that in a hierarchical system, you can simultaneously select novel combinations of predictions to achieve innovation without changing the hierarchical model.

    Our interpretation of the Thalamus depends on some theoretical assumptions about how general intelligence works. Crucially, we believe there is no difference between selective awareness of externally-caused and self-generated events, except some of the latter have agency in the real world via the agent’s actions. This means that selective attention and action selection can both be consequences of the same subjective modelling process.

    But where does selection actually occur?

    For a number of practical reasons, action and attentional selection should be centralized functions. For one thing, the reward criteria for selecting actions are of much smaller dimension than the cortical representations – for example, the set of possible pain sensations are far more limited than the potential external causes of pain. We essentially need to compare the reward of all potential actions against each other, rather than an absolute scale.

    It is also important that conflicts between items competing for attention or execution are resolved so that incompatible plans are replaced by a single clear choice. Conflict resolution is difficult to do in a highly parallel & distributed system; instead, it is preferable to force all alternatives to compete against each other until a few clear winners are found.

    Finally, once an action or attentional target is selected, it should be maintained for a long period (if still relevant), to avoid vacillation. (See Scholarpedia for a good introduction to the difficulties of conflict resolution and the importance of sticking to a decision for long enough to evaluate it).

    We believe the Thalamus plays this role via its interactions with the Cortex. It interacts with the Cortex in two ways. First, the Thalamus selectively dis-inhibits particular C5 cells, allowing them to become active when the right circumstances are later observed objectively (i.e. via C2/3, which is not subjective).

    Second, the Thalamus must also co-operate with the Feed-Back cascade via C6.  While the Thalamus generates new selections by controlling C5, it must also permit the execution of existing, more abstract Thalamic selections by allowing cascading feedback activity to override local selections. Together, these mechanisms ensure that execution of abstract plans is as easily accomplished as simpler, concrete actions.

    Interpretation of the Basal Ganglia

    The Basal Ganglia are involved in so many distinct functions that they can’t be fully described within this article. They consist of a set of discrete structures located adjacent to the Thalamus.

    In our model, selection is implemented by the Thalamus manipulating the subjective system within the Cortex. We propose that the selections themselves are generated by the Basal Ganglia, which then controls the behaviour of the Thalamus.

    Crucially, we believe the Striatum within the Basal Ganglia uses reward values (such as pleasure and pain) to make adaptive selections. In other words, the Basal Ganglia are responsible for picking good actions, biasing the entire Thalamo-Cortical system towards futures that are expected to be more pleasant for the agent.

    However, to make adaptive choices it is necessary to have accurate context and predictions (candidate actions). The hierarchical model defined within the Cortex is an efficient and powerful source for this data, and in fact, this pathway (Cortex → Basal Ganglia → Thalamus → Cortex) does exist within the brain (see figure 9 below).

    Thanks to studies of relevant disorders such as Parkinson’s and Huntingdon’s, it is known that this pathway is associated with behaviour initiation and selection based on adaptive criteria.

    Figure 9: Pathways forming a circuit from Cortex to Basal Ganglia to Thalamus and back to Cortex. Image source.

    Lifecycle of an idea

    Using our interpretation of biological general intelligence, we can follow the lifecycle of an idea from the conception to execution. Lets walk through the theorized response to a stimulus, resulting in an action.

    Although the brain is operating constantly and asynchronously, we can define the start of our idea as some sensory data that arrives at the visual cortex. In this example, it’s an image of an ice-cream in a shop.

    Objective Modelling

    Sensor data propagates unfiltered up the Feed-Forward Direct pathway, activating cells in C4 and C2/3 in numerous cortical areas as it is transformed into its hierarchical form. The visual stimuli become a rich network of associated concepts, including predictions of near-future outcomes, such as experiencing the taste of ice-cream. These concepts represent an objective external reality and are now active and available for attention.

    Subjective Prediction

    Activity within the Objective system triggers activity in the Subjective system. Some C5 cells become “predicted”, but are inhibited by the Thalamus. These cells represent potential future actions and outcomes. Things that, from experience, we know are likely to occur after the current situation.

    The Cortex projects data from C2/3 to the Striatum where it is weighted according to reward criteria. A strong response to the flavour of the frozen treat percolates through the Basal Ganglia and manipulates the activity of the Thalamus.

    Between the Thalamus and the Cortex, an iterative negotiation takes place resulting in the selection (via dis-inhibition) of some C5 cells. The Basal Ganglia have learned which manipulations of the Thalamus maximize the expected Reward given the current input from Cortex.

    The way that the Thalamus stimulates particular C5 cells is somewhat indirect. The path of activity to “select” C5 cells in layer n is C5[n-1] →  Thalamus → C4[n] → C5[n]. The signal is re-interpreted at each stage of this pipeline – that is, connections do not carry a specific meaning from point to point. Therefore, you can’t just adjust one “wire” to trigger a particular C5 cell. Rather, you must adjust the inhibition of input to many C4 → C5 cells until you’ve achieved the conditions to “select” a target C5 cell. Many target C5 cells might be simultaneously selected.

    In addition to requiring disinhibition, C5 cells also wait for specific patterns of cell activity in C2/3 prior to becoming “predicted”. This means that it’s very difficult to select a C5 cell that is not “predicted”; it simply doesn’t have the support to out-compete its neighbours in the column and become “selected”. This prevents unrealistic outcomes being “selected”, or output commencing, before the right circumstances have arrived to match the expectation.

    Eventually, a subset of C5 cells become “predicted” and “selected”, representing a subjective model of potential futures for the agent in the world. In this case, the anticipated future involves eating ice-cream.

    Execution

    When C5 cells become active, they in turn drive C6 pyramidal cells that are responsible for causing the future represented by “contextual, selected & predicted” C5 cells. In this case, C6 cells are charged with executing the high-level plan to “buy some ice-cream and eat it”.

    The plan is embodied by many C5 cells, distributed throughout the hierarchy; each represents a subset of the “qualia” relating to the eating of ice-cream. C6 cells begin to interpret these C5 cells into concrete actions, via the C6-C6 Feed-Back Direct pathway. Crucially, they no longer require the Thalamus to modulate the input that makes C5 cells “selected”. Instead, C6 cells stimulate C5 and C6 cells in hierarchically-lower Columns directly, moving them to “selected” status and allowing them to become active as soon as the corresponding Feed-Forward evidence arrives to match.

    C6 cells also modulate relay cells in the Thalamus, guiding the Thalamus to disinhibit C5 cells in lower hierarchy regions. This helps to ensure the parts of the decomposed plan are executed as intended. In turn, these newly selected “lower” C5 cells drive associated C6 cells, and the plan cascades down the hierarchy.

    Note that the plan is also flowing in the “forward” direction, as it incrementally becomes reality rather than expectation. As motor actions take place, they are sensed and signalled through the Feed-Forward pathways. When C5 cells become “selected”, this information becomes available to higher columns in the hierarchy, if not filtered. This also helps the Feed-Forward Indirect pathway and C6 cells to keep track of activity and execute the plan in a coordinated manner.

    At the lowest levels of the hierarchy, the plan becomes a sequence of motor activity, which is activated by C5 cells directly, and also by other brain components that are not covered by our general intelligence model.

    A few moments later, the ice-cream is enjoyed, triggering a release of Dopamine into the Striatum and reinforcing the rewards associated with recent active Cortical input. Delicious!

    Summary

    In the previous articles we explored the characteristics of a general intelligence and looked at some of the features we expected it to have. In part 2 and part 3 we reviewed some relevant computational neuroscience research. In this article we’ve described our interpretation of this background material.

    We presented a model of general intelligence built from 3 interacting systems – Objective, Subjective and Executive. We described how these systems could learn and bootstrap via interaction with the world, and how they could be implemented by the anatomy of the brain. As an example, we traced an experience from sensation, through planning and to execution.

    Let’s assume that our understanding of biology is approximately correct. We can use this as inspiration to build an artificial general intelligence with a similar architecture and test whether the systems behave as described in these articles.

    The next article in this series will look specifically at how these concepts could be implemented in software, resulting in a system that behaves much like the one described here.

    AGI/Algorithm/Architecture/Artificial General Intelligence/Hierarchical Generative Models/hierarchy/invariances

    How to build a General Intelligence: What we think we already know

    Posted by ProjectAGI on
    Authors: D Rawlinson and G Kowadlo

    This is the first of three articles detailing our latest thinking on general intelligence: A one-size-fits-all algorithm that, like people, is able to learn how to function effectively in almost any environment. This differs from most Artificial Intelligence (AI), which is designed by people for a specific purpose. This article will set out assumptions, principles, insights and design guidelines based on what we think we already know about general intelligence. It turns out that we can describe general intelligence in some detail, although not enough detail to actually build it…yet.

    The second article will look at how these ideas fit existing computational neuroscience, which helps to refine and filter the design; and the third article will describe a (high-level) algorithm that is, at least, not contradictory to the design goals and biology already established.

    As usual, our plans have got ahead of implementation, so code will follow in a few weeks after the end of the series (or months…)

    FIGURE 1: A hierarchy of units. Although units start out identically, they become differentiated as they learn from their unique input. The input to a unit depends on its position within the hierarchy and the state of the units connected to it. The hierarchy is conceptualized as having levels; the lowest levels are connected to sensors and motors. Higher levels are separated from sensors and motors by many intermediate units. The hierarchy may have a tree-like structure without cycles, but the number of units per level does not necessarily decrease as you move higher.

    Architecture of General Intelligence

    Let’s start with some fundamental assumptions and outline the structure of a system that has general intelligence characteristics.

    It Exists

    We assume there exists a “general intelligence algorithm” that is not irreducibly complex. That is, we don’t need to understand it in excruciating detail. Instead, we can break it down into simpler models that we can easily understand in isolation. This is not necessarily a reasonable assumption, but there is evidence for it:

    Units

    A general intelligence algorithm can be described more simply as a collection of many simpler, functionally-identical units. Again, this is a big assumption, but it is supported by at least two pieces of evidence. First, it has often been observed that the human cortex has quite uniform structure across areas having greatly varying functional roles. Second, this structure has revealed that the cortex is made up of many smaller units (called columns, at one particular scale). It is reasonable to decompose the cortex in this way due to high and varied intra-column connectivity and limited variety of inter-column connectivity. The patterns of inter and intra column connectivity are very similar throughout the cortex. “Columns” contain only a few thousand neurons organized into layers and micro-columns that further simplify understanding of the structure. That’s not overwhelmingly complex, although we are making simplifying assumptions about neuron function.

    Hierarchy

    Our reading and experimentation has suggested that hierarchical representation is critical for the types of information processing involved in general intelligence. Hierarchies are built from many units connected together in layers. Typically, only the lowest level of the hierarchy receives external input. Other levels receive input from lower levels of the hierarchy instead. For more background on hierarchies, see earlier posts. Hierarchy allows units in higher layers to model more complex and abstract features of input, despite the fixed complexity of each unit. Hierarchy also allows units to cover all available input data and allow combinations of features to be jointly represented within a reasonable memory limit. It’s a crucial concept.

    Synchronization

    Do we need synchronization between units? Synchronization can simplify sequence modelling in a hierarchy by restricting the number of possible permutations of events. However, synchronization between units may significantly hinder fast execution on parallel computing hardware, so this question is important. A point of confusion may be the difference between synchronization and timing / clock signals. We can have synchronization without clocks, but in any case there is biological evidence of timing signals within the brain. Pathological conditions can arise without a sense of time. In conclusion we’re going to assume that units should be functionally asynchronous, but might make use of clock signals.

    Robustness

    Your brain doesn’t completely stop working if you damage it. Robustness is a characteristic of a distributed system and one we should hope to emulate. Robustness applies not just to internal damage but external changes (i.e. it doesn’t matter if your brain is wrong or the world has changed; either way you have to learn to cope).

    Scalability

    Adding more units should improve capability and performance. The algorithm must scale effectively without changes other than having more of the same units appended to the hierarchy. Note the specific criteria for how scalability is to be achieved (i.e. enlarge the hierarchy rather than enlarge the units). It is important to test for this feature to demonstrate the generality of the solution.

    Generality

    The same unit should work reasonably well for all types of input data, without preprocessing. Of course, tailored preprocessing could make it better, but it shouldn’t be essential.

    Local interpretation

    The unit must locally interpret all input. In real brains it isn’t plausible that neuron X evolved to target neuron Y precisely. Neurons develop dendrites and synapses with sources and targets that are carefully guided, but not to the extent of identifying specific cells amongst thousands of peers. Any algorithm that requires exact targeting or mapping of long-range connections is biologically implausible. Rather, units should locally select and interpret incoming signals using characteristics of the input. Since many AI methods require exact mapping between algorithm stages, this principle is actually quite discriminating.

    Cellular plausibility

    Similarly, we can validate designs by questioning whether they could develop by biologically plausible processes, such as cell migration or preferential affinity for specific signal coding or molecular markers. However, be aware that brain neurons rarely match the traditional integrate-and-fire model.

    Key Insights

    It’s surprising that in careers cumulatively spanning more than 25 years we (the authors) had very little idea how the methods we used everyday could lead to general intelligence. It is only in the last 5 years that we have begun to research the particular sub-disciplines of AI that may lead us in that direction.

    Today, those who have studied this area can talk in some detail about the nature of general intelligence without getting into specifics. Although we don’t yet have all the answers, the problem has become more approachable. For example, we’re really looking to understand a much simpler unit, not an entire brain holistically. Many complex systems can be easily understood when broken down in the right way, because we can selectively ignore detail that is irrelevant to the question at hand.

    From our experience, we’ve developed some insights we want to share. Many of these insights were already known, and we just needed to find the right terminology. By sharing this terminology we can help others to find the right research to read.

    We’re looking for a stackable building block, not the perfect monolith

    We must find a unit that can be assembled into an arbitrarily large – yet still functional – structure. In fact, a similar feature was instrumental in the success of “deep” learning: Networks could suddenly be built up to arbitrary depths. Building a stackable block is surprisingly hard and astonishingly important.

    We’re not looking to beat any specific benchmark

    … but if we could do reasonably well at a wide range of benchmarks, that would be exciting. This is why the DeepMind Atari demos are so exciting; the same algorithm could succeed in very different problems.

    Abstraction by accumulation of invariances

    This insight comes from Hawkins’ work on Hierarchical Temporal Memory. He proposes that abstraction towards symbolic representation comes about incrementally, rather than as a single mapping process. Concepts accumulate invariances – such as appearance from different angles – until labels can correctly be associated with them.  This neatly avoids the fearful “symbol grounding problem” from the early days of AI.

    Biased Prediction and Selective Attention are both action selection

    We believe that selective bias of predictions and expectations is responsible for both narrowing of the range of anticipated futures (selective ignorance of potential outcomes) and the mechanism by which motor actions are generated. A selective prediction of oneself performing an action is a great way to generate or “select” that action. Similarly, selective attention to external events affects the way data is perceived and in turn the way the agent will respond. Filtering data flow between hierarchy units implements both selective attention and action selection, if data flowing towards motors represents candidate futures including both self-actions and external consequences.

    The importance of spatial structure in data

    As you will see in later parts of this article series, the spatial structure of input data is actually quite important when training our latest algorithms. This is not true of many algorithms, especially in Machine Learning where each input scalar is often treated as an independent dimension. Note that we now believe spatial structure is important both in raw input and in data communicated between units. We’re not simply saying that external data structure is important to the algorithm – we’re claiming that simulated spatial structure is actually an essential part of algorithms for dynamically dividing a pool of resources between hierarchy units.

    Binary data

    There’s a lot of simplification and assumption here, but we believe this is the most useful format for input and internal data. In any case, the algorithms we’re finding most useful can’t easily be refactored for the obvious alternative (continuous input values). However, continuous input can be encoded with some loss of precision as subsets of bits. There is some evidence that this is biologically plausible, but it is not definitive. Why binary? Dimensionality reduction is an essential feature of a hierarchical model; it may be that sparse binary representations are simply a good compromise between data loss and qualities such as compositionality:

    Sparse, Distributed Representations

    We will be using Sparse, Distributed Representations (SDRs) to represent agent and world state [RE  ]. SDRs are binary data (i.e. all values are 1 or 0). SDRs are sparse, meaning that at any moment, only a fraction of the bits are 1’s (active). The most complex feature to grasp is that SDRs are distributed: No individual bit uniquely represents anything. Instead, data features are jointly represented by sets of bits. SDRs are overcomplete representations – not all bits in a feature-set are required to “detect” a feature, which also means that degrees of similarity can also be expressed as if the data were continuous. These characteristics also mean that SDRs are robust to noise – missing bits are unlikely to affect interpretation. .

    Predictive Coding

    SDRs are a specific form of Sparse (Population) Coding where state is jointly represented by a set of active bits. Transforming data into a sparse representation is necessarily lossy and balances representational capacity against bit-density. The most promising sparse coding scheme we have identified is Predictive Coding, in which internal state is represented by prediction errors. PC has the benefit that errors are propagated rather than hidden in local states, and data dimensionality automatically reduces in proportion to its predictability. Perfect prediction implies that data is fully understood, and produces no output. A specific description of PC is given by Friston et al but a more general framework has been discussed in several papers by Rao, Ballard et al since about 1999. The latter is quite similar to the inter-region coding via temporal pooling described in the HTM Cortical Learning Algorithm.

    Generative Models

    Training an SDR typically produces a Generative Model of its input. This means that the system encodes observed data in such a way that it can generate novel instances of observed data. In other words, the system can generate predictions of all inputs (with varying uncertainty) from an arbitrary internal state. This is a key prerequisite for a general intelligence that must simulate outcomes for planned novel action combinations.

    Dimensionality Reduction

    In constructing models, we will be looking to extract stable features and in doing so reduce the complexity of input data. This is known as dimensionality reduction, for which we can use algorithms such as auto-encoders. To cope with the vast number of possible permutations and combinations of input, an incredibly efficient incremental process of compression is required. So how can we detect stable features within data?

    Unsupervised Learning

    By the definition of general intelligence, we can’t possibly hope to provide a tutor-algorithm that provides the optimum model update for every input presented. It’s also worth noting that internal representations of the world and agent should be formed without consideration of the utility of the representations – in other words, internal models should be formed for completeness, generality and accuracy rather than task-fulfilment. This allows less abstract representations to become part of more abstract, long-term plans, despite lacking immediate value. It requires that we use unsupervised learning to build internal representations.

    Hierarchical Planning & Execution

    We don’t want to have to model the world twice: Once for understanding what’s happening, and again for planning & control. The same model should be used for both. This means we have to do planning & action selection within the single hierarchical model used for perception. It also makes sense, given that the agent’s own actions will help to explain sensor input (for example, turning your head will alter the images received in a predictable way). As explained earlier, we can generate plans by simply biasing “predictions” of our own behaviour towards actions with rewarding outcomes.

    Reinforcement Learning

    In the context of an intelligent agent, it is generally impossible to discover the “correct” set of actions or output for any given situation. There are many alternatives of varying quality; we don’t even insist on the best action but expect the agent to usually pick rewarding actions. In these scenarios, we will require a Reinforcement Learning system to model the quality of the actions considered by the agent. Since there is value in exploration, we may also expect the agent to occasionally pick suboptimal strategies, to learn new information.

    Supervised Learning

    There is still a role for supervised learning within general intelligence. Specifically, during the execution of hierarchical control tasks we can describe both the ideal outcome and some metric describing similarity of actual outcome to desired. Supervised learning is ideal for discovery of actions with agency to bring about desired results. Supervised Learning can tell us how best to execute a plan constructed in an Unsupervised Learning model, that was later selected by Reinforcement Learning.

    Challenges Anticipated 

    The features and constraints already identified mean that we can expect some specific difficulties when creating our general intelligence.

    Among other problems, we are particularly concerned about:

    1. Allocation of limited resources
    2: Signal dilution
    3: Detached circuits within the hierarchy
    4: Dilution of executive influence
    5: Conflict resolution
    6: Parameter selection

    Let’s elaborate:

    Allocation of limited resources

    This is an inherent problem when allocating a fixed pool of computational resources (such as memory) to a hierarchy of units. Often, resources per unit are fixed, ensuring that there are sufficient resources for the desired hierarchy structure. However, this is far less efficient than dynamically allocating resources to units to globally maximize performance. It also presupposes the ideal hierarchy structure is known, and not a function of the data. If the hierarchy structure is also dynamic, this becomes particularly difficult to manage because resources are being allocated at two scales simultaneously (resources → units and units → hierarchy structure), with constraints at both scales.

    In our research we will initially adopt a fixed resource quota per hierarchy unit and a fixed branching factor for the hierarchy, allowing the structure of the hierarchy and resources per unit to be determined by data. This arrangement is the one most likely to work given a universal unit with constant parameters, as the number of inputs to each unit is constrained (due to the branching factor). It is interesting that the human cortex is a continuous sheet, and evidences dynamic resource allocation as neuroplasticity – resources can be dynamically assigned to working areas and sensors when others fail.

    Signal Dilution

    As data is transformed from raw input into a hierarchical model, information will be lost (not represented anywhere). This problem is certain to occur in all realistic tasks because input data will be modelled locally in each unit without global oversight over which data is useful. Given local resource constraints, this will be a lossy process. Moreover, we have also identified the need for units to identify patterns in the data and output a simplified signal for higher-order modelling by other units in the hierarchy (dimensionality reduction). Therefore, each unit will deliberately and necessarily lose data during these transformations. We will use techniques such as Predictive Coding to allow data that is not understood (i.e. not predictable) to flow through the system until it can be modelled accurately (predicted). However, it will still be important to characterise the failure modes in which important data is eliminated before it can be combined with other data that provides explanatory power.

    Detached circuits within the hierarchy

    Consider figure 2. Here we have a tree of hierarchy units. If the interactions between units are reciprocal (i.e. X outputs to Y and receives data from Y) there is a strong danger of small self-reinforcing circuits forming in the hierarchy. These feedback circuits exchange mutually complementary data between a pair or more units, causing them to ignore data from the rest of the hierarchy. In effect, the circuit becomes “detached” from the rest of the hierarchy. Since sensor data enters via leaf-units at the bottom of the hierarchy, everything above the detached circuit is also detached from the outside world and the system will cease to function satisfactorily.

    In any hierarchy with reciprocal connections, this problem is very likely to occur, and disastrous when it does. In Belief Propagation, another graphical model, this problem manifests as “double counting” and is avoided by nodes carefully ignoring their own evidence returned to them.

    FIGURE 2: Detached circuits within the hierarchy. Units X and Y have formed a mutually reinforcing circuit that ignores all data from other parts of the hierarchy. By doing so, they have ceased to model the external world and have divided the hierarchy into separate components.

    Dilution of executive influence

    A generally-intelligent agent needs to have the ability to execute abstract, high-level plans as easily as primitive, immediate actions. As people we often conceive plans that may take minutes, hours, days or even longer to complete. How is execution of lengthy plans achieved in a hierarchical system?

    If abstract concepts exist only in higher levels of the hierarchy, they need to control large subtrees of the hierarchy over long periods of time to be successfully executed. However, if each hierarchy unit is independent; how is this control to be achieved? If higher units do not effectively subsume lower ones, executive influence will dilute as plans are incrementally re-interpreted from abstract to concrete (see figure 3). Ideally, abstract units will have quite specific control over concrete units. However, it is impractical for abstract units to have the complexity to “micro-manage” an entire tree of concrete units.

    FIGURE 3: Dilution of executive influence. A high-level unit within the hierarchy wishes to execute a plan; the plan must be translated towards the most concrete units to be performed. However, each translation and re-interpretation risks losing details of the original intent which cannot be fully represented in the lower levels. Somehow, executive influence must be maintained down through an arbitrarily deep hierarchy. 

    Let’s define “agency” as the ability to influence or control outcomes. Lacking the ability to cause a particular outcome is a lack of agency over the desired and actual outcomes. By making each hierarchy unit responsible for the execution of goals defined in the hierarchy level immediately above, we indirectly maximise the agency of more abstract units. Without this arrangement, more abstract units would have little or no agency at all.

    Figure 4 shows what happens when an abstract plan gets “lost in translation” to concrete form. I walked up to my car and pulled my keys from my pocket. The car key is on a ring with many others, but it’s much bigger and can’t be mistaken by touch. It can only be mistaken if you don’t care about the differences.

    In this case, when I got to the car door I tried to unlock it with the house key! I only stopped when the key wouldn’t fit in the keyhole. Strangely, all low-level mechanical actions were performed skillfully, but high level knowledge (which key) was lost. Although the plan was put in motion, it was not successful in achieving the goal.

    Obviously this is just a hypothesis about why this type of error happens. What’s surprising is that it isn’t more common. Can you think of any examples?

    car_key_ampf_agi_translation.jpg

    FIGURE 4: Abstract plan translation failure: Picking the wrong key but skilfully trying it in the lock. This may be an example of abstract plans being carried out, but losing relevant details while being transformed into concrete motor actions by a hierarchy of units.

    In our model, planning and action selection occur as biased prediction. There is an inherent conflict between accurate prediction and bias. Attempting to bias predictions of events beyond your control leads to unexpected failure, which is even worse than expected failure.

    The alternative is to predict accurately, but often the better outcome is the less likely one. There must be a mechanism to increase the probability of low-frequency events where the agent has agency over the real-world outcome.

    Where possible, lower units must separate learning to predict and trying to use that learning to satisfy higher units’ objectives. Units should seek to maximise the probability of goal outcomes, given an accurate estimate of the state of the local unit as prior knowledge. But units should not become blind to objective reality in the process.

    Conflict resolution

    General intelligence must be able to function effectively in novel situations. Modelling and prediction must work in the first instance, without time for re-learning. This means that existing knowledge must be combined effectively to extrapolate to a novel situation.

    We also want the general intelligence to spontaneously create novel combinations of behaviour as a way to innovate and discover new ways to do things. Since we assume that behaviour is generated by filtering predictions, we are really saying we need to be able to predict (simulate) accurately when extrapolating combinations of existing models to new situations. So we also need conflict resolution for non-physical or non-action predictions. The agent needs a clear and decisive vision of the future, even when simulating outcomes it has never experienced.

    The downside of all this creativity is that there’s really no way to tell whether these combinations are valid. Often they will be, but not always. For example, you can’t touch two objects that are far apart at the same time. When incompatible, we need a way to resolve the conflict.

    There’s a good discussion of different conflict resolution strategies on Scholarpedia; our preferred technique is selecting a solitary active strategy in each hierarchy unit, choosing locally to optimise for a single objective when multiple are requested.

    Evaluating alternative plans is most easily accomplished as a centralised task – you have to bring all the potential alternatives together where they can be compared. This is because we can only assign relative rewards to each alternative; it is impossible to calculate meaningful absolute rewards for the experiences of an intelligent agent. It is also important to place all plans on a level playing-field regardless of the level of abstraction; therefore abstract plans should be competing against more concrete ones and vice-versa.

    Therefore, unlike most of the pieces we’ve described, action selection should be a centralised activity rather than a distributed one.

    Parameter Selection

    In a hierarchical system the input to “higher” units will be determined by modelling in “lower” units and interactions with the world. The agent-world system will develop in markedly different ways each time. It will take an unknown amount of time for stable modelling to emerge, first in the lower units and then moving higher in the hierarchy.

    As a result of all these factors it will be very difficult to pick suitable values for time-constants and other parameters that control the learning processes in each unit, due to compounded uncertainty about lower units’ input. Instead, we must allow recent input to each unit to determine suitable values for parameters. This is online learning. Some parameters cannot be automatically adjusted in response to data. For these, to have any hope of debugging a general intelligence, a fixed parameter configuration must work for all units in all circumstances. This constraint will limit the use of some existing algorithms.

    Summary

    That wraps up our theoretical overview of what we think a general intelligence algorithm must look like. The next article in this series will explain what we’ve learnt from biology’s implementation of general intelligence – ourselves! The final article will describe how we hope to build an algorithm that satisfies all these requirements.

    action selection/Adaptive/agency/Algorithm/CLA/Hierarchical Generative Models/hierarchy/symbol grounding problem

    Agency and Hierarchical Action Selection

    Posted by ProjectAGI on
    This post asks some questions about the agency of hierarchical action selection. We assume various pieces of HTM / MPF canon, such as a cortical hierarchy.

    Agency

    The concept of agency has various meanings in psychology, neuroscience, artificial intelligence and philosophy. The common element is having control over a system, with varying qualifiers regarding the entities who may be aware of execution or availability of control. Although “agency” has several definitions, let’s use this one I made up:

    An agent has agency over a state S, if its actions affect the probability that S occurs.

    Hierarchical Selection thought experiment

    Now let’s consider a hierarchical representation of action-states (actions and states encoded together). Candidate actions can therefore be synonymous with predictions of future states. Let’s assume that actions-states can be selected as objectives anywhere in the hierarchy. More complex actions are represented as combinations or sequences of simpler action-states defined in lower levels of the hierarchy.

    Let’s say an “abstract” action-state at a high level in the hierarchy is selected. How is the action-state executed? In other words, how is the abstract state made to occur?

    To exploit the structure of the hierarchy, let’s assume each vertex of the hierarchy re-interprets selected actions. This translates a compound action into its constituent parts.

    How much control does higher-level selection exert over lower-level execution? For simplicity let’s assume there are two alternatives:

    1. High level selection biases or influences lower level (weak control)
    2. Lower levels try to interpret high level selections as faithfully as possible (strong control)

    We exclude the possibility that higher levels directly control or subsume all lower levels due to the difficulty and complexity of performing such a task without the benefit of hierarchical problem decomposition.

    If high levels do not exert strong control over lower levels, the probability of faithfully executing an abstract plan should be small due to compound uncertainty at each level. For example, let’s say the probability of each hierarchy level correctly interpreting a selected action is x. The height of the hierarchy h determines the number of interpretations between selection of the abstract action and execution of relevant concrete actions. The probability of an abstract action a being correctly executed is:

    P(a) = xh

    So for example, if h=10 and x=0.9, P(a) = 0.34.

    We can see that in a hierarchy with a very large number of levels, the probability of executing any top-level strategy will be very small unless each level interprets higher-level objectives faithfully. However, “weak control” may suffice in a very shallow hierarchy.

    Are abstract actions easy to execute?

    Introspectively I observe that highly abstract plans are frequently and faithfully executed without difficulty (e.g. it is easy to drive a car to the shops for groceries, something I consider a fairly abstract plan). Given the apparent ease with which I select and execute tasks with rewards delayed by hours, days or months, it seems I have good agency over abstract tasks.

    According to the thought experiment above, my cortical hierarchy must either be very shallow or higher levels must exert “strong control” over lower levels.

    Let’s assume the hierarchy is not shallow (it might be, but then that’s a useful conclusion in its own right).

    Local Optimisation

    Local processes may have greater biological validity because they imply less difficulty/specificity routing relevant signals to the right places. Hopefully the amount of wiring is reduced also.

    What would a local implementation of a strong control architecture look like? Each vertex of the hierarchy would receive some objective action-state[s] as input. (When no input is received, no output is produced). Each vertex would produce some objective action-states as output, in terms of action-states in the level below. The hierarchical encoding of the world would be undone incrementally by each level. 

    At the lowest level the output action-states would be actual motor control signals.
    A cascade of incremental re-interpretation would flow from the level of original selection down to levels that process raw data (either as input or output). In each case, local interpretation should only be concerned with maximizing the conditional probability of the selected action-state given the current action-state and instructions passed to the level immediately below. 
    Clearly, the agency of each hierarchy vertex over its output action-states is crucial. The agency of hierarchy levels greater than 0 is dependent on faithful interpretation by lower levels. Other considerations (such as reward associated with output action-states) must be ignored, else the agency of higher hierarchy levels is lost. 

    Cortex Layer 6

    Connections form between layer 6 of cortex in “higher” regions and layer 6 in “lower” regions, with information travelling from higher (more abstract) regions towards more concrete regions (i.e. a feedback direction). Layer 6 neurons also receive input from other cortex layers in the same region. However, note that the referenced work disputes the validity of assigning directionality, such as “feed-back”, to cortical layers.
    Pressing on regardless, given some assumptions about cortical hierarchy, we can speculatively wonder whether the layer 6 neurons embody a local optimization process that incrementally translates selected actions into simpler parts, using information from other cortex layers for context. The purpose of cortex layer 6 remains mysterious.
    However, since cortex layer 5 seems to be the direct driver of motor actions, it may be that layer 6 somehow controls cortex layer 5 in the same or lower regions, perhaps via some negotiation with the Thalamus.

    Adapted from Numenta CLA Whitepaper by Gideon Kowadlo
    Another difficulty for this theory is that cortex layer 5 seems to be more complex than simply the output from layer 6. Activity in layer 5 seems to be the result of interaction between cortex and Thalamus. Potentially this interaction could be usefully overriding layer 6 instructions to produce novel action combinations.
    There is some evidence that dopaminergic neurones in the Striatum are involved in agency learning, but this doesn’t necessarily refute this post, because this process may modulate cortical activity via the Thalamus. Cortex layer 6 may still require some form of optimization to ensure that higher hierarchy levels have agency over future action-states.

    To conclude: This is all speculation – comments welcome!

    CLA/Generative Models/Hierarchical Generative Models/HTM/Predictive Coding/Rao & Ballard/Temporal Pooling

    On Predictive Coding and Temporal Pooling

    Posted by ProjectAGI on

    Introduction

    Predictive Coding (PC) is a popular theory of cortical function within the neuroscience community. There is considerable biological evidence to support the essential concepts (see e.g. “Canonical microcircuits for predictive coding” by Bastos et al).

    PC describes a method of encoding messages passed between processing units. Specifically, PC states that messages encode prediction failures; when prediction is perfect, there is no message to be sent. The content of each message is the error produced by comparing predictions to observations.

    A good introduction to the various theories and models under the PC umbrella has been written by Andy Clark (“Whatever next? Predictive brains, situated agents, and the future of cognitive science”). As Clark explains, the history of the PC concept goes back at least several decades to Ashby, quote: “The whole function of the brain is summed up in: error correction.” Mumford pretty much nailed the concept back in 1992, before it was known as predictive coding (the cited paper gives a good discussion of how the neocortex might implement a PC-like scheme).

    The majority of PC theories also model uncertainty explicitly, using Bayesian principles. This is a natural fit when providing explicit messaging of errors and attempting to generate predictions. Of course, it is also a robust framework for generative models.

    It can be difficult to search for articles regarding PC because a similar concept exists in Signal Processing, although this seems to be coincidental, or at least the connection goes back beyond our reading. Unfortunately, many articles on the subject are written at a high level and do not include sufficient detail for implementation. However, we found work by Friston et al (example) and Rao et al (example, example) to be well described, although the former is difficult to grasp if one is not familiar with dynamical systems theory.

    Rao’s papers include application of PC to visual processing and Friston’s work includes both the classification of birdsong and extends the concept to the control of motor actions. Friston et al wrote a paper titled “Perceptions as hypotheses; saccades as experiments” in which they suggest that actions are carefully chosen to optimally reduce uncertainty in internal predictive models. The PC concept throws up interesting new perspectives on many topics!

    Comparison to MPF/CLA

    There are significant parallels between MPF/CLA and PC. Both postulate a hierarchy of processing units with FeedForward (FF) and reciprocal FeedBack (FB) connections. MPF/CLA explicitly aims to produce increasingly stable FF signals in higher levels of the hierarchy. MPF/CLA tries to do this by identifying patterns via spatial and temporal pooling, and replacing these patterns with a constant signal.

    Many PC theories create “hierarchical generative models” (e.g. Rao and Ballard). Hierarchical is enforced by restrictions on the topology of the model. The generative part refers to the fact that variables (in the Bayesian sense), in each vertex of the model, are defined by identification of patterns in input data. This agrees with MPF/CLA.

    Both MPF/CLA and PC posit that processing units use FB data from higher layers to improve local prediction. In conjunction with local learning, this serves to reduce errors and therefore, in PC also stabilizes FF output.

    In MPF/CLA it is assumed that cells’ input dendrites determine the set of inputs the cell represents. This performs a form of Spatial Pooling – the cell comes to represent a set of input cells firing simultaneously, and hence the cell becomes a label or symbol representing that set. In PC it is similarly assumed that the generative model will produce objects (cells, variables) that represent combinations of inputs.

    However, MPF/CLA and PC differ in their approach to Temporal Pooling, i.e. changes in input over time.

    Implicit Temporal Pooling

    Predictive coding does not expressly aim to produce stability in higher layers, but increasing stability over time is an expected side-effect of the technique. Assuming successful learning within a processing unit, its FF output will be stable (no signal) for the duration of any periods of successful prediction.

    Temporal Pooling in MPF/CLA attempts to replace FF input with a (more stable) pattern that is constantly output for the duration of some sequence of events. In contrast, PC explicitly outputs prediction errors whenever they occur. If errors do not occur, PC does not produce any output, and therefore the output is stable. A similar outcome has occurred, but via different processes.

    Since the content of PC messages differs to MPF/CLA messages, it also changes the meaning of the variables defined in each vertex of the hierarchy. In MPF/CLA the variables will represent chains of sequences of sequences … in PC, variables will represent a succession of forks in sequences, where prediction failed.

    So it turns out that Predictive Coding is an elegant way to implement Temporal Pooling.

    Benefits of Predictive Coding

    Where PC gets really interesting is that the amplitude or magnitude of the FF signal corresponds to the severity of the error.  A totally unexpected event will cause a signal of large amplitude, whereas an event that was considered a possibility will produce a less significant output.

    This occurs because most PC frameworks model uncertainty explicitly, and these probability distributions can account for the possibility of multiple future events. Anticipated events will have some mass in the prior distribution; unanticipated events have very little prior probability. If the FF output is calculated as the difference between prior and posterior distributions, we naturally get an amplitude that is correlated with the surprise of the event.

    This is a very useful property. We can distribute representational resources across the hierarchy, giving the resources preferentially to the regions where larger errors are occurring more frequently. These events are being badly represented and need improvement.

    In biological terms this response would be embodied as a proliferation of cells in columns receiving or producing large or frequent FF signals.

    Next post

    In the next post we will describe a hybrid Predictive-Coding / Memory Prediction Framework which has some nice properties, and is appealingly simple to implement. We will include some empirical results that show how well the two go together.

    CLA/Cortical Learning Algorithm/Hierarchical Generative Models/HTM/MPF/Neocortex/Sequence Memory/Spatial Pooling/Temporal Pooling

    Architecture of the Memory Prediction Framework / Cortical Learning Algorithm / Hierarchical Temporal Memory

    Posted by ProjectAGI on
    by David Rawlinson and Gideon Kowadlo

    Introduction

    The Memory Prediction Framework (MPF) is a general description of a class of algorithms. Numenta’s Cortical Learning Algorithm (CLA) is a specific instance of the framework. Numenta’s Hierarchical Temporal Memory (HTM) was an earlier instance of the framework. HTM and CLA adopt different internal representations so it is not as simple as CLA supersedes HTM.
    This post will describe structure of the framework that is common to MPF, CLA and HTM, specifically some features that cause confusion to many readers.
    For a good introduction to MPF/CLA/HTM see the Numenta CLA white paper.

    The Hierarchy

    The framework is composed as a hierarchy of identical processing units. The units are known as “regions” in CLA. The hierarchy is a tree-like structure of regions:
    MPF/CLA/HTM hierarchy of Regions. The large arrows show the direction of increasing abstraction. Smaller arrows show the flow of data between nearby regions in a single level of the hierarchy, and between levels of the hierarchy. Figure originally from Numenta.
    Regions communicate with other, nearby regions in the same level of the hierarchy. Regions also communicate with a few regions in a higher level of the hierarchy, and a few regions in a lower level of the hierarchy. Notionally, abstraction increases as you move towards higher levels in the hierarchy. Note that Hawkins and Blakeslee define abstraction as “the accumulation of invariances”.

    Regions

    Biologically, each Region is a tiny patch of cortex. The hierarchy is constructed from lots of patches of cortex. Each piece of cortex has approximately 6 layers (there are small variations throughout the cortex, and the exact division between cortical layers in biology is a bit vague. Nature hates straight lines). Note that in addition to having only 6 layers, each cortical region is finite in extent within the cortex – i.e. it is only a tiny area on the surface of the cortex.
    Cortical layers and connections between hierarchy levels. Each cortical region has about 6 structurally (i.e. also functionally) distinct layers. The hierarchy is composed of a tree of cortical regions, with connections between regions in different levels of the hierarchy. 3 key pathways are illustrated here. Each pathway is a carrier of distinct information content. The Feed-Forward pathways carry information UP the hierarchy levels towards increasing abstraction/invariance. The Feed-Back pathway carries information DOWN through hierarchy levels towards regions that represent more concrete, raw, unclassified inputs. Some pathways connect cortical regions directly, others indirectly (via other brain structures). Note that this image is a modified copy of one from Numenta, with additional labels and colours standardised to match the rest of this document.

    Levels and Layers

    Newcomers to MPF/CLA/HTM theory sometimes confuse “cortical layers” and connections between regions placed in different “levels” of the hierarchy. We recommend everyone uses layers to talk about cortical layers and levels to talk about hierarchy levels, although the levels and layers are somewhat synonymous in English. I believe this confusion arises because readers expect to learn one new concept at a time, but in fact levels and layers are two separate things.

    Pathways

    There are several distinct routes that information takes through the hierarchy. Each route is called a “pathway”. What is a pathway? In short, a pathway is a set of assumptions that allows us to make some broad statements about what components are connected, and how. We assume that the content of data in each pathway is qualitatively different. We also assume there is limited mixing of data between pathways, except where some function is performed to specifically combine the data.

    Directions

    There are two directions that have meaning within the MPF/CLA/HTM literature. These are feed-forward and feed-back.  Feed-Forward (FF) means data travelling UP between hierarchy levels, towards increasing abstraction. Feed-Back (FB) means data travelling DOWN between hierarchy levels, with reducing abstraction and taking on more concrete forms closer to raw inputs.

    3 Pathways

    The 3 pathways typically discussed in the MPF/CLA/HTM literature are:
    – FF direct (BLUE)
    – FF indirect (GREEN)
    – FB direct (RED)
    Direct means that data travels from one cortical region to another, without a stop along the way at an intermediate brain structure. Indirect means that the data is passed through another brain structure en-route, and possibly modified or gated (filtered).
    This does not mean that other pathways do not exist. There is likely a FB-indirect pathway from Cortex to Cortex via the Basal Ganglia, and direct connections between nearby Regions at the same level in the hierarchy. However, current canonical MPF/CLA theory does not assign roles to these pathways.  
    We will always use the same colours for these pathways.
    The conceptual and biological arrangement of the MPF/CLA/HTM hierarchy. Left, the conceptual structure. Right, the physical arrangement of the hierarchy. Cortical processing occurs on the surface of the cerebrum, not inside it; the filling is mainly neuron axons connecting surface regions. FF (blue) and FB (red) pathways are shown. Moving between hierarchy levels involves routing data between different patches of cortex (surface). The processing Units – each, a separate region – here are labelled Unc where n is the hierarchy level and r is an identifier for each Region. Note that data from multiple regions is combined in higher levels of the hierarchy: For example, U2a receives FF data from U1a and U1b. Via the FB pathway, lower levels are able to exploit data from other subtrees. Some types of data relayed between hierarchy regions are relayed via deep brain structures, such as the Thalamus. We say these are “indirect” connections. The relays may modify / filter / gate the data en-route.

    Conceptual Region Architecture

    MPF/CLA/HTM broadly outlines the architecture of each Region as follows. Each region has a handful of distinct functional components, namely: Spatial Pooler, Sequence Memory, and Temporal Pooler. Prediction is also a core feature of each Region, though it may not be considered a separate component. I believe that Hawkins would not consider this to be a complete list, as the CLA algorithm is still being developed and does not yet cover all cortical functions. Note that the conceptual entities described here do not imply structural boundaries or say anything about how this might look as a neural network.

    Key functional components of each Region. Note that every cellular Cortical layer of cells is believed to be performing some subset of these functions. It is not intended that each layer perform one of the functions. Where specifically described, the inputs and outputs of each pathway are shown. The CLA white paper does not specifically define how FB output is generated. It is possible that FB output contains predicted cells. Prediction is an integral function of the posited sequence memory cells, so whether it can be a separate component is debatable. However, conceptually, a sequence memory cell cannot be activated by prediction alone; FF (“bottom up”) input is always needed to activate a cell. Prediction puts sequence memory cells into a receptive state for future activation by FF input. Regions receive additional data (e.g. from regions at higher hierarchy levels) when making their predictions. Prediction allows regions to better recognise FF input and predict future sequence cell activation. Note, from the existing CLA white paper it is not clear whether the FF indirect pathway involves Temporal Pooling. The white paper says that FF-indirect output originates in Layer 5 which is not fully described. 

    The Spatial Pooler identifies common patterns in the FF direct input and replaces them with activation of a single cell (or, variable, or state, or label, depending on your preferred terminology). The spatial pooler is functioning as an unsupervised classifier to transform input patterns into abstract labels that represent specific patterns. 

    The Sequence Memory models changes in the state of the spatial pooler over time. In other words, which cells or states follow which other cells/states? The Sequence Memory can be thought of as a Markov Chain of the states defined by the spatial pooler. Sequence Memory encodes information that enables predictions of future spatial pooler state.
    The FF direct pathway cannot be driven by feedback from higher levels alone: FF input is always needed to fully activate cells in the Sequence Memory. As a hierarchy of unsupervised classifiers, the FF pathways are similar to the Deep Learning hierarchy.

    Prediction is specifically a process of activating Sequence Memory cells that represent FF input patterns that are likely to occur in the near future. Prediction changes Sequence Memory cells to a receptive state where they are more easily activated by future FF input. In this way, prediction makes classification of FF input more accurate. Improvement is due to the extra information provided by prediction, using both the history of Sequence Cell activation within the region and the history of activation of Sequence Memory cells within higher regions, the latter via the FB pathway.

    It is probable that the FB pathway contains prediction data, possibly in addition to Sequence Memory cell state. This is described in MPF/HTM literature, but is not specifically encoded in existing CLA documentation.

    Personally, I believe that prediction is synonymous with the generation of behaviour and that it has dual purposes; firstly, to enable regions to better understand future FF input, and secondly, to produce useful actions. A future article will discuss the topic of whether prediction and planning actions could be the same thing in the brain’s internal representation. An indirect FB pathway is not shown in this diagram because it is not described in MPF/CLA literature.

    While Spatial Pooling tries to replace instantaneous input patterns with labels, Temporal pooling attempts to simplify changes over time by replacing common sequences with labels. This is a function not explicitly handled in Deep Learning methods, which are typically applied to static data. MPF/CLA/HTM is explicitly designed to handle a continuous stream of varying input.

    Temporal pooling ensures that regions at higher levels in the hierarchy encode longer sequences of patterns, allowing the hierarchy to recognise long-term causes and effects. The input data for every region is different, ensuring that each region produces unique representations of different sub-problems. Spatial and Temporal pooling, plus the merging of multiple lower regions in a tree-like structure, all contribute to the uniqueness of each region’s Sequence Memory representation.

    Numenta also claim that there is a timing function in cortical prediction, that enables the region to know when specific cells will be driven active by FF input. Since this function is speculative, it is not shown in the diagram above. The timing function is reportedly due to cortical layer 5.

    Mapping Region Architecture to Cortical Layers

    As it stands CLA claims to explain (most of) cortex layers 2, 3 and 4. Hawkins et al are more cautious about their understanding of other cortical layers.
    To try to present a clear picture of their stance, I have included a graphic (below) showing the functions of each biological cortex layer as defined by CLA. The graphic also shows the flows of data both between layers and between regions. Note that the flows given here are only those as described in the CLA white paper and Hawkins’ new ideas on temporal pooling. Other sources do describe additional/alternative connections between cortical levels and regions. The exact interactions of each layer of neurons are somewhat messy and difficult to interpret.

    Data flow between cortical layers as described in the CLA white paper. Every arrow in this diagram is the result of a specific comment or diagram in the white paper. This figure is mostly a repeat of the same information as in the second figure, using a different presentation format.  I have speculatively coloured each arrow by content (i.e. pathway) , but don’t rely on this interpretation. Inputs to L2/3,L4 and L5 from L1 are red because there are no cells in L1 to transform the FB input signal, therefore this must be FB data. The black arrow is black because I have no idea what data or pathway it is associated with!

    Summary

    I hope this review of the terminology and architecture is helpful. Although the MPF/CLA/HTM framework is thoroughly and consistently documented, some of the details and concepts can be hard to picture, especially in the first encounter. The CLA White Paper does a good job of explaining Sparse Distributed Representations and spatial, temporal pooler implementations as biologically-inspired Sequence Memory cells. However, the grosser features of the posited hierarchy are not so thoroughly described.

    It is worth noting that according to recent discussions on the NUPIC mailing list, the current NUPIC implementation of CLA does not correctly support multi-level hierarchies correctly. This problem is expected to be addressed in 2014, permitting multi-level hierarchies.

    Adaptive/AGI/Artificial General Intelligence/Hierarchical Generative Models/Memory-Prediction Framework/MPF/Neocortex/Reinforcement Learning

    Introduction

    Posted by ProjectAGI on
    by David Rawlinson and Gideon Kowadlo

    The Blog

    This blog will be written by several people. Other contributors are welcome – send us an email to introduce yourself!
    The content will be a series of short articles about a set of common architectures for artificial general intelligence (AGI). Specifically, we will look at the commonalities in Deep Belief Networks and Numenta’s Memory Prediction Framework (MPF). MPF is these days better known by its concrete implementations CLA (Cortical Learning Algorithm) and HTM (Hierarchical Temporal Memory). For an introduction to Deep Belief Networks, read one of the papers by Hinton et al.

    This blog will typically use the term MPF to collectively describe all the current implementations – CLA, HTM, NUPIC etc. We see MPF as an interface or specification, and CLA, HTM as implementations of the MPF.

    Both MPFs and DBNs try to build efficient and useful hierarchical representations from patterns in input data. Both use unsupervised learning to define local variables to represent the state-space at a particular position in the hierarchy; modelling of the state in terms of these local variables – be they “sequence cells” or “hidden units” – constitutes a nonlinear transformation of the input. This means that both are “Deep Learning” methods. The notion of local variables within a larger graph relates this work to general Bayesian Networks and other graphical models.

    We are also very interested in combining these structures with the representation and selection of behaviour, eventually resulting in the construction of an agent. This is a very exciting area of research that has not received significant attention.

    A very incomplete phylogeny of Deep Learning methods, specifically to contrast well known implementations of Numenta’s Memory Prediction Framework and Deep Belief Networks. Some assumptions (guesses?) about corporate technologies have been made (Vicarious, Grok, DeepMind).

    Readers would be forgiven for not having noted any similarity between MPFs and DBNs. The literature rarely describes both in the same terms. In an attempt to clarify our perspective, we’ve included a phylogeny showing the relationships between these methods – of course, this is only one perspective. We’ve also noted some significant organisations using each method.

    The remarkable uniformity of the neocortex 

    MPF/CLA/HTM aims to explain the function of the human neocortex. Deep Learning methods such as Convolutional Deep Neural Networks are explicitly inspired by cortical processing, particularly in the vision area. “Deep” means simply that the network has many layers; in earlier artificial neural networks, it was difficult to propagate signals through many layers, so only “shallow” networks were effective. “Deep” methods do some special (nonlinear) processing in each layer to ensure the propagated signal is meaningful, even after many layers of processing.

    A cross-section of part of a cerebrum showing the cortex (darker outline). The distinctively furrowed brain appearance is an attempt to maximize surface area within a constrained volume. Image from Wikipedia.

    Cortex means surface, and this surface is responsible for a lot of processing. The cortex covers the top half of the brain, the cerebrum. The processing happens in a thin layer on the surface, with the “filling” of the cerebrum being mainly connections between different areas of the cortex/surface.

    Remarkably, it has been known for at least a century that the neocortex is remarkably similar in structure throughout, despite being associated with ostensibly very different brain functions such as speech, vision, planning and language. Early analysis of neuron connection patterns within the cortex revealed that it is organised into parallel stacks of tiny columns. The columns are highly connected internally, with limited connections to nearby columns. In other words, each column can be imagined as an independent processor of data.

    Let’s assume you’re a connectionist: This means you believe the function of a neural network is determined by the degree and topology of the connections it has. This suggests that the same algorithm is being used in each cortical column: the same functionality is being repeated throughout the cortex despite being applied to very different data. This theory is supported by evidence of neural plasticity: Cortex areas can change function if different data is provided to them, and can learn to interpret new inputs.

    So, to explain the brain all we need to figure out is what’s happening in a typical cortical column and how the columns are connected!!*

    (*a gross simplification, so prepare to be disappointed…!)

    Neural Networks vs Graphical Models

    Whether the function of a cortical column is described as a “neural network” or as a graphical model is irrelevant so long as the critical functionality is captured. Both MPF and Deep Belief Networks create tree-like structures of functionally-identical vertices that we can call a hierarchy. The processing vertices are analogous to columns; the white matter filling the cerebrum passes messages between the vertices of the tree. The tree might really be a different type of graph; we don’t know whether it is better to have more vertices in lower or higher levels.

    From representation to action

    Deep Belief Networks have been particularly successful in the analysis of static images. MPF/CLA/HTM is explicitly designed to handle time-varying data. But neither is expressly designed to generate behaviour for an artificial agent.

    Recently, a company called DeepMind combined Deep Learning and Reinforcement Learning to enable a computer program to play Atari games. Reinforcement Learning teaches an algorithm to associate world & self states with consequences by providing only a nonspecific “quality” function. The algorithm is then able to pick actions that maximize the quality expected in future states.

    Reinforcement Learning is the right type of feedback because it avoids the need to provide a “correct” response in every circumstance. For a “general” AI this is important, because it would require a working General Intelligence to define the “correct” response in all circumstances!

    The direction taken by DeepMind is exactly what we want to do: Automatic construction of a meaningful hierarchical representation of the world and the agent, in combination with reinforcement learning to allow prediction of state quality. Technically, the problem of picking a suitable action for a specific state is called a Markov Decision Process (MDP). But often, the true state of the world is not directly measurable; instead, we can only measure some “evidence” of world-state, and must infer the true state. This harder task is called a Partially-Observable MDP (POMDP).

    An adaptive memory-prediction framework

    In summary this blog is concerned with algorithms and architectures for artificial general intelligence, which we will approach by tackling POMDPs using unsupervised hierarchical representations of the state space and reinforcement learning for action selection. Using Hawkins et al’s MPF concept for the representation of state-space as a hierarchical sequence-memory, and adding adaptive behaviour selection via reinforcement learning, we arrive at the adaptive memory prediction framework (AMPF).

    This continues a theme we developed in an earlier paper (“Generating adaptive behaviour within a memory-prediction framework”).

    Since that publication we have been developing more scalable methods and aim to release a new software package in 2014. In the meantime we will use this blog to provide context and discussion of new ideas.