Category Archives

3 Articles

AGI/Artificial General Intelligence/columns/Hierarchical Generative Models/Predictive Coding/pyramidal cell/sparse coding/Sparse Distributed Representations/symbol grounding problem

The Region-Layer: A building block for AGI

Posted by ProjectAGI on
The Region-Layer: A building block for AGI
Figure 1: The Region-Layer component. The upper surface in the figure is the Region-Layer, which consists of Cells (small rectangles) grouped into Columns. Within each Column, only a few cells are active at any time. The output of the Region-Layer is the activity of the Cells. Columns in the Region-Layer have similar – overlapping – but unique Receptive Fields – illustrated here by lines joining two Columns in the Region-Layer to the input matrix at the bottom. All the Cells in a Column have the same inputs, but respond to different combinations of active input in particular sequential contexts. Overall, the Region-Layer demonstrates self-organization at two scales: into Columns with unique receptive fields, and into Cells responding to unique (input, context) combinations of the Column’s input. 

Introducing the Region-Layer

From our background reading (see here, here, or here) we believe that the key component of a general intelligence can be described as a structure of “Region-Layer” components. As the name suggests, these are finite 2-dimensional areas of cells on a surface. They are surrounded by other Region-Layers, which may be connected in a hierarchical manner; and can be sandwiched by other Region-Layers, on parallel surfaces, by which additional functionality can be achieved. For example, one Region-Layer could implement our concept of the Objective system, another the Region-Layer the Subjective system. Each Region-Layer approximates a single Layer within a Region of Cortex, part of one vertex or level in a hierarchy. For more explanation of this terminology, see earlier articles on Layers and Levels.
The Region-Layer has a biological analogue – it is intended to approximate the collective function of two cell populations within a single layer of a cortical macrocolumn. The first population is a set of pyramidal cells, which we believe perform a sparse classifier function of the input; the second population is a set of inhibitory interneuron cells, which we believe cause the pyramidal cells to become active only in particular sequential contexts, or only when selectively dis-inhibited for other purposes (e.g. attention). Neocortex layers 2/3 and 5 are specifically and individually the inspirations for this model: Each Region-Layer object is supposed to approximate the collective cellular behaviour of a patch of just one of these cortical layers.
We assume the Region-Layer is trained by unsupervised learning only – it finds structure in its input without caring about associated utility or rewards. Learning should be continuous and online, learning as an agent from experience. It should adapt to non-stationary input statistics at any time.
The Region-Layer should be self-organizing: Given a surface of Region-Layer components, they should arrange themselves into a hierarchy automatically. [We may defer implementation of this feature and initially implement a manually-defined hierarchy]. Within each Region-Layer component, the cell populations should exhibit a form of competitive learning such that all cells are used efficiently to model the variety of input observed.
We believe the function of the Region-Layer is best described by Jeff Hawkins: To find spatial features and predictable sequences in the input, and replace them with patterns of cell activity that are increasingly abstract and stable over time. Cumulative discovery of these features over many Region-Layers amounts to an incremental transformation from raw data to fully grounded but abstract symbols. 
Within a Region-Layer, Cells are organized into Columns (see figure 1). Columns are organized within the Region-Layer to optimally cover the distribution of active input observed. Each Column and each Cell responds to only a fraction of the input. Via these two levels of self-organization, the set of active cells becomes a robust, distributed representation of the input.
Given these properties, a surface of Region-Layer components should have nice scaling characteristics, both in response to changing the size of individual Region-Layer column / cell populations and the number of Region-Layer components in the hierarchy. Adding more Region-Layer components should improve input modelling capabilities without any other changes to the system.
So let’s put our cards on the table and test these ideas. 

Region-Layer Implementation

Parameters

For the algorithm outlined below very few parameters are required. The few that are mentioned are needed merely to describe the resources available to the Region-Layer. In theory, they are not affected by the qualities of the input data. This is a key characteristic of a general intelligence.
  • RW: Width of region layer in Columns
  • RH: Height of region layer in Columns
  • CW: Width of column in Cells 
  • CH: Height of column in Cells

Inputs and Outputs

  • Feed-Forward Input (FFI): Must be sparse, and binary. Size: A matrix of any dimension*.
  • Feed-Back Input (FBI): Sparse, binary Size: A vector of any dimension
  • Prediction Disinhibition Input (PDI): Sparse, rare. Size: Region Area+
  • Feed-Forward Output (FFO): Sparse, binary and distributed. Size: Region Area+
* the 2D shape of input[s] may be important for learning receptive fields of columns and cells, depending on implementation.
+  Region Area = CW * CH * RW * RH

Pseudocode

    Here is some pseudocode for iterative update and training of a Region-Layer. Both occur simultaneously.
    We also have fully working code. In the next few blog posts we will describe some of our concrete implementations of this algorithm, and the tests we have performed on it. Watch this space!
    function: UpdateAndTrain( 
      feed_forward_input, 
      feed_back_input, 
      prediction_disinhibition 
    )

    // if no active input, then do nothing
    if( sum( input ) == 0 ) {
      return
    }

    // Sparse activation
    // Note: Can be implemented via a Quilt[1] of any competitive learning algorithm, 
    // e.g. Growing Neural Gas [2], Self-Organizing Maps [3], K-Sparse Autoencoder [4].
    activity(t) = 0

    for-each( column c ) {
      // find cell x that most responds to FFI 
      // in current sequential context given: 
      //  a) prior active cells in region 
      //  b) feedback input.
      x = findBestCellsInColumn( feed_forward_input, feed_back_input, c )

      activity(t)[ x ] = 1
    }

    // Change detection
    // if active cells in region unchanged, then do nothing
    if( activity(t) == activity(t-1) ) {
      return
    }

    // Update receptive fields to organize columns
    trainReceptiveFields( feed_forward_input, columns )

    // Update cell weights given column receptive fields
    // and selected active cells
    trainCells( feed_forward_input, feed_back_input, activity(t) )

    // Predictive coding: output false-negative errors only [5]
    for-each( cell x in region-layer ) {

      coding = 0

      if( ( activity(t)[x] == 1 ) and ( prediction(t-1)[x] == 0 ) ) {
        coding = 1
      }
      // optional: mute output from region, for attentional gating of hierarchy
      if( prediction_disinhibition(t)[x] == 0 ) {
        coding = 0 
      }

      output(t)[x] = coding
    }

    // Update prediction
    // Note: Predictor can be as simple as first-order Hebbian learning. 
    // The prediction model is variable order due to the inclusion of sequential 
    // context in the active cell selection step.
    trainPredictor( activity(t), activity(t-1) )
    prediction(t) = predict( activity(t) )
    [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1401&rep=rep1&type=pdf[2] https://papers.nips.cc/paper/893-a-growing-neural-gas-network-learns-topologies.pdf[3] http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf[4] https://arxiv.org/pdf/1312.5663[5] http://www.ncbi.nlm.nih.gov/pubmed/10195184
    action selection/Adaptive/agency/Algorithm/CLA/Hierarchical Generative Models/hierarchy/symbol grounding problem

    Agency and Hierarchical Action Selection

    Posted by ProjectAGI on
    This post asks some questions about the agency of hierarchical action selection. We assume various pieces of HTM / MPF canon, such as a cortical hierarchy.

    Agency

    The concept of agency has various meanings in psychology, neuroscience, artificial intelligence and philosophy. The common element is having control over a system, with varying qualifiers regarding the entities who may be aware of execution or availability of control. Although “agency” has several definitions, let’s use this one I made up:

    An agent has agency over a state S, if its actions affect the probability that S occurs.

    Hierarchical Selection thought experiment

    Now let’s consider a hierarchical representation of action-states (actions and states encoded together). Candidate actions can therefore be synonymous with predictions of future states. Let’s assume that actions-states can be selected as objectives anywhere in the hierarchy. More complex actions are represented as combinations or sequences of simpler action-states defined in lower levels of the hierarchy.

    Let’s say an “abstract” action-state at a high level in the hierarchy is selected. How is the action-state executed? In other words, how is the abstract state made to occur?

    To exploit the structure of the hierarchy, let’s assume each vertex of the hierarchy re-interprets selected actions. This translates a compound action into its constituent parts.

    How much control does higher-level selection exert over lower-level execution? For simplicity let’s assume there are two alternatives:

    1. High level selection biases or influences lower level (weak control)
    2. Lower levels try to interpret high level selections as faithfully as possible (strong control)

    We exclude the possibility that higher levels directly control or subsume all lower levels due to the difficulty and complexity of performing such a task without the benefit of hierarchical problem decomposition.

    If high levels do not exert strong control over lower levels, the probability of faithfully executing an abstract plan should be small due to compound uncertainty at each level. For example, let’s say the probability of each hierarchy level correctly interpreting a selected action is x. The height of the hierarchy h determines the number of interpretations between selection of the abstract action and execution of relevant concrete actions. The probability of an abstract action a being correctly executed is:

    P(a) = xh

    So for example, if h=10 and x=0.9, P(a) = 0.34.

    We can see that in a hierarchy with a very large number of levels, the probability of executing any top-level strategy will be very small unless each level interprets higher-level objectives faithfully. However, “weak control” may suffice in a very shallow hierarchy.

    Are abstract actions easy to execute?

    Introspectively I observe that highly abstract plans are frequently and faithfully executed without difficulty (e.g. it is easy to drive a car to the shops for groceries, something I consider a fairly abstract plan). Given the apparent ease with which I select and execute tasks with rewards delayed by hours, days or months, it seems I have good agency over abstract tasks.

    According to the thought experiment above, my cortical hierarchy must either be very shallow or higher levels must exert “strong control” over lower levels.

    Let’s assume the hierarchy is not shallow (it might be, but then that’s a useful conclusion in its own right).

    Local Optimisation

    Local processes may have greater biological validity because they imply less difficulty/specificity routing relevant signals to the right places. Hopefully the amount of wiring is reduced also.

    What would a local implementation of a strong control architecture look like? Each vertex of the hierarchy would receive some objective action-state[s] as input. (When no input is received, no output is produced). Each vertex would produce some objective action-states as output, in terms of action-states in the level below. The hierarchical encoding of the world would be undone incrementally by each level. 

    At the lowest level the output action-states would be actual motor control signals.
    A cascade of incremental re-interpretation would flow from the level of original selection down to levels that process raw data (either as input or output). In each case, local interpretation should only be concerned with maximizing the conditional probability of the selected action-state given the current action-state and instructions passed to the level immediately below. 
    Clearly, the agency of each hierarchy vertex over its output action-states is crucial. The agency of hierarchy levels greater than 0 is dependent on faithful interpretation by lower levels. Other considerations (such as reward associated with output action-states) must be ignored, else the agency of higher hierarchy levels is lost. 

    Cortex Layer 6

    Connections form between layer 6 of cortex in “higher” regions and layer 6 in “lower” regions, with information travelling from higher (more abstract) regions towards more concrete regions (i.e. a feedback direction). Layer 6 neurons also receive input from other cortex layers in the same region. However, note that the referenced work disputes the validity of assigning directionality, such as “feed-back”, to cortical layers.
    Pressing on regardless, given some assumptions about cortical hierarchy, we can speculatively wonder whether the layer 6 neurons embody a local optimization process that incrementally translates selected actions into simpler parts, using information from other cortex layers for context. The purpose of cortex layer 6 remains mysterious.
    However, since cortex layer 5 seems to be the direct driver of motor actions, it may be that layer 6 somehow controls cortex layer 5 in the same or lower regions, perhaps via some negotiation with the Thalamus.

    Adapted from Numenta CLA Whitepaper by Gideon Kowadlo
    Another difficulty for this theory is that cortex layer 5 seems to be more complex than simply the output from layer 6. Activity in layer 5 seems to be the result of interaction between cortex and Thalamus. Potentially this interaction could be usefully overriding layer 6 instructions to produce novel action combinations.
    There is some evidence that dopaminergic neurones in the Striatum are involved in agency learning, but this doesn’t necessarily refute this post, because this process may modulate cortical activity via the Thalamus. Cortex layer 6 may still require some form of optimization to ensure that higher hierarchy levels have agency over future action-states.

    To conclude: This is all speculation – comments welcome!

    CLA/Cortical Learning Algorithm/first order/HTM/invariances/markov chain/symbol grounding problem/Temporal Pooling/variable order

    TP 1/3: Temporal Pooler background

    Posted by ProjectAGI on

    by David Rawlinson and Gideon Kowadlo

    Article Series

    This is the first of 3 articles about temporal pooler design for Numenta’s Cortical Learning Algorithm (CLA) and related methods (e.g. HTM/MPF). It has limited relevance to Deep Belief Networks.

    This first part will describe some of the considerations, objectives and constraints on temporal pooler design. We will also introduce some useful terminology. Part 2 will examine Jeff Hawkins’ new temporal pooler design. Part 3 will examine an alternative temporal pooler design we are successfully using in our experiments.

    We will use the abbreviation TP for Temporal Pooler to save repeating the phrase.

    Purpose of the Temporal Pooler

    What is the purpose of the Temporal Pooler? The TP is a core process in construction of a hierarchy of increasingly abstract symbols from input data. In a general-purpose algorithm designed to accept any arbitrary stream of data, building symbolic representations, or associating a dictionary of existing symbols with raw data, is an exceedingly difficult problem. The MPF/CLA/HTM family of algorithms seeks to build a dictionary of symbols by looking for patterns in observed data, specifically:

    a) inputs that occur together at the same time
    b) sequences of inputs that reliably occur one after another

    The Spatial Pooler’s job is to find the first type of pattern: Inputs that occur together. The TP’s job is to find the second type of pattern – inputs that occur in predictable sequences over time. MPF calls these patterns “invariances”. Repeated experience allows discovery of invariances: reliable association in time and space binds inputs together as symbols.

    MPF/CLA/HTM claims that abstraction is equivalent to the accumulation of invariances. For example, a symbol representing “dog” would be invariant to the pose, position, and time of experiencing the dog. This is an exceptionally powerful insight, because it opens the door to an algorithm for automatic symbol definition.

    What is a symbol? Symbols are the output of Classification. There must be consistent classification of the varying inputs that collectively represent a common concept, such as an entity or action. There must also be substitution of the input with a unique label for every possible classification outcome.

    The symbol represents the plethora of experiences (i.e. input data) that cumulatively give the symbol its meaning. Embodiment of the symbol can be as simple as the firing of a single neuron. Symbols that represent a wider variety of input are more abstract; symbols that represent only a narrow set of input are more concrete.

    Markov Chains

    We will evaluate and compare TP functions using some example problems. We will be using Markov Chains to define the example problems. The Markov property is very simple; each state only depends on the state that preceded it. All information about the system is given by the identity of the current state. The current state alone is enough to determine the probability of transitioning to any other state.

    Markov chains are normally drawn as graphs, with states being represented as vertices (aka nodes, typically presented as circles or ellipses). Changes in system state are represented by transitions between vertices in the graph. Edges in the graph represent potential changes, usually annotated with transition probabilities; there may be more than one possible future state from the current state. When represented as a Markov chain, the history of a system is always a sequence of visited states without forks or joins.

    Here is an example Markov Chain representing a cycle of 4 states (A,B,C,D). Each state has a colour. The colour of the current state is observable. This graph shows that Red comes after Black. Green comes after Red. Blue follows Green, and Black follows Blue:

    A simple Markov Chain shown as a graph. Vertices (circles) represent states. There are 4 states. Edges represent possible transitions between states. The color of each circle indicates what is observed in the corresponding state.

    Now let’s look at a more complex example. What happens when one state can be followed by more than one subsequent state? The answer is that the next state is determined randomly according to the probability of each transition.

    A Markov chain with 7 states. State A can be followed by B or E with equal probability. Both states D and G are always followed by state A. This system represents 2 sequences of colours, separated by black. Each sequence is equally likely to occur, but once the sequence starts it is always completed in full.

    In our examples, when there is a fork, all subsequent states are equally likely.  This means that in the example above, both states B and E are equally likely to follow state A.

    First-Order Sequence Memory Representation

    If we allow an MPF hierarchy to observe the system described by the Markov Chain above, the hierarchy will construct a model of it. Spatial pooling tries to identify instantaneous patterns (which will cause it to learn the set of observed colours). Temporal pooling attempts to find sequential patterns (frequently-observed sequences of colours). More simply put, an MPF hierarchy will try to learn to predict the sequences of colours.

    How accurately the hierarchy can predict colours will depend on the internal model it produces. We know that the “real” system has 2 sequences of colours (RGB and GBR). We know that there is only uncertainty when the current state is Black.

    However, lets assume the MPF hierarchy consists of a single region. Let’s say the region has 4 cells, that learn to fire uniquely on observation of each of the 4 colours. The Sequence Memory in the region learns the order of cell firing – i.e. it predicts which cell will fire next, given the current cell. It only uses the current active cell to predict the next active cell.

    The situation we have described above can be drawn as a Markov Chain like this:

    A Markov chain constructed from first-order prediction given observations from the 7-state system shown in the previous figure. 

    Note that the modelled system is less predictable than the “real world” system we were observing. We can still be sure that Blue always follows Green. But when we look at which colours follow Blue, we can only say that either Black or Red will follow blue. Similarly, we cannot predict whether Red will be followed by Green or Black, whereas in the “real” world this is predictable.

    The reason for the lack of predictive ability is that we are only using the current colour of our model to predict the next colour. This is known as first-order prediction (i.e. Markov order=1). If we used a longer history of observations, we could predict correctly in every case except Black, where we could be right half the time. Using a “longer” history to predict is known as variable-order prediction (variable because we don’t know, or limit, how much history is needed).

    Uncertainty and the Newtonian experience

    In today’s physics, non-determinism (future not decided) or at least in-determinism (inability to predict) are widely and popularly accepted. But although these effects may dominate on very small and large scales, for much of human-scale experience the physical world is essentially a predictable, Newtonian system. Indeed, human perception encodes Newtonian laws so exactly that acceleration of objects induces an expectation of animate agency
    In a Newtonian world, every action is “explained” or caused by some other physical event; it is simply necessary to have a sufficiently comprehensive understanding & measurement to be able to discover the causes of all observed events.
    This intuition is important because it motivates the construction of an arbitrarily large hierarchy of symbols and their relations in time and space, with the expectation that somewhere in that hierarchy all events can be understood. It doesn’t matter that some events cannot be explained; we just need to get enough practical value to motivate construction of the hierarchy. The Newtonian nature of our experiences at human scale means that most external events are comprehensible and predictable. 

    Finding Higher-Order Patterns in Markov Chains

    The use of longer sequences of colours to better explain (predict) the future shows that confusing, unpredictable systems may become predictable when more complex causes are understood. Longer histories (higher-order representations) can reveal more complex sequences that are predictable, even when the simpler parts of these patterns are not predictable by themselves.

    The “Newtonian world” assumption gives us good reason to expect that a lot of physical causes are potentially predictable, given a suitably complex model and sufficient data. Even human causes are often predictable. It is believed that people develop internal models of third party behaviour (e.g. “theory of mind“), which may help with prediction. This evidence motivates us to try to discover and model these causes as part of an artificial general intelligence algorithm.

    Therefore, one purpose of the MPF hierarchy is to construct higher-order representations of an observed world, in hope of being able to find predictable patterns that explain as many events as possible. Given this knowledge, an agent can use the hierarchy to make good, informed decisions.

    Constructing Variable-Order Sequences using First-Order Predictions

    There is one final concept to introduce before we can discuss some temporal pooler implementations. The final concept is ways of representing variable-order sequences of cell activation in a Sequence Memory. This is not trivial because the sequences can be of unlimited length and complexity, (depending on the data). However, for practical reasons, the resources used must be limited in some way. So, how can arbitrarily complex structures be represented using a structure of limited complexity?

    Let’s define a sequence memory as a set of cells, where each cell represents a particular state. Let’s specify a fixed quantity of cells, and to encode all first-order transitions between these cells. All such pairwise transitions can be encoded in a matrix of dimension cells x cells. This means that cells only trigger each other individually; only pairs of cells can participate in each sequence fragment.

    So, how can longer sequences be represented in the matrix? How can we differentiate between occurrences of the same observation in the context of different sequences?

    The “state splitting” method of creating a variable-order memory using only first-order relationships between cells. This is part of a figure from Hawkins, George and Niemasik’s “Sequence memory for prediction, inference and behaviour” paper. In this example there are 5 states A,B,C,D,E. We observe the sequences A,C,D and B,C,E. In subfigure (c), we see that for each state A,…,E there are a bunch of cells that respond to it. However, each cell only responds to specific instances of these states. Specifically, there are two cells (C1 and C2) that respond to state C. C1 responds to state C only after state A. C2 responds to an observation of state C only after state B. If we have only a single cell that responds to C, we lose the ability to predict D and E (see subfigure (d)). With two cells uniquely responding to specific instances of C (see subfigure (e)), we gain the ability to predict states D and E.  Prediction is improved by splitting state C, giving us a variable-order memory.

    An elegant answer is to have multiple cells that represent the same observation, but which only fire in unique sequences. This concept is nicely explained in Hawkins, George and Niemasik’s “Sequence memory for prediction, inference and behaviour” paper. They call it “state-splitting”, i.e. splitting cells that represent an observed state and having multiple cells each responding to specific instances of the state in different sequences.

    In the current CLA implementation, the same feature is achieved by having “columns” of cells that all respond to the same input in different sequence contexts (i.e. given a different set of prior cells). CLA says they share the same “proximal dendrite”, which defines the set of input bits that activate the cell. In our paper, we showed how a radial inhibition function could induce sequence-based specialization of Self-Organising Map (SOM) cells into a variable-order sequence memory:

    Creation of a variable order sequence memory in a Self-Organising Map (SOM) using inhibition. The circles represent cells that respond to the same input, in this case the letters A,B,C or D. We can use first-order sequence relationships to cause specialization of cells to specific instances of each letter. Blue lines represent strong first-order sequence relationships. The edge i–>k promotes k responding to “B” and inhibits x. Eventually k only responds to AB and x only responds to CB. 

    In all cases, the common element is having multiple cells that respond to the same input, but only after specific sequences of prior inputs.

    So, returning to our example problem above with 2 sequences of colours, RGB and GBR, what is the ideal sequence-memory representation using a finite set of cells, multiple cells for each input depending on sequence context, and only first-order relationships between cells? One good solution is shown below:

    Example modelling of our test problem using a variable order sequence memory. We have 7 cells in total. Each cell responds to a single colour. Only one cell responds to black (BK). Having two cells responding to each colour R,G and B allows  accurate prediction of all transitions, except where there is genuine uncertainty (i.e. edges originating at black).  The temporal pooler component should then be able to identify the two sequences (grey and pink shading) by learning these predictable sequences of cell activations. The temporal pooler will replace each sequence with a single “label”, which might be a cell that fires continuously for the duration of each sequence. Cells watching the temporal pooler output will notice fewer state changes, i.e. a more stable output.

    Let’s assume this graph of cells representing specific occurrences of each input colour (i.e. a Sequence Memory) provides the input to the Temporal Pooler. What is the ideal Temporal Pooler output?

    Well, we know that there are in fact 2 sequences, and a “decision” state that switches between them. The ideal sequence-memory and temporal pooler implementation would track all predictable state changes, and replace these sequences with labels that persist for the duration of each sequence. In this way, the problem is simplified; other cells watching the temporal pooler output would observe fewer state changes – only switching between Black, Sequence #1 and Sequence #2.

    Markov Chain that is experienced by an observer of the output from the ideal sequence-memory and temporal pooler modelling shown in the figure above. The problem has been reduced from 7 states to 3. State transitions are only observed when transitioning to or from the Black state (BK). Otherwise, a constant state is observed.

    How can the ideal result be achieved? The next article will discuss how CLA simplifies sequences using the concepts described in this article, and the final article will discuss some alternative methods that we propose.