Category Archives

7 Articles

Predictive Coding/pyramidal cell/Rao & Ballard/unsupervised learning

Pyramidal Neurons and Predictive Coding

Posted by David Rawlinson on

Today’s post tries to fit the theoretical concept of Predictive Coding with the unusual structure and connectivity of Pyramidal cells in the Neocortex.

A reconstruction of a pyramidal cell (source: Wikipedia / Wikimedia Commons). Soma and dendrites are labeled in red, axon arbor in blue. 1) Soma (cell body) 2) Basal dendrite (feed-forward input) 3) Apical dendrite (feed-back input) 4) Axon (output) 5) Collateral axon (output).

Pyramidal neurons

Pyramidal neurons are interesting because they are one of the most common neuron types in the computational layers of the neocortex. This almost certainly means they are critical to many of the key cortical functions, such as forming representations of knowledge and reasoning about the world.

Anatomy of a Pyramidal Neuron

Pyramidal neurons are so-called because they tend to have a triangular body (soma). But this isn’t the most interesting feature! While all neurons have dendrites (inputs) and at least one axon (output), Pyramidal cells have more than one type of input – Basal and Apical dendrites.

Apical Dendrite

Pyramidal neurons tend to have a single, long Apical dendrite that extends with few forks a long way from the body of the neuron. When it reaches layer 1 of the cortex (which contains mostly top-down feedback from cortical areas that are believed to represent more abstract concepts), the apical dendrite branches out. This suggests the apical dendrite likes to receive feedback input. If feedback represents more abstract, longer-term context, then this data would be useful for predicting bottom-up input. More on this later.

Basal Dendrites

Pyramidal cells tend to have a few Basal dendrites that branch almost immediately, in the vicinity of the cell body. Note that this means the input provided to basal and apical dendrites is physically separated. We know from analysis of cortical microcircuits that axons terminating around the body of pyramidal cells in cortex layers 2 and 3 contain bottom-up data that is propagating in a feed-forward direction – i.e. information about the external state of the world.


Pyramidal cells have a single Axonal output that may fork, and may travel a very long distance to its targets including other areas of the cortex.

Predictive Coding

Predictive Coding (PC) is a method of transforming data from its original form, to a representation in terms of prediction errors. There’s not much interest in PC In the Machine Learning community, but in Neuroscience there is substantial evidence that the Cortex encodes information in this way. Similar but unrelated concepts have also been used for efficient compression of data in signal processing. The benefit of this transformation is due to compression: We assume that only prediction errors are important, because by definition, everything else can be predicted and is therefore sufficiently described elsewhere.

There are several research groups looking at computational models of Predictive Coding – in particular those of Karl Friston and Andy Clark.

Two uses for feedback

Assuming feedback contains a more processed and abstract representation of a broader set of data, it has two uses.

  • Prediction for a more efficient representation of the world (e.g. Predictive Coding)
  • Prediction for more robust interpretation (via integration of top-down information in perception)

Predictive coding aims to transform the representation inside the cortex to a more efficient one that encodes only the relationships between prediction errors. Take some time to decide for yourself whether this loses anything…!

But there are many perceptual phenomena that show how internal state affects perception and interpretation of external input. For example, the phenomenon of multistable perception in some visual illusions: We need to know what we’re looking for before we can see it, and we can deliberately change from one interpretation to another (see figure).

A Necker Cube: This object can be interpreted in two distinct ways; as a cube from slightly above or slightly below. With a little practice you can easily switch between interpretations. One explanation of this is that a high-level decision as to the preferred interpretation is provided as feedback to hierarchically-lower processing areas.

Now consider Bayesian inference, such as Belief Propagation, or Markov Random Fields – in all cases we combine a Prior (e.g. top-down feedback) with a Likelihood produced from current, bottom-up data. Good inference depends on effective integration of both inputs.

Ideally we would be able to resolve how both the modelling and inference benefits could be realized in the pyramidal cell, and how physical segregation of apical & basal dendrites might help this happen.

False-Negative Error Coding

The simplest scheme for predictive coding is simply to propagate only false-negative errors – where something was observed, but it was not predicted in advance. In this encoding, if the event was predicted, simply suppress any output. (Note: This assumes that another mechanism limits the number of false-positive errors – for example a homeostatic system to limit the total number of predictions.)

When a neuron fires, it represents a set of coincident input on a number of synapses. A pattern of input was observed. If the neuron was in a “predicted” state, immediately prior to firing, then we could safely suppress the output and achieve a simple predictive coding scheme. If a neuron is not in a predicted state when it fires, then the output should be propagated as normal.

False-Negative Error Coding in Pyramidal Cells

Since Pyramidal cells have 2 distinct inputs – basal and apical dendrites – we can implement the false negative coding as follows:

  • Basal dendrites recognize patterns of bottom-up input; the neuron “represents” those patterns by generating a spike on its axonal output when stimulated by the basal dendrites.
  • Apical dendrite learns to detect input that allows the cell’s spiking to be predicted. The apical dendrite determines the “predicted” state of the cell. Top-down feedback input is used for this purpose.
  • If the cell is “predicted” when the basal dendrite tries to generate an output, then suppress that output.
  • The cell internally self-regulates to ensure that it is rarely in a predicted state, and typically only at the right times.
  • Physical segregation of the two dendrite types ensures that they can target feedback data for prediction and feed-forward data for classification.

Spike bursts (spike trains)

When Pyramidal cells fire, they usually don’t fire just once. They tend to generate a short sequence of spikes known as a “burst” or “train”. So it’s possible that False-Negative coding doesn’t completely eliminate the spike, but rather truncates the sequence of spikes to make the output far less significant and less likely to significantly drive activity in other cells. There may also be some benefit to being able to broadcast the event in a subtle way, perhaps as a form of timing signal.

Time series plots of typical spike trains produced by pyramidal cells.

So to evidence this theory, we could look for truncated or absent spike trains in presence of predictive input to the apical dendrite. Specifically, to observe that input causing a spike in the apical dendrite truncates or eliminates an expected spike train resulting from basal stimulation.

Is there any direct neurological evidence for different integration of spikes from Apical and Basal dendrites in Pyramidal cells? It turns out, yes, there is! Metz, Spruston and Martina [1] say: “… our data present evidence for a dendritic segregation of Kv1-like channels in CA1 pyramidal neurons and identify a novel action for these channels, showing that they inhibit action potential bursting by restricting the size of the [afterdepolarization]”.

Now for the AI/ML audience it’s necessary to translate this a bit. An “action potential” “occurs when the membrane potential (voltage) of a specific axon location rapidly rises and falls. Action potentials in neurons are also known as “nerve impulses” or “spikes” So bursting is the generation of a short sequence of rapid spikes.

So in other words, apical stimulation inhibits bursts of axonal output spikes from a pyramidal neuron. There’s our smoking gun!

According to this paper, the Apical dendrite uniquely inhibits the spike burst from soma (the basal dendrites don’t). This matches the behaviour we would expect, if pyramidal cells implement false-negative predictive coding via the different inputs to different dendrite types: If the apical dendrite fires, there’s no axonal burst. If there wasn’t a spike in the apical dendrite, but basal activity drives the cell over its threshold, then the cell output does burst.

Note there are many other papers with similar claims; we found that search terms such as “differential basal apical dendrite integration” to be helpful.

[1] “Dendritic D-type potassium currents inhibit the spike afterdepolarization in rat hippocampal CA1 pyramidal neurons”
Alexia E. Metz, Nelson Spruston and Marco Martina. J. Physiol. 581.1 pp 175–187 (2007)


We’ve seen how we might combine the observed phenomena of multistable perception via separation of feedback and feed-forward input to the basal and apical dendrites, and predictive coding, via a simple model of pyramidal cell function by false-negative error coding.

Unlike existing models of predictive coding within the cortex, which often posit separate populations of cells representing predictions and residual errors (e.g. Rao and Ballard, 1999), we have proposed that coding could occur within the known biology of individual pyramidal cells, due to the different integration of apical and basal dendrite activity. At the same time, the proposed method allows feedback and feedforward information to be integrated within the same mechanism.

Over the next few months we’ll be testing some of these ideas in simulation!

AGI/Artificial General Intelligence/columns/Hierarchical Generative Models/Predictive Coding/pyramidal cell/sparse coding/Sparse Distributed Representations/symbol grounding problem

The Region-Layer: A building block for AGI

Posted by ProjectAGI on
The Region-Layer: A building block for AGI
Figure 1: The Region-Layer component. The upper surface in the figure is the Region-Layer, which consists of Cells (small rectangles) grouped into Columns. Within each Column, only a few cells are active at any time. The output of the Region-Layer is the activity of the Cells. Columns in the Region-Layer have similar – overlapping – but unique Receptive Fields – illustrated here by lines joining two Columns in the Region-Layer to the input matrix at the bottom. All the Cells in a Column have the same inputs, but respond to different combinations of active input in particular sequential contexts. Overall, the Region-Layer demonstrates self-organization at two scales: into Columns with unique receptive fields, and into Cells responding to unique (input, context) combinations of the Column’s input. 

Introducing the Region-Layer

From our background reading (see here, here, or here) we believe that the key component of a general intelligence can be described as a structure of “Region-Layer” components. As the name suggests, these are finite 2-dimensional areas of cells on a surface. They are surrounded by other Region-Layers, which may be connected in a hierarchical manner; and can be sandwiched by other Region-Layers, on parallel surfaces, by which additional functionality can be achieved. For example, one Region-Layer could implement our concept of the Objective system, another the Region-Layer the Subjective system. Each Region-Layer approximates a single Layer within a Region of Cortex, part of one vertex or level in a hierarchy. For more explanation of this terminology, see earlier articles on Layers and Levels.
The Region-Layer has a biological analogue – it is intended to approximate the collective function of two cell populations within a single layer of a cortical macrocolumn. The first population is a set of pyramidal cells, which we believe perform a sparse classifier function of the input; the second population is a set of inhibitory interneuron cells, which we believe cause the pyramidal cells to become active only in particular sequential contexts, or only when selectively dis-inhibited for other purposes (e.g. attention). Neocortex layers 2/3 and 5 are specifically and individually the inspirations for this model: Each Region-Layer object is supposed to approximate the collective cellular behaviour of a patch of just one of these cortical layers.
We assume the Region-Layer is trained by unsupervised learning only – it finds structure in its input without caring about associated utility or rewards. Learning should be continuous and online, learning as an agent from experience. It should adapt to non-stationary input statistics at any time.
The Region-Layer should be self-organizing: Given a surface of Region-Layer components, they should arrange themselves into a hierarchy automatically. [We may defer implementation of this feature and initially implement a manually-defined hierarchy]. Within each Region-Layer component, the cell populations should exhibit a form of competitive learning such that all cells are used efficiently to model the variety of input observed.
We believe the function of the Region-Layer is best described by Jeff Hawkins: To find spatial features and predictable sequences in the input, and replace them with patterns of cell activity that are increasingly abstract and stable over time. Cumulative discovery of these features over many Region-Layers amounts to an incremental transformation from raw data to fully grounded but abstract symbols. 
Within a Region-Layer, Cells are organized into Columns (see figure 1). Columns are organized within the Region-Layer to optimally cover the distribution of active input observed. Each Column and each Cell responds to only a fraction of the input. Via these two levels of self-organization, the set of active cells becomes a robust, distributed representation of the input.
Given these properties, a surface of Region-Layer components should have nice scaling characteristics, both in response to changing the size of individual Region-Layer column / cell populations and the number of Region-Layer components in the hierarchy. Adding more Region-Layer components should improve input modelling capabilities without any other changes to the system.
So let’s put our cards on the table and test these ideas. 

Region-Layer Implementation


For the algorithm outlined below very few parameters are required. The few that are mentioned are needed merely to describe the resources available to the Region-Layer. In theory, they are not affected by the qualities of the input data. This is a key characteristic of a general intelligence.
  • RW: Width of region layer in Columns
  • RH: Height of region layer in Columns
  • CW: Width of column in Cells 
  • CH: Height of column in Cells

Inputs and Outputs

  • Feed-Forward Input (FFI): Must be sparse, and binary. Size: A matrix of any dimension*.
  • Feed-Back Input (FBI): Sparse, binary Size: A vector of any dimension
  • Prediction Disinhibition Input (PDI): Sparse, rare. Size: Region Area+
  • Feed-Forward Output (FFO): Sparse, binary and distributed. Size: Region Area+
* the 2D shape of input[s] may be important for learning receptive fields of columns and cells, depending on implementation.
+  Region Area = CW * CH * RW * RH


    Here is some pseudocode for iterative update and training of a Region-Layer. Both occur simultaneously.
    We also have fully working code. In the next few blog posts we will describe some of our concrete implementations of this algorithm, and the tests we have performed on it. Watch this space!
    function: UpdateAndTrain( 

    // if no active input, then do nothing
    if( sum( input ) == 0 ) {

    // Sparse activation
    // Note: Can be implemented via a Quilt[1] of any competitive learning algorithm, 
    // e.g. Growing Neural Gas [2], Self-Organizing Maps [3], K-Sparse Autoencoder [4].
    activity(t) = 0

    for-each( column c ) {
      // find cell x that most responds to FFI 
      // in current sequential context given: 
      //  a) prior active cells in region 
      //  b) feedback input.
      x = findBestCellsInColumn( feed_forward_input, feed_back_input, c )

      activity(t)[ x ] = 1

    // Change detection
    // if active cells in region unchanged, then do nothing
    if( activity(t) == activity(t-1) ) {

    // Update receptive fields to organize columns
    trainReceptiveFields( feed_forward_input, columns )

    // Update cell weights given column receptive fields
    // and selected active cells
    trainCells( feed_forward_input, feed_back_input, activity(t) )

    // Predictive coding: output false-negative errors only [5]
    for-each( cell x in region-layer ) {

      coding = 0

      if( ( activity(t)[x] == 1 ) and ( prediction(t-1)[x] == 0 ) ) {
        coding = 1
      // optional: mute output from region, for attentional gating of hierarchy
      if( prediction_disinhibition(t)[x] == 0 ) {
        coding = 0 

      output(t)[x] = coding

    // Update prediction
    // Note: Predictor can be as simple as first-order Hebbian learning. 
    // The prediction model is variable order due to the inclusion of sequential 
    // context in the active cell selection step.
    trainPredictor( activity(t), activity(t-1) )
    prediction(t) = predict( activity(t) )
    Deep Learning/Friston/Generalized LSTM/Graves/ICML/Long Short Term Memory/LSTM/Monner/Predictive Coding/Reading List

    Reading list – July 2015

    Posted by ProjectAGI on
    This month’s reading list continues with a subtheme on recurrent neural networks, and in particular Long Short Term Memory (LSTM).

    First here’s an interesting report on a panel discussion about the future of Deep Learning at the International Conference on Machine Learning (ICML), 2015:

    Participants included Yoshua Bengio (University of Montreal), Neil Lawrence (University of Sheffield), Juergen Schmidhuber (IDSIA), Demis Hassabis (Google DeepMind), Yann LeCun (Facebook, NYU) and Kevin Murphy (Google).

    It was great to hear the panel express an interest in some of our favourite topics, notably hierarchical representation, planning and action selection (reported as sequential decision making) and unsupervised learning. From the Deep Learning community this is a new focus – most DL is based on supervised learning.
    In the Q&A session, it was suggested that reinforcement learning be used to motivate the exploration of search-spaces to train unsupervised algorithms. In robotics, robustly trading off the potential reward of exploration vs using existing knowledge has been a hot topic for several years (example).
    The theory of Predictive Coding suggests that the brain strives to eliminate unpredictability. This presents difficulties for motivating exploration – critics have asked why we don’t seek out quiet, dark solitude! Friston suggests that prior expectations balance the need for immediate predictability with improved understanding in the longer term. For a good discussion, see here.
    Our in-depth reading this month has continued on the theme of LSTM. The most thorough introduction we have found is Alex Graves’ “Supervised Sequence Labelling with Recurrent Neural Networks”:
    However, a critical limitation of LSTM as presented in Graves’ work is that online training is not possible – so you can’t use this variant of LSTM in an embodied agent.
    The best and online variant of LSTM seems to be Derek Monner’s Generalized LSTM algorithm, introduced in D. Monner and J. A. Reggia (2012). “A generalized LSTM-like training algorithm for second-order recurrent neural networks”. You download the paper from Monner’s website here:
    We’ll be back with some actual code soon, including our implementation of Generalized LSTM. And don’t worry, we’ll be back to unsupervised learning soon with a focus on Growing Neural Gas.
    'no input' state/CLA/missing data/Predictive Coding/Sparse Distributed Representations/Theory

    When is missing data a valid state?

    Posted by Gideon Kowadlo on

    By Gideon Kowadlo, David Rawlinson and Alan Zhang

    Can you hear silence or see pitch black?
    Should we classify no input as a valid state or ignore it?

    To my knowledge, the machine learning and statistics literature mainly regards an absence of input as missing data. There are several ways that it’s handled. It can be considered to be a missing data point, a value is inferred and then treated as the real input. When a period of no data occurs at the beginning or end of a stream (time series data), it can be ignored, referred to as censoring. Finally, when there is a variable that can never (or is never) observed, it can be viewed as data that is always missing, and modelled with what is referred to as latent or hidden variables. I believe there is more to the question of whether an absence of input is in fact a valid state, particularly when learning time varying sequences and when considering computational parallels of biological processes where an absence of signal might never occur.

    It is also relevant in the context of systems where ‘no signal’ is an integral type of message that can be passed around. One such system is Predictive Coding (PC), which is a popular theory of cortical function within the neuroscience community. In PC, prediction errors are fed forward (see PC post [1] for more information). Therefore, perfectly correct predictions result in ‘no-input’ in the next level, which may occur from time to time given it is the objective of the encoding system.

    Let’s say your system is classifying sequences of colours Red (R), Green (G) and Blue (B), with periods of no input which we represent as Black (K). There is a sequence of colours RGB, followed by a period of K, then BGR and then two steps of K again, illustrated below (the figure is a Markov graph representation).

    Figure 1: Markov graph representation of a sequence of colour transitions.

    What’s in a name?
    What actually defines Black as no input?

    This question is explored in the following paragraphs along with Figure 2 below. We start with the way the signal is encoded. In the case of an image, each pixel is a tuple of scalar values, including black (K) with a value of (0, 0, 0). No specific component value has a privileged status. We could define black as any scalar tuple. For other types of sensors, signal modulation is used to encode information. For example, frequency of binary spikes/firing is used in neural systems. No firing, or more generally no change, indicates no input. Superficially it appears to be qualitatively different. However, a specific modulation pattern can be mapped to a specific scalar value. Are they therefore equivalent?

    We reach some clarity by considering the presence of a clock as a reference. The use of signal modulation implies the requirement of a clock, but does not necessitate it. With an internal clock, modulation can be measured in a time-absolute* sense, the modulation can be mapped to a scalar representation, and the status of the no-input state does indeed become equivalent to the case of a scalar input with a clock i.e. no value is privileged.

    Where there is no clock, for either type of signal encoding, time can effectively stand still for the system. If the input does not change at all, there is no way to perceive the passage of time. For scalar input, this means that the input does not transition. For modulated input, it includes the most obvious type of ‘no-input’, no firing or zero frequency.

    This would obviously present a problem to an intelligent agent that needs to continue to predict, plan and act in the world. Although there are likely to be inputs to at least some of the sensors, it suggests that biological brains must have an internal clock. There is evidence that the brain has multiple clocks, summarised here in Your brain has two clocks [2]. I wonder if the time course of perceptible bodily processes or thoughts themselves could be sufficient for some crude perception of time.

    Figure 2: Definition of ‘no-input’ for different system characteristics.
    * With respect to the clock at least. This does give rise to the interesting question of the absoluteness of the clock itself. Assume for arguments sake that consciousness can be achieved with deterministic machines. The simulated brain won’t know how fast time is running. You can pause it and resume without it being any wiser.
    If we assume that we can define a ‘no-input’ state, how would we approach it?

    The system could be viewed as an HMM (Hidden Markov Model). The sensed/measured states represent hidden world states that can not be measured directly. Let us make many observations and look to the statistics of occurrence, and compare this to the other observable states. If the statistics are similar, we can assume option A – no special meaning. If on the other hand, it occurs between the other observable sequences, sequences which are not correlated with each other, and is therefore not significantly correlated with any transitions, then we can say that it is B – a delineator.

    A – no special meaning

    There are two options, treat K as any other state, or ignore it. For the former, it’s business as usual. For the latter, ‘ignoring the input’, there don’t seem to be any consequences for the following reason. The system will identify at least two shorter sequences, one before K and one after. Any type of sequence learning must anyway have an upper limit on the length of the representable sequences* (unlike the theoretical Turing Machine); this will just make those sequences shorter. In the case of hierarchical algorithms such as HTM/CLA, higher levels in the hierarchy will integrate these sub sequences together into longer (more abstracted) temporal sequences.

    However, ignoring K will have implications for learning the timing of state persistence and transitions. If the system ignores state K including the timing information, then modelling will be incomplete. For example, referring back to Figure 1, K occurs for two time steps before the transition back to R. This is important information for learning to predict when this R will occur. Additionally, the transition to K signalled the end of the occurrence of R preceding K. Another example is illustrated below in Figure 3. Here, K following B is a fork between two sub chains. The transition to R occurs 95% of the time. That information can be used to make a strong prediction about future transitions from this K, however if K is ignored, as shown on the right of the figure, the information is ignored and the prediction is not possible.

    Figure 3: Markov chain showing some limitations of ignoring K.

    * However, it is possible to have the ability to represent sequences far longer than the expected observable sequences with enough combinatorial power, as described in CLA and argued to exist in biological systems.

    B – a delineator

    This is the case where the ‘no-input’ state is not correlated (above some significant measure) with any observed sequence. The premise of this categorisation, is that due to lack of correlation, it is an effectively meaningless state. However, it can be used to make inferences about the underlying state. Using the example from Figure 1, based on repeated observations, the statement could be made that R, G and B summarise hidden states. We can also surmise that there are states that generate white noise, in this example random selections of R, G, B or K. This can be inferred since we never observe the same signal twice when in those states. Observations of K are then useful for modelling the hidden states, which calls into question the definition of K as ‘no-input’.

    However, it may in fact be an absence of input. In any case, we did not observe any correlations with other sequences. Therefore in practice, this is similar to ‘A – no special meaning – ignore the state’. The difference is the semantic meaning of the ‘no-input’ state as a delineator. There is also no expectation that there is meaningful information in the duration of the absence of input. The ‘state’ is useful to indicate the sequence is finished, and therefore defines the timing of persistence of the last state of the sequence.

    CLA and hierarchical systems

    Turning our attention briefly to the context of HTM CLA [3]. CLA utilises Sparse Distributed Representations (see SDR post [4] for more information) as a common data structure in a hierarchical architecture. A given state, represented as an SDR, will normally be propagated to the level above which also receives input from other regions. It will therefore be represented as one (or possibly more) of many bits in the state above. Each bit is semantically meaningful. A ‘0’ should therefore be as meaningful as a ‘1’. The questions discussed above arise when the SDR is completely zero, which I’ll refer to as a ‘null SDR’.

    The presence of a null SDR depends on the input source, presence of noise and the implementation details of the encoders. In a given region, the occurrence of null SDR’s will tend to dissipate, as the receptive field adjusts until a certain average complexity is observed. In addition, null SDR’s becomes increasingly unlikely as you move up the hierarchy and incorporate larger and larger receptive fields, thus increasing the surface area for possible activity. If the null SDR can still occur occasionally, there may be times when it is significant. If it is not classified, will the higher levels in the hierarchy recover the ‘lost’ information? This question applies to other hierarchical systems and will be investigated in future posts.

    So what ……. ?

    What does all of this mean for the design of intelligent systems? A realistic system will be operating with multiple sensor modalities and will be processing time varying inputs (regardless of the encoding of the signal). Real sensors and environments are likely to produce background noise, and in front of that, periods of no input in ways that are correlated with other environmental sequences, and in ways that are not – relating to the categorisations above ‘A – no special meaning’ and ‘B – a delineator’. There is no simple ‘so what’, but hopefully this gives us some food for thought and shows that it is something that should be considered. In future I’ll be looking in more depth at biological sensors and the nature of the signals that reach the cortex (are they ever completely silent?), as well as the implications for other leading machine learning algorithms.


    [1] On Predictive Coding and Temporal Pooling[2] Emilie Reas, Your brain has two clocks, Scientific American, 2013[3] HTM White Paper[4] Sparse Distributed Representations (SDRs)
    Adaptive/CLA/Competitive Learning/frontal cortex/HTM/Michael Ferrier/Predictive Coding/Reinforcement Learning/Sparse Distributed Representations

    Toward a Universal Cortical Algorithm: Examining Hierarchical Temporal Memory in Light of Frontal Cortical Function

    Posted by ProjectAGI on
    This post is about a fantastic new paper by Michael R. Ferrier, titled:

    Toward a Universal Cortical Algorithm: Examining Hierarchical Temporal Memory in Light of Frontal Cortical Function

    The paper was posted to the NUPIC mailing list and can be found via:

    The paper itself is currently hosted at:

    It isn’t clear if this is going to be formally published in a journal at some point. If this happens we’ll update the link.

    So, what do we like about this paper?

    Purpose & Structure of the paper

    The paper is mostly a literature review and is very well referenced. This is a great introductory work to the topic.

    The paper aims to look at the evidence for the existence of a universal cortical algorithm – i.e. one that can explain the anatomical features and function of the entire cortex. It is unknown whether such an algorithm exists, but there is some evidence it might. Or, more likely, variants of the same algorithm are used throughout the cortex.

    The paper is divided into 3 parts. First, it reviews some relevant & popular algorithms that generate hierarchical models. These include Deep Learning, various forms of Bayesian inference, Predictive Coding, Temporal Slowness and Multi-Stage Hubel Wiesel Architectures (MHWA). I’d never heard of MHWA before, though some of the examples (such as convolutional networks and HMAX) are familiar.  The different versions of HTM are also described.

    It is particularly useful that the author puts the components of HTM in a well-referenced context. We can see that the HTM/CLA Spatial Pooler is a form of Competitive Learning and that the proposed new HTM/CLA Temporal Pooler is an example of the Temporal Slowness principle. The Sequence Memory component is trained by a variant of Hebbian learning.

    These ties to existing literature are useful because they allow us to understand the properties and alternatives to these algorithms: Earlier research has thoroughly explored their capabilities and limitations.

    Although not an algorithm per se, Sparse Distributed Representations are explained particularly well. The author contrasts 3 types of representation: Localist (single feature or label represents state), Sparse and Dense. He argues that Sparse representations are preferable to Localist because the former can be gradually learnt and are more robust to small variations.

    Frontal Cortex

    The second part of the paper reviews the biology of frontal cortex regions. These regions are not normally described in computational theories. Ferrier suggests this omission is because these regions are less well understood, so they offer less insight and support for theory.

    However these cortical areas are of particular interest because they are responsible for representation of tasks, goals, strategy and reward; the origin of goal-directed behaviour and motor control.

    Of particular interest to us is discussion of biological evidence for the hierarchical generation of motor behaviour and output to motors directly from cortex.

    Thalamus and Basal Ganglia

    The paper discusses the role of the Thalamus in gating messages between cortical regions, and discusses evidence that the purpose of the Striatum and Basal Ganglia could include deciding which messages are filtered in the Thalamus.  Filtering is suggested to perform the roles of attention and control (this all perfectly matches our understanding of the same).

    There is a brief discussion of Reinforcement Learning (specifically, Temporal-Difference learning) as a computational analogue of Thalamic filter weighting. This has been exhaustively covered in the literature so wasn’t a surprise.

    Towards a Comprehensive Model of Cortical Function

    The final part of the paper links the computational theories to the referenced biology. There are some interesting insights (such as that messages in the feedback pathway from layer 5 to layer 1 in hierarchically lower regions must be “expanding” time; personally I think these messages are being re-interpreted in expanded time form on receipt).

    Our general expectation is that feedback messages representing predicted state are being selectively biased or filtered towards “predicting” that the agent achieves rewards; in this case the biased or filtered predictions are synonymous with goal-seeking strategies.

    Overall the paper does a great job of linking the “ghetto” of HTM-like computational theories with the relevant techniques in machine learning and neurobiology.

    Baar/CLA/Cortical Learning Algorithm/Friston/Global Workspace/Lee & Mumford/Predictive Coding/Rao & Ballard/Ryan McCall

    Cortical Learning Algorithms with Predictive Coding for a Systems-Level Cognitive Architecture

    Posted by ProjectAGI on
    This is a quick post to link a poster paper by Ryan McCall, who has experimented with a Predictive-Coding / Cortical Learning Algorithm (PC-CLA) hybrid approach. We found the paper via Ryan writing to the NUPIC theory mailing list.

    What’s great about the paper is it links to some of the PC papers we mentioned in a previous post and covers all the relevant literature, with clear and detailed descriptions of key features of each method.

    So we have Lee & Mumford, Rao and Ballard, Friston (Generalized Filtering)… It’s also nice to see Baar’s Global Workspace Theory and LIDA (a model of consciousness or, at least, attention).

    Ryan has added a PC-CLA module to LIDA and tested robustness to varying levels of input noise. So, early days with the experiments but great start. 

    CLA/Generative Models/Hierarchical Generative Models/HTM/Predictive Coding/Rao & Ballard/Temporal Pooling

    On Predictive Coding and Temporal Pooling

    Posted by ProjectAGI on


    Predictive Coding (PC) is a popular theory of cortical function within the neuroscience community. There is considerable biological evidence to support the essential concepts (see e.g. “Canonical microcircuits for predictive coding” by Bastos et al).

    PC describes a method of encoding messages passed between processing units. Specifically, PC states that messages encode prediction failures; when prediction is perfect, there is no message to be sent. The content of each message is the error produced by comparing predictions to observations.

    A good introduction to the various theories and models under the PC umbrella has been written by Andy Clark (“Whatever next? Predictive brains, situated agents, and the future of cognitive science”). As Clark explains, the history of the PC concept goes back at least several decades to Ashby, quote: “The whole function of the brain is summed up in: error correction.” Mumford pretty much nailed the concept back in 1992, before it was known as predictive coding (the cited paper gives a good discussion of how the neocortex might implement a PC-like scheme).

    The majority of PC theories also model uncertainty explicitly, using Bayesian principles. This is a natural fit when providing explicit messaging of errors and attempting to generate predictions. Of course, it is also a robust framework for generative models.

    It can be difficult to search for articles regarding PC because a similar concept exists in Signal Processing, although this seems to be coincidental, or at least the connection goes back beyond our reading. Unfortunately, many articles on the subject are written at a high level and do not include sufficient detail for implementation. However, we found work by Friston et al (example) and Rao et al (example, example) to be well described, although the former is difficult to grasp if one is not familiar with dynamical systems theory.

    Rao’s papers include application of PC to visual processing and Friston’s work includes both the classification of birdsong and extends the concept to the control of motor actions. Friston et al wrote a paper titled “Perceptions as hypotheses; saccades as experiments” in which they suggest that actions are carefully chosen to optimally reduce uncertainty in internal predictive models. The PC concept throws up interesting new perspectives on many topics!

    Comparison to MPF/CLA

    There are significant parallels between MPF/CLA and PC. Both postulate a hierarchy of processing units with FeedForward (FF) and reciprocal FeedBack (FB) connections. MPF/CLA explicitly aims to produce increasingly stable FF signals in higher levels of the hierarchy. MPF/CLA tries to do this by identifying patterns via spatial and temporal pooling, and replacing these patterns with a constant signal.

    Many PC theories create “hierarchical generative models” (e.g. Rao and Ballard). Hierarchical is enforced by restrictions on the topology of the model. The generative part refers to the fact that variables (in the Bayesian sense), in each vertex of the model, are defined by identification of patterns in input data. This agrees with MPF/CLA.

    Both MPF/CLA and PC posit that processing units use FB data from higher layers to improve local prediction. In conjunction with local learning, this serves to reduce errors and therefore, in PC also stabilizes FF output.

    In MPF/CLA it is assumed that cells’ input dendrites determine the set of inputs the cell represents. This performs a form of Spatial Pooling – the cell comes to represent a set of input cells firing simultaneously, and hence the cell becomes a label or symbol representing that set. In PC it is similarly assumed that the generative model will produce objects (cells, variables) that represent combinations of inputs.

    However, MPF/CLA and PC differ in their approach to Temporal Pooling, i.e. changes in input over time.

    Implicit Temporal Pooling

    Predictive coding does not expressly aim to produce stability in higher layers, but increasing stability over time is an expected side-effect of the technique. Assuming successful learning within a processing unit, its FF output will be stable (no signal) for the duration of any periods of successful prediction.

    Temporal Pooling in MPF/CLA attempts to replace FF input with a (more stable) pattern that is constantly output for the duration of some sequence of events. In contrast, PC explicitly outputs prediction errors whenever they occur. If errors do not occur, PC does not produce any output, and therefore the output is stable. A similar outcome has occurred, but via different processes.

    Since the content of PC messages differs to MPF/CLA messages, it also changes the meaning of the variables defined in each vertex of the hierarchy. In MPF/CLA the variables will represent chains of sequences of sequences … in PC, variables will represent a succession of forks in sequences, where prediction failed.

    So it turns out that Predictive Coding is an elegant way to implement Temporal Pooling.

    Benefits of Predictive Coding

    Where PC gets really interesting is that the amplitude or magnitude of the FF signal corresponds to the severity of the error.  A totally unexpected event will cause a signal of large amplitude, whereas an event that was considered a possibility will produce a less significant output.

    This occurs because most PC frameworks model uncertainty explicitly, and these probability distributions can account for the possibility of multiple future events. Anticipated events will have some mass in the prior distribution; unanticipated events have very little prior probability. If the FF output is calculated as the difference between prior and posterior distributions, we naturally get an amplitude that is correlated with the surprise of the event.

    This is a very useful property. We can distribute representational resources across the hierarchy, giving the resources preferentially to the regions where larger errors are occurring more frequently. These events are being badly represented and need improvement.

    In biological terms this response would be embodied as a proliferation of cells in columns receiving or producing large or frequent FF signals.

    Next post

    In the next post we will describe a hybrid Predictive-Coding / Memory Prediction Framework which has some nice properties, and is appealingly simple to implement. We will include some empirical results that show how well the two go together.