Category Archives

8 Articles

AGI/Artificial General Intelligence/columns/Hierarchical Generative Models/Predictive Coding/pyramidal cell/sparse coding/Sparse Distributed Representations/symbol grounding problem

The Region-Layer: A building block for AGI

Posted by ProjectAGI on
The Region-Layer: A building block for AGI
Figure 1: The Region-Layer component. The upper surface in the figure is the Region-Layer, which consists of Cells (small rectangles) grouped into Columns. Within each Column, only a few cells are active at any time. The output of the Region-Layer is the activity of the Cells. Columns in the Region-Layer have similar – overlapping – but unique Receptive Fields – illustrated here by lines joining two Columns in the Region-Layer to the input matrix at the bottom. All the Cells in a Column have the same inputs, but respond to different combinations of active input in particular sequential contexts. Overall, the Region-Layer demonstrates self-organization at two scales: into Columns with unique receptive fields, and into Cells responding to unique (input, context) combinations of the Column’s input. 

Introducing the Region-Layer

From our background reading (see here, here, or here) we believe that the key component of a general intelligence can be described as a structure of “Region-Layer” components. As the name suggests, these are finite 2-dimensional areas of cells on a surface. They are surrounded by other Region-Layers, which may be connected in a hierarchical manner; and can be sandwiched by other Region-Layers, on parallel surfaces, by which additional functionality can be achieved. For example, one Region-Layer could implement our concept of the Objective system, another the Region-Layer the Subjective system. Each Region-Layer approximates a single Layer within a Region of Cortex, part of one vertex or level in a hierarchy. For more explanation of this terminology, see earlier articles on Layers and Levels.
The Region-Layer has a biological analogue – it is intended to approximate the collective function of two cell populations within a single layer of a cortical macrocolumn. The first population is a set of pyramidal cells, which we believe perform a sparse classifier function of the input; the second population is a set of inhibitory interneuron cells, which we believe cause the pyramidal cells to become active only in particular sequential contexts, or only when selectively dis-inhibited for other purposes (e.g. attention). Neocortex layers 2/3 and 5 are specifically and individually the inspirations for this model: Each Region-Layer object is supposed to approximate the collective cellular behaviour of a patch of just one of these cortical layers.
We assume the Region-Layer is trained by unsupervised learning only – it finds structure in its input without caring about associated utility or rewards. Learning should be continuous and online, learning as an agent from experience. It should adapt to non-stationary input statistics at any time.
The Region-Layer should be self-organizing: Given a surface of Region-Layer components, they should arrange themselves into a hierarchy automatically. [We may defer implementation of this feature and initially implement a manually-defined hierarchy]. Within each Region-Layer component, the cell populations should exhibit a form of competitive learning such that all cells are used efficiently to model the variety of input observed.
We believe the function of the Region-Layer is best described by Jeff Hawkins: To find spatial features and predictable sequences in the input, and replace them with patterns of cell activity that are increasingly abstract and stable over time. Cumulative discovery of these features over many Region-Layers amounts to an incremental transformation from raw data to fully grounded but abstract symbols. 
Within a Region-Layer, Cells are organized into Columns (see figure 1). Columns are organized within the Region-Layer to optimally cover the distribution of active input observed. Each Column and each Cell responds to only a fraction of the input. Via these two levels of self-organization, the set of active cells becomes a robust, distributed representation of the input.
Given these properties, a surface of Region-Layer components should have nice scaling characteristics, both in response to changing the size of individual Region-Layer column / cell populations and the number of Region-Layer components in the hierarchy. Adding more Region-Layer components should improve input modelling capabilities without any other changes to the system.
So let’s put our cards on the table and test these ideas. 

Region-Layer Implementation

Parameters

For the algorithm outlined below very few parameters are required. The few that are mentioned are needed merely to describe the resources available to the Region-Layer. In theory, they are not affected by the qualities of the input data. This is a key characteristic of a general intelligence.
  • RW: Width of region layer in Columns
  • RH: Height of region layer in Columns
  • CW: Width of column in Cells 
  • CH: Height of column in Cells

Inputs and Outputs

  • Feed-Forward Input (FFI): Must be sparse, and binary. Size: A matrix of any dimension*.
  • Feed-Back Input (FBI): Sparse, binary Size: A vector of any dimension
  • Prediction Disinhibition Input (PDI): Sparse, rare. Size: Region Area+
  • Feed-Forward Output (FFO): Sparse, binary and distributed. Size: Region Area+
* the 2D shape of input[s] may be important for learning receptive fields of columns and cells, depending on implementation.
+  Region Area = CW * CH * RW * RH

Pseudocode

    Here is some pseudocode for iterative update and training of a Region-Layer. Both occur simultaneously.
    We also have fully working code. In the next few blog posts we will describe some of our concrete implementations of this algorithm, and the tests we have performed on it. Watch this space!
    function: UpdateAndTrain( 
      feed_forward_input, 
      feed_back_input, 
      prediction_disinhibition 
    )

    // if no active input, then do nothing
    if( sum( input ) == 0 ) {
      return
    }

    // Sparse activation
    // Note: Can be implemented via a Quilt[1] of any competitive learning algorithm, 
    // e.g. Growing Neural Gas [2], Self-Organizing Maps [3], K-Sparse Autoencoder [4].
    activity(t) = 0

    for-each( column c ) {
      // find cell x that most responds to FFI 
      // in current sequential context given: 
      //  a) prior active cells in region 
      //  b) feedback input.
      x = findBestCellsInColumn( feed_forward_input, feed_back_input, c )

      activity(t)[ x ] = 1
    }

    // Change detection
    // if active cells in region unchanged, then do nothing
    if( activity(t) == activity(t-1) ) {
      return
    }

    // Update receptive fields to organize columns
    trainReceptiveFields( feed_forward_input, columns )

    // Update cell weights given column receptive fields
    // and selected active cells
    trainCells( feed_forward_input, feed_back_input, activity(t) )

    // Predictive coding: output false-negative errors only [5]
    for-each( cell x in region-layer ) {

      coding = 0

      if( ( activity(t)[x] == 1 ) and ( prediction(t-1)[x] == 0 ) ) {
        coding = 1
      }
      // optional: mute output from region, for attentional gating of hierarchy
      if( prediction_disinhibition(t)[x] == 0 ) {
        coding = 0 
      }

      output(t)[x] = coding
    }

    // Update prediction
    // Note: Predictor can be as simple as first-order Hebbian learning. 
    // The prediction model is variable order due to the inclusion of sequential 
    // context in the active cell selection step.
    trainPredictor( activity(t), activity(t-1) )
    prediction(t) = predict( activity(t) )
    [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.1401&rep=rep1&type=pdf[2] https://papers.nips.cc/paper/893-a-growing-neural-gas-network-learns-topologies.pdf[3] http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf[4] https://arxiv.org/pdf/1312.5663[5] http://www.ncbi.nlm.nih.gov/pubmed/10195184
    AGI/Artificial General Intelligence/deep belief networks/Reading List/Reinforcement Learning/sparse coding/Sparse Distributed Representations/stationary problem/unsupervised learning

    Reading list – May 2016

    Posted by ProjectAGI on
    Reading list – May 2016
    Digit classification error over time in our experiments. The image isn’t very helpful but it’s a hint as to why we’re excited 🙂

    Project AGI

    A few weeks ago we paused the “How to build a General Intelligence” series (part 1, part 2, part 3, part 4). We paused it because the next article in the series requires us to specify everything in detail, and we need working code to do that.

    We have been testing our algorithm on a variety of MNIST-derived handwritten digit datasets, to better understand how well it generalizes its representation of digit-images and how it behaves when exposed to varying degrees of predictability. Initial results look promising: We will post everything here once we’ve verified them and completed the first batch of proper experiments. The series will continue soon!

    Deep Unsupervised Learning

    Our algorithm is a type of Online Deep Unsupervised Learning, so naturally we’re looking carefully at similar algorithms.

    We recommend this video of a talk by Andrew Ng. It starts with a good introduction to the methods and importance of feature representation and touches on types of automatic feature discovery. He looks at some of the important feature detectors in computer vision, such as SIFT and HoG and shows how feature detectors – such as edge detectors – can emerge from more general pattern recognition algorithms such as sparse coding. For more on sparse coding see Shakir’s excellent machine learning blog.

    For anyone struggling to intuit deep feature discovery, I also loved this post on yCombinator which nicely illustrates how and why deep networks discover useful features, and why the depth helps.

    The latter part of the video covers Ng’s latest work on deep hierarchical sparse coding using Deep Belief Networks, in turn based on AutoEncoders. He reports benchmark-beating results on video activity and phoneme recognition with this framework. You can find details of his deep unsupervised algorithm here:

    http://deeplearning.stanford.edu/wiki

    Finally he presents a plot suggesting that training dataset size is a more important determiner of eventual supervised network performance than algorithm choice! This is a fundamental limitation of supervised learning where the necessary training data is much more limited than in unsupervised learning (in the latter case, the real world provides a handy training set!)

    Effect of algorithm and training set size on accuracy. Training set size more significant. This is a fundamental limitation of supervised learning.

    Online K-sparse autoencoders (with some deep-ness)

    We’ve also been reading this paper by Makhzani and Frey about deep online learning with auto-encoders (a type of supervised learning neural network that is used in an unsupervised way to reconstruct its input, often known as semi-supervised learning). Actually we’ve struggled to find any comparison of autoencoders to earlier methods of unsupervised learning both in terms of computational efficiency and ability to cover the search space effectively. Let us know if you find a paper that covers this.

    The Makhzani paper has some interesting characteristics – the algorithm is online, which means it receives data as a stream rather than in batches. It is also sparse, which we believe is desirable from a representational perspective.
    One limitation is that the solution is most likely unable to handle changes in input data statistics (i.e. non-stationary problems). The reason this is an important quality is that in any arbitrarily deep network the typical position of a vertex is between higher and lower vertices. If all vertices are continually learning, the problem being modelled by any single vertex is constantly changing. Therefore, intermediate vertices must be capable of online learning of non stationary problems otherwise the network will not be able to function effectively. In Makhzani and Frey, they instead use the greedy layerwise training approach from Deep Belief Networks. The authors describe this approach:
    “4.6. Deep Supervised Learning Results The k-sparse autoencoder can be used as a building block of a deep neural network, using greedy layerwise pre-training (Bengio et al., 2007). We first train a shallow k-sparse autoencoder and obtain the hidden codes. We then fix the features and train another ksparse autoencoder on top of them to obtain another set of hidden codes. Then we use the parameters of these autoencoders to initialize a discriminative neural network with two hidden layers.”

    The limitation introduced can be thought of as an inability to escape from local minima that result from prior training. This paper by Choromanska et al tries to explain why this happens.
    Greedy layerwise training is an attempt to work around the fact that deep belief networks of Autoencoders cannot effectively handle nonstationary problems.

    For more information here’s some papers on deep sparse networks built from autoencoders:

    Variations on Supervised Learning – a Taxonomy

    Back to supervised learning, and the limitation of training dataset size. Thanks to a discussion with Jay Chakravarty we have this brief taxonomy of supervised learning workarounds for insufficient training datasets:

    Weakly supervised learning: [For poorly labelled training data] where you want to learn models for object recognition under weak supervision – you have say object labels for images, but no localization (e.g. bounding box) for the object in the image (there might be other objects in the image as well). You would use a Latent SVM to solve the problem of localizing the objects in the images, and at the same time learning a classifier for it.
    Another example of weakly supervised learning is that you have a bag of positive samples mixed up with negative training samples, but also have a bag of purely negative samples – you would use Multiple Instance Learning for this.

    Cross-modal adaptation: where one mode of data supervises another – e.g. audio supervises video or vice-versa.

    Domain adaptation: model learnt on one set of data is adapted, in unsupervised fashion, to new datasets with slightly different data distributions.

    Transfer learning: using the knowledge gained in learning one problem on a different, but related problem. Here’s a good example of transfer learning, a finalist in the NVIDIA 2016 Global Impact Award. The system learns to predict poverty from day and night satellite images, with very few labelled samples.

    Full paper:

    http://arxiv.org/pdf/1510.00098v2.pdf

    Interactive Brain Concept Map

    We enjoyed this interactive map of the distribution of concepts within the cortex captured using fMRI and produced by the Gallant Lab (source papers here).

    Using the map you can find the voxels corresponding to various concepts, which although maybe not generalizable due to the small sample size (7) gives you a good idea of the hierarchical structure the brain has produced, and what the intermediate concepts represent.

    Thanks to David Ray @ http://cortical.io for the link.

    Interactive brain concept map

    OpenAI Gym – Reinforcement Learning platform

    We also follow the OpenAI project with interest. OpenAI have just released their “Gym” – a platform for training and testing reinforcement learning algorithms. Have a play with it here:

    https://openai.com/blog/openai-gym-beta/

    According to Wired magazine, OpenAI will continue to release free and open source software (FOSS) for the wider impact this will have on uptake. There are many companies now competing to win market share in this space.

    The Talking Machines Blog

    We’re regular readers of this blog and have been meaning to mention it for months. Worth reading.

    How the brain generates actions

    A big gap in our knowledge is how the brain generates actions from its internal representation. This new paper by Vicente et al challenges the established (rather vague) dogma on how the brain generates actions.
    “We found that contrary to common belief, the indirect pathway does not always prevent actions from being performed, it can actually reinforce the performance of actions. However, the indirect pathway promotes a different type of actions, habits.”

    This is probably quite informative for reverse-engineering purposes. Full paper here.

    Hierarchical Temporal Memory

    HTM is an online method for feature discovery and representation and now we have a baseline result for HTM on the famous MNIST numerical digit classification problem. Since HTM works with time-series data, the paper compares HTM to LSTM (Long-Short-Term Memory), the leading supervised-learning approach to this problem domain.

    It is also interesting that the paper deals with adaptation to sudden changes in the input data statistics, the very problem that frustrates the deep belief networks described above.

    Full paper by Cui et al here.

    For a detailed mathematical description of HTM see this paper by Mnatzaganian and Kudithipudi.

    Reinforcement Learning/Sparse Distributed Representations

    SDR-RL (Sparse, Distributed Representation with Reinforcement Learning)

    Posted by ProjectAGI on
    Erik Laukien is back with a demo of Sparse, Distributed Representation with Reinforcement Learning.

    This topic is of intense interest to us, although the problem is quite a simple one. SDRs are a natural fit with Reinforcement Learning because bits jointly represent a state. If you associate each bit-pattern with a reward value, it is easy to determine the optimum action.

    However, since this is an enormous state-space, it is not practical to do so. Instead, one might associate only all observed bit patterns with reward, or cluster them somehow to reduce the number of reward values that must be stored. Anyway, these are thoughts for another day.

    Here’s his explanation of the demo:

    http://twistedkeyboardsoftware.com/?p=90

    Here’s the demo itself. Note, we had to set the stepsPerFrame parameter to 100 to get it working quickly.

    http://twistedkeyboardsoftware.com/?p=1

    quantum computing/Rinkus/Sparse Distributed Representations

    “Quantum computing” via Sparse distributed coding?

    Posted by ProjectAGI on
    An interesting article by Gerard Rinkus comparing the qualities of sparse distributed representation and quantum computing. In effect, he argues that because distributed representations can simultaneously represent multiple states, you get the same effect as a quantum superposition.

    The article was originally titled “sparse distributed coding via quantum computing” but I think that gets the key conclusions backwards (maybe I’m wrong?).

    The full article is here:

    http://people.brandeis.edu/~grinkus/SDR_and_QC.html

    Rinkus says:

    I believe that SDR constitutes a classical instantiation of quantum superposition and that switching from localist representations to SDR, which entails no new, esoteric technology, is the key to achieving quantum computation in a single-processor, classical (Von Neumann) computer.”

    I think that goes a bit too far. Yes, it would seem to have some of the same advantages as quantum computing, with the additional benefit of fitting classical computing technology that is mass manufactured at low cost.

    However may be moot, now that true quantum computing looks to become practical:


    All in all I think the analogy between sparse distributed representation and quantum computing is very thought-provoking.


    'no input' state/CLA/missing data/Predictive Coding/Sparse Distributed Representations/Theory

    When is missing data a valid state?

    Posted by Gideon Kowadlo on

    By Gideon Kowadlo, David Rawlinson and Alan Zhang

    Can you hear silence or see pitch black?
    Should we classify no input as a valid state or ignore it?

    To my knowledge, the machine learning and statistics literature mainly regards an absence of input as missing data. There are several ways that it’s handled. It can be considered to be a missing data point, a value is inferred and then treated as the real input. When a period of no data occurs at the beginning or end of a stream (time series data), it can be ignored, referred to as censoring. Finally, when there is a variable that can never (or is never) observed, it can be viewed as data that is always missing, and modelled with what is referred to as latent or hidden variables. I believe there is more to the question of whether an absence of input is in fact a valid state, particularly when learning time varying sequences and when considering computational parallels of biological processes where an absence of signal might never occur.

    It is also relevant in the context of systems where ‘no signal’ is an integral type of message that can be passed around. One such system is Predictive Coding (PC), which is a popular theory of cortical function within the neuroscience community. In PC, prediction errors are fed forward (see PC post [1] for more information). Therefore, perfectly correct predictions result in ‘no-input’ in the next level, which may occur from time to time given it is the objective of the encoding system.

    Let’s say your system is classifying sequences of colours Red (R), Green (G) and Blue (B), with periods of no input which we represent as Black (K). There is a sequence of colours RGB, followed by a period of K, then BGR and then two steps of K again, illustrated below (the figure is a Markov graph representation).

    Figure 1: Markov graph representation of a sequence of colour transitions.

    What’s in a name?
    What actually defines Black as no input?

    This question is explored in the following paragraphs along with Figure 2 below. We start with the way the signal is encoded. In the case of an image, each pixel is a tuple of scalar values, including black (K) with a value of (0, 0, 0). No specific component value has a privileged status. We could define black as any scalar tuple. For other types of sensors, signal modulation is used to encode information. For example, frequency of binary spikes/firing is used in neural systems. No firing, or more generally no change, indicates no input. Superficially it appears to be qualitatively different. However, a specific modulation pattern can be mapped to a specific scalar value. Are they therefore equivalent?

    We reach some clarity by considering the presence of a clock as a reference. The use of signal modulation implies the requirement of a clock, but does not necessitate it. With an internal clock, modulation can be measured in a time-absolute* sense, the modulation can be mapped to a scalar representation, and the status of the no-input state does indeed become equivalent to the case of a scalar input with a clock i.e. no value is privileged.

    Where there is no clock, for either type of signal encoding, time can effectively stand still for the system. If the input does not change at all, there is no way to perceive the passage of time. For scalar input, this means that the input does not transition. For modulated input, it includes the most obvious type of ‘no-input’, no firing or zero frequency.

    This would obviously present a problem to an intelligent agent that needs to continue to predict, plan and act in the world. Although there are likely to be inputs to at least some of the sensors, it suggests that biological brains must have an internal clock. There is evidence that the brain has multiple clocks, summarised here in Your brain has two clocks [2]. I wonder if the time course of perceptible bodily processes or thoughts themselves could be sufficient for some crude perception of time.

    Figure 2: Definition of ‘no-input’ for different system characteristics.
    * With respect to the clock at least. This does give rise to the interesting question of the absoluteness of the clock itself. Assume for arguments sake that consciousness can be achieved with deterministic machines. The simulated brain won’t know how fast time is running. You can pause it and resume without it being any wiser.
    If we assume that we can define a ‘no-input’ state, how would we approach it?

    The system could be viewed as an HMM (Hidden Markov Model). The sensed/measured states represent hidden world states that can not be measured directly. Let us make many observations and look to the statistics of occurrence, and compare this to the other observable states. If the statistics are similar, we can assume option A – no special meaning. If on the other hand, it occurs between the other observable sequences, sequences which are not correlated with each other, and is therefore not significantly correlated with any transitions, then we can say that it is B – a delineator.

    A – no special meaning

    There are two options, treat K as any other state, or ignore it. For the former, it’s business as usual. For the latter, ‘ignoring the input’, there don’t seem to be any consequences for the following reason. The system will identify at least two shorter sequences, one before K and one after. Any type of sequence learning must anyway have an upper limit on the length of the representable sequences* (unlike the theoretical Turing Machine); this will just make those sequences shorter. In the case of hierarchical algorithms such as HTM/CLA, higher levels in the hierarchy will integrate these sub sequences together into longer (more abstracted) temporal sequences.

    However, ignoring K will have implications for learning the timing of state persistence and transitions. If the system ignores state K including the timing information, then modelling will be incomplete. For example, referring back to Figure 1, K occurs for two time steps before the transition back to R. This is important information for learning to predict when this R will occur. Additionally, the transition to K signalled the end of the occurrence of R preceding K. Another example is illustrated below in Figure 3. Here, K following B is a fork between two sub chains. The transition to R occurs 95% of the time. That information can be used to make a strong prediction about future transitions from this K, however if K is ignored, as shown on the right of the figure, the information is ignored and the prediction is not possible.

    Figure 3: Markov chain showing some limitations of ignoring K.

    * However, it is possible to have the ability to represent sequences far longer than the expected observable sequences with enough combinatorial power, as described in CLA and argued to exist in biological systems.


    B – a delineator

    This is the case where the ‘no-input’ state is not correlated (above some significant measure) with any observed sequence. The premise of this categorisation, is that due to lack of correlation, it is an effectively meaningless state. However, it can be used to make inferences about the underlying state. Using the example from Figure 1, based on repeated observations, the statement could be made that R, G and B summarise hidden states. We can also surmise that there are states that generate white noise, in this example random selections of R, G, B or K. This can be inferred since we never observe the same signal twice when in those states. Observations of K are then useful for modelling the hidden states, which calls into question the definition of K as ‘no-input’.

    However, it may in fact be an absence of input. In any case, we did not observe any correlations with other sequences. Therefore in practice, this is similar to ‘A – no special meaning – ignore the state’. The difference is the semantic meaning of the ‘no-input’ state as a delineator. There is also no expectation that there is meaningful information in the duration of the absence of input. The ‘state’ is useful to indicate the sequence is finished, and therefore defines the timing of persistence of the last state of the sequence.

    CLA and hierarchical systems

    Turning our attention briefly to the context of HTM CLA [3]. CLA utilises Sparse Distributed Representations (see SDR post [4] for more information) as a common data structure in a hierarchical architecture. A given state, represented as an SDR, will normally be propagated to the level above which also receives input from other regions. It will therefore be represented as one (or possibly more) of many bits in the state above. Each bit is semantically meaningful. A ‘0’ should therefore be as meaningful as a ‘1’. The questions discussed above arise when the SDR is completely zero, which I’ll refer to as a ‘null SDR’.

    The presence of a null SDR depends on the input source, presence of noise and the implementation details of the encoders. In a given region, the occurrence of null SDR’s will tend to dissipate, as the receptive field adjusts until a certain average complexity is observed. In addition, null SDR’s becomes increasingly unlikely as you move up the hierarchy and incorporate larger and larger receptive fields, thus increasing the surface area for possible activity. If the null SDR can still occur occasionally, there may be times when it is significant. If it is not classified, will the higher levels in the hierarchy recover the ‘lost’ information? This question applies to other hierarchical systems and will be investigated in future posts.

    So what ……. ?

    What does all of this mean for the design of intelligent systems? A realistic system will be operating with multiple sensor modalities and will be processing time varying inputs (regardless of the encoding of the signal). Real sensors and environments are likely to produce background noise, and in front of that, periods of no input in ways that are correlated with other environmental sequences, and in ways that are not – relating to the categorisations above ‘A – no special meaning’ and ‘B – a delineator’. There is no simple ‘so what’, but hopefully this gives us some food for thought and shows that it is something that should be considered. In future I’ll be looking in more depth at biological sensors and the nature of the signals that reach the cortex (are they ever completely silent?), as well as the implications for other leading machine learning algorithms.

    References

    [1] On Predictive Coding and Temporal Pooling[2] Emilie Reas, Your brain has two clocks, Scientific American, 2013[3] HTM White Paper[4] Sparse Distributed Representations (SDRs)
    AGI/CLA/HTM/Michael Ferrier/Numenta/Sparse Distributed Representations

    Sparse Distributed Representations (SDRs)

    Posted by ProjectAGI on

    TL;DR:

    • An SDR is a Sparse Distributed Representation, described below
    • SDRs are biologically plausible data structures
    • SDRs have powerful properties
    • SDRs have received a lot of attention recently
    • There are a few really great new resources on the topic:

    Background

    An SDR is a binary vector, where only a small portion of the bits are ‘on’. There is growing evidence that SDRs are a feature of biological computation, for storing and conveying information. In a biological context, this represents a small number of active cells in a population of cells. SDR’s have been adopted by HTM (in CLA), and HTM’s focus on this has set it apart from the bulk of mainstream machine learning research.

    SDRs have very promising characteristics and are still relatively under utilised in the field of AI and machine learning. It’s an exciting area to be watching as AGI leaps forward.

    The concept of SDRs are not new however. Kanerva first proposed a sparse distributed memory as a physiologically plausible model for human memory in his influential 1988 book Sparse Distributed Memory [Kanerva1988]. The properties of sparse distributed memory were nicely summarised by Denning in 1989 [Denning1989].

    As early as 1997, Hinton and Ghahramani described a generative model, implementable with a neural network, that ‘discovered’ sparse distributed representations for image analysis [Hinton1997].

    Then in 1998 SDRs were used in the context of robot control for navigation by both Rajesh et. al and Rao et. al [Rajesh1998, Rao1998] and then in 2004 for reinforcement learning by Ratitch [Ratitch2004].

    Recently some great new resources have become available. There is a new video from Numenta of Subutai Ahmad presenting on the topic. A nice companion to that is an older introductory video of Jeff Hawkins presenting. The recent draft paper by Ferrier on a universal cortical algorithm (discussed in an earlier blog post) gives an excellent summary of their characteristics.

    What’s all the fuss about? – SDR Characteristics

    Given that so much great (and recent) material exists describing SDRs, I won’t go into very much detail. This post would not be complete though, without at least a cursory look at the ‘highlights’ of SDRs.

    • Semantic
      • each bit corresponds to something meaningful
    • Efficient/Versatile
      • storage: there is a huge number of potential encodings for a given vector size, as capacity increases exponentially with number of potentially active bits
      • compositionality: because of sparsity SDRs are generally linearly separable and can therefore be combined to represent a set of several states
      • comparisons: it is very efficient to measure similarity between vectors (you just need to count overlap of ‘on’ bits) and if a state is part of a set (due to compositionality)
    • Robust
      • subsampled or noisy vectors are still semantically similar and can be compared effectively

    Conclusion

    We believe that these elements make SDRs an excellent choice as the main data structure of any AGI implementation – they are used in our current approach. The highlights and links given above should be a good start to anyone that wants to learn more.

    Adaptive/CLA/Competitive Learning/frontal cortex/HTM/Michael Ferrier/Predictive Coding/Reinforcement Learning/Sparse Distributed Representations

    Toward a Universal Cortical Algorithm: Examining Hierarchical Temporal Memory in Light of Frontal Cortical Function

    Posted by ProjectAGI on
    This post is about a fantastic new paper by Michael R. Ferrier, titled:


    Toward a Universal Cortical Algorithm: Examining Hierarchical Temporal Memory in Light of Frontal Cortical Function


    The paper was posted to the NUPIC mailing list and can be found via:



    The paper itself is currently hosted at:



    It isn’t clear if this is going to be formally published in a journal at some point. If this happens we’ll update the link.


    So, what do we like about this paper?

    Purpose & Structure of the paper

    The paper is mostly a literature review and is very well referenced. This is a great introductory work to the topic.


    The paper aims to look at the evidence for the existence of a universal cortical algorithm – i.e. one that can explain the anatomical features and function of the entire cortex. It is unknown whether such an algorithm exists, but there is some evidence it might. Or, more likely, variants of the same algorithm are used throughout the cortex.


    The paper is divided into 3 parts. First, it reviews some relevant & popular algorithms that generate hierarchical models. These include Deep Learning, various forms of Bayesian inference, Predictive Coding, Temporal Slowness and Multi-Stage Hubel Wiesel Architectures (MHWA). I’d never heard of MHWA before, though some of the examples (such as convolutional networks and HMAX) are familiar.  The different versions of HTM are also described.


    It is particularly useful that the author puts the components of HTM in a well-referenced context. We can see that the HTM/CLA Spatial Pooler is a form of Competitive Learning and that the proposed new HTM/CLA Temporal Pooler is an example of the Temporal Slowness principle. The Sequence Memory component is trained by a variant of Hebbian learning.


    These ties to existing literature are useful because they allow us to understand the properties and alternatives to these algorithms: Earlier research has thoroughly explored their capabilities and limitations.


    Although not an algorithm per se, Sparse Distributed Representations are explained particularly well. The author contrasts 3 types of representation: Localist (single feature or label represents state), Sparse and Dense. He argues that Sparse representations are preferable to Localist because the former can be gradually learnt and are more robust to small variations.

    Frontal Cortex

    The second part of the paper reviews the biology of frontal cortex regions. These regions are not normally described in computational theories. Ferrier suggests this omission is because these regions are less well understood, so they offer less insight and support for theory.


    However these cortical areas are of particular interest because they are responsible for representation of tasks, goals, strategy and reward; the origin of goal-directed behaviour and motor control.


    Of particular interest to us is discussion of biological evidence for the hierarchical generation of motor behaviour and output to motors directly from cortex.

    Thalamus and Basal Ganglia

    The paper discusses the role of the Thalamus in gating messages between cortical regions, and discusses evidence that the purpose of the Striatum and Basal Ganglia could include deciding which messages are filtered in the Thalamus.  Filtering is suggested to perform the roles of attention and control (this all perfectly matches our understanding of the same).


    There is a brief discussion of Reinforcement Learning (specifically, Temporal-Difference learning) as a computational analogue of Thalamic filter weighting. This has been exhaustively covered in the literature so wasn’t a surprise.

    Towards a Comprehensive Model of Cortical Function

    The final part of the paper links the computational theories to the referenced biology. There are some interesting insights (such as that messages in the feedback pathway from layer 5 to layer 1 in hierarchically lower regions must be “expanding” time; personally I think these messages are being re-interpreted in expanded time form on receipt).


    Our general expectation is that feedback messages representing predicted state are being selectively biased or filtered towards “predicting” that the agent achieves rewards; in this case the biased or filtered predictions are synonymous with goal-seeking strategies.


    Overall the paper does a great job of linking the “ghetto” of HTM-like computational theories with the relevant techniques in machine learning and neurobiology.

    Cortical Learning Algorithm/HTM/Sparse Distributed Representations/Temporal Pooling

    TP 2/3: Jeff’s new Temporal Pooler

    Posted by ProjectAGI on
    By David Rawlinson and Gideon Kowadlo

    Jeff’s new Temporal Pooler

    This is article 2 in a 3 part series about Temporal Pooling (TP) in MPF/CLA-like algorithms. You can read part 1 here. For the rest of this article we will assume you’ve read part 1.

    This article is about the new TP proposed by Jeff Hawkins. The original TP was described in the CLA white paper. We will also assume you’ve at least had a quick read of the linked articles. Despite our best efforts this article is only an interpretation of those methods, and it may not be entirely correct or as Numenta intended.

    Separation of internal and external causes

    The first topic in Hawkins’ proposal covers the possible roles of specific cortical layers and the separation of internal and external causes.

    Hawkins suggests that cortical layers 3,4,5 and 6 are all implementing variants of the same algorithm, with minor differences. He also suggests that each layer is performing all functions (Spatial Pooling, Sequence Memory and Temporal Pooling. In higher hierarchy levels, Spatial Pooling may be absent). In CLA, these 3 components are implemented as a matrix of sequence-memory cells. Rules concerning how and when the cells activate each other in different input contexts implement the pooling and prediction features.

    Hawkins also states that one distinction between layers 3 and 4 may be that cells in layer 4 are using copies of motor actions (internal causes), to predict better. In consequence, cells in layer 3 are left trying to predict the actions that layer 4 could not predict, i.e. relying more on historically-observed sequential patterns of activation. Although both layers will learn sequential patterns of activation, layer 3 will rely more heavily on history. External causes will more often generate input that can’t be explained by motor actions, so we might expect layer 3 to more often respond to external events.

    This article will not discuss these ideas any further. We note them for clarity and to distinguish our focus. Instead, we will talk about how the new CLA TP could be used to construct a hierarchical representation of changes in input patterns over time.

    Temporal Slowness

    Both old and new temporal poolers exploit a principle known as “Temporal Slowness“, which means that output activity varies more slowly than input activity. You can read more about this general principle here.

    Another, related feature of the HTM and CLA temporal poolers is that they emit a constant pattern of cell activity to generate stability. This is achieved by marking cells as “active” regardless of whether they are active via prediction or via feedfoward input. Although active-by-prediction and active-by-FF-input are distinguished within the region for learning purposes, this distinction is not visible to the next higher region in the hierarchy.

    Old Temporal Pooler Cell Activity

    For reference, this is an outline of the TP functionality in the “old” TP from the Numenta CLA white paper.

    The diagrams in this article each have 3 parts. At the top is a graph showing a fragment (in graph terminology, a component) of the Sequence Memory encoded by the cells in the region. Arrows show the learnt transitions between cells. Below the graph is a series of observations (marked “FF input”) and the corresponding pattern of cell activity when each Feed-Forward (FF) input is observed. Each column represents a single cell from the sequence memory and its activity over time. Each row in the lower part of the diagrams shows all cells’ activity at one moment. Cells are filled white when they are active and black when not active:

    Figure 1: Spatial pooler and temporal pooler cell activity in the original CLA white paper. This image compares two patterns of cell activity over time, shown left and right. The left subfigure shows cell activity in spatial pooler cells, where the active cell[s] are the ones whose input bits most closely match current FF input. The right subfigure shows the original CLA temporal pooling method, where cells become active when predicted far in advance of their FF input being observed, and remain active until after the associated FF input is observed. In this example, ⅔ of the active cells are identical after each FF input change. A ‘P’ denotes cells activated due to prediction. Each subfigure has 3 parts. The main part is a matrix showing sequence memory cell activity over time. Each row is one time-step, numbered 1 to 5. The FF input observed at each time step is shown in a column to the left. The top row shows a fragment of Sequence Memory formed by the cells, and the colours each cell responds to. In this simple example, the Sequence Memory graph is simply a sequence of states that are always observed in the same order.

    Spatial Pooler cell activity is easiest to explain (figure 1, left). In figure 1, we see that the Sequence Memory has learnt that the colours Red, Yellow, Green, Blue and Black occur in order. One cell responds uniquely to each of these colours, creating the diagonal line of active cells over time. Each cell is only active for the duration of its FF input being observed.

    The original temporal pooler premise was to drive cells to an active state via prediction (cells active via prediction are marked with a ‘P’) as early as possible. The cells would then remain active until either the prediction was no longer made (e.g. due to observation of an unexpected FF input pattern) or the cell becomes active via its FF input.

    Time and Stability

    You can see in the figure above that in a sequence of predictable inputs each cell is active over a period of 3 input changes (the only meaningful way to measure time in these examples). So let’s assume each input change is one time step.

    The cells are shown to be active for an arbitrary period of time – to keep the diagrams simple the minimum period is shown. In reality a fixed activation period is unlikely; it will depend on the activation or prior cells in the sparse distributed representation. However, it is still possible to make the point that at any time during a predictable sequence, a set of cells is active. Most of those cells are not changing between time steps and will be active after the next FF input change. This is the temporal pooler in action.

    In the example above, each cell is predicted up to 2 steps before the corresponding input is observed. Therefore, in a predictable sequence 3 cells are simultaneously active, 2 of them due to prediction and one due to FF input.

    Although the output of the temporal pooler is continuously changing, most of the active cells are not changed between inputs. In this case, 2/3 of the output is stable. With longer activation of TP cells, a larger fraction of the output becomes stable.

    Noisy recognition and resource constraint assumptions

    To simplify the problem observed by the next level in the hierarchy it is necessary to have cell activity changes in the lower level without corresponding cell activity changes in the higher level. Given that the TP output is continuously changing, how do we sometimes avoid cell activity changes in the higher level?

    There are two parts to the answer. First, all cells’ FF input are Sparse Distributed Representations (SDRs). These are large sets of input bits, of which only a few are active at any given time. The Spatial Pooler in CLA recognises FF inputs when only a fraction of the expected (synapsed) input bits are active. For example, a cell may become active from FF input when 80% of its FF input bits are active – any 80%. The set of active input bits can change while the cell is active. This means that cells’ recognition is tolerant to noisy FF input.

    Noisy recognition of TP output from lower hierarchy levels is one assumption necessary for increasing stability in higher levels. But this assumption is actually a useful feature, allowing classification to be tolerant to noise.

    The other necessary assumption is a resource-constraint. If an infinite supply of cells were available, then after much slow learning every FF input pattern would have a dedicated cell (due to inhibition between cells). Cell activity changes would occur throughout the hierarchy after every input change, no matter how tiny. Obviously, resource constraints are physically necessary.

    The finite size of a CLA region ensures that there aren’t enough cells to represent each FF input pattern perfectly. Instead, some similar (probably successive) FF inputs will be represented by the same set of active cells (quantization error).

    These are both very reasonable assumptions, but worth stating and understanding.

    New Temporal Pooler Cell Activity

    The new TP proposes that cells are only counted as active when confirmed by observation of the corresponding FF input, and that they stay active for a period of time after this:

    Figure 2: Spatial pooler and temporal pooler cell activity as described by the new TP method. Each TP cell is active for a period of time after its corresponding FF input is observed. There may be no distinct SP or TP cells, but we show them separately to illustrate differences in activation behaviour.

    This in itself doesn’t change much, but because we are now building patterns forwards we can represent unpredicted events more accurately. Cells are no longer active until the corresponding FF input is observed or prediction is cancelled; instead they are never fully activated prior to the corresponding FF input. When a prediction error occurs, the results are immediate and lasting. Given noisy recognition of FF input, the old method would be more likely to have hidden prediction failures.

    Temporal pooling “replaces” graph components (specifically sequences of vertices) with a single vertex that represents the component by being constantly active for the duration of those inputs. It is also worth noting that to simplify any graph, the minimum number of vertices in each replaced sequence is 3. In a temporal pooler, this means that the minimum number of FF input changes for which a predicted cell must remain active is 3. If temporal pooler output is constant for sequences of length 2, the next hierarchy level will encode transitions between cells instead of sequences of cells (i.e. no simplification or effective pooling has occurred).

    Activity after a Successful Prediction

    The new TP proposes that there are two cortical layers of cells. One layer of cells embodies the Spatial Pooler. The other layer forms the TP. In the TP layer, cells remain active for a long time after being successfully predicted in the SP layer, but for only a short time when not predicted in the SP layer.

    Prediction failures will occur regularly, whenever there are multiple future states and the available data does not allow the correct future to be determined. This looks like a fork in the Sequence Memory graph:

    Figure 3: In this non-deterministic sequence, Red is followed by Yellow or Green. When the prior Red is observed, Yellow is predicted. Since it was predicted, the Yellow cell stays active for 3 steps in total. Since Blue always follows Yellow and Green, and Black follows Blue, the other cells are all active for the full 3 steps.

    In the example above, the Yellow cell is successfully predicted leading to a long activation of the sequence memory cell that responds to Yellow after Red.

    Activity after a Failed Prediction

    The new TP proposes that in the event of a failed prediction, cells only remain active briefly. This is shown in the example below, where Yellow was predicted but Green was observed:

    Figure 4: New TP cell activity after a failed prediction. Yellow was predicted but Green was observed.

    The pattern of activity after a failed prediction is initially different to the pattern after a correct prediction, with only a short activation of the Green cell and no full activation of the predicted cell at all. This means that now, cells are only active when their corresponding FF input is actually observed.

    The FF output of the TP after a prediction failure is quite different to the FF output during predictable sequences before and after. This helps to ensure that the unpredictable transition is modelled in higher hierarchy levels, passing the problem up the hierarchy rather than obscuring it. We anticipate that higher levels of the hierarchy will have the ability to understand and hence predict the problematic transition.

    Analysis

    Prediction with/without motor output distinguishes Cortex layers 3,4

    Copying motor actions back to the cortex to help with prediction makes sense, especially in lower hierarchy levels. However, recent motor signals become increasingly irrelevant when trying to predict more abstract, longer term events. For example, getting sacked from your job is less likely due to the way you just now sipped your coffee, and more likely to do with some events that happened days or weeks ago. These older events will be hierarchically represented as more abstract causes.

    At higher hierarchy levels, with greater abstraction, the “motor actions” that are necessary to explain & predict events are not simple muscle contractions, but complex sequences of decisions and behaviour with specific intents and expectations. The predictive data encoded in the Feed-Forward Indirect and Feed-Back (FB) pathways contains this data in a form that is appropriate and meaningful at each level of the hierarchy. If predictions and decisions are synonymous, then we can treat selected predictions as if they were actions.

    For these reasons we are skeptical about the idea that the use of motor actions to aid prediction is enough to distinguish the functionality of different cortical layers. However, in support of the idea, layer 4 does disappear in higher hierarchy levels where motor actions would be less relevant.

    Propagation of uncertainty to higher regions

    The way that uncertainty (as prediction failure) is propagated up the hierarchy is vital to being able to reliably assemble a useful hierarchical representation of FF input. In fact, we believe that unpredictable events should be propagated until they reach a level of abstraction and invariance where they become predictable (see the Newtonian world assumption in the previous post). Therefore, we believe TP output should be highly orthogonal to prior and subsequent output in the event of a prediction failure. In the case of an SDR, highly orthogonal means that many bits should have dissimilar activity (a small intersection between the sets of active bits before and after).

    Only a fraction of synapsed input bits are needed to activate a cell, and therefore CLA features “noise-tolerant” recognition of FF input. Only a few output bits would be dissimilar between the outcomes of prediction success and failure. This seems to raise the risk that unpredictable events could be “hidden” by noise-tolerance, and not passed up the hierarchy for higher levels to solve. From the perspective of a higher level, the set of active cells has not been significantly affected by their failure to predict.

    Some loss of uncertainty in propagation may be acceptable in a sufficiently complex system. These are toy examples with only a few cells, whereas real CLA regions have hundreds or thousands of cells. However, we are working through some simple examples to try to better understand the behaviour and limits of the CLA.

    Another detail that is not fully described in Hawkins’ current TP proposal is how long cells should be active when predicted. Maximum stability is achieved when cells are active for long periods, but we are limited by the conflicting objective to not hide uncertainty. Should we truncate activity when other prediction failures occur? In the next article we will propose explicitly making TP cells active unless uncertainty is too high, thereby implementing an auto-tuning of activation period.

    Representations of random sequences

    There is one comment in Hawkins’ proposal that we disagree with. He says: “One of the key requirements of temporal pooling is that we only want to do it when a sequence is being correctly predicted. For example, we don’t want to form a stable representation of a sequence of random transitions.” In fact, it may be necessary to build a framework of some random sequences, in order to build sufficiently complex representations to explain any of the simpler events. Although the random sequences may not be the right ones, we need to have a mechanism of assembling more complex hierarchical representations even when there is no incremental explanatory power in doing so (this was discussed in the previous article). This would mean looking for structure in randomness, on the assumption that it would eventually be worthwhile due to explanatory models in higher levels of the hierarchy.

    Summary

    To wrap up:
    – Feed-Back or Feed-Forward indirect pathway data may be a more appropriate source of data than motor actions in the FF direct pathway, for predicting events based on internal causes at higher levels of abstraction

    – Reliable propagation of uncertainty (prediction failure) up the hierarchy is critical to move unexplained events to a level of abstraction where they can be understood

    – We would like to extend activity period for maximum stability, balanced against the desire to avoid hiding prediction errors. How this is done is not detailed

    – It may be necessary to perform temporal pooling even when there are no predictable patterns, in order to construct higher-order representations that may be able to predict the simpler events.

    The next and final article in our 3 part series will present some specific alternative temporal pooler ideas.