The standard machine learning approach is to learn to accomplish a specific task with an associated dataset. A model is trained using the dataset and is only able to perform that one task. This is in stark contrast to animals which continue to learn throughout life and accumulate and re-purpose knowledge and skills. The limitation has been widely acknowledged and addressed in different ways, and with a variety of terminology, which can be confusing. I wanted to take a brief look at those approaches and to create a precise definition of the Continuous Learning that we want to implement in our pursuit of AGI.
Transfer Learning is a term that has been used a lot recently in the context of Deep Learning. It was actually first discussed in a paper by Pratt in 1993. Transfer Learning techniques use knowledge for related tasks on either the same or similar datasets. A classic example is learning to recognise cars and then applying the model to the task of recognising trucks. Or learning to recognise a different aspect of objects on the same dataset, such as learning how to recognise petals instead of leaves, of a dataset containing many plants.
One type of Transfer Learning is Domain Adaptation. It refers to the idea of learning on one domain, or data distribution, and then applying the model to and optimising it for a related data distribution. Training a model on different data distributions is often referred to as Multi Domain Learning. In some cases the distributions are similar, but other times they are deliberately unrelated.
The term Lifelong Learning pops up about the same time as Transfer Learning, in a paper by Thrun in 1994. He describes it as an approach that “addresses situations in which a learner faces a series of different learning tasks providing the opportunity for synergy among them”. It overlaps with Transfer Learning, but the emphasis is on gathering general purpose knowledge that transfers across multiple consecutive tasks for an ‘entire lifetime’. Thrun demonstrated results with real robotic systems.
Curriculum Learning by Bengio is a special case of Lifelong or Transfer Learning, where the objective is to optimise performance on a specific task, rather than across different tasks. It does this by making an easy version of that one task and making it subsequently harder and harder.
Online Learning algorithms learn iteratively with new data, in contrast to learning from a pass of a whole dataset, as is commonly done in conventional supervised and unsupervised learning, referred to as Batch Learning. Batches can also refer to portions of the dataset.
Online Learning is useful when the whole dataset does not fit into memory at once. Or more relevant for AGI, in scenarios where new data is observed over time. For example, with new samples being generated by users of a system, by an agent exploring its environment or for cases where the phenomena being modelled changes. Another way to describe it is that the underlying input data distribution is not static i.e. a non-stationary distribution, hence these are referred to as Non-stationary Problems.
Online learning systems can be susceptible to ‘forgetting’. That is, becoming less effective at modelling older data. The worst case is failing completely and suddenly, known as Catastrophic Forgetting or Catastrophic Interference.
Incremental Learning, as the name suggests, is about learning bit by bit, extending the model and improving performance over time. Incremental Learning explicitly handles the level of forgetting of past data. In this way, it is a type of online learning that avoids catastrophic forgetting.
In One-shot Learning, the algorithm is able to learn from one or very few examples. Instance Learning is one way of achieving that, constructing hypotheses from the training instances directly.
A related concept is Multi-Modal Learning, where a model is trained on different types of data for the same task. An example is learning to classify letters from the way they look with visual data, and the way they sound, with audio.
Now that we have some greater clarity around these terms, we recognise that they are all important features of what we consider to be Continuous Learning for a successful AGI agent. I think it’s instructive to express it in terms of traits in the context of an autonomous agent. I’ve mapped these traits to the associated Machine Learning algorithm concepts.
|Uses learnt information to help with subsequent tasks.
Builds on its knowledge. Enables more complex behaviour and faster learning.
|As features of the task change gradually, it will adapt.
This will not cause catastrophic forgetting.
Non-stationary input distributions
|Can learn entirely new tasks.
This will not cause catastrophic forgetting of old tasks. Also, it can learn these new tasks as well as it would have, if it was the first task learnt i.e. learning a task does not impede the ability to learn subsequent tasks.
|Learns important aspects of the task from very few examples.
It has the ability to learn fast when necessary.
|Continues to learn as it collects more data.||Online Learning|
|Combines sensory modalities to learn a task.||Multi-modal Learning|
Note that in continuous learning, if there are fixed resources, and you are operating at your limit, then there has to be some forgetting, but as mentioned in the table, it should not be ‘catastrophic forgetting’.
Also published on Medium.