When is missing data a valid state?
By Gideon Kowadlo, David Rawlinson and Alan Zhang
Can you hear silence or see pitch black?
Should we classify no input as a valid state or ignore it?
To my knowledge, the machine learning and statistics literature mainly regards an absence of input as missing data. There are several ways that it’s handled. It can be considered to be a missing data point, a value is inferred and then treated as the real input. When a period of no data occurs at the beginning or end of a stream (time series data), it can be ignored, referred to as censoring. Finally, when there is a variable that can never (or is never) observed, it can be viewed as data that is always missing, and modelled with what is referred to as latent or hidden variables. I believe there is more to the question of whether an absence of input is in fact a valid state, particularly when learning time varying sequences and when considering computational parallels of biological processes where an absence of signal might never occur.
It is also relevant in the context of systems where ‘no signal’ is an integral type of message that can be passed around. One such system is Predictive Coding (PC), which is a popular theory of cortical function within the neuroscience community. In PC, prediction errors are fed forward (see PC post  for more information). Therefore, perfectly correct predictions result in ‘no-input’ in the next level, which may occur from time to time given it is the objective of the encoding system.
Let’s say your system is classifying sequences of colours Red (R), Green (G) and Blue (B), with periods of no input which we represent as Black (K). There is a sequence of colours RGB, followed by a period of K, then BGR and then two steps of K again, illustrated below (the figure is a Markov graph representation).
|Figure 1: Markov graph representation of a sequence of colour transitions.|
What’s in a name?
What actually defines Black as no input?
This question is explored in the following paragraphs along with Figure 2 below. We start with the way the signal is encoded. In the case of an image, each pixel is a tuple of scalar values, including black (K) with a value of (0, 0, 0). No specific component value has a privileged status. We could define black as any scalar tuple. For other types of sensors, signal modulation is used to encode information. For example, frequency of binary spikes/firing is used in neural systems. No firing, or more generally no change, indicates no input. Superficially it appears to be qualitatively different. However, a specific modulation pattern can be mapped to a specific scalar value. Are they therefore equivalent?
We reach some clarity by considering the presence of a clock as a reference. The use of signal modulation implies the requirement of a clock, but does not necessitate it. With an internal clock, modulation can be measured in a time-absolute* sense, the modulation can be mapped to a scalar representation, and the status of the no-input state does indeed become equivalent to the case of a scalar input with a clock i.e. no value is privileged.
Where there is no clock, for either type of signal encoding, time can effectively stand still for the system. If the input does not change at all, there is no way to perceive the passage of time. For scalar input, this means that the input does not transition. For modulated input, it includes the most obvious type of ‘no-input’, no firing or zero frequency.
This would obviously present a problem to an intelligent agent that needs to continue to predict, plan and act in the world. Although there are likely to be inputs to at least some of the sensors, it suggests that biological brains must have an internal clock. There is evidence that the brain has multiple clocks, summarised here in Your brain has two clocks . I wonder if the time course of perceptible bodily processes or thoughts themselves could be sufficient for some crude perception of time.
|Figure 2: Definition of ‘no-input’ for different system characteristics.|
The system could be viewed as an HMM (Hidden Markov Model). The sensed/measured states represent hidden world states that can not be measured directly. Let us make many observations and look to the statistics of occurrence, and compare this to the other observable states. If the statistics are similar, we can assume option A – no special meaning. If on the other hand, it occurs between the other observable sequences, sequences which are not correlated with each other, and is therefore not significantly correlated with any transitions, then we can say that it is B – a delineator.
A – no special meaning
There are two options, treat K as any other state, or ignore it. For the former, it’s business as usual. For the latter, ‘ignoring the input’, there don’t seem to be any consequences for the following reason. The system will identify at least two shorter sequences, one before K and one after. Any type of sequence learning must anyway have an upper limit on the length of the representable sequences* (unlike the theoretical Turing Machine); this will just make those sequences shorter. In the case of hierarchical algorithms such as HTM/CLA, higher levels in the hierarchy will integrate these sub sequences together into longer (more abstracted) temporal sequences.
However, ignoring K will have implications for learning the timing of state persistence and transitions. If the system ignores state K including the timing information, then modelling will be incomplete. For example, referring back to Figure 1, K occurs for two time steps before the transition back to R. This is important information for learning to predict when this R will occur. Additionally, the transition to K signalled the end of the occurrence of R preceding K. Another example is illustrated below in Figure 3. Here, K following B is a fork between two sub chains. The transition to R occurs 95% of the time. That information can be used to make a strong prediction about future transitions from this K, however if K is ignored, as shown on the right of the figure, the information is ignored and the prediction is not possible.
|Figure 3: Markov chain showing some limitations of ignoring K.|
* However, it is possible to have the ability to represent sequences far longer than the expected observable sequences with enough combinatorial power, as described in CLA and argued to exist in biological systems.
B – a delineator
This is the case where the ‘no-input’ state is not correlated (above some significant measure) with any observed sequence. The premise of this categorisation, is that due to lack of correlation, it is an effectively meaningless state. However, it can be used to make inferences about the underlying state. Using the example from Figure 1, based on repeated observations, the statement could be made that R, G and B summarise hidden states. We can also surmise that there are states that generate white noise, in this example random selections of R, G, B or K. This can be inferred since we never observe the same signal twice when in those states. Observations of K are then useful for modelling the hidden states, which calls into question the definition of K as ‘no-input’.
However, it may in fact be an absence of input. In any case, we did not observe any correlations with other sequences. Therefore in practice, this is similar to ‘A – no special meaning – ignore the state’. The difference is the semantic meaning of the ‘no-input’ state as a delineator. There is also no expectation that there is meaningful information in the duration of the absence of input. The ‘state’ is useful to indicate the sequence is finished, and therefore defines the timing of persistence of the last state of the sequence.
CLA and hierarchical systems
Turning our attention briefly to the context of HTM CLA . CLA utilises Sparse Distributed Representations (see SDR post  for more information) as a common data structure in a hierarchical architecture. A given state, represented as an SDR, will normally be propagated to the level above which also receives input from other regions. It will therefore be represented as one (or possibly more) of many bits in the state above. Each bit is semantically meaningful. A ‘0’ should therefore be as meaningful as a ‘1’. The questions discussed above arise when the SDR is completely zero, which I’ll refer to as a ‘null SDR’.
The presence of a null SDR depends on the input source, presence of noise and the implementation details of the encoders. In a given region, the occurrence of null SDR’s will tend to dissipate, as the receptive field adjusts until a certain average complexity is observed. In addition, null SDR’s becomes increasingly unlikely as you move up the hierarchy and incorporate larger and larger receptive fields, thus increasing the surface area for possible activity. If the null SDR can still occur occasionally, there may be times when it is significant. If it is not classified, will the higher levels in the hierarchy recover the ‘lost’ information? This question applies to other hierarchical systems and will be investigated in future posts.
So what ……. ?
What does all of this mean for the design of intelligent systems? A realistic system will be operating with multiple sensor modalities and will be processing time varying inputs (regardless of the encoding of the signal). Real sensors and environments are likely to produce background noise, and in front of that, periods of no input in ways that are correlated with other environmental sequences, and in ways that are not – relating to the categorisations above ‘A – no special meaning’ and ‘B – a delineator’. There is no simple ‘so what’, but hopefully this gives us some food for thought and shows that it is something that should be considered. In future I’ll be looking in more depth at biological sensors and the nature of the signals that reach the cortex (are they ever completely silent?), as well as the implications for other leading machine learning algorithms.
References On Predictive Coding and Temporal Pooling Emilie Reas, Your brain has two clocks, Scientific American, 2013 HTM White Paper Sparse Distributed Representations (SDRs)
Also published on Medium.