Zero-shot numerical reasoning in dual stream neural networks and the primate visual system

Human viewers learn abstract concepts corresponding to visual relational properties and then can generalize these concepts zero-shot to new objects and contexts. Numerosity is a prime example of this ability. Once a child has learned the abstract concept of ”threeness”, she will forever be able to recognize groups of three objects, even novel objects in novel contexts, without any additional learning. This is not the case for modern neural network-based computer vision systems, which, while highly proficient at object recognition, struggle to generalize relational properties like cardinality. Here, we show that a recurrent dual-stream neural network, inspired by the role of the dorsal stream in primate vision, which apprehends an image via a sequence of foveated glimpses, displays zero-shot generalization in numerical reasoning tasks. This zero-shot generalization behaviour is not observed in parameter-matched control models which receive the entire image as input. Neither stream of the dual-stream model is sufficient to solve the task alone. The dual-stream model replicates several neural and behavioural phenomena associated with human and monkey enumeration. Analyzing the activity of the recurrent layer revealed several response properties associated with posterior parietal cortex (PPC): log-normal number coding, an over-representation of units selective to the extremes of the number range, and place-selective spatial receptive fields. Inspection of the pattern of errors made at different levels of proficiency revealed that, like human learners, the dual-stream model masters smaller numerosities first, gradually refining larger numerosities. Our characterization of the computational principles that support zero-shot numerical reasoning are consistent with converging theories of the role of PPC in visual reasoning. According to this theory, efferent copies of motor or attention signals are received as an additional input to a visual reasoning system, enabling abstractions that are grounded in action, rather than purely in the sensory domain. The success of our model suggests that “attention” may be an inactive mechanism for relational inference, rather than (merely) a spatial prioritisation scheme.