Brief dip into computer vision jargon. Anthropomorphism. Humans learn to recognise gross gestalten, before understanding mereology ... and humans learn to recognise 2D persistence before 3D persistence. But as conscious processes develop, they discover more efficient storage formats for data, by relying on 3D persistence graphs.
So while we could train a vision system to be more human like by beginning with gross gestalten, we may want to skip forward a little bit to endow the system with the capability to immediately begin forming a database of 3D graphs, representing object topologies in the wild.
LFG :
0. Statistical globbing via neural networks should be used as a final optimisation technique, only after all known structural optimisations ("physics") have been ("logically") programmed into the system - not as a lazy shortcut from zero.
1.
1.1. Let us throw away the concept of colour, and begin only with one dimension of magnitude per unit of space - luma / luminance / brightness.
1.2. Let is ignore 3D structure and begin only with the gross phenomenology of the field of vision.
2.
2.1. Now, mereology : we must have a way to describe subsets of any field which are interesting. The smallest item is a point of brightness.
2.2. SCALE invariance : the concept of an item growing to take up a larger fraction of the field, or shrinking to take up a smaller fraction, must exist. See SIFT, SURF, ORB, et al.
2.3. Location : the field of vision must be referenced by an address space.
2.4. Multiplicity : there may be many items of interest.
2.5. Extension in 2D : items may be linked, giving them 2D structure in the field. See "graph convolution networkS / GCNs"
2.6. ROTATIONAL invariance : the 2D structure is different if you spin it around with respect to the address space. Where you apply this to spinning everything around the system, see "sphere node graphs", though in 2D that's probably a "cylinder node graph".
... at this point we have five to six concepts that must be uniquely addressed in the cognitive space of the system. A sort of introductory transcendental aesthetic.
3.
3.1. Extension in 3D : Now we can revisit the notion of 3D space.
3.2. Location ("distance") in 3D is associated with scale in 2D (2.2.).
3.3. Occlusion in 2D is associated with rotation in 3D. See "aspect graphs".
... and, I think that's all we need to get starting building a computer vision system, to undergird the visual ontological comprehension of any anthropomorphic computer.
4. Oh yes, we can then revisit colour and motion in the end.