Convolutional neural networks that are used to recognize shapes typically use one or more layers of learned feature detectors that produce scalar outputs, interleaved with subsampling.
In this way they obtain translational invariance.
However important informations regarding the spatial relationships of the parts and their poses in respect to the total object are lost.
By contrast, the computer vision community uses complicated, hand-engineered features, like SIFT, that produce a whole vector of outputs including an explicit representation of the pose of the feature.
Capsule Networks by G. E. Hinton show how neural networks can be used to learn features that output a whole vector of instantiation parameters, a more promising way of dealing with variations in position, orientation, scale and lighting than the methods currently employed in the neural networks community.
CapsNets, instead of disregarding the relative positioning, try to model the global-relative transformations of different sub-parts along a hierarchy, building a parse-tree in which they move using a routing by agreement mechanism and pursuing equivariance.
These networks therefore include a viewpoint/orientation awareness and respond differently to different orientations.