The ability to decompose visual scenes in terms of abstract building blocks is crucial for general intelligence. These basic building blocks are capable to represent meaningful properties, interactions and other relations across scenes.
The use of such decompositions can simplify reasoning and enforce the representation strenght, for instance against potential attacks. In particular, representing perceptual observations in terms of entities should improve data efficiency, object representations and allow to transfer knowledge on a wide range of tasks. Thus, we need models capable of discovering useful decompositions of scenes by identifying units with such regularities and representing them in a common format. Graphs are a useful abstraction of image content, being able not only to represent details about individual objects in a scene, but also to capture and represent interactions among them.