Vision Meets Cognition: Functionality, Physics, Intentionality and Causality


Although we've seen a recent boost of sensors in digital devices, e.g., virtual reality headsets, autonomous driving, smart home, IoT devices, advanced algorithms for processing and handling visual data in the backend are largely missing. One key challenge is that algorithms that can understand images, human behaviors, and social activities from sensors deployed in daily lives go beyond the traditional scope of image and scene understanding, and they are expected to be capable of answering queries much broader and deeper than “what is where”. The mission of this workshop is to (a) identify the key domains in modern computer vision; (b) formalize the computational challenges in these domains; and (c) provide promising frameworks to solve these challenges.

Here we propose FPIC as four key domains beyond “what is where”:


What can you do with the tree trunk?


How likely is the stone balancing?


Why does the guy kick the door?


Who knocked down the domino?

The combination of these four largely orthogonal dimensions spans a large space of research in image and scene understanding.

Despite their apparent differences at the first glimpse, these domains do connect with each other in ways that are theoretically important. These connections include: (a) they usually don’t easily project onto explicit visual features; (b) existing computer vision algorithms are neither competent in these domains nor (in most cases) applicable at all; and (c) human vision is nevertheless highly efficient in these domains, and human-level reasonings often build upon these prior knowledge in these domains. Therefore, studying FPIC should significantly fill the gap between computer vision and human vision not only for visual recognition, but also for understanding visual scenes with common-sense knowledge and lifelong learning.

The introduction of FPIC will advance a vision system in three aspects: (a) transfer learning. As higher-level representation, FPIC tends to be globally invariant across the entire human living space. Therefore, learning in one type of scenes can be transferred to novel situations; (b) small sample learning. Leaning of FPIC, which is consistent and noise-free, is possible even without a wealth of previous experience or “big data”; and (c) bidirectional inference. Inference with FPIC requires the combination of top-down abstract knowledge and bottom-up visual patterns. The bidirectional processes can boost each other as a result.

Several key topics are:

- Representation of visual structure and commonsense knowledge
- Recognition of object function / affordances
- Physically grounded scene interpretation
- 3D scene acquisition, modeling and reconstruction
- Human-object-scene interaction
- Physically plausible pose / action modeling
- Reasoning about goals and intents of the agents in the scenes
- Causal model in vision
- Abstract knowledge learning and transferring
- Top-down and Bottom-up inference algorithms
- Related topics in cognitive science and visual perception
- Applications of FPIC to augmented and mixed reality

In conjunction with CVPR 2017, our third Vision meets Cognition workshop will bring together researchers from computer vision, computer graphics, robotics and cognitive science, to advance computer vision systems going beyond answering “what is where” in an image and building a sophisticated understanding of an image about Functionality, Physics, Intentionality and Causality (FPIC). In effect, these abilities allow an observer to answer an almost limitless range of questions about an image using finite and general-purpose models. In the meanwhile, we also want to emphasize that FPIC is never meant to be an exclusive set of image and scene understanding problems. We welcome any scholars who share the same perspective but are working on different problems.


Yixin Zhu

Center for Vision, Cognition, Learning & Autonomy

Computer Graphics & Vision Laboratory

Lap-Fai Yu
UMass Boston

Graphics and Virtual Environments Lab

Computational Cognitive Science lab

Ping Wei

Center for Vision, Cognition, Learning & Autonomy

Peter Battaglia

DeepMind, London

Tao Gao

Computational Cognitive Science lab