The goal of computer vision, as coined by Marr, is to compute what is where by looking. This paradigm has guided the geometry-based approaches in the 1980s-1990s and the appearance-based methods in the past 20 years. Despite of the remarkable progresses in recognizing objects, actions and scenes by using large data sets, better designed features, and machine learning techniques, performances in challenging benchmarks are still far from being satisfactory. While there is still some room for improvements, we believe that the detection and recognition tasks cannot be solved by the visible appearance alone. To gain the remaining percentages, we must look for a bigger picture to model and reason about the missing dimensions.
Here we propose FPIC as four key domains beyond “what is where”:
What can you do with the tree trunk?
How likely is the stone balancing?
Why does the guy kick the door?
Who knocked down the domino?
The combination of these four largely orthogonal dimensions spans a large space of research in image and scene understanding.
Despite their apparent differences at the first glimpse, these domains do connect with each other in ways that are theoretically important. These connections include: (a) they usually don’t easily project onto explicit visual features; (b) existing computer vision algorithms are neither competent in these domains nor (in most cases) applicable at all; and (c) human vision is nevertheless highly efficient in these domains, and human-level reasonings often build upon these prior knowledge in these domains. Therefore, studying FPIC should significantly fill the gap between computer vision and human vision not only for visual recognition, but also for understanding visual scenes with common-sense knowledge and lifelong learning.
The introduction of FPIC will advance a vision system in three aspects: (a) transfer learning. As higher-level representation, FPIC tends to be globally invariant across the entire human living space. Therefore, learning in one type of scenes can be transferred to novel situations; (b) small sample learning. Leaning of FPIC, which is consistent and noise-free, is possible even without a wealth of previous experience or “big data”; and (c) bidirectional inference. Inference with FPIC requires the combination of top-down abstract knowledge and bottom-up visual patterns. The bidirectional processes can boost each other as a result.
Several key topics are:
- - Representation of visual structure and commonsense knowledge
- - Recognition of object function / affordances
- - Physically grounded scene interpretation
- - 3D scene acquisition, modeling and reconstruction
- - Human-object-scene interaction
- - Physically plausible pose / action modeling
- - Reasoning about goals and intents of the agents in the scenes
- - Causal model in vision
- - Abstract knowledge learning and transferring
- - Top-down and Bottom-up inference algorithms
- - Related topics in cognitive science and visual perception
- - Applications of FPIC to augmented and mixed reality
In conjunction with CVPR 2018, our 4th Vision meets Cognition workshop will bring together researchers from computer vision, computer graphics, robotics and cognitive science, to advance computer vision systems going beyond answering “what is where” in an image and building a sophisticated understanding of an image about Functionality, Physics, Intentionality and Causality (FPIC). In effect, these abilities allow an observer to answer an almost limitless range of questions about an image using finite and general-purpose models. In the meanwhile, we also want to emphasize that FPIC is never meant to be an exclusive set of image and scene understanding problems. We welcome any scholars who share the same perspective but are working on different problems.