PI: Charless Fowlkes
Institution: University of California, Irvine
A key feature missing from most deep CNN architectures is the incorporation of high-level, top-down
feedback and geometric scene context. We have been exploring this idea along several lines.
We developed a recurrent segmentation model that predicts scene depth from perspective cues and utilizes
top-down depth estimates to modulate pooling regions. This architecture shows promising results
including state of the art semantic segmentation and depth estimation in street scenes for self driving applications.
Beyond estimating scene depth, we can also attempt to estimate full scene geometry (including
occluded surfaces) from a single image. Our latest approach to this uses a novel multi-layer representation
of scene depth trained on synthetic scenes for scene completion in a fully convolutional framework.
This most recent work was carried out on the CHASE-CI this fall (~8 GPU months of compute) and a
manuscript is currently under review.
We hypothesize that strong knowledge of the 3D structure of a scene (e.g., derived from a map or 3D scan of a scene) provide constraints that can improve estimation of human pose. For example, one can provide good estimates of the position of a person’s hips who is sitting on a bench of known height, even if they are substantially occluded. To explore this, we have just finished collecting a large dataset of 3D human pose affordance using a commercial motion capture system. The data includes ground-truth 3D joint locations, video streams from 5 viewpoints, and precise 3D scene geometry. We are now starting to train new CNN architectures which can incorporate descriptions of the scene geometry as an input and produce pose estimates which satisfy the physical constraints imposed by the scene.