Visual perception refers to automatically recognizing, detecting, or otherwise sensing the content of an image, video or scene. The most common contemporary approach to tackle a visual perception task is by training a deep neural network on a pre-existing dataset which provides examples of task success and failure, respectively. Despite remarkable recent progress across a wide range of vision tasks, many standard methodologies are static in that they lack mechanisms for adapting to any particular settings or constraints of the task at hand. The ability to adapt is desirable in many practical scenarios, since the operating regime often differs from the training setup. For example, a robot which has learnt to recognize a static set of training images may perform poorly in real-world settings, where it may view objects from unusual angles or explore poorly illuminated environments. The robot should then ideally be able to actively position itself to observe the scene from viewpoints where it is more confident, or refine its perception with only a limited amount of training data for its present operating conditions.
In this thesis we demonstrate how reinforcement learning (RL) can be integrated with three fundamental visual perception tasks -- object detection, human pose estimation, and semantic segmentation -- in order to make the resulting pipelines more adaptive, accurate and/or faster. In the first part we provide object detectors with the capacity to actively select what parts of a given image to analyze and when to terminate the detection process. Several ideas are proposed and empirically evaluated, such as explicitly including the speed-accuracy trade-off in the training process, which makes it possible to specify this trade-off during inference. In the second part we consider active multi-view 3d human pose estimation in complex scenarios with multiple people. We explore this in two different contexts: i) active triangulation, which requires carefully observing each body joint from multiple viewpoints, and ii) active viewpoint selection for monocular 3d estimators, which requires considering which viewpoints yield accurate fused estimates when combined. In both settings the viewpoint selection systems face several challenges, such as partial observability resulting e.g. from occlusions. We show that RL-based methods outperform heuristic ones in accuracy, with negligible computational overhead. Finally, the thesis concludes with establishing a framework for embodied visual active learning in the context of semantic segmentation, where an agent should explore a 3d environment and actively query annotations to refine its visual perception. Our empirical results suggest that reinforcement learning can be successfully applied within this framework as well.
|Translated title of the contribution||Aktiv visuell perception via förstärkningsinlärning|
- Mathematics (Faculty of Engineering)
- Sminchisescu, Cristian, Supervisor
- Olsson, Carl, Supervisor
|Award date||2021 Jun 10|
|ISBN (electronic) ||978-91-7895-796-5|
|Publication status||Published - 2021|
Place: Lecture hall MH:Hörmander, Centre of Mathematical Sciences, Sölvegatan 18, Faculty of Engineering LTH, Lund University, Lund. Zoom: https://lu-se.zoom.us/j/67213391794?pwd=WE1ZOE9KNlZIbTZvYnFhSlVqWU1tZz09
Name: Kjellström, Hedvig
Affiliation: KTH Royal Institute of Technology, Sweden.
In reference to IEEE copyrighted material which is used with permission in this thesis, the IEEE does not endorse any of Lund Universitys's products or services. Internal or personal use of this material is permitted. If interested in reprinting/republishing IEEE copyrighted material for advertising or promotional purposes or for creating new collective works for resale or redistribution, please go to http://www.ieee.org/publications_standards/publications/rights/rights_link.html to learn how to obtain a License from RightsLink.
- Computer Vision and Robotics (Autonomous Systems)
- computer vision
- reinforcement learning
- deep learning
- active vision
- object detection
- human pose estimation
- semantic segmentation