Sparse Multi-View Computer Vision for 3D Human and Scene Understanding

Research output: ThesisDoctoral Thesis (compilation)

41 Downloads (Pure)

Abstract

Perceiving and understanding human motion is a fundamental problem in computer vision, with diverse applications encompassing sports analytics, healthcare monitoring, entertainment, and intelligent interactive systems. Multi-camera systems, by capturing multiple viewpoints simultaneously, enable robust tracking and reconstruction of human poses in 3D, overcoming limitations of single-view approaches. This thesis addresses key bottlenecks encountered when designing and deploying multi-camera systems for 3D human and scene understanding beyond controlled laboratory settings.

Paper I introduces a human-pose-based approach to extrinsic camera calibration that leverages naturally occurring human motion in the scene. By incorporating a 3D pose likelihood model in kinematic chain space and a distance-aware confidence-weighted reprojection loss, we enable accurate wide-baseline calibration without calibration equipment. This allows for rapid deployment and reconfiguration of multi-camera systems without requiring technical expertise.

The reliance on large labeled datasets presents a significant obstacle to the widespread adoption of action recognition systems. In Paper II we propose a self-supervised learning framework for skeleton-based action recognition. We adapted Bootstrap Your Own Latent (BYOL) for 3D human pose sequence representation. Our contributions include multi-viewpoint sampling that leverages existing multi-camera data, and asymmetric augmentation pipelines bridging the domain shift gap when fine-tuning the network for downstream tasks. This self-supervised method reduces the need for labeled data, shortening development time for new applications.

Paper III focuses on robust 3D human pose reconstruction, particularly in challenging real-world scenarios. Triangulation-based methods struggle in occluded or sparsely-covered scenes. We designed an encoder-decoder Transformer model that regresses 3D human poses from multi-view 2D pose sequences, and introduced a biased attention mechanism that leverages geometric relationships between views and detection confidence scores. Our approach enables robust reconstruction of 3D human poses under heavy occlusion and when few input views are available.

In Paper IV, we tackle open-vocabulary 3D object detection from sparse multi-view RGB data. Our approach builds on pre-trained, off-the-shelf 2D networks and does not require retraining. We lift 2D detections into 3D via monocular depth estimation, followed by multi-view feature consistency optimization and 3D fusion of sparse proposals. Our experiments show that this approach can produce comparable results to state-of-the-art methods in the densely sampled setting while significantly outperforming the state-of-the-art for instances with sparse-views.
Original languageEnglish
QualificationDoctor
Awarding Institution
  • Mathematics (Faculty of Engineering)
Supervisors/Advisors
  • Åström, Kalle, Supervisor
  • Larsson, Viktor, Assistant supervisor
  • Tufvesson, Fredrik, Assistant supervisor
  • Huang, Sangxia, Assistant supervisor, External person
  • Petef, Andrej, Assistant supervisor, External person
Thesis sponsors
Award date2025 Oct 10
Place of PublicationLund
Publisher
ISBN (Print)978-91-8104-604-5
ISBN (electronic) 978-91-8104-605-2
Publication statusPublished - 2025

Bibliographical note

Defence details
Date: 2025-10-10
Time: 13:15
Place: Lecture Hall MH:Hörmander, Centre of Mathematical Sciences, Märkesbacken 4, Faculty of Engineering LTH, Lund University, Lund. The dissertation will be live streamed, but part of the premises is to be excluded from the live stream. Zoom: https://lu-se.zoom.us/j/69406882444?pwd=PQGCrAosqGNabtGs5pAxec2bQraJaO.1
External reviewer(s)
Name: Rhodin, Helge
Title: Prof.
Affiliation: Bielefeld University, Germany.
---

Subject classification (UKÄ)

  • Computer graphics and computer vision

Free keywords

  • Multi-view Geometry
  • Extrinsic Camera Calibration
  • Multi-camera System
  • 3D Human Pose Estimation
  • Skeleton-based Action Recognition
  • Self-supervised Learning
  • 3D Object Detection
  • 3D Scene Understanding

Fingerprint

Dive into the research topics of 'Sparse Multi-View Computer Vision for 3D Human and Scene Understanding'. Together they form a unique fingerprint.

Cite this