Improved deep depth estimation for environments with sparse visual cues

Niclas Joswig*, Juuso Autiosalo, Laura Ruotsalainen

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

52 Downloads (Pure)


Most deep learning-based depth estimation models that learn scene structure self-supervised from monocular video base their estimation on visual cues such as vanishing points. In the established depth estimation benchmarks depicting, for example, street navigation or indoor offices, these cues can be found consistently, which enables neural networks to predict depth maps from single images. In this work, we are addressing the challenge of depth estimation from a real-world bird’s-eye perspective in an industry environment which contains, conditioned by its special geometry, a minimal amount of visual cues and, hence, requires incorporation of the temporal domain for structure from motion estimation. To enable the system to incorporate structure from motion from pixel translation when facing context-sparse, i.e., visual cue sparse, scenery, we propose a novel architecture built upon the structure from motion learner, which uses temporal pairs of jointly unrotated and stacked images for depth prediction. In order to increase the overall performance and to avoid blurred depth edges that lie in between the edges of the two input images, we integrate a geometric consistency loss into our pipeline. We assess the model’s ability to learn structure from motion by introducing a novel industry dataset whose perspective, orthogonal to the floor, contains only minimal visual cues. Through the evaluation with ground truth depth, we show that our proposed method outperforms the state of the art in difficult context-sparse environments.

Original languageEnglish
Article number18
Number of pages12
Issue number1
Publication statusPublished - Jan 2023
MoE publication typeA1 Journal article-refereed


  • Computer vision
  • Deep learning
  • Monocular depth
  • Visual SLAM


Dive into the research topics of 'Improved deep depth estimation for environments with sparse visual cues'. Together they form a unique fingerprint.

Cite this