Skip to main navigation Skip to search Skip to main content

Are 3D convolutional networks inherently biased towards appearance?

  • Petr Byvshev
  • , Yu Xiao
  • , Pascal Mettes

Research output: Contribution to journalArticleScientificpeer-review

6 Citations (Scopus)
100 Downloads (Pure)

Abstract

3D convolutional networks, as direct inheritors of 2D convolutional networks for images, have placed theirmark on action recognition in videos. Combined with pretraining on large-scale video data, high classificationaccuracies have been obtained on numerous video benchmarks. In an effort to better understand why 3Dconvolutional networks are so effective, several works have highlighted their bias towards static appearanceand towards the scenes in which actions occur. In this work, we seek to find the source of this bias and questionwhether the observed biases towards static appearances are inherent to 3D convolutional networks or representlimited significance of motion in the training data. We resolve this by presenting temporality measures thatestimate the data-to-model motion dependency at both the layer-level and the kernel-level. Moreover, weintroduce two synthetic datasets where motion and appearance are decoupled by design, which allows us todirectly observe their effects on the networks. Our analysis shows that 3D architectures arenotinherentlybiased towards appearance. When trained on the most prevalent video sets, 3D convolutional networks areindeed biased throughout, especially in the final layers of the network. However, when training on datawith motions and appearances explicitly decoupled and balanced, such networks adapt to varying levels oftemporality. To this end, we see the proposed measures as a reliable method to estimate motion relevance foractivity classification in datasets and use them to uncover the differences between popular pre-training videocollections, such as Kinetics, IG-65M and Howto100 m.
Original languageEnglish
Article number103437
Number of pages12
JournalComputer Vision and Image Understanding
Volume220
Issue number103437
DOIs
Publication statusPublished - Jul 2022
MoE publication typeA1 Journal article-refereed

Keywords

  • 3D models
  • Temporality measure
  • Motion analysis
  • Large-scale videosets

Fingerprint

Dive into the research topics of 'Are 3D convolutional networks inherently biased towards appearance?'. Together they form a unique fingerprint.

Cite this