TY - JOUR
T1 - Affect in Multimedia
T2 - Benchmarking Violent Scenes Detection
AU - Constantin, Mihai Gabriel
AU - Stefan, Liviu Daniel
AU - Ionescu, Bogdan
AU - Demarty, Claire Helene
AU - Sjoberg, Mats
AU - Schedl, Markus
AU - Gravier, Guillaume
PY - 2020/1/1
Y1 - 2020/1/1
N2 - In this paper, we report on the creation of a publicly available, common evaluation framework, for Violent Scenes Detection (VSD) in Hollywood and YouTube videos. We propose a robust data set, the VSD96, with more than 96 hours of video of various genres, annotations at different levels of detail (e.g., shot-level, segment-level), annotations of mid-level concepts (e.g., blood, fire), various pre-computed multi-modal descriptors, and over 230 system output results as baselines. This is the most comprehensive dataset available to this date tailored to the VSD task, and was extensively validated during the MediaEval benchmarking campaigns. Furthermore, we provide an in-depth analysis of the crucial components of VSD algorithms, by reviewing the capabilities and the evolution of existing systems (e.g., overall trends and outliers, influence of the employed features and fusion techniques). Finally, we discuss the possibility of going beyond state-of-the-art performance via an ad-hoc late fusion approach. Experimentation is carried out on the VSD96 data. The increasing number of publications using the VSD96 data underline the importance of the topic. The presented and published resources are a practitioner's guide and also a strong baseline to overcome, which will help researchers for the coming years in analyzing aspects of audio-visual affect and violence detection in movies and videos.
AB - In this paper, we report on the creation of a publicly available, common evaluation framework, for Violent Scenes Detection (VSD) in Hollywood and YouTube videos. We propose a robust data set, the VSD96, with more than 96 hours of video of various genres, annotations at different levels of detail (e.g., shot-level, segment-level), annotations of mid-level concepts (e.g., blood, fire), various pre-computed multi-modal descriptors, and over 230 system output results as baselines. This is the most comprehensive dataset available to this date tailored to the VSD task, and was extensively validated during the MediaEval benchmarking campaigns. Furthermore, we provide an in-depth analysis of the crucial components of VSD algorithms, by reviewing the capabilities and the evolution of existing systems (e.g., overall trends and outliers, influence of the employed features and fusion techniques). Finally, we discuss the possibility of going beyond state-of-the-art performance via an ad-hoc late fusion approach. Experimentation is carried out on the VSD96 data. The increasing number of publications using the VSD96 data underline the importance of the topic. The presented and published resources are a practitioner's guide and also a strong baseline to overcome, which will help researchers for the coming years in analyzing aspects of audio-visual affect and violence detection in movies and videos.
KW - Benchmark testing
KW - benchmarking
KW - literature review
KW - Machine learning
KW - Market research
KW - Motion pictures
KW - multi-modal content description
KW - Task analysis
KW - Videos
KW - violent scenes detection
KW - VSD96 data set
KW - YouTube
UR - http://www.scopus.com/inward/record.url?scp=85083746591&partnerID=8YFLogxK
U2 - 10.1109/TAFFC.2020.2986969
DO - 10.1109/TAFFC.2020.2986969
M3 - Article
AN - SCOPUS:85083746591
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
SN - 1949-3045
ER -