The MediaEval 2014 Violent Scenes Detection task challenged participants to automatically find violent scenes in a set of videos. We propose to first predict a set of midlevel concepts from low-level visual and auditory features, then fuse the concept predictions and features to detect violent content. With the objective of obtaining a higly generic approach, we deliberately restrict ourselves to use simple general-purpose descriptors with limited temporal context and a common neural network classifier. The system used this year is largely based on the one successfully employed by our group in 2012 and 2013, with some improvements and updated features. Our best-performing run with regard to the offcial metric received a MAP2014 of 45.06% in the main task and 66.38% in the generalization task.