We aimed at predicting the temporal evolution of brain activity in naturalistic music listening conditions using a combination of neuroimaging and acoustic feature extraction. Participants were scanned using functional Magnetic Resonance Imaging (fMRI) while listening to two musical medleys, including pieces from various genres with and without lyrics. Regression models were built to predict voxel-wise brain activations which were then tested in a cross-validation setting in order to evaluate the robustness of the hence created models across stimuli. To further assess the generalizability of the models we extended the cross-validation procedure by including another dataset, which comprised continuous fMRI responses of musically trained participants to an Argentinean tango. Individual models for the two musical medleys revealed that activations in several areas in the brain belonging to the auditory, limbic, and motor regions could be predicted. Notably, activations in the medial orbitofrontal region and the anterior cingulate cortex, relevant for self-referential appraisal and aesthetic judgments, could be predicted successfully. Cross-validation across musical stimuli and participant pools helped identify a region of the right superior temporal gyrus, encompassing the planum polare and the Heschl's gyrus, as the core structure that processed complex acoustic features of musical pieces from various genres, with or without lyrics. Models based on purely instrumental music were able to predict activation in the bilateral auditory cortices, parietal, somatosensory, and left hemispheric primary and supplementary motor areas. The presence of lyrics on the other hand weakened the prediction of activations in the left superior temporal gyrus. Our results suggest spontaneous emotion-related processing during naturalistic listening to music and provide supportive evidence for the hemispheric specialization for categorical sounds with realistic stimuli. We herewith introduce a powerful means to predict brain responses to music, speech, or soundscapes across a large variety of contexts.