The multimodal approach is becoming more and more attractive and common method in multimedia information retrieval and description. It often shows better content recognition results than using only unimodal methods, but depending on the used data, this is not always the case. Most of the current multimodal media content classification methods still depend on unimodal recognition results. For both uni- and multimodal approaches it is important to choose the best features and classification models. In addition, in the case of unimodal models, the final multimodal recognitions still need to be produced with an appropriate late fusion technique. In this article, we study several multi- and unimodal recognition methods, features for them and their combination techniques, in the application setup of concept detection in image–text data. We consider both single- and multi-label recognition tasks. As the image features, we use GoogLeNet deep convolutional neural network (DCNN) activation features and semantic concept or classeme vectors. For text features, we use simple binary vectors for tags and the word2vec embedding vectors. The Multimodal Deep Boltzmann Machine (DBM) model is used as the multimodal model and the Support Vector Machine (SVM) with both linear and non-linear radial basis function (RBF) kernels as the unimodal one. The experiments are performed with the MIRFLICKR-1M and the NUS-WIDE datasets. The results show that the two models have equally good performance in the single-label recognition task of the former database, while the Multimodal DBM produces clearly better results in the multi-label task of the latter database. Compared with the results in the literature, we exceed the state of the art in both datasets, mostly due to the use of DCNN features and semantic concept vectors based on them.