Along with the rapid development of Cloud Computing, IoT, and AI technologies, cloud video surveillance (CVS) has become a hotly discussed topic, especially when facing the requirement of real-time analysis in smart applications. Object detection usually plays an important role for environment monitoring and activity tracking in surveillance system. The emerging edge-cloud computing paradigm provides us an opportunity to deal with the continuously generated huge amount of surveillance data in an on-site manner across IoT systems. However, the detection performance is still far away from satisfactions due to the complex surveilling environment. In this study, we focus on the multi-target detection for real-time surveillance in smart IoT systems. A newly designed deep neural network model called A-YONet, which is constructed by combining the advantages of YOLO and MTCNN, is proposed to be deployed in an end-edge-cloud surveillance system, in order to realize the lightweight training and feature learning with limited computing sources. An intelligent detection algorithm is then developed based on a pre-adjusting scheme of anchor box and a multi-level feature fusion mechanism. Experiments and evaluations using two datasets, including one public dataset and one homemade dataset obtained in a real surveillance system, demonstrate the effectiveness of our proposed method in enhancing training efficiency and detection precision, especially for multi-target detection in smart IoT application developments.