Hardware Benchmark for Deep Learning Capability

  • Huynh Quang Nguyen Vo (Creator)



1. Introduction

These files contain the proposed implementation for benchmarking to evaluate whether a setup of hardware is feasible for complex deep learning projects.

2. Scope

The benchmark evaluates the performance of a setup having a single CPU, a single GPU, RAM and memory storage. The performance of multi-CPUs/multi-GPUs or server-based is included in our scope.
The benchmark is built on the Anaconda distribution of Python, and the Jupyter Notebook computational environment. The deep learning models mentioned in this benchmarked are implemented using the Keras application programming interface (API).

The title and description of this software/code correspond with the situation when the software metadata was imported to ACRIS. The most recent version of metadata is available in the original repository.
Our goal is to develop a verified approach to conduct the hardware benchmark that is quick and easy to use. To do so, we provide benchmarking programs as well as the installation guide for Anaconda and deep learning-supported packages.

3. Evaluation metrics

There are various metrics to benchmark the performance capabilities of a setup for deep learning purposes. Here, the following metrics are used:

Total execution time: the total execution time includes both the total training time and the total validation time of a deep learning model on a dataset after a defined number of epochs. Here, the number of epochs is 100. The lower the total execution time the better.
Total inference time: the total inference time includes both the model loading time (the time required to fully load a set of pre-trained weights to implement a model) and the total prediction time of a deep learning model on a test dataset. Similar to the total execution time, the lower the total inference time the better.
FLOPS: the performance capability of a CPU or GPU can be measured by counting the number of floating operation points (FLO) it can execute per second. Thus, the higher the FLOPS, the better.
Computing resources issues/errors: Ideally, a better-performed setup will not encounter any computing resources issues/errors including but not limited to the Out-Of-Memory (OOM) error.
Bottlenecking: to put it simply, bottlenecking is a subpar performance that is caused by the inability of one component to keep up with the others, thus slowing down the overall ability of a setup to process data. Here, our primary concern is the bottlenecking between CPU and GPU. The bottlenecking factor is measured using an online tool: Bottleneck Calculator

4. Methods

To evaluate the hardware performance, two deep learning models are deployed for benchmarking purpose. The first model is a modified VGG19 based on a study by Deitsch et al. (Model A) [1], and the other model is a modified concatenated model proposed in a study from Rahimzadeh et al. (Model B) [2]. These models were previously implemented in Vo et al [3]. The model compilation, training and validation practices are similar to those mentioned in Vo et al [3]. Besides, several optimization practices such as mixed precision policy are applied for model training to make it run faster and consume less memory. The following datasets are used for benchmarking: the original MNIST dataset by LeCun et al., and the Zalando MNIST dataset by Xiao et al.
On the other hand, we also proposed another approach for benchmarking that is much simpler and quicker: evaluating the total execution time for a combination of basic operations. These basic operations include General Matrix to Matrix Multiplication (GEMM), 2D-Convolution (Convolve2D) and Recurrent Neural Network (RNN), and exist in almost all deep neural networks today [4]. We implemented our alternative approach based on the DeepBench work by Baidu [5]:

In DMM, we defined matrix C as a product of (MxN) and (NxK) matrices. For example, (3072,128,1024) means the resulting matrix is a product of (3072x128) and (128x1024) matrices. To benchmark, we implemented five different multiplications and measured the overall total execution time of these five. These multiplications included (3072,128,1024), (5124,9124,2560), (2560,64,2560), (7860,64,2560), and (1760,128,1760).
In SMM, we defined matrix C as a product of (MxN) and (NxK) matrices, and (100 - Dx100)% of the (MxN) matrix is omitted. For instance, (10752,1,3584,0.9) means the resulting matrix is a product of (10752x1) and (1x3584) matrices, while 10% of the (10752x1) matrix is omitted. To benchmark, we implemented four different multiplications and measured the overall total execution time of these five. These multiplications included (10752,1,3584,0.9), (7680,1500,2560,0.95), (7680,2,2560,0.95), and (7680,1,2560,0.95).
In Convolve2D, we defined a simple model containing only convolution layers and pooling layers and measured the resulting total execution time. The dataset used for this training this model is the Zalando MNIST by Xiao et al.
We did not implement the RNN due to several issues caused by the new version of Keras.

To evaluate total inference time, we loaded the already trained weights from our models (denoted as Model A-benchmarked and Model B-benchmarked, respectively) which has the best validation accuracy and conducted a prediction run on the test set from the Zalando MNIST. These files are available on Zenodo: Inference Models

5. References

[1] S. Deitsch, V. Christlein, S. Berger, C. Buerhop-Lutz, A. Maier, F. Gallwitz, and C. Riess, “Automatic classification of defective photovoltaic module cells in electroluminescence images,” Solar Energy, vol. 185, p. 455–468, 06-2019
[2] M. Rahimzadeh and A. Attar, “A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of Xception and ResNet50V2,” Informatics in MedicineUnlocked, vol. 19, p. 100360, 2020.
[3] H. Vo, “Realization and Verification of Deep Learning Models for FaultDetection and Diagnosis of Photovoltaic Modules,” Master’s Thesis, Aalto University. School of Electrical Engineering, 2021.
[4] P. Warden, "Why GEMM is at the heart of deep learning," Pete Warden's Blog, 2015. Available at: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
[5] Baidu Research, "Benchmarking Deep Learning operations on different hardware". Available at: https://github.com/baidu-research/DeepBench
[6] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, 1998.
[7] Xiao, K. Rasul, and R. Vollgraf, “A Novel Image Dataset for Benchmarking Machine Learning Algorithms,” 2017. https://github.com/zalandoresearch/fashion-mnist
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,“Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[9] F. Chollet, “Keras,” 2015. Available at: https://github.com/fchollet/keras
[10] ML Commons. Available at: https://mlcommons.org/en/
[11] W. Dai and D. Berleant, “Benchmarking contemporary deep learning hardware and frameworks: A survey of qualitative metrics,” 2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI), Dec 2019.

Date made available2021

Dataset Licences

  • CC-BY-4.0

Cite this