This is the first in a two-part series. Read Part 2 here.

Ready for a GPU to Enhance Your Workflow?

Training Performance of Convolutional Networks with CPU vs GPU

Introduction

In the technology community, especially in IT, many of us are searching for knowledge and how to develop our skills. After researching Deep Learning through books, videos, and online articles, I decided that the natural next step was to gain hands-on experience.

I started with Venkatesh’s tutorial of building an image classification model using a Convolution Neural Network (CNN) to classify cat and dog images. The “cat and dog image classification” issue is considered by some to be a “Hello World” style example for convolutional and Deep Learning networks. However, with Deep Learning, there is a lot more involved than simply displaying the “Hello World” text using a programming language.

Figure 1: Original “Cat & Dog” Image Classification Convolutional Neural Network

The tutorial code is built using Python running on Keras. I chose a Keras example due to the simplicity of the API. I was able to get the system up and running relatively smoothly after installing Python and the necessary libraries (such as TensorFlow.) However, I quickly realized that running the code on a VirtualBox VM (Virtual Machine) on my workstation is painfully slow and inefficient. For example, it took ~90 minutes to process a single epoch (i.e., 8000 steps and 32 images per-step), and the default setting of 25 epochs required to train the network took more than a day and a half. I quickly realized that the sheer volume of time it takes, merely to view the effect of minor changes, would render this test useless and far too cumbersome.

I began to ponder about how I could improve upon this process. After more research, I realized that a powerful GPU could be the solution I was after.

The opportunity to test such a GPU arose when I received a review unit of NVIDIA’s powerful new Tesla V100 GPU which currently runs at a slightly eye-watering $9,000 price tag. After two weeks with the GPU, I learned many things: some expected, but some entirely unexpected. I decided to write two blog posts to share what I learned with the hope that it can help others who are starting their journey into deep learning and are curious what a GPU can do for them. What’s even more exciting, this particular GPU is available as a robust option for phoenixNAP Dedicated Server product lines.

In the first entry, I focus on the training performance of the convolutional network, including observations and comparisons of the processing and training speeds with and without GPU. For example, I will showcase performance comparisons of the CIFAR-10 with the “cat and dog” image classifications deep convolution networks on the VirtualBox VM on my workstation, on a dedicated bare-metal server without GPU, and on a machine with the Tesla V100 GPU.

After tweaking the processing and training speeds of the network, I worked on improving the validation accuracy of the convolutional network. In the second blog, I share the changes I made to Venkatesh’s model which enhances the validation accuracy of the CNN from ~80% to 94%.

Both blogs assume that readers have foundational knowledge of neural and Deep Learning network terminology, such as validation accuracy, convolution layer, etc. Many of the contents will have added clarity if one has attempted Venkatesh’s or similar tutorials.

Observations on Performance & GPU Utilizations

Experiment with the workers and batch_size parameters

Whether the code is running on VirtualBox VM or a bare-metal CPU and with a GPU, changing these two parameters in the Keras code can make a significant impact on the training speeds. For example, with the VirtualBox VM, increasing the workers to 4 (default is 1) and batch_size to 64 (default is 32) improves the processing and training speed from 47 images/sec to 64 images/sec. With the GPU, the gain in training speed is roughly 3x after adjusting these parameters from the default values.

For a small network, a GPU is hardly utilized

I was quick to realize that maximizing the GPU utilization is a challenge. With the original “cat and dog” image classification network, GPU utilization hovers between 0 to 10%. CPU utilization also hovers at roughly 10%. Experimenting with different parameters such as workers, batch_size, max_queue_size, and even storing the images on RAM Disk did not make a significant difference regarding GPU utilization and training speed. However, after additional research, I learned that the bottleneck is at the input pipeline (e.g., reading, decompress, and augmenting the images) before the training starts, which is handled through the CPU.

Nevertheless, the system with a GPU still produces 4x higher processing and training speeds than the bare metal hardware without a GPU (see training speed comparisons section below).

Figure 2: Low GPU Utilization on the original Cat & Dog CNN

GPU put to work at deep networks and shine

After increasing the complexity of the “cat and dog” network, which improved the validation accuracy from 80% to 94%, (e.g., increasing the depth of the network), the GPU utilization increased to about 50%. In the improved network (regarding accuracy), the image processing and training speed decreased by ~20% on the GPU, but it dropped by ~85% on the CPU. In this network, the processing and training speeds are about 23x faster on the GPU than the CPU.

Figure 3: GPU Utilization on Improved Cat & Dog CNN

For experimental purpose, I created an (unnecessarily) deep network by adding 30+ convolutional layers. I was able to max out the GPU utilization to 100% (note the temperature and wattage from NVIDIA-SMI). Interestingly, the processing and training speeds stay about the same as the improved “cat and dog” network with the GPU. On the other hand, with the CPU, it can only process about three images/sec on this deep network, which is about 100 times slower than with a GPU.

Figure 4: GPU Utilization on the Deep CNN

Training Speed Comparisons

CIFAR-10

The CIFAR-10 dataset is a commonly used image dataset for training machine learning models. I ran the CIFAR-10 model with images downloaded from Github. The default batch_size is 128, and I experimented with different values with and without a GPU. On the Tesla V100 with batch_size 512, I was able to get around 15k to 17k examples/sec. GPU utilization was steady at ~45%. This is a very respectable result, compared to the numbers published by Andriy Lazorenko here. Using the same batch_size, with bare metal hardware running dual Intel Silver 4110 CPU (total 16 cores) and 128GB RAM, I was only able to get about 210 images/second, with the AVX2-compiled TensorFlow binaries. On the VirtualBox VM, I get about 90 images/second.

Figure 5: CIFAR-10 Output from Tesla V100
Figure 6: CIFAR-10 Training Speeds from VM, Bare Metal with & without GPU

Cat & Dog Image Classification Networks

The chart below shows the processing and training speeds of the different “cat and dog” networks on different systems. The parameters for each system (e.g., workers, batch_size) are tweaked from the default values to maximize performance. The performance improvement gains from using a powerful GPU, such as the V100, is more apparent as the networks become deeper and more complex.

Figure 7: Training Speeds of CNNs from VMs, Bare Metal with & without GPU

Conclusions

Having powerful GPUs to train Deep Learning networks is highly beneficial, especially if one is serious about improving the accuracy of the model. Without the significant increase in training speeds from GPUs, which can be in the magnitude of 100x+, one would have to wait for arduous amounts of time to observe the outcomes when experimenting with different network architecture and parameters. That would essentially render the process impractical.

This is the first in a two-part series. Read Part 2 here.

Ready for a GPU to Enhance Your Workflow?

Contact phoenixNAP today.

Complete the form below and our experts will contact you within 24 hours.