|
This tutorial shows you how to train, prune, and quantize a custom convolutional neural network (CNN) with the CIFAR10 dataset. You will use the Caffe framework and Xilinx® DNNDK tools on a ZCU102 target board.
The CIFAR10 dataset is composed of 10 classes of objects to be classified. It contains 60000 labeled RGB images that are 32x32 in size. The images are organized in three databases:
train_lmdb: 50000 images in the LMDB database for the forward/backward training process.valid_lmdb: 9000 images in the LMDB database for the validation step during the training process.test: 1000 images in JPEG plain format for the top-1 prediction measurements after the CNN has been trained.
The last two datasets are created by the 10000 images from the original CIFAR10 testing dataset. All the images are randomly shuffled before forming the database.
MiniVggNet and miniGoogleNet are custom CNNs described in the Starter Bundle book by Dr. Adrian Rosebrock from PyImageSearch. They are modeled and trained in Keras/TensorFlow, and have then been manually translated in .prototxt files and trained from scratch with Caffe.
MiniVggNet has all the features of the original VGG16 CNN, but it is less deep. The CIFAR10 dataset is much smaller than the ImageNet dataset that was used to train the original VGG16 network; the same applies to miniGoogleNet when compared with GoogleNet Inception v1. The way the layers are organized has also been changed, because the Xilinx DPU does not support ReLU before the BatchNorm (BN) layer at the time of writing. The number of BN and DROPOUT layers has also been reduced.
In this tutorial, the flow is only fully described using the miniVggNet example. MiniGoogleNet applies the same procedure, but only the final results are summarized.
Links to reference articles are available in this complementary PDF: CIFAR10_0_Introduction.
📌 Note: The Xilinx pruning tool requires a license fee and is therefore not included in this tutorial, although all the shell scripts to prune the CNNs and related log files are available here for miniVggNet and here for miniGoogleNet.
📌 Note: The PDF slides are only for your visual help, they are not frequently updated as the markdown (md) files.
By the end of this tutorial, you will understand how to train a custom CNN in Caffe from zero with the CIFAR10 dataset. You will also learn how to run it in the ZCU102 board after quantization (and optionally pruning) with the Xilinx DNNDK tools. This is accomplished by performing the steps listed below:
- Use a powerful set of Python scripts to do the following:
- Create the databases for training, validation, testing, and calibration.
- Train the CNN using Caffe and generate the
.caffemodelfile of floating point weights. - Use the trained network to make predictions from images and compute the top-1 accuracy with the
.caffemodelweights file. - Plot the learning curves of your training process and the CNN block diagram.
-
Use the Xilinx DNNDK tools to quantize the floating point weights of your original
.caffemodelfile (normally called the baseline). -
Compile and run the application on the ZCU102 target board to measure the effective frame rate.
-
Measure the effective average top-1 accuracy you get during run-time execution on the target board.
-
Optionally, prune the CNN so that there are fewer operations to be computed, thus increasing the effective frame rate achievable. It is possible to do this without detriment to the top-1 accuracy. After pruning, you need to run quantization on the pruned
.caffemodelfile before compiling and running the application on the ZCU102 board.
This tutorial assumes you are using a PC mounting Ubuntu 16.04 Linux OS with Python 2.7 and its virtual environments, Caffe BVLC, and Keras on the TensorFlow backend. The PC must have a CUDA-compatible GPU card and the following libraries: CUDA 8.0 or 9.0, and cuDNN 7.0.5. Alternatively, you can use a p2.xlarge instance of Deep Learning Base AMI Ubuntu version 15 from AWS. Many of the Ubuntu packages required by ML tools are already required by Xilinx SDx tools as well, so you would have to install them anyway.
After downloading and uncompressing the tutorial archive, Edge-AI-Platform-Tutorials.zip, move the docs/ML-CIFAR10-Caffe subfolder to the $HOME/ML directory and rename it to cifar10, using instructions similar to those in the example below:
cd $HOME #start from here
unzip Edge-AI-Platform-Tutorials.zip #uncompress
cd Edge-AI-Platform-Tutorials/docs #go to subfolder "docs"
mv ML-CIFAR10-Caffe cifar10 # rename the folder "ML-CIFAR10-Caffe" into "cifar10"
mkdir $HOME/ML #create this directory if it does not exist yet
mv cifar10 $HOME/ML # move the "cifar10" folder below the $HOME/ML folder
Download Caffe from the Caffe GitHub repository. This page describes the Caffe installation process. I have my own way to install it (including also how to install the CUDA libraries for compatible GPUs) and I can share if you contact me directly by email.
The CIFAR10_1_Caffe-Background.pdf document provides some basic information about the Caffe ML design environment.
The official online Caffe tutorial is available here.
For this tutorial, use the following release of the DNNDK tools:
You can also target alternative boards such as the following:
The DNNDK User Guide is available here:
If you do use previous releases, you should get the same top-1 accuracy results with a difference of 2-3%. A larger difference implies that something is wrong in the environment.
Kester Aernoudt has created an image with the Docker container, you can download it here. The image has all the ML tools needed to run this tutorial already pre-installed. This is likely the simplest way for you to get the tools up and running.
📌 Note: To use this Docker image, you need to have the Nvidia Docker Tool and an NVIDIA GPU card installed.
A good reason to use AWS is that most of the ML SW tools are already preinstalled on the Deep Learning Base AMI Ubuntu (version 15 and above), shown in the below screenshot. You only need to install Keras on the TensorFlow backend.

If you choose a p2.xlarge EC2 instance, as illustrated in the following screenshot, you get a K80 NVIDIA GPU, which is more than enough to run this tutorial:

Installing the Xilinx DNNDK tools on the AWS is straightforward. You only need to copy the tools there.
- Uncompress the original
xlnx_dnndk_v2.08.tar.gzarchive and place it in a folder named~/DNNDK/xlnx_dnndk_v2.08on your local PC (Windows or Linux OS; it does not matter). - Execute the following commands:
cd ~/DNNDK/xlnx_dnndk_v2.08_beta
cd host_x86/pkgs/ubuntu16.04
mkdir cuda9_tools
cp dnnc-dpu* ./cuda9_tools
cp cuda_9.0_cudnn_v7.0.5/decent ./cuda9_tools
tar –cvf xlnx_host_208tools_cuda9.tar ./cuda9_tools
gzip xlnx_host_208tools_cuda9.tar
- Copy the
xlnx_host_208tools_cuda9.tar.gzarchive from your local PC to your$HOMEin the AWS p2.xlarge instance. Do the same for the archiveEdge-AI-Platform-Tutorials.zip. - When both files are there, you can launch the set_aws_ML_env_cuda9.sh script, which creates the system of directories required by this tutorial. To avoid wasting storage space on the AWS, soft links are heavily used. Note also that the script installs Keras on the
tensorflow_p27environment (which you activate with the commandsource activate tensorflow_p27). Execute the following commands on your AWS instance:
cd $HOME
unzip Edge-AI-Platform-Tutorials.zip
cd Edge-AI-Platform-Tutorials/docs
mv ML-CIFAR10-Caffe cifar10
mv CATSvsDOGS cats-vs-dogs
mkdir $HOME/ML
mv cifar10 $HOME/ML
mv cats-vs-dogs $HOME/ML
cd $HOME
source $HOME/ML/cifar10/aws_scripts/set_aws_ML_env_cuda9.sh
# answer YES to possible questions like "Proceed?"
At the end of this process, if you list all the directories newly created, you should see what is reported in the following screenshot:

📌 Note: This screenshot also lists the DNNDK pruning tool (deephi_compress), which is not included in this tutorial because it requires a license fee.
In Caffe, .prototxt files cannot use Linux environmental variables; only relative pathnames. This project therefore assumes the following fixed directories:
-
$HOME/ML/cifar10/caffeis the working directory (variable$WORK_DIR). -
$CAFFE_ROOTis where the Caffe tool is installed. -
$HOME/ML/DNNDK/cuda9_toolsis where the DNNDKdecentanddnnctools are placed.
If you use AWS Ubuntu AMI, the set_aws_ML_env_cuda9.sh script creates the directory structure for the whole project, assuming the following values:
$HOMEstays for/home/ubuntu.$CAFFE_ROOTstays for/home/ubuntu/caffe_tools/BVLC1v0-Caffewhich is soft-linked to/home/ubuntu/src/caffe_python2.
If you use your own Ubuntu PC instead of the AWS:
$HOMEis your home directory, shortly~in Linux (for example,/home/danieleb).CAFFE_ROOTis where you have placed the Caffe tool (for example,/home/danieleb/caffe_tools/BVLC1v0-Caffe).
It is recommended not to deviate from the above structure. The alternative is changing all the pathnames in the shell and Python scripts as well as in the .prototxt files. This process has a higher risk of error.
Furthermore, to correctly use the pathnames adopted in the .prototxt files, all the *.sh shell scripts must be launched from the $HOME/ML/ directory.
Finally, sometimes you might need to pre-process the*.sh shell scripts with the dos2unix utility before executing them.
The installation procedure and flow for the DNNDK tools are described in the DNNDK User Guide 1327.
The tools are available only in Ubuntu 16.04, not in any other Linux distribution; they are split into two parts, one running on the Ubuntu Linux host PC and another running on the Ubuntu Linux filesystem of the target board (the ZCU102 in this case), which you need to have connected through WLAN and UART cables on your local PC.
📌 Note: The local PC connected to the board does not need to have the Ubuntu 16.04 OS: it could also mount the Windows OS. You only need that PC to communicate with the target board. On the other hand, the host PC must be an Ubuntu 16.04 one, be it either your local PC (mounting Ubuntu OS) or the AWS remote server.
The Xilinx DNNDK tools on the host are not connected with Vivado® Design Suite, because they are required for pruning and quantization. The three tools in question are decent, dnnc, and deephi_compress. This last one is not included in this tutorial because it requires a license fee, but log files are provided to illustrate how it works.
The AI inference hardware accelerator running on the ZU9 device of the ZCU102 target board is referred to as the deep processor unit (DPU). The CNN to be launched on the FPGA device has kernels running in software on the Arm™ CPU, and other kernels running in hardware on the DPU itself.
The DPU can be considered as a programmable co-processor of the Arm CPU, where the original CNN has been transformed into an .ELF file, which is in fact the software program for the co-processor. The same DPU architecture can therefore run multiple different types of CNN, each one represented by a different .ELF file.
In summary, the Xilinx DNNDK tools transform the Caffe .prototxt description and .caffemodel binary weights files (modeling the CNN understudy) into .ELF files running on the CPU/DPU embedded system.
The key steps of the quantization flow are as follows:
-
The
decenttool does the quantization. It needs two input files, namelyfloat.prototxtandfloat.caffemodel. The first file is the.prototxttext description of the CNN layers, and the second file is the 32-bit floating point of the weights you get after training your CNN with Caffe and the database of images. Thedecenttool generates two output files,deploy.prototxtanddeploy.caffemodel, which are then input to thednnccompiler. Whatdecentdoes is more or less independent of the DPU hardware architecture; this is why it runs on the host PC. -
The
dnnctool takes those two output files and analyzes them to see if there are inconsistencies or layers not supported by the architecture of the Xilinx DPU accelerator IP core. It then generates either .ELF files or error messages. It can also merge and fuse layers together to optimize the implementation. -
When this is done, your work on the host PC is finished and you can move to the target board. Assuming that you have transferred the .ELF files generated in the step above using
ssh/scp, you now use the tools on the target board to compile the C++ application which uses the DPU hardware accelerator. Open a terminal (using PuTTY or Tera Term) on the ZCU102 board and compile the application there. The application is a hybrid containing part of the application running on the Arm CPU (SoftMax and top-k accuracies, plus the loading of the input .mp4 movie or the images) and part of the application running on the DPU (the convolutional layers and so on).
The deephi_compress tool performs pruning, which is an optional process. You are not obliged to prune your CNN; an unpruned CNN is referred to as a baseline CNN. If you do prune it, you must iteratively apply the same training process through which you got the original float.prototxt and float.caffemodel files.
Pruning is solely an optimization technique and does not affect the hardware architecture of the DPU. In the case of miniVggNet in this tutorial, the number of parameters of the original CNN is ~2M, but after pruning it is only ~70000. The pruned CNN is less heavy and therefore performs with a higher frame rate on the architecture of the DPU IP core, because the core has fewer operations to compute.
At the end of the pruning process, you get a new set of float.prototxt and float.caffemodel files that now require far fewer operations to be completed (in other words, the CNN has been compressed). Repeat the quantization steps (1-3) described above to launch your pruned CNN on the CPU/DPU of the ZC102 board.
The cifar10 project is organized in the following subdirectories (placed under $HOME/ML/cifar10):
-
caffe/codecontains all the Python2.7 scripts. -
caffe/modelscontains the solver, training, and deploy.prototxtfiles for miniVggNet and other CNNs. -
caffe/rptcontains log files captured for your reference. -
deephicontains the files for quantization of either the baseline (quantiz) or pruned (pruning) CNN, plus the files for ZCU102 run-time execution (zcu102/baseline,zcu102/pruned, andzcu102/test_imagesrespectively). -
inputcontains the following:- LMDB databases for the Caffe phases of training and validation.
- JPEG images for testing the top-1 accuracy.
- Other JPEG images for DNNDK calibration during the quantization process.
The Python scripts that compose the ML design flow in Caffe are listed in order of execution below. They enable you to create the datasets, train your CNN with a training and validation LMDB database, and finally make predictions on JPEG images.
-
1_write_cifar10_images.py: This script downloads the dataset from the
keras.datasetsmodule and stores it in JPEG format in theinput/cifar10_jpgfolder, with subfolderstest,train,valandcalib. Thecalibfolder is needed only for quantization with Xilinx DNNDK. This is the only script which requires Keras in this project, and you only need to execute it once. -
2a_create_lmdb.py: This script creates the LMDB databases
input/lmdb/train_lmdbandinput/lmdb/valid_lmdbfor the training step. You only need to execute it once. -
2b_compute_mean.py: This script computes the mean values for the
train_lmdbdatabase ininput/mean.binaryproto. You only need to execute it once.📌 Important: You cannot run this python script on the AWS (they have not compiled Caffe with OpenCV), therefore you have to comment it out from the aws_caffe_flow_miniVggNet.sh and aws_caffe_flow_miniGoogleNet.sh scripts. It is recommended to add the three mean values (125, 123, and 114 respectively) directly into the
.prototxtmodel file. Furthermore, the DNNDKdnnctool does not support reading theinput/mean.binaryprotofile. -
3_read_lmdb.py: This script can be used to debug the first two scripts.
-
4_training.py: This script launches the real training process in Caffe, given certain
solverand CNN description.prototxtfiles. To be used for any trial of training. -
5_plot_learning_curve.py and plot_training_log.py: These scripts are to be launched at the end of the training to plot the learning curves of accuracy and loss in different ways.
-
6_make_predictions.py: This script is launched at the end of the training to measure the prediction accuracy achieved by the CNN you have trained. You need to have the
scikit(classic ML) library installed.
Another script, check_dpu_runtime_accuracy.py, is also available, but this script is not part of the Caffe design flow. It should be launched only when the CNN is running on the ZCU102 board, to compute the effective top-1 accuracy of the DPU at run time. Doing this allows you to make a fair comparison between the top-1 accuracy values when simulated by the .caffemodel in Caffe on the host PC, when estimated by DNNDK decent on the host PC, and when measured at run time on the target ZCU102 board.
If you are working on a local Ubuntu PC and have both Caffe and Keras in the same Python virtual environment, all the Python scripts can be orchestrated in the caffe_flow_miniVggNet.sh shell script. Normally, the first four scripts can be commented out after you have executed them the first time. The commands to be launched are similar to those in the example below:
cd $HOME/ML
source cifar10/caffe/caffe_flow_miniVggNet.sh 2>&1 | tee cifar10/caffe/models/miniVggNet/m3/logfile_3_miniVggNet.txt
If you are working on the AWS, the shell script needs some changes due to the different environment. Apply the aws_caffe_flow_miniVggNet.sh script instead, with similar commands:
cd $HOME/ML/cifar10
source caffe/aws_caffe_flow_miniVggNet.sh 2>&1 | tee caffe/models/miniVggNet/m3/aws_logfile_3_miniVggNet.txt
.prototxt files. Do not launch the scripts in a directory different from $HOME/ML, or they will fail.
📌 Note: In a Linux shell, you can comment out a set of lines by surrounding them with the following symbols: : ' before the first line to be commented and ' after the last line to be commented.
To describe the CNN in Caffe, you need a .prototxt text file which shows the type of layers and how they are connected, plus some specific features to be activated only during the training or validation phases, indicated as TRAIN and TEST respectively. You also need to set the batch_size during the TRAIN and TEST phases (128 and 50 respectively): this last number is given by the number of validation images (9000) and test_iter (180). During the TRAIN phase, all the parameters of the CNN are updated with Stochastic Gradient Descend (SGD), every batch_size number of images.
The model giving the best top-1 prediction results in previous experiments is train_val_3_miniVggNet.prototxt. Associated with it, you also have the deploy_3_miniVggNet.prototxt, which is needed to compute the prediction accuracy on the 1000 images in the TEST subfolder (the same that will be used at run time on the ZCU102).
There is another model, deephi_train_val_3_miniVggNet.prototxt, which is applied during the quantization process of the baseline CNN. It is exactly the same as train_val_3_miniVggNet.prototxt, but the LMDB database of the TRAIN phase has been replaced by the calibration images, and top-5 accuracy layer is added at the bottom of the CNN description.
In Caffe, the solver file defines the optimization method (that is, SGD, or Adam, or Nesterov), the number of iterations, and the policy for changing the learning rate during the various iterations. It also says if a CPU or GPU is being used for computation.
The solver file is named solver_3_miniVggNet.prototxt, and contains the settings for the training of miniVggNet model that have proved to be optimal. 40000 iterations were used here, but results are relatively stable after only 20000 iterations.
Saved post-training output results are placed in the rpt subfolder for your reference. They include the following:
- PNG image files containing the block diagram of the CNN and how the learning curves change during the training process.
- The log file of the Caffe proper training process from which all the above .PNG files are generated.
- The log file of the top-1 predictions computed on the 1000 images of the testing dataset.
The following screenshot illustrates the end of the training on the AWS. Note the ~86% top-1 average accuracy computed on the validation dataset.

The following screenshot shows the predictions on the 1000 test images. Note the ~87% top-1 average accuracy.

After training is executed, the top-1 prediction accuracy is 87% on average. Top-1 values from 84% to 88% are allowed: GPUs have varying random states, and you might not achieve exactly the same numerical results.
See below the elapsed time for the training in Caffe with 40000 iterations on different NVIDIA GPU cards with a different configuration memory in GB:
Elapsed time on GPU P6000 @24GB: 13min
Elapsed time on GPU Tesla K80@12GB: 45min
Elapsed time on GPU GTX1080 @ 8GB: 48min
Elapsed time on GPU K1000M @ 2GB:975min
You do not need to be in a Python virtual environment to launch the quantization process. Ensure only that you have the tools in your PATH and LD_LIBRARY_PATH environmental variables. To this purpose, if you are in your AWS AMI, you can launch the aws_activate_dnndk_cuda9.sh script:
source ~/ML/cifar10/aws_scripts/aws_activate_dnndk_cuda9.sh
The decent tool needs the following inputs:
float.prototxt: This is the description text file of the floating point CNN model.float.caffemodel: This is the file with pre-training weights of the CNN in floating point.calibration dataset: This is a subset of the images used in the original training, containing about 1000 pictures in this case study.
When the quantization is done, two output files are generated. These become the inputs to the dnnc compiler:
deploy.prototxt: This is the new description text file of the quantized CNN model.deploy.caffemodel: This is the file with fixed point quantized weights (this is not a standard Caffe format).
Preparing the input .prototxt files requires the following steps.
-
Take the weights file generated after the Caffe training process (
snapshot_3_miniVggNet__iter_40000.caffemodel), and rename it tofloat.caffemodel. -
Take the description file used in the Caffe training process (train_val_3_miniVggNet.prototxt), and rename it to
float.prototxt. -
Make the following further changes to the
float.prototxtfile:
- Remove the
Datatypelayers for the original TRAIN phase. - Add an
ImageDatatype layer with the calibration images for the new TRAIN phase. - On the bottom, add two
Accuracytype layers to compute top-1 and top-5 accuracies. - Remove the mean file and put separate values (the DPU does not support reading a binary mean file).
For your reference, the changes detailed in step 3 have already been made in deephi_train_val_3_miniVggNet.prototxt, while steps 1 and 2 are already done in the decent_miniVggNet.sh shell script.
-
Compress the neural network model (from the host side) using
decent. -
Compile the neural network model (from the host side) using
dnnc. See below for a command line example for the miniVggNet case (using the decent_miniVggNet.sh and dnnc_miniVggNet.sh scripts):cd ~/ML source cifar10/deephi/miniVggNet/quantiz/decent_miniVggNet.sh 2>&1 | tee cifar10/deephi/miniVggNet/quantiz/rpt/logfile_decent_miniVggNet.txt source cifar10/deephi/miniVggNet/quantiz/dnnc_miniVggNet.sh 2>&1 | tee cifar10/deephi/miniVggNet/quantiz/rpt/logfile_dnnc_miniVggNet.txt -
Edit the main.cc program application (from the host side).
-
Compile the hybrid application (from the target side) with the
makeutility. -
Run the hybrid application (from the target side). See below for a command line example for the miniVggNet case:
cd /root/cifar10/miniVggNet/zcu102/baseline make clean source run_fps_miniVggNet.sh 2>&1 | tee ./rpt/logfile_fps_miniVggNet.txt source run_top5_miniVggNet.sh 2>&1 | tee ./rpt/logfile_top5_miniVggNet.txt
You can now copy (ssh/scp) the logfile_top5_miniVggNet.txt from the target board to your host PC and run the latest Python script (check_dpu_runtime_accuracy.py) to check the top-1 accuracy delivered by the DPU on the test images at run time (or vice versa you copy the python script on the target board and run it there) . This is the most important step, because you can now see the real average accuracy of your CNN system working at run time.
The estimated top-1 average accuracy after quantization can be seen in one of the last lines of the captured decent log file. A remarkable (for this small CNN) ~86% is achieved.
In step 2, dnnc says that there is one kernel task running on the DPU (implementing the CONV, ReLU, BN, and FC layers) and another kernel running in software on the Arm CPU of ZCU102 (for the top-k and SoftMax layers). You can see this in the captured dnnc log file.
The variants only differ in a few printf lines, which are used in the first file but commented out in the second file. They print the top-5 accuracies for each input image to be classified, as seen in the example below:
// from void TopK(const float *d, int size, int k, vector<string> &vkind) {
// ...
pair<float, int> ki = q.top();
printf("[Top]%d prob = %-8f name = %s\n", i, d[ki.second], vkind[ki.second].c_str());
// ...
// from void classifyEntry(DPUKernel *kernelconv){
// ...
cout << "DBG imread " << baseImagePath + images.at(ind) << endl;
// ...
If you change the way this information is printed in the stdout, you must also change the Python script check_dpu_runtime_accuracy.py accordingly, because it acts essentially as a text parser of the top-5 log file captured at run time.
Also in the first lines of the top5_main.cc file, you need to report the kernel name, input node, and output node that were generated by the dnnc compiler in the log file:
#define KERNEL_CONV "miniVggNet_0"
#define CONV_INPUT_NODE "conv1"
#define CONV_OUTPUT_NODE "fc2"
The three routines in top5_main.cc that make use of DPU APIs are main(), classifyEntry() and run_miniVggNet(). These routines are described below:
main()opens the DPU device, loads the convolutional kernel, and finally destroys and closes it when the classification routine is terminated just before the return command.classifyEntry()is the classification routine. It creates a task on the DPU for the convolutional kernel it has to run (in general, one task for each separate kernel you might have). It does this in a multi-threading way, where the number of parallel threads is an argument you have to pass to the executable running on the Arm CPU usingargc/argv.run_miniVggNet()does most of the work, as shown in the following code snippet:
void run_miniVggNet(DPUTask *taskConv, Mat img) {
assert(taskConv );
int channel = dpuGetOutputTensorChannel(taskConv, CONV_OUTPUT_NODE);
float *softmax = new float[channel];
float *FCResult = new float[channel];
_T(dpuSetInputImage2(taskConv, CONV_INPUT_NODE, img));
_T(dpuRunTask(taskConv));
_T(dpuGetOutputTensorInHWCFP32(taskConv, CONV_OUTPUT_NODE, FCResult, channel));
_T(CPUCalcSoftmax(FCResult, channel, softmax));
_T(TopK(softmax, channel, 5, kinds));
delete[] softmax;
delete[] FCResult;
}
To compile and run the application on ZCU102, you need to archive the entire contents of the cifar10/deephi/miniVggNet/zcu102/baseline folder into a .tar file and copy it (ssh/scp) from the host PC to the target board. Make sure to physically copy the 1000 images from the testing folder cifar10/input/cifar10_jpg/test/ to the cifar10/deephi/miniVggNet/zcu102/test_images local subdirectory.
The following as a list of the commands you could launch from the target (static IP address 192.168.1.101) to the host (static IP address 192.168.1.100). These examples assume that your current folder is cifar10/deephi/miniVggNet/zcu102/ in the host PC and you have a folder named /root/cifar10/miniVggNet/zcu102 in the file system of the SD card on the target ZCU102 board:
(from host)
cd ~/ML/cifar10/deephi/miniVggNet/zcu102/
cp -r ../../../input/cifar10_jpg/test ./test_images
tar -cvf baseline.tar ./baseline ./test_images
scp baseline.tar root@192.168.1.100:/root/cifar10/miniVggNet/zcu102/
(from target)
cd /root/cifar10/miniVggNet/zcu102
tar -xvf baseline.tar
cd ./baseline
make clean
source run_fps_miniVggNet.sh 2>&1 | tee ./rpt/logfile_fps_miniVggNet.txt
source run_top5_miniVggNet.sh 2>&1 | tee ./rpt/logfile_top5_miniVggNet.txt
(from host)
cd ~/ML/cifar10/deephi/zcu102/baseline/rpt
scp root@192.168.1.100:/root/cifar10/miniVggNet/zcu102/baseline/rpt/logfile_top5_miniVggNet.txt .
python ~/ML/cifar10/caffe/codecheck_dpu_runtime_accuracy.py \
-i ./logfile_top5_miniVggNet.txt 2>&1 | tee ./logfile_check_dpu_top5_miniVggNet.txt
At the end of this quantization procedure, when the DPU runs the miniVggNet CNN on the ZCU102, the following performance was reported:
- 3710 fps with three threads, as shown in the logfile_fps_miniVggNet.txt log file.
- ~86% average top-1 accuracy, as shown in the logfile_check_dpu_top5_miniVggNet log file.
Pruning is a technique to remove redundant or less useful weights and output channels from a CNN layer to reduce or compress the overall number of operations. The aim is to reduce the number of operations and increase the frames per second (you might not need pruning if you CNN was optimized by design). This can however be detrimental to the average top-1 accuracy: the final result is ultimately a trade-off between the desired compression and the effective accuracy to sustain a certain target frame rate.
There are usually two types of pruning: fine and coarse. Fine pruning selectively kills either the weights or the output features with the smallest values from a channel. To achieve higher frame rates from a fine-compressed CNN, the hardware accelerator must be enabled to perform zero-skipping (that is, skipping all the multiplications with zero values). Zero-skipping requires a proper hardware architecture and organization of non-zero data (usually with run-length coding) in the internal memory of the hardware accelerator; otherwise, there would be no performance gain from fine pruning.
Xilinx DNNDK applies coarse pruning, which involves removing a complete output channel. In this case, any hardware accelerator can gain from it (not only the DPU, but also the xDNN IP core adopted in the Xilinx ML Suite), even if it does not implement zero-skipping in its architecture.
However, this invasive kind of pruning can affect the average accuracy. It is therefore important to apply the pruning in an iterative manner: for example, by compressing the CNN by only 10% and then performing fine-tuning (which can be a complete training process) to recover the probable accuracy drop. If you work carefully and apply this process for 7-8 steps, you can arrive at 70-80% of compression with a negligible top-1 average accuracy decrease. This iterative process can take a lot of time, especially if you are using a large database.
At the end of the pruning process, you get a new floating point .caffemodel file of a size probably reduced by 40-60% (depending on the CNN) in comparison with the original .caffemodel file of the baseline (non-pruned) CNN. To run it on the ZCU102 board, you need to apply quantization using the output files generated by pruning (with some minor but important manual editing) as the input file to quantization.
Before you begin, you need to have the following files in the deephi/miniVggNet/pruning working directory:
- config.prototxt: Use this file to set the number of GPU devices and test iterations, as well as the Caffe model description, weights files, and compression ratio you want to achieve. In reality, I use seven files like this, and each one applies the weights generated by the previous pruning trial to increment the compression by 10%.
- solver.prototxt: This is the same solver of your original
.caffemodel, but renamed (for example, the same solver_3_miniVggNet.prototxt was already adopted during the training process).
📌 Note: Regarding the solver, in previous experiments, the miniVggNet CNN has been retrained using only 20000 of the original 40000 iterations. Doing this is up to you, but for this CNN, using 20000 rather than 40000 iterations did not have a dramatic effect on top-1 accuracy, and it saved half the PC processing time.
- train_val.prototxt: This is the same description file of your original
.caffemodel, but renamed. For example, it is the same as the train_val_3_miniVggNet.prototxt.
📌 Note: You need to edit train_val.prototxt to add top-1 and top-5 accuracy layers at its end.
float.caffemodel. This is the same weights file of your original.caffemodel, only renamed (for example, the samesnapshot_3_miniVggNet__iter_40000.caffemodel).
📌 Note: This pruning step is optional. You can skip it if you do not have the Xilinx pruning tool, or if the network does not need pruning.
In previous experiments, pruning the miniVggNet required seven steps of 10% compression each time. The flow can be explained by looking at the pruning_flow.sh shell script:
-
analysis (ana)has to be executed only once at the beginning, and generates a hidden text file,.ana.regular, that is then reused by all following trials. This process can take a lot of time, so it is recommended to comment out the related line in the shell script after you have executed it once (assuming you are not changing the input files). -
Seven steps of
compress / finetuneactions, each one compressing the previously compressed CNN by 10%. In particular,compressis responsible for heuristically selecting the channels to kill, whilefinetuneperforms a retrain to restore the top-1 accuracy at the previous value if possible. -
The final action is
transform, which transforms the intermediate sparsed model into the effective output.caffemodelof the compressed CNN (transform.caffemodel).
In the transform step, you need to complete the following steps:
- Take the same train_val.prototxt and not the final.prototxt generated by the seventh step of
compress-finetune. - Take the latest snapshot
.caffemodelnamedregular_rate_0.7/snapshots/_iter_20000.caffemodeland not the sparse.caffemodelnamedregular_rate_0.7/sparse.caffemodel. This is also illustrated in the pruning_flow.sh shell script.
The command to prune the whole CNN is as follows:
cd ~/ML/
source cifar10/deephi/miniVggNet/pruning/pruning_flow.sh 2&>1 | tee cifar10/deephi/miniVggNet/pruning/rpt/logfile_whole_pruning_flow.txt
solver, train_val, and config*.prototxt files mentioned above.
After seven rounds of compress and finetune with half the iterations (20000) of the original Caffe training (40000) two output files are generated from the three input files (float.caffemodel, solver.prototxt, and train_val.prototxt). The output files are transformed.caffemodel and final.prototxt. These become the input files to the next quantization process.
The compressed miniVggNet now has the following:
- ~69% less operations and ~67% less weights than the original baseline CNN.
- An drop in accuracy from ~87% (baseline) to ~72% (compressed), as reported in the logfile_compress7_miniVggNet.txt log file after the seventh round of compression.
After finetune, the top-1 average prediction accuracy is restored to ~85%, as reported in the logfile_finetune7_miniVggNet.txt log file. In both cases, the top-1 accuracy is measured on the validation dataset during the finetune training process. To measure the effective top-1 average accuracy at run time with the check_dpu_runtime_accuracy.py Python script, you need to quantize the CNN you just pruned.
The log files of each step are stored in the rpt folder for your reference. In particular, logfile_pruning_flow.txt and aws_logfile_pruning_flow.txt are the log files of the whole pruning process running on a personal Ubuntu PC and on a AWS p2.xlarge EC2 instance respectively.
The process is exactly the same at that explained in Quantization of the Baseline miniVggNet section. The only difference is that the input files are now named as follows:
transformed.caffemodel: The output of thetransformstep from the pruning process.q_final.prototxt: Generated by manually editing the same final.prototxt that was produced at the end of the pruning process.
You also need to replace the training LMDB database with the calibration images, and add the three mean values. Pruning is not a deterministic process, so every pruning trial can create a different final.prototxt file, and in that case you have to re-edit a new q_final.prototxt file (the q_final.prototxt file placed in the rpt folder is solely for documentation).
To compile with the decent and dnnc tools, call the two shell scripts, decent_pruned_miniVggNet.sh and dnnc_pruned_miniVggNet.sh, with the following commands:
cd ~/ML
source /cifar10/deephi/miniVggNet/pruning/quantiz/decent_pruned_miniVggNet.sh 2>&1 | tee /cifar10/deephi/miniVggNet/pruning/quantiz/rpt/logfile_decent_pruned_miniVggNet.txt
source /cifar10/deephi/miniVggNet/pruning/quantiz/dnnc_pruned_miniVggNet.sh 2>&1 | tee /cifar10/deephi/miniVggNet/pruning/quantiz/rpt/logfile_dnnc_pruned_miniVggNet.txt
To compile and run the application on ZCU102, you need to archive the entire contents of the cifar10/deephi/miniVggNet/zcu102/pruned folder into a .tar file, and copy it (ssh/scp) from the host PC to the target board. Make sure to physically copy the 1000 images from the testing folder, cifar10/input/cifar10_jpg/test/, to the cifar10/deephi/miniVggNet/zcu102/test_images local subdirectory.
See below for a list of example commands you can launch from the target (static IP address 192.168.1.101) to the host (static IP address 192.168.1.100), assuming your current folder is cifar10/deephi/miniVggNet/zcu102/ in the host PC and you have a folder named /root/cifar10/miniVggNet/zcu102 in the file system of the SD card of the target ZCU102 board:
cd ~/ML/cifar10/deephi/miniVggNet/zcu102/
cp -r ../../../input/cifar10_jpg/test ./test_images
tar -cvf pruned.tar ./pruned ./test_images
scp pruned.tar root@192.168.1.100:/root/cifar10/miniVggNet/zcu102/
(from target)
cd /root/cifar10/miniVggNet/zcu102
tar -xvf pruned.tar
cd ./pruned
make clean
source run_fps_miniVggNet.sh 2>&1 | tee ./rpt/logfile_fps_pruned_miniVggNet.txt
source run_top5_miniVggNet.sh 2>&1 | tee ./rpt/logfile_top5_pruned_miniVggNet.txt
(from host)
cd ~/ML/cifar10/deephi/zcu102/pruned/rpt
scp root@192.168.1.100:/root/cifar10/miniVggNet/zcu102/pruned/rpt/logfile_top5_pruned_miniVggNet.txt .
python ~/ML/cifar10/caffe/codecheck_dpu_runtime_accuracy.py \
-i ./logfile_top5_pruned_miniVggNet.txt 2>&1 | tee ./logfile_check_dpu_top5_pruned_miniVggNet.txt
At the end of the pruning and quantization procedures, when the DPU runs the miniVggNet CNN on the ZCU102, the following performance was reported:
- 4895 fps with six threads, as shown in the logfile_fps_miniVggNet.txt log file.
- 85% average top-1 accuracy, as shown in the logfile_check_dpu_top5_pruned_miniVggNet log file.
Congratulations! You have completed the Caffe training of miniVggNet using the CIFAR10 database. You then applied Xilinx DNNDK to quantize the original CNN to get a baseline reference. You also have seen how to prune (and quantize again) the new optimized CNN: the CNN is optimized in the sense that it has fewer output channels than its baseline version.
By running the .ELF files of either the baseline or the pruned CNNs on the ZCU102 target board, you have measured a frame rate improvement from 3710 fps (baseline) to 4895 fps (pruned) with a small drop in average top-1 accuracy, from 86% in the baseline to 85% in the pruned CNN. See below for a summary of the most important steps you have completed to arrive at this point.
-
You have downloaded the CIFAR10 dataset and organized it in the proper way to train miniVggNet with Caffe, making predictions on 1000 testing images to get the average top-1 accuracy: 50000 and 9000 images respectively in the LMDB training and validation databases, plus 1000 images (from the training dataset) to compose the calibration dataset to be used during the quantization process. You have applied the following Python scripts:
-
You have trained the CNN with 40000 iterations by applying the 4_training.py Python script and the train_val_3_miniVggNet.prototxt and solver_3_miniVggNet.prototxt input
.prototxtfiles. You have also plotted the learning curves of the training process with the 5_plot_learning_curve.py Python script. -
The floating point weights file
float.caffemodelgenerated in the previous step together with the CNN deploy model (deploy_3_miniVggNet.prototxt) have then been used to make predictions on the 1000 testing images. You have achieved an average top-1 accuracy of ~87%. All the above steps can be run with the single shell script caffe_flow_miniVggNet.sh if you use your local PC, or with aws_caffe_flow_miniVggNet.sh if you use a p2.xlarge instance on the AWS. -
You have then quantized this baseline CNN with the DNNDK
decentanddnnctools on the host PC by applying the decent_miniVggNet.sh and dnnc_miniVggNet.sh shell scripts to the files generated in step 3,float.caffemodeland float.prototxt, where the latter file is the train_val_3_miniVggNet.prototxt edited to replace the LMDB training database with the calibration images and to add in the bottom the top-1 and top-5 accuracy layers. -
You have moved to the target ZCU102 board the main.cc application file and the ELF DPU kernel generated by
dnncin the previous step. There, you have compiled the hybrid application, with the Arm CPU executing the SoftMax and top-5 software routines while the DNNDK hardware accelerator executes the FC, CONV, ReLU, and BN layers of the CNN. -
You have measured an effective frame rate of 3710 fps and an average top-1 accuracy of 86% (this last one using the check_dpu_runtime_accuracy.py Python script). This ends the implementation flow of the baseline miniVggNet from the concept to the run-time execution on the ZCU102 target board.
-
You have seen how the CNN can be optimized by applying pruning to reduce the number of output channels (and consequently the overall number of operations the DPU has to complete). You have applied the iterative flow described in the pruning_flow.sh shell script together with seven variances of the same config.prototxt configuration file to the following input files:
- solver.prototxt: The same solver solver_3_miniVggNet.prototxt adopted during the training process, with edited pathnames and 20000 iterations instead of 40000.
- train_val.prototxt: The same description file adopted during the training process, renamed from train_val_3_miniVggNet.prototxt with some editing to add top-1 and top-5 accuracy layers at its end.
float.caffemodel, the same weights file of your original.caffemodel(snapshot_3_miniVggNet__iter_40000.caffemodel).
-
The pruning process generated the following output files, which then became inputs to the next and final quantization step:
transformed.caffemodel: A.caffemodelbinary file much smaller in size than the startingfloat.caffemodel.- final.prototxt: A
.prototxtfile detailing how many channels every layer has after pruning.
-
You have edited the final.prototxt file to replace the LMDB training database with the calibration images, adding the top-1 top-5 accuracy layers in the bottom to get the new q_final.prototxt file. You have applied the
decentanddnnctools on the host PC. You did this by aplying the decent_pruned_miniVggNet.sh and dnnc_pruned_miniVggNet.sh shell scripts to the q_final.prototxt andtransformed.caffemodelfiles. -
As in step 5, you have moved to the target ZCU102 board and compiled the hybrid application there. You have measured a frame rate of 4895 fps with an average top-1 accuracy of 85%.
The following is a summary of the results obtained from miniGoogleNet with the CIFAR10 database.
The whole training process runs with the caffe_flow_miniGoogleNet.sh shell script if you are using your local PC, or with aws_caffe_flow_miniGoogleNet.sh if you are using the AWS.
The input files are train_val_3_miniGoogleNet.prototxt and solver_3_miniGoogleNet.prototxt.
By applying the floating-point model named shanpshot_3_miniGoogleNet__iter40000.caffemodel (shortly float.caffemodel) and the CNN deploy model deploy_3_miniGoogleNet.prototxt to make predictions on the 1000 testing images, you can achieve an average top-1 accuracy of 91%.
The key files generated during the training flow are listed below:
- Log file for the whole Caffe training process: logfile_caffe_flow_miniGoogleNet.txt.
- Log file for the prediction process: predictions_3_miniGoogleNet.txt.
The following screenshot shows the end of the training on the AWS. Note the ~90% top-1 average accuracy computed on the validation dataset.
The following screenshot shows the predictions on the 1000 test images.
The elapsed time for the training in Caffe with 40000 iterations is around five hours on the K80 NVIDIA GPU of the AWS p2.xlarge instance.
The scripts to quantize this baseline CNN with DNNDK tools on the host PC are named decent_miniGoogleNet.sh and dnnc_miniGoogleNet.sh. They work on the files generated in the previous step, float.caffemodel and float.prototxt, where the latter is the train_val_3_miniGoogleNet.prototxt file edited to replace the LMDB training database with the calibration images, and to add the top-1 and top-5 accuracy layers in the bottom.
The key files generated during the quantization flow are listed below:
- Log file for the
decentstep: logfile_decent_miniGoogleNet.txt. It estimates the top-1 average accuracy to be 90%. - Log file for the
dnncstep: logfile_dnnc_miniGoogleNet.txt. - Log file for the fps performance on ZCU102 at run time: logfile_fps_miniGoogleNet.txt.
- Log file of the top5 prediction accuracy on ZCU102 at run time: logfile_check_dpu_top5_miniGoogleNet.txt.
In a departure from the miniVggNet example, two .ELF files are now generated for the DPU, because the intermediate AveragePooling layer has to be executed on the Arm CPU, not being available on the DPU in this release of DNNDK.
The top-1 average prediction accuracy of the quantized CNN is 89%. The throughput is 1943 fps with three threads.
The iterative pruning flow is applied by running the pruning_flow.sh shell script. The input files are listed below:
- solver.prototxt: The same solver.
- solver_3_miniGoogleNet.prototxt: Adopted during the training process in Caffe.
- train_val.prototxt: The same description file of your original Caffe model, renamed from train_val_3_miniGoogleNet.prototxt with some editing to add top-1 and top-5 accuracy layers at its end.
float.caffemodel: The same weights file as your original .caffemodel file (namedsnapshot_3_miniGoogleNet__iter_40000.caffemodel).
The pruning process generates the following output files, which then become inputs to the next and final quantization step:
transformed.caffemodel: a binary file much smaller in size than thefloat.caffemodelyou started with.- final.prototxt, which tells you how many channels every layer has after pruning.
When you have edited the final.prototxt file to replace the LMDB training database with the calibration images, added the top-1 top-5 accuracy layers in the bottom, and got the new q_final.prototxt file, you can apply decent and dnnc on the host PC by applying the decent_pruned_miniGoogleNet.sh and dnnc_pruned_miniGoogleNet.sh shell scripts to q_final.prototxt and transformed.caffemodel. The following key files are generated during the quantization of the pruned CNN:
- Log file of the
decentstep: logfile_decent_pruned_miniGoogleNet.txt. It estimates the top-1 average accurracy to be ~90%. - Log file of the
dnncstep: logfile_dnnc_pruned_miniGoogleNet.txt. - Log file of the fps performance on ZCU102 at run time: logfile_fps_pruned_miniGoogleNet.txt.
- Log file of the top-5 prediction accuracy on ZCU102 at run time: logfile_check_dpu_top5_prunes_miniGoogleNet.txt.
The top-1 average prediction accuracy of the quantized pruned CNN is 90%, and the throughput is 2527 fps with four threads. This limited improvement is due to the fact that the baseline CNN has been pruned of only ~39% (that is to say, only four steps of compression/finetuning). Beyond that level, the top-1 accuracy drops abruptly.
In reality, the training of this CNN should be redone from scratch, using data augmentation to decrease the level of overfitting and increase the top-1 accuracy. The original Keras model could achieve 94%, but Caffe does not have a sophisticated way to do data augmentation on the fly.
Another possibility is to run the analysis step after each compression/finetuning step. Furthermore, it is also possible to do finetuning after applying decent to recover some losses due to quantization. These three trials will be done in a future release of this tutorial.


