ETROVUB

Han Bao, Xingyu Liu, Ahmed Sadaqa, Yanxiang Zhu, Shidi Tang, Ruiqi Chen, Ming Ling, Bruno da Silva Gomes

Contribution to journal

Abstract ■

The Deep Learning Processor Unit (DPU) has become a core component in general-purpose accelerators for efficiently deploying deep neural networks (DNNs) on Field-Programmable Gate Array (FPGA). To fully exploit DPU performance, guidelines are needed for DNN architecture design and model porting. Benchmarking is essential for understanding the DPU architecture and achieving its ultimate performance. However, existing DPU benchmarkings mainly focus on the network-level, lacking a fine-grained analysis that helps fully reveal the performance of the DPU. To this end, we propose BenDan , an operator-level benchmarking framework for DPU. BenDan can identify characteristics in key DNN operators such as fully connected layers and 2D convolution layers. It bridges the gap between theoretical specifications and actual performance through fine-grained, operator-level benchmarks on DPU. By collecting critical metrics, BenDan reveals inefficiencies caused by misalignment with the DPU{\textquoteright}s pixel, input channel, and output channel parallelism. Moreover, we evaluate the DPU{\textquoteright}s compatibility with cutting-edge neural network models to expose limitations in operator support and inform deployment strategies. The following key findings are observed: (1) Some specific operators may cause the performance loss of the DPU; (2) Among various factors, DPU architecture configuration has the most significant impact on performance, outweighing channel and pixel parallelism settings. (3) Cutting-edge NN layers are partially offloaded to the CPU. BenDan provides a practical tool for DPU-aware optimization and model refinement. BenDan can be easily generalized to all DPU releases, and the DPUCZDX8G is used as an example in this paper. The benchmarking framework is available on GitLab∗.

Reference ■

Bao, H, Liu, X, Sadaqa, A, Zhu, Y, Tang, S, Chen, R, Ling, M & da Silva, B 2026, 'BenDan: Benchmarking DPU performance on FPGAs', Integration, the VLSI Journal, vol. 109, 102695. https://doi.org/10.1016/j.vlsi.2026.102695

Bao, H., Liu, X., Sadaqa, A., Zhu, Y., Tang, S., Chen, R., Ling, M., & da Silva, B. (2026). BenDan: Benchmarking DPU performance on FPGAs. Integration, the VLSI Journal, 109, Article 102695. https://doi.org/10.1016/j.vlsi.2026.102695

@article{68f5b295a7ef4c7cbd715f1b52b4738f,
title = "BenDan: Benchmarking DPU performance on FPGAs",
abstract = "The Deep Learning Processor Unit (DPU) has become a core component in general-purpose accelerators for efficiently deploying deep neural networks (DNNs) on Field-Programmable Gate Array (FPGA). To fully exploit DPU performance, guidelines are needed for DNN architecture design and model porting. Benchmarking is essential for understanding the DPU architecture and achieving its ultimate performance. However, existing DPU benchmarkings mainly focus on the network-level, lacking a fine-grained analysis that helps fully reveal the performance of the DPU. To this end, we propose BenDan , an operator-level benchmarking framework for DPU. BenDan can identify characteristics in key DNN operators such as fully connected layers and 2D convolution layers. It bridges the gap between theoretical specifications and actual performance through fine-grained, operator-level benchmarks on DPU. By collecting critical metrics, BenDan reveals inefficiencies caused by misalignment with the DPU{\textquoteright}s pixel, input channel, and output channel parallelism. Moreover, we evaluate the DPU{\textquoteright}s compatibility with cutting-edge neural network models to expose limitations in operator support and inform deployment strategies. The following key findings are observed: (1) Some specific operators may cause the performance loss of the DPU; (2) Among various factors, DPU architecture configuration has the most significant impact on performance, outweighing channel and pixel parallelism settings. (3) Cutting-edge NN layers are partially offloaded to the CPU. BenDan provides a practical tool for DPU-aware optimization and model refinement. BenDan can be easily generalized to all DPU releases, and the DPUCZDX8G is used as an example in this paper. The benchmarking framework is available on GitLab∗.",
author = "Han Bao and Xingyu Liu and Ahmed Sadaqa and Yanxiang Zhu and Shidi Tang and Ruiqi Chen and Ming Ling and \{da Silva\}, Bruno",
note = "Publisher Copyright: {\textcopyright} 2026 Elsevier B.V.",
year = "2026",
month = jul,
doi = "10.1016/j.vlsi.2026.102695",
language = "English",
volume = "109",
journal = "Integration, the VLSI Journal",
issn = "0167-9260",
publisher = "Elsevier",
}