Publication Details
Overview
 
 
Shidi Tang, Ruiqi Chen, Rui Liu, Yuxuan Lv, Pengwei Zheng, He Li, Ming Ling
 

Contribution to journal

Abstract 

The diffusion model has achieved remarkable success in the era of Artificial Intelligence Generated Content (AIGC) across various tasks, such as image, video, text, material modeling, and molecular design. However, the diffusion model is computational intensive due to the long iteration of the reverse denoising process, which hinders its further advancement. Therefore, there is an urgent need to accelerate the diffusion model, especially in edge scenarios that require real-time computation. While researchers have made efforts to accelerate the diffusion model at the algorithm level using either efficient sampling or model quantization, they still suffer from accuracy degradation. More importantly, they have overlooked the hardware-level acceleration challenge. This work aims to bridge the gap by introducing Diff-Acc, the first FPGA accelerator for unconditional diffusion models with a novel step-wise quantization method that requires minimal calibration data to achieve the state-of-the-art (SOTA) PTQ quantization accuracy. Additionally, we adopt several hardware-oriented optimizations to reduce the computational overhead. At the architecture level, we fully analyze the computation flow of diffusion models and propose a novel architecture with group-wise parallelism to tackle the long iteration challenge. Besides, we decouple the data dependencies and adopt proper computational transformations at the micro-architecture level. Experiments on two unconditional diffusion models (DDIM and DDPM) with two image datasets (CIFAR-10 and ImageNet) demonstrate that our quantization method achieves the substantial improvements in image quality (FID: 6.67, sFID: 11.24) under 8-bit PTQ quantization. Compared with both server-based (Tesla V100 and Intel Xeon) and edge-based (Raspberry Pi 4 and Jetson Nano) platforms, Diff-Acc implemented on the Zynq UltraScale+ XCZU9EG FPGA demonstrates an up-to 12.5× energy efficiency. Particularly versus edge-based platforms, Diff-Acc achieves up to 10.26× and 1.97× performance improvements over CPU and GPU, respectively.

Reference