Diff-DiT: Temporal Differential Accelerator for Low-bit Diffusion Transformers on FPGA
 
Diff-DiT: Temporal Differential Accelerator for Low-bit Diffusion Transformers on FPGA 
 
Shidi Tang, Pengwei Zheng, Ruiqi Chen, Yuxuan Lv, Bruno da Silva Gomes, Ming Ling
 
Abstract 

Diffusion Transformer (DiT) models have shown superior generative capabilities in image and video synthesis, yet their high computational cost during inference remains a critical bottleneck. Temporal differential computation offers a promising solution to low-bit quantization by exploiting the temporal similarity in activations. However, applying this technique to DiT's Attention layers introduces substantial memory and computation overheads.In this paper, we present Diff-DiT, the first FPGA accelerator designed for low-bit DiT inference with differential computation. To overcome the unique challenges of DiT quantization and hardware acceleration, we propose: (1) an approximated differential attention (ADA) method that selectively approximates attention computations across time steps using a significance score, enabling low-bit on-chip execution while minimizing memory overhead; (2) an optimal cross-cast data accessing pattern with flexible data reuse to maximize computational intensity during matrix multiplications; and (3) a half-condition splitting (HCS) dataflow optimization and fine-grained pipelining to reduce the computation and memory access latency.Extensive experiments show that Diff-DiT outperforms NVIDIA V100 GPU by 1.39× in end-to-end throughput and 5.60× in energy efficiency. When compared with state-of-the-art diffusion model accelerators, Diff-DiT also achieves 2.81× and 2.77× improvements in throughput and energy efficiency, respectively.