Publication Details
Overview
 
 
Shidi Tang, Pengwei Zheng, Ruiqi Chen, Yuxuan Lv, Bruno da Silva Gomes, Ming Ling
 

Chapter in Book/ Report/ Conference proceeding

Abstract 

Diffusion Transformer (DiT) models have shown superior generative capabilities in image and video synthesis, yet their high computational cost during inference remains a critical bottleneck. Temporal differential computation offers a promising solution to low-bit quantization by exploiting the temporal similarity in activations. However, applying this technique to DiT's Attention layers introduces substantial memory and computation overheads.In this paper, we present Diff-DiT, the first FPGA accelerator designed for low-bit DiT inference with differential computation. To overcome the unique challenges of DiT quantization and hardware acceleration, we propose: (1) an approximated differential attention (ADA) method that selectively approximates attention computations across time steps using a significance score, enabling low-bit on-chip execution while minimizing memory overhead; (2) an optimal cross-cast data accessing pattern with flexible data reuse to maximize computational intensity during matrix multiplications; and (3) a half-condition splitting (HCS) dataflow optimization and fine-grained pipelining to reduce the computation and memory access latency.Extensive experiments show that Diff-DiT outperforms NVIDIA V100 GPU by 1.39× in end-to-end throughput and 5.60× in energy efficiency. When compared with state-of-the-art diffusion model accelerators, Diff-DiT also achieves 2.81× and 2.77× improvements in throughput and energy efficiency, respectively.

Reference