Publication Details
Overview
 
 
Ruiqi Chen, Yangxintong Lyu, Han Bao, Shidi Tang, Jindong Li, Yanxiang Zhu, Ming Ling, Bruno da Silva Gomes
 

Contribution to journal

Abstract 

The 8-bit floating-point (FP8) data format has been increasingly adopted in neural network (NN) computations due to its superior dynamic range compared to traditional 8-bit integer (INT8). Nevertheless, the heavy reliance on multiplication in neural network workloads leads to considerable energy consumption, even with FP8, particularly in the context of FPGA-based deployments. To this end, this paper presents FP8ApproxLib, an FPGA-based approximate multiplier library for FP8. Firstly, we conduct a bit-level analysis of the prior approximation method and introduce improvements to reduce the resulting computational error. Based on these, we implement a fine-grained optimized design on mainstream FPGAs (Altera and AMD) using primitives and templates combined with physical layout constraints. Moreover, an automated tool is developed to support user configuration and generate HDL code. We then evaluate the accuracy and hardware efficiency of the FP8 approximate multipliers. The results show that our proposed method achieves an average error reduction of 53.15% (36.74%∼72.82%) compared to previous FP8 approximation method. Moreover, compared to prior 8-bit approximate multipliers, our FP8 designs exhibit the lowest resource utilization. Finally, we integrate the design into the inference phase of three representative NN models (CNN, LLM, and Diffusion), demonstrating its excellent power efficiency. This is the first FP8 approximate multiplier design with architecture-aware fine-grained optimization and deployment for modern FPGA platforms, which can serve as a benchmark for future designs and comparisons of FPGA-based low-precision floating-point approximate multipliers. The code of this work is available in our GitLab∗.

Reference