FANE: FPGA-Based FP8 Approximate Neural Network Engine
 
FANE: FPGA-Based FP8 Approximate Neural Network Engine 
 
Han Bao, Shidi Tang, Jindong Li, Ahmed Sadaqa, Ruiqi Chen, Ming Ling, Bruno da Silva Gomes
 
Abstract 

The 8-bit floating-point (FP8) format has gained growing interest in neural networks (NNs) for its superior dynamic range over traditional INT8. However, multiply-accumulate (MAC) operations remain a major source of power consumption during inference of NNs, which makes DSP-free design important, especially for edge FPGAs with few or no DSPs. Therefore, this brief presents FPGA-based FP8 approximate neural network engine (FANE), an FPGA-based approximate NN engine for FP8. We first introduce a novel approximation method that replaces the multiplications by linear additions. This approximate method reduces power consumption while maintaining high accuracy, outperforming the latest FP8 approximate multiplier by 53.15\%. Based on this design, we construct an FP8 MAC unit and integrate it into both a convolution engine and a matrix-vector multiplication (MVM) unit. Finally, we integrate our design into a large language model (LLM). The result shows 61.5\% higher efficiency (TOPS/W) than the previous design, demonstrating the superiority of FANE in terms of performance and power efficiency. The code of FANE is available on our https://github.com/hanbao04/FANE-FPGA-based-FP8-Approximate-Neural-Network-Engine.git