ETROVUB

Han Bao, Shidi Tang, Jindong Li, Ahmed Sadaqa, Ruiqi Chen, Ming Ling, Bruno da Silva Gomes

Contribution to journal

Abstract ■

The 8-bit floating-point (FP8) format has gained growing interest in neural networks (NNs) for its superior dynamic range over traditional INT8. However, multiply-accumulate (MAC) operations remain a major source of power consumption during inference of NNs, which makes DSP-free design important, especially for edge FPGAs with few or no DSPs. Therefore, this brief presents FPGA-based FP8 approximate neural network engine (FANE), an FPGA-based approximate NN engine for FP8. We first introduce a novel approximation method that replaces the multiplications by linear additions. This approximate method reduces power consumption while maintaining high accuracy, outperforming the latest FP8 approximate multiplier by 53.15\%. Based on this design, we construct an FP8 MAC unit and integrate it into both a convolution engine and a matrix-vector multiplication (MVM) unit. Finally, we integrate our design into a large language model (LLM). The result shows 61.5\% higher efficiency (TOPS/W) than the previous design, demonstrating the superiority of FANE in terms of performance and power efficiency. The code of FANE is available on our https://github.com/hanbao04/FANE-FPGA-based-FP8-Approximate-Neural-Network-Engine.git

Reference ■

Bao, H, Tang, S, Li, J, Sadaqa, A, Chen, R, Ling, M & da Silva, B 2026, 'FANE: FPGA-Based FP8 Approximate Neural Network Engine', IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 34, no. 6, pp. 2015-2019. https://doi.org/10.1109/TVLSI.2026.3677683

Bao, H., Tang, S., Li, J., Sadaqa, A., Chen, R., Ling, M., & da Silva, B. (2026). FANE: FPGA-Based FP8 Approximate Neural Network Engine. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 34(6), 2015-2019. https://doi.org/10.1109/TVLSI.2026.3677683

@article{9dbce7b1742941e78ecb0bdf22fc29eb,
title = "FANE: FPGA-Based FP8 Approximate Neural Network Engine",
abstract = "The 8-bit floating-point (FP8) format has gained growing interest in neural networks (NNs) for its superior dynamic range over traditional INT8. However, multiply-accumulate (MAC) operations remain a major source of power consumption during inference of NNs, which makes DSP-free design important, especially for edge FPGAs with few or no DSPs. Therefore, this brief presents FPGA-based FP8 approximate neural network engine (FANE), an FPGA-based approximate NN engine for FP8. We first introduce a novel approximation method that replaces the multiplications by linear additions. This approximate method reduces power consumption while maintaining high accuracy, outperforming the latest FP8 approximate multiplier by 53.15\%. Based on this design, we construct an FP8 MAC unit and integrate it into both a convolution engine and a matrix-vector multiplication (MVM) unit. Finally, we integrate our design into a large language model (LLM). The result shows 61.5\% higher efficiency (TOPS/W) than the previous design, demonstrating the superiority of FANE in terms of performance and power efficiency. The code of FANE is available on our https://github.com/hanbao04/FANE-FPGA-based-FP8-Approximate-Neural-Network-Engine.git",
author = "Han Bao and Shidi Tang and Jindong Li and Ahmed Sadaqa and Ruiqi Chen and Ming Ling and \{da Silva\}, Bruno",
note = "Publisher Copyright: {\textcopyright} 2026 IEEE.",
year = "2026",
month = jun,
doi = "10.1109/TVLSI.2026.3677683",
language = "English",
volume = "34",
pages = "2015--2019",
journal = "IEEE Transactions on Very Large Scale Integration (VLSI) Systems",
issn = "1063-8210",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "6",
}