ETROVUB

Inas Bosch, Barbara Gravel, Alexandre Renaux, Ann Nowé, Maris Laan, Tom Lenaerts

Contribution to journal

Abstract ■

Identifying the potential oligogenic causes of rare diseases remains a challenge, notwithstanding the advancements made in the last decade. While a variety of predictive and ranking approaches have been proposed, their precision remains limited, as only a small number of high-quality training cases are available and it remains difficult to know which features may be most relevant for the design of new predictors. We hypothesize here that structured biological information, which provides an integration of various relevant biological networks and ontologies in a single heterogeneous knowledge graph, can make a difference as it allows for learning a relevant genetic representation through KGE methods. An exhaustive benchmarking is performed here wherein we assess the performance of various state-of-the-art embedding models for the task of identifying potentially pathogenic gene pairs. The results obtained show that these KGE provide highly accurate predictions, leading to an Area Under the Precision-Recall Curve of up to \$0.93\$, representing also a significant advancement over previous approaches for predicting gene pairs involved in oligogenic diseases. We show nonetheless that care needs to be taken in the cross-validation when using embeddings, as data leakage between folds in embedding space will reveal overly optimistic results. The further evaluation of the methods on a holdout set as well as on a group of new male infertility cases show that three Translational Distance models (TransE, MurE, and RotatE) and two of the Semantic Matching models (DisMult and QuatE) provide the better results. The analysis is concluded by comparing all known gene combinations for these top-ranking models, examining their similarities and differences. Overall, KGE provide a predictive advancement but new steps will need to be taken generate explanations as to why the pairs are relevant for oligogenic diseases.

Reference ■

Bosch, I, Gravel, B, Renaux, A, Nowé, A, Laan, M & Lenaerts, T 2026, 'Benchmarking knowledge graph embedding models for the prediction of oligogenic combinations', Briefings in Bioinformatics, vol. 27, no. 1, bbaf712. https://doi.org/10.1093/bib/bbaf712

Bosch, I., Gravel, B., Renaux, A., Nowé, A., Laan, M., & Lenaerts, T. (2026). Benchmarking knowledge graph embedding models for the prediction of oligogenic combinations. Briefings in Bioinformatics, 27(1), Article bbaf712. https://doi.org/10.1093/bib/bbaf712

@article{d6e70dd881a040219994dd4d15b7e9f8,
title = "Benchmarking knowledge graph embedding models for the prediction of oligogenic combinations",
abstract = "Identifying the potential oligogenic causes of rare diseases remains a challenge, notwithstanding the advancements made in the last decade. While a variety of predictive and ranking approaches have been proposed, their precision remains limited, as only a small number of high-quality training cases are available and it remains difficult to know which features may be most relevant for the design of new predictors. We hypothesize here that structured biological information, which provides an integration of various relevant biological networks and ontologies in a single heterogeneous knowledge graph, can make a difference as it allows for learning a relevant genetic representation through KGE methods. An exhaustive benchmarking is performed here wherein we assess the performance of various state-of-the-art embedding models for the task of identifying potentially pathogenic gene pairs. The results obtained show that these KGE provide highly accurate predictions, leading to an Area Under the Precision-Recall Curve of up to \$0.93\$, representing also a significant advancement over previous approaches for predicting gene pairs involved in oligogenic diseases. We show nonetheless that care needs to be taken in the cross-validation when using embeddings, as data leakage between folds in embedding space will reveal overly optimistic results. The further evaluation of the methods on a holdout set as well as on a group of new male infertility cases show that three Translational Distance models (TransE, MurE, and RotatE) and two of the Semantic Matching models (DisMult and QuatE) provide the better results. The analysis is concluded by comparing all known gene combinations for these top-ranking models, examining their similarities and differences. Overall, KGE provide a predictive advancement but new steps will need to be taken generate explanations as to why the pairs are relevant for oligogenic diseases.",
keywords = "benchmarking, knowledge graph embeddings, oligogenic relations, pathogenicity prediction, algorithm, bioinformatics, genetics, human, procedures, rare disease",
author = "Inas Bosch and Barbara Gravel and Alexandre Renaux and Ann Now{\'e} and Maris Laan and Tom Lenaerts",
note = "Publisher Copyright: {\textcopyright} The Author(s) 2026. Published by Oxford University Press.",
year = "2026",
month = jan,
day = "7",
doi = "10.1093/bib/bbaf712",
language = "English",
volume = "27",
journal = "Briefings in Bioinformatics",
issn = "1467-5463",
publisher = "Oxford University Press",
number = "1",
}