PPI prediction from sequences via transfer learning on balanced but yet biased datasets: an open problem
DATE:
2024
UNIVERSAL IDENTIFIER: http://hdl.handle.net/11093/7493
UNESCO SUBJECT: 3314.99 Otras
DOCUMENT TYPE: conferenceObject
ABSTRACT
Computational approaches for Protein-Protein Interaction (PPI) prediction, and particularly, methods that predict interactions by leveraging only
amino acid sequences are of paramount interest. In this study, we aimed to evaluate the suitability of pre-trained protein sequence embeddings, namely ProtBert
and SeqVec, as feature extractors for classical machine learning algorithms. Consistent with recent reports, we found that performance metrics calculated over
random train-test splits of balanced PPIs datasets, such as holdout or cross-validation, lead to highly overestimated values, mainly due to a non-evident bias present in such datasets. We demonstrate this bias by using two PPIs datasets and
conducting a 5-fold cross-validation, which yields relatively high values for most
tested models, including a custom baseline model, named PPIIBM, which predicts the interaction status based only on the a priori positivity of proteins found
in the train split only. This baseline PPIIBM model achieves results similar to
state of the art models, even of those based on deep learning, showing that predicting PPIs from sequences remains an open challenge, where careful validation
pipelines should be implemented
Files in this item
- Name:
- 2024_lopez_ppi_predictions.pdf
- Size:
- 575.9Kb
- Format:
- Description:
- Embargo indefinido por copyright