article Drug Discovery

All that glitters is not gold: Importance of rigorous evaluation of proteochemometric models

Polina Avdiunina, Shamieraah Jamal, Filipp Gusev, Olexandr Isayev

2025

Highlight

Proteochemometric models (PCM) are used in computational drug discovery to leverage both protein and ligand representations for bioactivity prediction.

Abstract

Proteochemometric models (PCM) are used in computational drug discovery to leverage both protein and ligand representations for bioactivity prediction. While machine learning (ML) and deep learning (DL) have come to dominate PCMs, often serving as scoring functions, rigorous evaluation standards have not always been consistently applied. In this study, using kinase-ligand bioactivity prediction as a model system, we highlight the critical roles of dataset curation, permutation testing, class imbalances, data splitting strategies, and embedding quality in determining model performance. Our findings indicate that data splitting and class imbalances are the most critical factors affecting PCM performance, emphasizing the challenges in generalizing ability of ML/DL-PCMs. We evaluated various protein-ligand descriptors and embeddings, including those augmented with multiple sequence alignment (MSA) information. However, permutation testing consistently demonstrated that protein embeddings contributed minimally to PCM efficacy. This study advocates for the adoption of stringent evaluation standards to enhance the generalizability of models to out-of-distribution data and improve benchmarking practices.

Keywords

Cite This Paper

@article{Avdiunina2025a,
  author = {Avdiunina, Polina and Jamal, Shamieraah and Gusev, Filipp and Isayev, Olexandr},
  title = {All that glitters is not gold: Importance of rigorous evaluation of proteochemometric models},
  year = {2025},
  doi = {10.26434/chemrxiv-2025-vbmgc},
  url = {http://dx.doi.org/10.26434/chemrxiv-2025-vbmgc},
  publisher = {American Chemical Society (ACS)},
  keywords = {machine learning, deep learning, drug discovery},
  researchAreas = {drug-discovery},
  highlight = {Proteochemometric models (PCM) are used in computational drug discovery to leverage both protein and ligand representations for bioactivity prediction.}
}

Related Research Areas

Related Publications

2020
cited63

Towards chemical accuracy for alchemical free energy calculations with hybrid physics-based machine learning / molecular mechanics potentials

Rufa D. A., Bruce Macdonald H. E., Fass J., Wieder M., Grinaway P. B., Roitberg A. E., Isayev O., Chodera J. D.

(2020)

Drug Discovery
Ml Potentials

Towards chemical accuracy for alchemical free energy calculations with hybrid physics-based machine learning / molecular mechanics potentials.

DOI
2024

Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR

Tropsha A., Isayev O., Varnek A., Schneider G., Cherkasov A.

Nat. Rev. Drug Discov., 23, 141–155 (2024)

Drug Discovery
DOI
2022
cited74

Generative and reinforcement learning approaches for the automated de novo design of bioactive compounds

Korshunova M., Huang N., Capuzzi S., Radchenko D. S., Savych O., Moroz Y. S., Wells C. I., Willson T. M., Tropsha A., Isayev O.

Communications Chemistry, 5 (2022)

Generative Ai
Drug Discovery
Experiment Automation

AbstractDeep generative neural networks have been used increasingly in computational chemistry for de novo design of molecules with desired properties.

DOI
2021
cited66

Crowdsourced mapping of unexplored target space of kinase inhibitors

Cichońska A., Ravikumar B., Allaway R. J., Wan F., Park S., Isayev O., Li S., Mason M., Lamb A., Tanoli Z., Jeon M., Kim S., Popova M., Capuzzi S., Zeng J., Dang K., Koytiger G., Kang J., Wells C. I., Willson T. M., Tan M., Huang C., Shih E. S. C., Chen T., Wu C., Fang W., Chen J., Hwang M., Wang X., Ben Guebila M., Shamsaei B., Singh S., Nguyen T., Karimi M., Wu D., Wang Z., Shen Y., Öztürk H., Ozkirimli E., Özgür A., Lim H., Xie L., Kanev G. K., Kooistra A. J., Westerman B. A., Terzopoulos P., Ntagiantas K., Fotis C., Alexopoulos L., Boeckaerts D., Stock M., De Baets B., Briers Y., Luo Y., Hu H., Peng J., Dogan T., Rifaioglu A. S., Atas H., Atalay R. C., Atalay V., Martin M. J., Jeon M., Lee J., Yun S., Kim B., Chang B., Turu G., Misák Á., Szalai B., Hunyady L., Lienhard M., Prasse P., Bachmann I., Ganzlin J., Barel G., Herwig R., Oršolić D., Lučić B., Stepanić V., Šmuc T., Oprea T. I., Schlessinger A., Drewry D. H., Stolovitzky G., Wennerberg K., Guinney J., Aittokallio T.

Nature Communications, 12 (2021)

Drug Discovery

Abstract Despite decades of intensive search for compounds that modulate the activity of particular protein targets, a large proportion of the human kinome remains as yet undrugged.

DOI
2021
cited94

Teaching a neural network to attach and detach electrons from molecules

Zubatyuk R., Smith J. S., Nebgen B. T., Tretiak S., Isayev O.

Nature Communications, 12 (2021)

Ml Potentials
Reactions Reactivity
Drug Discovery
Quantum Chemistry

Abstract Interatomic potentials derived with Machine Learning algorithms such as Deep-Neural Networks (DNNs), achieve the accuracy of high-fidelity quantum mechanical (QM) methods in areas traditionally dominated by empirical force fields and allow performing massive simulations.

DOI