All That Glitters Is Not Gold: Importance of Rigorous Evaluation of Proteochemometric Models

Avdiunina, Polina; Jamal, Shamieraah; Gusev, Filipp; Isayev, Olexandr

doi:10.1021/acs.jcim.5c00395

article Drug Discovery

All That Glitters Is Not Gold: Importance of Rigorous Evaluation of Proteochemometric Models

Polina Avdiunina, Shamieraah Jamal, Filipp Gusev, Olexandr Isayev

Journal of Chemical Information and Modeling Vol. 65 (19) pp. 10239–10252 2025 0

View on Publisher

Abstract

Proteochemometric models (PCMs) are used in computational drug discovery to employ both protein and ligand representations jointly for bioactivity prediction. While machine learning (ML) and deep learning (DL) have come to dominate PCMs, often serving as a basis for scoring functions, rigorous evaluation standards have not always been consistently applied. In this study, using kinase-ligand bioactivity prediction as a model system, we highlight the critical roles of data set curation, permutation testing, class imbalances, and various data splitting strategies for mitigating plausible data leakage and embedding quality in determining model performance. Our findings indicate that data splitting and class imbalances are the most critical factors affecting PCM performance, emphasizing the challenges in the generalizing ability of ML/DL-PCMs. We evaluated various protein–ligand descriptors and embeddings, including those augmented with multiple sequence alignment information. However, permutation testing consistently demonstrated that protein embeddings contributed minimally to PCM efficacy. This study advocates for the adoption of stringent evaluation standards to enhance the generalizability of models to out-of-distribution data and improve benchmarking practices.

Keywords

proteochemometric models machine learning deep learning kinase-ligand bioactivity prediction data curation permutation testing class imbalances data splitting

Cite This Paper

@article{Avdiunina2025,
  author = {Avdiunina, Polina and Jamal, Shamieraah and Gusev, Filipp and Isayev, Olexandr},
  title = {All That Glitters Is Not Gold: Importance of Rigorous Evaluation of Proteochemometric Models},
  year = {2025},
  journal = {Journal of Chemical Information and Modeling},
  volume = {65},
  number = {19},
  pages = {10239--10252},
  doi = {10.1021/acs.jcim.5c00395},
  url = {http://dx.doi.org/10.1021/acs.jcim.5c00395},
  publisher = {American Chemical Society (ACS)},
  keywords = {proteochemometric models, machine learning, deep learning, kinase-ligand bioactivity prediction, data curation, permutation testing, class imbalances, data splitting},
  researchAreas = {drug-discovery, ai-for-science},
  citations = {0}
}

Related Research Areas

Drug Discovery

Related Publications

2018

cited 87

Transforming Computational Drug Discovery with Machine Learning and AI

Smith J. S. , Roitberg A. E. , Isayev O.

ACS Medicinal Chemistry Letters , 9 , 1065–1069 (2018)

Drug Discovery

Ai For Science

DOI

2020

cited 752

QSAR without borders

Muratov E. N. , Bajorath J. , Sheridan R. P. , Tetko I. V. , Filimonov D. , Poroikov V. , Oprea T. I. , Baskin I. I. , Varnek A. , Roitberg A. , Isayev O. , Curtalolo S. , Fourches D. , Cohen Y. , Aspuru-Guzik A. , Winkler D. A. , Agrafiotis D. , Cherkasov A. , Tropsha A.

Chemical Society Reviews , 49 , 3525–3564 (2020)

Drug Discovery

Ai For Science

DOI

2019

cited 302

Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen

Menden M. P. , Wang D. , Mason M. J. , Szalai B. , Bulusu K. C. , Guan Y. , Yu T. , Kang J. , Jeon M. , Wolfinger R. , Nguyen T. , Zaslavskiy M. , Abante J. , Abecassis B. S. , Aben N. , Aghamirzaie D. , Aittokallio T. , Akhtari F. S. , Al-lazikani B. , Alam T. , Allam A. , Allen C. , de Almeida M. P. , Altarawy D. , Alves V. , Amadoz A. , Anchang B. , Antolin A. A. , Ash J. R. , Aznar V. R. , Ba-alawi W. , Bagheri M. , Bajic V. , Ball G. , Ballester P. J. , Baptista D. , Bare C. , Bateson M. , Bender A. , Bertrand D. , Wijayawardena B. , Boroevich K. A. , Bosdriesz E. , Bougouffa S. , Bounova G. , Brouwer T. , Bryant B. , Calaza M. , Calderone A. , Calza S. , Capuzzi S. , Carbonell-Caballero J. , Carlin D. , Carter H. , Castagnoli L. , Celebi R. , Cesareni G. , Chang H. , Chen G. , Chen H. , Chen H. , Cheng L. , Chernomoretz A. , Chicco D. , Cho K. , Cho S. , Choi D. , Choi J. , Choi K. , Choi M. , Cock M. D. , Coker E. , Cortes-Ciriano I. , Cserzö M. , Cubuk C. , Curtis C. , Daele D. V. , Dang C. C. , Dijkstra T. , Dopazo J. , Draghici S. , Drosou A. , Dumontier M. , Ehrhart F. , Eid F. , ElHefnawi M. , Elmarakeby H. , van Engelen B. , Engin H. B. , de Esch I. , Evelo C. , Falcao A. O. , Farag S. , Fernandez-Lozano C. , Fisch K. , Flobak A. , Fornari C. , Foroushani A. B. K. , Fotso D. C. , Fourches D. , Friend S. , Frigessi A. , Gao F. , Gao X. , Gerold J. M. , Gestraud P. , Ghosh S. , Gillberg J. , Godoy-Lorite A. , Godynyuk L. , Godzik A. , Goldenberg A. , Gomez-Cabrero D. , Gonen M. , de Graaf C. , Gray H. , Grechkin M. , Guimera R. , Guney E. , Haibe-Kains B. , Han Y. , Hase T. , He D. , He L. , Heath L. S. , Hellton K. H. , Helmer-Citterich M. , Hidalgo M. R. , Hidru D. , Hill S. M. , Hochreiter S. , Hong S. , Hovig E. , Hsueh Y. , Hu Z. , Huang J. K. , Huang R. S. , Hunyady L. , Hwang J. , Hwang T. H. , Hwang W. , Hwang Y. , Isayev O. , Don’t Walk O. B. , Jack J. , Jahandideh S. , Ji J. , Jo Y. , Kamola P. J. , Kanev G. K. , Karacosta L. , Karimi M. , Kaski S. , Kazanov M. , Khamis A. M. , Khan S. A. , Kiani N. A. , Kim A. , Kim J. , Kim J. , Kim K. , Kim K. , Kim S. , Kim Y. , Kim Y. , Kirk P. D. W. , Kitano H. , Klambauer G. , Knowles D. , Ko M. , Kohn-Luque A. , Kooistra A. J. , Kuenemann M. A. , Kuiper M. , Kurz C. , Kwon M. , van Laarhoven T. , Laegreid A. , Lederer S. , Lee H. , Lee J. , Lee Y. W. , Lepp_aho E. , Lewis R. , Li J. , Li L. , Liley J. , Lim W. K. , Lin C. , Liu Y. , Lopez Y. , Low J. , Lysenko A. , Machado D. , Madhukar N. , Maeyer D. D. , Malpartida A. B. , Mamitsuka H. , Marabita F. , Marchal K. , Marttinen P. , Mason D. , Mazaheri A. , Mehmood A. , Mehreen A. , Michaut M. , Miller R. A. , Mitsopoulos C. , Modos D. , Moerbeke M. V. , Moo K. , Motsinger-Reif A. , Movva R. , Muraru S. , Muratov E. , Mushthofa M. , Nagarajan N. , Nakken S. , Nath A. , Neuvial P. , Newton R. , Ning Z. , Niz C. D. , Oliva B. , Olsen C. , Palmeri A. , Panesar B. , Papadopoulos S. , Park J. , Park S. , Park S. , Pawitan Y. , Peluso D. , Pendyala S. , Peng J. , Perfetto L. , Pirro S. , Plevritis S. , Politi R. , Poon H. , Porta E. , Prellner I. , Preuer K. , Pujana M. A. , Ramnarine R. , Reid J. E. , Reyal F. , Richardson S. , Ricketts C. , Rieswijk L. , Rocha M. , Rodriguez-Gonzalvez C. , Roell K. , Rotroff D. , de Ruiter J. R. , Rukawa P. , Sadacca B. , Safikhani Z. , Safitri F. , Sales-Pardo M. , Sauer S. , Schlichting M. , Seoane J. A. , Serra J. , Shang M. , Sharma A. , Sharma H. , Shen Y. , Shiga M. , Shin M. , Shkedy Z. , Shopsowitz K. , Sinai S. , Skola D. , Smirnov P. , Soerensen I. F. , Soerensen P. , Song J. , Song S. O. , Soufan O. , Spitzmueller A. , Steipe B. , Suphavilai C. , Tamayo S. P. , Tamborero D. , Tang J. , Tanoli Z. , Tarres-Deulofeu M. , Tegner J. , Thommesen L. , Tonekaboni S. A. M. , Tran H. , Troyer E. D. , Truong A. , Tsunoda T. , Turu G. , Tzeng G. , Verbeke L. , Videla S. , Vis D. , Voronkov A. , Votis K. , Wang A. , Wang H. H. , Wang P. , Wang S. , Wang W. , Wang X. , Wang X. , Wennerberg K. , Wernisch L. , Wessels L. , van Westen G. J. P. , Westerman B. A. , White S. R. , Willighagen E. , Wurdinger T. , Xie L. , Xie S. , Xu H. , Yadav B. , Yau C. , Yeerna H. , Yin J. W. , Yu M. , Yu M. , Yun S. J. , Zakharov A. , Zamichos A. , Zanin M. , Zeng L. , Zenil H. , Zhang F. , Zhang P. , Zhang W. , Zhao H. , Zhao L. , Zheng W. , Zoufir A. , Zucknick M. , Jang I. S. , Ghazoui Z. , Ahsen M. E. , Vogel R. , Neto E. C. , Norman T. , Tang E. K. Y. , Garnett M. J. , Veroli G. Y. D. , Fawell S. , Stolovitzky G. , Guinney J. , Dry J. R. , Saez-Rodriguez J.

Nature Communications , 10 (2019)

Drug Discovery

Ai For Science

DOI

2024

cited 59

MLatom 3: A Platform for Machine Learning-Enhanced Computational Chemistry Simulations and Workflows

Dral P. O. , Ge F. , Hou Y. , Zheng P. , Chen Y. , Barbatti M. , Isayev O. , Wang C. , Xue B. , Pinheiro Jr M. , Su Y. , Dai Y. , Chen Y. , Zhang L. , Zhang S. , Ullah A. , Zhang Q. , Ou Y.

Journal of Chemical Theory and Computation , 20 , 1193–1213 (2024)

Ml Potentials

Ai For Science

DOI

2022

cited 127

Extending machine learning beyond interatomic potentials for predicting molecular properties

Fedik N. , Zubatyuk R. , Kulichenko M. , Lubbers N. , Smith J. S. , Nebgen B. , Messerly R. , Li Y. W. , Boldyrev A. I. , Barros K. , Isayev O. , Tretiak S.

Nature Reviews Chemistry , 6 , 653–672 (2022)

Ml Potentials

Ai For Science

DOI

← Back to all publications