Extracting Prime Protein Targets As Possible Drug Candidates: Machine Learning Evaluation

Subhagata Chattopadhyay; Nhat Phuong Do; Darren R. Flower; Amit K Chattopadhyay

doi:10.1007/s11517-023-02893-0

Extracting Prime Protein Targets As Possible Drug Candidates: Machine Learning Evaluation

Subhagata Chattopadhyay, Nhat Phuong Do, Darren R. Flower, Amit K Chattopadhyay^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Extracting “high ranking” or “prime protein targets” (PPTs) as potent MRSA drug candidates from a given set of ligands is a key challenge in efficient molecular docking. This study combines protein-versus-ligand matching molecular docking (MD) data extracted from 10 independent molecular docking (MD) evaluations — ADFR, DOCK, Gemdock, Ledock, Plants, Psovina, Quickvina2, smina, vina, and vinaxb to identify top MRSA drug candidates. Twenty-nine active protein targets (APT) from the enhanced DUD-E repository (http://DUD-E.decoys.org) are matched against 1040 ligands using “forward modeling” machine learning for initial “data mining and modeling” (DDM) to extract PPTs and the corresponding high affinity ligands (HALs). K-means clustering (KMC) is then performed on 400 ligands matched against 29 PTs, with each cluster accommodating HALs, and the corresponding PPTs. Performance of KMC is then validated against randomly chosen head, tail, and middle active ligands (ALs). KMC outcomes have been validated against two other clustering methods, namely, Gaussian mixture model (GMM) and density based spatial clustering of applications with noise (DBSCAN). While GMM shows similar results as with KMC, DBSCAN has failed to yield more than one cluster and handle the noise (outliers), thus affirming the choice of KMC or GMM. Databases obtained from ADFR to mine PPTs are then ranked according to the number of the corresponding HAL-PPT combinations (HPC) inside the derived clusters, an approach called “reverse modeling” (RM). From the set of 29 PTs studied, RM predicts high fidelity of 5 PPTs (17%) that bind with 76 out of 400, i.e., 19% ligands leading to a prediction of next-generation MRSA drug candidates: PPT2 (average HPC is 41.1%) is the top choice, followed by PPT14 (average HPC 25.46%), and then PPT15 (average HPC 23.12%). This algorithm can be generically implemented irrespective of pathogenic forms and is particularly effective for sparse data. Graphical Abstract: [Figure not available: see fulltext.]

Original language	English
Pages (from-to)	3035-3048
Number of pages	14
Journal	Medical and Biological Engineering and Computing
Volume	61
Issue number	11
Early online date	23 Aug 2023
DOIs	https://doi.org/10.1007/s11517-023-02893-0
Publication status	Published - Nov 2023

Bibliographical note

Funding: Nhat Phuong Do received partial financial support from the Vietnam International Education Development (VIED), Decision No. 76/QD-BGDDT scholarship through the School of Pharmacy, Tra Vinh University, 126 Nguyen Thien Thanh Street, Ward 5, Tra Vinh City, Viet Nam.
Copyright © The Author(s) 2023. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Keywords

DBSCAN
DUD-E repository
Data mining
Drug design
Forward modeling
Gaussian mixture model
K-means clustering
Ligands
Machine learning (ML)
Molecular docking
Protein targets
Protein–ligand interaction
Reverse modeling

Access to Document

10.1007/s11517-023-02893-0Licence: CC BY 4.0

extracting prime protein targets
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
Final published version, 1.18 MBLicence: CC BY 4.0

Cite this

@article{816f33a006604e3097b32861e321a64f,

title = "Extracting Prime Protein Targets As Possible Drug Candidates: Machine Learning Evaluation",

abstract = "Extracting “high ranking” or “prime protein targets” (PPTs) as potent MRSA drug candidates from a given set of ligands is a key challenge in efficient molecular docking. This study combines protein-versus-ligand matching molecular docking (MD) data extracted from 10 independent molecular docking (MD) evaluations — ADFR, DOCK, Gemdock, Ledock, Plants, Psovina, Quickvina2, smina, vina, and vinaxb to identify top MRSA drug candidates. Twenty-nine active protein targets (APT) from the enhanced DUD-E repository (http://DUD-E.decoys.org) are matched against 1040 ligands using “forward modeling” machine learning for initial “data mining and modeling” (DDM) to extract PPTs and the corresponding high affinity ligands (HALs). K-means clustering (KMC) is then performed on 400 ligands matched against 29 PTs, with each cluster accommodating HALs, and the corresponding PPTs. Performance of KMC is then validated against randomly chosen head, tail, and middle active ligands (ALs). KMC outcomes have been validated against two other clustering methods, namely, Gaussian mixture model (GMM) and density based spatial clustering of applications with noise (DBSCAN). While GMM shows similar results as with KMC, DBSCAN has failed to yield more than one cluster and handle the noise (outliers), thus affirming the choice of KMC or GMM. Databases obtained from ADFR to mine PPTs are then ranked according to the number of the corresponding HAL-PPT combinations (HPC) inside the derived clusters, an approach called “reverse modeling” (RM). From the set of 29 PTs studied, RM predicts high fidelity of 5 PPTs (17%) that bind with 76 out of 400, i.e., 19% ligands leading to a prediction of next-generation MRSA drug candidates: PPT2 (average HPC is 41.1%) is the top choice, followed by PPT14 (average HPC 25.46%), and then PPT15 (average HPC 23.12%). This algorithm can be generically implemented irrespective of pathogenic forms and is particularly effective for sparse data. Graphical Abstract: [Figure not available: see fulltext.]",

keywords = "DBSCAN, DUD-E repository, Data mining, Drug design, Forward modeling, Gaussian mixture model, K-means clustering, Ligands, Machine learning (ML), Molecular docking, Protein targets, Protein–ligand interaction, Reverse modeling",

author = "Subhagata Chattopadhyay and Do, {Nhat Phuong} and Flower, {Darren R.} and Chattopadhyay, {Amit K}",

note = "Funding: Nhat Phuong Do received partial financial support from the Vietnam International Education Development (VIED), Decision No. 76/QD-BGDDT scholarship through the School of Pharmacy, Tra Vinh University, 126 Nguyen Thien Thanh Street, Ward 5, Tra Vinh City, Viet Nam. Copyright {\textcopyright} The Author(s) 2023. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. ",

year = "2023",

month = nov,

doi = "10.1007/s11517-023-02893-0",

language = "English",

volume = "61",

pages = "3035--3048",

journal = "Medical and Biological Engineering and Computing",

issn = "0140-0118",

publisher = "Springer",

number = "11",

}

TY - JOUR

T1 - Extracting Prime Protein Targets As Possible Drug Candidates: Machine Learning Evaluation

AU - Chattopadhyay, Subhagata

AU - Do, Nhat Phuong

AU - Flower, Darren R.

AU - Chattopadhyay, Amit K

N1 - Funding: Nhat Phuong Do received partial financial support from the Vietnam International Education Development (VIED), Decision No. 76/QD-BGDDT scholarship through the School of Pharmacy, Tra Vinh University, 126 Nguyen Thien Thanh Street, Ward 5, Tra Vinh City, Viet Nam. Copyright © The Author(s) 2023. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

PY - 2023/11

Y1 - 2023/11

N2 - Extracting “high ranking” or “prime protein targets” (PPTs) as potent MRSA drug candidates from a given set of ligands is a key challenge in efficient molecular docking. This study combines protein-versus-ligand matching molecular docking (MD) data extracted from 10 independent molecular docking (MD) evaluations — ADFR, DOCK, Gemdock, Ledock, Plants, Psovina, Quickvina2, smina, vina, and vinaxb to identify top MRSA drug candidates. Twenty-nine active protein targets (APT) from the enhanced DUD-E repository (http://DUD-E.decoys.org) are matched against 1040 ligands using “forward modeling” machine learning for initial “data mining and modeling” (DDM) to extract PPTs and the corresponding high affinity ligands (HALs). K-means clustering (KMC) is then performed on 400 ligands matched against 29 PTs, with each cluster accommodating HALs, and the corresponding PPTs. Performance of KMC is then validated against randomly chosen head, tail, and middle active ligands (ALs). KMC outcomes have been validated against two other clustering methods, namely, Gaussian mixture model (GMM) and density based spatial clustering of applications with noise (DBSCAN). While GMM shows similar results as with KMC, DBSCAN has failed to yield more than one cluster and handle the noise (outliers), thus affirming the choice of KMC or GMM. Databases obtained from ADFR to mine PPTs are then ranked according to the number of the corresponding HAL-PPT combinations (HPC) inside the derived clusters, an approach called “reverse modeling” (RM). From the set of 29 PTs studied, RM predicts high fidelity of 5 PPTs (17%) that bind with 76 out of 400, i.e., 19% ligands leading to a prediction of next-generation MRSA drug candidates: PPT2 (average HPC is 41.1%) is the top choice, followed by PPT14 (average HPC 25.46%), and then PPT15 (average HPC 23.12%). This algorithm can be generically implemented irrespective of pathogenic forms and is particularly effective for sparse data. Graphical Abstract: [Figure not available: see fulltext.]

AB - Extracting “high ranking” or “prime protein targets” (PPTs) as potent MRSA drug candidates from a given set of ligands is a key challenge in efficient molecular docking. This study combines protein-versus-ligand matching molecular docking (MD) data extracted from 10 independent molecular docking (MD) evaluations — ADFR, DOCK, Gemdock, Ledock, Plants, Psovina, Quickvina2, smina, vina, and vinaxb to identify top MRSA drug candidates. Twenty-nine active protein targets (APT) from the enhanced DUD-E repository (http://DUD-E.decoys.org) are matched against 1040 ligands using “forward modeling” machine learning for initial “data mining and modeling” (DDM) to extract PPTs and the corresponding high affinity ligands (HALs). K-means clustering (KMC) is then performed on 400 ligands matched against 29 PTs, with each cluster accommodating HALs, and the corresponding PPTs. Performance of KMC is then validated against randomly chosen head, tail, and middle active ligands (ALs). KMC outcomes have been validated against two other clustering methods, namely, Gaussian mixture model (GMM) and density based spatial clustering of applications with noise (DBSCAN). While GMM shows similar results as with KMC, DBSCAN has failed to yield more than one cluster and handle the noise (outliers), thus affirming the choice of KMC or GMM. Databases obtained from ADFR to mine PPTs are then ranked according to the number of the corresponding HAL-PPT combinations (HPC) inside the derived clusters, an approach called “reverse modeling” (RM). From the set of 29 PTs studied, RM predicts high fidelity of 5 PPTs (17%) that bind with 76 out of 400, i.e., 19% ligands leading to a prediction of next-generation MRSA drug candidates: PPT2 (average HPC is 41.1%) is the top choice, followed by PPT14 (average HPC 25.46%), and then PPT15 (average HPC 23.12%). This algorithm can be generically implemented irrespective of pathogenic forms and is particularly effective for sparse data. Graphical Abstract: [Figure not available: see fulltext.]

KW - DBSCAN

KW - DUD-E repository

KW - Data mining

KW - Drug design

KW - Forward modeling

KW - Gaussian mixture model

KW - K-means clustering

KW - Ligands

KW - Machine learning (ML)

KW - Molecular docking

KW - Protein targets

KW - Protein–ligand interaction

KW - Reverse modeling

UR - https://link.springer.com/article/10.1007/s11517-023-02893-0

UR - http://www.scopus.com/inward/record.url?scp=85168603583&partnerID=8YFLogxK

U2 - 10.1007/s11517-023-02893-0

DO - 10.1007/s11517-023-02893-0

M3 - Article

C2 - 37608081

SN - 0140-0118

VL - 61

SP - 3035

EP - 3048

JO - Medical and Biological Engineering and Computing

JF - Medical and Biological Engineering and Computing

IS - 11

ER -

Extracting Prime Protein Targets As Possible Drug Candidates: Machine Learning Evaluation

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this