Visualization of molecular fingerprints

John R. Owen; Ian T. Nabney; José L. Medina-Franco; Fabian López-Vallejo

doi:10.1021/ci1004042

Visualization of molecular fingerprints

John R. Owen, Ian T. Nabney, José L. Medina-Franco, Fabian López-Vallejo

Computer Science Research Group

Research output: Contribution to journal › Article › peer-review

Abstract

A visualization plot of a data set of molecular data is a useful tool for gaining insight into a set of molecules. In chemoinformatics, most visualization plots are of molecular descriptors, and the statistical model most often used to produce a visualization is principal component analysis (PCA). This paper takes PCA, together with four other statistical models (NeuroScale, GTM, LTM, and LTM-LIN), and evaluates their ability to produce clustering in visualizations not of molecular descriptors but of molecular fingerprints. Two different tasks are addressed: understanding structural information (particularly combinatorial libraries) and relating structure to activity. The quality of the visualizations is compared both subjectively (by visual inspection) and objectively (with global distance comparisons and local k-nearest-neighbor predictors). On the data sets used to evaluate clustering by structure, LTM is found to perform significantly better than the other models. In particular, the clusters in LTM visualization space are consistent with the relationships between the core scaffolds that define the combinatorial sublibraries. On the data sets used to evaluate clustering by activity, LTM again gives the best performance but by a smaller margin. The results of this paper demonstrate the value of using both a nonlinear projection map and a Bernoulli noise model for modeling binary data.

Original language	English
Pages (from-to)	1552-1563
Number of pages	12
Journal	Journal of Chemical Information and Modeling
Volume	51
Issue number	7
DOIs	https://doi.org/10.1021/ci1004042
Publication status	Published - 22 Jun 2011

Keywords

combinatorial chemistry techniques
drug discovery
statistical models
molecular structure
principal component analysis
small molecule libraries

Access to Document

10.1021/ci1004042

Cite this

@article{12232965e9a748748551fe37be78cc27,

title = "Visualization of molecular fingerprints",

abstract = "A visualization plot of a data set of molecular data is a useful tool for gaining insight into a set of molecules. In chemoinformatics, most visualization plots are of molecular descriptors, and the statistical model most often used to produce a visualization is principal component analysis (PCA). This paper takes PCA, together with four other statistical models (NeuroScale, GTM, LTM, and LTM-LIN), and evaluates their ability to produce clustering in visualizations not of molecular descriptors but of molecular fingerprints. Two different tasks are addressed: understanding structural information (particularly combinatorial libraries) and relating structure to activity. The quality of the visualizations is compared both subjectively (by visual inspection) and objectively (with global distance comparisons and local k-nearest-neighbor predictors). On the data sets used to evaluate clustering by structure, LTM is found to perform significantly better than the other models. In particular, the clusters in LTM visualization space are consistent with the relationships between the core scaffolds that define the combinatorial sublibraries. On the data sets used to evaluate clustering by activity, LTM again gives the best performance but by a smaller margin. The results of this paper demonstrate the value of using both a nonlinear projection map and a Bernoulli noise model for modeling binary data.",

keywords = "combinatorial chemistry techniques, drug discovery, statistical models, molecular structure, principal component analysis, small molecule libraries",

author = "Owen, {John R.} and Nabney, {Ian T.} and Medina-Franco, {Jos{\'e} L.} and Fabian L{\'o}pez-Vallejo",

year = "2011",

month = jun,

day = "22",

doi = "10.1021/ci1004042",

language = "English",

volume = "51",

pages = "1552--1563",

journal = "Journal of Chemical Information and Modeling",

issn = "1549-9596",

publisher = "American Chemical Society",

number = "7",

}

TY - JOUR

T1 - Visualization of molecular fingerprints

AU - Owen, John R.

AU - Nabney, Ian T.

AU - Medina-Franco, José L.

AU - López-Vallejo, Fabian

PY - 2011/6/22

Y1 - 2011/6/22

N2 - A visualization plot of a data set of molecular data is a useful tool for gaining insight into a set of molecules. In chemoinformatics, most visualization plots are of molecular descriptors, and the statistical model most often used to produce a visualization is principal component analysis (PCA). This paper takes PCA, together with four other statistical models (NeuroScale, GTM, LTM, and LTM-LIN), and evaluates their ability to produce clustering in visualizations not of molecular descriptors but of molecular fingerprints. Two different tasks are addressed: understanding structural information (particularly combinatorial libraries) and relating structure to activity. The quality of the visualizations is compared both subjectively (by visual inspection) and objectively (with global distance comparisons and local k-nearest-neighbor predictors). On the data sets used to evaluate clustering by structure, LTM is found to perform significantly better than the other models. In particular, the clusters in LTM visualization space are consistent with the relationships between the core scaffolds that define the combinatorial sublibraries. On the data sets used to evaluate clustering by activity, LTM again gives the best performance but by a smaller margin. The results of this paper demonstrate the value of using both a nonlinear projection map and a Bernoulli noise model for modeling binary data.

AB - A visualization plot of a data set of molecular data is a useful tool for gaining insight into a set of molecules. In chemoinformatics, most visualization plots are of molecular descriptors, and the statistical model most often used to produce a visualization is principal component analysis (PCA). This paper takes PCA, together with four other statistical models (NeuroScale, GTM, LTM, and LTM-LIN), and evaluates their ability to produce clustering in visualizations not of molecular descriptors but of molecular fingerprints. Two different tasks are addressed: understanding structural information (particularly combinatorial libraries) and relating structure to activity. The quality of the visualizations is compared both subjectively (by visual inspection) and objectively (with global distance comparisons and local k-nearest-neighbor predictors). On the data sets used to evaluate clustering by structure, LTM is found to perform significantly better than the other models. In particular, the clusters in LTM visualization space are consistent with the relationships between the core scaffolds that define the combinatorial sublibraries. On the data sets used to evaluate clustering by activity, LTM again gives the best performance but by a smaller margin. The results of this paper demonstrate the value of using both a nonlinear projection map and a Bernoulli noise model for modeling binary data.

KW - combinatorial chemistry techniques

KW - drug discovery

KW - statistical models

KW - molecular structure

KW - principal component analysis

KW - small molecule libraries

UR - http://www.scopus.com/inward/record.url?scp=79960730867&partnerID=8YFLogxK

UR - http://pubs.acs.org/doi/abs/10.1021/ci1004042

U2 - 10.1021/ci1004042

DO - 10.1021/ci1004042

M3 - Article

C2 - 21696145

SN - 1549-9596

VL - 51

SP - 1552

EP - 1563

JO - Journal of Chemical Information and Modeling

JF - Journal of Chemical Information and Modeling

IS - 7

ER -

Visualization of molecular fingerprints

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this