Alignment-Free Probabilistic Proteomics: Patterns to Functionality

  • Ewa Grela

Student thesis: Doctoral ThesisDoctor of Philosophy


Major Histocompatibility Complexes class I (MHC I), known as the Human Leukocyte Antigen class (HLA I) in humans, are proteins responsible for antigen presentation to Tlymphocytes. MHCs interact with T Cell Receptors (TCRs). They serve as crucial immune regulators for vertebrates. The three main sub-classes of the HLA class I proteins (HLA-A, HLA-B, HLA-C) are encoded in three different loci. Therefore (as genes within MHC I class are co-dominant), an individual has up to six different alleles of HLA class I protein present on the surface of their cells. The genetic diversity of HLA class I in the human population can be linked to the differentiated immunological response.

Based on a combination of established bioinformatic and machine learning tools, we have addressed the challenge to analyse HLA class I protein data-set in order to determine their ability to bind to specific antigens. To achieve this, we have created three dimensional models of HLA class I variants using homology modelling techniques. These have then been placed in three dimensional grids in order to calculate the electrostatic fields around the protein domains. The resultant multi-dimensional data were then analysed using the unsupervised machine learning techniques: both linear Principal Component Analysis (PCA), and nonlinear ones: the auto-encoder neural network (NLPCA) and the Gaussian Process Latent Variable Model (GPLVM). The methods used, accomplished the task of distinguishing between the HLA proteins sub-classes (A, B and C). In addition, the results obtained with the GPLVM dimensionality reduction suggested, that the electrostatic potential calculation may add information necessary to identifying HLA super-types. However, this method by itself, it is not robust enough to be independently conclusive.

The sequence alignments methods are not free from assumptions. Results they provide are influenced by the choice of a substitution matrix, as the numerical values are assigned to the differences between compared biomolecules’ primary structures. The increase of the number of known sequences, related to the development of the Next Generation Sequencing techniques created additional challenge, that is a computational time required.

As an alternative to the sequence alignment, we implemented the methods from time series analysis, information and chaos theory, and statistical physics to translate information from amino acid sequences into numerical vectors, in order to predict the similarity in proteins structures and functions.

We transformed a data set of 9693 amino acid sequences belonging to 100 protein families by replacing each amino acid with numerical values representing its physicochemical and biochemical properties, and based on that, calculated multiple multidimensional vectors of non-alignment protein descriptors with measures such as approximate and sample entropy or persistence, Hurst and Lyapunov exponents. The supervised learning Linear Discriminant Analysis technique, used to assess the ability of the developed protocols to correctly assign proteins to their functional groups, showed an efficiency up to over 99%.
Date of AwardDec 2022
Original languageEnglish
SupervisorAmit Chattopadhyay (Supervisor), Darren R. Flower (Supervisor), Michael Stich (Supervisor) & Juan Neirotti (Supervisor)

Cite this