Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods

Monika Krzak; Yordan Raykov; Alexios Boukouvalas; Luisa Cutillo; Claudia Angelini

doi:10.3389/fgene.2019.01253

Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods

Monika Krzak, Yordan Raykov, Alexios Boukouvalas, Luisa Cutillo, Claudia Angelini

College of Engineering and Physical Sciences

Research output: Contribution to journal › Article › peer-review

Abstract

Single-cell RNA-seq (scRNAseq) is a powerful tool to study heterogeneity of cells. Recently, several clustering based methods have been proposed to identify distinct cell populations. These methods are based on different statistical models and usually require to perform several additional steps, such as preprocessing or dimension reduction, before applying the clustering algorithm. Individual steps are often controlled by method-specific parameters, permitting the method to be used in different modes on the same datasets, depending on the user choices. The large number of possibilities that these methods provide can intimidate non-expert users, since the available choices are not always clearly documented. In addition, to date, no large studies have invistigated the role and the impact that these choices can have in different experimental contexts. This work aims to provide new insights into the advantages and drawbacks of scRNAseq clustering methods and describe the ranges of possibilities that are offered to users. In particular, we provide an extensive evaluation of several methods with respect to different modes of usage and parameter settings by applying them to real and simulated datasets that vary in terms of dimensionality, number of cell populations or levels of noise. Remarkably, the results presented here show that great variability in the performance of the models is strongly attributed to the choice of the user-specific parameter settings. We describe several tendencies in the performance attributed to their modes of usage and different types of datasets, and identify which methods are strongly affected by data dimensionality in terms of computational time. Finally, we highlight some open challenges in scRNAseq data clustering, such as those related to the identification of the number of clusters.

Original language	English
Article number	1253
Journal	Frontiers in Genetics
Volume	10
DOIs	https://doi.org/10.3389/fgene.2019.01253
Publication status	Published - 11 Dec 2019

Bibliographical note

Copyright: © 2019 Krzak, Raykov, Boukouvalas, Cutillo and Angelini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Keywords

benchmark
clustering methods
high-dimensional data analysis
parameter sensitivity analysis
single-cell RNA-seq

Access to Document

10.3389/fgene.2019.01253Licence: CC BY 3.0

RNA Sequencing
Copyright © 2019 Krzak, Raykov, Boukouvalas, Cutillo and Angelini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Final published version, 8.03 MBLicence: CC BY 3.0

Cite this

@article{37b670ed60a74f6c8bfcca520a5afa92,

title = "Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods",

abstract = "Single-cell RNA-seq (scRNAseq) is a powerful tool to study heterogeneity of cells. Recently, several clustering based methods have been proposed to identify distinct cell populations. These methods are based on different statistical models and usually require to perform several additional steps, such as preprocessing or dimension reduction, before applying the clustering algorithm. Individual steps are often controlled by method-specific parameters, permitting the method to be used in different modes on the same datasets, depending on the user choices. The large number of possibilities that these methods provide can intimidate non-expert users, since the available choices are not always clearly documented. In addition, to date, no large studies have invistigated the role and the impact that these choices can have in different experimental contexts. This work aims to provide new insights into the advantages and drawbacks of scRNAseq clustering methods and describe the ranges of possibilities that are offered to users. In particular, we provide an extensive evaluation of several methods with respect to different modes of usage and parameter settings by applying them to real and simulated datasets that vary in terms of dimensionality, number of cell populations or levels of noise. Remarkably, the results presented here show that great variability in the performance of the models is strongly attributed to the choice of the user-specific parameter settings. We describe several tendencies in the performance attributed to their modes of usage and different types of datasets, and identify which methods are strongly affected by data dimensionality in terms of computational time. Finally, we highlight some open challenges in scRNAseq data clustering, such as those related to the identification of the number of clusters.",

keywords = "benchmark, clustering methods, high-dimensional data analysis, parameter sensitivity analysis, single-cell RNA-seq",

author = "Monika Krzak and Yordan Raykov and Alexios Boukouvalas and Luisa Cutillo and Claudia Angelini",

note = "Copyright: {\textcopyright} 2019 Krzak, Raykov, Boukouvalas, Cutillo and Angelini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. ",

year = "2019",

month = dec,

day = "11",

doi = "10.3389/fgene.2019.01253",

language = "English",

volume = "10",

journal = "Frontiers in Genetics",

issn = "1664-8021",

publisher = "Frontiers Media S.A.",

}

TY - JOUR

T1 - Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods

AU - Krzak, Monika

AU - Raykov, Yordan

AU - Boukouvalas, Alexios

AU - Cutillo, Luisa

AU - Angelini, Claudia

N1 - Copyright: © 2019 Krzak, Raykov, Boukouvalas, Cutillo and Angelini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

PY - 2019/12/11

Y1 - 2019/12/11

N2 - Single-cell RNA-seq (scRNAseq) is a powerful tool to study heterogeneity of cells. Recently, several clustering based methods have been proposed to identify distinct cell populations. These methods are based on different statistical models and usually require to perform several additional steps, such as preprocessing or dimension reduction, before applying the clustering algorithm. Individual steps are often controlled by method-specific parameters, permitting the method to be used in different modes on the same datasets, depending on the user choices. The large number of possibilities that these methods provide can intimidate non-expert users, since the available choices are not always clearly documented. In addition, to date, no large studies have invistigated the role and the impact that these choices can have in different experimental contexts. This work aims to provide new insights into the advantages and drawbacks of scRNAseq clustering methods and describe the ranges of possibilities that are offered to users. In particular, we provide an extensive evaluation of several methods with respect to different modes of usage and parameter settings by applying them to real and simulated datasets that vary in terms of dimensionality, number of cell populations or levels of noise. Remarkably, the results presented here show that great variability in the performance of the models is strongly attributed to the choice of the user-specific parameter settings. We describe several tendencies in the performance attributed to their modes of usage and different types of datasets, and identify which methods are strongly affected by data dimensionality in terms of computational time. Finally, we highlight some open challenges in scRNAseq data clustering, such as those related to the identification of the number of clusters.

AB - Single-cell RNA-seq (scRNAseq) is a powerful tool to study heterogeneity of cells. Recently, several clustering based methods have been proposed to identify distinct cell populations. These methods are based on different statistical models and usually require to perform several additional steps, such as preprocessing or dimension reduction, before applying the clustering algorithm. Individual steps are often controlled by method-specific parameters, permitting the method to be used in different modes on the same datasets, depending on the user choices. The large number of possibilities that these methods provide can intimidate non-expert users, since the available choices are not always clearly documented. In addition, to date, no large studies have invistigated the role and the impact that these choices can have in different experimental contexts. This work aims to provide new insights into the advantages and drawbacks of scRNAseq clustering methods and describe the ranges of possibilities that are offered to users. In particular, we provide an extensive evaluation of several methods with respect to different modes of usage and parameter settings by applying them to real and simulated datasets that vary in terms of dimensionality, number of cell populations or levels of noise. Remarkably, the results presented here show that great variability in the performance of the models is strongly attributed to the choice of the user-specific parameter settings. We describe several tendencies in the performance attributed to their modes of usage and different types of datasets, and identify which methods are strongly affected by data dimensionality in terms of computational time. Finally, we highlight some open challenges in scRNAseq data clustering, such as those related to the identification of the number of clusters.

KW - benchmark

KW - clustering methods

KW - high-dimensional data analysis

KW - parameter sensitivity analysis

KW - single-cell RNA-seq

UR - https://www.frontiersin.org/articles/10.3389/fgene.2019.01253/abstract

UR - http://www.scopus.com/inward/record.url?scp=85077294092&partnerID=8YFLogxK

U2 - 10.3389/fgene.2019.01253

DO - 10.3389/fgene.2019.01253

M3 - Article

SN - 1664-8021

VL - 10

JO - Frontiers in Genetics

JF - Frontiers in Genetics

M1 - 1253

ER -

Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this