Background Single-cell RNA-Seq (scRNA-seq) has the potential to increase our understanding of cell populations in lupus. Recently, kidney scRNA-Seq data from lupus nephritis (LN) patients has provided the opportunity to determine the heterogeneity of cells within the affected kidney. However, since individual cells were not identified phenotypically, it is necessary to identify populations computationally. The unique technical challenges of scRNA-Seq data make it difficult to approach this analysis with conventional unsupervised bioinformatics techniques. The implementation of natural language processing (NLP) -inspired techniques, however, makes it possible to identify meaningful clusters of cells without prior knowledge of the cell types present in the sample.
Methods We have developed a recursive, unsupervised, heuristic technique (StarShipTM) to dynamically perform top-down, divisive clustering on scRNA-Seq data. First, the cells are mapped onto an n-dimensional unit sphere, where n is the number of available genes. The angles between all cells are used to construct a cosine distance metric: 1-cos(θ). The cosine distance is used to carry out k-means or k-medoids clustering, with k set to 2 for each iteration. At each split of the data, the algorithm evaluates whether it has sorted the remaining cells into meaningful populations and stops making splits when a user-defined criterion is met. Once all clusters are finalized, a Mann-Whitney U test determines genes that distinguish clusters or groups of clusters from other cells. This algorithm was validated using publicly available peripheral blood mononuclear cell (PBMC) scRNA-Seq data from 10X Genomics and tested in scRNA-Seq data from LN patients from the NIAMS AMP RA/SLE initiative. Adjusted Rand Index (ARI) was used to compare generated partitions to known cell types in the PBMC data.
Results StarShipTM was used to classify 250 PBMC (50 each of CD14 monocytes, CD19 B cells, CD4 helper T cells, CD8 T cells, and CD56 NK cells). Using dynamic spherical k-means, 6 clusters were generated that closely corresponded to the known cell types (figure 1). For comparison, hierarchical clustering and one-off spherical k-means with k set to 5 were carried out. Hierarchical clustering had an ARI of 0.45, one-off spherical k-means had an ARI of 0.89, and dynamic spherical k-means had an ARI of 0.86.
Conclusions This method can effectively partition unknown cells from scRNA-Seq data sets into biologically relevant clusters without prior knowledge of the number of cell types present. The similarity between the performance of the StarShipTM algorithm and one-off k-means, which does incorporate this prior knowledge, highlights the value of this dynamic technique. A full analysis of the AMP LN data is forthcoming.
Acknowledgments Research supported by the RILITE Foundation.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.