Article Text

Download PDFPDF

Dialogue: High-throughput studies in rheumatology: time for unsupervised clustering?
  1. George Bertsias1,2
  1. 1Rheumatology, Clinical Immunology and Allergy, University of Crete School of Medicine, Heraklion, Crete, Greece
  2. 2Laboratory of Rheumatology, Autoimmunity and Inflammation, Institute of Molecular Biology and Biotechnology (IMBB-FORTH), Heraklion, Greece
  1. Correspondence to Dr George Bertsias; gbertsias{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

In complex autoimmune rheumatic diseases, high-throughput technologies simultaneously analysing dozens, hundreds or thousands of biological cues (genes, metabolites, serum proteins etc) have long been considered valuable in obtaining unique pathogenic insights while facilitating the discovery of therapeutic targets and biomarkers for diagnosis, monitoring and prognosis.1

In the current issue of Lupus Science and Medicine, Brunekreef et al2 used a custom chip-based microarray to probe serum samples for a total 57 known and new IgG autoantibodies and explore their diagnostic utility in SLE. By comparing the prevalence of each autoantibody in 483 patients with SLE and 1397 disease controls (including 361 healthy individuals), they found that anti-double stranded(ds)DNA antibodies and antibodies against Cytosine-phosphate-Guanine (anti-CpG) DNA motifs could best discriminate SLE versus control groups with corresponding area under the receiver operating curve (AUC) values of 0.800 and 0.756, respectively.2 Notably, 15.1% of patients with SLE negative for anti-dsDNA tested positive for anti-CpG DNA antibodies, therefore suggesting added diagnostic value. Although the exact specificity of CpG-targeting antibodies was not explored and some cross-reactivity with anti-dsDNA antibodies cannot be entirely excluded, the results are biologically plausible given the abundance of nucleic acids containing unmethylated or hypomethylated CpG DNA in SLE.3–5

Pending further standardisation of the CpG DNA detection methods and validation of these findings, certain methodological aspects of this work merit discussion. First, patients were designated as SLE or other disease/condition by the use of a text mining algorithm that searched for pre-specified disease-related or symptom-related keywords in retrospectively collected electronic health records. Although, in general, such strategies are considered valid and advantageous for large datasets,6 algorithm-assigned diagnoses were not ascertained by the existing classification criteria or other means. This might account for the lower-than-expected frequency of anti-nuclear antibodies (19 out of 147 first samples tested negative) in patients with SLE and also the fact that about 30% of all patients received more than one diagnosis.

Second, the researchers assigned patients without SLE to multiple control groups including one with mild, non-specific symptoms resembling healthy controls, a second with lupus-like (or incomplete lupus) presentations (eg, arthritis, nephritis, serositis) and a third with an autoimmune disease other than SLE.2 Notwithstanding this might reflect the ‘real-life’ situation where patients do not always fit into exact diagnostic entities, one should consider that autoimmune rheumatic diseases like SLE tend often to develop over time; therefore, some of the disease controls might represent early (or pre-) lupus forms.7 8 This is also supported by the between-group differences in the prevalence of autoantibodies reported by the authors.2

These complexities in the definition and phenotypic heterogeneity of autoimmune rheumatic disorders bring out the issue of how we can best use high-throughput studies and big data towards disease diagnosis/classification and risk stratification. To date, the majority of studies have employed a conventional, ‘supervised’-type approach to analyse biological (input) data which are tagged with pre-specified (output) ‘labels’ (diagnostic or endophenotypic groups). Although this method is straightforward and can yield accurate classification results, especially following implementation of sophisticated machine learning tools,9–11 it is biased heavily on the accuracy of the available diagnostic information (considered to be ‘ground truth’) and pre-existing grouping of the dataset. In the situation we have no accurate prior knowledge on the diagnostic groups for the samples or the output is not really “yes or no” (eg, SLE or not) but rather behaves as a continuum of states (eg, ranging from healthy, pre-lupus, mild lupus, severe lupus), unsupervised clustering (or learning) might represent a more suitable solution.

Indeed, these computational methods require no preconceived assumptions, work with unlabeled outputs and infer the inherent structure present within a dataset.10 12 Accordingly, they are useful to recognise hidden patterns or combinations of biological data, therefore providing a natural clustering of the complex-structured samples. Interpretability of the resulting clusters and characterisation of their distinctive features in a compact form may require additional steps as part of a decision-making process;13 nonetheless, unsupervised approaches move closer to the current concept of revisiting autoimmune rheumatic diseases based on the underlying molecular taxonomy.14

To this end, high-throughput studies such as this by Brunekreef et al2 represent notable contributions in the diagnostics of rheumatic diseases and the identification of sub-phenotypes with possibly distinct underlying pathophysiology. With accruing experience in the analysis of big data, the community should gradually move forward to implementing less biased classification methods to ultimately ‘let the data speak for themselves’.

Ethics statements

Patient consent for publication

Ethics approval

This study does not involve human participants.



  • Contributors I am the sole contributor of this article (Editorial).

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Provenance and peer review Commissioned; internally peer reviewed.