CURRENT BIOINFORMATICS, vol.18, no.2, pp.154-169, 2023 (SCI-Expanded)
Background Due to overdispersion in the RNA-Seq data and its discrete structure, clustering samples based on gene expression profiles remains a challenging problem, and several clustering approaches have been developed so far. However, there is no "gold standard" strategy for clustering RNA-Seq data, so alternative approaches are needed. Objective In this study, we presented a new clustering approach, which incorporates two powerful methods, i.e., voom and self-organizing maps, into the frequently used clustering algorithms such as k-means, k-medoid and hierarchical clustering algorithms for RNA-seq data clustering. Methods We first filter and normalize the raw RNA-seq count data. Then to transform counts into continuous data, we apply the voom method, which outputs the log-cpm matrix and sample quality weights. After the voom transformation, we apply the SOM algorithm to log-cpm values to get the codebook used in the downstream analysis. Next, we calculate the weighted distance matrices using the sample quality weights obtained from voom transformation and codebooks from the SOM algorithm. Finally, we apply k-means, k-medoid and hierarchical clustering algorithms to cluster samples. Results The performances of the presented approach and existing methods are compared over simulated and real datasets. The results show that the new clustering approach performs similarly or better than other methods in the Rand index and adjusted Rand index. Since the voom method accurately models the observed mean-variance relationship of RNA-seq data and SOM is an efficient algorithm for modeling high dimensional data, integrating these two powerful methods into clustering algorithms increases the performance of clustering algorithms in overdispersed RNA-seq data. Conclusion The proposed algorithm, voomSOM, is an efficient and novel clustering approach that can be applied to RNA-Seq data clustering problems.