We developed scDIOR for single-cell data IO between R and Python, which contained two modules, dior and diopy. scDIOR implements the unification of data format of different single-cell analytics platforms.
We evaluated the performance of available data transformation software between R and Python. We used the Cao2019 dataset [
3
] with 1,943,759 cells and 26,157 genes as the standard input. The data was transformed into three data format, ‘.mtx’ (Matrix Market format of COOrdinate sparse matrix), ‘.rds’ (R object for Seurat), ‘.h5ad’ (HDF5 file for Scanpy) and ‘.h5’ (HDF5 file for scDIOR). The corresponding file sizes are 18 GB, 3.4 GB, 9.5 GB, 9.5 GB. The ‘.rds’ and ‘.h5ad’ were designed for and can be used by R or Python, respectively, while ‘.h5’ can be used for both languages. In R environment, we recorded the IO speed and peak memory cost of ‘.h5’, ‘.rds’ and ‘.mtx’. The elapsed time of reading ‘.h5’ is about 1.5 times shorter than ‘.mtx’ (Fig.
2
a, top panel). The elapsed time of reading ‘.rds’ is much shorter than ‘.h5’ and ‘.mtx’ since the file size of ‘.rds’ is much smaller. However, the storing time of ‘.h5’ is much shorter than ‘.rds’ and ‘.mtx’. Note that the memory consumption of reading ‘.h5’ is similar to ‘.mtx’, suggesting that the HDF5 package for R could be optimized in the future (Fig.
2
a, b, bottom panel). In Python environment, we tested the IO speed and memory cost of ‘.h5’,’.h5ad’,’.rds’. The elapsed time of reading and storing time of ‘.h5’ and ‘.h5ad’ was similar and much shorter than ‘.mtx’ (Fig.
2
c, d, top panel). In addition, reading and storing ‘.h5’ and ‘.h5ad’ only consume half of the memory than ‘.mtx’ (Fig.
2
c, d, bottom panel). The results showed that ‘.h5’ is a high-performance format that can be used to manage and transform data across different platforms.
Different software on different platforms has specialized advantages for specific analysis tasks, making full use of their advantages will help users discover new phenomena. One can perform trajectory analysis using Monocle3 [
3
] in R, then transform the single-cell data to Scanpy in Python using scDIOR, such as expression profiles of spliced and unspliced, as well as low dimensional layout. The expression profile can be used to run dynamical RNA velocity analysis [
14
] and the results can be projected on the layout of Monocle3 (Fig.
3
a). scDIOR provides an easy way to compare the trajectory analysis performance between tools. scDIOR also helps link analysis pipeline between Python and R. One can employ single-cell data preprocess and normalization method provided by Scanpy [
10
], and utilize batches correction method provided by Seurat [
7
] (Fig.
3
b). In addition, scDIOR supports spatial omics data IO between R and Python platforms (Fig.
3
c). These results suggest that scDIOR is a convenient and versatile software that can handle different single-cell data types and the data IO capabilities of different software on different platforms, avoiding the complicated process of intermediate data IO, and greatly improving the continuity and efficiency of analysis.
Several software for cross-platform data transformation have been proposed. We compared the characters of scDIOR, SeuratDisk, Zellkonverter and Loom (Fig.
4
). The SeuratDisk(
https://mojaveazure.github.io/seurat-disk/reference/SeuratDisk-package.html
) is a HDF5 based R tools for interconversion between Seurat and Scanpy. Zellkonverter (
http://www.bioconductor.org/packages/release/bioc/html/zellkonverter.html
) is the HDF5 based Python tool that focused on the transformation between Scanpy and SingleCellExperiment. It utilizes a frozen Python environment for data storage to prevent package version incompatibility. Loom (
http://loompy.org/
) is the HDF5 based file format for scRNA-seq data, in which there is no interfaces between Seurat, SingleCellExperiment and Scanpy. Moreover, the use of loom files require external software, such as calling functions LoadLoom/SaveLoom implemented in SeuratDisk, calling functions import/export implemented in LoomExperiment (
https://bioconductor.org/packages/release/bioc/html/LoomExperiment.html
), calling functions read_loom/write_loom implemented in Scanpy. Since the method of reading and writing.loom file is designed by different labs, the cross-platform data interconversion could be difficult. SeuratDisk, ZellKonverter and Loom only support limited data objects conversion. However, the conversion between Seurat, SingleCellExperiment and Scanpy data object can be performed using scDIOR easily (Fig.
4
).
scDIOR also provides the function to load ‘.rds’ file in Python, and load ‘.h5ad’ file in R directly (Fig.
4
). In this scenario, the ‘.rds’ or ‘.h5ad’ will be first converted to ‘.h5’and then loaded by scDIOR. In addition, scDIOR provides easy functions for partial information extraction, by which users can load the data partially instead of the whole dataset, e.g., loading the cell annotation data frame regardless of the gene expression matrix with great size. This character of scDIOR helps save the memory and accelerate the file reading (Fig.
4
). All the functions of scDIOR can be implemented in command line using only a few codes (Fig.
4
).
In general, scDIOR has excellent performance in the speed of single-cell data IO, saving the time of data storage and extraction. scDIOR is convenience and versatility, which can well connect the different analysis processes of different platforms. scDIOR supports users to extract the data partially by ignoring the unused information, which saves the memory and accelerates the file reading. All the data transformation across platforms can be done by a few codes in IDE or command line. For the version control, scDIOR is widely compatible with multiple versions of SingleCellExperiment (≥ 1.8.0), Seurat (≥ v3) and Scanpy (≥ 1.4). scDIOR is an effective and user-friendly tool that will improve the utilization of advantages of different analytics platforms.