# Suppress loading messages when building the HTMLsuppressPackageStartupMessages({
library(data.table)
library(Biobase)
library(SingleCellExperiment)
library(GEOquery) # Optional# Do not convert strings to factorsoptions(stringsAsFactors=FALSE)
# To build a personalized report, update this working directory:# knitr::opts_knit$set(root.dir = 'SCENIC_MouseBrain')
Introduction to SCENIC
SCENIC is a tool to simultaneously reconstruct gene regulatory networks and identify stable cell states from single-cell RNA-seq data. The gene regulatory network is inferred based on co-expression and DNA motif analysis, and then the network activity is analyzed in each cell to identify the recurrent cellular states.
More info & citation
The approach behind SCENIC and its application to several datasets (e.g. usage examples) was presented in the following article:
Aibar et al. (2017) SCENIC: single-cell regulatory network inference and clustering. Nature Methods. doi: 10.1038/nmeth.4463.
Please, cite this article if you use SCENIC in your research.
Requirements
Species
The current version of SCENIC supports human, mouse and fly (Drosophila melanogaster).
To apply SCENIC to other species, it would require manual adjustments on the second step (e.g. create new RcisTarget databases or using a diferent motif-enrichment-analysis tool).
Input: expression matrix
The input to SCENIC is the single-cell RNA-seq expression matrix:
Each column corresponds to a sample (cell) and each row corresponds to a gene.
The gene ID should be the gene-symbol and stored as rownames (for compatibility with RcisTarget annotation databases).
Expression units: The preferred expression values are gene-summarized counts.
There is currently not a strong recommendation towards using the raw counts, or counts normalized through single-cell specific methods (e.g. Seurat).
Other measurements, such as transcripts/counts per million (TPM) and FPKM/RPKM, are also accepted as input. However, note that some authors recommend avoiding within sample normalization (i.e. TPM) for co-expression analysis (first step of SCENIC) because they may induce artificial co-variation (@crow2016).
The choice of input expression matrix might have some effect on the co-expression analysis to create the regulons (step 1). The other steps of the workflow are not directly affected by the input expression values: (2) The expression is not taken into account for the motif analysis, and (3) AUCell, which is used for scoring the regulons on the cells, is cell ranking-based (it works as an implicit normalization).
Overall, SCENIC is quite robust to this choice, we have applied SCENIC to datasets using raw (logged) UMI counts, normalized UMI counts, and TPM and they all provided reliable results (see @aibar2017).
Availability (R / Python)
SCENIC is implemented in R (this package and tutorial) and Python (pySCENIC).
The Python implementation is significantly faster to run, so we generally recommend using it for most analyses.
We provide containers (in Docker and Singularity) including all the required dependencies, and a Nextflow pipeline (VSN, useful to run SCENIC in batch on multiple datasets). This should make it very easy to install and run (py)SCENIC, even for users with limited experience in Python.
The results are equivalent across versions and provide output .loom files that can be explored in SCope or used as interface between R and Python.
The rest of this tutorial will continue with the R implementation, but the general concepts behind SCENIC algorithm also apply to the other versions.
Installation
The R implementation of SCENIC is based on three R packages:
r Biocpkg("GENIE3") to infer the co-expression network (faster alternative: GRNBoost2)
r Biocpkg("RcisTarget") for the analysis of transcription factor binding motifs
r Biocpkg("AUCell") to identify cells with active gene sets (gene-network) in scRNA-seq data
Therefore, you will need to install these packages, and some extra dependencies, to run SCENIC:
if (!requireNamespace("BiocManager", quietly =TRUE)) install.packages("BiocManager")
BiocManager::version()
## Required
BiocManager::install(c("AUCell", "RcisTarget"))
BiocManager::install(c("GENIE3")) # Optional. Can be replaced by GRNBoost## Optional (but highly recommended):# To score the network on cells (i.e. run AUCell):
BiocManager::install(c("zoo", "mixtools", "rbokeh"))
# For various visualizations and perform t-SNEs:
BiocManager::install(c("DT", "NMF", "ComplexHeatmap", "R2HTML", "Rtsne"))
# To support paralell execution in tSNE (not available in Windows):
BiocManager::install(c("doMC", "doRNG"))
# To export/visualize in http://scope.aertslab.orgif (!requireNamespace("devtools", quietly =TRUE)) install.packages("devtools")
devtools::install_github("aertslab/SCopeLoomR", build_vignettes =TRUE)
You are now ready to install SCENIC:
if (!requireNamespace("devtools", quietly =TRUE)) install.packages("devtools")
devtools::install_github("aertslab/SCENIC")
packageVersion("SCENIC")
For older versions of R/Bioconductor
For older versions of R/Bioconductor, you might need to install a previous version of SCENIC. e.g.:
# For R version 3.6 and Bioconductor 3.9:
devtools::install_github("aertslab/[email protected]")
In case you have trouble installing the packages from Bioconductor (for example, for older versions of R), you may try installing them from Github or directly from the Bioconductor package files:
In addition to the R-packages, you will also need to download the species-specific databases for RcisTarget (the motif rankings).
The links to all the available databases are available in our website.
By default, SCENIC uses the databases that score the motifs in the promoter of the genes (up to 500bp upstream the TSS), and in the 20kb around the TSS (+/-10kbp).
dbFiles <-c("https://resources.aertslab.org/cistarget/databases/drosophila_melanogaster/dm6/flybase_r6.02/mc8nr/gene_based/dm6-5kb-upstream-full-tx-11species.mc8nr.feather")
# mc8nr: Motif collection version 8: 20k motifs
The approximate file size for these databases (.feather files) is 1GB.
To avoid problems or incomplete downloads (especially if you have a slow connection), we recommend to use zsync_curl.
If you prefer to download directly from R, you can try the following code:*
# dir.create("cisTarget_databases"); setwd("cisTarget_databases") # if neededoptions(timeout =max(300, getOption("timeout")))
for(featherURL in dbFiles)
download.file(featherURL, destfile=basename(featherURL), mode ="wb") # saved in current dir
To confirm that the databases were downloaded correctly, we recommend to confirm its sha256sum: https://resources.aertslab.org/cistarget/databases/sha256sum.txt
After these setup steps, SCENIC is ready to run! To start, see the vignette "SCENIC_Running".
Some tips...
Template for your own analysis
You can use the R notebooks of this workflow as template for your own data (i.e. copy the .Rmd file, and edit it in RStudio).
Note that some steps of this workflow can take considerable time. To avoid re-running these steps when knitting the vignette (i.e. create the HTML report), we have added eval=FALSE to some code chunks and load() its output in the next. Feel free to adapt these to your needs.
At any time, you an access the help for any function used in this workflow (i.e. for details on their arguments), and the vignettes of the other steps of the workflow with the following commands:
## Get help for a function:?runSCENIC_3_scoreCells
help(runSCENIC_3_scoreCells) # equivalent## See the available tutorials:vignette(package="SCENIC") # listvignette("SCENIC_Running") # open
Sample dataset
Running SCENIC in a real dataset typically takes a few hours. The toy example used in these tutorials (200 cells and <1000 genes) is a subset of a dataset of 3005 cells from the adult mouse brain, including neurons (e.g. pyramidal neurons and interneurons) and glia (oligodendrocytes, astrocytes/ependymal, endothelial/mural and microglia). The expression values are Unique Molecular Identifier counts.
Zeisel, A., et al. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142. doi: 10.1126/science.aaa1934
The output files from the run on the full dataset are available at http://scenic.aertslab.org/examples/
Formatting input
Reminder: SCENIC's main input is a single-cell expression matrix with genes-symbols as row names.
Below there are several options to download and format this dataset (as example to format/import your own data).
a) From .loom file
.loom files can be directly imported into SCENIC through SCopeLoomR.
To load the expression matrix and cell annotation:
library(SCopeLoomR)
loom <-open_loom(loomPath, mode="r+") # We recommend to open files as read-only (mode="r"). However, since this is an old-format loom file, it needs write access to update it.
exprMat <-get_dgem(loom)
cellInfo <-get_cell_annotation(loom)
close_loom(loom)
b) From 10X/CellRanger output files
10X/CellRanger output matrices can be used as input for SCENIC.
The tutorials on how to load the CellRanger output into R are available at 10X website (choose the appropriate CellRanger version):
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices#r-load-mat
Some packages, such as Seurat, also provide functions to directly import 10X/CellRanger output:
c) From other R objects (e.g. Seurat, SingleCellExperiment)
Many R packages store the expression data in their own data structures (e.g. Seurat) or Bioconductor classes (e.g. SingleCellExperimentSummarizedExperiment, ExpressionSet).
Most of these objects have data acessors to retrieve the expression matrix and cell metadata (the function name depends on the object type or package).
For example, for a SingleCellExperiment object the accessors are counts() and colData():
library(SingleCellExperiment)
## Get data from sce object:
exprMat <-counts(sce)
cellInfo <-colData(sce)
To use Seurat clusters as cell annotation (e.g. for visualization):
To run SCENIC in Python we recommend to save/export the expression matrix and cell metadata into a .loom file (for R, any R/Bioconductor object is also OK):