Pipelines — VSN-Pipelines documentation

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Single-sample Pipelines ¶

Pipelines to run on a single sample or multiple samples separately and in parallel.

single_sample ¶

The single_sample workflow will process 10x data, taking in 10x-structured data, and metadata file. The standard analysis steps are run: filtering, normalization, log-transformation, HVG selection, dimensionality reduction, clustering, and loom file generation. The output is a loom file with the results embedded.

single_sample_scenic ¶

Runs the single_sample workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results. This could be very resource intensive, depending on the dataset.

single_sample_scrublet ¶

Runs the single_sample workflow above together with the Scrublet workflow.

The single_sample workflow is running from the input data. The scrublet workflow is running from the input data. The final processed file from the single_sample pipeline is annotated with the cell-based data generated by Scrublet.

The pipelines generate the following relevant files for each sample:

Output Files (not exhaustive list) ¶ out/data/*.SINGLE_SAMPLE_SCRUBLET.loom SCope -ready loom file containing resulting loom file from a single_sample workflow but with additional metadata (doublet scores and predicted doublet for the cells) based on Scrublet run. out/data/scrublet/*.SC__SCRUBLET__DOUBLET_DETECTION.ScrubletObject.pklz Pickled file containing the Scrublet object. out/data/scrublet/*.SCRUBLET.SC__ANNOTATE_BY_CELL_METADATA.h5ad h5ad file with raw data and doublets annotated. out/data/scrublet/*.SINGLE_SAMPLE_SCRUBLET.h5ad h5ad file resulting from a


   
    single_sample

workflow run and with doublets (inferred from Scrublet) removed.

Cuurently there are 3 methods available to call doublets from Scrublet doublet scores:

(Default) Scrublet will try to automatically identify the doublet score threshold. The threshold is then used to call doublets based on the doublet scores available in the scrublet__doublet_scores column. The doublets called are available in the scrublet__predicted_doublets column.

It can happen that Scrublet fails to find the automatic treshold. In that case, the pipeline will fail and let you know that either the method define in 3. has to be used or a custom threshold has to be provided. Either way, the pipeline will generate the Scrublet histograms. This is helpful especially if the user decide to select a custom threshold which will need to be reflected in the config as follows:

params {
    tools {
        scublet {
            threshold = [
              "<sample-name>": <custom-threshold>
This method is specifc to sample generated by the 10x Genomics single-cell platform. This method is based on the rate of the expected number of doublets in 10x Genomics samples. The number of doublets called (D) will be equal to the rate of doublets (given a number of cells) times the number of cells in that 10x Genomics sample. The cells are then ranked by their Scrublet doublet score (descending order) and the top D cells are called as doublets.
out/data/*.CELDA_DECONTX_{FILTER,CORRECT}.h5ad
A h5ad file with either the filtered matrix using one of the provided filters or the corrected (decontaminated) matrix by DecontX.
out/data/celda/*.CELDA__DECONTX.Rds
A Rds file containing the SingleCellExperiment object processed by DecontX.
out/data/celda/*.CELDA__DECONTX.Contamination_Outlier_Table.tsv
A cell-based .tsv file containing data generated by DecontX and additional outlier masks:
decontX_contamination
decontX_clusters
celda_decontx__{doublemad,scater_isOutlier_3MAD,custom_gt_0.5}_predicted_outliers
out/data/celda/*.CELDA__DECONTX.Contamination_Outlier_Thresholds.tsv
A .tsv containing a table with the different threshold for generating the outlier masks.
out/data/celda/*.CELDA__DECONTX.Contamination_Score_Density_with_{doublemad,scater_isOutlier_3MAD,custom_gt_0.5}.pdf
A .pdf plot showing the density of the decontamination score from DecontX and the outlier area highlighted for the given outlier threshold.
out/data/celda/*.CELDA__DECONTX.UMAP_Contamination_Score.pdf
A .pdf plot showing the DecontX contamination score on top of a UMAP generated from the decontaminated matrix.
out/data/celda/*.CELDA__DECONTX.UMAP_Clusters.pdf
A .pdf plot showing a UMAP generated by DecontX and from the decontaminated matrix.
single_sample_decontx ¶
Runs the single_sample workflow above together with the DecontX workflow.
The DecontX workflow is running from the input data.
The final processed file from the single_sample pipeline is annotated with the cell-based data generated by DecontX.
See single_sample and decontx to know more about the files generated by this pipeline.
single_sample_decontx_scrublet ¶
Runs the single_sample workflow above together with the DecontX workflow.
The single_sample workflow is running from the input data.
The decontx workflow is running from the input data.
The scrublet workflow is running from the output of the DecontX workflow.
The final processed file from the single_sample pipeline is annotated with the cell-based data generated by DecontX and Scrublet.
See single_sample, decontx and scrublet to know more about the files generated by this pipeline.
scenic ¶
Runs the scenic workflow alone, generating a loom file with only the SCENIC results.
Currently, the required input is a loom file (set by params.tools.scenic.filteredLoom).
scenic_multiruns  ¶
Runs the scenic workflow multiple times (set by params.tools.scenic.numRuns), generating a loom file with the aggregated results from the multiple SCENIC runs.
Note that this is not a complete entry-point itself, but a configuration option for the scenic module.
Simply adding -profile scenic_multiruns during the config step will activate this analysis option for any of the standard entrypoints.
cellranger¶
Runs the cellranger workflow (makefastq, then count).
Input parameters are specified within the config file:
params.tools.cellranger.mkfastq.csv: path to the CSV samplesheet
params.tools.cellranger.mkfastq.runFolder: path of Illumina BCL run folder
params.tools.cellranger.count.transcriptome: path to the Cell Ranger compatible transcriptome reference
cellranger_count_metadata¶
Given the data stored as:
MKFASTQ_ID_SEQ_RUN1
|-- MAKE_FASTQS_CS
 -- outs
    |-- fastq_path
        |-- HFLC5BBXX
            |-- test_sample1
            |   |-- sample1_S1_L001_I1_001.fastq.gz
            |   |-- sample1_S1_L001_R1_001.fastq.gz
            |   |-- sample1_S1_L001_R2_001.fastq.gz
            |   |-- sample1_S1_L002_I1_001.fastq.gz
            |   |-- sample1_S1_L002_R1_001.fastq.gz
            |   |-- sample1_S1_L002_R2_001.fastq.gz
            |   |-- sample1_S1_L003_I1_001.fastq.gz
            |   |-- sample1_S1_L003_R1_001.fastq.gz
            |   |-- sample1_S1_L003_R2_001.fastq.gz
            |-- test_sample2
            |   |-- sample2_S2_L001_I1_001.fastq.gz
            |   |-- sample2_S2_L001_R1_001.fastq.gz
            |   |-- ...
        |-- Reports
        |-- Stats
        |-- Undetermined_S0_L001_I1_001.fastq.gz
        -- Undetermined_S0_L003_R2_001.fastq.gz
MKFASTQ_ID_SEQ_RUN2
|-- MAKE_FASTQS_CS
 -- outs
    |-- fastq_path
        |-- HFLY8GGLL
            |-- test_sample1
            |   |-- ...
            |-- test_sample2
            |   |-- ...
        |-- ...
and a metadata table:
Minimally Required Metadata Table¶
Optional columns:
short_uuid: sample_name will be prefix by this value. This should be the same between sequencing runs of the same biological replicate
expect_cells: This number will be used as argument for the --expect-cells parameter in cellranger count.
chemistry: This chemistry will be used as argument for the --chemistry parameter in cellranger count.
and a config:
nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile cellranger_count_metadata \
   > nextflow.config
and a workflow run command:
nextflow run \
    ~/vib-singlecell-nf/




    
vsn-pipelines \
    -entry cellranger_count_metadata
The workflow will run Cell Ranger count on 2 samples, each using the 2 sequencing runs.
NOTES:
If fastqs_dir_name does not exist, set it to none
demuxlet/freemuxlet¶
Runs the demuxlet or freemuxlet workflows (dsc-pileup [with prefiltering], then freemuxlet or demuxlet)
Input parameters are specified within the config file:
params.tools.popscle.vcf: path to the VCF file for demultiplexing
params.tools.popscle.freemuxlet.nSamples: Number of clusters to extract (should match the number of samples pooled)
params.tools.popscle.demuxlet.field: Field in the VCF with genotype information
nemesh¶
Runs the nemesh pipeline (Drop-seq) on a single sample or multiple samples separately.
Source
bbknn ¶
Runs the bbknn workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, then the batch-effect correction steps: BBKNN, clustering, dimensionality reduction (UMAP only)).
The output is a loom file with the results embedded.
Source: https://github.com/Teichlab/bbknn/blob/master/examples/pancreas.ipynb
Output Files (not exhaustive list)¶
out/data/*.BBKNN.h5ad
Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.BBKNN.loom
SCope-ready loom file containing all results.
bbknn_scenic ¶
Runs the bbknn workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results.
This could be very resource intensive, depending on the dataset.
Output Files (not exhaustive list)¶
out/data/*.BBKNN.h5ad
Scanpy-ready h5ad file containing all results from a bbknn workflow run. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.BBKNN_SCENIC.loom
SCope-ready loom file containing all results from a bbknn workflow and a scenic workflow run (e.g.: regulon AUC matrix, regulons, …).
harmony ¶
Runs the harmony workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, batch-effect correction (Harmony), clustering, dimensionality reduction (t-SNE and UMAP)).
The output is a loom file with the results embedded.
Output Files (not exhaustive list)¶
out/data/*.HARMONY.h5ad
Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.HARMONY.loom
SCope-ready loom file containing all results.
harmony_scenic ¶
Runs the harmony workflow above, then runs the scenic workflow on the output, generating a comprehensive loom file with the combined results.
This could be very resource intensive, depending on the dataset.
Output Files (not exhaustive list)¶
out/data/*.HARMONY.h5ad
Scanpy-ready h5ad file containing all results from a harmony workflow run. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.HARMONY_SCENIC.loom
SCope-ready loom file containing all results from a harmony workflow and a scenic workflow run (e.g.: regulon AUC matrix, regulons, …).
mnncorrect ¶
Runs the mnncorrect workflow (sample-specific filtering, merging of individual samples, normalization, log-transformation, HVG selection, PCA analysis, batch-effect correction (mnnCorrect), clustering, dimensionality reduction (t-SNE and UMAP)).
The output is a loom file with the results embedded.
Output Files (not exhaustive list)¶
out/data/*.MNNCORRECT.h5ad
Scanpy-ready h5ad file containing all results. The raw.X slot contains the log-normalized data (if normalization & transformation steps applied) while the X slot contains the log-normalized scaled data.
out/data/*.MNNCORRECT.loom
SCope-ready loom file containing all results.
Utility Pipelines¶
Contrary to the aformentioned pipelines, these are not end-to-end. They are used to perform small incremental processing steps.
cell_annotate¶
Runs the cell_annotate workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files.
We show a use case here below with 10x Genomics data were it will annotate different samples using the obo method. For more information
about this cell-based annotation feature please visit Cell-based metadata annotation section.
First, generate the config :
nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile tenx,utils_cell_annotate,singularity
Make sure the following parts of the generated config are properly set:
[...]
data {
  tenx {
     cellranger_mex = '~/out/counts/*/outs/'
tools {
    scanpy {
        container = 'vibsinglecellnf/scanpy:1.8.1'
    cell_annotate {
        off = 'h5ad'
        method = 'obo'
        indexColumnName = 'BARCODE'
        cellMetaDataFilePath = "~/out/data/*.best"
        sampleSuffixWithExtension = '_demuxlet.best'
        annotationColumnNames = ['DROPLET.TYPE', 'NUM.SNPS', 'NUM.READS', 'SNG.BEST.GUESS']
    [...]
[...]
Now we can run it with the following command:
nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
   -entry cell_annotate \
   > nextflow.config
cell_annotate_filter ¶
Runs the cell_annotate_filter workflow which will perform a cell-based annotation of the data using a set of provided .tsv metadata files following by a cell-based filtering.
We show a use case here below with 10x Genomics data were it will annotate different samples using the obo method. For more information
about this cell-based annotation feature please visit Cell-based metadata annotation section and Cell-based metadata filtering section.
First, generate the config :
nextflow config \
   ~/vib-singlecell-nf/vsn-pipelines \
   -profile tenx,utils_cell_annotate,utils_cell_filter,singularity \
   > nextflow.config
Make sure the following parts of the generated config are properly set:
[...]
data {
  tenx {
     cellranger_mex = '~/out/counts/*/outs/'
tools {
    scanpy {
        container = 'vibsinglecellnf/scanpy:1.8.1'
    cell_annotate {
        off = 'h5ad'
        method = 'obo'
        indexColumnName = 'BARCODE'
        cellMetaDataFilePath = "~/out/data/*.best"
        sampleSuffixWithExtension = '_demuxlet.best'
        annotationColumnNames = ['DROPLET.TYPE', 'NUM.SNPS', 'NUM.READS', 'SNG.BEST.GUESS']
    cell_filter {
        off = 'h5ad'
        method = 'internal'
        filters = [
                id:'NO_DOUBLETS',
                sampleColumnName:'sample_id',
                filterColumnName:'DROPLET.TYPE',
                valuesToKeepFromFilterColumn: ['SNG']
    [...]
[...]
Now we can run it with the following command:
nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
   -entry cell_filter
sra¶
Runs the sra workflow which will download all (or user-defined selected) FASTQ files from a particular SRA project and format those with properly and human readable names.
First, generate the config :
nextflow config \
  ~/vib-singlecell-nf/vsn-pipelines \
    -profile sra,singularity \
    > nextflow.config
NOTES:
The download of SRA files is by default limited to 20 Gb. If this limit needs to be increased please set params.tools.sratoolkit.maxSize accordingly. This limit can be ‘removed’ by setting the parameter to an arbitrarily high number (e.g.: 9999999999999).
If you’re a VSC user, you might want to add the vsc profile.
The final output (FASTQ files) will available in out/data/sra
If you’re downloading 10x Genomics scATAC-seq data, make sure to set params.tools.sratoolkit.includeTechnicalReads = true and properly set params.utils.sra_normalize_fastqs.fastq_read_suffixes. In the case of downloading the scATAC-seq samples of SRP254409, fastq_read_suffixes would be set to ["R1", "R2", "I1", "I2"].
Now we can run it with the following command:
nextflow -C nextflow.config \
   run ~/vib-singlecell-nf/vsn-pipelines \
    -entry sra
$ nextflow -C nextflow.config run ~/vib-singlecell-nf/vsn-pipelines -entry sra
N E X T F L O W  ~  version 21.04.3
Launching `~/vib-singlecell-nf/vsn-pipelines/main.nf` [sleepy_goldstine] - revision: ba1dedbf51
executor >  local (23)
[12/25b9d4] process > sra:DOWNLOAD_FROM_SRA:SRA_TO_METADATA (1)                                             [100%] 1 of 1 _
[e2/d5a429] process > sra:DOWNLOAD_FROM_SRA:SRATOOLKIT__DOWNLOAD_FASTQS:DOWNLOAD_FASTQS_FROM_SRA_ACC_ID (4) [ 33%] 3 of 9
[30/cba7a0] process > sra:DOWNLOAD_FROM_SRA:SRATOOLKIT__DOWNLOAD_FASTQS:FIX_AND_COMPRESS_SRA_FASTQ (3)      [100%] 3 of 3
[76/97ce6e] process > sra:DOWNLOAD_FROM_SRA:NORMALIZE_SRA_FASTQS (3)                                        [100%] 3 of 3
[8c/3125c4] process > sra:PUBLISH:SC__PUBLISH (11)                                                          [100%] 12 of 12