import warnings
warnings.filterwarnings("ignore")
import pyscenic
import loompy as lp
import scanpy as sc
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rcParams
import os
20.4. Preparation of the NeurIPs dataset
The current dataset we will be looking at here is already pre-processed (previous chapters) and it contains 69,249 cells, annotated into 22 cell types. Due to batch integration considerations, we will demonstrate the usage of GRN methods for only one, labeled as s1d1 (n=6,224 cells).
Load the full dataset
adata = sc.read_h5ad("../../data/openproblems_bmmc_multiome_genes_filtered.h5ad")
adata.shape
rna = adata[:, adata.var.feature_types == "GEX"]
del adata
sc.pp.log1p(rna)
rna.obs.batch.value_counts()
Here we select highly-variable genes, to minimize the computing requirements for steps down-stream. An execution without defining HVGs is also possible, but will need additional computing memory and time.
sc.pp.highly_variable_genes(rna, batch_key="batch", flavor="seurat")
This is the embedding of all gene expression data cells of all donors. One can see that donor is dominate over cell types in terms of groups, and batch-correction methods have not been executed.
sc.pl.embedding(rna, "GEX_X_umap", color=["cell_type", "batch"])
Here we observe the cells of only donor s1d1. When observing cells for only one donor, cells clusters can identified based on their provided annotations.
adata_batch = rna[rna.obs.batch == "s1d1", :]
sc.pl.embedding(adata_batch, "GEX_X_umap", color=["cell_type", "batch"])
20.5. Preparation of SCENIC
Using loompy, we will convert the gene expression values into loom files. This file format is required by pyscenic as input. In addition to this, the file allTFs_hg38.txt
defines a list of gene symbols related to transcription factors, to be considered when weighting associations between those and other genes.
## this file has to be downloaded if not found
!wget -nc https://raw.githubusercontent.com/aertslab/SCENICprotocol/master/example/allTFs_hg38.txt
loom_path = "data/neurips_processed_input.loom"
loom_path_output = "data/neurips_processed_output.loom"
tfs = [tf.strip() for tf in open(tfs_path)]
As a verification, it is recommended to check that genes annotated as TFs are part of the provided input data. If not a high coverage of gene symbols is observed e.g. 50% or more, there could be an issues with the feature names declared in the .var
, such as Ensembl IDs instead of gene_symbols, or lower/uppercase difference if the genome assembly is mouse or another species that uses a different capitalization style than human.
# as a general QC. We inspect that our object has transcription factors listed in our main annotations.
print(
f"%{np.sum(adata_batch.var.index.isin(tfs))} out of {len(tfs)} TFs are found in the object"
Usage of HVG or all features can be defined here, by changing the flag use_hvg
to False
.
use_hvg = True
if use_hvg:
mask = (adata_batch.var["highly_variable"] == True) | adata_batch.var.index.isin(
adata_batch = adata_batch[:, mask]
Create a loom file for NeurIPS donor. If you get an error in this step, verify that the labels for Gene/CellID/etc. are properly defined.
row_attributes = {
"Gene": np.array(adata_batch.var.index),
col_attributes = {
"CellID": np.array(adata_batch.obs.index),
"nGene": np.array(np.sum(adata_batch.X.transpose() > 0, axis=0)).flatten(),
"nUMI": np.array(np.sum(adata_batch.X.transpose(), axis=0)).flatten(),
lp.create(loom_path, adata_batch.X.transpose(), row_attributes, col_attributes)
Once the loom file has been generated, we execute pyscenic
to generate associations between TFs and genes. TF-gene associations are inferred by GRNBoost, and summarized by a directional weight between TFs and target genes. The output of this analysis is a table summarizing all reported associations with their importance weight.
Some steps below require indicating a number of cores (num_workers
). Increase according to computing resources available
num_workers = 3
results_adjacencies = pd.read_csv("adj.csv", index_col=False, sep=",")
print(f"Number of associations: {results_adjacencies.shape[0]}")
results_adjacencies.head()
Visualize the distribution of weights for general inspection of the quantiles and thresholds obtained from pyscenic. As provided by the pyscenic grn step, the importance scores follow a unimodal distribution, with negative/positive values indicating TF-gene associations with less/more importance, respectively. From the right-tail of this distribution, we can recover the most relevant interactions between TFs and potential target genes, supported by gene expression values and the analysis done by pyscenic.
plt.hist(np.log10(results_adjacencies["importance"]), bins=50)
plt.xlim([-10,
10])
As targets genes have DNA motifs at promoters (sequence specific DNA motifs), those can be used to link TFs to target genes. Next, we use an annotation of TF associations to Transcription Start Sites (TSSs) to refine this annotation.
Download TSS annotations precalculated by Aerts’s lab
!wget -nc https://resources.aertslab.org/cistarget/databases/homo_sapiens/hg38/refseq_r80/mc9nr/gene_based/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.genes_vs_motifs.rankings.feather
Using the catalog of motif and their gene associations at promoters, retrieved a subset adjacencies by pruning of the previous adjacencies.
This step can take several minutes on consumer grade hardware.
if not os.path.exists('reg.csv'):
!pyscenic ctx adj.csv \
{db_names} \
--annotations_fname {motif_path} \
--expression_mtx_fname {loom_path} \
--output reg.csv \
--mask_dropouts \
--num_workers {num_workers} > pyscenic_ctx_stdout.txt
To explore the candidates reported, it is recommended as a rule of thumb to explore the output by the ranking of relative contribution, or by top-quantile thresholds defined visually to obtain a high signal-to-noise ratio.
Define custom quantiles for further exploration
import numpy as np
n_genes_detected_per_cell = np.sum(adata_batch.X > 0, axis=1)
percentiles = pd.Series(n_genes_detected_per_cell.flatten().A.flatten()).quantile(
[0.01, 0.05, 0.10, 0.50, 1]
print(percentiles)
The histogram below indicates the distribution of genes detected per cell. This visualization is convenient to define the parameter --auc_threshold
in the next step. Specifically, the default parameter of --auc_threshold
is 0.05, which in this plot would result in the selection of 144
genes, to be used as a reference per cell for AUCell calculations. The modification of this parameter affects the estimation of AUC values calculated by AUCell.
fig, ax = plt.subplots(1, 1, figsize=(8, 5), dpi=100)
sns.distplot(n_genes_detected_per_cell, norm_hist=False, kde=False, bins="fd")
for i, x in enumerate(percentiles):
fig.gca().axvline(x=x, ymin=0, ymax=1, color="red")
ax.text(
x=x,
y=ax.get_ylim()[1],
s=f"{int(x)} ({percentiles.index.values[i]*100}%)",
color="red",
rotation=30,
size="x-small",
rotation_mode="anchor",
ax.set_xlabel("# of genes")
ax.set_ylabel("# of cells")
fig.tight_layout()
This step will use TFs to calculate Area Under the Curve scores, that summarize how well the gene expression observed in each cell can be associated by the regulation of target genes regulated by the mentioned TFs.
Using the above-generated matrix of cell x TFs and those scores, we can calculate a new embedding using only those.
if not os.path.exists(loom_path_output):
!pyscenic aucell $loom_path \
reg.csv \
--output {loom_path_output} \
--num_workers {num_workers} > pyscenic_aucell_stdout.txt
# collect SCENIC AUCell output
lf = lp.connect(loom_path_output, mode="r+", validate=False)
auc_mtx = pd.DataFrame(lf.ca.RegulonsAUC, index=lf.ca.CellID)
lf.close()
ad_auc_mtx = ad.AnnData(auc_mtx)
sc.pp.neighbors(ad_auc_mtx, n_neighbors=10, metric="correlation")
sc.tl.umap(ad_auc_mtx)
sc.tl.tsne(ad_auc_mtx)
WARNING: You’re trying to run this on 247 dimensions of `.X`, if you really want this, set `use_rep='X'`.
Falling back to preprocessing with `sc.pp.pca` and default params.
This UMAP visualization confirms that the signal from SCENIC is capable of capturing regulons that keep dividing most cell populations into sub-groups. Hence, there is information TF-regulons that enables cell-type identification.
sc.pl.embedding(adata_batch, basis="X_umap_aucell", color="cell_type")
A visualization of the tSNE values generated by SCENIC also confirms this cell-type separation, for the majority of cell-types
sc.pl.embedding(adata_batch, basis="X_tsne_aucell", color="cell_type")
top_n = 50
top_tfs = mean_auc_by_cell_type.max(axis=0).sort_values(ascending=False).head(top_n)
mean_auc_by_cell_type_top_n = mean_auc_by_cell_type[
[c for c in mean_auc_by_cell_type.columns if c in top_tfs]
Once we know the top TF-regulons involved in the biological system we are studying, we can inspect the activities estimated by each TF, based on the scores per cell, or the overall AUCs per cell-type explained by those TF (blue heatmap below).
sns.clustermap(
mean_auc_by_cell_type_top_n,
figsize=[15, 6.5],
cmap="Blues",
xticklabels=True,
yticklabels=True,
As the red heatmap is suggesting that some TFs are strongly associated to particular cell types, we can verify the expression of their expression levels as an additional validation. This is done by matching the TF names we want to highlight, and visualize those using Scanpy’s (red heatmap).
tf_names = top_tfs.index.str.replace("\(\+\)", "")
adata_batch_top_tfs = adata_batch[:, adata_batch.var_names.isin(tf_names)]
WARNING: dendrogram data not found (using key=dendrogram_cell_type). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.
WARNING: You’re trying to run this on 2785 dimensions of `.X`, if you really want this, set `use_rep='X'`.
Falling back to preprocessing with `sc.pp.pca` and default params.
Through visual inspection and comparison of the TF-regulon AUC scores and TF gene expression by cell type, we can verify that in several case the cell-type specific expression of a TF is linked to a particular cell type, where the respective TF-regulon is also active e.g. RUNX2 in pDCs, TCF7L2 in CD16+ Mono, LEF1 in CD4+ T activated. Additional inspection can serve to validate previous insights and/or additional associations.
20.6. Takeaways
In this notebook, we have:
Setup a basic pipeline to prepare, execute, and verify SCENIC results using the NeurIPs 2021 dataset.
Detected TF-regulons that allow to explain the observed transcriptomes by the regulatory potential of a handful of TFs.
20.7. Quiz
20.7.1. Theory
How many genes in the human/mouse genome are considered TFs?
What is a TF-regulon?
How many DNA-binding motifs are there to each TFs?
3.1. Currently, are there more known DNA-binding motifs or TFs that bind those?
3.2. How can one reconcile the redundancy of these during the analysis?
Additional reading: Describe the futility theorem.
20.7.2. SCENIC
Describe the major steps in the SCENIC pipeline (three or more)?
What could happen if the number of TFs reported in your own data has a low overlap to the one externally retrieved?
What is the AUC score from pyscenic and how it can be helpful for interpretation of results per cell cluster?
[grnAGonzalezBM+17]
Sara Aibar, Carmen Bravo González-Blas, Thomas Moerman, Vân Anh Huynh-Thu, Hana Imrichova, Gert Hulselmans, Florian Rambow, Jean-Christophe Marine, Pierre Geurts, Jan Aerts, Joost van den Oord, Zeynep Kalender Atak, Jasper Wouters, and Stein Aerts. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods, 14(11):1083–1086, November 2017.
[grnFSL+20]
Chenchen Feng, Chao Song, Yuejuan Liu, Fengcui Qian, Yu Gao, Ziyu Ning, Qiuyu Wang, Yong Jiang, Yanyu Li, Meng Li, Jiaxin Chen, Jian Zhang, and Chunquan Li. KnockTF: a comprehensive human gene expression profile database with knockdown/knockout of transcription factors. Nucleic Acids Res., 48(D1):D93–D100, January 2020.
[grnGAHI+19]
Luz Garcia-Alonso, Christian H Holland, Mahmoud M Ibrahim, Denes Turei, and Julio Saez-Rodriguez. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Res., 29(8):1363–1375, August 2019.
[grnHSS+15]
Heonjong Han, Hongseok Shim, Donghyun Shin, Jung Eun Shim, Yunhee Ko, Junha Shin, Hanhae Kim, Ara Cho, Eiru Kim, Tak Lee, Hyojin Kim, Kyungsoo Kim, Sunmo Yang, Dasom Bae, Ayoung Yun, Sunphil Kim, Chan Yeong Kim, Hyeon Jin Cho, Byunghee Kang, Susie Shin, and Insuk Lee. TRRUST: a reference database of human transcriptional regulatory interactions. Sci. Rep., 5:11432, June 2015.
[grnSZSanchezPerezS+18]
Alberto Santos-Zavaleta, Mishael Sánchez-Pérez, Heladia Salgado, David A Velázquez-Ramírez, Socorro Gama-Castro, Víctor H Tierrafría, Stephen J W Busby, Patricia Aquino, Xin Fang, Bernhard O Palsson, James E Galagan, and Julio Collado-Vides. A unified resource for transcriptional regulation in escherichia coli K-12 incorporating high-throughput-generated binding data into RegulonDB version 10.0. BMC Biol., 16(1):91, August 2018.
[grnSSRP+20]
Jonas Schulte-Schrepping, Nico Reusch, Daniela Paclik, Kevin Baßler, Stephan Schlickeiser, Bowen Zhang, Benjamin Krämer, Tobias Krammer, Sophia Brumhard, Lorenzo Bonaguro, Elena De Domenico, Daniel Wendisch, Martin Grasshoff, Theodore S. Kapellos, Michael Beckstette, Tal Pecht, Adem Saglam, Oliver Dietrich, Henrik E. Mei, Axel R. Schulz, Claudia Conrad, Désirée Kunkel, Ehsan Vafadarnejad, Cheng-Jian Xu, Arik Horne, Miriam Herbert, Anna Drews, Charlotte Thibeault, Moritz Pfeiffer, Stefan Hippenstiel, Andreas Hocke, Holger Müller-Redetzky, Katrin-Moira Heim, Felix Machleidt, Alexander Uhrig, Laure Bosquillon de Jarcy, Linda Jürgens, Miriam Stegemann, Christoph R. Glösenkamp, Hans-Dieter Volk, Christine Goffinet, Markus Landthaler, Emanuel Wyler, Philipp Georg, Maria Schneider, Chantip Dang-Heine, Nick Neuwinger, Kai Kappert, Rudolf Tauber, Victor Corman, Jan Raabe, Kim Melanie Kaiser, Michael To Vinh, Gereon Rieke, Christian Meisel, Thomas Ulas, Matthias Becker, Robert Geffers, Martin Witzenrath, Christian Drosten, Norbert Suttorp, Christof von Kalle, Florian Kurth, Kristian Händler, Joachim L. Schultze, Anna C. Aschenbrenner, Yang Li, Jacob Nattermann, Birgit Sawitzki, Antoine-Emmanuel Saliba, Leif Erik Sander, and Deutsche COVID-19 OMICS Initiative (DeCOI). Severe covid-19 is marked by a dysregulated myeloid cell compartment. Cell, 182(6):1419–1440.e23, Sep 2020. S0092-8674(20)30992-2[PII]. URL: https://doi.org/10.1016/j.cell.2020.08.001, doi:10.1016/j.cell.2020.08.001.
20.1. Motivation