SCS【13】单细胞转录组之识别细胞对“基因集”的响应 (AUCell)
点击关注,桓峰基因
桓峰基因公众号推出单细胞系列教程,有需要生信分析的老师可以联系我们!首选看下转录分析教程整理如下:
SCS【4】单细胞转录组数据可视化分析 (Seurat 4.0)
SCS【6】单细胞转录组之细胞类型自动注释 (SingleR)
SCS【7】单细胞转录组之轨迹分析 (Monocle 3) 聚类、分类和计数细胞
SCS【8】单细胞转录组之筛选标记基因 (Monocle 3)
SCS【9】单细胞转录组之构建细胞轨迹 (Monocle 3)
SCS【10】单细胞转录组之差异表达分析 (Monocle 3)
SCS【11】单细胞ATAC-seq 可视化分析 (Cicero)
SCS【12】单细胞转录组之评估不同单细胞亚群的分化潜能 (Cytotrace)
SCS【13】单细胞转录组之识别细胞对“基因集”的响应 (AUCell)
今天来说说单细胞转录组数据中识别细胞对“基因集”的响应,学会这些分析结果,距离发文章就只差样本的选择了,有创新性的样本将成为文章的亮点,并不是分析内容了!
前 言
AUC允许在单细胞RNA数据集中识别具有活性基因集(如gene signature,gene module)的细胞。AUCell使用曲线下面积来计算输入基因集的一个关键子集是否在每个细胞的表达基因中富集。AUCell和GSVA的算法是一回事,都是排序。AUCell借鉴了ssGSVA的算法,但是在排序的时候,有些处理不太一样。在SCENIC转录因子预测的分析过程中,已经封装了AUCell。AUCell的工作流程分为三步,分别通过三个函数完成:
-
AUCell_buildRankings : Build the rank;
-
AUCell_calcAUC : Calculate the Area Under the Curve (AUC);
-
AUCell_exploreThresholds : Set the assignment thresholds;
软件包安装
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
# To support paralell execution:
BiocManager::install(c("doMC", "doRNG","doSNOW"))
# For the main example:
BiocManager::install(c("mixtools", "SummarizedExperiment"))
# For the examples in the follow-up section of the tutorial:
BiocManager::install(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh",
"dynamicTreeCut","R2HTML","Rtsne", "zoo"))
数据读取
AUCell所需要的输入数据分为两部分,矩阵数据和基因集数据。
矩阵数据我们通过下载GEO上的数据集GSE60361下载,之后整理可以的矩阵:
library(AUCell)
dir.create("AUCell_tutorial")
setwd("AUCell_tutorial")
if (!requireNamespace("GEOquery", quietly = TRUE)) BiocManager::install("GEOquery")
library(GEOquery)
# gse <- getGEO('GSE60361') # does not work, the matrix is in a suppl file
geoFile <- "GSE60361_C1-3005-Expression.txt.gz"
# download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60361/suppl/GSE60361_C1-3005-Expression.txt.gz',
# destfile = geoFile)
library(data.table)
exprMatrix <- fread(geoFile, sep = "\t")
geneNames <- unname(unlist(exprMatrix[, 1, with = FALSE]))
exprMatrix <- as.matrix(exprMatrix[, -1, with = FALSE])
rownames(exprMatrix) <- geneNames
exprMatrix <- exprMatrix[unique(rownames(exprMatrix)), ]
dim(exprMatrix)
exprMatrix[1:5, 1:4]
# Remove file(s) downloaded:
file.remove(geoFile)
# Convert to sparse
library(Matrix)
exprMatrix <- as(exprMatrix, "dgCMatrix")
# Save for future use
mouseBrainExprMatrix <- exprMatrix
save(mouseBrainExprMatrix, file = "exprMatrix_MouseBrain.RData")
基因集是AUCell软件包自带的直接调取利用getGmt读取即可:
library(AUCell)
library(GSEABase)
gmtFile <- paste(file.path(system.file("examples", package = "AUCell")), "geneSignatures.gmt",
sep = "/")
geneSets <- getGmt(gmtFile)
geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix))
cbind(nGenes(geneSets))
geneSets <- setGeneSetNames(geneSets, newNames = paste(names(geneSets), " (", nGenes(geneSets),
"g)", sep = ""))
# Random
set.seed(321)
extraGeneSets <- c(GeneSet(sample(rownames(exprMatrix), 50), setName = "Random (50g)"),
GeneSet(sample(rownames(exprMatrix), 500), setName = "Random (500g)"))
countsPerGene <- apply(exprMatrix, 1, function(x) sum(x > 0))
# Housekeeping-like
extraGeneSets <- c(extraGeneSets, GeneSet(sample(names(countsPerGene)[which(countsPerGene >
quantile(countsPerGene, probs = 0.95))], 100), setName = "HK-like (100g)"))
geneSets <- GeneSetCollection(c(geneSets, extraGeneSets))
names(geneSets)
例子操作
计算gene signatures分值
为每个细胞建立基因表达排序,然后进行计算基因标记的富集(AUC):
cells_AUC <- AUCell_run(exprMatrix, geneSets)
save(cells_AUC, file = "cells_AUC.RData")
cells_rankings <- AUCell_buildRankings(exprMatrix, plotStats = FALSE, splitByBlocks = TRUE)
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)
cells_rankings
确定具有给定基因特征或活性基因集的细胞
AUC表示在signature中表达的基因的比例,以及与细胞内其他基因的相对表达值。我们可以使用此属性根据基因集的表达来探索数据集中存在的细胞种群。
set.seed(123)
par(mfrow = c(3, 3))
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist = TRUE, assign = TRUE)
绘制特定基因集的AUC直方图,并设置新的阈值:
cells_assignment$Oligodendrocyte_Cahoy$aucThr$thresholds
cells_assignment$Oligodendrocyte_Cahoy$aucThr$selected
oligodencrocytesAssigned <- cells_assignment$Oligodendrocyte_Cahoy$assignment
length(oligodencrocytesAssigned)
head(oligodencrocytesAssigned)
geneSetName <- rownames(cells_AUC)[grep("Oligodendrocyte_Cahoy", rownames(cells_AUC))]
AUCell_plotHist(cells_AUC[geneSetName, ], aucThr = 0.25)
abline(v = 0.25)
为这个新的阈值分配细胞:
newSelectedCells <- names(which(getAUC(cells_AUC)[geneSetName, ] > 0.08))
length(newSelectedCells)
head(newSelectedCells)
探索细胞分配(表格和热图),细胞赋值存储在$assignment中。提取所有基因集的细胞,并将其转化为一个表格,并绘制为直方图:
cellsAssigned <- lapply(cells_assignment, function(x) x$assignment)
assignmentTable <- reshape2::melt(cellsAssigned, value.name = "cell")
colnames(assignmentTable)[2] <- "geneSet"
head(assignmentTable)
assignmentMat <- table(assignmentTable[, "geneSet"], assignmentTable[, "cell"])
assignmentMat[, 1:2]
set.seed(123)
miniAssigMat <- assignmentMat[, sample(1:ncol(assignmentMat), 100)]
library(NMF)
aheatmap(miniAssigMat, scale = "none", color = "black", legend = FALSE)
library(DT)
datatable(assignmentTable, options = list(pageLength = 10), filter = "top")
根据signatrue评分探索细胞/簇
AUC评分还可以用于探索以前聚类分析的输出(反之亦然)。
使用获得的不同signature的AUC为t-SNE上色,该t-SNE基于之前运行的整个表达矩阵(即不是5000个随机基因)。
load(paste(file.path(system.file("examples", package = "AUCell")), "cellsTsne.RData",
sep = "/"))
cellsTsne <- cellsTsne$Y
plot(cellsTsne, pch = 16, cex = 0.3)
这个t-SNE是用下面的代码创建的(需要一段时间来运行):
sumByGene <- apply(mouseBrainExprMatrix, 1, sum)
exprMatSubset <- mouseBrainExprMatrix[which(sumByGene > 0), ]
logMatrix <- log2(exprMatSubset + 1)
logMatrix <- as.matrix(logMatrix)
library(Rtsne)
set.seed(123)
cellsTsne <- Rtsne(t(logMatrix)) ##运行时间长
rownames(cellsTsne$Y) <- colnames(logMatrix)
colnames(cellsTsne$Y) <- c("tsne1", "tsne2")
save(cellsTsne, file = "cellsTsne.RData")
load("cellsTsne.RData")
t-SNE可以根据AUC评分进行着色。为了突出显示根据特征更可能属于该细胞类型的细胞簇,我们将把细胞分为通过分配阈值的细胞(用粉色-红色表示)和没有通过分配阈值的细胞(用黑色-蓝色表示)。
当然,在解释时也应该记住这些签名的来源(例如,这些签名是从大体积RNA-seq分析中获得的)。此外,只在5000个基因上运行本教程可能会带来一些额外的噪音。
对于本例,我们使用了自动分配的阈值:
selectedThresholds <- getThresholdSelected(cells_assignment)
par(mfrow = c(3, 3)) # Splits the plot into two rows and three columns
for (geneSetName in names(selectedThresholds)) {
nBreaks <- 5# Number of levels in the color palettes
# Color palette for the cells that do not pass the threshold
colorPal_Neg <- (grDevices::colorRampPalette(c("black", "blue", "skyblue")))(nBreaks)
# Color palette for the cells that pass the threshold
colorPal_Pos <- (grDevices::colorRampPalette(c("pink", "magenta", "red")))(nBreaks)
# Split cells according to their AUC value for the gene set
passThreshold <- getAUC(cells_AUC)[geneSetName, ] > selectedThresholds[geneSetName]
if (sum(passThreshold) > 0) {
aucSplit <- split(getAUC(cells_AUC)[geneSetName, ], passThreshold)
# Assign cell color
cellColor <- c(setNames(colorPal_Neg[cut(aucSplit[[1]], breaks = nBreaks)],
names(aucSplit[[1]])), setNames(colorPal_Pos[cut(aucSplit[[2]], breaks = nBreaks)],
names(aucSplit[[2]])))
# Plot
plot(cellsTsne$Y, main = geneSetName, sub = "Pink/red cells pass the threshold",
col = cellColor[rownames(cellsTsne$Y)], pch = 16)
这种图也可以用来探索分配阈值。例如:
selectedThresholds[2] <- 0.25
par(mfrow = c(2, 3))
AUCell_plotTSNE(tSNE = cellsTsne$Y, exprMat = exprMatrix, cellsAUC = cells_AUC[1:2,
], thresholds = selectedThresholds)
与均值比较
评估基因特征表达最直观的方法之一是计算平均表达。然而,计算原始计数上的基因签名的平均值将导致读取次数更多的细胞(无论是技术上还是生物学上的原因)拥有更高的大多数签名的平均值。为了减少这种影响,需要某种类型的库大小规范化。由于AUCell单独评估每个单元上的签名检测,它自动补偿库的大小,并使其应用程序更容易,独立于数据单位和应用的其他规范化(例如原始计数、TPM、UMI计数)。
## Comparison with mean
logMat <- log2(exprMatrix + 2)
meanByGs <- t(sapply(geneSets, function(gs) colMeans(logMat[geneIds(gs), ])))
rownames(meanByGs) <- names(geneSets)
colorPal <- (grDevices::colorRampPalette(c("black", "red")))(nBreaks)
par(mfrow = c(3, 3))
for (geneSetName in names(geneSets)) {
cellColor <- setNames(colorPal[cut(meanByGs[geneSetName, ], breaks = nBreaks)],
names(meanByGs[geneSetName, ]))
plot(cellsTsne$Y, main = geneSetName, axes = FALSE, xlab = "", ylab = "", sub = "Expression mean",
col = cellColor[rownames(cellsTsne$Y)], pch = 16)
绘制tSNE降维的AUC:
AUCell_plotTSNE(tSNE = cellsTsne$Y, exprMat = exprMatrix, plots = "AUC", cellsAUC = cells_AUC[1:9,
], thresholds = selectedThresholds)
检测到的基因数量
nGenesPerCell <- apply(exprMatrix, 2, function(x) sum(x > 0))
colorPal <- grDevices::colorRampPalette(c("darkgreen", "yellow", "red"))
cellColorNgenes <- setNames(adjustcolor(colorPal(10), alpha.f = 0.8)[as.numeric(cut(nGenesPerCell,
breaks = 10, right = FALSE, include.lowest = TRUE))], names(nGenesPerCell))
plot(cellsTsne$Y, axes = FALSE, xlab = "", ylab = "", sub = "Number of detected genes",
col = cellColorNgenes[rownames(cellsTsne$Y)], pch = 16)
How reliable is AUCell? (Confusion matrix)
当使用AUCell识别单元格类型时,所识别单元格的可靠性将取决于使用的签名,以及数据集中单元格类型之间的差异。在任何情况下,AUCell对噪声都相当稳健,无论是在数据上还是在基因集上。
这是用AUCell(本例)识别的细胞与最初发表的细胞类型的比较:
# 'Real' cell type (e.g. provided in the publication)
cellLabels <- paste(file.path(system.file("examples", package = "AUCell")), "mouseBrain_cellLabels.tsv",
sep = "/")
cellLabels <- read.table(cellLabels, row.names = 1, header = TRUE, sep = "\t")
# Confusion matrix:
cellTypeNames <- unique(cellLabels[, "level1class"])