biomaRt: Timeout on getBM().

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
相关文章推荐
深情的火腿肠 · 肝癌 - 医生与科室 - 妙佑医疗国际· 4 月前 ·
狂野的卤蛋 · 我的收藏· 5 月前 ·
任性的柿子 · 全国减刑假释信息化办案平台开通_中华人民共和 ...· 5 月前 ·
挂过科的山楂 · Using kubeadm to ...· 6 月前 ·
失落的炒饭 · 《2023年中国番茄酱罐头行业深度研究报告》 ...· 7 月前 ·
Hi. I want to get the exon annotation of a list of chromosome interval (hg19). Here is the code:
getBM_value <- list(
  chromosome_name = bed_df$chromosome_name,
  start = bed_df$start,
  end = bed_df$end
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl', host="grch37.ensembl.org"))
fupanel_bed_anno <- getBM(attributes = c('chromosome_name', 'exon_chrom_start', 'exon_chrom_end', 
                                         "strand", "ensembl_gene_id","ensembl_exon_id"), 
                          filters = c('chromosome_name', 'start', 'end'),
                          values = getBM_value,
                          mart = mart)
But it returns an error:
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:80] Operation timed out after 300001 milliseconds with 163661 bytes received
I've tried the mirror argument.
mart <- useEnsembl(biomart='ensembl', dataset='hsapiens_gene_ensembl', mirror = "uswest", GRCh = 37)
Warning message:
In useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl",  :
  version or GRCh arguments can not be used together with the mirror argument.', 
                'We will ignore the mirror argument and connect to main Ensembl site.
fupanel_bed_anno <- getBM(attributes = c('chromosome_name', 'exon_chrom_start', 'exon_chrom_end', 
                                         "strand", "ensembl_gene_id","ensembl_exon_id"), 
                          filters = c('chromosome_name', 'start', 'end'),
                          values = getBM_value,
                          mart = mart)
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:443] Operation timed out after 300001 milliseconds with 323405 bytes received
                        When calling getBM, you can provide a CURLHandle with custom settings. This worked for me when I had a similar timeout problem:
getBM( ... , curl = curl::new_handle(timeout_ms=3600000))
                        Many elements of biomaRt will now respect the setting applied in options('timeout'), so you can use that mechanism to try and adjust this.
I'm pretty sure that curl argument is no longer used and should be deprecated.
                        The problem is that your query is just too big. We recommend a maximum of 500 regions in our web interface because of the limitations of the server and it's the same server for both the web interface and the R interface.
You could chunk your query in biomaRt, splitting your list up or even running each region as a single query.
Another option is to use the REST API overlap endpoint, which you can script around in your preferred programming language. This may be better than BioMart, since BioMart will be getting you all the genes that overlap your regions and all the exons of all of those genes. If you just want the exons that overlap your loci, the REST endpoint will do that for you.
                        Emily,
I had a similar issue but my query is not large:
Failed <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "transcript_tsl"), mart = ensembl)
Error in curl::curl_fetch_memory(url, handle = handle) :    Timeout
was reached: [dec2021.archive.ensembl.org:443] Operation timed out
after 300000 milliseconds with 9960752 bytes received
                        Emily,
I found this thread while coming across a similar issue. I am trying to find all refSNP idäs (rs#'s) for a set of immune system genes (there are 2960 Ensembl gene (ENSG) ID's in the set) and I am running into the timeout issue as well.
I have tried subsetting the query to do smaller and smaller increments of ENSG ID's, to the point where it is doing them one at a time in a for loop. I get through about 2 ENSG ID's and then during the 3rd it gives me the time out (see snippet of the code below).
What should I do?
mart.snp <- useMart("ENSEMBL_MART_SNP", "hsapiens_snp",host = "https//:grch37.ensembl.org")
Immune.rs_genes <- data.frame(matrix(ncol = 3, nrow=0))
colnames(Immune.rs_genes) <- c("refsnp_id","ensembl_gene_stable_id","associated_gene")
chunks <- function(x,n) split(x,cut(seq_along(x),n,labels=FALSE)) #Function that cuts vector x into n chunks
GO_immune_ENGID <- read.csv("GO_Immune_ENGIDs.csv") # a single column csv with 2960 ENSG IDs
Immune.rs_sets <-chunks(GO_immune_ENGID$x , 1500) #creates a list of 1500 vectors containing 2 ENSG IDs each
for (Sect in 1:1500) {
  Small_set <- getBM(attributes = c("refsnp_id","ensembl_gene_stable_id","associated_gene"), filters = "ensembl_gene", values = Immune.rs_sets[Sect], mart = mart.snp, verbose = TRUE)
  Immune.rs_genes <- rbind(Immune.rs_genes,Small_set)
  Sys.sleep(2) #Sleep for 2 seconds to avoid spamming the query system, I have tried with and without this.
  print(Sect)
Based on the Verbose print out from above, it does only query 2 ENSG IDs at a time. This is the output from the above
Cache found
[1] 1
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query  virtualSchemaName = 'default' uniqueRows = '1' count='0' datasetConfigVersion='0.6' header='1' formatter='TSV' requestid='biomaRt'> <Dataset name = 'hsapiens_snp'><Attribute name = 'refsnp_id'/><Attribute name = 'ensembl_gene_stable_id'/><Attribute name = 'associated_gene'/><Filter name = "ensembl_gene" value = "ENSG00000100345,ENSG00000134516" /></Dataset></Query>
Error in curl::curl_fetch_memory(url, handle = handle) :
  Timeout was reached: [grch37.ensembl.org:443] Operation timed out after 300000 milliseconds with 53069 bytes received
The [1] is clearly the print output, so I am not sure what is going on.
I checked the internet speed and it is 900mbs/sec on my end.