添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Hi. I want to get the exon annotation of a list of chromosome interval (hg19). Here is the code:

getBM_value <- list(
  chromosome_name = bed_df$chromosome_name,
  start = bed_df$start,
  end = bed_df$end
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl', host="grch37.ensembl.org"))
fupanel_bed_anno <- getBM(attributes = c('chromosome_name', 'exon_chrom_start', 'exon_chrom_end', 
                                         "strand", "ensembl_gene_id","ensembl_exon_id"), 
                          filters = c('chromosome_name', 'start', 'end'),
                          values = getBM_value,
                          mart = mart)

But it returns an error:

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:80] Operation timed out after 300001 milliseconds with 163661 bytes received

I've tried the mirror argument.

mart <- useEnsembl(biomart='ensembl', dataset='hsapiens_gene_ensembl', mirror = "uswest", GRCh = 37)
Warning message:
In useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl",  :
  version or GRCh arguments can not be used together with the mirror argument.', 
                'We will ignore the mirror argument and connect to main Ensembl site.
fupanel_bed_anno <- getBM(attributes = c('chromosome_name', 'exon_chrom_start', 'exon_chrom_end', 
                                         "strand", "ensembl_gene_id","ensembl_exon_id"), 
                          filters = c('chromosome_name', 'start', 'end'),
                          values = getBM_value,
                          mart = mart)
Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:443] Operation timed out after 300001 milliseconds with 323405 bytes received
                        

When calling getBM, you can provide a CURLHandle with custom settings. This worked for me when I had a similar timeout problem:

getBM( ... , curl = curl::new_handle(timeout_ms=3600000))
                        

Many elements of biomaRt will now respect the setting applied in options('timeout'), so you can use that mechanism to try and adjust this.

I'm pretty sure that curl argument is no longer used and should be deprecated.

The problem is that your query is just too big. We recommend a maximum of 500 regions in our web interface because of the limitations of the server and it's the same server for both the web interface and the R interface.

You could chunk your query in biomaRt, splitting your list up or even running each region as a single query.

Another option is to use the REST API overlap endpoint, which you can script around in your preferred programming language. This may be better than BioMart, since BioMart will be getting you all the genes that overlap your regions and all the exons of all of those genes. If you just want the exons that overlap your loci, the REST endpoint will do that for you.

Emily,

I had a similar issue but my query is not large:

Failed <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "transcript_tsl"), mart = ensembl)

Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: [dec2021.archive.ensembl.org:443] Operation timed out after 300000 milliseconds with 9960752 bytes received

Emily,

I found this thread while coming across a similar issue. I am trying to find all refSNP idäs (rs#'s) for a set of immune system genes (there are 2960 Ensembl gene (ENSG) ID's in the set) and I am running into the timeout issue as well.

I have tried subsetting the query to do smaller and smaller increments of ENSG ID's, to the point where it is doing them one at a time in a for loop. I get through about 2 ENSG ID's and then during the 3rd it gives me the time out (see snippet of the code below).

What should I do?

mart.snp <- useMart("ENSEMBL_MART_SNP", "hsapiens_snp",host = "https//:grch37.ensembl.org")
Immune.rs_genes <- data.frame(matrix(ncol = 3, nrow=0))
colnames(Immune.rs_genes) <- c("refsnp_id","ensembl_gene_stable_id","associated_gene")
chunks <- function(x,n) split(x,cut(seq_along(x),n,labels=FALSE)) #Function that cuts vector x into n chunks
GO_immune_ENGID <- read.csv("GO_Immune_ENGIDs.csv") # a single column csv with 2960 ENSG IDs
Immune.rs_sets <-chunks(GO_immune_ENGID$x , 1500) #creates a list of 1500 vectors containing 2 ENSG IDs each
for (Sect in 1:1500) {
  Small_set <- getBM(attributes = c("refsnp_id","ensembl_gene_stable_id","associated_gene"), filters = "ensembl_gene", values = Immune.rs_sets[Sect], mart = mart.snp, verbose = TRUE)
  Immune.rs_genes <- rbind(Immune.rs_genes,Small_set)
  Sys.sleep(2) #Sleep for 2 seconds to avoid spamming the query system, I have tried with and without this.
  print(Sect)

Based on the Verbose print out from above, it does only query 2 ENSG IDs at a time. This is the output from the above

Cache found
[1] 1
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query  virtualSchemaName = 'default' uniqueRows = '1' count='0' datasetConfigVersion='0.6' header='1' formatter='TSV' requestid='biomaRt'> <Dataset name = 'hsapiens_snp'><Attribute name = 'refsnp_id'/><Attribute name = 'ensembl_gene_stable_id'/><Attribute name = 'associated_gene'/><Filter name = "ensembl_gene" value = "ENSG00000100345,ENSG00000134516" /></Dataset></Query>
Error in curl::curl_fetch_memory(url, handle = handle) :
  Timeout was reached: [grch37.ensembl.org:443] Operation timed out after 300000 milliseconds with 53069 bytes received

The [1] is clearly the print output, so I am not sure what is going on. I checked the internet speed and it is 900mbs/sec on my end.