Be careful when using the XML feature in Ensembl’s Biomart!
Recently, I needed to download cDNA sequences for various organism for a project. This was for a web service tool which will be used regularly, so I wanted to make sure that it was easy to update the sequence files. I decided to use Biomart to retrieve the sequences, since once you have defined what you want to download using Biomart, you can simply generate an XML file (there is an XML button at the top you can click) which describes what you want to download (organism, fasta header, cDNA etc ). Having generated an XML file for each organism, like this for human:
< !DOCTYPE Query> <query virtualSchemaName = "default" formatter = "FASTA" header = "0" uniqueRows = "0" count = "" datasetConfigVersion = "0.6" > <dataset name = "hsapiens_gene_ensembl" interface = "default" > <attribute name = "ensembl_gene_id" /> <attribute name = "ensembl_transcript_id" /> <attribute name = "cdna" /> <attribute name = "external_gene_id" /> <attribute name = "description" /> </dataset> </query>
Then I used a simple perl script from EBI to run the XML file
#!/usr/bin/perl # an example script demonstrating the use of BioMart webservice use strict; use LWP::UserAgent; open (FH,$ARGV[0]) || die ("\nUsage: perl postXML.pl Query.xml\n\n"); my $xml; while (){ $xml .= $_; } close(FH); my $path="http://www.biomart.org/biomart/martservice?"; my $request = HTTP::Request->new("POST",$path,HTTP::Headers->new(),'query='.$xml."\n"); my $ua = LWP::UserAgent->new; my $response; $ua->request($request, sub{ my($data, $response) = @_; if ($response->is_success) { print "$data"; } else { warn ("Problems with the web server: ".$response->status_line); } },1000);
This was quite nice and simple and easy to maintain with a cronjob, or so I thought. Running this script is quite unstable and frequently stops, without a warning so I never knew if it had stopped because the connection was lost or if the download was finished. And even though you can do a count at Biomart for what to expect, that number does not take into account various splice alternatives into account as far as I can see, so I never knew if I had downloaded all the sequences or not. More dramatically when using this XML procedure to download 5′UTRs I received wrong sequences that had nothing to do with the 5′UTRs I got from a manual download.
So be careful with Biomart’s XML solution. It’s a great idea and I would love to see it work in the future, so if the great people at Ensembl read this, please have a look at this issue. Furthermore it is quite possible that this XML procedure works great when retrieving smaller amount of data, like just a few genes, but for someone who is as greedy as me, going for the whole thing, I couldn’t figure out a stable solution which guaranteed the download to be 100% correct.
Also when downloading such big files manually from Biomart, like all cDNAs for a given organism, be sure you choose to download a gzipped file, since you will on occasion, without any warning, loose the connection and the only way to see if you’ve downloaded the whole thing is to look for a error message when gunzipping the file, which will happen if you didn’t get the whole thing.