<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Biogeeks</title>
	<atom:link href="http://bio-geeks.com/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://bio-geeks.com</link>
	<description></description>
	<lastBuildDate>Thu, 27 May 2010 06:11:12 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>PubmedPDF updated</title>
		<link>http://bio-geeks.com/?p=863</link>
		<comments>http://bio-geeks.com/?p=863#comments</comments>
		<pubDate>Tue, 25 May 2010 07:37:08 +0000</pubDate>
		<dc:creator>Morten</dc:creator>
				<category><![CDATA[Cool Tools]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=863</guid>
		<description><![CDATA[]]></description>
			<content:encoded><![CDATA[<p>I have just committed a major update to biogeek&#8217;s script to fetch pdf-reprints of papers indexed in Pubmed.  It is available on github <a href="http://github.com/mortenlindow/PubmedPDF">here</a> .</p>
<p>The <a href="http://bio-geeks.com/?p=749">first version</a> required the <a href="http://camping.rubyforge.org/">Camping rubygem</a>, but I have decoupled that dependency, cleaned up the code a bit and added a few tests.</p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=863</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sending emails with EC2 and correct reverse-dns</title>
		<link>http://bio-geeks.com/?p=858</link>
		<comments>http://bio-geeks.com/?p=858#comments</comments>
		<pubDate>Sat, 24 Apr 2010 10:35:00 +0000</pubDate>
		<dc:creator>elfar</dc:creator>
				<category><![CDATA[Geek stuff]]></category>
		<category><![CDATA[ec2]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=858</guid>
		<description><![CDATA[A major problem for many users of EC2 has been sending emails because of issues with reverse DNS causing it to be regarded as spam many places.
I got it working just fine using the Elastic Load Balancer. I configure exim &#8220;dpkg-reconfigure exim4-config&#8221; to use my elastic load balancer as it&#8217;s Mail Server Name. For example, [...]]]></description>
			<content:encoded><![CDATA[<p>A major problem for many users of EC2 has been sending emails because of issues with reverse DNS causing it to be regarded as spam many places.</p>
<p>I got it working just fine using the Elastic Load Balancer. I configure exim &#8220;dpkg-reconfigure exim4-config&#8221; to use my elastic load balancer as it&#8217;s Mail Server Name. For example, if I have the website example.com, I make sure the DNS for this site points to my load balancer (and in the load balancer I register the server/s hosting my site). So, when I send mails from somebody@example.com it will reverse DNS correctly.</p>
<p>p.s. you cannot actually point the DNS for example.com to the load balancer since the load balancer is not a IP you have to use CNAME and can only point www.example.com (or any subdomain) to the load balancer. You could point example.com to an elastic IP for one of your servers and have permanently point example.com to www.example.com.</p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=858</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Amazon Cloud</title>
		<link>http://bio-geeks.com/?p=818</link>
		<comments>http://bio-geeks.com/?p=818#comments</comments>
		<pubDate>Sun, 08 Nov 2009 16:16:47 +0000</pubDate>
		<dc:creator>elfar</dc:creator>
				<category><![CDATA[resource]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=818</guid>
		<description><![CDATA[Lately I&#8217;ve been playing with Amazon Elastic Compute Cloud (Amazon EC2) and so far I really like what I see. Their virtual setup makes EC2 quite an attractive and flexible alternative to running the servers locally. For example, I could setup a server with EC2, install various bioinformatics tools and take a snapshot of this [...]]]></description>
			<content:encoded><![CDATA[<p>Lately I&#8217;ve been playing with Amazon Elastic Compute Cloud (Amazon EC2) and so far I really like what I see. Their virtual setup makes EC2 quite an attractive and flexible alternative to running the servers locally. For example, I could setup a server with EC2, install various bioinformatics tools and take a snapshot of this server. This &#8220;copy-paste&#8221; of a server allows me, in no time, to provide companies with an advanced and powerful bioinformatics server, which could easily be administrated by either us or the client. Also, this makes it easy for companies/universities to setup a server which provides some service, and as the company and demand for more CPU power grows, it is fast to &#8220;copy-paste&#8221; your server to add more servers.</p>
<p>Another great thing about EC2 is that in addition to great GUIs like ElasticFox there are also the EC2 API tools, making it possible to script your server management. For example, you can use these tools to dynamically start/stop servers, for example if your servers are very loaded, you could define some server load thresholds and automatically add/remove servers based on these thresholds. Also, since Amazon charge by the hour, you can set your servers to shutdown after office hours if your employees are the only people using your server. Using Ubuntu, <a href="https://help.ubuntu.com/community/EC2StartersGuide">this page</a> is a good place to start using the ec2-api-tools (of course you will first need an EC2 account).</p>
<p>You should be aware that once you shutdown your server, all data will be lost. That&#8217;s why Amazon also have Elastic Block Storage (EBS). You can, also scripting it with ec2-api-tools, create an EBS volume, which is is your disk space (which according to Amazon should be as fast or faster than local disks), and have your server mount it automatically when you start your server. So you can have the server write its data to this EBS volume and avoid loosing data when you stop your server.</p>
<p>This is only a fraction of the possibilities that I mention here and I am still fairly new to EC2, but so far I am really impressed with it and hope to use it a lot in the future. Thumbs up to the developer team at Amazon for some well thought out solutions. </p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=818</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Short news</title>
		<link>http://bio-geeks.com/?p=814</link>
		<comments>http://bio-geeks.com/?p=814#comments</comments>
		<pubDate>Fri, 06 Nov 2009 14:00:12 +0000</pubDate>
		<dc:creator>Troels</dc:creator>
				<category><![CDATA[Web]]></category>
		<category><![CDATA[genomics]]></category>
		<category><![CDATA[microsoft]]></category>
		<category><![CDATA[news]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[publication]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=814</guid>
		<description><![CDATA[This just in before the weekend:
Complete Genomics have published its first few genomes, and despite (somewhat) dodgy statistics on the quality of the assembly they are in business. More here.
Nature paper out on the &#8216;complete&#8217; epigenome hints that embryonic cells may use different mechanisms for gene regulation than more differentiated cells. Abstract here.
Lastly, Microsoft would [...]]]></description>
			<content:encoded><![CDATA[<p>This just in before the weekend:</p>
<p>Complete Genomics have published its first few genomes, and despite (somewhat) dodgy statistics on the quality of the assembly they are in business. More <a href="http://blogs.nature.com/news/thegreatbeyond/2009/11/complete_genomics_publishes_a.html">here</a>.</p>
<p>Nature paper out on the &#8216;complete&#8217; epigenome hints that embryonic cells may use different mechanisms for gene regulation than more differentiated cells. Abstract <a href="http://www.nature.com/nature/journal/vaop/ncurrent/abs/nature08514.html">here</a>.</p>
<p>Lastly, Microsoft would like to take a step into the world of bioinformatics. What they really want to is to take a strong tradition for open source and wrap it in proprietary frameworks so we can be even more reliant on Microsoft products. Posted <a href="http://opendotdotdot.blogspot.com/2009/11/microsofts-biological-implants.html">here</a> (Gleen Moodys comment) and <a href="http://research.microsoft.com/en-us/collaboration/tools/mbf.aspx">here</a> (Microsoft statement).</p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=814</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Barcodes with checksums</title>
		<link>http://bio-geeks.com/?p=804</link>
		<comments>http://bio-geeks.com/?p=804#comments</comments>
		<pubDate>Fri, 30 Oct 2009 14:35:19 +0000</pubDate>
		<dc:creator>Troels</dc:creator>
				<category><![CDATA[Cool Tools]]></category>
		<category><![CDATA[methodology]]></category>
		<category><![CDATA[NGS]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=804</guid>
		<description><![CDATA[A small neat GUI for generating barcodes which incorporates some error checking. A very useful tool for multiplexing strategies.]]></description>
			<content:encoded><![CDATA[<p>If you are working with a limited sequence space you may be interested in <a href="http://www.illumina.com/technology/multiplexing_sequencing_assay.ilmn">multiplexing</a> . That is, run several samples in a single sequencing lane to limit cost and time. This method has already been successfully used in various settings and will likely be used increasingly as the number of reads gained from a single run increases. However, for this to work it is necessary to tag each read with a sample specific identifier, the barcode. If the chosen barcodes are very similar a single read error could cause the sequence to be misclassified. One solution is to choose an encoding scheme with inherent error checking. This solution is elegantly presented in Hamady et al. Nature Methods 2008 (PMID 18264105). A more simple approach is presented at seqanswers.com by the user <a href="http://seqanswers.com/forums/showthread.php?t=1369">wraithnot</a></p>
<p>In short, the idea is to assign each letter A, C, G and T to the number 0 to 3. Then generate the number of barcodes as base 10 numbers, for instance 0 to 15 and convert these to base 4 numbers, these are then converted into the barcode sequence. Taking modulo of the sum of the base 4 vector we get a new number between 0 and 3. This number is our checksum and is also converted into the corresponding base. Now any single mutation in the barcode will lead to a sum not generating the right checksum, and a mutation in the checksum base will not give the right checksum if the barcode is not affected. Thus all single mutations in the barcode + checksum position can be detected. For a worked example please refer to the original posting by user wraithnot.</p>
<p>But a good idea is not worth much if you cannot make use of it. Therefore we wrote a small python GUI for generating barcodes using the above strategy. Below is an example of our GUI for generating such barcodes.</p>
<p>Assuming you have python and wxpython installed you can start the GUI by python barcodeGUI.py and you should see this (on os x, will look different dependent on platform):</p>
<p><img src="http://bio-geeks.com/wp-content/uploads/2009/10/Picture-1.png" alt="The Barcode Generator GUI on os x" width="673" height="404" /></p>
<p>In <strong>number of codes</strong> you specify the number of codes you want, leaving it blank will give you all the codes generated. Next you specify the <strong>number of positions</strong> the code can take, remember you will be using one more to incorporate the checksum base. Next are some options regarding the barcode, avoiding three identical consecutive bases (<strong>triplets</strong>), avoiding <strong>palindromes</strong> and must have a specific <strong>GC content</strong>. Finally you can specify a small string that the code <strong>must not match</strong>, for instance the primer, linker or overhang caused by a restriction enzyme. Finally pressing OK will let you save the barcodes to a tab delimited file. Obviously, your settings may not allow for generating the desired number of barcodes, in that case increase the number of positions or relax your filters.</p>
<p>Now here is an example where 16 barcodes are generated and a random single mutation is added to any position. The first line shows the numbers generated from decoding the possible ‘mutated’ barcode. Next line is the call whether or not the barcode was mutated. finally the third line shows the possibly mutated barcode, the original barcode and the corresponding checksum and checksum base. Remember that taking modulo of the sum of the barcode should give the checksum number.</p>
<p>[0, 2, 2] ['A', 'G', 'G', 'T']</p>
<p>Mutation</p>
<p>AGGT (&#8217;AGC&#8217;, 3, &#8216;T&#8217;)</p>
<p>[3, 3, 3] ['T', 'T', 'T', 'G']</p>
<p>Mutation</p>
<p>TTTG (&#8217;TTA&#8217;, 2, &#8216;G&#8217;)</p>
<p>[3, 0, 3] ['T', 'A', 'T', 'G']</p>
<p>No mutation</p>
<p>TATG (&#8217;TAT&#8217;, 2, &#8216;G&#8217;)</p>
<p>The above example is shown if you run the python script barcode.py</p>
<p>Currently, this decoding is only included in the script version of the python code, only a neat interface for the barcode generating is provided. Both are offered as open source, currently by contacting troels at bio-geeks dot com, but soon via our open source portal.</p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=804</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bioconductor 2.5</title>
		<link>http://bio-geeks.com/?p=811</link>
		<comments>http://bio-geeks.com/?p=811#comments</comments>
		<pubDate>Thu, 29 Oct 2009 15:50:29 +0000</pubDate>
		<dc:creator>Troels</dc:creator>
				<category><![CDATA[Geek stuff]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=811</guid>
		<description><![CDATA[Bioconductor 2.5 has been released, and many new packages are included. A fair amount of the new packages are related to analysis of data from sequencing platforms (Chip-seq and RNA-seq) and analysis of miRNA targets, just to mention some of the packages that looked interesting. See the full statement here.
]]></description>
			<content:encoded><![CDATA[<p>Bioconductor 2.5 has been released, and many new packages are included. A fair amount of the new packages are related to analysis of data from sequencing platforms (Chip-seq and RNA-seq) and analysis of miRNA targets, just to mention some of the packages that looked interesting. See the full statement <a href="https://mailman.stat.ethz.ch/pipermail/bioconductor/2009-October/030262.html">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=811</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The evolutionary history of human microRNAs in 10 lines of Ruby code</title>
		<link>http://bio-geeks.com/?p=776</link>
		<comments>http://bio-geeks.com/?p=776#comments</comments>
		<pubDate>Sun, 11 Oct 2009 11:00:00 +0000</pubDate>
		<dc:creator>anders</dc:creator>
				<category><![CDATA[Geek stuff]]></category>
		<category><![CDATA[microRNA]]></category>
		<category><![CDATA[resource]]></category>
		<category><![CDATA[ruby]]></category>
		<category><![CDATA[miRMaid]]></category>
		<category><![CDATA[miRNA]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=776</guid>
		<description><![CDATA[If you have a well-structured data model and an intuitive framework for querying your data it suddenly becomes feasible (and fun) to quickly address smaller scientific questions as they arise. If you use an expressive and intuitive query language it is also a lot easier to share your query/code with people in your group or in a scientific forum. miRMaid (www.mirmaid.org) provides such a software framework for miRNA data. If you are interested in microRNAs, or in doing something similar in your own data domain, then read on.]]></description>
			<content:encoded><![CDATA[<p>If you have a well-structured data model and an intuitive framework for querying your data it suddenly becomes feasible (and fun) to quickly address smaller scientific questions as they arise. If you use an expressive and intuitive query language it is also a lot easier to share your query/code with people in your group or in a scientific forum. miRMaid (<a href="http://www.mirmaid.org">www.mirmaid.org</a>) provides such a software framework for miRNA data. If you are interested in microRNAs, or in doing something similar in your own data domain, then read on.</p>
<p>I share an interest in microRNAs together with my fellow biogeek, Morten Lindow. Using data from the official miRNA registry, <a href="http://www.mirbase.org">miRBase</a>, we have designed and implemented a data model and software framework for miRNA data. It is built in the <a href="http://rubyonrails.org/">Ruby on Rails</a> framework that has a really intuitive object-relational layer (ActiveRecord), which basically means that you can access data in your SQL database as Ruby objects. If for some reason you prefer a different programming language than Ruby, then you can use miRMaid’s general <a href="http://en.wikipedia.org/wiki/Representational_State_Transfer">RESTful</a> web-service interface.  but the software is open source (LGPL) and it can be used remotely on the web or as a local installation. I will not go into the gory details here but if you are interested you can read more at <a href="http://www.mirmaid.org/">www.mirmaid.org</a>. The framework is NOT meant as a competitor to the official miRNA registry, <a href="http://www.mirbase.org/">www.miRBase.org</a>, instead it is tightly coupled to miRBase data and aims at providing bioinformaticians (and web-services) with easy and intuitive access to structured miRNA data.</p>
<p>I thought the best way to quickly introduce the framework is with a really short example. So this is what I am going to do in this post. I wanted a quick overview of the evolutionary history of human miRNAs. How many miRNAs are not conserved in other species? How many are conserved in i.e. other mammals? Using the Ruby object-relational models, I can query the inter-linked models for species, miRNA precursors and precursor families (as defined by miRBase). For each human miRNA precursor, <em>p</em>, (line 5) we select the name of the most general clade among the taxonomies of <em>p</em>’s family members (line 6+7). The counts are tallied in a hash and printed on the final line. Perhaps there is too much magic going on in line 7 but I had to keep the number of code lines below 10 <img src='http://bio-geeks.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>

<div class="wp_syntax"><div class="code"><pre class="ruby" style="font-family:monospace;">conservation = <span style="color:#CC00FF; font-weight:bold;">Hash</span>.<span style="color:#9900CC;">new</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#006666;">0</span><span style="color:#006600; font-weight:bold;">&#41;</span>
sp = Species.<span style="color:#9900CC;">find_by_abbreviation</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;hsa&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span>
tx = sp.<span style="color:#9900CC;">taxonomy</span>.<span style="color:#CC0066; font-weight:bold;">split</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;;&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#006600; font-weight:bold;">+</span> <span style="color:#006600; font-weight:bold;">&#91;</span>sp.<span style="color:#9900CC;">name</span><span style="color:#006600; font-weight:bold;">&#93;</span>
&nbsp;
sp.<span style="color:#9900CC;">precursors</span>.<span style="color:#9900CC;">each</span> <span style="color:#9966CC; font-weight:bold;">do</span> <span style="color:#006600; font-weight:bold;">|</span>p<span style="color:#006600; font-weight:bold;">|</span>
  ps = <span style="color:#CC0066; font-weight:bold;">p</span>.<span style="color:#9900CC;">precursor_family</span> ? <span style="color:#CC0066; font-weight:bold;">p</span>.<span style="color:#9900CC;">precursor_family</span>.<span style="color:#9900CC;">precursors</span> : <span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#CC0066; font-weight:bold;">p</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  clade = tx<span style="color:#006600; font-weight:bold;">&#91;</span>ps.<span style="color:#9900CC;">map</span><span style="color:#006600; font-weight:bold;">&#123;</span><span style="color:#006600; font-weight:bold;">|</span>x<span style="color:#006600; font-weight:bold;">|</span> <span style="color:#006600; font-weight:bold;">&#40;</span>tx <span style="color:#006600; font-weight:bold;">&amp;</span> <span style="color:#006600; font-weight:bold;">&#40;</span>x.<span style="color:#9900CC;">species</span>.<span style="color:#9900CC;">taxonomy</span>.<span style="color:#CC0066; font-weight:bold;">split</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#996600;">&quot;;&quot;</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">+</span><span style="color:#006600; font-weight:bold;">&#91;</span>x.<span style="color:#9900CC;">species</span>.<span style="color:#9900CC;">name</span><span style="color:#006600; font-weight:bold;">&#93;</span><span style="color:#006600; font-weight:bold;">&#41;</span><span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">size</span><span style="color:#006600; font-weight:bold;">&#125;</span>.<span style="color:#9900CC;">min</span><span style="color:#006600; font-weight:bold;">-</span><span style="color:#006666;">1</span><span style="color:#006600; font-weight:bold;">&#93;</span>
  conservation<span style="color:#006600; font-weight:bold;">&#91;</span>clade<span style="color:#006600; font-weight:bold;">&#93;</span> <span style="color:#006600; font-weight:bold;">+</span>= <span style="color:#006666;">1</span>
<span style="color:#9966CC; font-weight:bold;">end</span>;nil
&nbsp;
pp conservation
<span style="color:#008000; font-style:italic;"># output</span>
<span style="color:#008000; font-style:italic;">#{&quot;Chordata&quot;=&gt;7,</span>
<span style="color:#008000; font-style:italic;"># &quot;Metazoa&quot;=&gt;3,</span>
<span style="color:#008000; font-style:italic;"># &quot;Homo sapiens&quot;=&gt;66,</span>
<span style="color:#008000; font-style:italic;"># &quot;Hominidae&quot;=&gt;83,</span>
<span style="color:#008000; font-style:italic;"># &quot;Mammalia&quot;=&gt;215,</span>
<span style="color:#008000; font-style:italic;"># &quot;Deuterostoma&quot;=&gt;3,</span>
<span style="color:#008000; font-style:italic;"># &quot;Primates&quot;=&gt;157,</span>
<span style="color:#008000; font-style:italic;"># &quot;Bilateria&quot;=&gt;63,</span>
<span style="color:#008000; font-style:italic;"># &quot;Vertebrata&quot;=&gt;124}</span></pre></div></div>

<p>The plot below tells us that of the 721 human miRNA precursors, 66 do not have evidence in other species, and that 215 human precursors have mammalian (non-primate) origin.</p>
<p><img class="alignnone size-full wp-image-790" title="Evolution of human miRNAs" src="http://bio-geeks.com/wp-content/uploads/2009/10/evolution_of_human_mirnas.png" alt="Evolution of human miRNAs" width="591" height="301" /></p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=776</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Automatic PDF fetching of articles</title>
		<link>http://bio-geeks.com/?p=749</link>
		<comments>http://bio-geeks.com/?p=749#comments</comments>
		<pubDate>Sat, 03 Oct 2009 10:00:17 +0000</pubDate>
		<dc:creator>elfar</dc:creator>
				<category><![CDATA[Geek stuff]]></category>
		<category><![CDATA[automation]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[publication]]></category>
		<category><![CDATA[pubmed]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=749</guid>
		<description><![CDATA[Tired of clicking your way to the article PDFs you need? Check this out and find out how you can fully automate this process.]]></description>
			<content:encoded><![CDATA[<p>As a bioinformatician I don&#8217;t like doing anything &#8220;manually&#8221;. Despite that, I&#8217;ve spent years visiting pubmed, searching for some article and clicked my way to the PDF of that article. This excruciating process is tedious and therefore I have implemented this in a more automatic fashion. There are several projects out there trying to do something like this but none of them matched exactly what I was looking for. What I wanted, was a simple script, where you could specify either a list of pubmed IDs, or a search term, and the script would &#8220;automagically&#8221; hunt down any available PDFs matching the IDs or term and download them.</p>
<p>There is one project that did many of the things I was interested in, <a href="http://code.google.com/p/pdfetch/">pdfetch</a>, by Edoardo &#8216;Dado&#8217; Marcora, a nice simple script that allows you to setup a service on your localhost where you can access the service with a given pubmed ID, and the service will attempt to download the PDF and load it in your browser. BUT, I could not get it work out of the box and I wanted something which made it simple to retrieve PDF&#8217;s via the terminal. Luckily, Edoardo&#8217;s code was nicely written and easy to fix so that it ran with the latest versions of the gems he used and to update any journal links that might have changed since his implementation (which are the major reasons it did not run out of the box).</p>
<p>So I updated his script and added some journals/features and implemented a couple of scripts to call it command line using pubmed IDs or search terms.</p>
<p><strong>Install:</strong><br />
These are ruby scripts so you will need to have ruby installed, furthermore you will need to install some gems</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #c20cb9; font-weight: bold;">sudo</span> <span style="color: #c20cb9; font-weight: bold;">apt-get</span> <span style="color: #c20cb9; font-weight: bold;">install</span> git-core <span style="color: #666666; font-style: italic;">#On ubuntu if you do not have git installed</span>
git clone git:<span style="color: #000000; font-weight: bold;">//</span>github.com<span style="color: #000000; font-weight: bold;">/</span>elfar<span style="color: #000000; font-weight: bold;">/</span>PubmedPDF.git
gem <span style="color: #c20cb9; font-weight: bold;">install</span> mechanize
gem <span style="color: #c20cb9; font-weight: bold;">install</span> bio <span style="color: #666666; font-style: italic;">#Not needed to load PDF's by pubmed IDs</span>
gem <span style="color: #c20cb9; font-weight: bold;">install</span> socksify  <span style="color: #666666; font-style: italic;">#only needed if you plan on using a SOCKS proxy</span></pre></div></div>

<p>For the bio gem, it is probably better to install it via github</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">git clone git:<span style="color: #000000; font-weight: bold;">//</span>github.com<span style="color: #000000; font-weight: bold;">/</span>bioruby<span style="color: #000000; font-weight: bold;">/</span>bioruby.git
<span style="color: #7a0874; font-weight: bold;">cd</span> bioruby
<span style="color: #c20cb9; font-weight: bold;">sudo</span> ruby setup.rb</pre></div></div>

<p><strong>Call the scripts with no arguments to get some help:</strong></p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">pubmedid2pdf.rb
searchTerm2pdf.rb</pre></div></div>

<p><strong>Retrieve PDFs from a comma-separated list of pubmed IDs:</strong></p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">pubmedid2pdf.rb <span style="color: #000000;">19508715</span>,<span style="color: #000000;">18677110</span>,<span style="color: #000000;">19450510</span>,<span style="color: #000000;">19450585</span></pre></div></div>

<p><strong>Retrieve PDFs from a given search term:</strong> The outputted list is a .csv like file (&#8217;|&#8217; instead of &#8216;,&#8217;) with various meta data. This can easily be parsed into a database (or you can modify the ruby script itself) if you prefer that. This is what we do. We, Biogeeks, have a rails framework with all our papers, where we can search etc. in our database. But I made these simple scripts like this so you could easily match this to your own setup, whatever it may be.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">searchTerm2pdf.rb <span style="color: #ff0000;">'Lindow M[Author] OR Torarinsson E[Author]'</span> <span style="color: #000000; font-weight: bold;">&gt;</span> list</pre></div></div>

<p><strong>Optional arguments (for both scripts):</strong> You can use a SOCKS proxy, f. ex. if you&#8217;re at home and do not have access to many of the journals (we do not bypass the law, you will need to have legal access to the non-free PDFs <img src='http://bio-geeks.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />  ), but your University/Company has, you can always go through them. To do this, first, make a connection in a different terminal to your server:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #c20cb9; font-weight: bold;">ssh</span> <span style="color: #660033;">-D</span> <span style="color: #000000;">9999</span> username<span style="color: #000000; font-weight: bold;">@</span>server  <span style="color: #666666; font-style: italic;">#In another terminal</span>
searchTerm2pdf.rb <span style="color: #ff0000;">'Torarinsson E[Author]'</span> 127.0.0.1 <span style="color: #000000;">9999</span>
OR
pubmedid2pdf.rb <span style="color: #000000;">19508715</span> 127.0.0.1 <span style="color: #000000;">9999</span></pre></div></div>

<p><strong>Conclusion:</strong> This works quite well for me, for example, I use the same search which I use for the pubmedlist wordpress plugin I wrote, to retrieve all the Biogeeks articles &#8216;(Lindow M[Author] OR Torarinsson E[Author] OR Lindgreen S[Author] OR Marstrand T[Author]) AND (1998/01/01[PDAT]:3000[PDAT])&#8217; and it loads the 36 articles into the output list with corresponding meta data and retrieves 33 out of 36 PDFs. The three missing are one with no PDF available and two Nature (Gen) papers I don&#8217;t have access to through the server I am using.</p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=749</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Geeky weekend</title>
		<link>http://bio-geeks.com/?p=745</link>
		<comments>http://bio-geeks.com/?p=745#comments</comments>
		<pubDate>Fri, 02 Oct 2009 16:11:08 +0000</pubDate>
		<dc:creator>Troels</dc:creator>
				<category><![CDATA[Geek stuff]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=745</guid>
		<description><![CDATA[Just happened to stumble across this very geeky post on blog updating from emacs and R. I am sure that my fellow geeks and I will be spending some time during the weekend checking this out. Maybe the readers of this blog will be spammed with R-code, even more, in the near future. Have a [...]]]></description>
			<content:encoded><![CDATA[<p>Just happened to stumble across this very geeky post on blog updating <a href="http://blogisticreflections.wordpress.com/2009/09/20/welcome-to-blogistic-reflections/">from emacs and R</a>. I am sure that my fellow geeks and I will be spending some time during the weekend checking this out. Maybe the readers of this blog will be spammed with R-code, even more, in the near future. Have a nice weekend.</p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=745</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analysis of gene expression</title>
		<link>http://bio-geeks.com/?p=737</link>
		<comments>http://bio-geeks.com/?p=737#comments</comments>
		<pubDate>Sun, 27 Sep 2009 19:43:42 +0000</pubDate>
		<dc:creator>Troels</dc:creator>
				<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[gene-regulation]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[methodology]]></category>
		<category><![CDATA[micro array]]></category>

		<guid isPermaLink="false">http://bio-geeks.com/?p=737</guid>
		<description><![CDATA[A small rant about two different, and very useful methods for analyzing expression data from micro arrays.]]></description>
			<content:encoded><![CDATA[<p><!--StartFragment--></p>
<p class="MsoNormal">While everyone is thrilled about next-generation sequencing and the added complexity level, the possible novel discoveries, detection of splice variants etc. Fact is that micro arrays still provide a cheaper and possibly more reliable platform for gene expression studies and we already have a vast amount of such data available in public databases for testing various biological hypotheses. Testing that requires no more than the good idea, and some CPU time. However, as with any test we need the right tool set. Broadly micro array analysis is done either by investigating differential expressed genes, or by globally group genes based on expression profiles and investigate the inferred classes.</p>
<p class="MsoNormal">
<p class="MsoNormal">In the first case the term Gene Set Enrichment Analysis has recently received a lot of attention. The archetype of this type of analysis is that presented by Subramanian et al (2005), although a variety of methodologies based on the same conceptual idea have been proposed.</p>
<p class="MsoNormal">Instead of assessing the significance of individual genes, the idea is to borrow strength by assessing the significance of an externally defined set of genes. For instance, the genes involved in a specific pathway or similarly. Gene sets should therefore, ideally, produce more robust results in the presence of technical and biological variability. However, gene set enrichment analysis (of any kind) is not the magic bullet, and comes with it own share of issues to consider, most of which are ignored if using the tools out of box. A very recommendable publication by Efron and Tibshirani (2007) address these issues. In short, simply permuting the genes within each experiment to assess the significance of a given gene set ignores the possible correlation between biologically related genes, and therefore under-estimates the null model. This in turns leads to significant results for gene sets simply caused by the fact that they are treated as independent which they are clearly not when members of the same pathway or similarly.</p>
<p class="MsoNormal">While permutation of the experiment labels will retain the correlation between the genes, it does not take into account that the detected significance could be due to a global effect leading to a general up-regulation of most genes. Therefore both between and within sampling is needed.</p>
<p class="MsoNormal">
<p class="MsoNormal">Now gene set enrichment analysis is excellent when the main question of interest is differential expression. In general more information is gained from scaling down the multi-dimensional expression data to a set of interpretable groupings across all levels of expression, not just the extremes. Doing so involves some sort of inference. Often a pattern recognition algorithm is applied to classify single genes to belong to a limited set of group, which hopefully have biological interpretations based on their member genes. Such unsupervised techniques are popular and widely applicable, however they often suffer from lack of reproducibility at the gene level, in that individual genes change group membership and or a group contains a subset of genes which a readily interpreted as belonging to a particular biological function while the rest of the set is not easily co-classified.</p>
<p class="MsoNormal">As the main exercise is to reduce the expression data into a limited number of biological interpretable groups, the problem can be approached oppositely as in gene set enrichment analysis. A publication by Sangurdekar et al. 2006 describe how measuring the entropy of a pre-defined group of genes decreases as a function of an experiment can be used to classify transcriptional response in terms of extent of co-expression of functionally related genes. The expectation is that if genes form a functional group under a given condition, experimental setup, they should be better correlated than a random assortment of genes.</p>
<p class="MsoNormal">To assess whether a given set of genes are specific for a condition and further if they are more so than the other sets specific for that condition both between experiment and within experiment permutations are performed. The measured quantity is the entropy measured over the singular values of the expression values of the gene set, a general idea previously described by Alter et al 2000. As the entropy tends towards zero the more stable and structured the gene expression profiles from the set are. Naturally genes that do not change expression levels across all conditions are highly correlated and therefore show low entropy a composite score involving the amplitude of expression across conditions are therefore incorporated.</p>
<p class="MsoNormal">Ultimately each gene set can be a relative importance both between and within all conditions, moreover a condition can be described by the median class activity, that is, the overall performance of all queried gene sets within the conditions, thereby elucidating if the phenotype is associated with a very specific response triggering a few pathways or more global changes in expression profiles. Hence leading to a much richer description of the data sets than obtained from testing differential expression.</p>
<p><!--EndFragment--></p>
]]></content:encoded>
			<wfw:commentRss>http://bio-geeks.com/?feed=rss2&amp;p=737</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
