Shuffling multiple alignments
Scoring of candidates has been a major problem in large scans for structural ncRNAs. One reason for this is the lack of a method to maintain the dinucleotide composition of the shuffled alignments that are used as null examples for these scans. The norm for assessing the biological significance of the predictions in these scans is to re-run their respective method on random data that is similar to real genomic data, and compare the score distribution. The random data has often been generated by shuffling the original alignments, maintaining mononucleotide frequencies, gap structure and local conservation patterns, using an algorithm by Washietl and Hofacker [2004].
For prediction of structured RNAs, the folding free energy is a critical attribute. Since G − C base pairs are more energetically favorable, random control sequences should ideally exhibit the same mononucleotide frequencies as native sequences. More subtly, dinucleotide frequencies also matter, since stacked pairs in helices also affect folding energy. Several studies (e.g., Workman and Krogh [1999], Clote et al. [2005]) convincingly demonstrate that this is more than just a theoretical concern. To constrain the dinucleotide frequencies, while also maintaining mononucleotide frequencies, gap structure and local conservation patterns, often reduces the number of possible shuffles unacceptably.
We, therefore, designed a shuffling algorithm, Multiperm, for arbitrary multiple alignments. Multiperm preserves mononucleotide frequencies, gap structure and local conservation patterns exactly, while preserving dinucleotide frequencies approximately. The number of distinct shuffles that can be produced by the algorithm is very large for the vast majority of multiple alignments. Together these characteristics provide a much more realistic null model for RNA prediction than previously available.