MaxAlign: maximizing usable data in an alignment

R Gouveia-Oliveira, PW Sackett, AG Pedersen - BMC bioinformatics, 2007 - Springer
R Gouveia-Oliveira, PW Sackett, AG Pedersen
BMC bioinformatics, 2007Springer
Background The presence of gaps in an alignment of nucleotide or protein sequences is
often an inconvenience for bioinformatical studies. In phylogenetic and other analyses, for
instance, gapped columns are often discarded entirely from the alignment. Results MaxAlign
is a program that optimizes the alignment prior to such analyses. Specifically, it maximizes
the number of nucleotide (or amino acid) symbols that are present in gap-free columns–the
alignment area–by selecting the optimal subset of sequences to exclude from the alignment …
Background
The presence of gaps in an alignment of nucleotide or protein sequences is often an inconvenience for bioinformatical studies. In phylogenetic and other analyses, for instance, gapped columns are often discarded entirely from the alignment.
Results
MaxAlign is a program that optimizes the alignment prior to such analyses. Specifically, it maximizes the number of nucleotide (or amino acid) symbols that are present in gap-free columns – the alignment area – by selecting the optimal subset of sequences to exclude from the alignment.
MaxAlign can be used prior to phylogenetic and bioinformatical analyses as well as in other situations where this form of alignment improvement is useful. In this work we test MaxAlign's performance in these tasks and compare the accuracy of phylogenetic estimates including and excluding gapped columns from the analysis, with and without processing with MaxAlign. In this paper we also introduce a new simple measure of tree similarity, Normalized Symmetric Similarity (NSS) that we consider useful for comparing tree topologies.
Conclusion
We demonstrate how MaxAlign is helpful in detecting misaligned or defective sequences without requiring manual inspection. We also show that it is not advisable to exclude gapped columns from phylogenetic analyses unless MaxAlign is used first. Finally, we find that the sequences removed by MaxAlign from an alignment tend to be those that would otherwise be associated with low phylogenetic accuracy, and that the presence of gaps in any given sequence does not seem to disturb the phylogenetic estimates of other sequences.
The MaxAlign web-server is freely available online at http://www.cbs.dtu.dk/services/MaxAlign where supplementary information can also be found. The program is also freely available as a Perl stand-alone package.
Springer