Sequence alignment data is often ordered by coordinate (id of the reference sequence plus position on the sequence where the fragment was mapped) when stored in BAM files, as this simplifies the extraction of variants between the mapped data and the reference or of variants within the mapped data. In this order paired reads are usually separated in the file, which complicates some other applications like duplicate marking or conversion to the FastQ format which require to access the full information of the pairs. In this paper we introduce biobambam, a set of tools based on the efficient collation of alignments in BAM files by read name. The employed collation algorithm avoids time and space consuming sorting of alignments by read name where this is possible without using more than a specified amount of main memory. Using this algorithm tasks like duplicate marking in BAM files and conversion of BAM files to the FastQ format can be performed very efficiently with limited resources. We also make the collation algorithm available in the form of an API for other projects. This API is part of the libmaus package. In comparison with previous approaches to problems involving the collation of alignments by read name like the BAM to FastQ or duplication marking utilities our approach can often perform an equivalent task more efficiently in terms of the required main memory and run-time. Our BAM to FastQ conversion is faster than all widely known alternatives including Picard and bamUtil. Our duplicate marking is about as fast as the closest competitor bamUtil for small data sets and faster than all known alternatives on large and complex data sets.
A. Felfernig, M. Mairitsch, M. Mandl, M. Schubert, and E. Teppan. Proceedings of the 22nd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: Next-Generation Applied Intelligence, page 162--171. Berlin, Heidelberg, Springer-Verlag, (2009)
B. Heckmann. Proceedings of the Third Collaborative Research Symposium on Security, E-Learning, Internet and Networking, page 185--198. Plymouth, UK, Lulu.com, (June 2007)
B. Heckmann, G. Turetschek, and A. Phippen. Proceedings of the Fourth Collaborative Research Symposium on Security, E-learning, Internet and Networking, page 155--165. Wrexham, UK, (2008)
B. Heckmann, I. Stengel, A. Phippen, and G. Turetschek. ESM'2009 The 2009 European Simulation and Modelling Conference, page 175--180. Leicester, United Kingdom, EUROSIS-ETI, (October 2009)
B. Heckmann, A. Phippen, R. Moore, and C. Wentzel. International Transactions on Systems Science and Applications, Vol. 7 (No. 3/4):
173--178(December 2011)
B. Heckmann, A. Phippen, R. Moore, and C. Wentzel. CLOSER 2012 - Proceedings of the 2nd International Conference on Cloud Computing and Service Science, page 267--270. Porto, Portugal, INSTICC - Institute for Systems and Technologies of Information, Control and Communication, (April 2012)
P. Adamopoulos, and A. Tuzhilin. DiveRS 2011 – ACM RecSys 2011 Workshop on Novelty and Diversity in Recommender Systems, New York, NY, USA, ACM, (October 2011)
C. Klima, M. Becker, D. Heim, and A. Winkelmann. Tagungsband der Multikonferenz Wirtschaftsinformatik (MKWI) 2016: Technische Universität Ilmenau, 09. - 11. März 2016, Ilmenau, Universitätsverlag Ilmenau, (2016)