Article,

Semi-automated sequence curation for reliable reference datasets in ITS2 vascular plant DNA (meta-)barcoding

A. Quaresma, M. Ankenbrand, C. Garcia, J. Rufino, M. Honrado, J. Amaral, R. Brodschneider, V. Brusbardis, K. Gratzer, F. Hatjina, O. Kilpinen, M. Pietropaoli, I. Roessink, J. van der Steen, F. Vejsnæs, M. Pinto, and A. Keller.
Scientific Data, 11 (1): 129 (Jan 25, 2024)
DOI: 10.1038/s41597-024-02962-5

Abstract

One of the most critical steps for accurate taxonomic identification in DNA (meta)-barcoding is to have an accurate DNA reference sequence dataset for the marker of choice. Therefore, developing such a dataset has been a long-term ambition, especially in the Viridiplantae kingdom. Typically, reference datasets are constructed with sequences downloaded from general public databases, which can carry taxonomic and other relevant errors. Herein, we constructed a curated (i) global dataset, (ii) European crop dataset, and (iii) 27 datasets for the EU countries for the ITS2 barcoding marker of vascular plants. To that end, we first developed a pipeline script that entails (i) an automated curation stage comprising five filters, (ii) manual taxonomic correction for misclassified taxa, and (iii) manual addition of newly sequenced species. The pipeline allows easy updating of the curated datasets. With this approach, 13\% of the sequences, corresponding to 7\% of species originally imported from GenBank, were discarded. Further, 259 sequences were manually added to the curated global dataset, which now comprises 307,977 sequences of 111,382 plant species.

BibTeX key: Quaresma2024
entry type: article
year: 2024
month: jan
day: 25
journal: Scientific Data
number: 1
pages: 129
volume: 11
issn: 2052-4463
DOI: 10.1038/s41597-024-02962-5
url: https://doi.org/10.1038/s41597-024-02962-5

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

@article{Quaresma2024, abstract = {One of the most critical steps for accurate taxonomic identification in DNA (meta)-barcoding is to have an accurate DNA reference sequence dataset for the marker of choice. Therefore, developing such a dataset has been a long-term ambition, especially in the Viridiplantae kingdom. Typically, reference datasets are constructed with sequences downloaded from general public databases, which can carry taxonomic and other relevant errors. Herein, we constructed a curated (i) global dataset, (ii) European crop dataset, and (iii) 27 datasets for the EU countries for the ITS2 barcoding marker of vascular plants. To that end, we first developed a pipeline script that entails (i) an automated curation stage comprising five filters, (ii) manual taxonomic correction for misclassified taxa, and (iii) manual addition of newly sequenced species. The pipeline allows easy updating of the curated datasets. With this approach, 13{\%} of the sequences, corresponding to 7{\%} of species originally imported from GenBank, were discarded. Further, 259 sequences were manually added to the curated global dataset, which now comprises 307,977 sequences of 111,382 plant species.}, added-at = {2024-02-15T08:58:53.000+0100}, author = {Quaresma, Andreia and Ankenbrand, Markus J. and Garcia, Carlos Ariel Yadr{\'o} and Rufino, Jos{\'e} and Honrado, M{\'o}nica and Amaral, Joana and Brodschneider, Robert and Brusbardis, Valters and Gratzer, Kristina and Hatjina, Fani and Kilpinen, Ole and Pietropaoli, Marco and Roessink, Ivo and van der Steen, Jozef and Vejsn{\ae}s, Flemming and Pinto, M. Alice and Keller, Alexander}, biburl = {https://www.bibsonomy.org/bibtex/2ab9ff81808d3dd57314b756902937ae1/iimog}, day = 25, description = {Semi-automated sequence curation for reliable reference datasets in ITS2 vascular plant DNA (meta-)barcoding | Scientific Data}, doi = {10.1038/s41597-024-02962-5}, interhash = {3dabfb97eede061a5bac7f002502e179}, intrahash = {ab9ff81808d3dd57314b756902937ae1}, issn = {2052-4463}, journal = {Scientific Data}, keywords = {alexanderkeller bmd cctb markusankenbrand mbd}, month = jan, number = 1, pages = 129, timestamp = {2024-02-15T08:58:53.000+0100}, title = {Semi-automated sequence curation for reliable reference datasets in ITS2 vascular plant DNA (meta-)barcoding}, url = {https://doi.org/10.1038/s41597-024-02962-5}, volume = 11, year = 2024 }

BibSonomy

Semi-automated sequence curation for reliable reference datasets in ITS2 vascular plant DNA (meta-)barcoding

Abstract

Tags

Users

Comments and Reviewsshow / hide

Cite this publication

More citation styles

search on