Clustering approaches are pivotal to handle the many sequence variants obtained in DNA metabarcoding data sets, and therefore they have become a key step of metabarcoding analysis pipelines. Clustering often relies on a sequence similarity threshold to gather sequences into molecular operational taxonomic units (MOTUs), each of which ideally represents a homogeneous taxonomic entity (e.g., a species or a genus). However, the choice of the clustering threshold is rarely justified, and its impact on MOTU over-splitting or over-merging even less tested. Here, we evaluated clustering threshold values for several metabarcoding markers under different criteria: limitation of MOTU over-merging, limitation of MOTU over-splitting, and trade-off between over-merging and over-splitting. We extracted sequences from a public database for nine markers, ranging from generalist markers targeting Bacteria or Eukaryota, to more specific markers targeting a class or a subclass (e.g., Insecta, Oligochaeta). Based on the distributions of pairwise sequence similarities within species and within genera, and on the rates of over-splitting and over-merging across different clustering thresholds, we were able to propose threshold values minimizing the risk of over-splitting, that of over-merging, or offering a trade-off between the two risks. For generalist markers, high similarity thresholds (0.96-0.99) are generally appropriate, while more specific markers require lower values (0.85-0.96). These results do not support the use of a fixed clustering threshold. Instead, we advocate careful examination of the most appropriate threshold based on the research objectives, the potential costs of over-splitting and over-merging, and the features of the studied markers.

Optimal sequence similarity thresholds for clustering of molecular operational taxonomic units in DNA metabarcoding studies / A. Bonin, A. Guerrieri, G.F. Ficetola. - In: MOLECULAR ECOLOGY RESOURCES. - ISSN 1755-098X. - 23:(2023), pp. 368-381. [10.1111/1755-0998.13709]

Optimal sequence similarity thresholds for clustering of molecular operational taxonomic units in DNA metabarcoding studies

A. Guerrieri
Secondo
;
G.F. Ficetola
Ultimo
2023

Abstract

Clustering approaches are pivotal to handle the many sequence variants obtained in DNA metabarcoding data sets, and therefore they have become a key step of metabarcoding analysis pipelines. Clustering often relies on a sequence similarity threshold to gather sequences into molecular operational taxonomic units (MOTUs), each of which ideally represents a homogeneous taxonomic entity (e.g., a species or a genus). However, the choice of the clustering threshold is rarely justified, and its impact on MOTU over-splitting or over-merging even less tested. Here, we evaluated clustering threshold values for several metabarcoding markers under different criteria: limitation of MOTU over-merging, limitation of MOTU over-splitting, and trade-off between over-merging and over-splitting. We extracted sequences from a public database for nine markers, ranging from generalist markers targeting Bacteria or Eukaryota, to more specific markers targeting a class or a subclass (e.g., Insecta, Oligochaeta). Based on the distributions of pairwise sequence similarities within species and within genera, and on the rates of over-splitting and over-merging across different clustering thresholds, we were able to propose threshold values minimizing the risk of over-splitting, that of over-merging, or offering a trade-off between the two risks. For generalist markers, high similarity thresholds (0.96-0.99) are generally appropriate, while more specific markers require lower values (0.85-0.96). These results do not support the use of a fixed clustering threshold. Instead, we advocate careful examination of the most appropriate threshold based on the research objectives, the potential costs of over-splitting and over-merging, and the features of the studied markers.
COI; MOTU over-merging; MOTU over-splitting; alpha diversity; metabarcoding marker; sequence variant
Settore BIO/05 - Zoologia
Settore BIO/18 - Genetica
   Reconstructing community dynamics and ecosystem functioning after glacial retreat (IceCommunities)
   IceCommunities
   EUROPEAN COMMISSION
   H2020
   772284
2023
15-set-2022
Article (author)
File in questo prodotto:
File Dimensione Formato  
Bonin et al. 2023 MER.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 1.12 MB
Formato Adobe PDF
1.12 MB Adobe PDF Visualizza/Apri
Bonin preprint.pdf

accesso aperto

Tipologia: Pre-print (manoscritto inviato all'editore)
Dimensione 1.02 MB
Formato Adobe PDF
1.02 MB Adobe PDF Visualizza/Apri
Bonin_2023 Mol Ecol submitted.pdf

accesso aperto

Tipologia: Pre-print (manoscritto inviato all'editore)
Dimensione 1.26 MB
Formato Adobe PDF
1.26 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/952521
Citazioni
  • ???jsp.display-item.citation.pmc??? 3
  • Scopus 9
  • ???jsp.display-item.citation.isi??? 8
social impact