2.3.2. Taxonomic classification
The merged files were aligned to phylogeny against the Greengenes
reference sequence sepp-refs-gg-13-8 using q2-fragment-insertion[22]. Incorrect taxonomic and phylogenetic
assignments due to differences in 16S rDNA hypervariable regions and
merging the variable lengths during analysis were solved with
q2-fragment insertion technique (SATe-enabled phylogenetic placement in
QIIME2 plugin) [22]. The core diversity was
calculated before (to calculate the impact on diversity) and after
removing mitochondria (mtDNA) and chloroplast (clDNA) sequences from the
datasets. The mtDNA and clDNA filtered datasets were further used for
calculating diversity, taxonomy, important (core) s-OTUs and the
difference in composition estimation using QIIME2 and the diversity
graph was plotted within QIIME2. We used Unweighted, Weighted Unifrac
and Jaccard distance matrix to compute the beta diversity, and the
outcomes were envisaged using Principal Coordinates Analysis (PCoA) in
QIIME2. A Permutational Multivariate Analysis of Variance (PERMANOVA)[23]thru the Unweighted,
Weighted Unifrac along with Jaccord distance-based beta-diversity was
calculated within QIIME2. We used standard pre-trained Greengenes
library (gg_13_8_99_OTU_full-length) [24],
SILVA reference database (SILVA_188_99_OTUs full-length)[25] and fragment-insertion reference dataset
(ref-gg-99-taxonomy). Then we decided to discuss the results from the
fragment-insertion reference dataset.
We also implemented the Analysis of the Composition of Microbiome
(ANCOM) [26] in QIIME2 plugin to identify the
significantly different bacteria between the copepod genera. ANCOM used
F-statistics and W-statistics to determine the difference, where W
represents the vigour of the ANCOM test for the tested number of species
and F represents the measure of the effect size difference for a
particular species between the groups (copepods). To predict the
important bacteria associated with the copepods, we used sophisticated
supervised machine learning classifier (SML): RandomForest Classifier
(RFC) [27] and Gradient Boosting Classifier (GBC)[28] using built-in QIIME2. Which is one of the
most accurate learning algorithms for managing large and noisy datasets,
Random Forest often manages unbalanced sample distributions and is less
susceptible to overfitting and generating unbiased classifiers[29]. The gradient boosting method involves the
use of several weak learners by taking the loss function from the
previous tree and using it to enhance the classification. This technique
is less prone to overfitting and does not suffer from the dimensionality
curse, but it is susceptible to noisy data and outliers[30].
The mtDNA and clDNA filtered table and representative sequences were
also used as an input for predicting CAB potential metabolic function
using Phylogenetic Investigation of Communities by Reconstruction of
Unobserved States (PICRUSt2) [19]. The output
abundance KEGG data were analysed in Statistical Analysis of Taxonomic
and Functional Profiles (STAMP) which includes Principle Component
Analysis (PCA) [31] to find the significant
difference in potential functions of CAB between the copepod genera
using Kruskal–Wallis H-test [32] with
Tukey–Kramer parameter[33]. The kegg metabolic
maps [34-36] was used as a reference to draw the
figure representing the copepod genera with a high proportion of
potential functional genes.