4.4 Machine-learning classification C-N-S genes between runoff
area and stagnant area
As C-N-S genes changed significantly between runoff area and stagnant
area, they could be used as biomarkers to differentiate water samples of
runoff area and stagnant area. The prediction model was established
using a random-forest machine-learning method to correlate runoff area
and stagnant area with Picrut2 data (Subramanian et al., 2014;
Yatsunenko et al., 2014; Karlsson et al., 2014). Ten-fold
cross-validation with five repeats was carried out to evaluate the
importance of indicator bacterial families. The crossvalidation error
curve stabilized when the most relevant genes were used (Fig.12b). These
genes included methanogenesis genes (mtr, fwd, mcr, mtd, mtr, hdr, mer
and mvh); aceticlastic methanogenesis gene (cdh); F420biosynthesis gene (cof); Coenzyme M biosynthesis gene(com);
dissimilatory and assimilatory sulfate reduction genes (sat, sir, dsr
and apr); assimilatory nitrate reduction genes (napA and nrfA) and
denitrification gene (nosz). Fig.12a showed the above-mentioned
functional genes’ relative abundence in runoff area samples and stagnant
area samples. This consisted with the
Welch’s t-test results. As these
function genes related to methanogenesis, nitrate reduction and sulfate
reduction were all anaerobic respiration genes, when the hydrogeological
environment transits from the aerobic environment in the runoff area to
the anaerobic environment in the stagnant area, their gene abundance had
a sensitive response, which can be used as an effective index to
identify the stagnant area and the runoff area.