4.4 Machine-learning classification C-N-S genes between runoff area and stagnant area
As C-N-S genes changed significantly between runoff area and stagnant area, they could be used as biomarkers to differentiate water samples of runoff area and stagnant area. The prediction model was established using a random-forest machine-learning method to correlate runoff area and stagnant area with Picrut2 data (Subramanian et al., 2014; Yatsunenko et al., 2014; Karlsson et al., 2014). Ten-fold cross-validation with five repeats was carried out to evaluate the importance of indicator bacterial families. The crossvalidation error curve stabilized when the most relevant genes were used (Fig.12b). These genes included methanogenesis genes (mtr, fwd, mcr, mtd, mtr, hdr, mer and mvh); aceticlastic methanogenesis gene (cdh); F420biosynthesis gene (cof); Coenzyme M biosynthesis gene(com); dissimilatory and assimilatory sulfate reduction genes (sat, sir, dsr and apr); assimilatory nitrate reduction genes (napA and nrfA) and denitrification gene (nosz). Fig.12a showed the above-mentioned functional genes’ relative abundence in runoff area samples and stagnant area samples. This consisted with the Welch’s t-test results. As these function genes related to methanogenesis, nitrate reduction and sulfate reduction were all anaerobic respiration genes, when the hydrogeological environment transits from the aerobic environment in the runoff area to the anaerobic environment in the stagnant area, their gene abundance had a sensitive response, which can be used as an effective index to identify the stagnant area and the runoff area.