Discussion
Expansins typically consist of about 225 amino acids (about 26 kDa) and an N-terminal signal peptide2, in total 250 to 275 amino acids55, which is in agreement with the average sequence length of 262 amino acids identified in this study. Thus, sequences shorter than 210 amino acids or longer than 300 amino acids were excluded from global sequence analyses (Figure S4 ). However, sequences with a length of about 600 amino acids contained replications of expansin domains as fusion proteins or due to sequencing errors, leading to expansin sequences that contained each domain two or three times. Since the two expansin domains have a length between 80 and 90 amino acids, shorter protein sequences can be considered as fragments or incomplete expansin domains.
The occurrence of expansins in major taxa in the tree of life (after Fig. 1 in8 where a comprehensive phylogenetic analysis of expansin genes across all kingdoms of life is shown) is comparable to the results obtained in this study (https://doi.org/10.18419/darus-693). For twelve out of ninety groups that were compared, the results are different, e.g. the archaeon Halomicroarcula sp. LR21 can be found in the ExED and contains one expansin homologue for which both expansin domains are annotated, whereas previous studies in8 did not find a putative expansin in Archaea . Other, apparently hitherto unknown, occurrences of putative expansins in the ExED include thirty-six sequences of Fibrobacteres , one sequence of Ignavibacteria in which both expansin domains can be found, seven sequences of Discosea , one sequence ofDiscoba , and one sequence of Acidobacteria , but without domain annotations. Further expansin sequences that were not included in this study but mentioned in 8 are from the taxaVerrucomicrobia , Chlorobi , Tubulinea ,Glaucophyta , Haptophyta , Dinoflagellata , andPhaeophyta .
The protein sequence networks confirmed the nomenclature and classification of expansins into three kingdoms of Bacteria , Fungi, and Viridiplantae and the subclassification of plant expansins into EXPA, EXPB, EXLA, and EXLB56(Figure 2 ). Despite the differences on global sequence level, the protein sequence networks of expansins from Bacteria andViridiplantae share similarities on a domain-based sequence level (Figures S1, S6 and S7 ). The N-terminal expansin domain is more conserved than the C-terminal expansin domain (Figure 3 and https://doi.org/10.18419/darus-735). When expansin homologues from more diverse backgrounds are discovered in the future, updated profile HMMs will show more insights into the possible co-evolution of both expansin domains.
A conservation analysis revealed and confirmed positions with an essential functional or structural role in expansin homologues. Glycine is structurally relevant, as it mediates the formation of short loops57 and is frequently observed at the N- and C-caps of α-helices to increase helix stability58. As observed previously for other protein families59,60, glycine is the most conserved amino acid in both expansin domains. In expansins, all four conserved glycines are located in loop regions (Table 2 , compare with Figure 1 ). The conservation of threonine 12 and aspartate 82 in Bacteria , Fungi, EXPA, and EXPB confirms their functional role10. Interestingly, at standard position 75, which plays a moderate role in cell wall extension activity of Bs EXLX110, a glutamate is conserved in the superfamily ‘Bacterial expansins’, and a glycine in EXPA and EXLB. In contrast, standard position 75 is not conserved in the superfamily ‘Fungal expansins’, in EXPB, and in EXLA (Table 2and https://doi.org/10.18419/darus-735). Aspartate 71, which has been proposed as important but not essential for wall extension activity ofBs EXLX110, is conserved in the superfamilies ‘Bacterial expansins’ and ‘Fungal expansins’, and in EXPB, EXLA, and EXLB (Table 2 and https://doi.org/10.18419/darus-735). However, three other proposed key amino acids for cell wall extension activity (threonine 14, serine 16, and tyrosine 7310) are neither conserved in expansins from Bacteria , Fungi, norViridiplantae (Table S3 and https://doi.org/10.18419/darus-735), indicating the importance of an increased sample size for conservation analysis. The large number of expansin sequences investigated here also provided a deeper insight into the structural or functional relevance of disulfide bridges in the different superfamilies. Previously, three disulfide bridges were proposed to stabilize the tertiary structure of the N-terminal expansin domain of EXPA and EXPB14,15. Five of the proposed six cysteines could be confirmed as highly conserved in the superfamily ‘Plant expansins’ (Table 2 and https://doi.org/10.18419/darus-735). The sixth cysteine is located directly before the linker to the C-terminal expansin domain and therefore not included in our profile HMM for the N-terminal expansin domain. Against expectations, the additional highly conserved forth cysteine pair in plant α-expansins from 15 was not found in our analysis( https://doi.org/10.18419/darus-735) . Only three conserved cysteines were found in the superfamily ‘Fungal expansins’, thus not all fungal expansin homologues possess three disulfide bridges, as concluded from the expansin Sc Exlx116. None of the six cysteines was conserved in the superfamily ‘Bacterial expansins’ (Table 2 ), which is in accordance with previous observations of bacterial expansins lacking disulfide bridges13.
In the C-terminal expansin domain, the three aromatic residues at standard positions 125, 126, and 157, which mediate binding to cellulose10, are conserved in the superfamilies ‘Bacterial expansins’ and ‘Fungal expansins’, but are less conserved in the superfamily ‘Plant expansins’ (Table S4 and https://doi.org/10.18419/darus-735). Lysine 119, which is important for cell wall-loosening activity10, is conserved in the superfamilies ‘Bacterial expansins’ and ‘Fungal expansins’, but not conserved in the superfamily ‘Plant expansins’ (Table 2 ).
Through the use of conservation analysis, previously published family-specific motifs were confirmed: in the N-terminal expansin domain, the T(F/W)YG motif was present in the two superfamilies ‘Bacterial expansins’ and ‘Fungal expansins’ (standard positions 12-14 and 14.1), and the motifs GGACG (20-24) and HFD (80-82) in the superfamily ‘Plant expansins’9,55 (Table 3 ). We suggest to extend the GGACG motif to a GGACGYG motif and the HFD motif to a HFDL motif in plant expansins. In bacterial and fungal expansins, these two plant motifs are slightly different: in the superfamily ‘Fungal expansins’, the GGACGYG motif is shorter (GGxC), and in fungal and bacterial expansins the HFDL motif is replaced by HLDL. The HLD motif as well as the GGACS motif were already described for the fungal expansin Sc EXLX116. Newly proposed motifs in the N-terminal expansin domain are VpGP (58-61) in the superfamily ‘Bacterial expansins’ and GTAnS (34-38) in the superfamily ‘Fungal expansins’ (Tables 3 and S3) , where p and n denote polar and nonpolar amino acids, respectively. In expansins from Fungi, the proline of the VpGP-motif is replaced by a non-polar amino acid. The previously described CDRC-motif at the amino terminus of EXLA55 is located beyond the boundaries of our profile HMM for the N-terminal expansin domain .
No sequence motifs have been proposed yet for the C-terminal expansin domain, whereas we found eight novel motifs: KpG(S/T)S (119-123) and QVRNH (130-134) in the superfamilies ‘Bacterial expansins’ and ‘Fungal expansins’, where QVRNH shows slight modifications; LEVSTDGD (141-146, including 143.1 and 143.2), GGG (164-166), and VDVRVT (170-175) in ‘Fungal expansins’; YLA (126-128) in EXLA and EXLB; WGA (156-158) in EXPB and with slight modifications in EXPA, EXLA and EXLB; and LSFpVT (170-175) in EXPA (Table 3 ). When annotating expansin sequences in future studies, these sequence motifs will help to assign unknown protein sequences (e.g. metagenomic sequences) to the kingdomsViridiplantae , Bacteria , or Fungi, and to distinguish the plant expansins EXPA, EXPB, EXLA, and EXLB (Tables 3 ,S3 , and S4 ). Exemplary annotations were shown herein for actinobacterial genome samples from South Africa, including a putative expansin homologue from S. swartbergensis(Tables S8 and S9 ).
The large number of expansin sequences used for analysis not only improved the identification of motifs, but also shed light on evolutionary relationships. Interestingly, when searching with the newly established profile HMMs for expansin domains within the CBM63 protein sequences from CAZy, 510 out of the 582 CBM63 protein sequences were found to contain both expansin domains (Table S10 ). Only four sequences had a similarity to the C-terminal expansin domain, while missing the N-terminal expansin domain, as suggested previously1, and 58 CBM63 sequences contained only the N-terminal expansin domain.
The observation of four bacterial sequences being found in clusters of plant expansins supports the hypothesis that microbial expansins were derived via horizontal gene transfer from plants to microbes7 (Figures 2 , S6 , andS7 ). The two bacterial sequences in clusters of the superfamily ‘Plant expansins’ (Figure 5 ) are from the plant pathogensKutzneria sp. 744 (NCBI accession EWM10128.1) andStreptomyces acidiscabies (NCBI accession WP 050370046.1), which are both actinobacteria, as described previously2.
With the chosen filter criteria, the sequence of the fungal swollenin does not contain any expansin domain. As the score for the C-terminal expansin domain is far below the chosen criteria, the swollenin sequence resembles a distantly related C-terminal expansin domain (Table S7 ), but we found no N-terminal expansin domain within the protein sequence of swollenin. This is due to the short N-terminal expansin domain in the swollenin from Trichoderma reesei and confirms the rather low sequence similarity between swollenin and expansins47.
On a global sequence level, GH45s and N-terminal expansin domains share less than 30% pairwise sequence identity (Figure 4 ), and neither the profile HMM search of the N- and C-terminal expansin domains in the 542 GH45 sequences nor the profile HMM search of the GH45 profile HMM from Pfam (https://pfam.xfam.org/family/PF02015/hmm) in the 15,089 sequences of the ExED resulted in a match. In comparison to N-terminal expansin domains, GH45 sequences are longer due to several inserts and longer loop regions (179-208 amino acids as compared to 90-115 amino acids of the N-terminal expansin domains). Despite these differences, the evolutionary relationship between the two protein families is underlined by conserved amino acids. Both the conserved threonine and aspartate at standard positions 12 and 82, and the HFDL-motif (standard positions 80-83) were found in the GH45 protein sequences.
This study confirms the observation that microbial expansins comprise two protein domains and are widely distributed across diverse lineages of Archaea ,Bacteria , Fungi, other eukaryotic microbes8, and Viridiplantae . Therefore, the ExED can serve as a basis for a more detailed phylogenetic analysis in order to elucidate the origin of expansins and ancient evolutionary dynamics. Furthermore, the ExED can be used to search for expansin genes in virulent fungal and bacterial plant pathogens.