Discussion
Expansins typically consist of about 225 amino acids (about 26 kDa) and
an N-terminal signal peptide2, in total 250 to 275
amino acids55, which is in agreement with the average
sequence length of 262 amino acids identified in this study. Thus,
sequences shorter than 210 amino acids or longer than 300 amino acids
were excluded from global sequence analyses (Figure S4 ).
However, sequences with a length of about 600 amino acids contained
replications of expansin domains as fusion proteins or due to sequencing
errors, leading to expansin sequences that contained each domain two or
three times. Since the two expansin domains have a length between 80 and
90 amino acids, shorter protein sequences can be considered as fragments
or incomplete expansin domains.
The occurrence of expansins in major taxa in the tree of life (after
Fig. 1 in8 where a comprehensive phylogenetic analysis
of expansin genes across all kingdoms of life is shown) is comparable to
the results obtained in this study (https://doi.org/10.18419/darus-693).
For twelve out of ninety groups that were compared, the results are
different, e.g. the archaeon Halomicroarcula sp. LR21 can be
found in the ExED and contains one expansin homologue for which both
expansin domains are annotated, whereas previous studies in8 did not find a putative expansin in Archaea .
Other, apparently hitherto unknown, occurrences of putative expansins in
the ExED include thirty-six sequences of Fibrobacteres , one
sequence of Ignavibacteria in which both expansin domains can be
found, seven sequences of Discosea , one sequence ofDiscoba , and one sequence of Acidobacteria , but without
domain annotations. Further expansin sequences that were not included in
this study but mentioned in 8 are from the taxaVerrucomicrobia , Chlorobi , Tubulinea ,Glaucophyta , Haptophyta , Dinoflagellata , andPhaeophyta .
The protein sequence networks confirmed the nomenclature and
classification of expansins into three kingdoms of Bacteria ,
Fungi, and Viridiplantae and the subclassification of plant
expansins into EXPA, EXPB, EXLA, and EXLB56(Figure 2 ). Despite the differences on global sequence level,
the protein sequence networks of expansins from Bacteria andViridiplantae share similarities on a domain-based sequence level
(Figures S1, S6 and S7 ). The N-terminal expansin
domain is more conserved than the C-terminal expansin domain
(Figure 3 and https://doi.org/10.18419/darus-735). When
expansin homologues from more diverse backgrounds are discovered in the
future, updated profile HMMs will show more insights into the possible
co-evolution of both expansin domains.
A conservation analysis revealed and confirmed positions with an
essential functional or structural role in expansin homologues. Glycine
is structurally relevant, as it mediates the formation of short
loops57 and is frequently observed at the N- and
C-caps of α-helices to increase helix stability58. As
observed previously for other protein families59,60,
glycine is the most conserved amino acid in both expansin domains. In
expansins, all four conserved glycines are located in loop regions
(Table 2 , compare with Figure 1 ). The conservation of
threonine 12 and aspartate 82 in Bacteria , Fungi, EXPA, and EXPB
confirms their functional role10. Interestingly, at
standard position 75, which plays a moderate role in cell wall extension
activity of Bs EXLX110, a glutamate is conserved
in the superfamily ‘Bacterial expansins’, and a glycine in EXPA and
EXLB. In contrast, standard position 75 is not conserved in the
superfamily ‘Fungal expansins’, in EXPB, and in EXLA (Table 2and https://doi.org/10.18419/darus-735). Aspartate 71, which has been
proposed as important but not essential for wall extension activity ofBs EXLX110, is conserved in the superfamilies
‘Bacterial expansins’ and ‘Fungal expansins’, and in EXPB, EXLA, and
EXLB (Table 2 and https://doi.org/10.18419/darus-735). However,
three other proposed key amino acids for cell wall extension activity
(threonine 14, serine 16, and tyrosine 7310) are
neither conserved in expansins from Bacteria , Fungi, norViridiplantae (Table S3 and
https://doi.org/10.18419/darus-735), indicating the importance of an
increased sample size for conservation analysis. The large number of
expansin sequences investigated here also provided a deeper insight into
the structural or functional relevance of disulfide bridges in the
different superfamilies. Previously, three disulfide bridges were
proposed to stabilize the tertiary structure of the N-terminal expansin
domain of EXPA and EXPB14,15. Five of the proposed six
cysteines could be confirmed as highly conserved in the superfamily
‘Plant expansins’ (Table 2 and
https://doi.org/10.18419/darus-735). The sixth cysteine is located
directly before the linker to the C-terminal expansin domain and
therefore not included in our profile HMM for the N-terminal expansin
domain. Against expectations, the additional highly conserved forth
cysteine pair in plant α-expansins from 15 was not
found in our analysis( https://doi.org/10.18419/darus-735) . Only three
conserved cysteines were found in the superfamily ‘Fungal expansins’,
thus not all fungal expansin homologues possess three disulfide bridges,
as concluded from the expansin Sc Exlx116. None
of the six cysteines was conserved in the superfamily ‘Bacterial
expansins’ (Table 2 ), which is in accordance with previous
observations of bacterial expansins lacking disulfide
bridges13.
In the C-terminal expansin domain, the three aromatic residues at
standard positions 125, 126, and 157, which mediate binding to
cellulose10, are conserved in the superfamilies
‘Bacterial expansins’ and ‘Fungal expansins’, but are less conserved in
the superfamily ‘Plant expansins’ (Table S4 and
https://doi.org/10.18419/darus-735). Lysine 119, which is important for
cell wall-loosening activity10, is conserved in the
superfamilies ‘Bacterial expansins’ and ‘Fungal expansins’, but not
conserved in the superfamily ‘Plant expansins’ (Table 2 ).
Through the use of conservation analysis, previously published
family-specific motifs were confirmed: in the N-terminal expansin
domain, the T(F/W)YG motif was present in the two superfamilies
‘Bacterial expansins’ and ‘Fungal expansins’ (standard positions 12-14
and 14.1), and the motifs GGACG (20-24) and HFD (80-82) in the
superfamily ‘Plant expansins’9,55 (Table 3 ).
We suggest to extend the GGACG motif to a GGACGYG motif and the HFD
motif to a HFDL motif in plant expansins. In bacterial and fungal
expansins, these two plant motifs are slightly different: in the
superfamily ‘Fungal expansins’, the GGACGYG motif is shorter (GGxC), and
in fungal and bacterial expansins the HFDL motif is replaced by HLDL.
The HLD motif as well as the GGACS motif were already described for the
fungal expansin Sc EXLX116. Newly proposed
motifs in the N-terminal expansin domain are VpGP (58-61) in the
superfamily ‘Bacterial expansins’ and GTAnS (34-38) in the superfamily
‘Fungal expansins’ (Tables 3 and S3) , where p and n
denote polar and nonpolar amino acids, respectively. In expansins from
Fungi, the proline of the VpGP-motif is replaced by a non-polar amino
acid. The previously described CDRC-motif at the amino terminus of
EXLA55 is located beyond the boundaries of our profile
HMM for the N-terminal expansin domain .
No sequence motifs have been proposed yet for the C-terminal expansin
domain, whereas we found eight novel motifs: KpG(S/T)S (119-123) and
QVRNH (130-134) in the superfamilies ‘Bacterial expansins’ and ‘Fungal
expansins’, where QVRNH shows slight modifications; LEVSTDGD (141-146,
including 143.1 and 143.2), GGG (164-166), and VDVRVT (170-175) in
‘Fungal expansins’; YLA (126-128) in EXLA and EXLB; WGA (156-158) in
EXPB and with slight modifications in EXPA, EXLA and EXLB; and LSFpVT
(170-175) in EXPA (Table 3 ). When annotating expansin sequences
in future studies, these sequence motifs will help to assign unknown
protein sequences (e.g. metagenomic sequences) to the kingdomsViridiplantae , Bacteria , or Fungi, and to distinguish the
plant expansins EXPA, EXPB, EXLA, and EXLB (Tables 3 ,S3 , and S4 ). Exemplary annotations were shown herein
for actinobacterial genome samples from South Africa, including a
putative expansin homologue from S. swartbergensis(Tables S8 and S9 ).
The large number of expansin sequences used for analysis not only
improved the identification of motifs, but also shed light on
evolutionary relationships. Interestingly, when searching with the newly
established profile HMMs for expansin domains within the CBM63 protein
sequences from CAZy, 510 out of the 582 CBM63 protein sequences were
found to contain both expansin domains (Table S10 ). Only four
sequences had a similarity to the C-terminal expansin domain, while
missing the N-terminal expansin domain, as suggested
previously1, and 58 CBM63 sequences contained only the
N-terminal expansin domain.
The observation of four bacterial sequences being found in clusters of
plant expansins supports the hypothesis that microbial expansins were
derived via horizontal gene transfer from plants to
microbes7 (Figures 2 , S6 , andS7 ). The two bacterial sequences in clusters of the superfamily
‘Plant expansins’ (Figure 5 ) are from the plant pathogensKutzneria sp. 744 (NCBI accession EWM10128.1) andStreptomyces acidiscabies (NCBI accession WP 050370046.1), which
are both actinobacteria, as described previously2.
With the chosen filter criteria, the sequence of the fungal swollenin
does not contain any expansin domain. As the score for the C-terminal
expansin domain is far below the chosen criteria, the swollenin sequence
resembles a distantly related C-terminal expansin domain (Table
S7 ), but we found no N-terminal expansin domain within the protein
sequence of swollenin. This is due to the short N-terminal expansin
domain in the swollenin from Trichoderma reesei and confirms the
rather low sequence similarity between swollenin and
expansins47.
On a global sequence level, GH45s and N-terminal expansin domains share
less than 30% pairwise sequence identity (Figure 4 ), and
neither the profile HMM search of the N- and C-terminal expansin domains
in the 542 GH45 sequences nor the profile HMM search of the GH45 profile
HMM from Pfam (https://pfam.xfam.org/family/PF02015/hmm) in the 15,089
sequences of the ExED resulted in a match. In comparison to N-terminal
expansin domains, GH45 sequences are longer due to several inserts and
longer loop regions (179-208 amino acids as compared to 90-115 amino
acids of the N-terminal expansin domains). Despite these differences,
the evolutionary relationship between the two protein families is
underlined by conserved amino acids. Both the conserved threonine and
aspartate at standard positions 12 and 82, and the HFDL-motif (standard
positions 80-83) were found in the GH45 protein sequences.
This study confirms the
observation that microbial expansins comprise two protein domains and
are widely distributed across diverse lineages of Archaea ,Bacteria , Fungi, other eukaryotic microbes8,
and Viridiplantae . Therefore, the ExED can serve as a basis for a
more detailed phylogenetic analysis in order to elucidate the origin of
expansins and ancient evolutionary dynamics. Furthermore, the ExED can
be used to search for expansin genes in virulent fungal and bacterial
plant pathogens.