Sequence hierarchy in the ExED
The initial twenty-five seed sequences comprise six bacterial, one
fungal, and seventeen plant expansins, as well as one expansin-like
swollenin sequence. The BLAST hits for each of these seed sequences were
assigned to a corresponding superfamily named ’Bacterial expansins’,
’Fungal expansins’, ’Plant expansins’, and ’N-terminal domains’. Hence,
the division of the identified protein sequences into the different
superfamilies was based on sequence identity, and not on phylogenetic
relationships. Herein, the term family refers to a group of sequences
sharing a certain degree of similarity, i.e. rather a cluster of similar
sequences than a clade in a phylogenetic tree. Homologous families were
created by a cutoff of 60% pairwise sequence identity as determined by
the Needleman-Wunsch algorithm implemented in the EMBOSS software suite
(version 6.6.0), with gap opening and extension penalties of 10 and 0.5,
respectively32,33. All sequence entries which shared
at least 98% global sequence identity were assigned to a single protein
entry. For each sequence entry, the respective superfamily, homologous
family, and protein entry were annotated together with the identifiers
of the original source database.