Sequence hierarchy in the ExED
The initial twenty-five seed sequences comprise six bacterial, one fungal, and seventeen plant expansins, as well as one expansin-like swollenin sequence. The BLAST hits for each of these seed sequences were assigned to a corresponding superfamily named ’Bacterial expansins’, ’Fungal expansins’, ’Plant expansins’, and ’N-terminal domains’. Hence, the division of the identified protein sequences into the different superfamilies was based on sequence identity, and not on phylogenetic relationships. Herein, the term family refers to a group of sequences sharing a certain degree of similarity, i.e. rather a cluster of similar sequences than a clade in a phylogenetic tree. Homologous families were created by a cutoff of 60% pairwise sequence identity as determined by the Needleman-Wunsch algorithm implemented in the EMBOSS software suite (version 6.6.0), with gap opening and extension penalties of 10 and 0.5, respectively32,33. All sequence entries which shared at least 98% global sequence identity were assigned to a single protein entry. For each sequence entry, the respective superfamily, homologous family, and protein entry were annotated together with the identifiers of the original source database.