Protein sequence networks
Protein sequence networks visualize large sequence datasets as nodes in an undirected graph with edge weights to derive relationships between different clusters or communities. The protein sequences in the ExED were sorted by decreasing sequence length and were subsequently clustered using the USEARCH algorithm (UCLUST) with a threshold of 90% sequence identity (without terminal gaps) to determine a reduced set of centroid sequences (representative sequences)30. For each centroid sequence, the N- and the C-terminal expansin domains were annotated by the two profile HMMs with the filter criteria mentioned above. Pairwise sequence identities between two sequences were derived from global Needleman-Wunsch alignments as described above and used as edge weights. Protein sequence networks were generated with edge weights of pairwise sequence identity, filtered by a pre-defined threshold. Metadata of the nodes (e.g. the sequence ID) and of the edges (i.e. the edge weights) were summarized in GraphML files by applying the NetworkX library in Python (version 1.9) for an automated assignment of node and edge attributes 41. The GraphML files are available at https://doi.org/10.18419/darus-624. Protein sequence networks were visualized with Cytoscape version 3.7.242 using a prefuse, force-directed layout with respect to the edge weights.
For the networks showing the relationships between CBM63s and expansin homologues, and between GH45s and the N-terminal expansin domain homologues, CD-HIT (version 4.7) was used with a clustering threshold of 90% and a word size of 5 (instead of UCLUST)43,44. The GH45 sequences were downloaded from the protein family database (Pfam, version 32.0, accession PF02015)45, whereas the CBM63 sequences were downloaded from the carbohydrate-active enzymes (CAZy) database on June 3, 201946. In the CAZy database, 633 individual CBM63 sequences were deposited, but only 582 NCBI accessions were available at the time of writing, as some of the records were moved or entries were merged. Members of CBM63 were annotated by the profile HMMs for the two expansin domains (https://doi.org/10.18419/darus-625).