Sequence collection for the ExED
The Expansin Engineering Database (ExED, https://exed.biocatnet.de) was built within the BioCatNet database system starting from twenty-five protein seed sequences (Table S1 )25. These seed sequences were used as queries for the Basic Local Alignment Search Tool (BLAST)26 using an e-value cutoff of 10-10 against the non-redundant protein database27 of the National Center for Biotechnology Information (NCBI)28 and the Protein Data Bank (PDB)29. Two subsequent updates were performed to further enrich the ExED. For the first update, the sequences found by the initial search were clustered by UCLUST from the USEARCH package (version 11.0.667)30 by a threshold of 80% sequence identity, and the centroids (representative sequences) served as seed sequences for a BLAST search in the NCBI non-redundant protein database and the PDB. The seed sequences for the database updates of the ExED are available under https://doi.org/10.18419/darus-622. For the second update, profile hidden Markov models (HMMs) were generated for the N- and C-terminal expansin domains as described below. Further sequences were collected by searching with the hmmscan command from the HMMER software package (version 3.1b2, http://www.hmmer.org, Howard Hughes Medical Institute, Chevy Chase, MD, USA)31. The hits were filtered by a minimal domain-based score of 35 (chosen after comparison with HMMER’s domain-based “independent” e-values), a minimal hit length of 60 amino acids, and a maximal ratio of bias over domain-based score of 10%.