Sequence collection for the ExED
The Expansin Engineering Database (ExED, https://exed.biocatnet.de) was
built within the BioCatNet database system starting from twenty-five
protein seed sequences (Table S1 )25. These
seed sequences were used as queries for the Basic Local Alignment Search
Tool (BLAST)26 using an e-value cutoff of
10-10 against the non-redundant protein
database27 of the National Center for Biotechnology
Information (NCBI)28 and the Protein Data Bank
(PDB)29. Two subsequent updates were performed to
further enrich the ExED. For the first update, the sequences found by
the initial search were clustered by UCLUST from the USEARCH package
(version 11.0.667)30 by a threshold of 80% sequence
identity, and the centroids (representative sequences) served as seed
sequences for a BLAST search in the NCBI non-redundant protein database
and the PDB. The seed sequences for the database updates of the ExED are
available under https://doi.org/10.18419/darus-622. For the second
update, profile hidden Markov models (HMMs) were generated for the N-
and C-terminal expansin domains as described below. Further sequences
were collected by searching with the hmmscan command from the
HMMER software package (version 3.1b2, http://www.hmmer.org,
Howard Hughes Medical Institute, Chevy Chase, MD,
USA)31. The hits were filtered by a minimal
domain-based score of 35 (chosen after comparison with HMMER’s
domain-based “independent” e-values), a minimal hit length of 60 amino
acids, and a maximal ratio of bias over domain-based score of 10%.