Materials and Methods

Data selection

All data were extracted from the enormous amount of information available in the Protein Data Bank [18][19]. Only X-ray crystal structures determined in the 80-120 K temperature range and refined at a resolution of at least 2.0 Å were retained. This resulted in about 66,500 entries of the Protein Data Bank.
Then two strategies were followed to extract non-redundant sets of data.
On the one hand, the pairwise sequence redundancy was reduced with CD-HIT – maximal percentage of sequence identity of 40% [20] – and the attention was limited to chains containing at least 50 amino acids. This resulted in the Single dataset containing about 14,000 protein chains.
On the other hand, the RaSPDB procedure was applied [21]. It consists in creating several subsets of the Protein Data Bank. Each subset must be large enough to be representative of the Protein Data Bank and small enough to avoid internal redundancy. Nine non-overlapping subsets, each containing about 7,000 protein chains made by more than 50 amino acids, were assembled, and all statistical analyses were performed on each of them and then averaged. This procedure allows one to use a much larger fraction of Protein Data Bank and to estimate the standard errors of each estimate. This results in the nine subsetsraspdb_X (X=1-9).

Chalcogen bond detection

In previous studies of chalcogen bonds formed by selenomethionine, the position of the nucleophile relative to the selenium atom was described by means of spherical coordinates [7][11], which require the atomic positions of the C-Se-C triatomic fragment of the selenomethionine side-chain. An analogous approach is impossible here, where the attention is focused on the C-S-H triatomic fragment of cysteine, given that the coordinates of this hydrogen atom are usually unknown, since acidic and rotatable hydrogen atoms are often undetected, even at very high crystallographic resolution or in neutron diffraction studies.
In principle, it is possible to compute the position of these hydrogen atoms by optimizing their interactions with atoms close by [22]. This means by optimizing their hydrogen bonds [23]. Here it is preferable to avoid the computation of the position of these hydrogen atoms, since this would inevitably bias the analysis of chalcogen bonds.
As a consequence, a S-Nu chalcogen bond was simply defined as a contact shorter than 3.4 Å (when Nu is an oxygen atom) or than 3.7 Å (when Nu is a sulfur atom) and colinear or nearly colinear with the C-S bond, which means that the angle α = 180°-(Cβ-Sγ-Nu) must be narrower than 25° – note that this threshold is larger than 20°, the value used in chemistry and material science, since it is necessary to consider the lower accuracy of macromolecular crystal structures.
Care was taken to remove from the chalcogen bonds’ list the disulfide bonds and the short sulfur-sulfur contacts that may be observed for radiation damaged disulfide bonds [24][25]. Analogously, short sulfur-sulfur contacts resulting from the interactions of the sulfur atoms with the same heteroatom – typically a metal cation – were removed from the chalcogen bonds’ list.

Hydrogen bond detection

Potential hydrogen bonds that involve cysteine were identified with HBPLUS [26] and filtered according to the following criteria [27][28]: S-A < 4.3 Å and S-A-AA > 90° when the cysteine is a hydrogen donor; and D-S < 4.1 Å when the cysteine is a hydrogen acceptor. Additional stereochemical criteria that can be used to identify hydrogen bonds and that require the knowledge of the position of the hydrogen atoms were disregarded, since the hydrogen atom position is generally unknown.

Miscellaneous

Solvent accessible surface areas were computed with NACCESS [29] and secondary structure assignments were performed with Stride [30].