Materials and Methods
Data selection
All data were extracted from the enormous amount of information
available in the Protein Data Bank [18][19]. Only X-ray crystal
structures determined in the 80-120 K temperature range and refined at a
resolution of at least 2.0 Å were retained. This resulted in about
66,500 entries of the Protein Data Bank.
Then two strategies were followed to extract non-redundant sets of data.
On the one hand, the pairwise sequence redundancy was reduced with
CD-HIT – maximal percentage of sequence identity of 40% [20] –
and the attention was limited to chains containing at least 50 amino
acids. This resulted in the Single dataset containing about
14,000 protein chains.
On the other hand, the RaSPDB procedure was applied [21]. It
consists in creating several subsets of the Protein Data Bank. Each
subset must be large enough to be representative of the Protein Data
Bank and small enough to avoid internal redundancy. Nine non-overlapping
subsets, each containing about 7,000 protein chains made by more than 50
amino acids, were assembled, and all statistical analyses were performed
on each of them and then averaged. This procedure allows one to use a
much larger fraction of Protein Data Bank and to estimate the standard
errors of each estimate. This results in the nine subsetsraspdb_X (X=1-9).
Chalcogen bond detection
In previous studies of chalcogen bonds formed by selenomethionine, the
position of the nucleophile relative to the selenium atom was described
by means of spherical coordinates [7][11], which require the
atomic positions of the C-Se-C triatomic fragment of the
selenomethionine side-chain. An analogous approach is impossible here,
where the attention is focused on the C-S-H triatomic fragment of
cysteine, given that the coordinates of this hydrogen atom are usually
unknown, since acidic and rotatable hydrogen atoms are often undetected,
even at very high crystallographic resolution or in neutron diffraction
studies.
In principle, it is possible to compute the position of these hydrogen
atoms by optimizing their interactions with atoms close by [22].
This means by optimizing their hydrogen bonds [23]. Here it is
preferable to avoid the computation of the position of these hydrogen
atoms, since this would inevitably bias the analysis of chalcogen bonds.
As a consequence, a S-Nu chalcogen bond was simply defined as a contact
shorter than 3.4 Å (when Nu is an oxygen atom) or than 3.7 Å (when Nu is
a sulfur atom) and colinear or nearly colinear with the C-S bond, which
means that the angle α = 180°-(Cβ-Sγ-Nu)
must be narrower than 25° – note that this threshold is larger than
20°, the value used in chemistry and material science, since it is
necessary to consider the lower accuracy of macromolecular crystal
structures.
Care was taken to remove from the chalcogen bonds’ list the disulfide
bonds and the short sulfur-sulfur contacts that may be observed for
radiation damaged disulfide bonds [24][25]. Analogously, short
sulfur-sulfur contacts resulting from the interactions of the sulfur
atoms with the same heteroatom – typically a metal cation – were
removed from the chalcogen bonds’ list.
Hydrogen bond detection
Potential hydrogen bonds that involve cysteine were identified with
HBPLUS [26] and filtered according to the following criteria
[27][28]: S-A < 4.3 Å and S-A-AA > 90°
when the cysteine is a hydrogen donor; and D-S < 4.1 Å when
the cysteine is a hydrogen acceptor. Additional stereochemical criteria
that can be used to identify hydrogen bonds and that require the
knowledge of the position of the hydrogen atoms were disregarded, since
the hydrogen atom position is generally unknown.
Miscellaneous
Solvent accessible surface areas were computed with NACCESS [29] and
secondary structure assignments were performed with Stride [30].