2.2 | Classifying evaluation units into evolutionary prediction classes
Historically, the outcome of a protein structure modeling exercise was largely predetermined by the evolutionary relationship between the target and experimentally determined structures. Proteins with apparent homology to available structures were typically easier to model, while non-homology targets were at the harder side of the prediction difficulty spectrum. Since targets of different difficulty required different modeling approaches, yielded different degrees of model accuracy, and thus required different evaluation approaches, CASP had previously assessed modeling results separately for different target difficulties. The names of the difficulty categories changed with time, but the major factor defining the difficulty remained the same: availability of structural templates. The classical difficulty schema was shaken in CASP14, where the DeepMind group showed that highly accurate models can be built with AlphaFold 2 (AF2) for practically all targets, independently of the template availability. This suggested that the classical division into largely homology-based difficulty categories may not be needed any more. Acting upon these developments, CASP organizers recommended assessment of tertiary structure prediction in CASP15 in one batch. This analysis is presented elsewhere in this issue21. Nevertheless, similarly to splitting targets into EUs (above), the assignment of EUs to evolutionary prediction classes is still needed for comparing CASP15 results with the earlier ones.
In previous CASPs, EUs were classified into difficulty categories based on the availability of similar structures in the PDB, as detected by sequence- and structure-based searches (reflecting estimated difficulty) and predictors’ performance (reflecting actual difficulty)9,10. Since performance has become more uniform across the whole range of targets, it is no longer useful for their discrimination. To adapt to the situation, we explored automated approaches to target classification, aiming to recapitulate the outcomes of previous CASPs as far as possible, but working solely with the results of automated PDB searches. Each EU was assigned a sequence-based and structure-based similarity score. The sequence-based score was defined as the HHscore 10, which is the product of the HHsearch probability and the alignment coverage of the query for the top-ranked template identified by HHsearch. The structure-based score was the LGA_S score of the highest-ranked structural match according to the procedure described in section 2.1, Step 3 . These scores were used to automatically assign EUs to prediction classes (see Results, section 3.2 ).