THE ClinICAL UTILITY and diagnostic yield of rna

While analytical validity refers to the sensitivity, specificity, and accuracy of a diagnostic test in terms of its ability to measure a biomarker in a lab setting, clinical validity refers to the accuracy and predictive value of that test when it comes predicting clinical diagnosis. Both terms are distinct from clinicalutility , which refers to a test’s ability to make a difference – that is its potential to impact patient quality of care/life by guiding clinical decision-making (Byron et al., 2016). Here we will outline the clinical utility of transcriptome analysis for the diagnosis of both neoplastic and non-neoplastic diseases.

Mendelian Disorders

Transcriptome analysis is a boon to the diagnosis of rare Mendelian diseases. Historically, genetic counseling has relied upon whole exome sequencing to identify causative disease variants; however, this DNA-only approach has left up to 75% of patients without genetic diagnoses (Stenton & Prokisch, 2020). When integrated with genome sequencing – and, especially, in situations when said genome sequencing encompasses both exons and introns – gene expression profiling has been shown to significantly boost molecular diagnostic rates; yields have been shown to increase by 10-35% (Lee et al., 2020; Maddirevula et al., 2020; Stenton & Prokisch, 2020). This is because RNA data both: (1) puts any variants identified in DNA into context by revealing their transcript-level consequences (ex: allele-specific expression due to nonsense-mediated decay, imprinting, and/or expression of splice variants), and (2) illuminates phenomena (like gene expression outliers) that may not pass the threshold of detection in DNA data alone (but that are crucial to the pathogenesis of a given disease) (Lee et al., 2020).
Gene expression profiling has also improved clinicians’ ability to diagnose, stratify, and subtype autoimmune diseases, like systemic lupus erythematosus, as well as degenerative diseases like Age-Related Macular Degeneration (Alarcón-Riquelme, 2019). Additionally, consideration of the transcriptomic landscape has shed light on the fact that many of these diseases are heterogenous with a spectrum of causative molecular events (Morello et al., 2019). RNA-seq is also capable of overcoming the “bottleneck of variant interpretation” in patients with inborn errors of metabolism, mitochondriopathies, and/or unsolved muscle disorders, leading to significantly increased diagnostic yields (Kremer et al., 2018; Thompson et al., 2020).
It is important to note that recent studies have shown, particularly in the case of monogenetic neuromuscular disorders, that blood-based RNA-seq is not sufficient for diagnosis; however, RNA-seq performed on myotubes generated by trans-differentiation of patient fibroblastswas capable of identifying a molecular culprit (predominantly splicing variants) in 36% of patients for whom DNA-only analysis had failed to do so (Gonorazky et al., 2019). This highlights the fact that several methodological improvements must be made to hasten the progress of translating transcriptome analysis from the benchtop to the bedside, and to enhance diagnostic sensitivity. These include refinement ofex vivo trans-differentiation of accessible cells to more disease-relevant cell types (Lee et al., 2020).

Hereditary Cancer

Cancer genomic analysis involves the identification of inherited (“germline”) risk variants and acquired (“somatic”) mutations in DNA and RNA (Koeppel et al., 2018). Transcriptome analysis has been shown to be capable of identifying rare, causative variants by revealing changes in splicing and gene expression that were undetected by DNA sequencing (Yuan et al., 2020). Since examples of RNA-seq analysis in the conjunction of cancer risk prediction is more recent, we will dissect these papers in greater detail. Before we do, two key distinctions should be made regarding hereditary cancer studies. First, there is a general bias towards using RNA-sequencing in conjunction with panel-based clinical sequencing, to reduce the genomic search space considerably. If one focuses on all oncogenic or tumor-suppressor genes it changes the prevalence of background events, and ultimately the precision and/or diagnostic yield. Second, the context of reporting is distinct from Mendelian studies, a search or diagnostic odyssey. Typically, hereditary cancer VUS create unique stress, and there is some implied interpretation of negative findings.
A series of papers from 2019 through 2021 illustrate and give further insights into these distinctions. First, Conner et al. (2019) found that, by supplementing DNA genetic testing with RNA, heterozygous duplication events in MSH2 – which were previously classified as VUS in five individuals with Lynch Syndrome – were able to be reclassified as pathogenic or likely pathogenic (Conner et al., 2019). Similarly, Karam et al. (2019) showed that, by supplementing DNA with RNA genetic testing in cases suspicious for hereditary cancer in which the variant in question involved a potential splice site alteration, (1) inconclusive DNA-based results were resolved in 49 of 56 inconclusive cases (88%) studied, with 26 (47%) being reclassified as clinically-actionable and 23 (41%) being clarified as benign; (2) the study estimated that 2% of patients receiving paired DNA/RNA testing would benefit by the addition of RNA by further characterization of splice-site VUS (Karam et al., 2019). Two other studies found that the addition of transcriptomic analysis to hereditary cancer testing enabled 60% and 20%, respectively, of splicing VUS to be reclassified as (likely) pathogenic (Agiannitopoulos et al., 2021; Rofes et al., 2020). Landrith et al. (2020) performed germline RNA-seq to profile 18 genes (i.e. APC, ATM, BRCA1, BRCA2, BRIP1, CDH1, CHEK2, MLH1, MSH2, MSH6, MUTYH, NF1, PALB2, PMS2, PTEN, RAD51C, RAD51D, and TP53 ) in patients with suspected hereditary cancer syndromes. The investigators demonstrated a 9.1% relative increase in the detection of pathogenic variants afforded by augmenting DNA data with RNA analysis (Landrith et al., 2020). Deep intronic variants have also been identified in BRCA1/2, by virtue of RNA analysis, in patients with familial breast and ovarian cancers (Anczuków et al., 2012; Montalban et al., 2019).
As is evident from the studies mentioned above, RNA deep intronic mutations and splicing aberrations are unique mechanisms of carcinogenesis which, based upon DNA data alone, are still often classified as VUS (Urbanski et al., 2018). Splicing mutations (which can be present in both pre-mRNA exons and introns (the latter of which has historically been harder to detect using traditional DNA analyses) lead to abnormal mRNA phenomena (e.g. exon skipping, intron inclusion, cryptic splice site activation) and the production of abnormal proteins with diagnostic value (Shi et al., 2018). Expression changes in splicing regulators can be used as biomarkers for cancer diagnosis (ex:hnRNPA2/B1 , an RNA-binding protein involved in mRNA splicing, is a sensitive and specific early-diagnostic marker of lung neoplasms) (Zhang et al., 2021). RNA-seq has shown utility in diagnosing germline splicing variants in hereditary cancer genes that were not evident in DNA analysis (Urbanski et al., 2018). While splicing variants make up 11% of hereditary cancer gene VUS, they make up 55% of those VUS that are “likely pathogenic”(Parsons et al., 2019).
Larger-scale reports have been published by clinical genetic companies where RNA-seq was used in conjunction with panel-based studies across thousands of individuals. Ambry recently released a series of “RNA Case Studies” that demonstrate the clinical diagnostic utility of transcriptomic data, particularly for identifying intronic variants (AmbryGenetics, 2019). One such scenario was the case of a 33-year-old male, with a personal and family history of colon polyps, for whom no clinically-significant variants could be detected via DNA-only analysis. When genetic analysis was supplemented with transcriptomic analysis (i.e. Ambry’s +RNAinsight ® panel), however, abnormal APC transcripts were detected prompting further investigation via targeted Sanger DNA sequencing. This resulted in the confirmation of a deep intronic, likely pathogenic variant. Transcriptomic data enabled the patient’s provider to make a genetic diagnosis of familial adenomatous polyposis.(AmbryGenetics, 2019) Other examples include a likely pathogenic intronic variant that was identified outside of DNA analytical range in the gene ATM(c.497-2661A>G), and exon skipping variants in MSH6leading to Lynch Syndrome. Ambry’s +RNAinsight ®panel, mentioned in the 2 cases above, analyzes 91 cancer driver genes, and can be paired with most DNA panels; it has shown to be capable of reclassifying >70% of VUS (AmbryGenetics, 2021).
Similarly, a recent study by Invitae aimed to exemplify the utility of RNA analysis for reclassifying splicing VUS (Truty et al., 2021). The investigators analyzed a significantly large sample consisting of nearly 700k patients from a clinical cohort plus individuals from two large public datasets (i.e. ClinVar and Genome Aggregation Database/gnomAD ) (Truty et al., 2021). In their clinical cohort, Invitae found that 5.4% of individuals had at least one splicing VUS (most of which were identified outside of essential splice sites), and that splicing variants represented 13% of all variants classified as (likely) pathogenic or VUS. They estimated that, in the clinical cohort, RNA analysis would be capable of clarifying/reclassifying splicing VUSs in 1.7% of cases. In comparison to the clinical cohort, in ClinVar and gnomAD , Invitae observed that splicing VUS comprised nearly 5% and 9% of reported variants, respectively. Invitae concluded that, in all 3 cohorts, individuals would have a tangible, clinical-diagnostic benefit from RNA testing (Truty et al., 2021).
Not only can transcriptome characterization classify VUS as (likely) pathogenic, but it can also clarify variants as benign . For example, RNA data supported a variant downgrade of a likely pathogenic splice site variant at a canonical splice site (Shamseldin et al., 2021). In the case of CDH1 c.387+1G>A, various clinical laboratories initially reported the variant in multiple Hispanic/Latino patients as “likely pathogenic” on the basis of the “+1” position of the variant. This led to the diagnosis of hereditary diffuse gastric cancer syndrome, a condition requiring complex management because of its association with a very high risk of early onset gastric cancer and lobular breast cancer. However, the variant was studied in more detail because the patients with this variant lacked the associated phenotype of the condition. The variant was experimentally demonstrated to result in the activation of a cryptic in-frame donor splice site, leading to the recommendation by ACMG and AMP that variants at this position not be considered as likely pathogenic (Maoz et al., 2016).
In large part, we have limited this review to germline-inherited variation due to space and scope. However, clearly, RNA-sequencing has utility in the context of somatic variation, and, in fact, this can be the basis of treatment decisions. It is worth highlighting that a 2021 study in Oncogene  examined somatic variation across over 1,000 pan-cancer, paired whole genomes and transcriptomes to understand the role of splicing mutations in tumorigenesis. The investigators identified about 700 somatic intronic mutations; nearly half were within deep intronic regions and, of those, 38% activated cryptic splice sites. A subset of the deep intronic mutations resulted in splicing enhancers or silencers alterations. They found that intronic mutations often affected tumor suppressor genes, and those hematological malignancies, particularly, harbor many deep intronic mutations. Taken as a whole, this paper suggests considerable insights can be gained well beyond germline analysis of VUS (Jung et al., 2021).

Limitations & Future Directions

The progress of RNA-based diagnostics is encouraging, especially as new and translational gene expression profiling techniques emerge (Wang et al., 2020). Gene expression profiling allows for, not only, the identification of fusion transcripts, but also the detection of phenomena like differential expression, ASE, alternative splicing, and the presence of non-coding RNAs (Conner et al., 2019). Both targeted RNA microarrays and RNA-seq have shown analytical validity when it comes to diagnostics for pediatric, adolescent/young adult, and adult patients (Vaske et al., 2019).

Conflicting Lines of Evidence

One fallacy of reasoning – commonly and erroneously applied to the analysis of variant lists such as variant call format (VCF ) files – is the assumption that the absence of a transcript variant means that the variant is absent from the specimen. This common misconception lead to the development of genomic VCFs (gVCFs ) which call every position – both variants and wild type/reference.
The only way to move forward with statistical power and confidence is through collaborative efforts and the creation of diverse and devoted databases. ClinVar (Rehm et al., 2017) and gnomAD(Karczewski et al., 2020) are under-appreciated summary-level datasets.gnomAD ’s focus on categorizing rare events was foundational. At the RNA-level, this approach has not yet been adopted outside of isolated cases; burgeoning examples are RNAcentral (a database of non-coding RNAs) (Petrov et al., 2015) and SpliceDB (a database of canonical and non-canonical mammalian splice sites) (Burset et al., 2001).
With the clinical implementation of any new “translational” technology, one must approach variant curation and interpretation of functional evidence with caution. Interpretation can be more complex than anticipated; there are many potential pitfalls. For example, Nix et al. once posited that a partial exon-skipping mutation identified inBRCA2 was pathogenic; it was later found to occur in many healthy controls (Mundt et al., 2017).

Differences in RNA-seq Library Preparation & Analysis Methods

Unlike genomic sequencing of DNA, differences in collection methods, library preparation, tissue sources, etc. massively impact RNA-seq analysis and interpretation. The first and most apparent variable is the tissue source for RNA and its relevance to the disease or phenotype. For example, how well can RNA from whole-blood provide insights into neurological disorders? GTEx provides an initial framework to evaluate this question showing typically >40% of genes expressed at reasonably high levels, and experiences reviewed in previous sections frequently faced a similar question (Consortium, 2013). Likely, customized assays leveraging enrichment may increase this dynamic range of RNA species, recognizing many genes will not have the expression needed for interpretation via RNA-seq. Nonetheless, many of the studies highlighted showed >10% improvement in diagnostic yield despite such changes.
Without question, the ability to look across rare DNA variation across thousands of individuals, such as through resources like gnomAD, has profoundly influenced the interpretation of genomic variants. Aggregation of RNA - even within the same lab will face significant and un-ignorable challenges. As has been experienced by consortiums and labs, aggregation of RNA-seq across samples, studies, and library preps typically recapitulates multiple technical variables to drive the largest proportion. Efforts to normalize or adjust to these technical differences are an active area of research beyond the scope of this review.
Even still, when examining consortiums such as PsychENCODE (Psych et al., 2015) and AMP-AD (Hodes & Buckholtz, 2016), among others, eliminating technical variation from RNA-seq experiments is challenging, particularly if one is interested in rare events. To illustrate this point, we consider the recent release of 4,871 longitudinally-collected samples from 1,570 clinically-phenotyped individuals from the Parkinson’s Progression Marker Initiative (PPMI), conducted using random priming for PaxGene collected whole-blood with paired whole-genome sequencing (Craig et al., 2021). Forthcoming efforts from TopMED will utilize the same PaxGene whole-blood protocols but will differ in using mRNA-seq from poly-A priming. These two methods lead to different species with random priming, showing pre-spliced RNA and non-polyA-tailed transcripts. Algorithms trained on these methods will fundamentally differ in their core measures, such as PSI. Even within the same dataset, we have observed significant differences in gene/exon usage that depended on read lengths of paired 100bp vs. a 125bp subset.
While daunting, solutions are emerging for aggregating RNA such as through the ARCHS4 aggregation across mouse and human RNA-seq studies (Lachmann et al., 2018). Other examples include in-house solutions or those specific to a given group; it becomes a question of sensitivity. Our group successfully employed outlier analysis to identify causative variants in a cohort collected over 5 years that was sequenced by different labs using different methods.

Fragmentation of RNA-seq Databases and Standards

Though the RNA-based diagnostics described here have potential, there are still obstacles that must be overcome before they will be incorporated into routine clinical practice. These challenges include the need for scientific rigor, reproducibility, accuracy, precision, clinical validity, and clinical utility. Standards must be created for test thresholds and normalized reporting, and databases must be established (Tahiliani et al., 2020; Wang et al., 2020). These databases must be designed so as to not fall prey to any logical fallacies (ex: the “marker-positive fallacy”).
Issues of database size, diversity, and representation (both in the sense of race/ethnicity and cases/controls), population structure, and cryptic relatedness must be considered (Update., 1996). We must also acknowledge, and attempt to address, limitations (ex: the half-life/stability of RNA) and potential confounders (e.g. temporal changes in RNA expression, differences in RNA capture from fresh frozen vs. formalin fixed paraffin embedded samples, and phenomena like clonal hematopoiesis of indeterminate potential in liquid biopsies) (Wang et al., 2020).
Investigators must carefully consider the tissue from which they are isolating RNA given the fact that expression patterns differ across tissues (and, on the circadian-level, RNA expression can even differ in the same tissue at different time points) (Maddirevula et al., 2020). It is important to balance preference for minimally-invasive techniques with considerations of differential tissue expression. One recent study found that, when comparing brain vs. blood vs. human B-lymphoblastoid cell lines (LCL ), LCLs possessed isoform diversity for neurodevelopmental genes similar to that of brain tissue; LCLs also expressed these genes more highly compared to blood (Rentas et al., 2020). The authors of this paper described an RNA-seq pipeline with 90% sensitivity and claimed that findings in LCLs outperformed those in blood and had implications for the molecular diagnosis of >1000 genetic syndromes (Rentas et al., 2020).
Another limitation is the fact that expression quantitative trait loci (eQTL ) databases – like GTEx Portal – are limited to common variants (i.e. variants with a minor allele frequency >1%). This means that such datasets are not applicable toward understanding VUS which, although rare in the general/overall population, disproportionately impact Non-White/European groups. RNA analysis is also limited by the fact that most tools utilize transcripts defined by a Gene Transfer Format (GTF) file and find it difficult to annotate the 3′ untranslated region (3’ UTR) (Shenker et al., 2015). Therefore, there is a critical need for more rigorous, reproducible, and representative RNA databases and tools.

VUS as a Manifestation of Cancer Disparities

One anecdotal trend that we have noticed within our own group and across collaborative efforts is that RNA data allows for the identification of previously missed variation particularly in individuals of non-European ancestry. For example, in Human Mutation we reported a variant within 3bp of the exon boundary using an outlier approach in individuals of African ancestry. The molecular consequences of this variant included exon skipping, altered isoform usage, and loss of canonical isoform expression – events not evident in DNA data alone (McCullough et al., 2020). Patients who self-identify as Hispanic/Latinx, Black/African, and Asian/Pacific Islander experience more advanced stage disease at time of screening, significantly lower diagnostic yields, and higher rates of VUS and variant reclassification compared to their European/Caucasian counterparts (Dutil et al., 2019; Kinney et al., 2018; Kowalski et al., 2019; Marco-Puche et al., 2019; Ndugga-Kabuye & Issaka, 2019; Roberts et al., 2020; Slavin et al., 2018; Urbina-Jara et al., 2019). Individuals from non-European populations will have more private variation for one of three reasons: (1) they are poorly represented in reference datasets, (2) they have greater African ancestry, or (3) they come from a population that has undergone recent expansions (ex: Bangladesh) (Halperin et al., 2017).
A recent study reported by Ambry Genetics found that theirBRCAplus , BreastNext , and CancerNext panels yielded ≈2-3x fewer VUS for Non-Hispanic whites than for minority populations (AmbryGenetics, 2017). Another study reports VUS frequencies in the tumor suppressor genes BRCA1/2 to be 4.4% in Caucasians, 8.9% in African Americans, and 8.0% in Hispanic/Latinos; for larger hereditary cancer panels, this study reported VUS frequencies of 22.1% in Caucasians, 30.3% in African Americans, and 24.9% in Hispanics/Latinos (Appelbaum et al., 2020).
One important distinction to make here is the difference between race/ethnicity and genetic ancestry. While race and ethnicity are social constructs, ancestry is a biological/genetic construct resulting from human migrations throughout history resulting in biogeographical genetic variation (Batai et al., 2021). An example of how genetic ancestry can further clarify race/ethnicity-based disparities is the fact that higher African ancestry in Hispanic/Latinos (who are typically “admixed” with genetic contributions from African, European, and American Indian aka Native/Indigenous American ancestries) is associated with more aggressive breast cancer subtypes and a greater likelihood of receiving inconclusive VUS during genetic testing (Chapman-Davis et al., 2021; Dutil et al., 2019; Kinney et al., 2018; Kowalski et al., 2019; Marco-Puche et al., 2019; Ndugga-Kabuye & Issaka, 2019; Roberts et al., 2020; Slavin et al., 2018; Urbina-Jara et al., 2019; Virlogeux et al., 2015). Gene expression profiling may be able to help shed light on and alleviate these inequities (Frésard et al., 2019; Wai et al., 2020).

Conclusions

VUS cause significant psychological distress to patients and disproportionately limit the promise of precision medicine for minority patients (Landry et al., 2018). RNA data provides critical answers to the question of VUS, particularly in terms of clarifying deep intronic and splicing variants as pathogenic vs. benign. This necessitates the development of more rigorous, reproducible, and representative RNA databases and analytical tools.