Inferring the main drivers of SARS-CoV-2 transmissibility

,


Introduction
Despite the unprecedented worldwide campaign of mass immunization, due to the relatively slow vaccine rollout and to the appearance of new, more contagious (Tegally et al., 2020), and maybe even more deadly SARS-CoV-2 strains (Mallapaty, 2021), COVID-19 still takes its toll on human lives, stifles the world economy, and forces the majority of countries to keep unpopular lockdowns.In the absence of a prompt solution to the first pandemic of the century, the goal to identify the main environmental and demographic parameters that influence the dynamics of infection transmission remains as important as ever.
We recently published a comprehensive study of the correlation of 42 different demographic and weather parameters with COVID-19 basic reproduction number R0 across 118 world countries (Salom et al., 2021).R0 is a well-established epidemiological measure of virus transmissibility, which has a major advantage of being independent on the testing policy/capacity, and on the intervention measures that can be highly variable (and almost impossible to consistently control) between different countries (Salom et al., 2021).In (Salom et al., 2021), we selected all the countries that exhibited regular exponential growth in the case numbers before the introduction of intervention measures (Djordjevic et al., 2021), from which their R0 values can be reliably extracted.Tracking a wide range of countries allows achieving a maximal variability in the dataset, i.e., a maximal possible range in the values of analyzed variables, as another advantage of this study.This generated dataset will be used as a starting point in this work.
While (Salom et al., 2021) covered a broad scope of variables and countries, it focused on establishing pairwise correlations between R0 and each of the studied factors, ignoring the fact that many of these variables are highly mutually correlated.This is most obvious in the case of the weather parameters such as e.g.temperature and UV radiation (which both reflect the local climate in a similar way and follow comparable seasonal trends), but also in the case of many demographic parameters, e.g. the strong positive correlation between the Human Development Index (HDI) and cholesterol levels.Based on pairwise correlations alone, it is thus hard to estimate which of these variables might be truly influencing the spread of the disease, to what extent, and in which direction.To achieve this, the number of variables necessary to explain the virus transmissibility needs to be reduced to only a few without losing predictiveness.However, this is not the only challenge, because of variable redundancy.
In particular, one may select different combinations of variables accounting together for a similar proportion of variance in the virus transmissibility, which seems to be a dead-end (Notari and Torrieri, 2020).There is consequently a challenge to narrow down the possibilities and illuminate important contributions of the seemingly small differences between highly correlated variables.Noticeably, while numerous studies examined the correlations of several selected (Lin et al., 2020;Ran et al., 2020;Xie et al., 2020) or many different (Li et al., 2020;Hassan et al., 2021;Salom et al., 2021) sociodemographic and meteorological factors with the magnitude of the COVID-19 epidemic, only few studies tried to select a handful of key factors whose combination can explain a large portion of the variance between regions (Allel et al., 2020;Coccia, 2020;Gupta and Gharehgozli, 2020;Notari and Torrieri, 2020).Even a smaller number of studies included data from multiple countries (Allel et al., 2020;Notari and Torrieri, 2020).
The main idea of this study is to develop a novel approach to robustly identify the most important predictors of R0.The development of such an approach will i) provide a straightforward solution to the known problem of selecting important among the highly correlated variables, ii) enable a better understanding of which environmental and demographic variables may dominantly and/or independently influence the progression of the COVID-19 epidemics, and what is the direction of this influence.To achieve these goals, the study is organized as follows: 1.The variables are first naturally split into two groups.The first group comprises 6 meteorological parameters, sampled and averaged (for each country) during the initial stage of the local epidemic outbreak: air temperature (T), precipitation (PC), specific humidity (H), ultra-violet radiation index (UV), air pressure (P), and wind speed (WS).Eighteen (broadly-speaking) demographic parameters form the second group: human development index (HDI), percentage of the urban population (UP), gross domestic product per capita (GDP), amount of the built-up area per person (BUAPC), percentage of refugees (RE), net migration (i.e., the number of immigrants minus emigrants, I-E), infant mortality (IM), median age (MA), long-term average of PM2.5 pollution (PM), prevalence and severity of COVID-19 relevant chronic diseases in the population (CD), average blood cholesterol level (CH), the prevalence of raised blood pressure (RBP), the prevalence of obesity (OB), the prevalence of insufficient physical activity among adults (IN), BCG immunization coverage (BCG), alcohol consumption per capita (ALC), smoking prevalence (SM), and the delay of the epidemic onset (ON).2. Due to strong mutual correlations between parameters within each group (as well as across the groups, but at a lower extent), the principal component analysis (PCA) will be performed on each of the groups (Jolliffe, 2002).This step will allow us to notably reduce the dimensionality of the problem, i.e., proceed to work with a smaller number of (mostly) uncorrelated variables.Such dimensionality reduction will significantly simplify the further analysis and improve the reliability of the results.3. The linear regression analysis will next be performed in four independent ways, ranging from our custom-developed to more formal regression-based approaches, to select important variables.In our custom-developed approach, multiple linear regressions are applied, first separately to demographic and meteorological principal components (PCs), to narrow down the number of relevant PCs within each of the two groups, before doing overall linear regression with the remaining PCs to assess their importance in explaining R0.A major advantage of such analysis is in an intuitive understanding of the data structure and its relation to R0.This analysis is next independently redone by more formal feature selection methods, commonly employed in bioinformatics and systems biology: Stepwise regression and regressions utilizing both regularization and variable selection -LASSO (Least Absolute Selection and Shrinkage Operator) and Elastic net (Tibshirani, 1996;Zou and Hastie, 2005;Hastie et al., 2009).Such comprehensive analysis will ensure the consistency and robustness of the reported results.4. Finally, an intuitive interpretation of the obtained results will be presented.This will permit a much more specific understanding of COVID-19 transmissibility, by focusing on the main driving factors behind the disease spread in the population.

Data collection
Data for demographic and meteorological parameters were assembled as described in (Salom et al., 2021).Briefly, the data correspond to six meteorological and eighteen demographic variables outlined above.The differences between this dataset and the one used in (Salom et al., 2021) is the following: IMS (Social security and health insurance coverage), Prevalence of ABO and Rhesus blood groups, and Ambient levels of different pollutants (NO2, SO2, CO, PM2.5, PM10) are not used in this analysis, as they contain too many missing values.Instead of the pollutant levels measured from air pollution monitoring stations during the epidemic's exponential growth (available for only ~40 countries) we use the yearly average PM2.5 pollutant levels in 2017 (World Bank, 2020b).Also, we consider GDP per capita (GDPpc), taken from (World Bank, 2020a) as a more direct (average) indicator of a country's economic wealth/productivity.
Basic reproduction number (R0), i.e., a measure of SARS-CoV-2 transmissibility in a fully susceptible population and in the absence of intervention measures (social distancing, quarantine), was also taken from (Salom et al., 2021), where it was inferred from non-linear dynamics modeling.Overall, demographic data, meteorological data, and basic reproductive numbers were assembled for 118 different countries from which we could reliably infer R0.Missing values in the demographic data (which were sparse for the used variables) were substituted by median values of the respective variables; there are no missing values in the meteorological data.

Data preparation
Several variables, particularly among demographic data, show a significant deviation from normality when visually inspected.Such deviations generate large outliers and would significantly impact the necessary normality of the model error residuals.We consequently transform the data where necessary, to make the resulting distributions closer to normal, by using standard transformations that reduce the right and left skewness.The strength of the applied transformations (e.g., square root, cubic root, or log) is chosen so that skewness of the transformed distribution is as close to zero as possible.The table with all applied transformations is provided below: After transformations, the remaining (now sparse) outliers were removed by substituting them with the median of each variable; the outliers were identified as having more than three scaled median absolute deviation (MAD) from the (transformed) variable median.Each transformed variable whose direction was changed by the transformation was taken with a minus sign, so that the original and the transformed variable are oriented in the same direction, allowing for easier result interpretation.

Principal components analysis
The dimensionality of the transformed data was reduced and the data decorrelated through PCA (Jolliffe, 2002).PCA was done separately for demographic and meteorological variables to allow for a more straightforward interpretation of the obtained PCs.Since different variables are expressed in different units and correspond to diverse scales, each variable in the dataset was standardized (the mean subtracted and divided by the standard deviation) before PCA.For both datasets, we retained as many PCs (starting from the most dominant one) as needed to (cumulatively) explain >85% of the data variance.It was inspected that PCs reasonably follow a normal distribution (as expected, based on the transformation of the original variables).Few remaining outliers for PCs were then substituted by medians.For easier interpretation of PCs and their contribution to R0, each PC was oriented in the same direction as the variable with which it has a maximal magnitude of Pearson correlation (i.e., the sign of the PC was flipped when needed, to render the positive sign of this correlation).

Custom regression analysis
Multiple linear regression (PC regression) was done first with only demographic PCs (Hastie et al., 2009).Only linear terms were included in the regression to allow straightforward interpretation, i.e., selection of PCs that significantly affect R0.Significant PCs were selected as those appearing in the regression with P<0.05, where the significance in the regression was estimated in the standard way (through F-statistics) (Alexopoulos, 2010).The same, regression was then repeated with only meteorological PCs, and those significant in explaining R0 were retained.Finally, multiple linear regression was performed with all retained demographic and meteorological PCs.The significant PCs from this last step were recognized as PCs relevant for R0 explanation.Before regression, each PC was standardized so that coefficients obtained in the regression provided a measure of the variable importance in explaining R0.For both the custom analysis and stepwise regression, OLS (Ordinary Least Squares) were used as the regression metrics.

Stepwise regression
Stepwise regression was used to select PCs that significantly affect R0.In Stepwise regression, as well as in LASSO and Elastic net described below, all PCs (demographic and meteorological) were included in the regression.Briefly, starting from a constant model, at each step a term is added to the model if its significance (calculated with F-statistics) meets the condition P<0.05 (Pope and Webster, 1972).
Only linear terms are added to the model (i.e., interaction and quadratic terms are not considered) to allow for straightforward interpretation which PCs significantly affect R0.All PCs are standardized before regression so that contributions of the terms (PCs) in the model can be assessed by the magnitude of the regression coefficient.

LASSO regression
L1 regularization was implemented through LASSO (Least Absolute Shrinkage and Selection Operator) (Tibshirani, 1996;Hastie et al., 2009).As needed with the LASSO regularization, all PCs were standardized before regression, which also allowed direct comparison of the coefficients obtained by the regression.The value  in LASSO was treated as the hyperparameter, i.e.,  min value was determined through cross-validation, so that MSE (Mean Squared Error) on the testing set was minimal.A total of 100  values were put on the grid, corresponding to the geometric sequence, where the largest value produces all zero terms.Note that larger  corresponds to sparser model, i.e., a smaller number of non-zero components in the regression, while the small  limit corresponds to OLS regression.To obtain the maximally sparse model,  1SE =  min + 1SE, where 1SE corresponds to the standard error of MSE obtained by cross-validation, was used.1000 cross-validations were performed, where in each repetition 20% of the data were randomly selected for the testing set, with the remainder used for training.All non-zero terms and the corresponding coefficients obtained through LASSO were reported.

Elastic net regression
A combination of L1 and L2 regularization was implemented through Elastic net regression (Zou and Hastie, 2005).Analogously to our LASSO analysis, i.e., as needed due to regularization, all PCs were standardized.In the regression, both  and  were treated as hyperparameters, i.e., their optimal values were found by cross-validation.Cross-validation was repeated 1000 times, wherein each repetition testing and training sets were formed in the same way as for LASSO. and  values were put on a grid consisting of 100  and 100  values. values on the grid were chosen uniformly in the range [0,1] - approaching zero corresponds to Ridge (L2) regression, and 1 corresponds to LASSO regression.
For each  value,  values were chosen as described for the LASSO regression.For each repetition of cross-validation,  and  combination which leads to the minimal MSE was chosen. and  values in (, ) pairs from each cross-validation run were then standardized so that  and  values are on the same scale and centered to the origin of the  −  plane.( min ,  min ) was then chosen as the (, ) point closest to the origin.With this ( min ,  min ) value the model was then retrained on the entire dataset.Similarly to LASSO, all non-zero terms and the corresponding regression coefficients were reported.

Dimensionality reduction of the demographic dataset
PCA was first applied to the dataset consisting of 18 demographic and health factors for 118 countries.
Cumulative data variance that is explained jointly by the first n PCs is shown in Figure 1A (with n represented on the x-axis).In particular, Figure 1A shows the first PC alone already accounts for 45% of the variance, while the first 9 PCs (PC1 -PC9), which we retain in further analysis, explain more than 85% (precisely, 89%).
To obtain a basic interpretation of these nine PCs, we related each PC with the original (transformed) variable it is most correlated with.The corresponding associationswith the values of correlations coefficients presented on the y-axisare shown in Figure 1B (however, one should have in mind that some PCs are highly correlated with more than one original variable, as we discuss in more detail below).Among all principal components, the PC1 and the PC5 have the highest correlation coefficients (close to 1) with individual demographic factorsthe HDI and the BCG immunization coverage, respectively.Moderately high correlation coefficients (~0.75) characterize the relations between the PC2 and the prevalence of smokers, and the PC3 and the percentage of refugees, while the coefficient values of ~0.5 were obtained for the correlations of the PC4, the PC6, the PC7, the PC8 and the PC9 with, respectively, the prevalence of obesity, the prevalence of insufficient physical activity, the amount of the built-up area per person, the percentage of refugees, and the epidemic onset.
In particular, the first PC, accounting alone for the largest portion of the variance in the demographic data, is almost perfectly correlated with the Human Development Index (Fig. 1C).On the other hand, the HDI variable itself strongly correlates with several other demographic variables (Fig. 1D), most prominently with per capita GDP, infant mortality, and cholesterol levels.As elaborated in the Discussion section, such extremely high correlations will eventually preclude us from differentiating between the separate effects of each of these variables on R0.On the other hand, the prevalence of obesity, the built-up area per person, and the epidemic onset are significantly correlated with the HDI (Fig. 1D), and thereby the PC1 (Fig. 1C), but they are markedly featured also in separate principal components (Fig. 1B), namelythe PCs 4, 7 and 9.This will help us to infer whether their specific, additional contributions to the variance in the data (apart from that along the PC1) impact the virus transmissibility.

Dimensionality reduction of the meteorological dataset
The dimensionality of the dataset consisting of 6 meteorological factors for 118 countries was reduced similarly as for the demographic dataset.PCA generated 6 uncorrelated, orthogonal principal components.Thereby, the first PC alone explains 62% of the variance, while the first three PCs (PC1-PC3) capture 95%, which is significantly above the targeted 85% of the total variance (Fig. 2A).
Pairwise correlations showed that the retained three PCs have the highest correlations with the temperature, the wind speed, and the air pressure, respectively (Fig. 2B), where the correlation of PC1 with the temperature is close to 1 (Figs.2B and 2C).There are also notable correlations of the temperature with humidity, the levels of UV radiation, and precipitation (Fig. 2D).PC1, therefore, presents seasonality, i.e. a set of mutually correlated meteorological variables which can be related to yearly weather changes.Consequently, PCA effectively separated the impacts of seasonality (PC1), the wind speed (through the PC2), and the air pressure (through the PC3).The variables determining the PC1 are also correlated with the HDI.These inter-dataset correlations are not resolved at this level by our PCA and represent the trade-off that allows interpreting the PCs more easily within each of the two smaller, thematic groups of factors.

Linear regressions
After PCA, we applied the linear regression analysis using four different methods, as explained in Methods.The first, "custom" method included the additional step of "preselecting", i.e. further narrowing down the number of PCs that will enter the final regression analysis.The multiple linear regression, applied on the group of 9 demographic principal components, selected 1 st , 4 th , 7 th and 9 th component as the most relevant predictors of R0 (the remaining 5 components appeared in the linear regression with p values above 0.05 threshold, and were consequently excluded from the further analysis).Analogously, the "preselection" of meteorological principal components singled out the 1 st component as the only statistically relevant predictor of R0 from this group.The multiple linear regression was then applied on these 5 selected PCs (4 demographical and 1 meteorological) and yielded a regression model with the corresponding linear coefficients represented in Figure 3A.Meteo PC1 component does not appear in the results of the custom method, due to the lack of statistical significance (p>0.305) in the final regression, so that according to our custom regression methodology, weather parameters do not significantly influence R0.R0 in this model is therefore determined by a combination of demographic PC1, PC4, PC7, and PC9, where coefficients multiplying PC1 and PC4 are positive, while for PC7 and PC9 are negative.As can be inferred from the values represented in Fig 3A, the demographic PC1 has the most dominant influence on R0a robustly obtained result throughout all 4 methods (see below).We have already related each of these four PCs with the dominantly correlated variable (Figure 1B), but a more detailed interpretation of the results is obtained if all significant correlations (not just the dominant one) are taken into account.In addition to the very high correlation with HDI, demographic PC1 is also highly positively correlated with GDP, cholesterol levels, median age, and percentage of the urban population, while it is highly negatively correlated with infant mortality and the prevalence of chronic diseases (Figure 4A).Such strong correlations with HDI, GDP, IM, MA, and UP show that this component indeed expresses an overall, both social and financial, prosperity of the country (which seemingly also goes hand in hand with high average cholesterol levels and low prevalence of COVID-19 relevant chronic diseases).Similarly, by considering the correlations of demo PC4 with all demographic variables, we see that this component is significantly positively correlated not only with obesity but also with smoking, physical inactivity, and air pollution (Figure 4B)in other words, with major indicators of an unhealthy lifestyle and living conditions.Apart from its correlation with the BUAPC parameter, the component demo PC7 is also significantly positively correlated with net migration (Figure 4C).In the case of the demo PC9 component, its only significant correlation is with the onset variable.Results of the custom method can therefore be summarized as follows: the country's prosperity, as well as unhealthy living conditions and lifestyle, tend to increase the value of R0, while the larger built-up area per person and the later epidemic outbreak tend to slow the spread of the disease.Also, the results seem to indicatevia demo PC7 componenta surprising diminishing effect of the net migration on the rate of epidemic progress (though the sign of this variable may not be easy to interpret, as the net migration is a difference of two quantities).LASSO results, shown in Figure 3C, find two additional PCs as relevant: demo PC5 and meteo PC1 (in addition to demo PC1, demo PC4, demo PC7, and demo PC9).The component demo PC5, appearing in LASSO results with a small negative coefficient, is significantly correlated only with the BCG variable, hinting at possible beneficial effects of BCG vaccination.Meteorological principal component meteo PC1 reflects seasonality (see above).Thus, overall, in addition to supporting the conclusions of the custom and stepwise methods, the LASSO method also implicates a significance of seasonality changes, and to some extent BCG vaccination, in reducing the rate of SARS-CoV2 spread.
The results of the Elastic net method, shown in Figure 3D, are again a bit more restrictive.While further bolstering our confidence in the importance of demo PC1, demo PC7, and demo PC9, these results also reinforce that the seasonal weather variables influence the COVID-19 epidemic (in agreement with the LASSO method) but, for the first time, we do not find an indication of the relevance of the unhealthy lifestyle and living conditionsas revealed by the absence of demo PC4 component in Figure 3D.
Finally, as much as the PCs appearing in Figure 3 are important, the absence of the remaining PCs in the results can be of comparative significance for some of our conclusions.For example, we note that PCs highly correlated with the urban population, alcohol consumption, and chronic diseases do not show up as relevant in any of the methods used.While it is true that these variables are moderately correlated with demo PC1, absence in the results of additional PCs tied with these variables supports the view that these variables are not directly influencing R0 value, but only via indirect relation to the country's prosperity.

Discussion
Our goal was to identify the most predictive factors influencing the risk of the SARS-CoV-2 virus spreading in a population in the absence of any epidemic mitigation measures.Since many potentially relevant factors strongly correlate with each other, we divided them into two groups -meteorological and sociodemographicand applied the Principal Component Analysis to the variables in each group.
In this way, we were able to decorrelate variables within each group, while still retaining intuitive interpretation for the new variables (demographic and meteorological PCs) used in further analysis.Dimensionality reduction and predictor decorrelation through PCA was then combined with different variable selection and regularization techniques, to select PCs that are most predictive of R0 for COVID-19 epidemics.Examining correlations of these PCs with the original variables allowed pinpointing the main drivers of COVID-19 transmissibility.This approach is to our knowledge unique in the COVID-19 research literature, and reminiscent of the analysis of complex data in systems biology and bioinformatics.
Three principal components are robustly selected as the most important predictors by all the methods.Of these, the prosperity of the country has the most significant influence on R0: the spread of the epidemic is faster in economically more developed countries.Specifically, this is the most dominant PC from the demographic group of variables, which is by far most important in explaining R0, and very strongly correlated with HDI (Pearson's correlation coefficient r=0.95) and GDP (r= 0.94)therefore effectively reflecting prosperity and wealth.The second PC is dominantly related to the built-up area per person (BUAPC), and the third with the epidemic onset, where the increase of these reduces the infection spread.We also robustly obtained (by three out of four methods) that unhealthy living conditions and lifestylei.e., the PC dominantly (and consistently positively) correlated with obesity, physical inactivity, smoking, and air pollutionis another important factor that exacerbates the epidemic.Seasonality, represented by the group of four weather conditions all significantly correlated with temperature, was selected by two independent methods including, importantly the Elastic net, which is well adapted to selecting among correlated variables (Zou and Hastie, 2005;Hastie et al., 2009) -note that correlations between meteorological and demographic PCs were not abolished by our approach.The PC dominantly correlated with BCG immunization appears only in LASSO regression.

High economic development as the main predictor of COVID-19 transmissibility
As noted above, we consistently obtained that the first demographic PC is the most important predictor of R0.HDI (alternatively, GDPpc) shows the highest correlation with this PC, which singles out this variable as the main index quantifying the virus transmissibility risk.Higher HDI leads to a higher rate of social contacts and more intense population mixing, as high HDI is strongly associated with high GDPpc implying intensive economic activity, trade, and transportation, including large-distance flights (Allel et al., 2020;Gangemi et al., 2020).Thus, much higher contact frequency in societies with higher HDI is likely the main cause behind the dominant role of the first demographic PC in explaining R0.
An important advantage of our approach is that it is based on the analysis of R0, rather than other measures used as transmissibility proxies.The most commonly used measure, confirmed case counts, strongly depends on the number of performed tests, which is generally much higher in high-GDPpc countries, so the analysis would become strongly influenced by testing policies.For example, in (Allel et al., 2020) the importance of HDI for predicting cumulative case counts was noted.However, this perceived effect may be due to the lack of testing in lower-income countries (Notari, 2021), rather than genuine HDI influence.Our results are, on the other hand, insensitive to the testing capacity differences, since our R0 estimation procedure relies on the slope of the case growth curve in the distinct early exponential phase (Djordjevic et al., 2021), which requires only that the testing is performed consistently during the relatively short examined period (Salom et al., 2021).Therefore, our analysis indeed strongly suggests that HDI/GDPpc are the main/genuine predictors of COVID-19 spread in the population.

Demographic factors significantly correlated with HDI
Many correlations previously reported between SARS-CoV-2 transmissibility and various weather, sociodemographic, and health factors [see e.g.(Li et al., 2020;Salom et al., 2021)] may be captured by HDI.From our results, one can note that several demographic factors significantly correlate with both HDI/GDPpc and the first demographic PC, but are not noticeably related with other demographic PCs (4,5,7,9) that significantly contribute to R0.These demographic factors can be further divided into two groups using the correlation of BUAPC with HDI as the reference.The percentage of the urban population, the prevalence of alcohol consumption, and chronic diseases, which have similar (just somewhat higher) correlations with HDI compared to BUAPC, comprise the first group.Their absence from the independent PCs significantly related with R0, in contrast to BUAPC which prominently appears in the demographic PC7, indicates that they do not have independent effects on R0.Consequently, their significant correlation with R0 (Salom et al., 2021) is very likely due to their generic correlation with HDI, rather than a consequence of the independent effect that they exhibit on R0.This result is especially interesting for the percentage of the urban population, whose relation with R0 is sometimes taken for granted (Carozzi, 2020).It also explains the previously obtained negative correlation of the prevalence of chronic diseases with R0, where one might expect the opposite, as it is generally known that people with chronic diseases are seriously affected by COVID-19 (Zheng et al., 2020).We can now claim that this result is due to a generically lower incidence of chronic diseases in more developed countries (i.e., due to their significant negative correlation with HDI), rather than a direct effect on R0.
The net economic immigration (the difference between immigrants and emigrants), population median age, infant mortality, and the average blood cholesterol level, comprising the second group, also have a significant positive correlation with the first demographic PC.However, in distinction to the aforementioned three factors, their correlation with HDI is very high, i.e., visibly higher compared to the correlation of BUAPC with HDI.So, even though they do not appear in demographic PCs that significantly contribute to R0 other than PC1, we cannot make any reliable conclusion about their direct effect on R0 based on our analysis.It is therefore relevant to discuss evidence from other sources, i.e., possible mechanisms that can distinguish their direct influence on R0.Regarding infant mortality, a mechanism of its direct contribution to R0 is hard to imagine, so its involvement in PC1, and high negative correlation with R0, is almost certainly an indirect consequence of this variable being a proxy of HDI (Ruiz et al., 2015).On the other hand, the median age and the blood cholesterol level are real contenders for direct R0 modifiers, as mechanisms for their contribution to COVID-19 transmissibility have been proposed.Aging is generally associated with the weakening of the immune response to infectious diseases making the elderly more susceptible to the viruses like the SARS-CoV-2 (Pawelec and Larbi, 2008).Additionally, many of them due to some chronic diseases take ACE inhibitors and angiotensin-receptor blockers which cause an increased expression of ACE2 serving as a receptor for the SARS-CoV-2 virus entry (Shahid et al., 2020).Their residing in care-homes, which is particularly common in high-income countries, also well suits the spreading of the infection (Kapitsinis, 2020).
Similarly, high cholesterol levels can increase susceptibility to the infection by SARS-CoV-2 through systemic adverse effects on the immune and inflammatory responses, but also through direct implication in the virus life cycle, especially at the level of its endocytosis.To that end, statins, blocking cholesterol synthesis, were proposed for usage in COVID-19 treatment, which is supported by studies showing that previous statin usage is associated with a milder pneumonia outcome in the case of several other viral infections (Frost et al., 2007;Schmidt et al., 2020).

Independent COVID-19 transmissibility predictors
All the demographic variables discussed in the previous subsection show a rather strong correlation with the first demographic PC but are not involved with other significant demographic PCs (4,5,7,9).These PCs are by construction independent (decorrelated) from PC1. Variables associated with these PCs can be interpreted as effects on R0 independent from those related to PC1.These variables then importantly identify corrections to the main effect of HDI/GDPpc.Specifically, these are indoor area available to an individual and the net immigration (demographic PC7), the delay in the epidemic onset with respect to February 15th associated with more awareness of the virus threat (demographic PC9), the prevalence of unhealthy lifestyle and environment (demographic PC4), and the weather seasonality (meteorological PC1).
The slower spread of the virus with a larger built-up area per capita, as an independent and significant R0 predictor, is an interesting and new result, though intuitively plausible.It can be understood as having a less crowded indoor space (where the virus transmission dominantly happens) so that people are less exposed to each other and the virus.For example, both the population density and R0 on the Diamond Princess cruise ship were estimated as four times greater than those in Wuhan (Rocklöv and Sjödin, 2020).On the other hand, a correlation of the virus transmissibility with the large territory population density is weakly established in the literature, whereby it seems that one should rather seek a correlation with a local population density, directly determining the number of contacts that an individual can make (Garland et al., 2020).
A positive contribution to the transmissibility is also made by the principal component strongly correlated with the onset variable, representing the number of days from February 15th to the epidemic's start in a particular country.The importance of the delay in the epidemic onset may be due to the psychological effect of hearing the news about the spread of COVID-19 in other countries (Khajanchi et al., 2020).Namely, the longer the epidemic was growing outside of a particular country, the larger impact this had on its people to change their usual behavior to prevent the infection, which could slow down the virus transmission even before the introduction of the official intervention measures (Salom et al., 2021).
Another distinguished principal component appears to encompass multiple indicators of an unhealthy lifestyle and environmentspecifically, the prevalence of obesity, physical inactivity, and smoking, together with the level of air pollution.We obtained that all these factors promote virus transmission.
It is well established that they can impair immune function and adversely affect different organ systems.Furthermore, their association with mechanisms specifically facilitating the infection by the SARS-CoV-2 virus has been proposed (Domingo and Rovira, 2020;Heidari-Beni and Kelishadi, 2020;Haddad et al., 2021).
Two more PCs are strongly determined by temperature (and/or three other highly related weather factors) and the prevalence of BCG vaccinated children, respectively.Although not selected by all the methods, the weather component seems important as it was chosen by the Elastic net algorithm (in addition to LASSO), which is specifically designed to deal with (highly) correlated variables, and yet it did not exclude this PC despite its correlation with the first demographic PC.Moreover, a decrease of the transmissibility with the temperature increase appears as a robust result in COVID-19 literature, although conflicting conclusions are also present (Srivastava, 2021).Higher temperatures may shorten the period of virus viability in aerosols, enhance the immune system functioning, and/or impact the time that people spend together in poorly ventilated indoor spaces (Notari, 2021).Since temperature is highly positively correlated with the intensity of UV radiation, humidity, and the level of precipitation, we cannot exclude the possibility that some of these other factors are in a significant causal relationship with virus transmissibility.Importantly, some experimental findings support the inactivating effects of high temperature, humidity, and UV radiation on SARS-CoV-2 and related viruses (Casanova et al., 2010;Chan et al., 2011;Heilingloh et al., 2020;Sagripanti and Lytle, 2020;van Doremalen et al., 2020).Anyhow, our results suggest the dependence of virus transmissibility on seasonal weather variations.
Regarding the last demographic principal component, it occurred as important only in LASSO regression, but it closely follows the extent of BCG vaccination, which is known to provide some protection against various respiratory tract infections through the induction of the trained immunity (O'Neill and Netea, 2020), so BCG immunization may significantly influence the SARS-CoV-2 spread, although, according to our results, to a lesser extent than the other discussed factors.

Differences to pairwise correlation analysis
Our study is also an example of how assessing the effect of one factor while controlling for the presence of other relevant variables can change the obtained conclusions.We will illustrate this with four examples, where we obtained qualitatively different conclusions, compared to single-variable correlation analysis (Salom et al., 2021): built-up area per capita (BUAPC), net migration, air pollution, and raised blood pressure.
BUAPC showed an absence of a significant correlation with R0 (Salom et al., 2021), which is due to the canceling of two effects.The first is its direct effect on R0, exhibited through demographic PC7, which is in the direction of slowing COVID-19 spread in a population.The other effect is through collinearity with PC1, which reflects a generic correlation of BUAPC with GDPpc, caused by more construction (higher built-up area) per capita with the increase in GDPpc.Our combination of PC and regression analysis revealed this non-trivial conclusion, which cannot (even qualitatively) be obtained from the pairwise correlation analysis.
Similar reasoning, though perhaps harder to understand intuitively, applies to net migration.Net migration is also significantly (and positively) correlated with HDI, and consequently also with PC1, reflecting a generic tendency of immigrants to flow to countries with higher GDPpc.The direct effect of net immigration, exhibited through PC7 is however harder to intuitively understand, as I-E negatively contribute to R0, so that faster spread (at least in the initial phase of the epidemic) appears to be associated with a higher number of emigrants.As these are economic migrations (to be distinguished from the movement of refugees), possibly the part of the emigrants returned to their countries with the pandemic's start.In any case, the significant effect of net immigration on R0 inferred through our analysis is again highly non-trivial, and in the opposite direction from the positive pairwise correlation of R0 with I-E.For refuges (i.e., percentage of refugee population by country), it exhibits high correlations with only PC3 and PC8, neither of which significantly contribute to R0.There is also no significant pairwise correlation of refugees with R0, which robustly shows that this variable does not significantly affect transmissibility.
Regarding pollution, it contributes negatively to demographic PC1 (with the corresponding negative correlation with HDI), while it has a positive contribution to demographic PC4.The pairwise correlation of the pollution with R0 is negative (-0.31), which is counterintuitive, as it is generally expected that higher pollution should increase COVID-19 transmissibility.This negative correlation with R0 is however an artifact of the generic negative correlation of the pollution with HDI, while its genuine (direct) effect is reflected through PC4.Our analysis, therefore, revealed the direct effect of long-term air pollution on transmissibility, which is consistent with previously published observations that it can damage the respiratory system and reduce resistance to infections (Domingo and Rovira, 2020;Fattorini and Regoli, 2020), but opposite to naive pairwise correlation analysis.
Raised blood pressure also shows a statistically significant, but counterintuitively negative, correlation with R0.However, in addition to PC1, raised blood pressure shows a notable correlation only with PC2, which does not significantly affect R0.This indicates that the negative correlation of this variable with R0 is a consequence of its generically negative correlation with HDI, instead of a direct effect on COVID-19 transmissibility.

Conclusion and Outlook
Numerous studies tried to assess the correlations of different factors with the SARS-CoV-2 virus transmissibility (Li et al., 2020;Notari and Torrieri, 2020;Salom et al., 2021), but the next step should be predicting the environmental risk of the high spreadability in a certain population (Allel et al., 2020;Coccia, 2020;Gupta and Gharehgozli, 2020).Specifically, a relatively small number of the most influential meteorological and demographic factors should be selected for a predictive risk measure that is accurate enough and practical for use.Such risk assessment is very useful in guiding the future strategies of imposing epidemic mitigation measures.
We here demonstrated that taking into account joint effects of different factors can point to qualitatively different conclusions about their influence on the virus transmissibility than considering them individually (as in (Salom et al., 2021)).Utilizing a combination of PCA and feature selection techniques, we were able to disentangle with high confidence which variables independently (and significantly) influence the rate of the infection spread, and which have an only indirect influence or no influence at all (here found for alcohol consumption, chronic diseases, percentage of the urban population, raised blood pressure and refugees).
While PCA brings clear advantages to regression analysis such as working with a smaller number of variables and abolishing collinearity, the main disadvantage is harder interpretation in terms of original variables.In this case, we were, however, able to unequivocally interpret PCs that significantly affect R0, so that the main driving factors (i.e., PCs) behind COVID-19 transmissibility are the country's wealth/development level corrected by the available indoor space per person and net immigration; pollution levels, and some of the unhealthy living factors; spontaneous behavior change due to developing epidemics; weather seasonality; possibly (marginally) BCG vaccination.These conclusions, and the direction of the corresponding effects, crucially depend on the more complex analysis performed here.
However, when the alignment between certain variables is too high, even the analysis performed here cannot differentiate between the factors genuinely affecting R0 and mere accidental correlations.In such cases, further, specifically designed (such as targeted epidemiological) studies are needed.For example, based on this analysis alone and due to the very high correlation between the cholesterol levels and HDI/GDP it cannot be excluded that cholesterol is a contributing factor to the observed significance of the PC1 component, in addition to the country's prosperity that mimics the contact rate in population (as a crucial disease transmission property).For this reason, our research suggests that a separate study of cholesterol levels in the COVID-19 context (e.g. by measuring cholesterol blood levels along with PCR tests) could be, potentially, of high value since a hypothetical unexpected discovery of inherent cholesterol importance could potentially lead to novel treatments of SARS-CoV-2 infection.Similarly, studies that disentangle the effect of the overall country's prosperity from the intrinsic effects of median age on R0 would be also quite welcome.
Our conclusions about the importance of HDI as a predictor of R0 could be further tested by studies of epidemiological relevance of higher resolution HDI-analogs, such as Subnational HDI (SHDI) or City Development Index (CDI).And if HDI and GDP parameters are confirmed to dominantly influence R0 values simply since they highly and naturally correlate with the frequency of social contacts (as we anticipate to be the case), identifying this as one of the major factors is not without implications.While it is certainly not reasonable to intentionally reduce HDI levels to curb the COVID-19 epidemic, recognizing the importance of this parameter can help us make better predictions of the disease dynamic and locate in advance high-risk spots/areas.The BUAPC variable, which surfaced as another significant factor in our analysis, can have a similar predictive value.As for the PC4 component, reflecting the healthy lifestyle and living conditions, we could and certainly should try to influence the underlying variables -by attempting to reduce obesity, smoking prevalence, physical inactivity, and air pollution.All the more so now that our study indicates the corresponding improvements would also be beneficial to combat the COVID-19 pandemic.

Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figure 1 .
Figure 1.PCA for demographic data.A) Cumulative explained variance.B) Variables best correlated with demographic PCs.The label above and below each bar present, respectively, the demographic PC and the variable with which this PC has the highest correlation.C) Scatter plot PC1 vs HDI.D) Correlations of selected demographic variables with HDI.

Figure 2 .
Figure 2. PCA for meteorological data.A) Cumulative explained variance.B) Variables best correlated with meteorological PCs.C) Scatter plot meteo PC1 vs temperature.D) Correlation of meteorological variables with temperature.

Figure 3 .
Figure 3. Results of: A) multiple linear regression ("custom") method, B) Stepwise regression, C) LASSO regression and D) Elastic net regression.Bar charts represent the values of regression coefficients for each of the PCs selected by the method.

Figure 4 :
Figure 4: Pearson correlation coefficients between principal components and demographic variables for A) demo PC1, B) demo PC4, and C) demo PC7.Equivalently to Figure 3A, Figures 3B, 3C, and 3D represent the results of, respectively, Stepwise, LASSO, and Elastic net regression.Results (and the corresponding graph) of the Stepwise method almost coincide with the results of our custom methodin spite that in the Stepwise regression (as well as in LASSO and Elastic net methods) there is no intermediate "preselection" step.