DISCUSSION
The GAN approach investigated integrates AI techniques with the large
public domain NHANES database containing biomedical information on
diverse populations that could prove valuable in pharmacometrics
applications. Proof-of-concept computational experiments were conducted
to evaluate the capabilities of GANs to simulate univariate
distributions of a test bed of 16 diabetes-relevant biomarkers. In the
next step, the GAN strategy was extended to complex joint distributions
of multiple biomarkers and finally, a conditional GAN was used for
modeling of Black, Hispanic, Other and White race/ethnicity categories.
The training-test strategy was used for GAN performance evaluation.
The GAN strategy enables robust learning and can be considered
“non-parametric” because it does not need prior distributions, which
are required for Bayesian approaches. While the latent space for a GAN
generator is sampled from a multivariate Gaussian distribution, it
serves only as a source of random noise for the generator neural network
to transform. Notably, the GAN architecture is indirect because it does
not conduct head-to-head comparison of the generated data distribution
vs. training data distribution. GANs avert direct comparison by
intercalating a binary classifier and judicious use of the adversarial
loss functions. The literature on GANs in pharmacometrics is sparse.
Parikh et al . 14 have used GANs to generate
instances of models for cardiac mechanics in control myocytes and
myocytes treated with omecamtiv mecarbil, a new drug for treating heart
failure. The GANs were used to find model parameters for fitting the
data for both groups. This application of GANs to in vitro data
differs qualitatively from the patient-centric problem in our research.
Conditional GANs are an extension of GANs wherein the generator and
discriminator networks are conditioned with additional input.
Conditional GANs are particularly useful for modeling multimodal data
and have been used elsewhere for tagging and annotating images15. We found that biomarker profile joint distribution
could be modeled using GAN architectures effective for tabular data,
which can consist of multiple data types, e.g., continuous variables,
ordinal, and categorical. Tabular data generation presents some unique
challenges as compared to GAN modeling of images because: i) columns in
a row do not have local structure and, ii) conditioned
variable-dependent continuous variables are generally multimodal (i.e.,
the density function has several peaks). The typical GAN architectures
designed for images are not particularly good at generating multimodal
data because of a phenomenon termed “mode collapse”. Mode collapse
reduces the diversity of output samples and occurs when the generator
can only produce a single type of output or a small set of outputs that
fool the discriminator 13. To simultaneously generate
a mix of discrete and continuous columns, the Xu et al .12 GAN approach applies both softmax and tanh on the
output. We used the PacGAN method, wherein the discriminator
decision-making is guided by multiple or “packed” samples from each
class 13. In PacGAN, the discriminator does not
classify each generated sample but instead, examines a “pack” of
samples for a class. Thus, diversity of the generated samples becomes a
criterion for the discriminator in the classification process and helps
avoid mode collapse. By implementing these enhancements12,13, we found that a conditional GAN yielded
effective results for modeling race/ethnicity. The approach addresses
the frequency differences between the various under-represented groups,
and the multimodality resulting from between-group differences in
biomarker expression.
We selected 16 diverse diabetes-relevant physiological biomarkers that
reflected different organ systems and become clinically salient at
different stages of diabetes progression. Alterations to plasma glucose
and insulin profiles are direct consequences of diabetes and can be
dysregulated early in diabetes because of decreased pancreatic β-cell
function or increased insulin resistance in hepatic and peripheral
tissues. Glycohemoglobin is related to the average glucose exposure over
2-3 months. In contrast, increased urinary creatinine and albumin are
the result of compromised renal function during diabetes disease
progression. We also included integrative biomarkers, e.g., body mass
index and systolic blood pressure, metabolic biomarkers, e.g.,
triglycerides and cholesterol, inflammatory biomarkers (C-reactive
protein and ferritin) and hepatic biomarkers (e.g., alanine
aminotransferase, aspartate aminotransferase and gamma
glutamyltransferase) that are dysregulated in diabetes.
One of the strengths of the NHANES as a source of “big data” for
modeling under-represented groups is that while the total sample size in
a given cycle is fixed, the survey adapts its population-based sampling
strategy to include adequate numbers of individuals from
under-represented groups, e.g., there is ongoing oversampling of
Hispanics, non-Hispanic Blacks, older adults, and low income
whites/others groups and beginning in 2011, non-Hispanic Asians were
oversampled 16. We used the RIDRETH1 variable
from NHANES to derive our under-represented groups; additional
race-ethnicity variables have been added to NHANES, but these variables
were not available across all the datasets we used. A weakness is that
the NHANES sample is limited to the non-institutionalized civilian
resident population: it does not contain groups such as prisoners,
military personnel, individuals in psychiatric institutions, and drug
rehabilitation facilities. Interestingly, Allen et al . and Riegeret al . also leveraged NHANES data in their work on virtual
patients 17,18. We have previously used NHANES as the
data source in the generalized pharmacometrics modeling (GPM) approach,
which integrates population models with AI techniques. GPM simulates
pharmacokinetic (PK) parameters from population PK covariate models
using Bayesian networks that include demographic and biomarker features
identified from NHANES. The integration of external data enables GPM to
facilitate modeling and simulation of drug disposition and effects for
populations different from those in the underlying PK study7.
Creating virtual populations requires modeling or otherwise sampling the
joint distribution of biomarkers of interest. If the biomarkers are not
normally distributed or if there are multiple biomarkers of interest,
covariance matrices are generally inadequate for characterizing
higher-order inter-dependencies. General empirically-motivated methods
for producing virtual patient populations include patient selection
using inclusion and exclusion criteria 19,
bootstrapping similar clinical trials or patient databases20 and simulating from fitted distributions21. Simulated annealing and nested simulated
annealing-based methods have been proposed for generating “plausible”
populations in the context of quantitative systems pharmacology models17,18. Our GAN approach relies on neural network-based
learning and is generative, i.e., it creates new sample sets: it differs
substantially from the non-parametric re-sampling and parametric
Bayesian approaches that have been used in pharmacometrics for
approximating data distributions.
GANs are considered a deep learning (DL) method as many GANs require
deep neural networks (DNN; “deep” refers to the number of network
layers) for the generator and discriminator architectures. Although
there is increasing interest in leveraging AI approaches including DL in
drug discovery and development, the assessments of DL and GANs in
pharmacometrics have been preliminary 22,23. Liuet al . 23 used long short-term memory (LSTM, a
common neural network architecture that is effective for time series)
DNN to model simulated PK/PD data of a hypothetical drug. The plasma
concentration and effect level under one dosing regimen was used to
train the model and the model was used to predict the individual PK/PD
for other dosing regimens. Lu 22 included neural
ordinary differential equations for forecasting PK/PD of platelet
responses in a clinical dataset of 800 patients. It should be noted that
like many AI and DL methods, GAN methods can be computationally
intensive; however, graphic processing units (GPU) and high-performance
computing (HPC) architectures can improve the performance of AI
algorithms substantially 24,25.
Our results demonstrate the potential of the GAN approach for modeling
the joint distribution of complex systems of disease-relevant biomarkers
in under-represented groups. The approach may find utility for
generating virtual patient populations for clinical trial simulations
and pharmacometrics.