High-Dimensional Biomarker Panel Joint Distribution Simulations
We developed and evaluated GAN for higher dimensional distributions.
Dataset and Data Pre-Processing: For these experiments, the joint distribution of 14 of the 16 diabetes-relevant biomarkers from the univariate setting was assessed.
High-sensitivity C-reactive protein (hs-CRP) and ferritin were excluded from the list of biomarkers; ferritin was excluded because of sample size and hs-CRP was excluded because assay methodologies changed across the NHANES data sets.
GAN Architecture: The architecture of the conditional GAN was based on Xu et al . 12 for tabular data.
Two fully connected hidden layers of size 256 were used in both generator and discriminator. In the generator, batch-normalization and ReLU activation functions were used after each fully connected layer. A variational Gaussian mixture model was used to identify the modality of the data and apply normalization specific to the mode. After two hidden layers, the synthetic row representation is generated. The scalar values of this representation are generated using tanh activation, while the mode indicator and discrete values are generated by Gumbel softmax.
In the discriminator, we used leaky ReLU function and dropout on each hidden layer. The PacGAN framework with 10 samples in each pack was used to reduce mode collapse 13.
The model was trained for 1000 epochs with batch size of 300 and five discriminator steps.
Data Analysis: For visualization, the t-distributed stochastic neighbor embedding (t-SNE), uniform manifold approximation and projection (UMAP) and principal components analysis (PCA) were used to obtain the two-dimensional projections of the 14-dimensional data. TheRtsne , umap packages and prcomp function in R were used. The perplexity and theta hyperparameters were set to 50 and 0.5, respectively, for t-SNE. The ggpairs package was used to generate pairs panel plots containing univariate densities, bivariate scatter plots and Spearman rank correlation of the test data and GAN-generated distributions. Seven of the 14 biomarkers were assessed in pairs panel plots to keep the number and size of the bivariate plots amenable for visual interpretation.