Abstract
Machine learning has the potential to automate the analysis of vast
amounts of raw geophysical data, allowing scientists to monitor changes
in key aspects of our climate such as cloud cover in real-time and at
fine spatiotemporal scales. However, the lack of large labeled training
datasets poses a significant barrier for effectively applying machine
learning to these applications. Transfer learning, which involves first
pretraining a neural network on an auxiliary “source” dataset and then
finetuning on the “target” dataset, has been shown to improve accuracy
for machine learning models trained on small datasets. Across prior work
on machine learning for geophysical imaging, different choices are made
about what data to pretrain on, and the impact of these choices on model
performance is unclear. To address this, we systematically explore
various settings of transfer learning for cloud classification, cloud
segmentation, and aurora classification. We pretrain on different source
datasets, including the large ImageNet dataset as well as smaller
geophysical datasets that are more similar to the target datasets. We
also experiment with multiple transfer learning steps where we pretrain
on more than one source dataset. Despite the smaller source datasets’
similarity to the target datasets, we find that pretraining on the
large, general-purpose ImageNet dataset yields significantly better
results across all of our experiments. Transfer learning is especially
effective for smaller target datasets, and in these cases, using
multiple source datasets can give a marginal added benefit.