Abstract
The escalating impact of climate change underscores the need for precise and timely forecasts of meteorological phenomena, particularly droughts, due to their extensive effects on agriculture, water resources, and ecosystems. Addressing this, we introduce a deep learning framework that merges Computer Vision with modified Transformer networks, tailored to predict future drought conditions leveraging historical global climate data. Our model inputs are stacked monthly global maps of Sea Surface Temperature, Temperature 2m above ground, and Total Precipitation, each spanning a year, thus creating a 36-channel input to capture seasonal variability.This study extends conventional Vision Transformers (ViT) by adapting them for sequence processing, enabling the model to learn the intricate temporal dynamics and spatial interdependencies inherent in climate data. By employing a sliding window approach, the model assimilates a sequence length of 12 months for each variable, and the target variables are stacks of Standardized Precipitation & Evapotranspiration Index (SPEI). Our modified ViT architecture successfully integrates the temporal sequencing by adjusting convolutional patch embeddings and positional embeddings, rendering the model sensitive to both the chronological progression and spatial distribution of climatic factors. Preliminary evaluations indicate the model's robust capability in forecasting drought conditions on a global scale. We substantiate these findings with performance metrics that illustrate the model's efficacy in interpreting and predicting the complex, non-linear, and non-stationary patterns of drought phenomena.