The Global-Local Attention Transformer (GLAT) model is utilized for the effective integration of global and local features in the classification and detection of defects within Geographic Information System (GIS) imagery. Comprised of a Vision Transformer (ViT) backbone and a Global-Local Attention (GGLA) module, GLAT employs localized pooling and convolutions to enhance channel interactions. Input features are segmented, followed by the application of pooling and convolutions, culminating in the generation of attention weights via a sigmoid function. This process refines the features by highlighting pivotal information and diminishing irrelevant details. Achieved through experimentation, a classification accuracy of 83.4% is reported for GLAT, outpacing traditional Convolutional Neural Networks (CNNs). Additionally, the incorporation of the GGLA module across various ViT configurations results in an average accuracy enhancement of 0.9 percentage points. Such outcomes underscore GLAT’s capability in bolstering the identification of both global structures and local details, signifying its potential as a novel solution in the realm of GIS defect detection and the advancement of technology in this domain.