Precisely classifying earthquake types is crucial for elucidating the relationship between volcanic earthquakes and volcanic activity. However, traditional methods rely on subjective human judgment, which requires considerable time and effort. To improve this, we developed a deep learning model using a transformer encoder for a more objective and efficient classification. Tested on Mount Asama's diverse seismic activity, our model achieved high F1 scores (0.876 for tectonic, 0.964 for low-frequency earthquakes, and 0.995 for noise), equivalent to or better than other methods. According to the attention weight visualization, our model focuses on critical seismic signal features for classification, similar to expert analysis. However, it has been demonstrated that removing subjective elements and employing standardized labeling of the training data based on waveform features are necessary to enhance the interpretability of the model. Additionally, the analyses suggest that stations near the volcanic crater are essential for a highly interpretative and accurate classification.