Abstract
Silent Speech Recognition (SSR) based on Surface Electromyography (sEMG)
is a voice interaction technology proposed for scenarios requiring
silent operations. In this article, we abstract the SSR task based on
sEMG into a short-term image sequence classification task. We perform
time-frequency domain feature extraction and data reconstruction on the
muscle activity segment data. Additionally, we analyze the temporal and
spatial dimensions to capture the intrinsic correlation representation
of muscle activity. We propose the SVIT-SSR model based on the Vision
Transformer (VIT) framework. Finally, we design experiments to identify
33 types of typical silent speech commands in the SSR dataset. The
results demonstrate that the proposed model achieves an accuracy of
96.67±1.15%, outperforming similar algorithms.