Mwalili Tobias - ESS Open Archive

Speech-to-text is essential as it converts spoken words to text, thus making it easy to store. It has several components; from a basic model, it is viewed in four stages; Signal pre-processing, feature extraction, feature selection, and modeling. Several works of literature have been documented on improving and achieving better results in speech recognition. However, works remains in resolving the issue of word error rate and accuracy on continuous input stream without increasing the required bandwidth. This research evaluates recurrent neural networks, long short-term memory neural networks, gated recurrent units, and bi-directional long short-term memory. It further tests the signal’s performance after introducing bias to the long short-term memory. This research then proposes a model bi-directional long short-term memory recurrent neural network. Experimental results demonstrate that even with a bias of one on long short-term memory, the bidirectional long short-term memory recurrent neural network model still achieves better results with a word error rate of 8.92%, accuracy of 91.08% and mean edit distance of 0.1910 using the Libri speech training dataset. Future work will evaluate the use of the transformer models in the reduction of the word error rate and accuracy on a continuous input stream.