Abstract
Unlike English, in Chinese texts there is no natural separator-like
space between words, which makes Chinese word segmentation a difficult
information processing problem. At present, geological texts contain a
large number of unregistered geological terms, and the existing
rule-based methods, machine-learning and deep-learning algorithms still
cannot solve the problem of word separation in geology, especially for
the large number of unregistered words. In this paper, we explore a
dual-corpus, deep learning model-based approach to geological text
dictionaries and compare it with the general domain dictionary and
single-corpus deep learning model dictionary methods. Our experiments
show that the proposed method is significantly better than other methods
in open testing, with a precision of 92.56%, recall of 91.44% and F1
of 92.00%. In this paper, the Chinese word segmentation of geological
text can identify unregistered geological terms effectively and ensures
the recognition rate of common words, which lays the foundation for
natural language processing in the domain of geoscience.