loading page

Word segmentation of Chinese texts in the geoscience domain using the BERT model
  • +5
  • Dongqi Wei,
  • Zhihao Liu,
  • Dexin Xu,
  • Kai Ma,
  • Liufeng Tao,
  • Zhong Xie,
  • qinjun qiu,
  • Shengyong Pan
Dongqi Wei
National Engineering Research Center of Geographic Information System
Author Profile
Zhihao Liu
National Engineering Research Center of Geographic Information System, Wuhan 430074, China
Author Profile
Dexin Xu
Wuhan Geomatics Institute, Wuhan 430074, China
Author Profile
Kai Ma
Unknown
Author Profile
Liufeng Tao
China University of Geosciences
Author Profile
Zhong Xie
China University of Geosciences
Author Profile
qinjun qiu
China University of Geosciences

Corresponding Author:[email protected]

Author Profile
Shengyong Pan
Wuhan Zondy Cyber Science & Technology Co., Ltd., Wuhan, China
Author Profile

Abstract

Unlike English, in Chinese texts there is no natural separator-like space between words, which makes Chinese word segmentation a difficult information processing problem. At present, geological texts contain a large number of unregistered geological terms, and the existing rule-based methods, machine-learning and deep-learning algorithms still cannot solve the problem of word separation in geology, especially for the large number of unregistered words. In this paper, we explore a dual-corpus, deep learning model-based approach to geological text dictionaries and compare it with the general domain dictionary and single-corpus deep learning model dictionary methods. Our experiments show that the proposed method is significantly better than other methods in open testing, with a precision of 92.56%, recall of 91.44% and F1 of 92.00%. In this paper, the Chinese word segmentation of geological text can identify unregistered geological terms effectively and ensures the recognition rate of common words, which lays the foundation for natural language processing in the domain of geoscience.