ARCBERT: Largescale Domain Specific Dataset and Pretrained Language Model for AEC Industry


Sine early 2019, our group devotes most of our efforts on developing new methods to extract and learn complex domain knowledge from textual documents such as building codes, construction documents. To efficiently extract and transfer prior knowledge hidden in domain documents, we developed the first largescale domain specific corpora and pretrained language model based on BERT, which outperformed traditional methods in various NLP tasks with maximum improvement of 8.1% You can download the dataset, pretrained models and algorithms here for research and exploration purpose. Latest updates of the dataset, pretrained models and algorithms could be found at github page.

If our work is adopted or used in your work, please cite the following articles:

Zheng, Z., Lu, X.Z., Chen, K.Y., Zhou, Y.C., Lin, J.R.* (2022). Pretrained Domain-Specific Language Model for Natural Language Processing Tasks in the AEC Domain. Computers in Industry, 142, 103733.

Leave a Comment