作为建筑业 (AEC)的信息获取的重要任务，自然语言处理 (NLP)得到了越来越多的关注。尽管深度学习模型已在NLP任务中得到广泛应用并被引入到建筑业相关研究中，但目前仍缺乏面向建筑业的大规模领域语料库及预训练模型。因此，本研究面向建筑业NLP应用，构建了首个一个大规模领域语料库和领域预训练模型，并系统地分析了不同迁移学习策略及微调技术对各类NLP任务的性能影响。最后，在各种预训练模型的基础上，进一步训练和构建了面向多任务的预训练模型。结果表明，在文本分类和命名实体识别任务中，领域语料库对传统词嵌入模型具有相反的效果，但可有效提高 BERT 深度学习模型在所有任务中的性能。同时，研究显示，BERT领域预训练模型在分类、命名实体识别等 任务中的表现都显著优于传统方法，F1最大提升分别为可达3.8%和8.1%。因此，本研究一方面有效展示了领域语料库和预训练深度学习模型在领域NLP任务中的巨大优势，另一方面所提出的首个领域语料库和预训练模型也为建筑业NLP应用提供了重要的支撑。
As an essential task for the architecture, engineering, and construction (AEC) industry, information processing and acquiring from unstructured textual data based on natural language processing (NLP) are gaining increasing attention. Although deep learning (DL) models for NLP tasks have been investigated for years, domain-specific pretrained DL models and their advantages are seldomly investigated in the AEC domain. Therefore, this work developed a large scale domain corpora and pretrained domain-specific language models for the AEC domain, and then systematically explores various transfer learning and fine-tuning techniques to explore the performance of pretrained DL models for various NLP tasks. First, both in-domain and close-domain Chinese corpora are developed. Then, two types of pretrained models, including static word embedding models and contextual word embedding models, are pretrained based on various domain corpora. Finally, several widely used DL models for NLP tasks are further trained and tested based on various pretrained models. The result shows that domain corpora can further improve the performance of static word embedding-based DL models and contextual word embedding-based DL models in text classification (TC) and named entity recognition (NER) tasks. Meanwhile, contextual word embedding-based DL models significantly outperform the static word embedding-based DL methods in TC and NER tasks, with maximum improvements of 8.1% and 3.8% in the F1 score, respectively. This research contributes to the body of knowledge in two ways: (1) demonstrating the advantages of domain corpora and pretrained DL models, and (2) opening the first domain-specific dataset and pretrained language models named ARCBERT for the AEC domain. Thus, this work sheds light on the adoption and application of pretrained models in the AEC domain.
The authors are grateful for the financial support received from the National Natural Science Foundation of China (No. 51908323, No. 72091512), the National Key R&D Program (No. 2019YFE0112800), and the Tencent Foundation through the XPLORER PRIZE.