一种将施工成本文件分类为国际建筑测量标准的机器学习方法

论文标题

一种将施工成本文件分类为国际建筑测量标准的机器学习方法

A Machine Learning Approach to Classifying Construction Cost Documents into the International Construction Measurement Standard

论文作者

Deza, J. Ignacio, Ihshaish, Hisham, Mahdjoubi, Lamine

论文摘要

我们介绍了第一个自动化模型，用于将“数量账单”（BOQ）在基础设施建筑业中流行的成本文件中提供的自然语言描述分类为国际建筑测量标准（ICMS）。我们部署的模型和系统评估了多级文本分类的模型是从英国24个大型基础设施构建项目中检索到的5万多个项目的数据集中汲取的。我们描述了我们对语言表示的方法和随后的建模，以检查上下文语义的强度以及在建筑项目文档中使用的语言的时间依赖性。为此，我们根据两个不同的语言表示模型和一系列基于最新的序列分类方法，包括复发性和卷积神经网络体系结构，评估两个实验管道从文本中推断ICMS代码。研究结果表明，在32个ICMS类别上，报告的准确性结果平均高于90％的F1得分，其准确性结果平均超过90％。此外，由于Boqs文本中语言使用的特定性质；简而言之，在很大程度上是描述性和技术性的，我们发现更简单的模型与实现更高准确性结果相比。我们的分析表明，在描述性文本中，信息更有可能嵌入到局部关键特征中，这解释了为什么更简单的通用时间卷积网络（TCN）表现出与具有相同容量的复发体系结构的可比内存，并且随后在此任务下超越了这些。

We introduce the first automated models for classifying natural language descriptions provided in cost documents called "Bills of Quantities" (BoQs) popular in the infrastructure construction industry, into the International Construction Measurement Standard (ICMS). The models we deployed and systematically evaluated for multi-class text classification are learnt from a dataset of more than 50 thousand descriptions of items retrieved from 24 large infrastructure construction projects across the United Kingdom. We describe our approach to language representation and subsequent modelling to examine the strength of contextual semantics and temporal dependency of language used in construction project documentation. To do that we evaluate two experimental pipelines to inferring ICMS codes from text, on the basis of two different language representation models and a range of state-of-the-art sequence-based classification methods, including recurrent and convolutional neural network architectures. The findings indicate a highly effective and accurate ICMS automation model is within reach, with reported accuracy results above 90% F1 score on average, on 32 ICMS categories. Furthermore, due to the specific nature of language use in the BoQs text; short, largely descriptive and technical, we find that simpler models compare favourably to achieving higher accuracy results. Our analysis suggest that information is more likely embedded in local key features in the descriptive text, which explains why a simpler generic temporal convolutional network (TCN) exhibits comparable memory to recurrent architectures with the same capacity, and subsequently outperforms these at this task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题