A Machine Learning Approach To Classifying Constru
A Machine Learning Approach To Classifying Constru
Abstract
We introduce the first automated models for classifying natural language descriptions provided in cost documents
called “Bills of Quantities” (BoQs) popular in the infrastructure construction industry, into the International Construc-
tion Measurement Standard (ICMS). The presented analysis and models are aimed at vitalising the adoption of ICMS
and thus providing benchmarkers with an effective automated tool to allow for project comparison in a more granular
way. The presented study addresses these challenges and sets forth models to facilitate widespread analysis of cost and
performance in infrastructure construction projects effectively. The models we deployed and systematically evaluated
for multi-class text classification are learnt from a dataset of more than 50 thousand descriptions of items retrieved
from 24 large infrastructure construction projects across the United Kingdom.
We describe our approach to language representation and subsequent modelling to examine the strength of con-
textual semantics and temporal dependency of language used in construction project documentation. To do that we
evaluate two experimental pipelines to inferring ICMS codes from text, on the basis of two different language rep-
resentation models and a range of state-of-the-art sequence-based classification methods, including recurrent and
convolutional neural network architectures.
The findings indicate a highly effective and accurate ICMS automation model is within reach, with reported ac-
curacy results above %90 F1 score on average, on 32 ICMS categories. Furthermore, due to the specific nature of
language use in the BoQs text; short, largely descriptive and technical, we find that simpler models compare favourably
to achieving higher accuracy results. Our analysis suggest that information is more likely embedded in local key fea-
tures in the descriptive text, which explains why a simpler generic temporal convolutional network (TCN) exhibits
comparable memory to recurrent architectures with the same capacity, and subsequently outperforms these at this
task.
Keywords: Natural language processing (NLP), Deep learning, Automation in Building Information Modelling
(BIM), Artificial Intelligence (AI), ICMS, Short text classification, Recurrent and convolutional neural networks
(LSTM, GRU, CNN), Temporal convolutional networks (TCN).
{ignacio.deza, hisham.ihshaish}@uwe.ac.uk).
1.04.040
Text Description Price Breakdown Text Description Price Breakdown
A1 1.06.020 B1
A2 1.08.060 B2
A3 1.01.060 B3
A4 1.08.010 B4
A5 1.02.050 B5
A6 1.02.030 B6
A7 1.05.080 B7
Figure 1: ICMS is designed to allow comparison between BoQ-like documents which–in general–don’t have means of comparison apart from the
text description. Although many very detailed rules of measurement and naming conventions exist, they tend to be so granular that the comparison
between items becomes harder because similar items fall in different categories. For this reason, ICMS is designed as a high-level, elementary cost
and carbon measurement standard, with the potential to make cost documents more transparent and international.
For these reasons the ICMS project presents itself as The contextual information in the analysed texts is
a very high profile venture, with the potential to disrupt naturally present, albeit simpler, compared to the com-
the present construction methodologies towards a more plexity often embedded in other types of natural lan-
transparent, international and sustainable future. guage. This is largely due to the fact that BoQs de-
The work presented in this article aims to foster adop- scriptions are considerably condensed, short and strictly
tion of this new standard. Adoption of a common stan- descriptive, as they are essentially intended to be infor-
dard is usually difficult unless there is immediate gain mative; with neither emotional nor opinion components
by the participants. As with many processes in the con- to them. This can add to the challenge of extracting the
struction industry, classification of BoQs is usually done semantics for the classification effort compared to other
manually. This makes adoption of an extra standard— types of short texts from media or tweets.
focused on benchmarking and optimisation, rather than Moreover, the classification of the description of
on day-to-day operations—an extra burden that more tasks in natural language into a set of categories re-
often than not will tend to be avoided. Fortunately, the quires a certain degree of interpretation. This problem
Department for Transport (DfT)1 of the UK has directed aggravates when there are multiple people performing
many Government-owned-Companies who administer the classifications and if the description lacks proper
parts of its infrastructure to comply with the ICMS. This context. For example, all the load bearing works un-
study, which comes under the TIES Living Labs[12] derground or underwater must be classified as “sub-
project sponsored by the UK Government, is in line with structure”, but when they are over ground they become
these efforts. “structure”. There are however many structures that can
be partially buried due to terrain issues, slopes, etc. . . so
2. Data and Methods the classification of such items may well be down to
subjective judgment. This way a great variety of slightly
The natural language descriptions found in the BoQs
different classifications can occur naturally in manual
are predominantly short. Similar to texts analysed in
classification, which accentuates the need for an impar-
studies of sentiment analysis[13], dialogue systems[14,
tial classifier which can resolve these issues in an objec-
15, 16] and user query intent understanding[17], among
tive way. As the purpose of this standard is to compare
others, inferring such text is known to be especially
like-for-like, even a systematically erroneous classifi-
challenging. This is because of the often limited con-
cation is desirable over classifications around a theme
textual information they are accompanied by, compared
which won’t match most items as intended.
to that of long texts found in books and documents.
In this section we describe the dataset and the mod-
1 https://www.gov.uk/government/organisations/department-for- eling methods we applied to automate the classification
transport of BoQs into the ICMS standard.
3
2.1. Data acquisition and pre-processing • “Geophysical Survey in accordance with drawing
As a part of a the TIES Living Lab, a total of 124 XX.”
thousand Materials and Costs items, defined in natu-
ral language and originally labelled manually in ICMS, • “Termination of optic fibre cable to XX equipment
where retrieved from a total of 24 projects from a major cabinet Type YY.”
UK-based infrastructure construction company.
The data is presented as a cost documents, which in- • “Power reduction joint of XX mm2 to XX mm2.”
clude the ICMS code, the free text description to be
analysed and a price breakdown per item. The prices Since the retrieved descriptions have been recorded
were not considered in the study. by a large number of different people, varying levels of
Each project contains several thousand lines of cost complexities inherent in natural language were present
descriptions, written in natural language presumably by as a result. We found distinguishable differences in
subcontractors executing the tasks or delivering the ma- detail level provided across data samples—e.g. ref-
terials along the supply chain. These pieces of text are erences to internal codes, drawings, diagrams, scales,
relatively short (the median length of the descriptions in etc. . . The recording is additionally found in many in-
the dataset is 14 words with a maximum of 160 and a consistent ways, i.e. ‘cable 10 m’, ‘cable 10 meter’,
minimum of only one word). The encoding (each sam- ‘cable 10m’, ‘ten metre cable’ which essentially are in-
ple mapped to an ICMS code) has been performed man- tended to record the same information. Additionally
ually by Quantity Surveyor Experts with the support of many words have been found misspelled in at least one
the (British) Royal Institution of Chartered Surveyors way.
(RICS2 ), which is part of the ICMS Coalition which Such inconsistencies were considered as data
aims at promoting a widespread adoption of ICMS as was cleansed, and special characters—including
a global standard. punctuation—and numbers were removed.
The number of unique ICMS categories with at least
one entry in the original dataset is 72 (from the over- 2.2. Classification Methods
all total of 109 categories present in the cost side of the
standard). However, many of these categories contained For language representation and subsequently mod-
only a handful of items and were as a result discarded elling, we considered two different approaches; an
in this study. Only 32 categories contained sufficient explicit representation for text with a vector space
samples for their use in the presented study, reducing model[18] based on term(s) occurrence, and an implicit
the total size of the dataset from 123210 to 51906 items representation, using a word embedding approach[19]
–having additionally removed duplicated samples. We to text so that contextual semantics, beyond term occur-
established a cut-off of 250 samples per ICMS category rence, are represented to learn the corresponding target
such that all ICMS categories with less number of sam- labels.
ples were removed from the dataset. The distribution of For term occurrence models we used the popular n-
items/samples per each ICMS encoding is provided in gram “bag-of-words” (BoW) whereby each unique term
Fig. 2. (or n terms) is considered as an independent dimen-
The following synthetic examples bear a very close sion of the terms space, and is “one-hot” encoded as a
resemblance of the text analysed in this study, both in sparse vector. Different weightings for term occurrence
wording and in length – the original data is protected were additionally evaluated; a binary one-hot encod-
by a non-disclosure agreement, and therefore cannot be ing, term frequency and the popular Term Frequency-
shared publicly at this stage: Inverse Document Frequency (TF-IDF)[20]. Although
popular, allowing models to learn corresponding targets
• “Galvanised high adherence reinforcing strips act- based on local key features, this approach is nonetheless
ing as soil reinforcement.” limited as it considers terms in text to be independent,
• “Take down and remove to tip off Site unlit traffic and as a result the semantic term-term dependence is
sign including 4 posts.” entirely undermined.
On the other hand, word vectors[19], also known
• “Installation of wildlife tunnel XX m in length as as word embeddings, provide a much more semantic-
per diagram XX.” aware representation for language, where each word (in
the vocabulary, V) is embedded into a real-valued vector
2 https://www.rics.org/uk/ in a dense space of concepts, of dimension d << |V|.
4
Number of Samples per ICMS Category
7000
Cutoff: 250 samples
250
6000
200
5000 150
100
4000
Samples
50
3000
0
1.02.010
1.04.010
1.08.070
1.08.130
1.02.100
1.08.100
1.01.080
1.08.030
1.03.040
1.08.120
1.01.110
1.08.050
2.03.020
1.04.030
1.10.010
2.03.030
1.01.040
3.01.020
3.02.020
2.03.010
1.08.110
1.08.080
1.10.020
3.02.040
2000
ICMS
1000
0
1.04.040
1.06.020
1.08.060
1.06.050
1.01.060
1.08.010
1.02.050
1.02.030
1.05.080
1.08.020
1.02.080
1.04.070
1.07.040
1.04.050
1.05.060
1.03.030
1.02.020
1.03.070
1.01.010
1.05.020
1.01.050
1.07.030
1.03.090
1.02.060
1.02.070
2.01.020
1.06.010
1.08.040
1.08.090
1.05.030
1.03.060
1.05.040
1.01.020
1.02.010
1.04.010
1.08.070
1.08.130
1.02.100
1.08.100
1.01.080
1.08.030
1.03.040
1.08.120
1.01.110
1.08.050
2.03.020
1.04.030
1.10.010
2.03.030
1.01.040
3.01.020
3.02.020
2.03.010
1.08.110
1.08.080
1.10.020
3.02.040
ICMS codes
Figure 2: A histogram of the dataset; sample numbers per ICMS codes showing data imbalance. A cutoff of 250 is applied to exclude the overly
under-represented ICMS codes from the analysis. (inset) The categories below the threshold and thus not included in the study.
The d dimensions encode concepts shared by all analysis[38, 39]. ConvNets utilises multiple convolu-
words, rather than a statistic relative to each unique tion kernels of different sizes to extract key informa-
word. This generally allows for richer word-word re- tion in sentences, which can capture the local rele-
lationship information than language representation of vance of text. A variant of CNNs, namely the Tem-
BoW models . Word vectors may either be initialized poral Convolutional Network (TCN) was recently pro-
randomly and trained along with machine learning mod- posed and has shown promising performance results
els on a specific text mining task, or can be pre-trained over standard CNN and RNN architectures on different
vectors. We evaluated both approaches, and used a pre- NLP benchmarks[40]. We evaluated a TCN architec-
trained word2vec[21] for word embeddings. ture, primarily as these—contrary to CNNs which can
only work with fixed-size text inputs and usually focus
To learn ICMS classes from possible contextual in-
on terms that are in immediate proximity due to their
formation provided in the BoQ, we further evaluated
static convolutional filter size—applies techniques like
a set of deep learning methods as shown in 3b, most
multiple layers of dilated convolutions and padding of
widely used as sequence processing models. In partic-
input sequences in order to handle different sequence
ular we evaluated two RNN (recurrent neural network)
lengths and capture dependencies between terms that
architectures; the Bidirectional LSTM, or BiLSTM, a
are not necessarily adjacent, but instead are positioned
bidirectional RNN consisting of a forward LSTM[22]
on different places in a sequence. This could potentially
unit and backward LSTM unit to enhance the ability
emphasise the strength of signal which can be dispersed
of neural networks to capture context information, and
in a given BoQ sequence. E.g. (from previous exam-
a simpler model of BiGRU[23], or bidirectional gated
ples): “Take down and remove to tip off Site unlit traf-
unit consisting of the output state connection layer of
fic sign including 4 posts.”, regardless of terms’ prox-
forward GRU[24], reverse GRU, and forward and re-
imity.
verse GRU. Both models make up for the basic RNN
architecture — which additionally are known to be no- 2.2.1. Experiments and Model Description
toriously difficult to train[25, 26] — in extracting global
Associating the samples provided in the BoQs to
features from text sequence, and have been widely
ICMS categories can be learned by machine learning
used with notable improvement over basic LSTM and
methods for classification, casting the task as a result to
GRU architectures in a range of applications to text and
a supervised learning problem for natural language pro-
speech processing[27, 28, 29, 30, 31].
cessing. For the studied dataset, S = {s1 , s2 , s3 , ..., sn },
We additionally evaluated a convolutional neural net- where s1 , s2 , ..., sn are the short texts provided in the
work (CNN)[32], which has been applied to model independent BoQs and |S | = 51906, the correspond-
sequences for decades, and more recently at tasks ing ICMS categories y1 , y2 , y3 , ..., ym are provided as
for text classification—e.g., sentence classification[33, ground-truth labels. Each BoQ item, si , is associated
34], document classification[35, 36, 37] and sentiment to a unique ICMS category, yi ∈ Y, where |Y| = 32.
5
News corpus containing 100 billion words)5 . We use
a 300-dimension word embedding for both the learned
w1 w2 w3 w4 wn ICMS1
Learning model
ICMS2 and pre-trained. The learning process therefore con-
pre-processing
. .
BoW . .
. .
.
sists of first, the semantic representation of each text
.
. . j .
ICMS
. is obtained through the model training (or using pre-
6045 . .
training & . . trained skip-gram model), and the vector representation
validation . .
data
.
ICMSn of words is obtained. Subsequently the vector represen-
tation of the word is input into each model for further
analysis and extraction of semantics. The final word
(a) Pipeline 1 examining the performance of different classification models with vector is then connected to a softmax layer of size 32
bag-of-words for language representation.
for text classification.
input layer embedding layer
For all neural network models, including the MLP,
1 we use ADAM[44] for learning, with a learning rate of
1
0.01. The batch size is set to 64. The training epochs are
ICMS1
set to 40. We employed the BiLSTM model as in[45],
Learning model
ICMS2
. .
.
w1 w2 w3 w4 wn … … .
.
.
.
. j
with two hidden layers of size 64. The BiGRU model
ICMS
.
.
.
is similarly trained with two hidden layers of the same
training &
validation
… . .
. . size. A dropout rate of 0.5 is applied to both. All mod-
data .
ICMSn
300
els in Pipeline 2 were implemented using Keras 6 and
16800 TensorFlow 7 .
The TCN uses 1D CNN layer, followed by two lay-
(b) Pipeline 2 examining the performance of deep learning models with a word em-
bedding layer. ers of dilated 1D convolution. We apply an exponential
dilation d = 2i for layer i in the network. Acasual con-
Figure 3: Schematic view of the two modeling pipelines; (a) with uni- volutions are applied in the TCN so that target labels
gram representation of the BoW model. The total number of unique
terms is reduced to 6045 after applying stemming, lemmatization and
can be learnt as a function of terms at any time step in
stop-word removal. (b) an embedding layer is applied to learn a word the sequence –contrary to the causal convolution that is
representation of 300 dimensions (d = 300 and |V| = 16800), or oth- used in Wavenet[46]– with kernel size set to 3 where
erwise using a pre-trained word-embedding (see details on the the use each layer uses 100 filters.
of Word2Vec).
The dataset is split into a training and validation set
(development set) of 80% of the entire corpus to train
To evaluate how the performance of classification and fine-tune the models, and a test set of 20% (result-
models based on both language representation ap- ing in 10242 samples) to evaluate the different models’
proaches compare, especially given the unique charac- performance. The models were fine-tuned optimising
the categorical cross-entropy l = − C=32
P
teristics of language use in the BoQs; short (of uneven c=1 y si ,c log(p s,c ),
size), application-specific and predominantly descrip- where p is the predicted probability observation s is of
tive, we evaluated classification methods in two differ- class c, of total 32 ICMS classes in the dataset.
ent experimental settings as shown in Fig. 3.
In Pipeline 1, we trained and fine-tuned Support Vec- 2.3. Results and Analysis
tor Machine (SVM)3 [41], Random Forest3 [42, 43] and We used TF-IDF with a Multinomial Naı̈ve
multilayer perceptron (MLP)4 algorithms. Bayes[47] as a baseline model. Count Vectorizer and
The MLP –fully connected feed-forward neural uni-gram model with feature set size of 6045 was used.
network– is trained with an input layer of size 300, an Similarly, the CNN of[34] –a classic baseline for text
additional 50-sized hidden layer and a softmax output classification – based on the pre-trained word embed-
layer of 32. ding is additionally used as a baseline model.
In Pipeline 2 we evaluated the three models of BiL-
STM, BiGRU and TCN, each applying two differ-
ent word embeddings as in[21]; learned vectors in an
embedding layer with random initial weights, or pre-
trained Word2Vec embeddings (trained on the Google
5 https://code.google.com/p/word2vec/
3 The model was built using Scikit-learn: scikit-learn.org 6 F. Chollet. Keras. https://github.com/fchollet/ keras, 2015
4 The model was built using TensorFlow: tensorflow.org 7 Software available from tensorflow.org
6
Pipeline Model Embedding Accuracy Macro F1 The feed-forward MLP outperforms all models,
NB BoW 0.861 0.857 achieving ≥ 90% F1 score on 25 ICMS categories,
e1
RF BoW 0.922 0.918 which again confirms the suggestion that despite the
lin
[10].
7
100
TCN
Score[%]
90
80
70
100
RF
Score[%]
90
80
70
100
BIGRU
Score[%]
90
80
70
100
MLP
Score[%]
90
80 Over 90%
Below 90%
70
1.04.040
1.06.020
1.08.060
1.06.050
1.01.060
1.08.010
1.02.050
1.02.030
1.05.080
1.08.020
1.02.080
1.04.070
1.07.040
1.04.050
1.05.060
1.03.030
1.02.020
1.03.070
1.01.010
1.05.020
1.01.050
1.07.030
1.03.090
1.02.060
1.02.070
1.06.010
1.08.040
1.08.090
1.05.030
1.03.060
1.05.040
1.01.020
ICMS codes
Figure 4: F1 score per ICMS category recorded for the TCN, RF, BiGRU and MLP on the test set. Black dots mean over 90%, white dots are below
90% F1 score. MLP (bottom) achieves a ≥ 90% F1 Score on 25 ICMS categories, compared to the rest of models with similar performance on
about 22 categories.
Here, a permanent CCTV is presumably of one class, Alternatively in absence of richer (and potentially
and a temporary one is of another, whereas neither the larger) datasets, methods for data augmentation in NLP
words “permanent” nor “temporary” were necessarily (e.g., token-level perturbation like EDA [49], mis-
present. That is, in order to improve inference beyond classified samples augmentation [50] and techniques
this point more training data of diverse contextual nature for under and oversampling like SMOTE [51] and
has to be provided, and subsequently modelled. MLSMOTE[52], among others), which have shown im-
proved performance on many text classification tasks,
On the other hand, the under-performance of the dif- could be potentially applied here9 . In this study
ferent models observed on 4 to 7 categories consistently however, only the bootstrapping of the random forest
below %90 F1 score is partly caused by the same rea- was applied as overall classification performance was
sons stated earlier, amplified by the long-tailed distri- largely up to the mark.
bution of samples across the 32 ICMS standards con- All models where tuned and optimised experimen-
sidered in the dataset. The majority of these categories tally. The reported performance of the SVM and Multi-
happen to be significantly under-represented in the orig- nomial NB correspond to their best models tuned with
inal dataset, and many of them stand only slightly above cross-validation (K-fold) on the development set. For
the 250-samples cutoff which was applied. This is to be the Random Forest we tuned the models on the de-
compared to the mean of about 1600 samples per class velopment set to minimise the estimated out-of-bag
in the dataset. Highly skewed datasets, where the mi- (OOB)[53] error as provided in Fig. 5, which showed
nority classes are heavily outnumbered by one or more noticeable convergence of performance towards a size
classes, have proven to be a challenge while at the same of 600 classification trees. We additionally report both
time becoming more and more common [48]. A con- the Precision and Recall scores corresponding to each
servative solution to this conundrum has been to under- ICMS category recorded on the test set for the optimal
sample by deleting the very minority classes as done RF in Fig.6, separately. 10
with classes of less than 250 samples. Although we ap-
plied this limit relatively arbitrarily, it has been set as a
trade-off between the classification of a larger number 9 For a comprehensive review of data augmentation methods in
of ICMS categories on the one hand, and model stabil- NLP reader is advised to refer to [48].
ity on another. 10 The performance of RF of 600 trees is reported here. We provide
8
As was foreseeable this again shows some arguably models as well as that of their learning. Learning rate
peripheral under-performance of the model on instances and drop-out rates were fixed as reported earlier. Most
of under-represented categories as described earlier. models showed similar learning performance (and loss
Despite a better performance that has been achieved on minimisation rate) over the training epochs as shown in
these particular samples by the different deep learning Fig. 7, and were able to converge at 15 to 20 training
models, compared to the RF, it can be overenthusias- epochs. Though again more complex models, e.g., BiL-
tic to draw conclusive arguments as to why that was, STM, whilst converging nearly similarly to the rest of
nonetheless. models, seem more prone to over-fitting, exhibiting con-
siderable difference between training and validation loss
0.094
RF,'sqrt', B. over the successive learning process, and are as such
0.092 RF,'sqrt', WS.
RF,'log2', B. sub-optimally adjusted.
0.090 RF,'log2', WS.
OOB Error Rate
1.00
0.088
1.4
0.086 0.95
1.2
0.084
1.0 0.90
0.082
Accuracy
0.8
Loss
0.85
0.080 MLP
100 200 300 400 500 600 700 800 0.6
Nº Estimators 0.80
BiLSTM
0.4 BiGRU
TCN
Figure 5: The out-of-bag (OOB) of the different RF configurations on 0.75 Train
0.2
the development set. The models with less number of random features Validate
0.0 0.70
at each classification tree, corresponding to log2 |V|, show better skill 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
Epoch Epoch
at the inference task. Similarly a slight improvement can be achieved
when the warm-start (WS) hyper-parameter is activated. WS is a pa-
Figure 7: Learning, validation and loss performance for the different
rameter provided by some Scikit-models to allow existing fitted model
ANN architectures on the BoQ dataset. Whilst all models learning
attributes to initialise a new model in a subsequent call to fit.
converges quickly in training over the first 10 epochs, approximately
asymptotic thereafter, the TCN and MLP seem to exhibit less vari-
ance with better accuracy performance overall. Recurrent architec-
100
RF tures, and especially the BiLSTM, appear to have highest variance on
95
the dataset as a result of learning a much larger set of parameters.
90
There is a marginal improvement in performance as
Score[%]
85
80 Precision word embedding is learned by the models over us-
Recall
75
Over 90% ing pre-trained vectors, although at some computational
70 Below 90%
90% line cost, as models using pre-trained vectors were relatively
65
faster to train. Consistently nonetheless these models
1.04.040
1.06.020
1.08.060
1.06.050
1.01.060
1.08.010
1.02.050
1.02.030
1.05.080
1.08.020
1.02.080
1.04.070
1.07.040
1.04.050
1.05.060
1.03.030
1.02.020
1.03.070
1.01.010
1.05.020
1.01.050
1.07.030
1.03.090
1.02.060
1.02.070
1.06.010
1.08.040
1.08.090
1.05.030
1.03.060
1.05.040
1.01.020
ICMS codes
showed higher loss on validation instances. Although
the corpus used to train models can be deemed suffi-
Figure 6: Precision and Recall scores of the RF on the 32 ICMS cat- ciently large, quantitatively, it’s however less so seman-
egories. Although the average precision is very high, the algorithm is tically, especially due to its descriptive and short nature,
more sensitive to small data samples than the neural network -based and the considerable presence of specialised, and occa-
models. Compare to figure 4
sionally non-English, language. This can explain the
marginal edge achieved by learning an embedding vec-
Different configurations for the different ANN mod- tor for language representation on this corpus compared
els used in this study were evaluated. The best model to a pre-trained one, consistent with the conclusions in
was saved and their performance was reported on the [54] on the benefits of learning word embeddings for the
test set. The criteria for initial selection as candidate construction domain.
options included their reported performance in a wide In general, the experimental results indicate an effec-
range of language processing applications and bench- tive high inference skill of all ANN architectures on this
marks, whereas the architecture parameters where op- task, with comparable results additionally available with
timised relative to the classification performance of the RFs. In fact due to the specific nature of language use in
the BoQs; short, descriptive and technical, simpler mod-
access to the trained model in production alongside the implementa- els achieved better accuracy performance. Both MLP
tion of the evaluated models in this study. and TCN showed to be able to outperform other—more
9
sophisticated—methods. As mentioned earlier, the po- nature of text found in the BoQs. That is, they are con-
sition in a text is only important once the context is in- siderably condensed, short and strictly descriptive, so
ferred from the text. In the case of short texts, context much so that their complexity strikes as being a function
simply isn’t provided, and can only be inferred by ex- of abstraction in key term use, rather than the inherent
perts by looking at other variables or based on previous complexity of semantic dynamics in language use more
knowledge of the project, most of which is not mod- often than not. The “long memory” advantage of RNNs
elled. The “more-flexible-memory” advantage of RNNs is therefore largely inconsequential at this task, and as a
is therefore largely inconsequential at this task, and as result the TCN exhibited comparable memory to recur-
a result the TCN exhibited comparable memory to re- rent architectures with the same capacity. And simpler
current architectures with the same capacity. It also has models like the MLP and RF were able to capture the
a very small number of parameters compared to the Bi- required mapping favourably from local key features.
GRU and BiLSTM networks, and as the texts are too As adoption of ICMS gains traction, more annotated
simple to make use of this added complexity, these mod- data will be made available and the evaluated models
els tend to overfit and comparatively underperform. can be re-trained to learn further ICMS categories. It is
therefore hoped that the findings of this study will trig-
3. Conclusion and Impact ger this process further. To that end, the trained model
of MLP and development code in this study are made
This work presents the first attempt to automate the available to the community and can be readily used 11 .
(still manually-handled) mapping of free written work Consequently, we believe this study presents a com-
and items’ cost text descriptions, from construction cost pelling case for the community – both private and pub-
documents called bills of quantities (BoQs), into the In- lic sectors – of the construction industry to prioritise an
ternational Cost Measurement Standard (ICMS), which open data approach along their supply lines, apace with
will enable benchmarkers to compare and benchmark considerable use of tools to ensure friction-less stan-
the performance of projects at a scale that was never dardisation. We argue this will allow for vital develop-
done before, and facilitate more effective cost and risk ments in the field leading to a transformative automated
analysis in construction projects. To that end we evalu- benchmarking system.
ated state-of-the-art machine learning methods to learn
multi-class text classification models from 51906 item
descriptions. These were retrieved from 24 different Acknowledgements
infrastructure construction projects carried out by con-
tractors of public-owned companies of the United King- This work has been supported by Innovate UK under
dom, across the UK. Grant N: 08027517 as a part of “Transport infrastruc-
We considered two approaches to our modelling, one ture efficiency strategy living labs” (TIES Living Labs)
assuming information signals can be captured from lo- Project N. 106171.
cal features of the description text provided in the BoQs,
and another on the premise that, alongside local key References
features, the potential propagation of information and
semantics in the text may help improve the learning of [1] N. Thompson, W. Squires, N. Fearnhead, R. Claase, Digitali-
ICMS codes. To do that we evaluated a range of classifi- sation in construction-industrial strategy review, supporting the
cation methods which have been widely used on tasks of government’s industrial strategy, Tech. rep., University College
London, London (2017).
text classification, including support vector machines, [2] N. Davies, G. Atkins, D. Slade, How to transform infrastructure
random forests, multi-layer perceptron, and advanced decision making in the UK, Tech. rep., Institute for Government,
deep learning architectures commonly used in sequence London (2018).
modelling, including recurrent (LSTM, GRU) and con- [3] S. Changali, A. Mohammad, M. v. Nieuwland, The construction
productivity imperative, McKinsey, 2015.
volutional architectures (CNN, TCN). [4] W. Pan, A. G. Gibb, A. R. Dainty, Leading UK housebuilders’
Whilst results strongly suggest that most models are utilization of offsite construction methods, Building Research &
largely skillful at inferring ICMS standards from the Information 36 (1) (2008) 56–67.
[5] P. Fewings, C. Henjewele, Construction project management:
short text provided in the BoQs, we found that sim-
an integrated approach, Routledge, 2019.
pler models, like the RF, and generic MLP and TCN
architectures with minimal tuning outperform recurrent
–more sophisticated– architectures such as LSTMs and 11 Operational model (MLP) and development code are available
10
[6] X. Yin, H. Liu, Y. Chen, M. Al-Hussein, Building information [23] X. Luo, W. Zhou, W. Wang, Y. Zhu, J. Deng, Attention-
modelling for off-site construction: Review and future direc- based relation extraction with bidirectional gated recurrent unit
tions, Automation in Construction 101 (2019) 72–91. and highway network in the analysis of geological data, IEEE
[7] M. El Jazzar, M. Piskernik, H. Nassereddine, Digital twin in Access 6 (2018) 5705–5715. doi:10.1109/ACCESS.2017.
construction: An empirical analysis, in: EG-ICE 2020 Work- 2785229.
shop on Intelligent Computing in Engineering, Proceedings, [24] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau,
2020, pp. 501–510. F. Bougares, H. Schwenk, Y. Bengio, Learning phrase repre-
[8] S. Wu, K. Ginige, G. Wood, S. W. Jong, et al., How can building sentations using RNN encoder–decoder for statistical machine
information modelling (BIM) support the new rules of measure- translation, in: Proceedings of the 2014 Conference on Empiri-
ment (NRM1), Tech. rep., Royal Institution of Chartered Sur- cal Methods in Natural Language Processing (EMNLP), Doha,
veyors (2014). Qatar, 2014, pp. 1724–1734. doi:10.3115/v1/D14-1179.
[9] A. Muse, M. Horner, G. O’Sullivan, C. Fry, A. Aronsohn, [25] Y. Bengio, P. Simard, P. Frasconi, Learning long-term depen-
D. Baharuddin, P. Bredehoeft, T. Chatzisymeon, R. Fadason, dencies with gradient descent is difficult, IEEE Transactions on
R. Flanagan, et al., ICMS: Global Consistency in Presenting Neural Networks 5 (2) (1994) 157–166. doi:10.1109/72.
Construction Life Cycle Costs and Carbon Emissions (2021). 279181.
[10] C. Mitchell, International construction measurement standards [26] R. Pascanu, T. Mikolov, Y. Bengio, On the difficulty of train-
(ICMS) explained, Tech. rep., International Construction Mea- ing recurrent neural networks, in: S. Dasgupta, D. McAllester
surement Standards Coalition (ICMSC). (Eds.), Proceedings of the 30th International Conference on Ma-
URL https://icms-coalition.org/,year={2016} chine Learning, Vol. 28 of Proceedings of Machine Learning
[11] M. D. Deo Prasad, A. Kuru, P. Oldfield, L. Ding, C. Noller, Research, Atlanta, Georgia, USA, 2013, pp. 1310–1318.
B. He, Race to net zero carbon: A climate emergency guide for [27] J. Chen, Y. Hu, J. Liu, Y. Xiao, H. Jiang, Deep short text classifi-
new and existing buildings in Australia, Tech. rep., Low Carbon cation with knowledge powered attention, in: Proceedings of the
Institute (2021). Thirty-Third AAAI Conference on Artificial Intelligence and
[12] TIES living lab, https://tieslivinglab.co.uk/, [Online; Thirty-First Innovative Applications of Artificial Intelligence
accessed 8-July-2022] (2022). Conference and Ninth AAAI Symposium on Educational Ad-
[13] X. Li, H. Xie, L. Chen, J. Wang, X. Deng, News impact on stock vances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19,
price return via sentiment analysis, Knowledge-Based Systems AAAI Press, 2019, p. 8. doi:10.1609/aaai.v33i01.
69 (2014) 14–23. 33016252.
[14] J. Y. Lee, F. Dernoncourt, Sequential short-text classification [28] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation
with recurrent and convolutional neural networks, in: Proceed- by jointly learning to align and translate, in: Y. Bengio, Y. Le-
ings of the 2016 Conference of the North American Chapter of Cun (Eds.), 3rd International Conference on Learning Repre-
the Association for Computational Linguistics, 2016, pp. 515– sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
520. Conference Track Proceedings, 2015, p. 9.
[15] R. Fellows, H. Ihshaish, S. Battle, C. Haines, P. Mayhew, J. I. [29] T. Zhang, R. Xu, Performance Comparisons of Bi-LSTM and
Deza, Task-oriented dialogue systems: performance vs. quality- Bi-GRU Networks in Chinese Word Segmentation, Association
optima, a review, in: David C. Wyld et al. (Eds): SIPP, NLPCL, for Computing Machinery, New York, NY, USA, 2021, Ch. 3, p.
BIGML, SOEN, AISC, NCWMC, CCSIT, 2022, pp. 69–87. 73–80.
doi:10.5121/csit.2022.121306. [30] V. Vukotić, C. Raymond, G. Gravier, A step beyond local obser-
[16] R. Nicholls, R. Fellows, S. Battle, H. Ihshaish, Problem classi- vations with a dialog aware bidirectional gru network for spoken
fication for tailored helpdesk auto-replies, in: Artificial Neural language understanding, in: Interspeech, 2016, pp. 3241–3244.
Networks and Machine Learning – ICANN 2022, Springer Na- doi:10.21437/Interspeech.2016-1301.
ture Switzerland, Cham, 2022, pp. 445–454. doi:10.1007/ [31] Z. Xiao, P. Liang, Chinese sentiment analysis using bidirectional
978-3-031-15937-4\_37. lstm with word embedding, in: X. Sun, A. Liu, H.-C. Chao,
[17] J. Hu, G. Wang, F. Lochovsky, J.-t. Sun, Z. Chen, Understanding E. Bertino (Eds.), Cloud Computing and Security, Springer In-
user’s query intent with wikipedia, in: Proceedings of the 18th ternational Publishing, Cham, 2016, pp. 601–610.
international conference on World wide web, 2009, pp. 471– [32] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
480. W. Hubbard, L. D. Jackel, Backpropagation applied to hand-
[18] G. Salton, C. Buckley, Term-weighting approaches in auto- written zip code recognition, Neural Computation 1 (4) (1989)
matic text retrieval, Information Processing & Management 541–551. doi:10.1162/neco.1989.1.4.541.
24 (5) (1988) 513–523. doi:https://doi.org/10.1016/ [33] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional
0306-4573(88)90021-0. neural network for modelling sentences, in: Proceedings of the
[19] Y. Bengio, R. Ducharme, P. Vincent, C. Janvin, A neural prob- 52nd Annual Meeting of the Association for Computational Lin-
abilistic language model, J. Mach. Learn. Res. 3 (null) (2003) guistics (Volume 1: Long Papers), Association for Computa-
1137–1155. tional Linguistics, Baltimore, Maryland, 2014, pp. 655–665.
[20] G. Salton, C. Buckley, Term-weighting approaches in auto- doi:10.3115/v1/P14-1062.
matic text retrieval, Information Processing & Management [34] Y. Kim, Convolutional neural networks for sentence classifi-
24 (5) (1988) 513–523. doi:https://doi.org/10.1016/ cation, in: Proceedings of the 2014 Conference on Empirical
0306-4573(88)90021-0. Methods in Natural Language Processing (EMNLP), Associ-
[21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Dis- ation for Computational Linguistics, Doha, Qatar, 2014, pp.
tributed representations of words and phrases and their composi- 1746–1751. doi:10.3115/v1/D14-1181.
tionality, in: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, [35] A. Conneau, H. Schwenk, L. Barrault, Y. Lecun, Very deep
K. Weinberger (Eds.), Advances in Neural Information Process- convolutional networks for text classification, in: Proceedings
ing Systems, Vol. 26, Curran Associates, Inc., 2013, p. 9. of the 15th Conference of the European Chapter of the Asso-
[22] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural ciation for Computational Linguistics, 2017, pp. 1107–1116.
computation 9 (8) (1997) 1735–1780. doi:10.18653/v1/E17-1104.
11
[36] R. Johnson, T. Zhang, Effective use of word order for text cat- Language Processing and the 9th International Joint Conference
egorization with convolutional neural networks, in: Proceed- on Natural Language Processing (EMNLP-IJCNLP), Associa-
ings of the 2015 Conference of the North American Chapter tion for Computational Linguistics, Hong Kong, China, 2019,
of the Association for Computational Linguistics: Human Lan- pp. 6382–6388. doi:10.18653/v1/D19-1670.
guage Technologies, 2015, pp. 103–112. doi:10.3115/v1/ [50] T. Dreossi, S. Ghosh, X. Yue, K. Keutzer, A. Sangiovanni-
n15-1011. Vincentelli, S. A. Seshia, Counterexample-guided data augmen-
[37] R. Johnson, T. Zhang, Deep pyramid convolutional neural net- tation, in: Proceedings of the 27th International Joint Confer-
works for text categorization, in: Proceedings of the 55th An- ence on Artificial Intelligence, IJCAI’18, AAAI Press, 2018, p.
nual Meeting of the Association for Computational Linguis- 2071–2078.
tics (Volume 1: Long Papers), Association for Computational [51] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer,
Linguistics, Vancouver, Canada, 2017, pp. 562–570. doi: Smote: synthetic minority over-sampling technique, Journal of
10.18653/v1/P17-1052. artificial intelligence research 16 (2002) 321–357.
[38] X. Ouyang, P. Zhou, C. H. Li, L. Liu, Sentiment analysis us- [52] F. Charte, A. J. Rivera, M. J. del Jesus, F. Herrera, Mlsmote:
ing convolutional neural network, in: 2015 IEEE International Approaching imbalanced multilabel learning through synthetic
Conference on Computer and Information Technology; Ubiqui- instance generation, Knowledge-Based Systems 89 (2015) 385–
tous Computing and Communications; Dependable, Autonomic 397. doi:https://doi.org/10.1016/j.knosys.2015.
and Secure Computing; Pervasive Intelligence and Comput- 07.019.
ing, 2015, pp. 2359–2364. doi:10.1109/CIT/IUCC/DASC/ [53] L. Breiman, Out-of-bag estimation, Tech. rep., Dept. of Statis-
PICOM.2015.349. tics, Univ. of California Berkeley (1996).
[39] S. Liao, J. Wang, R. Yu, K. Sato, Z. Cheng, Cnn for situa- URL www.stat.berkeley.edu/~breiman/
tions understanding based on sentiment analysis of twitter data, OOBestimation.pdf
Procedia Computer Science 111 (2017) 376–381, the 8th Inter- [54] A. J. P. Tixier, M. Vazirgiannis, M. R. Hallowell, Word em-
national Conference on Advances in Information Technology. beddings for the construction domain (2016). doi:10.48550/
doi:https://doi.org/10.1016/j.procs.2017.06.037. ARXIV.1610.09333.
[40] S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of
generic convolutional and recurrent networks for sequence mod-
eling, ArXiv abs/1803.01271 (2018).
[41] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm
for optimal margin classifiers, in: Proceedings of the Fifth An-
nual Workshop on Computational Learning Theory, COLT ’92,
Association for Computing Machinery, New York, NY, USA,
1992, p. 144–152. doi:10.1145/130385.130401.
[42] L. Breiman, Random forests, Machine Learning 45 (1) (2001)
5–32. doi:10.1023/A:1010933404324.
[43] T. K. Ho, Random decision forests, in: Proceedings of 3rd in-
ternational conference on document analysis and recognition,
Vol. 1, IEEE, 1995, pp. 278–282.
[44] D. P. Kingma, J. Ba, Adam: A method for stochastic optimiza-
tion, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Con-
ference on Learning Representations, ICLR 2015, San Diego,
CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015,
p. 15.
[45] Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, J. Zhao, An end-
to-end model for question answering over knowledge base with
cross-attention combining global knowledge, in: Proceedings of
the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Vancouver, Canada, 2017,
pp. 221–231. doi:10.18653/v1/P17-1021.
[46] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,
O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,
K. Kavukcuoglu, Wavenet: A generative model for raw audio,
in: Arxiv, 2016, p. 15.
URL https://arxiv.org/abs/1609.03499
[47] A. M. Kibriya, E. Frank, B. Pfahringer, G. Holmes, Multi-
nomial naive bayes for text categorization revisited, in: G. I.
Webb, X. Yu (Eds.), AI 2004: Advances in Artificial Intelli-
gence, Springer Berlin Heidelberg, 2005, pp. 488–499.
[48] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mi-
tamura, E. Hovy, A survey of data augmentation approaches
for NLP, in: Findings of the Association for Computational
Linguistics: ACL-IJCNLP 2021, 2021, pp. 968–988. doi:
10.18653/v1/2021.findings-acl.84.
[49] J. Wei, K. Zou, EDA: Easy data augmentation techniques for
boosting performance on text classification tasks, in: Proceed-
ings of the 2019 Conference on Empirical Methods in Natural
12