0% found this document useful (0 votes)
82 views18 pages

Managing Technical Debt in Normalization

The document discusses managing technical debt in database normalization. It proposes a multi-attribute analysis framework using portfolio theory and TOPSIS methods to rank tables for normalization. This aims to provide an informed justification for critical tables to normalize while reducing effort. The techniques are evaluated on a case study of a database for human resource management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views18 pages

Managing Technical Debt in Normalization

The document discusses managing technical debt in database normalization. It proposes a multi-attribute analysis framework using portfolio theory and TOPSIS methods to rank tables for normalization. This aims to provide an informed justification for critical tables to normalize while reducing effort. The techniques are evaluated on a case study of a database for human resource management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 1

Managing Technical Debt in Database


Normalization
Mashel Albarak, Rami Bahsoon, Ipek Ozkaya, and Robert Nord

Abstract— Database normalization is one of the main principles for designing relational databases, which is the most popular
database model, with the objective of improving data and system qualities, such as performance. Refactoring the database for
normalization can be costly, if the benefits of the exercise are not justified. Developers often ignore the normalization process due to
the time and expertise it requires, introducing technical debt into the system. Technical debt is a metaphor that describes trade-offs
between short-term goals and applying optimal design and development practices. We consider database normalization debts are
likely to be incurred for tables below the fourth normal form. To manage the debt, we propose a multi-attribute analysis framework
that makes a novel use of the Portfolio Theory and the TOPSIS method (Technique for Order of Preference by Similarity to Ideal
Solution) to rank the candidate tables for normalization to the fourth normal form. The ranking is based on the tables estimated impact
on data quality, performance, maintainability, and cost. The techniques are evaluated using an industrial case study of a database-
backed web application for human resource management. The results show that the debt-aware approach can provide an informed
justification for the inclusion of critical tables to be normalized, while reducing the effort and cost of normalization.

Index Terms—Database Normalization, Multi-attribute analysis, Software Design, Technical Debt

—————————— ◆ ——————————

1 INTRODUCTION

I nformation systems refactoring is integral to their devel-


opment and evolution to respond to data growth, improv-
DBMS such as IBM Db2 and SQLight.
Conventional approaches often suggest higher level of
ing quality, and changes in users’ requirements. The refac- normalization for database tables to achieve structural and
toring process faces many risks, such as budget overruns, behavioral benefits in the design. However, practically, de-
schedule delays and increasing chances of failure. Databases velopers tend to overlook normalization process due to time
are the core of information systems; evolving databases’ and expertise it requires [5]. With the growth of data, nor-
schemas through refactoring is common practice for im- malizing the database becomes essential to treat deteriora-
proving quality, meeting new requirements, among other tions in data quality, performance and overall database de-
structural and behavioral qualities. sign [2]. Conversely, database normalization and refactor-
Database normalization is one of the main principles for ing can be costly, if the benefits are not justified. For exam-
Relational Database Model design, invented by the Turing ple, consider a database that consists of a large number of
Award winner Ted Codd [1]. Normalization concept was weakly normalized tables, developers may decide on nor-
developed to organize data in "relations" or tables following malizing the design to improve its quality and performance.
specific rules to minimize data redundancy, and conse- Normalization may also require refactoring the applications
quently, improve data consistency by reducing anomalies. using the database and other artifacts. Though some tables
Normalization benefits go beyond data quality, and can fur- can be candidate for normalization, they might not have se-
ther improve maintainability and performance [2], [3]. vere impact on quality to justify the cost of investing into
Despite the advances in database models and the emer- normalization.
gent of new models where normalization is irrelevant (such To address this issue, technical debt metaphor can be
as NoSQL), a recent study by DB-Engines Ranking [4], has used as a tool to justify the value of database normalization,
shown that the Relational Database Model is the dominant through capturing the likely value of normalization (e.g.
model, where 139 of Database Management Systems quality improvements) relative to the cost and effort of em-
(DBMS) leverage this model among 350 different DBMS. barking on this exercise. Technical debt is a metaphor coined
The study has evidenced that the relational model is still the to describe trade-offs between short term goals (e.g. Fast
most popular one , with a 75.2 % popularity score, and used system release; savings in Person Months) and applying
by the top 4 DBMS, namely: Oracle, MySQL, MS SQL Server optimal design and development practices [6]. Technical
and PostgreSQL. The model is also used by other major debt has been studied extensively over the past years (see
MTD workshop series and TechDebt Conference [7], Eu-
———————————————— roMicro track on technical debt [8], Dagstuhl seminar
• Mashel Albarak is with the School of Computer Science, University of Bir- [9]). Despite the increasing volume of research on technical
mingham, UK and King Saud University, KSA. E-mail: debt, there is no general agreement on technical debt defini-
mxa657@cs.bham.ac.uk tion. However, the common ground of existing definitions
• Rami Bahsoon is with the School of Computer Science, University of Bir-
mingham, UK. E-mail: r.bahsoon@cs.bham.ac.uk
is that the debt can be attributed to poor and sub-optimal
• Ipek Ozkaya is with the Software Engineering Institute, Carnegie Mellon engineering decisions that may carry immediate benefits,
University, USA. E-mail: ozkaya@sei.cmu.edu but are not well geared for long-term benefits. Database
• Robert Nord is with the Software Engineering Institute, Carnegie Mellon
University, USA. E-mail: rn@sei.cmu.edu
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

2 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

normalization is among the core engineering decisions that • The cost of normalization is an important factor
are performed to meet both structural and behavioral re- that we incorporated in the decision analysis pro-
quirements, such as performance, general data qualities and cess.
maintainability among the others. Sub-optimality, taking • Consequently, managing the debt should consider
the form of weaker normal forms, can have direct negative how the debt can relate to data quality, perfor-
consequences on meeting these requirements. Database nor- mance, maintainability, and cost of normalization
malization is essentially a design decision that can incur a not in isolation but also in conjunction. Therefore,
technical debt, where the debt interest can be manifested we view normalization debt management as a
into degradation in qualities and increased level of data in- multi-attribute decision problem, where develop-
consistency and duplication over the lifetime of the software ers need to prioritize tables that should be normal-
system. If not managed, the consequence can resemble an ized based on their impact on the three considered
accumulated interest on the debt that can grow with the qualities, in addition to the cost of normalization.
growth of the data over time. The framework makes a novel use of the TOPSIS
The majority of technical debt research have focused method ( Technique for Order of Preference by
on code and architectural level debts [6], [10]. Technical Similarity to Ideal Solution) [17], to rank the debt
debt linked to databases design was first attempted in the tables that should be normalized to the fourth nor-
context of missing foreign keys in a database [11]. In [12] , mal form.
[13] we were the first to explore a new context of technical The techniques are evaluated using an industrial case
debt which relates to database normalization issues. We study of a database-backed web application for human re-
have considered that potential normalization debts are source management in a large company. The database con-
likely to be incurred for tables below the fourth normal sists of 97 tables of industrial scale, each filled with large
form, since fifth normal form is regarded as a theoretical amount of data. The results show that the debt-aware ap-
form [14]. proach has provided an informed justification for the inclu-
The underlying assumption of the normalization theory sion of critical tables to be normalized. Equally important, it
is that the database tables should be normalized to a hypo- reduced the effort and cost of normalizing by eliminating
thetical fifth normal form, to achieve benefits [14]. While unnecessary normalization tasks. Our framework has the
this assumption holds in theory, practically it fails due to promise to replace ad-hoc and un-informed practices for
the required time and expertise. To address this issue, in normalizing databases, where debt and its impact can moti-
[12], we proposed a prioritization method of the tables that vate better design, optimize for resources and justify data-
should be normalized to the fourth normal form based on base normalization decisions.
their likely impact on data quality and operations perfor-
mance. For, operations performance, table’s growth rate
was considered as a critical factor to address performance
2 BACKGROUND AND MOTIVATION
impact accumulation in the future. We utilized the portfolio In this section, the key concept of our work is summa-
theory [15] to rank the tables based on their impact on per- rized. Normalization theory was introduced by Codd in
formance under the risk of tables’ growth. 1970 [1], as a process of organizing the data in tables. The
The contribution of this research is a multi-attribute de- main goal of normalization is to minimize data redun-
cision analysis approach for managing technical debt re- dancy, which is accomplished by splitting a table into sev-
lated to databases normalization. our proposed approach eral tables after analyzing the dependencies between the
provides databases designers and decision makers with a table’s attributes. Advantages of normalization was dis-
systematic framework to identify the debt items, quantify cussed and proved in the literature [2], [14]. Examples of
likely impact of normalization debts, and to provide mitiga- such advantages include: data quality improvement as it
tion strategies to manage the debts through justified and reduces redundancy; reducing update anomalies and facil-
more informed normalization that consider cost, qualities itating maintenance. Fig 1 illustrates the normal forms hi-
and debts. In this study, the proposed framework focuses on erarchy. Higher level of normal form indicates a better de-
the prioritization aspect of managing technical debt, to rank sign [14], since higher levels eliminate more redundant
the tables that are most affected by the weakly normalized data. The main condition to go higher in the hierarchy is
design.
This study goes beyond our previous work [12] in the
following:
• In addition to data quality and performance, we
considered potential debt tables’ impact on main-
tainability. We used tables’ number of attributes
as a measure of the tables’ complexity [16].
• We extend the application of the portfolio analysis
technique to cover data quality and maintainabil-
ity, considering the diversification and mitigation
of tables’ growth rate risk across three quality di-
mensions. Fig 1 Database Normalization Hierarchy

0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

ALBARAK ET AL.: MANAGING TECHNICAL DEBT IN DATABASE NORMALIZATION 3

based on the constraint between two sets of attributes in a normalizing" since they do not increase data redun-
table, which is referred to as dependency relationship. dancy. In fact, as Date states, some of them are con-
sidered to be good normalized relational databases.
2.1 Benefits of Database Normalization: Data • There is no theoretical evidence that de-normalizing
Quality, Maintainability and Performance tables will improve performance. Therefore, it’s ap-
Data quality enhancement is one of the main benefits of plication dependent and may work for some appli-
database normalization [2], [14]. The enhancement is cations. Nevertheless, this does not imply that a
linked to decreasing the amount of data redundancy as we highly normalized database will not perform better.
move higher in the normalization hierarchy. By redun- In this study, we view these deteriorations in data qual-
dancy we mean, recording the same fact more than once in ity, maintainability and performance as debt impact
the same table. In poorly or un-normalized tables that suf- caused by inadequate normalization of database tables.
fer from data redundancy, there is always a possibility of The impact can accumulate overtime with data growth and
updating only some occurrences of the data, which will af- increased data duplication. Short-term savings from not
fect the data consistency. Data quality is a crucial require- normalizing the table can have long-term consequences,
ment in all information systems, as the success of any sys- which may call for inevitable expensive maintenance, fixes
tem relies on the reliability of the data retrieved from the and/or replacement of the database. Therefore, we moti-
system. vate rethinking database normalization from the debt per-
In addition to data quality, improving maintainability is spective linked to data quality, maintainability and perfor-
another important benefit from database normalization mance issues.
[3], [2]. Weakly or un-normalized tables’ design involve
bigger number of attributes in each table compared to
2.2 Cost of Database Normalization
highly normalized tables, which will increase complexity Benefits of normalization comes with certain costs. Nor-
in terms of retrieving data from those tables and imple- malizing the database is a relatively complex process due
menting new rules on the tables [3]. Moreover, reducing to expertise and resources required. Decomposing a single
the size of the table by normalization will facilitate the table for normalization may involve [5]:
back-up and restore process. • Database schema alterations: create the new normal-
Benefits of normalization can also be observed on oper- ized tables; Ensure that the data in new tables will
ations performance. Performance has always been a con- be synchronized and finally updating all stored
troversial subject when it comes to normalizing databases. views, procedures and functions that access the
Some may argue that normalizing the database involves original table before normalization to reflect the
decomposing a single table into more tables, henceforth; changes.
data retrieval can be less efficient since it requires joining • Data Migration: a detailed and secured strategy
more tables as opposed to retrieving the data from a single should be planned to migrate the data from the old
table. Indeed, de-normalization was discussed as the pro- weakly or un-normalized table to the new decom-
cess of "down-grading" table’s design to lower normal posed tables.
forms and have limited number of big tables to avoid join- • Modifications to the accessed application/s: intro-
ing tables when retrieving the required data [2]. Advocates duce the new tables’ meta-data and update the ap-
of de-normalization argue that the Database Management plications’ methods and classes’ source code that ac-
System (DBMS) stores each table physically to a file that cessed the original table. Occasionally, normaliza-
maintains the records contiguously, and therefore retriev- tion may not require refactoring of the application,
ing data from more than one table will require a lot of I/O. if the application is well developed and decoupled
However, this argument might not be correct: even though from the database. Nevertheless, normalization is
there will be more tables after normalization, joining the considered to be complex since it would require
tables will be faster and more efficient because the sets will other complex tasks, as mentioned in the above
be smaller and the queries will be less complicated com- points.
pared to the de-normalized design [2]. Moreover, weakly Moreover, testing the database and the applications be-
or un-normalized tables will be stored in a large number of fore deployment can be expensive and time consuming.
files as opposed to the normalized design due to big Therefore, in this study, we formulate the normalization
amount of data redundancy, and consequently, increased problem as technical debt. We start from the intuitive as-
records size and increased I/O cost [2], [18]. Adding to sumptions that tables, which are weakly/not normalized
this, not all DBMS store each table in a dedicated physical to the deemed ideal form, can potentially carry debt. To
file. Therefore, several tables might be stored in a single file manage the debt, we adhere to the logical practice in pay-
due to reduced table size after normalization, which means ing the debt with the highest negative impact on quality,
less I/O and improved performance, as shown in [2], [18] with respect to the cost of normalization.
Despite the controversy about normalization and per- 2.3 Why Multi-attribute Analysis
formance, several arguments in favor of normalization is
Multi-attribute analysis techniques help decision mak-
presented by C. J. Date [2], an expert who was involved
ers evaluate alternatives when conflicting objectives must
with Codd in the relational database theory:
be considered and balanced, and when outcomes are un-
• Some of the de-normalization strategies to improve
certain [17]. The process provides a convenient framework
performance proposed in the literature are not "de-
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

4 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

for developing a quantitative debt impact assessment that normalization is a process that should create a value to en-
results in a set of prioritized tables to be normalized. The sure the system’s sustainability and maintainability in the
prioritization is based on the debt impact on data quality, future.
maintainability, performance and the effort cost of normal-
izing the table. As this analysis involves multiple conflict-
3 NORMALIZATION DEBT DEFINITION
ing attributes, multi-attribute analysis will help structure
the problem and provide a systematic and informed guid- The common ground of existing definitions of technical
ance to normalize tables to improve quality cost effec- debt is that the debt can be caused by poor and sub-optimal
tively. In this study, we utilize the TOPSIS method [17] to development decisions that may carry immediate benefits,
rank the tables that should be normalized. This method in- but are not well geared for long-term benefits [10], [6]. Da-
volved several steps to find the best alternative that is clos- tabase normalization had been proved to minimize data
est to the ideal solution. Detailed explanation of the duplication, and several studies have shown that it also im-
method is demonstrated in section 4.3. proved performance as the table is further normalized to
higher normal forms (see Fig 1) [18], [2]. Therefore, tables
2.4 Database refactoring and Schema Evolution in the database that are below the fourth normal form can
Database refactoring is defined as “simple change to a be subjected to debts as they potentially lag behind the op-
database schema that improves its design while retaining timal, where debt can be observed on data consistency,
both its behavioral and informational semantics” [5]. While maintainability and performance degradation as the data-
code refactoring area is well researched in the literature base grows [2]. To address this phenomenon of normaliza-
[19], Database refactoring received less attention [20]. As tion and technical debt, in previous works [12], [13], we
mentioned by Vial in [20], only 16 studies on “database re- have considered fourth normal form as the target normal
factoring” were found in ACM and IEEE digital libraries in form for the following reasons:
2015, Including reviews of “Refactoring Databases: Evolu- • In practice, most database tables are in third (rarely
tionary Database Design” [5]. Vial reports lessons learned achieve BCNF) normal form [14]. However, fourth
from refactoring a database for industrial logistics applica- normal form is considered as a better design since it
tions [20]. One of the key lessons learned is that some re- is a higher level and more redundant data is elimi-
factoring patterns might not yield the benefits they envi- nated [14].
sioned, specifically in the case of reducing the data dupli- • Fourth normal form criteria relies on multi-valued
cation in the database. Therefore, he stated that the data- dependencies that are common to address in real-
base might encounter some level of debt (in his case the ity [25].
debt resembles in data duplication) as long as the debt is • While fifth normal form is higher in the normaliza-
known and documented. Additional studies between 2015 tion hierarchy, it is considered as a theoretical value,
and 2019 on the topic are in line with Vial’s conclusion [21], since it is based on join dependencies between
[22]. attributes, which rarely arise in practice [14].
Schema evolution on the other hand, has been exten-
sively studied over the past years [23], [24], [5]. Schema
evolution is modifying the database schema to evolve the
4 PROPOSED FRAMEWORK TO PRIORITIZE
software system in meeting new requirements [5]. Re- TECHNICAL DEBT IN DATABASE NORMALIZATION
searchers have attempted to analyze the schema evolution To facilitate managing normalization debt in an explicit
from earlier versions [23], and they developed tools to au- way, we propose a simple management framework. The
tomate the evolution analysis process [24]. However, framework is meant to help organize the activities and in-
schema evolution literature has focused on limited tactics, formation needed to prioritize tables in the database that
such as adding or renaming a column, changing the data should be normalized to the fourth normal form. As de-
type, etc. Evolving the database schema for the objective of picted in Fig 2, the framework consists of three major
normalizing the database to create value and avoid tech- phases: identification of potential debt tables (phase 1), es-
nical debt has not been explored. We posit that timating its likely impact and cost (phase 2) and using this

Fig 2 Normalization debt management framework


0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

ALBARAK ET AL.: MANAGING TECHNICAL DEBT IN DATABASE NORMALIZATION 5

information to prioritize tables which are candidate for (k=1,..n),


normalization (phase 3). As shown in the figure, phases 2 A=∑k∑j ∑iDijk, Dijk = number of duplicate values found in
and 3 are performed periodically. After the decision has set i of k attributes of table j.
been made to normalize a specific table, the rest of the B= ∑j mj*nj/T (j=1..T) , T= number of tables, mj= number
identified debt tables should be monitored as their impact of rows of table j, nj= number of columns of table j.
on quality and the cost to normalize them may change as For X, lower is better.
time passes. Maintainability: Maintainability refers to the ease of
4.1 Phase1: Potential Debt Identification which a software product can be altered after delivery in or-
der to correct defects, add or improve features or cope with
Given the previous definition of normalization potential
future changes in the environment [28]. In databases, the
debt tables, our objective is to identify the tables that are be-
number of attributes is proposed as a metric to measure the
low the fourth normal form by determining the current nor-
complexity of each table [29], [16]. Weakly or un-normal-
mal form of each table in the database. Determining the nor-
ized tables consist of a big number of attributes. Too many
mal form can be done through knowledge of the functional
attributes indicate a lack of cohesion and storing data for
dependencies that holds in a table. A functional dependency
multiple entities in one table [5]. In [30] the authors showed
is a constraint that determines the relationship between two
that tables with big number of attributes are more vulner-
sets of attributes in the same table [2]. Example of a functional
able to changes as the system evolves. They also argued
dependency in a table that stores Staffs’ information: JobTi-
that a good design may involve tables with smaller num-
tle→ Salary. This dependency is interpreted as: all staff
ber of attributes, which is accomplished by normalizing
members having the same JobTitle, will also have the same
those tables to higher normal forms through decomposi-
Salary. Therefore, the Salary is functionally dependent on the
tion. In this study, the complexity of each potential debt
JobTitle.
table is measured by the number of its attributes. This com-
These dependencies are elicited from a good knowledge of
plexity is decreased by normalizing the table to fourth nor-
the real world and the enterprise domain, which the database
mal form as it will involve decomposing the table to two
will serve. Experienced database developers and the availa-
or more tables and reduce the number of attributes.
bility of complete documentation can also be a source for ex-
Performance: Operations performed on the database in-
tracting dependencies. Practically, none of these might be
clude update; insert; delete and data retrieval operations.
available, which makes dependency analysis a tough task. In
Each of these operations incur a cost on the number of disk
previous work [13], we proposed a framework that involves
pages read or written, which is referred to as the In-
several steps to determine the current normal for each table
put/Output "I/O" cost [31]. The impact of normalization
by mining the data stored in each table. The framework uses
debt can be observed through the I/O cost incurred by the
a data mining technique called association rule mining [26] to
operations performed on the potential debt table in its cur-
analyze the data and test candidate dependencies in each ta-
rent normal form. Due to the huge amount of data dupli-
ble. Detailed explanation of the framework is found in [13].
cation in tables below fourth normal form, tables will re-
4.2 Phase 2: Debt Impact and Cost Estimation quire more pages to be stored in, which will affect perfor-
mance of the operations executed on those tables. Mean-
4.2.1 Normalization Debt Impact ing, the more data stored in the table, the more disk pages
The second phase of the proposed framework involves es- it requires and henceforth, more time to go through the
timating the impact of each potential debt table on data pages to execute the operation. Therefore, normalization is
quality, maintainability and performance, where the esti- the sensible solution that will improve performance, as
mation will be the one of the drivers of prioritizing debts shown in [2], [18].
to be paid in addition to the normalization cost. Unlike data quality and maintainability, a metric to
Data Quality: In this study, the quality of the data is quantify the I/O cost of all the operations performed on
represented by its consistency. In weakly or un-normal- debt tables is not available. In previous work [12], we pro-
ized tables that store large amount of data redundancy, posed the following model to estimate the average I/O
there is always a possibility of changing or updating some costs for operations executed on each debt table:
𝑛 𝑛
occurrences of the data leaving the same redundant data I\Ο cost = ∑𝑥=1
𝑢 𝑛
C𝑥𝑢 λ𝑢𝑥 + ∑𝑥=1
𝑖
C𝑥𝑖 λ𝑖𝑥 + ∑𝑥=1
𝑑
C𝑥𝑑 λ𝑑𝑥 +
un-changed. Therefore, the risk of data inconsistency will ∑𝑛𝑥=1
𝑠
C𝑥𝑠 λ𝑥𝑠 (2)
increase. We view the risk of data inconsistency as an im-
Where nu, ni , nd , ns are the number of update, insert,
pact incurred by weakly normalized tables. This Impact
delete and select operations on the potential debt table R
can be quantified using the International Standardization
respectively. λxu , λxi , λxd , λxs represent the execution rates
Organization (ISO) metric [27], risk of data inconsistency.
of the xth update, xth insert, xth delete and xth select re-
According to ISO, The risk of data inconsistency is propor-
spectively. Finally, the I\O costs of the xth update, xth in-
tional to the number of duplicate values in the table and it
sert, xth delete and xth select are represented by Cxu, Cxi,
is reduced if the table further normalized to higher normal
Cxd , Cxs respectively.
forms. This risk can be measured using the following for-
mula:
X=A/B (1)
Debt Impact Accumulation and Tables’ Growth:
In technical debt area, the debt interest is a crucial factor
With (𝑛𝑘) sets of k attributes for a table with n attributes for managing debts. In general, the interest of technical
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

6 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

debt is the cost paid overtime by not resolving the debt expected return and variance of the return are used to eval-
[10]. Researchers have coined interest with implication on uate the portfolio performance.
qualities [32]. The analogy is applicable to the case of nor- To fit in portfolio management, each debt table below the
malization debt as striving for the ideal fourth normal form fourth normal form is treated as an asset. For each table,
will reduce impact on data quality, maintainability and we need to determine whether it is better to normalize that
performance. We view that all tables do incur interest over- table to the fourth normal form (pay the debt) or keep the
time. However, interest varies between tables in how much table in it is current normal form ( defer the payment). To
and how fast they accumulate interest, which our work decide on this, we need to determine what the expected re-
uses to identify tables that are candidate for normalization. turn of each debt table is. In the case of normalization debt,
Tables’ growth rate is a crucial factor that will cause the the expected return of the debt table resembles the esti-
impact of the debt on the three qualities to accumulate mated quality impact of the table. Tables with the lowest
faster. For performance, I/O cost of the operations changes estimated impact are deemed to carry higher expected re-
based on the tables’ growth rate. If the table is likely to turn. In other words, If the estimated quality impact of ta-
grow faster than other tables, the I/O cost for the opera- ble A is less than estimated quality impact of table B, then
tions executed on that table will accumulate faster than table A expected return would be higher than B; B will then
others. This due to the fact increasing table size implies has a higher priority for normalization due to high quality
more disk pages to store the table and therefore, more I/O impact. We balance the expected return with the risk. In
cost. Similarly, risk of data inconsistency increases as the portfolio management, this risk is represented by the vari-
table is growing. Regarding maintainability, Existing met- ance of the return. For the debt tables, this risk is repre-
ric has looked at the number of attributes of each table as a sented by the tables’ growth rate. Tables with the highest
measure for complexity. However, discussing normaliza- growth rate are considered to be risky assets, their likely
tion debts can make this measure limited as the debt can interest and so the debt will grow faster than other tables
intuitively linked to not only number of attributes but also of low growth rate.
the growth rate and the population size of the table. In order to apply the portfolio theory to normalization
Therefore, tables’ growth rate is a crucial measure to debt, few considerations need to be taken into account:
prioritize tables needed to be normalized. If the table is not
likely to grow or its growing rate is less than other tables, • The expected return of the debt table is equal to 1÷
a strategic decision would be to keep the debt and defer its quality impact.
payment. Table growth rate can be elicited from the data- • The risk of each table is equal to the table growth
base monitoring system. The growth rate of a table can be rate for each debt table. The growth rate can be elic-
viewed as analogous to interest risk or interest probability. ited from the database management system by mon-
Interest probability captures the uncertainty of interest itoring the table’s growth.
growth in the future [10]. Debt tables which experience • We set the correlation between the debt tables to
high growth rate in data can be deemed to have higher in- zero for several reasons: First, the quality impact
terest rate. Consequently, these tables are likely to accumu- represented by I/O costs, risk of data inconsistency
late interest faster. and number of attributes for the debt tables are in-
dependent. Meaning, the I/O cost of the operations
4.2.2 Prioritizing Debt Tables Using Portfolio Theory: executed on a debt table has no effect on the I/O cost
We use portfolio theory to prioritize tables that need to of the operations executed on another debt table.
be normalized considering their impact on the three quali- The same reasoning apply for the risk of data incon-
ties, taking into consideration the likely growth rate of the sistency and the number of attributes for each table.
table size and henceforth, the risk of interest accumulation. Moreover, the growth rate for each table, which af-
Modern Portfolio Theory (MPT) was developed by the fects the quality impact accumulation, is unique and
Nobel Prize winner Markowits [15]. The aim of this theory independent from each other. Lastly, each debt table
is to develop a systematic procedure to support decision design is independent from other debt tables, as the
making process of selecting capital of a portfolio consisting decision to keep the debt or normalize the table have
of various investment assets. The assets may include no effect on the design and the data of the other debt
stocks, bonds, real estate, and other financial products on tables.
the market that can produce a return through investment. Taking into account these considerations, we can apply
The objective of the portfolio theory is to select the combi- the portfolio theory, where the database developer is in-
nation of assets using a formal mathematical procedure vesting in tables’ normalization. The database developer
that can maximize the return while minimizing the risk as- needs to build a diversify portfolio of multiple debt tables.
sociated with every asset. Portfolio management involves Multiple debt tables in the database represent the assets.
determining the types that should be invested or divested For each asset i, it has its own risk Ri and quality impact
and how much should be invested in each asset. This pro- Qi. Based on these values the developer then can prioritize
cess draws on similarity with the normalization debt man- tables to be normalized. The expected return of debt tables’
agement process, where developers can make decisions portfolio Ep, built by prioritizing debt tables from the da-
about prioritizing investments in normalization, based on tabase of m debt tables can be calculated as in the following
which technical debt items should be paid, ignored, or can equation:
further wait. With the involvement of uncertainty, assets 𝐸𝑝 = ∑𝑚
1
(3)
𝑖=1 𝑤𝑖 𝑄
𝑖
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

ALBARAK ET AL.: MANAGING TECHNICAL DEBT IN DATABASE NORMALIZATION 7

With one constraint represented in the following equa- we will rely on experts’ estimation for the time required for
tion: each task. All the major refactoring tasks that are required
∑𝑚
𝑖=1 𝑤𝑖 = 1 (4) to normalize the tables in our case study application were
determined by the technical team. The tasks were further
Where wi represents the resulted weight of each debt refined into sub-tasks to ease and concretize costing. The
table. This weight will resemble the priority of each table task and subtasks were as follows:
for normalization as explained in the process steps. 1. Task: Split the table and create the new decomposed
The risk of table growth rate for debt table i is represented tables into the 4th normal form. Sub-tasks include:
by Ri. The global risk of the portfolio Rp is calculated as write scripts to create new tables and backup table
the following: data.
𝑅𝑝 = √∑𝑚 𝑤 2 2
𝑅 (5) 2. Task: Update ORM. Sub-tasks include: create new
𝑖=1 𝑖 𝑖
classes, update ORM models and update applica-
Process Steps: The following steps will be executed three
tion services and domain layers.
times (except for the first step, which will be executed once)
3. Task: Migrate data to the new tables. Sub-tasks in-
to prioritize tables, based on their impact on performance,
clude: write scripts to migrate data to new decom-
data quality and maintainability, separately for each quality
posed tables and run those scripts.
attribute.
4. Task: Refactor the application code. Sub-tasks in-
1. Determine the potential debt tables’ growth rate
clude: refactor web helper classes, view models and
from the database monitoring system. This step
class controllers.
will simplify the method to examine only tables
5. Task: Test the database and application. Sub-tasks
with high growth rate.
include: run automation test, end to end testing of
2. Consider only tables of high growth rate to meas-
the application, and sanity and smoke tests.
ure their impact on the three quality attributes, us-
6. Task: Integrate and apply changes in the production
ing the metrics discussed in section 4.2.1.
database. Sub-tasks include: backup database, stop
3. Calculate the values of the portfolio variables (ex-
the application and update the new files and perfor-
pected return for each table = 1÷ quality impact,
mance testing.
risk= table’s growth rate)
4. Run the model on the available data to produce the 4.3 Phase 3: Making Decisions
optimal portfolio of the debt tables. The portfolio We view normalization debt management as a multi-attrib-
model will provide the highest weights to those ta- ute decision making process, where developers need to make
bles with low quality impact and low table growth a decision about which table(s) to normalize to the fourth nor-
rate. Therefore, debt table that has the lowest mal form, to improve quality at while minimizing effort cost,
weight implies the highest priority table that and which should be kept in its current normal form because
should be normalized. it is likely to have the least impact on quality and most expen-
4.2.3 Estimating Normalization Debt cost sive to carry out. The final phase of the proposed framework
incorporates information obtained from the previous
The second phase of the proposed framework also in-
phase 2 to prioritize tables. The proposed approach uses
volves estimating the cost of normalizing each debt table
the TOPSIS method [17] to rank the debt tables to be nor-
to the fourth normal form. Normalizing and splitting a
malized. TOPSIS is a mainly a utility-based method that
single table in the database requires several alterations on
compares each alternative directly based on data in the
the database schema, applications using the database and
migrating data to new decomposed tables [5]. Some of evaluation matrix and the weight of each quality attribute.
these tasks can be automated depending on the technolo- The fundamental idea of TOPSIS is that the best solution is
gies used in the project. However, regardless of the size of the one, which has the shortest distance to the positive
the system, refactoring the database for normalization can ideal solution and the farthest distance from the negative
be very complex and time consuming. Therefore, to struc- ideal solution. This method has several advantages such as
ture the process of estimating normalization cost, first, the [33]:
main tasks required to normalize each table to the fourth • The method results represent scalar values that ac-
normal form should be defined. Then decompose each count for the best and worst alternatives simultane-
main task to sub-tasks and estimate the time required for ously.
each sub-task. The following simple model can be used to • The ability to visualize the performance measures of
estimate the total effort cost required to normalize each ta- all alternatives on attributes.
ble: • Human choice is represented by a sound logic.
The process of TOPSIS method is carried out as the fol-
Normalization cost for each table (in hours unit) = lowing:
∑𝑚
𝑖=1 ∑𝑥=1 (Estimated hours needed to complete the xth
𝑛
Step1: Construct an evaluation matrix: The matrix con-
sub-task) (6)
sists of m tables and n criteria, with the intersection of each
Where n is the total number of sub-tasks and m is the total alternative and criteria given as Xij, we therefore have a
number of the main tasks for each table. The tasks and time matrix (Xij)m*n. The criteria in our study would be risk of
required to perform the refactoring tasks can be estimated data inconsistency, maintainability, performance and cost.
from historical effort data and domain experts. In this study Step 2: Calculate the normalized matrix: the
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

8 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

normalized matrix is represented by R=(rij)m*n. Since the 5 CASE STUDY


criteria are all measured in different units, this step will
In order to control the results of the evaluation effort, we
normalize all the values in the matrix using the following
conducted the case study following the DESMET method-
formula to make the criteria comparable:
𝑥𝑖𝑗 ology [34], for guiding the evaluation of software engi-
𝑟𝑖𝑗 = (7) neering methods. DESMET sees that the first decision to
√∑𝑚 2
𝑖=𝑖 𝑥𝑖𝑗 make when undertaking a case study is to determine what
i=1,2…m , j=1,2…n the study aims to investigate and verify, in other words, to
define the goals of the case study. In the context of evalu-
Step 3: Calculate the weighted normalized matrix
ating the normalization debt prioritization framework, our
tij= rij×wj (8)
goal is to demonstrate the validity of the following hypoth-
i=1,2…m , j=1,2…n
esis:
wj is the weight criteria j. The weight of the criteria re-
flects the importance of this specific criterion compared to Database normalization debt prioritization framework
the other criteria. The summation of all weights must not provides a systematic method to improve the quality of
decisions made when refactoring relational databases for
exceed 1. In this study, we assumed that all four criteria are
the purpose of normalization.
equally important. Therefore, the weight of each criterion
To test the validity of the stated hypothesis, we define spe-
would be 0.25. Nevertheless, the analyst can adjust these
cific questions under which we will evaluate the results of
weights based on importance and/or relevance.
Step 4: Determine the positive ideal solution A+ and the industrial case study:
the negative ideal solution A- for each criterion as the fol- 1. Does the prioritization framework provide sys-
lowing: tematic and effective method to guide the prioriti-
A+= {<max (tij|i=1,2,….m) |j ∈ J+>, <min (tij|i=1,2,….m) zation of technical debt in database normaliza-
|j ∈ J-> } ≡ {tbj|j=1,2,…..n} (9) tion?
To answer this question, we conduct a case study
A-= {<min (tij|i=1,2,….m) |j ∈ J+>, <max (tij|i=1,2,….m) by systematically executing the phases of the
|j ∈ J-> } ≡ {twj|j=1,2,…..n} (10) framework, as described in section 4. The effective-
Where ness of the proposed framework is observed
J+= {1,2,….,n|j} associated with the criterion having a pos- through the overall ease of use, the systematic pro-
itive impact.
cess, and the difference in effort cost between the
J-= {1,2,….,n|j} associated with the criterion having a neg-
ative impact. conventional approach and the debt aware ap-
In this study, The positive ideal solution for each criterion proach for database normalization.
would be the minimum value for that specific criterion, 2. Does the debt quality impact estimation provide
and the negative ideal solution is the maximum value. As insights to identify tables that are most affected by
explained in section 4.2.2, table with lowest weight indi-
the debt?
cates the table of the highest impact on performance, data
quality and maintainability and therefore, has the highest Insights from quality impact estimation are de-
priority for normalization. Similarly, the positive ideal so- rived by measuring the impact on data quality, per-
lution in the cost criterion would be the lowest cost. formance and maintainability of the identified deb
Step 5: Calculate the distance between each table and the tables, using the metrics described in section 4.2.1.
best condition dib , and the distance between each table and
the worst condition diw as the following: 3. Is the portfolio analysis technique effective in pri-
oritizing debt tables with high growth rate in the
𝑑𝑖𝑏 = √∑𝑛𝑗=1(𝑡𝑖𝑗 −𝑡𝑏𝑗 )2 i=1,2,3...m (11)
future?
We apply the portfolio model as described in sec-
𝑑𝑖𝑤 = √∑𝑛𝑗=1(𝑡𝑖𝑗 −𝑡𝑤𝑗 )2 i=1,2,3….m (12) tion 4.2.2 on the data collected from quality impact
estimations for tables of high growth rates. To eval-
Step 6: Calculate the relative closeness to the ideal solu- uate the effectiveness of the technique, we simulate
tion Ci as the following: future scenario and enlarge the debt tables accord-
𝑑 ing to their growth rates. Then, we re-measure the
𝐶𝑖 = 𝑖𝑤 (13)
𝑑𝑖𝑏 +𝑑𝑖𝑤
quality impact after tables’ enlargement.
Where 0≤Ci ≤1
4. Does the TOPSIS method provide useful guidance
Step 7: Rank the preference order: The best table will be the to make informed normalization decisions?
table with Ci closest to 1. This question is answered through observation of
the application of the TOPSIS method on the col-
lected data, and see to what extent does it provide
guidance to improve the quality of normalization
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

ALBARAK ET AL.: MANAGING TECHNICAL DEBT IN DATABASE NORMALIZATION 9

decisions. 5. We met with the developers to estimate the effort


cost of normalizing each of the five tables to the
5.1 Subject Application fourth normal form.
The subject application was selected by the first author, 6. We applied the TOPSIS method detailed in section
who served as the company contact as well as well as the 4.3 on the collected data (debt tables’ weights and
main researcher conducting the case study. The project is a normalization cost), to rank the debt tables for nor-
database-backed web application for human resource malization.
management in a large company of 855 employees. The ap- 7. We simulated future scenario and enlarged the
plication is used to record employees’ information and debt tables according to their growth rates. We ac-
provide different services for the human resource depart- complished this by using the Redgate data genera-
ment such as, attendance management, payroll manage- tion tool [36] to populate each table with the re-
ment and employees’ requests management. The applica- quired amount of data. The choice of using this tool
tion consists of a relational database with 97 tables and the was based on the development team’s suggestion.
program code was written in C# based on the Object Rela- After data generation, we re-measured the impact
tional Mapping framework (ORM). ORM framework is a of each table on data quality, performance and
programming technique converting data between incom- maintainability and obtained the results.
patible type systems in object-oriented programming lan- 8. With the cooperation of the developers, we normal-
guages [35]. This technique is gaining popularity among ized the suggested table and refactor the applica-
developers because it simplifies code generation. This is tion code accordingly to measure improvements in
due to the fact that ORM abstracts the access to the data- the three quality attributes (data quality, perfor-
base, and allows methods calls to be transparently trans- mance and maintainability).
lated to SQL queries by the framework. The application 9. We met with the development team to discuss the
was developed with the following infrastructure: ASP.net outcomes and gather their feedback.
and Microsoft SQL Server (To manage the database). 5.3 Case Study Execution
5.2 Case Study Process In this section, we will demonstrate the details and the re-
There were 1 lead architect and 2 developers who au- sults of the process steps presented in the previous section
dited and maintained the application. The case study was 5.2.
executed by the principal researcher PR (first author), with
5.3.1 Framework Phase 1: Potential Debt Identification
the cooperation of the developers. The PR played the role
of the analyst. We simulated the phases of the framework We applied the steps of the framework explained in [13]
using the data provided by the application and the tech- to the tables in the database to identify tables below the
nical team. We have conducted the study for nearly 8 fourth normal form. A database dictionary that describes
months, where the PR met with the developers on weekly the tables and the attributes of the database was not avail-
basis, each meeting lasting for an average of 2 to 3 hours. able. While most of the attributes were self-described by
The meetings were used for obtaining data about the ap- their names, such as Employee_ID and Salary, some of the
plication and execute the phases of the proposed frame- attributes were described by the technical team. For each
work. The information used to conduct the case study of the available tables we were allowed to obtain the stored
came from a variety of sources, such as regular meetings data and load it to the RapidMiner software [10] to execute
with the developers, inspecting the database monitoring the mining algorithm and extract functional dependencies
systems, and artifacts including the database management that holds in the table.
system, the code of the application, supported by a review Following the steps of the framework detailed in [13] we
of available literatures. The case study was conducted with were able to identify 17 potential debt tables below the
the following general steps (details in the following sub- fourth normal form. The following Table 1 demonstrates
sections): some of the identified tables and their normal forms:
1. We obtained the data stored in the database tables
and applied the framework described in [13], to Table 1 Potential debt tables below fourth normal form
identify potential debt tables below fourth normal
form. Table name Normal form
2. Utilizing the database monitoring system, the de- AttendanceRecords BCNF
velopers extracted the growth rates of the identi- EmployeeDayAttendance 2nd normal form
fied potential debt tables. We considered the top Logs
five tables of the highest growth rates for the anal- AuditTrials 2nd normal form
ysis in this study. Leaves 1st normal form
3. For each of the five tables, we measured the impact LeaveBalances 2nd normal form
on data quality using the risk of data inconsistency
metric, performance using the model in equation 5.3.2 Framework Phase 2: Debt Impact and Cost
(2) and maintainability using the number of attrib- Estimation
utes. Estimating the Debt Quality Impact:
4. We applied the portfolio model on the collected 1. As explained in section 4.2.2 the first step is to deter-
data to produce optimum weights for the tables. mine the tables’ growth rate. Tables of higher growth
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

10 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

rates are considered to be risky and their impact on they were executed in the last month. Most of the
quality will accumulate faster than other tables of extracted operations were long and complex. All
lower growth rates. To collect information about the of the operations were analyzed so that only op-
tables’ growth, the lead developer generated a report erations that were executed on the tables of high
from the database monitoring system that showed ta- growth rates were considered. To Determine the
bles’ growth rates in the past month. Note that we I/O costs of each operation executed on the debt
have access to the database monitoring system with a table, SET STATISTICS IO feature in SQL server
one month log of operations and tables information. was used [37]. This feature display information
The following Table 2 demonstrates the top 5 highest about the amount of disk activity generated by
growth rate tables and their growth rates. As for the SQL statements. We re-executed the queries with
rest of the tables, most of them did not grow during this feature to display the I/O costs. Some of the
the past month, and the remaining few tables had a queries consists of 1 or more variables, we re-
substantially low growth rate, which indicates that placed those variables with actual data from the
the impact of those tables on quality attributes won’t table that were selected randomly. Then, the que-
accumulate in the future. Therefore, we analyzed ries were re-executed using the SET STATISTICS
only those five risky tables (presented in Table 2), IO feature and the average I/O cost was rec-
which are the best candidates for normalization due orded. A list was constructed with each debt ta-
to their high growth rates, which will accumulate the ble name, I/O cost of the operations executed on
impact on quality attributes faster than the other ta- this table and the execution rate for each opera-
bles. Nevertheless, the analysis techniques presented tion. After that, using the proposed model ex-
in this study, can be applied to any number of tables. plained in section 4.2.1, we calculated the I/O
cost for each of the previous tables.
Table 2 fastest growing tables and their growth rates 3. We calculated the values of the portfolio variables for
each quality attribute (expected return for each table
Table name Growth rate = 1÷ quality impact, risk= table’s growth rate re-
AttendanceRecords 1.04 ported in Table 2) the quality impact values were the
EmployeeDayAttendance 0.87 results obtained from the previous step.
Logs 4. We have implemented a program to aid the execu-
AuditTrials 0.86 tion of the portfolio model for each quality attribute
Leaves 0.52 and calculate the optimum weight for each table. The
LeaveBalances 0.68 program applies a non- linear optimization method
called Generalized Reduction Gradient (GRG) [38] to
2. For each of the previous tables: equation (3) as the fitness function and equation (4)
• For data quality impact measurement, we cal- as the constraint. We ran the portfolio model on the
culate the risk of data inconsistency using the available data three times separately for each quality
ISO metric [27] , a script was written in SQL lan- attribute to produce weights for each debt table. The
guage for each debt table that sums the duplicate resulted weights for each table impact in data quality,
values in all possible combinations of columns. maintainability and performance are demonstrated
Then, the result was divided by the number of in Table 3, Table 4 and Table 5 respectively. The port-
columns multiplied by the number of rows. This folio model will provide the highest weights to tables
process was repeatedly executed for each of the with low quality impact and low growth rate. There-
previous tables to calculate their risk of data in- fore, table that has the lowest weight in each quality
consistency. attribute implies the highest priority table that should
• For maintainability impact measurement, the normalized due to high impact on quality.
number of attributes for each of the previous ta- Table 3 Debt tables’ information and weights of data quality
bles was retrieved by the SQL server. impact
• Finally, to measure performance impact, we ex-
tracted operations executed on those tables and Table name Risk of Expected Weight
their execution rates in the last month. Since the data in- return
application was developed in an ORM frame- con-
work, there were no SQL language queries writ- sistency
ten in the code. The ORM framework would AttendanceRe-
2.5 0.4 0.0581
translate methods calls to SQL language queries cords
and execute them on the database. Therefore, op- Employ-
erations were extracted from the database moni- eeDayAttend- 0.236 4.237 0.7359
toring system. The database monitoring system ance Logs
has a feature of displaying the operations that AuditTrials 1.31 0.763 0.1341
were executed on the database in last month and Leaves 5.55 0.180 0.0523
their execution rates. So, a report was generated LeaveBalances 11.34 0.088 0.0196
displaying those operations and how many times
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

ALBARAK ET AL.: MANAGING TECHNICAL DEBT IN DATABASE NORMALIZATION 11

From the results of Table 3, we can determine that table performed on that table and the execution rate of those
LeaveBalances has the highest priority to normalize to operations on a monthly basis are high, in addition to
fourth normal form to improve data quality, since it got the the highest growth rate of table AttendandanceRecords,
lowest weight. Although this table has a growth rate less which will accelerate I\O cost accumulation in the future
than table AttendanceRecords, the risk of data incon- and affect performance. Table EmployeeDayAttend-
sistency of this table is the highest among the tables and it anceLogs has the second priority for normalization. As seen,
will accumulate faster than other than other tables. the difference of the weights between table AttendanceRe-
It is also observable from the results in Table 3 that the risk cords and table EmployeeDayAttendanceLogs is relatively
of data inconsistency is independent from the normal form small, which indicates that both tables are semi-equally im-
of the debt table. As shown, table AttendanceRecords has portant to normalize, considering time and budget con-
a higher risk of data inconsistency than table Employ- straints.
eeDayAttendanceLogs and table AuditTrials, even though
the latter tables are in a weaker normal form, which de- Estimating the Cost of Normalization:
notes that the normal form alone is not sufficient to make To estimate the cost of normalizing each table to the fourth
the right decision on which table to normalize. normal form, we met with technical team and determine all
Table 4 Debt tables information and weights of maintaina- the major refactoring tasks that are required to normalize
bility impact the tables. The tasks and sub tasks were as mentioned in
section 4.2.3. The cost of each candidate table was analyzed
Table name Number of Ex- Weight in accordance to those tasks. As shown, the tasks were fur-
attributes pected ther refined into sub-tasks to ease and concretize costing.
return For each sub-task under consideration, we solicited from
AttendanceRe- the team three estimates for the time required for each sub-
6 0.1667 0.1824
cords task: best, most likely and worst case. The team used previ-
EmployeeDayAt- ous experience to arrive on the best and worst cases esti-
8 0.125 0.1635
tendance Logs mates. For the most likely case, the team suggested taking
AuditTrials 10 0.1 0.1323 the average of the best and worst. In our calculation, we
Leaves 7 0.1429 0.3127 considered the most likely time estimate for each subtask
LeaveBalances 8 0.125 0.2092 and calculated the total time required to normalize each ta-
Based on the results from Table 4, table AuditTrails has ble, using equation (6). The results are shown in Table 6. It
the highest priority to normalize to decrease complexity. is worth noting the estimates are in the context of the team’s
Even though this table has a growth rate lower than At- experience and capabilities to perform the said tasks calcu-
tendeanceRecords and EmplyeeDayAttendanceLogs, the lated in hours (i.e. they are relative estimates).
number of AuditTrails attributes is the highest and the Table 6 Normalization cost for each table
complexity of this table will increase faster with growing
data to be stored in this table. It is also observable that table Table name Average cost to normal-
EmployeeDayAttendanceLogs is the second prioritized ta- ize to fourth normal
ble to normalize to the fourth normal form, even with form in hours
number of attributes similar to table LeaveBalances, how- AttendanceRecords 268
ever, the growth rate of table EmployeeDayAttend- EmployeeDayAttendance
anceLogs is higher which will increase the complexity of 235
Logs
the table faster. AuditTrials 252
Table 5 Debt tables’ information and weights of perfor- Leaves 524
mance impact
LeaveBalances 525
Table name Average I/O Ex- Weight 5.3.3 Framework Phase 3: Making Decisions Using the
cost pected TOPSIS Method
return So far, we have estimated the impact of the debt tables on
AttendanceRe- three quality attributes. We have also estimated the cost of
62822956 1.59 0.000073
cords normalizing each table to the fourth normal form. The next
Employ- step is to incorporate all this information to prioritize ta-
eeDayAttend- 27897228 3.58 0.000197 bles that need to be normalized. The proposed approach
ance Logs uses the TOPSIS method detailed in section 4.3. We imple-
AuditTrials 23020 4344.05 0.242 mented a program to aid in calculations of TOPSIS follow-
Leaves 15184 6585.88 0.606 ing the TOPSIS method equations explained in section 4.3.
LeaveBalances 46240 2162.63 0.152 Step 1: Construct the evaluation matrix: Table 7 demon-
Table 5 indicates that table AttendanceRecords has the strates the evaluation matrix that consists of the debt ta-
highest priority to normalize to the fourth normal form to bles, the criteria, the tables’ weights from the portfolio
improve performance. This is due to the fact that the I/O analysis and the tables’ normalization cost.
cost incurred by operations executed on this table is the
highest among tables, meaning the cost of operations
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

12 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

Table 7 Evaluation Matrix equal to 0.0842 (calculated as the difference between:


LeaveBalances value in cost in the weighted normalized
Table/Criteria Perfor- Data Main- Cost/ matrix, and the best solution for cost, respectively depicted
mance Quality taina- hours in Table 2A and Table 3A of the Appendix). Since the latter
bility distance is shorter, this implies that table LeaveBalances
AttendanceRe- has a bigger impact on cost than table EmployeeDayAt-
0.000073 0.0581 0.1824 268
cords tendenceLog’s impact on data quality. Since all the three
Employ- criteria are assumed to be equally important, and taking
eeDayAttend- 0.000197 0.7359 0.1635 235 into account all the distances, table EmployeeDayAttend-
anceLogs anceLogs is the farther from the best condition and closer
Audit Trials 0.242 0.1341 0.1323 252 to the worst condition.
Leaves 0.606 0.0523 0.3127 524 Step 7: rank the preference order. Results of the final step
LeaveBalances 0.152 0.0196 0.2092 525 and tables’ ranking are shown in the following Table 8:
The numeric calculations of steps 2 to 6 are depicted in Table 8 Debt tables’ ranks:
the Appendix. The calculations followed the principles
and equations corresponding to each step as described in Table/Criteria rank
section 4.3. AttendanceRecords 1
Fig 3 plots the distances calculation results of step 5. The EmployeeDayAttendanceLogs 4
vertical axis corresponds to the distance between each ta- Audit Trials 2
ble and the worst condition (Diw), while the horizontal axis Leaves 5
presents the distance from the best condition (Dib). Looking LeaveBalances 3
at the figure, we can observe that tables that fall above the Results from Table 8 is consistent with the analysis de-
diagonal are the most promising tables to normalize since scribed for the chart in Fig 3. As seen in the table, Attend-
they have the shortest distance from the best condition and ance records is ranked first among the other tables that
longest distance from the worst condition. As seen, table should be normalized to improve performance, data qual-
AttendanceRecords is potentially the best table to normal- ity and maintainability at minimum cost.
Sensitivity Analysis:
The purpose of the sensitivity analysis is to determine
how sensitive the analysis to the developers cost estimates
and the range they provided, which is inherently suffers
from uncertainty. Sensitivity analysis allows the develop-
ers to explore changes (incremental for our experiment) in
the cost estimates to understand how they affect the order-
ing of the tables. To do this, we ran the TOPSIS calculations
several times with different cost values for each table at a
time. To perform the sensitivity analysis, we have applied
an increment of + 10% at each calculation, moving from the
most likely estimate to the worst case one for cost. The fol-
Fig 3 Debt tables distance from best and worst condition lowing Table 9 demonstrates the results of analysis.
Table 9 sensitivity analysis of the cost estimations:
ize to improve the three qualities at the minimum cost pos-
sible. Although table EmployeeDayAttendanceLogs is Table name Rank cost Worst cost Rank
possibly the cheapest to normalize, this table is the farther percentage after
from the best condition, as opposed to table increase in-
LeaveBalances, which is closer to the best condition despite crease
having the highest cost of normalization. This can be jus- At- 1
tified through the distances between each table and the tendanceRe- 268 33.21% 1
best table to normalize to improve each criterion. For ex- cords
ample, table EmployessDayAttendaceLogs has the lowest Employ- 4
priority when it comes to improve data quality, and the eeDayAttend- 235 34.04% 4
distance between EmployeeDayAttendencLogs and the anceLogs
best solution to improve data quality is equal to 0.238 (cal- Audit Trials 2 252 32.14% 3
culated as the difference between: EmployeeAttend- Leaves 5 524 30.73% 5
enceLogs value in data quality in the weighted normalized LeaveBalances 3 525 35.81% 3
matrix, and the best solution to improve data quality, re-
spectively depicted in Table 2A and Table 3A of the Ap-
pendix). On the other hand, table LeaveBalances has the
lowest priority in cost (since it is the most expensive table
to normalize), however, the distance between
LeaveBalances and the best solution in cost criterion is
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

ALBARAK ET AL.: MANAGING TECHNICAL DEBT IN DATABASE NORMALIZATION 13

The Worst cost percentage increase of Table 9 depicts the LeaveBalances 12.28 86364 8
percentage increase on costs to reach the worst case esti-
mates. As an example, Attendance records needs 33.21% • We normalized table AttendenceRecords to the
increase in cost to reach its worst case estimate. The last fourth normal form. The normalization included
10% covers the worst cost percentage increase. Results of several tasks: decompose the table to three tables
the previous table are plotted in Fig 4. As seen, all the ta- (AttendanceRecords, DeviceMode and DevicePro-
bles’ ranking were stable within the worst case estimate in- cess), write scripts to migrate data to new tables, cre-
crease in the cost, except for table AuditTrails which rank ate indexes and refactor the application code to re-
changed from second to third table to normalize with flect the changes.
10.8% increase in its normalization cost and table • Finally, we measured the improvement after table’s
LeaveBalances changes from third to second in ranking. normalization in the three quality attributes as the
However, this change does not affect the ranking of the following:
first prioritized table (i.e. AttendanceRecords) that should For data quality: we measured the risk of data incon-
be normalized to fourth normal form. sistency in all three tables using the ISO metric in
equation (1).
For performance: we extracted the new SQL code of
the operations that used to retrieve data from At-
tendanceRecords. The new operations will now re-
trieve data from the new three tables after decompo-
sition. We obtained the total I/O costs of the opera-
tions executed on the new three tables. Then, As-
suming the execution rate of each operation is the
same as before normalization, we used equation (2)
explained in section 4.2.1 to get the total average I/O
costs of the operations executed on the three tables.
For maintainability: number of attributes of table
Fig 4 Stability of tables’ ranking after cost increase AttendanceRecords was easily obtained from the
SQL server.
Results of quality impact improvements are shown
5.3.4 Future scenario simulation and debt table in Table 11
normalization Table 11 table AttendanceRecords impact on quality attrib-
As discussed, we have performed this case study by simu- utes after normalization
lating the phases of the proposed framework using real
data from the Human resource application. To evaluate the Quality attribute Value
results, we simulated future scenario and executed the fol- Before normalization 3.63
lowing: After normalization 1.137
• We used data generation tool (Redgate tool [36] ), Data quality
% Rate of impact re- - 68.68%
suggested by the developers to populate the tables duction
with the required amount data to nearly emulate the Before normalization 122290310
likely growth rate. Performance After normalization 87751829
• We re- measured the tables’ impact on data quality, (I/O cost) % Rate of impact re- -28.24%
performance and maintainability in the same manner duction
as we did when we executed the phases of the frame-
Before normalization 6
work. Results are shown in Table 10.
Maintainabil- After normalization 5
ity % Rate of impact re- -16.7%
Table 10 Debt tables’ impact on quality attributes after ta-
bles’ growth duction

Risk of 5.4 Analysis and Discussion


Perfor-
data in- Maintain-
Table Name mance In this section, we present a critical evaluation of our
con- ability
(I/O Cost) framework by assessing the framework against the evalu-
sistency
ation questions defined in section 5
At-
1.Does the prioritization framework provide systematic
tendanceRe- 3.63 122290310 6
and effective method to guide the prioritization of tech-
cords
nical debt in database normalization?
EmployeeDay
The first aspect we discuss is to what extent the frame-
Attend- 0.425 66765276 8
work facilitates the normalization decision making pro-
anceLogs
cess. The ease to use is a key aspect that has motivated the
Audit Trials 1.853 40092 10
development of the framework. Several aspects of the
Leaves 6.91 29924 7 framework contribute to making it easy to use. For
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

14 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

instance, the use of tools (database monitoring tool to elicit To minimize complexity, the third scenario suggests nor-
required information, and programs to execute some of the malizing table AuditTrails based on the results from Table
steps in each phase of the method), facilitates the applica- 4. Finally, to improve all three quality attributes and taking
tion of the method without previous experience in data- into account the effort cost of normalizing each table, table
base normalization. The framework is based on intuitive AttendanceRecords is suggested to be normalized.
concepts organized in a structured manner, which contrib- Table 12 Difference in effort between conventional ap-
utes to its repeatability. Furthermore, users of the frame- proach and the debt-aware approach of normalization
work do not need to understand the details of the decision
theory behind TOPSIS or Portfolio analysis technique. On Approach Number of tables to
the other hand, users must have a comprehensive under- normalize
standing of organization resources and knowledge about Conventional Approach 17
the refactoring tasks to estimate the cost of normalization. Debt-aware approach to im- 1
The guidance provided by us during the case study pro- prove data quality
cess to execute each step and obtain required information Debt-aware approach to im- 2
seems to be beneficial and easy to apply in the context of prove performance
the human resource web application. Debt-aware approach to im- 1
Concerning the systematic process of the framework, we prove maintainability
have observed that the framework covers all major steps Debt-aware approach to im- 1
necessary to conduct a disciplined selection process. The prove three qualities
framework phases provide a clear, structured process to Moreover, depending on available resources, develop-
guide the acquisition and analysis of relevant information. ers can include more tables to normalize and justify their
This information assists the developers in making in- decisions based on the debt tables’ impact on data quality,
formed decisions, and therefore, choosing a table to nor- performance, maintainability and cost.
malize to improve quality at minimum cost. The added Our approach is built on the premise that the identified
value the framework we propose has given to the decision root causes of the debt have to do with inadequate normal-
process is that it provides guidance on how to measure the ization, and its impact is observed on qualities such as per-
debt impact of individual tables to facilitate the prioritiza- formance, maintainability and data quality. It is impera-
tion process. Concerning normalization cost estimation, tive that the architect may consider other architectural tac-
our work is consistent with current practices in consider- tics and fixes to rectify these problems. Though the tactics
ing the effort and person hours per task. These task’s cost and fixes may provide immediate benefits in some con-
were estimated using expert s’ judgments. We shall note texts, they won’t eliminate the root causes of the debt and
that paper is not claiming contribution on effort estimation the debt will continue to be dormant. The architect may
for database refactoring, In addition to experts’ judgement, need to outweigh the long-term benefits and costs of these
future work may consider developing specialized para- fixes against that of normalization. Otherwise, these fixes
metric models for the cost/effort of database refactoring, can be regarded as taking a debt on existing debt.
which is outside the scope of our current contribution. 2. Does the debt quality impact estimation provide in-
Though our work provides a list of prioritized tables that sights to identify tables that are most affected by the
need to be normalized based on the debt information, the debt?
developers may decide to further revisit the list to elimi- Our interest in debt impact is centered on the fact that it
nate tables that are not worth the consideration. The deci- is suitable for identifying tables that are candidate for nor-
sion to eliminate a table from the list is highly dependent malization based on their data quality, performance and
on the expertise of the developers, their familiarity with a maintainability impact, regardless of the normal form of
similar case, and whether the improvements are not worth the table. The notion of debt impact is particularly appro-
the consideration. Nevertheless, we have taken a conserva- priate to understand to what extent a potential debt table
tive approach by using the available resources to decide on may hurt the quality compared to other potential debt ta-
the tables to normalize among the prioritized list. The fol- bles, and therefore facilitate the decision-making process
lowing Table 12 shows that following the conventional ap- regarding which table to normalize.
proach, which encourage normalizing all the tables to We have started the case study by identifying potential
fourth normal form is more costly; time consuming and ad- debt tables (i.e. tables below the fourth normal form) in the
hoc than the debt-aware approach for four scenarios: The database. After the identification process, it can be sug-
first scenario prioritizes debt-tables based on their impact gested to normalize the weakest normal form tables to im-
on data quality, where the portfolio approach results from prove the design and the systems’ quality. The suggestion
Table3 suggests to only normalize one table, table will be based on the fact that in the normalization theory,
LeaveBalances, based on the biggest amount of data dupli- the higher the normal form of the table the better design
cation it holds and its growth rate compared to other ta- will be. Which is true “design-wise”, however when the
bles. The second scenario considers the performance im- notion of quality impact is introduced and after measuring
pact of each debt table, and the likely accumulation of this the data quality of the debt tables, evidence has shown that
impact in the future. By utilizing the portfolio approach, weakly normalized tables may have better data quality
results from Table 5 suggest two tables for normalization, than higher normal form tables as observed from Table 3.
AttendanceRecords and EmployessDayAttendanceLogs. This is due to the fact that the data quality metric present
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

ALBARAK ET AL.: MANAGING TECHNICAL DEBT IN DATABASE NORMALIZATION 15

from ISO, measures the risk of data inconsistency based on weight that implies highest priority to normalize. Since the
the amount of duplicated data stored in the table. Some number of attributes has not changed, maintainability
weakly normalized tables store less amount of data and tends to be sensitive to the growth of data stored in the ta-
thus have less impact on data quality. This empirical evi- ble, i.e. rows.
dence reinforces our assumption that the prioritization The results indicate the fitness of the proposed technique
process is independent from the table normal form. Simi- towards improving the quality considering future growth.
larly, when measuring the performance impact, our pro- Additional benefit of using the portfolio approach is that
posed model is based on I/O cost of operations executed its input is from specific data drawn from the quality meas-
on the table which is determined by the table size and the urement and the database monitoring tool that elicited ta-
execution rate of the operation, regardless of the table nor- bles’ growth rates, which eliminates biased and ad-hoc de-
mal form. cisions regarding which table to normalize.
Our maintainability metric does not solely rely on the 4. Does the TOPSIS method provide useful guidance to
number of attributes in a table, but also considers the make informed normalization decisions?
growth rate of the table size. Our contribution is not aimed Improving the quality of decisions made for database
at designing a maintainability metric for databases per se; normalization is the main objective of this research. In this
however, our approach is flexible enough to make use of context, the objective of the TOPSIS method is to organize
other metrics for maintainability, if available. These met- relevant information gathered through the prioritization
rics can be parametric; open-ended based on experts’ judg- framework. Utilizing the TOPSIS method to prioritize the
ment and/or back of the envelope calculation for a given debt tables can benefit the developers in two aspects. First,
domain; or can utilize more sophisticated methods, such as it gives the developers the flexibility to include or exclude
learning from cross-companies/projects databases to esti- the quality attributes that are necessary for given valuation
mate likely maintainability effort for a given case. The context. Moreover, based on available resources and the
measures can incorporate various maintainability con- systems conditions, developers are able to weigh the
cerns and dimensions depending on the context. level of importance for each quality attribute.
3. Is the portfolio analysis technique effective in priori- In the case study, information about debt impact on data
tizing debt tables with high growth rate in the future? quality, performance, maintainability and cost of database
As with most database applications, the human resource normalization, have been estimated and gathered during
web application grows in the amount of data stored in the the framework phase 2. Developers need a structured, flex-
database. This growth is a significant factor to make a bet- ible and simple method to organize all information to
ter decision regarding debt payment. Tables with high make the best decisions. The TOPSIS method represented
growth rate will accelerate the impact of the debt and its tool that can be utilized at any time when new information
accumulation on the three qualities faster than the other becomes available or when a specific quality dimension re-
tables of lower growth rates. Therefore, if the table is not quired to be improved over the other, which may change
likely to grow or its growing rate is less than other tables, as the system evolves. The TOPSIS method has given ex-
a strategic decision would be to keep the debt and defer its plicit reasoning of how trade-offs is achieved to deal with
payment. The portfolio analysis provides the developers factors affecting debt tables’ prioritization.
with an objective tool to assess and rethink their normali- To further evaluate our proposed approach, we have
zation decisions using debt impact and the risk of table worked with the development team to normalize table At-
growth in the future. tendanceRecords to the fourth normal form and refactor
The exercise of applying the portfolio analysis to the hu- the application code as described in Section 5.3.4, After
man resource web application proved to be valid based on normalization, we re-measured the impact of the new de-
the data reported after simulating future scenario and en- composed tables on data quality, performance and main-
larging each table under analysis according to their growth tainability as explained in section 5.3.4. The results are
rate. After data growth, we re-measured the debt tables’ shown in Table 11.
impact on the three quality attributes. Results are shown in As shown in Table 11, after normalization to the fourth
Table 10. normal form, the impact in data quality is reduced by
Portfolio analysis prioritization, applied (before data 68.68%. Performance impact is decreased by 28.24% and
growth), is depicted in Tables 3, 4, and 5. Table 10 Shows maintainability impact is decreased by 16.7%. The results
the ranking of these tables after data growth. The results provide evidence of the negative impact that was at-
show consistency with the Portfolio prioritization results. tributed to a normal form weaker than the fourth normal
For example, Table 10 Shows LeaveBalances to have the form.
highest risk of data inconsistency after data growth. The In summary, developers’ feedback on the framework
same table (i.e LeaveBalances) had the highest priority to has been positive. The approach was an eye opening and
be normalized to the fourth normal form, as suggested by has encouraged the team to reconsider their current prac-
the portfolio analysis in Table 3, where LeaveBalances had tices: In particular, the developers often avoid, on any ex-
the lowest weight. Similarly, for performance, operations penses, restructuring the database to tune performance,
executed on table AttendanceRecords appears to have the due to the lack of guidance on which table to refactor
highest I/O cost in as shown in Table 10, which is con- among the available ones. Our approach has led the devel-
sistent with the results of the portfolio prioritization results opers to recognize that refactoring the database for nor-
in Table 5, where AttendanceRecords had the lowest malization is essential to treat the root cause of the problem
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

16 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

and improve quality; they appreciate the structured ap- classes on software maintainability and correctness. They
proach to prioritize the tables that should be normalized to prioritized classes that should be refactored based on their
improve the quality at the minimum cost possible. Moreo- impact on those qualities. Portfolio Theory was proposed
ver, combining the systematic debt analysis with measur- by researchers to manage technical debt [42]. In [42] the au-
ing the debt impact of the table on the three qualities based thors viewed each debt item as an asset, and they utilized
on data from the database and the monitoring system, has Portfolio Theory to construct a portfolio of debt items that
provided the developers with previously uncovered should be kept based on debt principal, which they defined
knowledge for informing the refactoring exercise. The ap- as the effort required to remove the debt, and debt interest
proach was deemed to be more informative and objective which is the extra work needed if the debt is not removed.
regarding database normalization for mitigating the debts Portfolio Theory was also proposed to manage require-
over ad hoc normalization and refactoring. ments compliance debt in [43]. The authors viewed com-
pliance management as an investment activity that needs
5.5 Threats to Validity decisions to be made about the right compliance goals un-
Our evaluation has used an industrial case study. Case der uncertainty. They identified an optimal portfolio of ob-
studies are difficult to generalize its results [39], due to its stacles that needed to be resolved, and address value-
specificity to the case/domain. In this study, the presented driven requirements based on their economics and risks.
project is a database application, developed and main- Since the process of normalizing a database can be very
tained using specific technologies and infrastructure, complex and costly, researchers have proposed algorithms
which may have different cost pattern and management to automate or facilitate the normalization process [44],
style than other projects. For example, managing trade-offs [45], [46]. Their aim was to produce up to third normal
between qualities affected by the debt can be more com- form or BCNF tables automatically. However, the studies
plex in an integrated database project (i.e. a database that looked at the database schema in isolation from applica-
serves multiple applications [40]). Moreover, the evalua- tions using the database. It is important to consider the ap-
tion has used the case study developers for feedback on the plications to better estimate the cost of normalization tak-
overall effectiveness of the framework. ing into account refactoring and data migration tasks.
Therefore, our findings reflect experience in just one par- Since this process can be very costly, this study aims to pro-
ticular case, the goal being transferability, not generaliza- vide a method to prioritize tables to be normalized to im-
bility. Meaning, the specific findings of this study may be prove the design and avoid negative consequences.
applicable to other projects, and the general lessons
learned should be instructive to those applying and study-
ing this approach in any situation. Further analysis may 7 CONCLUSION AND FUTURE WORK
need to look at applying the method on several applica- We have explored the concept of technical debt in data-
tions of industrial scale and/or to report on the applicabil- base design that relates to database normalization. Nor-
ity of the method using more than one independent team. malizing a database is acknowledged to be an essential
Both routes are non-trivial and require a careful empiri- process to improve data quality, maintainability and per-
cal/field study that goes beyond the scope of this paper. formance. Conventional approaches for normalization is
Nevertheless, this is subject to future work. driven by the acclaimed structural and behavioral benefits
that higher normal forms are likely to provide, without ex-
plicit to value and debt information. The problem relies on
6 RELATED WORK
the fact that developers tend to overlook this process for
Ward Cunningham was the first to introduce this meta- several reasons, such as, saving effort cost, lack of expertise
phor in 1992 on the code level, as a trade-off between short or meeting the deadline. This can imply a debt that need to
term business goals (e.g. shipping the application early) be managed carefully to avoid negative consequences in
and long term goals of applying the best coding practices the future. Conversely, designers tend to embark on data-
[41]. The majority of researches have focused on code and base normalization without clear justification for the effort
architectural level debt [10],[6]. In [11], Weber et al. dis- and debt avoidance.
cussed a specific type of database schema design debt in We reported on a framework to manage normalization
missing foreign keys in relational databases. To illustrate debt in database design. A table below the fourth normal
the concept, the authors examined this type of debt in an form is viewed as a potential debt item. Though we con-
electronic medical record system called OSCAR. The au- sidered tables below the fourth normal form as potential
thors proposed an iterative process to manage foreign keys debt items, in practice most databases lag behind this
debt in different deployment sites. Our work is different as level [14]. Among the reasons for avoiding normalization,
it is driven by normalization rules, which are fundamen- database refactoring is often acknowledged to be a tedious
tally different in their requirements and treatment. Tables and expensive exercise which developers avoids on any
in databases have specific requirements and the likely im- expense due to its unpredictable outcome [5]. Addition-
pact of normalization can be observed on different quality ally, it is impractical for developers to use the fourth nor-
metrics that are database specific. mal form as the only criteria to drive the normalization ex-
As our work is close to debt prioritization, authors in ercise. To overcome these problems, we proposed a frame-
[32] utilized prioritization to manage code level debts in work to prioritize tables and their candidacy for normali-
software design. The authors estimated the impact of God zation. The framework utilizes the Portfolio theory and the
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

ALBARAK ET AL.: MANAGING TECHNICAL DEBT IN DATABASE NORMALIZATION 17

TOPSIS method to rank the tables based on their estimated REFERENCES


impact on data quality, performance, maintainability, in
[1] E. F. Codd, “A relational model of data for large shared data
addition to the cost of normalization.
banks,” Communications of the ACM, vol. 13, no. 6, pp. 377–387,
The framework was applied to an industrial case study
Jun. 1970.
of a database-backed web application for human resource
[2] C. Date, Database Design and Relational Theory: Normal Forms and
management. The database consists of 97 tables, exhibiting
complex dependencies and populated with large amount All That Jazz. O’Reilly Media, Inc., 2012.
of data. 17 out of the 97 tables were considered as poten- [3] D. G. Jha, Computer Concepts and Management Information Sys-
tial debt tables. Out of the 17 tables, 5 exhibited a high tems. PHI Learning Pvt. Ltd., 2013.
growth rate and were found to have the highest risk of im- [4] “DB-Engines Ranking - popularity ranking of database manage-
pact accumulation on data and system quality. Our frame- ment systems.” [Online]. Available: http://db-en-
work ranked these tables based on the data and system gines.com/en/ranking. [Accessed: 16-Aug-2016].
quality impact and their corresponding cost for normaliza- [5] S. W. Ambler and P. J. Sadalage, Refactoring databases: Evolution-
tion. Depending on availability of resources and person- ary database design. Pearson Education, 2006.
nel, developers may normalize the 5 tables, with the high- [6] Z. Li, P. Avgeriou, and P. Liang, “A systematic mapping study
est priority given to tables with the highest impact and on technical debt and its management,” Journal of Systems and
cost. The results show that rethinking conventional data- Software, vol. 101, pp. 193–220, Mar. 2015.
base normalization from the debt angle, by not only rely- [7] “TechDebt 2018 International Conference on Technical Debt -
ing on the normal form to re-design the database, can pro- TechDebt 2018.” [Online]. Available:
vide more systematic guidance to justify normalization de- https://2018.techdebtconf.org/track/TechDebt-2018-pa-
cisions and improve quality at minimum cost. The pro- pers#Previous-Editions. [Accessed: 03-Jun-2018].
posed framework focused on the prioritization aspect of [8] “Euromicro DSD/SEAA 2018 | SEAA 2018.” [Online]. Availa-
managing the debt. The evaluation of the overall technical ble: http://dsd-seaa2018.fit.cvut.cz/seaa/index.php?sec=ses-
debt management framework may require an extensive sions_seated#page_header. [Accessed: 22-Mar-2018].
empirical and/or field study, where the management of
[9] S. D.-L.-Z. für I. G. Wadern 66687, “Schloss Dagstuhl : Seminar
debt items may require monitoring longer period of time
Homepage.” [Online]. Available: http://www.dag-
for debt and impact accumulation, which can be subjected
stuhl.de/en/program/calendar/semhp/?semnr=16162. [Ac-
to future work.
cessed: 07-Aug-2017].
Though our framework has been described and applied
[10] C. Fernández-Sánchez, J. Garbajosa, A. Yagüe, and J. Perez,
to a standalone database, it can be further extended to
serve other industrial large-scale systems, such as inte- “Identification and analysis of the elements required to manage
grated databases. However, the formulation can be more technical debt by means of a systematic mapping study,” Journal
complex as the prioritization needs to consider the various of Systems and Software, vol. 124, pp. 22–38, Feb. 2017.
quality requirements per application in the integrated [11] J. H. Weber, A. Cleve, L. Meurice, and F. J. B. Ruiz, “Managing
schema, manage their trade-offs and conflicts when con- Technical Debt in Database Schemas of Critical Software,” 2014,
solidating the final set of requirements to be considered in pp. 43–46.
the analysis, which is subject to future work due to its spec- [12] M. Albarak and R. Bahsoon, “Prioritzing Technical Debt in Da-
ificity in treatments. tabase Normalization Using Portfolio Theory and Data Quality
Metrics,” in Proceedings of the International Conference on Technical
ACKNOWLEDGMENTS Debt, Gothenburg, Sweden, 2018.
[13] M. Albarak, M. Alrazgan, and R. Bahsoon, “Identifying Tech-
Copyright 2019 IEEE.
nical Debt in Database Normalization Using Association Rule
For Mashel Albarak:
This research is supported by the Deanship of Scientific Re- Mining,” in 2018 44th Euromicro Conference on Software Engineer-
search, King Saud University through the initiative of DSR ing and Advanced Applications (SEAA), Prague, 2018, pp. 437–441.
Graduate Students Research Support. [14] R. Elmasri and S. B. Navathe, “Database systems: models, lan-
For Robert Nord, and Ipek Ozkaya: guages, design, and application programming,” 2011.
This material is based upon work funded and supported [15] H. Markowitz, “Portfolio selection,” The journal of finance, vol. 7,
by the Department of Defense under Contract No. FA8702- no. 1, pp. 77–91, 1952.
15-D-0002 with Carnegie Mellon University for the opera- [16] M. Piattini, C. Calero, and M. Genero, “Table oriented metrics
tion of the Software Engineering Institute, a federally for relational databases,” Software Quality Journal, vol. 9, no. 2,
funded research and development center. pp. 79–97, 2001.
References herein to any specific commercial product, pro- [17] C.-L. Hwang and K. Yoon, “Methods for multiple attribute deci-
cess, or service by trade name, trade mark, manufacturer, sion making,” in Multiple attribute decision making, Springer,
or otherwise, does not necessarily constitute or imply its 1981, pp. 58–191.
endorsement, recommendation, or favoring by Carnegie [18] S. Marion, A. Kagan, and H. Shimura, “Performance criteria for
Mellon University or its Software Engineering Institute. relational databases in different normal forms,” vol. 34, no. 1, pp.
DM19-1256. 31–42, 1996.
[19] I. M. Keshta, “Software Refactoring Approaches: A Survey,” IN-
TERNATIONAL JOURNAL OF ADVANCED COMPUTER
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering

18 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING

SCIENCE AND APPLICATIONS, vol. 8, no. 11, pp. 542–547, [Accessed: 12-May-2019].
2017. [37] CarlRabeler, “SET STATISTICS IO (Transact-SQL).” [Online].
[20] G. Vial, “Database refactoring: Lessons from the trenches,” IEEE Available: https://docs.microsoft.com/en-us/sql/t-sql/state-
Software, vol. 32, no. 6, pp. 71–79, 2015. ments/set-statistics-io-transact-sql. [Accessed: 05-Jul-2018].
[21] K. Hamaji and Y. Nakamoto, “Toward a Database Refactoring [38] L. S. Lasdon, A. D. Waren, A. Jain, and M. Ratner, “Design and
Support Tool,” in 2016 Fourth International Symposium on Compu- Testing of a Generalized Reduced Gradient Code for Nonlinear
ting and Networking (CANDAR), 2016, pp. 443–446. Programming,” ACM Trans. Math. Softw., vol. 4, no. 1, pp. 34–50,
[22] P. Khumnin and T. Senivongse, “SQL antipatterns detection and Mar. 1978.
database refactoring process,” in Software Engineering, Artificial [39] R. K. Yin, Case study research Design and methods, Third. Sage pub-
Intelligence, Networking and Parallel/Distributed Computing lications, 2003.
(SNPD), 2017 18th IEEE/ACIS International Conference on, 2017, [40] M. Fowler, “IntegrationDatabase,” martinfowler.com. [Online].
pp. 199–205. Available: https://martinfowler.com/bliki/IntegrationData-
[23] A. Cleve, M. Gobert, L. Meurice, J. Maes, and J. Weber, “Under- base.html.
standing database schema evolution: A case study,” Science of [41] W. Cunningham, “The WyCash Portfolio Management System,”
Computer Programming, vol. 97, pp. 113–121, Jan. 2015. in Addendum to the Proceedings on Object-oriented Programming
[24] L. Meurice and A. Cleve, “Dahlia: A visual analyzer of database Systems, Languages, and Applications (Addendum), New York, NY,
schema evolution,” in Software Maintenance, Reengineering and USA, 1992, pp. 29–30.
Reverse Engineering (CSMR-WCRE), 2014 Software Evolution [42] Y. Guo and C. Seaman, “A portfolio approach to technical debt
Week-IEEE Conference on, 2014, pp. 464–468. management,” in Proceedings of the 2nd Workshop on Managing
[25] M. S. Wu, “The practical need for fourth normal form,” in ACM Technical Debt, 2011, pp. 31–34.
SIGCSE Bulletin, 1992, vol. 24, pp. 19–23. [43] B. Ojameruaye and R. Bahsoon, “Systematic elaboration of com-
[26] G. Piatetsky-Shapiro, “Discovery, analysis, and presentation of pliance requirements using compliance debt and portfolio the-
strong rules,” Knowledge discovery in databases, pp. 229–238, 1991. ory,” in International Working Conference on Requirements Engi-
[27] “ISO/IEC 25024:2015 - Systems and software engineering -- Sys- neering: Foundation for Software Quality, 2014, pp. 152–167.
tems and software Quality Requirements and Evaluation [44] M. Demba, “Algorithm for Relational Database Normalization
(SQuaRE) -- Measurement of data quality,” ISO. [Online]. Avail- Up to 3NF,” International Journal of Database Management Systems,
able: http://www.iso.org/iso/catalogue_de- vol. 5, no. 3, pp. 39–51, Jun. 2013.
tail.htm?csnumber=35749. [45] Y. . Dongare, P. . Dhabe, and S. . Deshmukh, “RDBNorma: - A
[28] ISO/IEC, “International Standard-ISO/IEC 14764 IEEE Std semi-automated tool for relational database schema normaliza-
14764-2006 Software Engineering; Software Life Cycle Processes tion up to third normal form,” International Journal of Database
&; Maintenance,” 2006. Management Systems, vol. 3, no. 1, pp. 133–154, Feb. 2011.
[29] C. Calero and M. Piattini, “Metrics for databases: a way to assure [46] J. Diederich and J. Milton, “New methods and fast algorithms
the quality,” in Information and database quality, Springer, 2002, for database normalization,” ACM Transactions on Database Sys-
pp. 57–83. tems (TODS), vol. 13, no. 3, pp. 339–365, 1988.
[30] G. Papastefanatos, P. Vassiliadis, A. Simitsis, and Y. Vassiliou,
“Metrics for the prediction of evolution impact in etl ecosystems: Mashel Albarak is with the School of Computer Science at the Uni-
versity of Birmingham, UK, and King Saud University, KSA. Her inter-
A case study,” Journal on Data Semantics, vol. 1, no. 2, pp. 75–97, ests are in managing technical debt in databases and information sys-
2012. tems.
[31] V. E. Ferraggine, J. H. Doorn, and L. C. Rivero, Handbook of Re-
search on Innovations in Database Technologies and Applications: Rami Bahsoon is an academic at the School of Computer Science,
University of Birmingham, UK. His research is in software architecture,
Current and Future Trends. Information Science Reference Her- self-adaptive and managed architectures, economics-driven software
shey, PA, 2009. engineering and technical debt management. He co-edited four books
[32] N. Zazworka, C. Seaman, and F. Shull, “Prioritizing design debt on Software Architecture, including Economics-Driven Software Archi-
tecture. He holds a PhD in Software Engineering from University Col-
investment opportunities,” in Proceedings of the 2nd Workshop on lege London and was MBA Fellow at London Business School. He is
Managing Technical Debt, 2011, pp. 39–42. a fellow of the Royal Society of Arts and Associate Editor of IEEE
[33] H.-S. Shih, H.-J. Shyur, and E. S. Lee, “An extension of TOPSIS Software.
for group decision making,” Mathematical and Computer Model- Ipek Ozkaya is a technical director at the Carnegie Mellon University
ling, vol. 45, no. 7–8, pp. 801–813, 2007. Software Engineering Institute, where she develops methods and
[34] B. Kitchenham, S. Linkman, and D. Law, “DESMET: a method- practices for software architectures, agile development, and manag-
ing technical debt in complex systems. She coauthored a book on
ology for evaluating software engineering methods and tools,” Managing Technical Debt: Reducing Friction in Software Development
Computing Control Engineering Journal, vol. 8, no. 3, pp. 120–126, (2019). She received a PhD in Computational Design from CMU.
Jun. 1997. Ozkaya is a senior member of IEEE and the 2019—2021 editor-in-
chief of IEEE Software magazine.
[35] “What is Object/Relational Mapping? - Hibernate ORM.”
[Online]. Available: http://hibernate.org/orm/what-is-an- Robert Nord is a principal researcher at the Carnegie Mellon Univer-
orm/. [Accessed: 04-Jul-2018]. sity Software Engineering Institute, where he develops methods and
practices for agile at scale, software architecture, and managing tech-
[36] “SQL Data Generator - Data Generator For MS SQL Server Da- nical debt. He is coauthor of Managing Technical Debt: Reducing Fric-
tabases.” [Online]. Available: https://www.red- tion in Software Development (2019). He received a PhD in computer
gate.com/products/sql-development/sql-data-generator/. science from CMU and is a distinguished member of the ACM.
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.

You might also like