Managing Technical Debt in Normalization
Managing Technical Debt in Normalization
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
Abstract— Database normalization is one of the main principles for designing relational databases, which is the most popular
database model, with the objective of improving data and system qualities, such as performance. Refactoring the database for
normalization can be costly, if the benefits of the exercise are not justified. Developers often ignore the normalization process due to
the time and expertise it requires, introducing technical debt into the system. Technical debt is a metaphor that describes trade-offs
between short-term goals and applying optimal design and development practices. We consider database normalization debts are
likely to be incurred for tables below the fourth normal form. To manage the debt, we propose a multi-attribute analysis framework
that makes a novel use of the Portfolio Theory and the TOPSIS method (Technique for Order of Preference by Similarity to Ideal
Solution) to rank the candidate tables for normalization to the fourth normal form. The ranking is based on the tables estimated impact
on data quality, performance, maintainability, and cost. The techniques are evaluated using an industrial case study of a database-
backed web application for human resource management. The results show that the debt-aware approach can provide an informed
justification for the inclusion of critical tables to be normalized, while reducing the effort and cost of normalization.
—————————— ◆ ——————————
1 INTRODUCTION
normalization is among the core engineering decisions that • The cost of normalization is an important factor
are performed to meet both structural and behavioral re- that we incorporated in the decision analysis pro-
quirements, such as performance, general data qualities and cess.
maintainability among the others. Sub-optimality, taking • Consequently, managing the debt should consider
the form of weaker normal forms, can have direct negative how the debt can relate to data quality, perfor-
consequences on meeting these requirements. Database nor- mance, maintainability, and cost of normalization
malization is essentially a design decision that can incur a not in isolation but also in conjunction. Therefore,
technical debt, where the debt interest can be manifested we view normalization debt management as a
into degradation in qualities and increased level of data in- multi-attribute decision problem, where develop-
consistency and duplication over the lifetime of the software ers need to prioritize tables that should be normal-
system. If not managed, the consequence can resemble an ized based on their impact on the three considered
accumulated interest on the debt that can grow with the qualities, in addition to the cost of normalization.
growth of the data over time. The framework makes a novel use of the TOPSIS
The majority of technical debt research have focused method ( Technique for Order of Preference by
on code and architectural level debts [6], [10]. Technical Similarity to Ideal Solution) [17], to rank the debt
debt linked to databases design was first attempted in the tables that should be normalized to the fourth nor-
context of missing foreign keys in a database [11]. In [12] , mal form.
[13] we were the first to explore a new context of technical The techniques are evaluated using an industrial case
debt which relates to database normalization issues. We study of a database-backed web application for human re-
have considered that potential normalization debts are source management in a large company. The database con-
likely to be incurred for tables below the fourth normal sists of 97 tables of industrial scale, each filled with large
form, since fifth normal form is regarded as a theoretical amount of data. The results show that the debt-aware ap-
form [14]. proach has provided an informed justification for the inclu-
The underlying assumption of the normalization theory sion of critical tables to be normalized. Equally important, it
is that the database tables should be normalized to a hypo- reduced the effort and cost of normalizing by eliminating
thetical fifth normal form, to achieve benefits [14]. While unnecessary normalization tasks. Our framework has the
this assumption holds in theory, practically it fails due to promise to replace ad-hoc and un-informed practices for
the required time and expertise. To address this issue, in normalizing databases, where debt and its impact can moti-
[12], we proposed a prioritization method of the tables that vate better design, optimize for resources and justify data-
should be normalized to the fourth normal form based on base normalization decisions.
their likely impact on data quality and operations perfor-
mance. For, operations performance, table’s growth rate
was considered as a critical factor to address performance
2 BACKGROUND AND MOTIVATION
impact accumulation in the future. We utilized the portfolio In this section, the key concept of our work is summa-
theory [15] to rank the tables based on their impact on per- rized. Normalization theory was introduced by Codd in
formance under the risk of tables’ growth. 1970 [1], as a process of organizing the data in tables. The
The contribution of this research is a multi-attribute de- main goal of normalization is to minimize data redun-
cision analysis approach for managing technical debt re- dancy, which is accomplished by splitting a table into sev-
lated to databases normalization. our proposed approach eral tables after analyzing the dependencies between the
provides databases designers and decision makers with a table’s attributes. Advantages of normalization was dis-
systematic framework to identify the debt items, quantify cussed and proved in the literature [2], [14]. Examples of
likely impact of normalization debts, and to provide mitiga- such advantages include: data quality improvement as it
tion strategies to manage the debts through justified and reduces redundancy; reducing update anomalies and facil-
more informed normalization that consider cost, qualities itating maintenance. Fig 1 illustrates the normal forms hi-
and debts. In this study, the proposed framework focuses on erarchy. Higher level of normal form indicates a better de-
the prioritization aspect of managing technical debt, to rank sign [14], since higher levels eliminate more redundant
the tables that are most affected by the weakly normalized data. The main condition to go higher in the hierarchy is
design.
This study goes beyond our previous work [12] in the
following:
• In addition to data quality and performance, we
considered potential debt tables’ impact on main-
tainability. We used tables’ number of attributes
as a measure of the tables’ complexity [16].
• We extend the application of the portfolio analysis
technique to cover data quality and maintainabil-
ity, considering the diversification and mitigation
of tables’ growth rate risk across three quality di-
mensions. Fig 1 Database Normalization Hierarchy
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
based on the constraint between two sets of attributes in a normalizing" since they do not increase data redun-
table, which is referred to as dependency relationship. dancy. In fact, as Date states, some of them are con-
sidered to be good normalized relational databases.
2.1 Benefits of Database Normalization: Data • There is no theoretical evidence that de-normalizing
Quality, Maintainability and Performance tables will improve performance. Therefore, it’s ap-
Data quality enhancement is one of the main benefits of plication dependent and may work for some appli-
database normalization [2], [14]. The enhancement is cations. Nevertheless, this does not imply that a
linked to decreasing the amount of data redundancy as we highly normalized database will not perform better.
move higher in the normalization hierarchy. By redun- In this study, we view these deteriorations in data qual-
dancy we mean, recording the same fact more than once in ity, maintainability and performance as debt impact
the same table. In poorly or un-normalized tables that suf- caused by inadequate normalization of database tables.
fer from data redundancy, there is always a possibility of The impact can accumulate overtime with data growth and
updating only some occurrences of the data, which will af- increased data duplication. Short-term savings from not
fect the data consistency. Data quality is a crucial require- normalizing the table can have long-term consequences,
ment in all information systems, as the success of any sys- which may call for inevitable expensive maintenance, fixes
tem relies on the reliability of the data retrieved from the and/or replacement of the database. Therefore, we moti-
system. vate rethinking database normalization from the debt per-
In addition to data quality, improving maintainability is spective linked to data quality, maintainability and perfor-
another important benefit from database normalization mance issues.
[3], [2]. Weakly or un-normalized tables’ design involve
bigger number of attributes in each table compared to
2.2 Cost of Database Normalization
highly normalized tables, which will increase complexity Benefits of normalization comes with certain costs. Nor-
in terms of retrieving data from those tables and imple- malizing the database is a relatively complex process due
menting new rules on the tables [3]. Moreover, reducing to expertise and resources required. Decomposing a single
the size of the table by normalization will facilitate the table for normalization may involve [5]:
back-up and restore process. • Database schema alterations: create the new normal-
Benefits of normalization can also be observed on oper- ized tables; Ensure that the data in new tables will
ations performance. Performance has always been a con- be synchronized and finally updating all stored
troversial subject when it comes to normalizing databases. views, procedures and functions that access the
Some may argue that normalizing the database involves original table before normalization to reflect the
decomposing a single table into more tables, henceforth; changes.
data retrieval can be less efficient since it requires joining • Data Migration: a detailed and secured strategy
more tables as opposed to retrieving the data from a single should be planned to migrate the data from the old
table. Indeed, de-normalization was discussed as the pro- weakly or un-normalized table to the new decom-
cess of "down-grading" table’s design to lower normal posed tables.
forms and have limited number of big tables to avoid join- • Modifications to the accessed application/s: intro-
ing tables when retrieving the required data [2]. Advocates duce the new tables’ meta-data and update the ap-
of de-normalization argue that the Database Management plications’ methods and classes’ source code that ac-
System (DBMS) stores each table physically to a file that cessed the original table. Occasionally, normaliza-
maintains the records contiguously, and therefore retriev- tion may not require refactoring of the application,
ing data from more than one table will require a lot of I/O. if the application is well developed and decoupled
However, this argument might not be correct: even though from the database. Nevertheless, normalization is
there will be more tables after normalization, joining the considered to be complex since it would require
tables will be faster and more efficient because the sets will other complex tasks, as mentioned in the above
be smaller and the queries will be less complicated com- points.
pared to the de-normalized design [2]. Moreover, weakly Moreover, testing the database and the applications be-
or un-normalized tables will be stored in a large number of fore deployment can be expensive and time consuming.
files as opposed to the normalized design due to big Therefore, in this study, we formulate the normalization
amount of data redundancy, and consequently, increased problem as technical debt. We start from the intuitive as-
records size and increased I/O cost [2], [18]. Adding to sumptions that tables, which are weakly/not normalized
this, not all DBMS store each table in a dedicated physical to the deemed ideal form, can potentially carry debt. To
file. Therefore, several tables might be stored in a single file manage the debt, we adhere to the logical practice in pay-
due to reduced table size after normalization, which means ing the debt with the highest negative impact on quality,
less I/O and improved performance, as shown in [2], [18] with respect to the cost of normalization.
Despite the controversy about normalization and per- 2.3 Why Multi-attribute Analysis
formance, several arguments in favor of normalization is
Multi-attribute analysis techniques help decision mak-
presented by C. J. Date [2], an expert who was involved
ers evaluate alternatives when conflicting objectives must
with Codd in the relational database theory:
be considered and balanced, and when outcomes are un-
• Some of the de-normalization strategies to improve
certain [17]. The process provides a convenient framework
performance proposed in the literature are not "de-
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
for developing a quantitative debt impact assessment that normalization is a process that should create a value to en-
results in a set of prioritized tables to be normalized. The sure the system’s sustainability and maintainability in the
prioritization is based on the debt impact on data quality, future.
maintainability, performance and the effort cost of normal-
izing the table. As this analysis involves multiple conflict-
3 NORMALIZATION DEBT DEFINITION
ing attributes, multi-attribute analysis will help structure
the problem and provide a systematic and informed guid- The common ground of existing definitions of technical
ance to normalize tables to improve quality cost effec- debt is that the debt can be caused by poor and sub-optimal
tively. In this study, we utilize the TOPSIS method [17] to development decisions that may carry immediate benefits,
rank the tables that should be normalized. This method in- but are not well geared for long-term benefits [10], [6]. Da-
volved several steps to find the best alternative that is clos- tabase normalization had been proved to minimize data
est to the ideal solution. Detailed explanation of the duplication, and several studies have shown that it also im-
method is demonstrated in section 4.3. proved performance as the table is further normalized to
higher normal forms (see Fig 1) [18], [2]. Therefore, tables
2.4 Database refactoring and Schema Evolution in the database that are below the fourth normal form can
Database refactoring is defined as “simple change to a be subjected to debts as they potentially lag behind the op-
database schema that improves its design while retaining timal, where debt can be observed on data consistency,
both its behavioral and informational semantics” [5]. While maintainability and performance degradation as the data-
code refactoring area is well researched in the literature base grows [2]. To address this phenomenon of normaliza-
[19], Database refactoring received less attention [20]. As tion and technical debt, in previous works [12], [13], we
mentioned by Vial in [20], only 16 studies on “database re- have considered fourth normal form as the target normal
factoring” were found in ACM and IEEE digital libraries in form for the following reasons:
2015, Including reviews of “Refactoring Databases: Evolu- • In practice, most database tables are in third (rarely
tionary Database Design” [5]. Vial reports lessons learned achieve BCNF) normal form [14]. However, fourth
from refactoring a database for industrial logistics applica- normal form is considered as a better design since it
tions [20]. One of the key lessons learned is that some re- is a higher level and more redundant data is elimi-
factoring patterns might not yield the benefits they envi- nated [14].
sioned, specifically in the case of reducing the data dupli- • Fourth normal form criteria relies on multi-valued
cation in the database. Therefore, he stated that the data- dependencies that are common to address in real-
base might encounter some level of debt (in his case the ity [25].
debt resembles in data duplication) as long as the debt is • While fifth normal form is higher in the normaliza-
known and documented. Additional studies between 2015 tion hierarchy, it is considered as a theoretical value,
and 2019 on the topic are in line with Vial’s conclusion [21], since it is based on join dependencies between
[22]. attributes, which rarely arise in practice [14].
Schema evolution on the other hand, has been exten-
sively studied over the past years [23], [24], [5]. Schema
evolution is modifying the database schema to evolve the
4 PROPOSED FRAMEWORK TO PRIORITIZE
software system in meeting new requirements [5]. Re- TECHNICAL DEBT IN DATABASE NORMALIZATION
searchers have attempted to analyze the schema evolution To facilitate managing normalization debt in an explicit
from earlier versions [23], and they developed tools to au- way, we propose a simple management framework. The
tomate the evolution analysis process [24]. However, framework is meant to help organize the activities and in-
schema evolution literature has focused on limited tactics, formation needed to prioritize tables in the database that
such as adding or renaming a column, changing the data should be normalized to the fourth normal form. As de-
type, etc. Evolving the database schema for the objective of picted in Fig 2, the framework consists of three major
normalizing the database to create value and avoid tech- phases: identification of potential debt tables (phase 1), es-
nical debt has not been explored. We posit that timating its likely impact and cost (phase 2) and using this
debt is the cost paid overtime by not resolving the debt expected return and variance of the return are used to eval-
[10]. Researchers have coined interest with implication on uate the portfolio performance.
qualities [32]. The analogy is applicable to the case of nor- To fit in portfolio management, each debt table below the
malization debt as striving for the ideal fourth normal form fourth normal form is treated as an asset. For each table,
will reduce impact on data quality, maintainability and we need to determine whether it is better to normalize that
performance. We view that all tables do incur interest over- table to the fourth normal form (pay the debt) or keep the
time. However, interest varies between tables in how much table in it is current normal form ( defer the payment). To
and how fast they accumulate interest, which our work decide on this, we need to determine what the expected re-
uses to identify tables that are candidate for normalization. turn of each debt table is. In the case of normalization debt,
Tables’ growth rate is a crucial factor that will cause the the expected return of the debt table resembles the esti-
impact of the debt on the three qualities to accumulate mated quality impact of the table. Tables with the lowest
faster. For performance, I/O cost of the operations changes estimated impact are deemed to carry higher expected re-
based on the tables’ growth rate. If the table is likely to turn. In other words, If the estimated quality impact of ta-
grow faster than other tables, the I/O cost for the opera- ble A is less than estimated quality impact of table B, then
tions executed on that table will accumulate faster than table A expected return would be higher than B; B will then
others. This due to the fact increasing table size implies has a higher priority for normalization due to high quality
more disk pages to store the table and therefore, more I/O impact. We balance the expected return with the risk. In
cost. Similarly, risk of data inconsistency increases as the portfolio management, this risk is represented by the vari-
table is growing. Regarding maintainability, Existing met- ance of the return. For the debt tables, this risk is repre-
ric has looked at the number of attributes of each table as a sented by the tables’ growth rate. Tables with the highest
measure for complexity. However, discussing normaliza- growth rate are considered to be risky assets, their likely
tion debts can make this measure limited as the debt can interest and so the debt will grow faster than other tables
intuitively linked to not only number of attributes but also of low growth rate.
the growth rate and the population size of the table. In order to apply the portfolio theory to normalization
Therefore, tables’ growth rate is a crucial measure to debt, few considerations need to be taken into account:
prioritize tables needed to be normalized. If the table is not
likely to grow or its growing rate is less than other tables, • The expected return of the debt table is equal to 1÷
a strategic decision would be to keep the debt and defer its quality impact.
payment. Table growth rate can be elicited from the data- • The risk of each table is equal to the table growth
base monitoring system. The growth rate of a table can be rate for each debt table. The growth rate can be elic-
viewed as analogous to interest risk or interest probability. ited from the database management system by mon-
Interest probability captures the uncertainty of interest itoring the table’s growth.
growth in the future [10]. Debt tables which experience • We set the correlation between the debt tables to
high growth rate in data can be deemed to have higher in- zero for several reasons: First, the quality impact
terest rate. Consequently, these tables are likely to accumu- represented by I/O costs, risk of data inconsistency
late interest faster. and number of attributes for the debt tables are in-
dependent. Meaning, the I/O cost of the operations
4.2.2 Prioritizing Debt Tables Using Portfolio Theory: executed on a debt table has no effect on the I/O cost
We use portfolio theory to prioritize tables that need to of the operations executed on another debt table.
be normalized considering their impact on the three quali- The same reasoning apply for the risk of data incon-
ties, taking into consideration the likely growth rate of the sistency and the number of attributes for each table.
table size and henceforth, the risk of interest accumulation. Moreover, the growth rate for each table, which af-
Modern Portfolio Theory (MPT) was developed by the fects the quality impact accumulation, is unique and
Nobel Prize winner Markowits [15]. The aim of this theory independent from each other. Lastly, each debt table
is to develop a systematic procedure to support decision design is independent from other debt tables, as the
making process of selecting capital of a portfolio consisting decision to keep the debt or normalize the table have
of various investment assets. The assets may include no effect on the design and the data of the other debt
stocks, bonds, real estate, and other financial products on tables.
the market that can produce a return through investment. Taking into account these considerations, we can apply
The objective of the portfolio theory is to select the combi- the portfolio theory, where the database developer is in-
nation of assets using a formal mathematical procedure vesting in tables’ normalization. The database developer
that can maximize the return while minimizing the risk as- needs to build a diversify portfolio of multiple debt tables.
sociated with every asset. Portfolio management involves Multiple debt tables in the database represent the assets.
determining the types that should be invested or divested For each asset i, it has its own risk Ri and quality impact
and how much should be invested in each asset. This pro- Qi. Based on these values the developer then can prioritize
cess draws on similarity with the normalization debt man- tables to be normalized. The expected return of debt tables’
agement process, where developers can make decisions portfolio Ep, built by prioritizing debt tables from the da-
about prioritizing investments in normalization, based on tabase of m debt tables can be calculated as in the following
which technical debt items should be paid, ignored, or can equation:
further wait. With the involvement of uncertainty, assets 𝐸𝑝 = ∑𝑚
1
(3)
𝑖=1 𝑤𝑖 𝑄
𝑖
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
With one constraint represented in the following equa- we will rely on experts’ estimation for the time required for
tion: each task. All the major refactoring tasks that are required
∑𝑚
𝑖=1 𝑤𝑖 = 1 (4) to normalize the tables in our case study application were
determined by the technical team. The tasks were further
Where wi represents the resulted weight of each debt refined into sub-tasks to ease and concretize costing. The
table. This weight will resemble the priority of each table task and subtasks were as follows:
for normalization as explained in the process steps. 1. Task: Split the table and create the new decomposed
The risk of table growth rate for debt table i is represented tables into the 4th normal form. Sub-tasks include:
by Ri. The global risk of the portfolio Rp is calculated as write scripts to create new tables and backup table
the following: data.
𝑅𝑝 = √∑𝑚 𝑤 2 2
𝑅 (5) 2. Task: Update ORM. Sub-tasks include: create new
𝑖=1 𝑖 𝑖
classes, update ORM models and update applica-
Process Steps: The following steps will be executed three
tion services and domain layers.
times (except for the first step, which will be executed once)
3. Task: Migrate data to the new tables. Sub-tasks in-
to prioritize tables, based on their impact on performance,
clude: write scripts to migrate data to new decom-
data quality and maintainability, separately for each quality
posed tables and run those scripts.
attribute.
4. Task: Refactor the application code. Sub-tasks in-
1. Determine the potential debt tables’ growth rate
clude: refactor web helper classes, view models and
from the database monitoring system. This step
class controllers.
will simplify the method to examine only tables
5. Task: Test the database and application. Sub-tasks
with high growth rate.
include: run automation test, end to end testing of
2. Consider only tables of high growth rate to meas-
the application, and sanity and smoke tests.
ure their impact on the three quality attributes, us-
6. Task: Integrate and apply changes in the production
ing the metrics discussed in section 4.2.1.
database. Sub-tasks include: backup database, stop
3. Calculate the values of the portfolio variables (ex-
the application and update the new files and perfor-
pected return for each table = 1÷ quality impact,
mance testing.
risk= table’s growth rate)
4. Run the model on the available data to produce the 4.3 Phase 3: Making Decisions
optimal portfolio of the debt tables. The portfolio We view normalization debt management as a multi-attrib-
model will provide the highest weights to those ta- ute decision making process, where developers need to make
bles with low quality impact and low table growth a decision about which table(s) to normalize to the fourth nor-
rate. Therefore, debt table that has the lowest mal form, to improve quality at while minimizing effort cost,
weight implies the highest priority table that and which should be kept in its current normal form because
should be normalized. it is likely to have the least impact on quality and most expen-
4.2.3 Estimating Normalization Debt cost sive to carry out. The final phase of the proposed framework
incorporates information obtained from the previous
The second phase of the proposed framework also in-
phase 2 to prioritize tables. The proposed approach uses
volves estimating the cost of normalizing each debt table
the TOPSIS method [17] to rank the debt tables to be nor-
to the fourth normal form. Normalizing and splitting a
malized. TOPSIS is a mainly a utility-based method that
single table in the database requires several alterations on
compares each alternative directly based on data in the
the database schema, applications using the database and
migrating data to new decomposed tables [5]. Some of evaluation matrix and the weight of each quality attribute.
these tasks can be automated depending on the technolo- The fundamental idea of TOPSIS is that the best solution is
gies used in the project. However, regardless of the size of the one, which has the shortest distance to the positive
the system, refactoring the database for normalization can ideal solution and the farthest distance from the negative
be very complex and time consuming. Therefore, to struc- ideal solution. This method has several advantages such as
ture the process of estimating normalization cost, first, the [33]:
main tasks required to normalize each table to the fourth • The method results represent scalar values that ac-
normal form should be defined. Then decompose each count for the best and worst alternatives simultane-
main task to sub-tasks and estimate the time required for ously.
each sub-task. The following simple model can be used to • The ability to visualize the performance measures of
estimate the total effort cost required to normalize each ta- all alternatives on attributes.
ble: • Human choice is represented by a sound logic.
The process of TOPSIS method is carried out as the fol-
Normalization cost for each table (in hours unit) = lowing:
∑𝑚
𝑖=1 ∑𝑥=1 (Estimated hours needed to complete the xth
𝑛
Step1: Construct an evaluation matrix: The matrix con-
sub-task) (6)
sists of m tables and n criteria, with the intersection of each
Where n is the total number of sub-tasks and m is the total alternative and criteria given as Xij, we therefore have a
number of the main tasks for each table. The tasks and time matrix (Xij)m*n. The criteria in our study would be risk of
required to perform the refactoring tasks can be estimated data inconsistency, maintainability, performance and cost.
from historical effort data and domain experts. In this study Step 2: Calculate the normalized matrix: the
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
rates are considered to be risky and their impact on they were executed in the last month. Most of the
quality will accumulate faster than other tables of extracted operations were long and complex. All
lower growth rates. To collect information about the of the operations were analyzed so that only op-
tables’ growth, the lead developer generated a report erations that were executed on the tables of high
from the database monitoring system that showed ta- growth rates were considered. To Determine the
bles’ growth rates in the past month. Note that we I/O costs of each operation executed on the debt
have access to the database monitoring system with a table, SET STATISTICS IO feature in SQL server
one month log of operations and tables information. was used [37]. This feature display information
The following Table 2 demonstrates the top 5 highest about the amount of disk activity generated by
growth rate tables and their growth rates. As for the SQL statements. We re-executed the queries with
rest of the tables, most of them did not grow during this feature to display the I/O costs. Some of the
the past month, and the remaining few tables had a queries consists of 1 or more variables, we re-
substantially low growth rate, which indicates that placed those variables with actual data from the
the impact of those tables on quality attributes won’t table that were selected randomly. Then, the que-
accumulate in the future. Therefore, we analyzed ries were re-executed using the SET STATISTICS
only those five risky tables (presented in Table 2), IO feature and the average I/O cost was rec-
which are the best candidates for normalization due orded. A list was constructed with each debt ta-
to their high growth rates, which will accumulate the ble name, I/O cost of the operations executed on
impact on quality attributes faster than the other ta- this table and the execution rate for each opera-
bles. Nevertheless, the analysis techniques presented tion. After that, using the proposed model ex-
in this study, can be applied to any number of tables. plained in section 4.2.1, we calculated the I/O
cost for each of the previous tables.
Table 2 fastest growing tables and their growth rates 3. We calculated the values of the portfolio variables for
each quality attribute (expected return for each table
Table name Growth rate = 1÷ quality impact, risk= table’s growth rate re-
AttendanceRecords 1.04 ported in Table 2) the quality impact values were the
EmployeeDayAttendance 0.87 results obtained from the previous step.
Logs 4. We have implemented a program to aid the execu-
AuditTrials 0.86 tion of the portfolio model for each quality attribute
Leaves 0.52 and calculate the optimum weight for each table. The
LeaveBalances 0.68 program applies a non- linear optimization method
called Generalized Reduction Gradient (GRG) [38] to
2. For each of the previous tables: equation (3) as the fitness function and equation (4)
• For data quality impact measurement, we cal- as the constraint. We ran the portfolio model on the
culate the risk of data inconsistency using the available data three times separately for each quality
ISO metric [27] , a script was written in SQL lan- attribute to produce weights for each debt table. The
guage for each debt table that sums the duplicate resulted weights for each table impact in data quality,
values in all possible combinations of columns. maintainability and performance are demonstrated
Then, the result was divided by the number of in Table 3, Table 4 and Table 5 respectively. The port-
columns multiplied by the number of rows. This folio model will provide the highest weights to tables
process was repeatedly executed for each of the with low quality impact and low growth rate. There-
previous tables to calculate their risk of data in- fore, table that has the lowest weight in each quality
consistency. attribute implies the highest priority table that should
• For maintainability impact measurement, the normalized due to high impact on quality.
number of attributes for each of the previous ta- Table 3 Debt tables’ information and weights of data quality
bles was retrieved by the SQL server. impact
• Finally, to measure performance impact, we ex-
tracted operations executed on those tables and Table name Risk of Expected Weight
their execution rates in the last month. Since the data in- return
application was developed in an ORM frame- con-
work, there were no SQL language queries writ- sistency
ten in the code. The ORM framework would AttendanceRe-
2.5 0.4 0.0581
translate methods calls to SQL language queries cords
and execute them on the database. Therefore, op- Employ-
erations were extracted from the database moni- eeDayAttend- 0.236 4.237 0.7359
toring system. The database monitoring system ance Logs
has a feature of displaying the operations that AuditTrials 1.31 0.763 0.1341
were executed on the database in last month and Leaves 5.55 0.180 0.0523
their execution rates. So, a report was generated LeaveBalances 11.34 0.088 0.0196
displaying those operations and how many times
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
From the results of Table 3, we can determine that table performed on that table and the execution rate of those
LeaveBalances has the highest priority to normalize to operations on a monthly basis are high, in addition to
fourth normal form to improve data quality, since it got the the highest growth rate of table AttendandanceRecords,
lowest weight. Although this table has a growth rate less which will accelerate I\O cost accumulation in the future
than table AttendanceRecords, the risk of data incon- and affect performance. Table EmployeeDayAttend-
sistency of this table is the highest among the tables and it anceLogs has the second priority for normalization. As seen,
will accumulate faster than other than other tables. the difference of the weights between table AttendanceRe-
It is also observable from the results in Table 3 that the risk cords and table EmployeeDayAttendanceLogs is relatively
of data inconsistency is independent from the normal form small, which indicates that both tables are semi-equally im-
of the debt table. As shown, table AttendanceRecords has portant to normalize, considering time and budget con-
a higher risk of data inconsistency than table Employ- straints.
eeDayAttendanceLogs and table AuditTrials, even though
the latter tables are in a weaker normal form, which de- Estimating the Cost of Normalization:
notes that the normal form alone is not sufficient to make To estimate the cost of normalizing each table to the fourth
the right decision on which table to normalize. normal form, we met with technical team and determine all
Table 4 Debt tables information and weights of maintaina- the major refactoring tasks that are required to normalize
bility impact the tables. The tasks and sub tasks were as mentioned in
section 4.2.3. The cost of each candidate table was analyzed
Table name Number of Ex- Weight in accordance to those tasks. As shown, the tasks were fur-
attributes pected ther refined into sub-tasks to ease and concretize costing.
return For each sub-task under consideration, we solicited from
AttendanceRe- the team three estimates for the time required for each sub-
6 0.1667 0.1824
cords task: best, most likely and worst case. The team used previ-
EmployeeDayAt- ous experience to arrive on the best and worst cases esti-
8 0.125 0.1635
tendance Logs mates. For the most likely case, the team suggested taking
AuditTrials 10 0.1 0.1323 the average of the best and worst. In our calculation, we
Leaves 7 0.1429 0.3127 considered the most likely time estimate for each subtask
LeaveBalances 8 0.125 0.2092 and calculated the total time required to normalize each ta-
Based on the results from Table 4, table AuditTrails has ble, using equation (6). The results are shown in Table 6. It
the highest priority to normalize to decrease complexity. is worth noting the estimates are in the context of the team’s
Even though this table has a growth rate lower than At- experience and capabilities to perform the said tasks calcu-
tendeanceRecords and EmplyeeDayAttendanceLogs, the lated in hours (i.e. they are relative estimates).
number of AuditTrails attributes is the highest and the Table 6 Normalization cost for each table
complexity of this table will increase faster with growing
data to be stored in this table. It is also observable that table Table name Average cost to normal-
EmployeeDayAttendanceLogs is the second prioritized ta- ize to fourth normal
ble to normalize to the fourth normal form, even with form in hours
number of attributes similar to table LeaveBalances, how- AttendanceRecords 268
ever, the growth rate of table EmployeeDayAttend- EmployeeDayAttendance
anceLogs is higher which will increase the complexity of 235
Logs
the table faster. AuditTrials 252
Table 5 Debt tables’ information and weights of perfor- Leaves 524
mance impact
LeaveBalances 525
Table name Average I/O Ex- Weight 5.3.3 Framework Phase 3: Making Decisions Using the
cost pected TOPSIS Method
return So far, we have estimated the impact of the debt tables on
AttendanceRe- three quality attributes. We have also estimated the cost of
62822956 1.59 0.000073
cords normalizing each table to the fourth normal form. The next
Employ- step is to incorporate all this information to prioritize ta-
eeDayAttend- 27897228 3.58 0.000197 bles that need to be normalized. The proposed approach
ance Logs uses the TOPSIS method detailed in section 4.3. We imple-
AuditTrials 23020 4344.05 0.242 mented a program to aid in calculations of TOPSIS follow-
Leaves 15184 6585.88 0.606 ing the TOPSIS method equations explained in section 4.3.
LeaveBalances 46240 2162.63 0.152 Step 1: Construct the evaluation matrix: Table 7 demon-
Table 5 indicates that table AttendanceRecords has the strates the evaluation matrix that consists of the debt ta-
highest priority to normalize to the fourth normal form to bles, the criteria, the tables’ weights from the portfolio
improve performance. This is due to the fact that the I/O analysis and the tables’ normalization cost.
cost incurred by operations executed on this table is the
highest among tables, meaning the cost of operations
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
The Worst cost percentage increase of Table 9 depicts the LeaveBalances 12.28 86364 8
percentage increase on costs to reach the worst case esti-
mates. As an example, Attendance records needs 33.21% • We normalized table AttendenceRecords to the
increase in cost to reach its worst case estimate. The last fourth normal form. The normalization included
10% covers the worst cost percentage increase. Results of several tasks: decompose the table to three tables
the previous table are plotted in Fig 4. As seen, all the ta- (AttendanceRecords, DeviceMode and DevicePro-
bles’ ranking were stable within the worst case estimate in- cess), write scripts to migrate data to new tables, cre-
crease in the cost, except for table AuditTrails which rank ate indexes and refactor the application code to re-
changed from second to third table to normalize with flect the changes.
10.8% increase in its normalization cost and table • Finally, we measured the improvement after table’s
LeaveBalances changes from third to second in ranking. normalization in the three quality attributes as the
However, this change does not affect the ranking of the following:
first prioritized table (i.e. AttendanceRecords) that should For data quality: we measured the risk of data incon-
be normalized to fourth normal form. sistency in all three tables using the ISO metric in
equation (1).
For performance: we extracted the new SQL code of
the operations that used to retrieve data from At-
tendanceRecords. The new operations will now re-
trieve data from the new three tables after decompo-
sition. We obtained the total I/O costs of the opera-
tions executed on the new three tables. Then, As-
suming the execution rate of each operation is the
same as before normalization, we used equation (2)
explained in section 4.2.1 to get the total average I/O
costs of the operations executed on the three tables.
For maintainability: number of attributes of table
Fig 4 Stability of tables’ ranking after cost increase AttendanceRecords was easily obtained from the
SQL server.
Results of quality impact improvements are shown
5.3.4 Future scenario simulation and debt table in Table 11
normalization Table 11 table AttendanceRecords impact on quality attrib-
As discussed, we have performed this case study by simu- utes after normalization
lating the phases of the proposed framework using real
data from the Human resource application. To evaluate the Quality attribute Value
results, we simulated future scenario and executed the fol- Before normalization 3.63
lowing: After normalization 1.137
• We used data generation tool (Redgate tool [36] ), Data quality
% Rate of impact re- - 68.68%
suggested by the developers to populate the tables duction
with the required amount data to nearly emulate the Before normalization 122290310
likely growth rate. Performance After normalization 87751829
• We re- measured the tables’ impact on data quality, (I/O cost) % Rate of impact re- -28.24%
performance and maintainability in the same manner duction
as we did when we executed the phases of the frame-
Before normalization 6
work. Results are shown in Table 10.
Maintainabil- After normalization 5
ity % Rate of impact re- -16.7%
Table 10 Debt tables’ impact on quality attributes after ta-
bles’ growth duction
instance, the use of tools (database monitoring tool to elicit To minimize complexity, the third scenario suggests nor-
required information, and programs to execute some of the malizing table AuditTrails based on the results from Table
steps in each phase of the method), facilitates the applica- 4. Finally, to improve all three quality attributes and taking
tion of the method without previous experience in data- into account the effort cost of normalizing each table, table
base normalization. The framework is based on intuitive AttendanceRecords is suggested to be normalized.
concepts organized in a structured manner, which contrib- Table 12 Difference in effort between conventional ap-
utes to its repeatability. Furthermore, users of the frame- proach and the debt-aware approach of normalization
work do not need to understand the details of the decision
theory behind TOPSIS or Portfolio analysis technique. On Approach Number of tables to
the other hand, users must have a comprehensive under- normalize
standing of organization resources and knowledge about Conventional Approach 17
the refactoring tasks to estimate the cost of normalization. Debt-aware approach to im- 1
The guidance provided by us during the case study pro- prove data quality
cess to execute each step and obtain required information Debt-aware approach to im- 2
seems to be beneficial and easy to apply in the context of prove performance
the human resource web application. Debt-aware approach to im- 1
Concerning the systematic process of the framework, we prove maintainability
have observed that the framework covers all major steps Debt-aware approach to im- 1
necessary to conduct a disciplined selection process. The prove three qualities
framework phases provide a clear, structured process to Moreover, depending on available resources, develop-
guide the acquisition and analysis of relevant information. ers can include more tables to normalize and justify their
This information assists the developers in making in- decisions based on the debt tables’ impact on data quality,
formed decisions, and therefore, choosing a table to nor- performance, maintainability and cost.
malize to improve quality at minimum cost. The added Our approach is built on the premise that the identified
value the framework we propose has given to the decision root causes of the debt have to do with inadequate normal-
process is that it provides guidance on how to measure the ization, and its impact is observed on qualities such as per-
debt impact of individual tables to facilitate the prioritiza- formance, maintainability and data quality. It is impera-
tion process. Concerning normalization cost estimation, tive that the architect may consider other architectural tac-
our work is consistent with current practices in consider- tics and fixes to rectify these problems. Though the tactics
ing the effort and person hours per task. These task’s cost and fixes may provide immediate benefits in some con-
were estimated using expert s’ judgments. We shall note texts, they won’t eliminate the root causes of the debt and
that paper is not claiming contribution on effort estimation the debt will continue to be dormant. The architect may
for database refactoring, In addition to experts’ judgement, need to outweigh the long-term benefits and costs of these
future work may consider developing specialized para- fixes against that of normalization. Otherwise, these fixes
metric models for the cost/effort of database refactoring, can be regarded as taking a debt on existing debt.
which is outside the scope of our current contribution. 2. Does the debt quality impact estimation provide in-
Though our work provides a list of prioritized tables that sights to identify tables that are most affected by the
need to be normalized based on the debt information, the debt?
developers may decide to further revisit the list to elimi- Our interest in debt impact is centered on the fact that it
nate tables that are not worth the consideration. The deci- is suitable for identifying tables that are candidate for nor-
sion to eliminate a table from the list is highly dependent malization based on their data quality, performance and
on the expertise of the developers, their familiarity with a maintainability impact, regardless of the normal form of
similar case, and whether the improvements are not worth the table. The notion of debt impact is particularly appro-
the consideration. Nevertheless, we have taken a conserva- priate to understand to what extent a potential debt table
tive approach by using the available resources to decide on may hurt the quality compared to other potential debt ta-
the tables to normalize among the prioritized list. The fol- bles, and therefore facilitate the decision-making process
lowing Table 12 shows that following the conventional ap- regarding which table to normalize.
proach, which encourage normalizing all the tables to We have started the case study by identifying potential
fourth normal form is more costly; time consuming and ad- debt tables (i.e. tables below the fourth normal form) in the
hoc than the debt-aware approach for four scenarios: The database. After the identification process, it can be sug-
first scenario prioritizes debt-tables based on their impact gested to normalize the weakest normal form tables to im-
on data quality, where the portfolio approach results from prove the design and the systems’ quality. The suggestion
Table3 suggests to only normalize one table, table will be based on the fact that in the normalization theory,
LeaveBalances, based on the biggest amount of data dupli- the higher the normal form of the table the better design
cation it holds and its growth rate compared to other ta- will be. Which is true “design-wise”, however when the
bles. The second scenario considers the performance im- notion of quality impact is introduced and after measuring
pact of each debt table, and the likely accumulation of this the data quality of the debt tables, evidence has shown that
impact in the future. By utilizing the portfolio approach, weakly normalized tables may have better data quality
results from Table 5 suggest two tables for normalization, than higher normal form tables as observed from Table 3.
AttendanceRecords and EmployessDayAttendanceLogs. This is due to the fact that the data quality metric present
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
from ISO, measures the risk of data inconsistency based on weight that implies highest priority to normalize. Since the
the amount of duplicated data stored in the table. Some number of attributes has not changed, maintainability
weakly normalized tables store less amount of data and tends to be sensitive to the growth of data stored in the ta-
thus have less impact on data quality. This empirical evi- ble, i.e. rows.
dence reinforces our assumption that the prioritization The results indicate the fitness of the proposed technique
process is independent from the table normal form. Simi- towards improving the quality considering future growth.
larly, when measuring the performance impact, our pro- Additional benefit of using the portfolio approach is that
posed model is based on I/O cost of operations executed its input is from specific data drawn from the quality meas-
on the table which is determined by the table size and the urement and the database monitoring tool that elicited ta-
execution rate of the operation, regardless of the table nor- bles’ growth rates, which eliminates biased and ad-hoc de-
mal form. cisions regarding which table to normalize.
Our maintainability metric does not solely rely on the 4. Does the TOPSIS method provide useful guidance to
number of attributes in a table, but also considers the make informed normalization decisions?
growth rate of the table size. Our contribution is not aimed Improving the quality of decisions made for database
at designing a maintainability metric for databases per se; normalization is the main objective of this research. In this
however, our approach is flexible enough to make use of context, the objective of the TOPSIS method is to organize
other metrics for maintainability, if available. These met- relevant information gathered through the prioritization
rics can be parametric; open-ended based on experts’ judg- framework. Utilizing the TOPSIS method to prioritize the
ment and/or back of the envelope calculation for a given debt tables can benefit the developers in two aspects. First,
domain; or can utilize more sophisticated methods, such as it gives the developers the flexibility to include or exclude
learning from cross-companies/projects databases to esti- the quality attributes that are necessary for given valuation
mate likely maintainability effort for a given case. The context. Moreover, based on available resources and the
measures can incorporate various maintainability con- systems conditions, developers are able to weigh the
cerns and dimensions depending on the context. level of importance for each quality attribute.
3. Is the portfolio analysis technique effective in priori- In the case study, information about debt impact on data
tizing debt tables with high growth rate in the future? quality, performance, maintainability and cost of database
As with most database applications, the human resource normalization, have been estimated and gathered during
web application grows in the amount of data stored in the the framework phase 2. Developers need a structured, flex-
database. This growth is a significant factor to make a bet- ible and simple method to organize all information to
ter decision regarding debt payment. Tables with high make the best decisions. The TOPSIS method represented
growth rate will accelerate the impact of the debt and its tool that can be utilized at any time when new information
accumulation on the three qualities faster than the other becomes available or when a specific quality dimension re-
tables of lower growth rates. Therefore, if the table is not quired to be improved over the other, which may change
likely to grow or its growing rate is less than other tables, as the system evolves. The TOPSIS method has given ex-
a strategic decision would be to keep the debt and defer its plicit reasoning of how trade-offs is achieved to deal with
payment. The portfolio analysis provides the developers factors affecting debt tables’ prioritization.
with an objective tool to assess and rethink their normali- To further evaluate our proposed approach, we have
zation decisions using debt impact and the risk of table worked with the development team to normalize table At-
growth in the future. tendanceRecords to the fourth normal form and refactor
The exercise of applying the portfolio analysis to the hu- the application code as described in Section 5.3.4, After
man resource web application proved to be valid based on normalization, we re-measured the impact of the new de-
the data reported after simulating future scenario and en- composed tables on data quality, performance and main-
larging each table under analysis according to their growth tainability as explained in section 5.3.4. The results are
rate. After data growth, we re-measured the debt tables’ shown in Table 11.
impact on the three quality attributes. Results are shown in As shown in Table 11, after normalization to the fourth
Table 10. normal form, the impact in data quality is reduced by
Portfolio analysis prioritization, applied (before data 68.68%. Performance impact is decreased by 28.24% and
growth), is depicted in Tables 3, 4, and 5. Table 10 Shows maintainability impact is decreased by 16.7%. The results
the ranking of these tables after data growth. The results provide evidence of the negative impact that was at-
show consistency with the Portfolio prioritization results. tributed to a normal form weaker than the fourth normal
For example, Table 10 Shows LeaveBalances to have the form.
highest risk of data inconsistency after data growth. The In summary, developers’ feedback on the framework
same table (i.e LeaveBalances) had the highest priority to has been positive. The approach was an eye opening and
be normalized to the fourth normal form, as suggested by has encouraged the team to reconsider their current prac-
the portfolio analysis in Table 3, where LeaveBalances had tices: In particular, the developers often avoid, on any ex-
the lowest weight. Similarly, for performance, operations penses, restructuring the database to tune performance,
executed on table AttendanceRecords appears to have the due to the lack of guidance on which table to refactor
highest I/O cost in as shown in Table 10, which is con- among the available ones. Our approach has led the devel-
sistent with the results of the portfolio prioritization results opers to recognize that refactoring the database for nor-
in Table 5, where AttendanceRecords had the lowest malization is essential to treat the root cause of the problem
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
and improve quality; they appreciate the structured ap- classes on software maintainability and correctness. They
proach to prioritize the tables that should be normalized to prioritized classes that should be refactored based on their
improve the quality at the minimum cost possible. Moreo- impact on those qualities. Portfolio Theory was proposed
ver, combining the systematic debt analysis with measur- by researchers to manage technical debt [42]. In [42] the au-
ing the debt impact of the table on the three qualities based thors viewed each debt item as an asset, and they utilized
on data from the database and the monitoring system, has Portfolio Theory to construct a portfolio of debt items that
provided the developers with previously uncovered should be kept based on debt principal, which they defined
knowledge for informing the refactoring exercise. The ap- as the effort required to remove the debt, and debt interest
proach was deemed to be more informative and objective which is the extra work needed if the debt is not removed.
regarding database normalization for mitigating the debts Portfolio Theory was also proposed to manage require-
over ad hoc normalization and refactoring. ments compliance debt in [43]. The authors viewed com-
pliance management as an investment activity that needs
5.5 Threats to Validity decisions to be made about the right compliance goals un-
Our evaluation has used an industrial case study. Case der uncertainty. They identified an optimal portfolio of ob-
studies are difficult to generalize its results [39], due to its stacles that needed to be resolved, and address value-
specificity to the case/domain. In this study, the presented driven requirements based on their economics and risks.
project is a database application, developed and main- Since the process of normalizing a database can be very
tained using specific technologies and infrastructure, complex and costly, researchers have proposed algorithms
which may have different cost pattern and management to automate or facilitate the normalization process [44],
style than other projects. For example, managing trade-offs [45], [46]. Their aim was to produce up to third normal
between qualities affected by the debt can be more com- form or BCNF tables automatically. However, the studies
plex in an integrated database project (i.e. a database that looked at the database schema in isolation from applica-
serves multiple applications [40]). Moreover, the evalua- tions using the database. It is important to consider the ap-
tion has used the case study developers for feedback on the plications to better estimate the cost of normalization tak-
overall effectiveness of the framework. ing into account refactoring and data migration tasks.
Therefore, our findings reflect experience in just one par- Since this process can be very costly, this study aims to pro-
ticular case, the goal being transferability, not generaliza- vide a method to prioritize tables to be normalized to im-
bility. Meaning, the specific findings of this study may be prove the design and avoid negative consequences.
applicable to other projects, and the general lessons
learned should be instructive to those applying and study-
ing this approach in any situation. Further analysis may 7 CONCLUSION AND FUTURE WORK
need to look at applying the method on several applica- We have explored the concept of technical debt in data-
tions of industrial scale and/or to report on the applicabil- base design that relates to database normalization. Nor-
ity of the method using more than one independent team. malizing a database is acknowledged to be an essential
Both routes are non-trivial and require a careful empiri- process to improve data quality, maintainability and per-
cal/field study that goes beyond the scope of this paper. formance. Conventional approaches for normalization is
Nevertheless, this is subject to future work. driven by the acclaimed structural and behavioral benefits
that higher normal forms are likely to provide, without ex-
plicit to value and debt information. The problem relies on
6 RELATED WORK
the fact that developers tend to overlook this process for
Ward Cunningham was the first to introduce this meta- several reasons, such as, saving effort cost, lack of expertise
phor in 1992 on the code level, as a trade-off between short or meeting the deadline. This can imply a debt that need to
term business goals (e.g. shipping the application early) be managed carefully to avoid negative consequences in
and long term goals of applying the best coding practices the future. Conversely, designers tend to embark on data-
[41]. The majority of researches have focused on code and base normalization without clear justification for the effort
architectural level debt [10],[6]. In [11], Weber et al. dis- and debt avoidance.
cussed a specific type of database schema design debt in We reported on a framework to manage normalization
missing foreign keys in relational databases. To illustrate debt in database design. A table below the fourth normal
the concept, the authors examined this type of debt in an form is viewed as a potential debt item. Though we con-
electronic medical record system called OSCAR. The au- sidered tables below the fourth normal form as potential
thors proposed an iterative process to manage foreign keys debt items, in practice most databases lag behind this
debt in different deployment sites. Our work is different as level [14]. Among the reasons for avoiding normalization,
it is driven by normalization rules, which are fundamen- database refactoring is often acknowledged to be a tedious
tally different in their requirements and treatment. Tables and expensive exercise which developers avoids on any
in databases have specific requirements and the likely im- expense due to its unpredictable outcome [5]. Addition-
pact of normalization can be observed on different quality ally, it is impractical for developers to use the fourth nor-
metrics that are database specific. mal form as the only criteria to drive the normalization ex-
As our work is close to debt prioritization, authors in ercise. To overcome these problems, we proposed a frame-
[32] utilized prioritization to manage code level debts in work to prioritize tables and their candidacy for normali-
software design. The authors estimated the impact of God zation. The framework utilizes the Portfolio theory and the
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2020.3001339,
IEEE Transactions on Software Engineering
SCIENCE AND APPLICATIONS, vol. 8, no. 11, pp. 542–547, [Accessed: 12-May-2019].
2017. [37] CarlRabeler, “SET STATISTICS IO (Transact-SQL).” [Online].
[20] G. Vial, “Database refactoring: Lessons from the trenches,” IEEE Available: https://docs.microsoft.com/en-us/sql/t-sql/state-
Software, vol. 32, no. 6, pp. 71–79, 2015. ments/set-statistics-io-transact-sql. [Accessed: 05-Jul-2018].
[21] K. Hamaji and Y. Nakamoto, “Toward a Database Refactoring [38] L. S. Lasdon, A. D. Waren, A. Jain, and M. Ratner, “Design and
Support Tool,” in 2016 Fourth International Symposium on Compu- Testing of a Generalized Reduced Gradient Code for Nonlinear
ting and Networking (CANDAR), 2016, pp. 443–446. Programming,” ACM Trans. Math. Softw., vol. 4, no. 1, pp. 34–50,
[22] P. Khumnin and T. Senivongse, “SQL antipatterns detection and Mar. 1978.
database refactoring process,” in Software Engineering, Artificial [39] R. K. Yin, Case study research Design and methods, Third. Sage pub-
Intelligence, Networking and Parallel/Distributed Computing lications, 2003.
(SNPD), 2017 18th IEEE/ACIS International Conference on, 2017, [40] M. Fowler, “IntegrationDatabase,” martinfowler.com. [Online].
pp. 199–205. Available: https://martinfowler.com/bliki/IntegrationData-
[23] A. Cleve, M. Gobert, L. Meurice, J. Maes, and J. Weber, “Under- base.html.
standing database schema evolution: A case study,” Science of [41] W. Cunningham, “The WyCash Portfolio Management System,”
Computer Programming, vol. 97, pp. 113–121, Jan. 2015. in Addendum to the Proceedings on Object-oriented Programming
[24] L. Meurice and A. Cleve, “Dahlia: A visual analyzer of database Systems, Languages, and Applications (Addendum), New York, NY,
schema evolution,” in Software Maintenance, Reengineering and USA, 1992, pp. 29–30.
Reverse Engineering (CSMR-WCRE), 2014 Software Evolution [42] Y. Guo and C. Seaman, “A portfolio approach to technical debt
Week-IEEE Conference on, 2014, pp. 464–468. management,” in Proceedings of the 2nd Workshop on Managing
[25] M. S. Wu, “The practical need for fourth normal form,” in ACM Technical Debt, 2011, pp. 31–34.
SIGCSE Bulletin, 1992, vol. 24, pp. 19–23. [43] B. Ojameruaye and R. Bahsoon, “Systematic elaboration of com-
[26] G. Piatetsky-Shapiro, “Discovery, analysis, and presentation of pliance requirements using compliance debt and portfolio the-
strong rules,” Knowledge discovery in databases, pp. 229–238, 1991. ory,” in International Working Conference on Requirements Engi-
[27] “ISO/IEC 25024:2015 - Systems and software engineering -- Sys- neering: Foundation for Software Quality, 2014, pp. 152–167.
tems and software Quality Requirements and Evaluation [44] M. Demba, “Algorithm for Relational Database Normalization
(SQuaRE) -- Measurement of data quality,” ISO. [Online]. Avail- Up to 3NF,” International Journal of Database Management Systems,
able: http://www.iso.org/iso/catalogue_de- vol. 5, no. 3, pp. 39–51, Jun. 2013.
tail.htm?csnumber=35749. [45] Y. . Dongare, P. . Dhabe, and S. . Deshmukh, “RDBNorma: - A
[28] ISO/IEC, “International Standard-ISO/IEC 14764 IEEE Std semi-automated tool for relational database schema normaliza-
14764-2006 Software Engineering; Software Life Cycle Processes tion up to third normal form,” International Journal of Database
&; Maintenance,” 2006. Management Systems, vol. 3, no. 1, pp. 133–154, Feb. 2011.
[29] C. Calero and M. Piattini, “Metrics for databases: a way to assure [46] J. Diederich and J. Milton, “New methods and fast algorithms
the quality,” in Information and database quality, Springer, 2002, for database normalization,” ACM Transactions on Database Sys-
pp. 57–83. tems (TODS), vol. 13, no. 3, pp. 339–365, 1988.
[30] G. Papastefanatos, P. Vassiliadis, A. Simitsis, and Y. Vassiliou,
“Metrics for the prediction of evolution impact in etl ecosystems: Mashel Albarak is with the School of Computer Science at the Uni-
versity of Birmingham, UK, and King Saud University, KSA. Her inter-
A case study,” Journal on Data Semantics, vol. 1, no. 2, pp. 75–97, ests are in managing technical debt in databases and information sys-
2012. tems.
[31] V. E. Ferraggine, J. H. Doorn, and L. C. Rivero, Handbook of Re-
search on Innovations in Database Technologies and Applications: Rami Bahsoon is an academic at the School of Computer Science,
University of Birmingham, UK. His research is in software architecture,
Current and Future Trends. Information Science Reference Her- self-adaptive and managed architectures, economics-driven software
shey, PA, 2009. engineering and technical debt management. He co-edited four books
[32] N. Zazworka, C. Seaman, and F. Shull, “Prioritizing design debt on Software Architecture, including Economics-Driven Software Archi-
tecture. He holds a PhD in Software Engineering from University Col-
investment opportunities,” in Proceedings of the 2nd Workshop on lege London and was MBA Fellow at London Business School. He is
Managing Technical Debt, 2011, pp. 39–42. a fellow of the Royal Society of Arts and Associate Editor of IEEE
[33] H.-S. Shih, H.-J. Shyur, and E. S. Lee, “An extension of TOPSIS Software.
for group decision making,” Mathematical and Computer Model- Ipek Ozkaya is a technical director at the Carnegie Mellon University
ling, vol. 45, no. 7–8, pp. 801–813, 2007. Software Engineering Institute, where she develops methods and
[34] B. Kitchenham, S. Linkman, and D. Law, “DESMET: a method- practices for software architectures, agile development, and manag-
ing technical debt in complex systems. She coauthored a book on
ology for evaluating software engineering methods and tools,” Managing Technical Debt: Reducing Friction in Software Development
Computing Control Engineering Journal, vol. 8, no. 3, pp. 120–126, (2019). She received a PhD in Computational Design from CMU.
Jun. 1997. Ozkaya is a senior member of IEEE and the 2019—2021 editor-in-
chief of IEEE Software magazine.
[35] “What is Object/Relational Mapping? - Hibernate ORM.”
[Online]. Available: http://hibernate.org/orm/what-is-an- Robert Nord is a principal researcher at the Carnegie Mellon Univer-
orm/. [Accessed: 04-Jul-2018]. sity Software Engineering Institute, where he develops methods and
practices for agile at scale, software architecture, and managing tech-
[36] “SQL Data Generator - Data Generator For MS SQL Server Da- nical debt. He is coauthor of Managing Technical Debt: Reducing Fric-
tabases.” [Online]. Available: https://www.red- tion in Software Development (2019). He received a PhD in computer
gate.com/products/sql-development/sql-data-generator/. science from CMU and is a distinguished member of the ACM.
0098-5589 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
Authorized licensed use limited to: University of Exeter. Downloaded information.
on June 19,2020 at 13:34:28 UTC from IEEE Xplore. Restrictions apply.