AI Paradigms: Data vs. Model-Centric
AI Paradigms: Data vs. Model-Centric
The artificial intelligence (AI) field is going through a dramatic revolution in terms of new
horizons for research and real-world applications, but some research trajectories in AI are
becoming detrimental over time. Recently, there has been a growing call in the AI
community to combat a dominant research trend named model-centric AI (MC-AI), which
only fiddles with complex AI codes/algorithms. MC-AI may not yield desirable results when
applied to real-life problems like predictive maintenance due to limited or poor-quality data.
In contrast, a relatively new paradigm named data-centric (DC-AI) is becoming more
popular in the AI community. In this article, we discuss and compare MC-AI and DC-AI in
terms of basic concepts, working mechanisms, and technical differences. Then, we highlight
the potential benefits of the DC-AI approach to foster further research on this recent
paradigm. This pioneering work on DC-AI and MC-AI can pave the way to understand the
fundamentals and significance of these two paradigms from a broader perspective.
A
rtificial intelligence (AI) is a transformative number of new AI models, architectures, and low-
technology with a wide range of practical code/no-code tools have been developed. The abilities
applications in diverse sectors such as health of machine learning (ML) models are expanding from
care, defense, cybersecurity, and robotics. During the classification/prediction tasks to predictive mainte-
COVID-19 pandemic, AI was widely used to forecast nance and other complex tasks.3 Deep learning models
daily case tallies, gauge the efficacy of interventions, combined with the Internet of Things (IoT), and other
trace hidden routes of transmission, predict the course technologies are helping to combat the shortage of
of the epidemic, and analyze trends, to name just a few experts and resources in health-care sectors.4 Also,
applications.1 Recently, researchers/practitioners have advancements in federated learning (FL) and contras-
been expanding the horizon of AI applications from tive learning are improving the privacy and usability of
simple problems to global issues such as climate data. Developments in generative AI are assisting in
change. Addressing such global issues by utilizing AI curating more data to compensate for the deficiency
will have a very big impact on people around the of data and to improve the results of AI models. The lat-
globe.2 Apart from these applications, the amalgam- est generative AI tools, such as ChatGPT, have many
ation of AI with technologies like blockchain, edge innovative use cases (e.g., code writing, scientific paper
computing, and other Industry 4.0 technologies is rap- writing, answering questions, virtual assistants, and so
idly increasing and has opened up various innovative on).5 The forthcoming wave of AI will bring more pow-
use cases. Based on the discussion here, it is fair to say erful and innovative tools for diverse sectors.
that AI is on its way to becoming a very helpful tool for Before the inception of data-centric AI (DC-AI),
humans to execute many tasks. most of the efforts were put into a model-centric AI
Recently, AI has emerged as a strong competitor to (MC-AI) approach that puts special focus on improving
humans as it can perform many tasks in less time and the architectural aspects of AI models (e.g., modifying
with minimal cost than humans. More efforts are the network architecture, switching to a new model,
underway in developing artificial general intelligence reducing model size, and hyperparameter tuning). Using
systems (e.g., the systems that can closely mimic the this approach when an AI model fails to yield the required
way a human performs tasks). As a result, a large performance, developers only improve the architectural
aspects. This might not apply to some scenarios when
1520-9202 © 2023 IEEE
data are limited (or are of poor quality) and when further
Digital Object Identifier 10.1109/MITP.2023.3322410 data acquisition is difficult due to a limited budget.
Date of current version 12 January 2024. Another main drawback of this approach is 2 the data,
62
Authorized licensed useIT
limited to: Thiagarajar College of Engineering.
Professional Downloaded
Published by the on August
IEEE 07,2024Society
Computer at 04:41:07 UTC from IEEE Xplore. Restrictions
November/December apply.
2023
AI
meaning that if an AI model fails to yield the required pinpoints the potential benefits of DC-AI from a broader
accuracy, developers get more data irrespective of the perspective than previously anticipated.
fact that only a few images/features might be faulty. This
can waste time and effort and increase computing MC-AI AND DC-AI
overhead. Figure 1 illustrates the workflows of both DC-AI and
Thanks to the discovery by Prof. Andrew Ng, the MC-AI paradigms in real-world scenarios. We define
deficiency of large datasets can be easily overcome by both paradigms as follows:
rigorously using the DC-AI approach.6 In this approach,
when an AI model gives poor performance, developers In MC-AI, developers usually pay more atten-
need to inspect the data as well, rather than solely tion to optimizing the model’s codes while
improving the code. DC-AI can overcome the potential rarely inspecting the data. MC-AI can be for-
drawbacks of the MC-AI approach and reduce over- mally expressed in
head by collecting the required images/features, rather MC AI ¼ C 0 þ D: (1)
than simply doubling the data. Furthermore, DC-AI can
increase the accuracy of convolutional neural network In DC-AI, developers need to look into the data,
(CNN) models by using even less, but good-quality, along with iteratively improving algorithms
data.7 It can be widely applicable to scenarios where and/or codes. Specifically, developers should
the commodity of data does not exist, or when getting iteratively investigate and enhance the data,
more data is difficult. Thus far, very little is known along with tweaking the AI model. DC-AI can be
about these two paradigms, and a concrete overview formally expressed in
of their workflow and key differences remains unex- DC AI ¼ C þ D0 : (2)
plored. The main contributions of this work are summa-
In (1) and (2), C and D refer to code and data,
rized as follows:
respectively. The }0 } sign over C and D indicates the
We explore two schools of thought (DC-AI and priorities in the respective paradigm. In real settings, C
MC-AI) concerning the development of AI tech- is the code of any AI model, and D is data enclosed in
nology, and we identify opportunities to provide any modality (e.g., table, images, text, and so forth).
concrete technical details and insights about In MC-AI, developers improve the codes/algorithms
them. Specifically, we present a technical anal- only when the AI model yields poor results [step 8 in
ysis of the DC-AI and MC-AI paradigms, and we Figure 1(a)]. In contrast, both the data and the AI model’s
highlight the key differences between them. codes/algorithms are jointly inspected in DC-AI when
We pinpoint and describe six dimensions to the AI model yields poor results [step 8 in Figure 1(b)].
systematically highlight the MC-AI approach of Also, data are significantly improved in step 4 before
AI developments that remain unexplored in the being fed into AI models.
current literature.
We analyze different techniques that can be SIX NOTEWORTHY DIMENSIONS
vital in realizing DC-AI, and we group them into OF RESEARCH/DEVELOPMENTS
three levels to systematically demonstrate what IN MC-AI
DC-AI entails. MC-AI has significantly contributed to advancing the
We demonstrate the potential benefits of DC-AI technical potency of AI when solving many real-life
when solving many key issues in the current AI problems. Some of the major problems are natural lan-
technology. To the best of our knowledge, this is guage processing, emotion detection, human activity
the first work centering on DC-AI and MC-AI, and recognition, and pandemic mitigation.8,9 Researchers
it can provide a good foundation for future have explored MC-AI from multiple perspectives, but
research in this line of work. most of those are related to improving the architec-
tural aspects (e.g., the code of AI models). To provide a
This work’s four differences from the published arti- clear overview of developments concerning MC-AI, we
cle6 are that 1) it identifies and discusses the catego- classify major research/developments into the follow-
ries of MC-AI developments from the perspective of six ing six broad dimensions:
dimensions, 2) it identifies and groups techniques that
can be vital to enhancing data quality and realizing 1) Ever-expanding horizons of AI applications: In the
DC-AI, 3) it provides the workflow of MC-AI and DC-AI beginning, AI was mostly confined to the com-
when solving a real-world problem using AI, and 4) it puter science field and was used/investigated by
FIGURE 1. Workflows of (a) MC-AI and (b) DC-AI when adopted to solve real-life problems.
computer scientists. However, with time, AI has target class. However, these methods are sim-
expanded into many other disciplines. Currently, ple and require much greater human involve-
in most sectors, AI has been rigorously used to ment. The landscape of AI models changed,
accomplish multiple objectives, and AI has taken and computer scientists became interested
over some jobs from humans, such as gauging in limiting human involvement, so perceptron
the amount of liquid in water/wine bottles. Spe- evolved into neural networks that require less
cifically, AI applications in the health-care sector feature engineering (or human involvement).
are booming. The recent pandemic sparked the Consequently, a simple perceptron model that
use of AI in the health-care sector.1 Because the can work well with simple data was enhanced
data used to train AI models can vary from appli- to a complex neural network to solve problems
cation to application, developers need to pay in which the input can be images/videos. In
ample attention to the data for each particular other words, a simple binary mathematical
application. function was replaced with layers and channels,
2) Advancements in network architectures: In the and multiple interactions performed between
early days of AI development, the mapping of layers determine the output. These enhance-
input ðXÞ to output ðY Þ was governed by a few ments in the network architecture improved the
intermediate layers. In the perceptron model, technical status of AI technology. Today, a mam-
computer scientists are interested in determin- moth number of network architectures exist,
ing whether or not a mathematical function can which can be used to solve any real-world
map a vector of numbers to some specific problem.
64
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:41:07 UTC from
IT Professional IEEE Xplore. Restrictions
November/December apply.
2023
AI
systems. In Table 2, we summarize all six dimensions DC-AI. Specifically, we classify various approaches that
and highlight their main priorities. From the analysis, can be employed as part of DC-AI into the following
we found that MC-AI gives minimal preference to the three levels”
data, and therefore, fiddling with code may not yield
desirable solutions to many industrial problems. 1) First level: This includes 24 basic approaches
that fall under the DC-AI umbrella. Most of the
approaches can be applied to the initial phase
THREE-LEVEL DC-AI PARADIGM of AI system development. For example, it is
DC-AI is a very recent paradigm that explores ways to vital to collect only relevant and necessary data
improve data quality to enhance the performance of AI concerning the problem, and there exists a
models.16,17 Table 3 presents the core approaches of data-relevance approach at this level to ensure
TABLE 3. Salient approaches of the DC-AI paradigm that can make AI more effective.
66
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:41:07 UTC from
IT Professional IEEE Xplore. Restrictions
November/December apply.
2023
AI
it. Similarly, sensitive data classification and The Com value can be quantified via
risk of misuse can be assessed to guarantee
U
better protection of sensitive data in the life- Com ¼ 1 100 (6)
n
cycle. Furthermore, analysis of data complete-
ness is desirable at this level to prevent where U denotes the number of missing values,
performance-degradation issues in AI systems. and n denotes the total number of entries. For
Satisfying most of the approaches at the first example, for a dataset having 500 records with
level can prevent inadvertently propagating 110 missing values, the Com is 78%. The Met is
data-specific biases to the other levels, which the detailed information of a column/dataset. It
can contribute to the development of effective is ideal to analyze the metadata before building
AI systems to solve real-world problems. ML models. For example, the distribution skew is
2) Second level: This level includes nine different a very common problem in ML, and it can be
approaches that are relatively more sophisti- computed using
cated and advanced than in previous level. cM
Met ¼ (7)
These approaches enhance the quality of data cm
and can be employed to determine whether the
where cM denotes the instances from the major
data are complete according to most of the
class, and cm denotes the instances from minor
aspects. These approaches empower AI devel-
class. In real cases, the distribution/frequencies
opers to have strong control over the data, and
of the column can be computed and used in the
therefore, all parts can be equally used in the
Met:
training/development of AI systems. Most of
The Tim can be quantified by taking the dif-
the approaches at this level are multicriteria,
ference between data curation time ðTc Þ and
meaning that multiple coefficients can be used
data use time ðTu Þ
to quantify the level of each approach. Equation
(3) is an example of data quality estimation Tim ¼ Tc Tu : (8)
using a multicriteria method
Dq ¼ w1 Acc þ w2 Con þ w3 The value of Tim can be compared with some
Com þ w4 Met þ w5 Tim (3) threshold t to decide about data acceptance/
þw6 Rel þ w7 Eff unacceptance, expressed as
where Dq refers to data quality, and Acc, Con, unacceptable; if Tim t
Tim ¼ (9)
acceptable; otherwise:
Com, Met, Tim, Rel, and Eff denote accuracy,
consistency, completeness, metadata, timeliness,
relevance, and effectiveness, respectively. These The Rel can be quantified using
parameters can be quantified using mathemati- Sf
cal formulas or numerical scores given by domain Rel ¼ (10)
Tf
experts based on data judgment. The formula for
computing Acc is expressed as where Sf denotes the salient features, and Tf
refers to the total number of features. Rel can
C
Acc ¼ (4) also be used to draw relevant samples out of the
A
total samples.
where C denotes the number of samples that are The Eff can be quantified using
recognized correctly, and A denotes the total
A
number of samples in a dataset. Eff ¼ (11)
D
Similarly, the value of Con can be quantified
using where A is the achieved accuracy/data, and D is
desirable accuracy/data.
Cindex ðmax n
Þ
Con ¼ ¼ n1 (5) In (3), wi , where i ¼ 1 to seven, denotes the
Rindex Rindex weights of each parameter. The range of wi coef-
where Cindex is the consistency index, Rindex is the ficients is between zero and one (e.g., wi > 0),
P7
random consistency index, and n is the number and i¼1 wi ¼ 1: The optimal values of coeffi-
of observations in the data. The value of Rindex is cients can be specified by domain experts or
determined using a lookup table by passing n as adjusted based on the importance/problem. It is
a parameter. worth noting that some of the aforementioned
68
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:41:07 UTC from
IT Professional IEEE Xplore. Restrictions
November/December apply.
2023
AI
(SIEDS), 2023, pp. 274–279, doi: 10.1109/SIEDS58326. learning-based side-channel analysis,” IEEE Trans.
2023.10137850. Emerg. Topics Comput., early access, 2022, doi: 10.
6. E. Strickland, “Andrew NG, AI minimalist: The 1109/TETC.2022.3218372.
machine-learning pioneer says small is the new big,” 16. L. Schmarje et al., “A data-centric approach for
IEEE Spectr., vol. 59, no. 4, pp. 22–50, Apr. 2022, improving ambiguous labels with combined semi-
doi: 10.1109/MSPEC.2022.9754503. supervised classification and clustering,” in Proc. Eur.
7. E. Jeczmionek and P. A. Kowalski, “Input reduction of Conf. Comput. Vision, Cham, Switzerland: Springer,
convolutional neural networks with global sensitivity 2022, pp. 363–380, doi: 10.1007/978-3-031-20074-8_21.
analysis as a data-centric approach,” Neurocomputing, 17. A. Majeed and S. O. Hwang, “Data-centric artificial
vol. 506, pp. 196–205, Sep. 2022, doi: 10.1016/j.neucom. intelligence, preprocessing, and the quest for
2022.07.027. transformative artificial intelligence systems
8. S. Geravesh and V. Rupapara, “Artificial neural development,” Computer, vol. 56, no. 5, pp. 109–115,
networks for human activity recognition using sensor May 2023, doi: 10.1109/MC.2023.3240450.
based dataset,” Multimedia Tools Appl., vol. 82, no. 10, 18. I. Taleb, M. A. Serhani, C. Bouhaddioui, and R. Dssouli,
pp. 14,815–14,835, Apr. 2023, doi: 10.1007/s11042-022- “Big data quality framework: A holistic approach to
13716-z. continuous quality management,” J. Big Data, vol. 8,
9. A. H. Shamman, A. A. Hadi, A. R. Ramul, M. M. A. no. 1, pp. 1–41, May 2021, doi: 10.1186/s40537-021-
Zahra, and H. M. Gheni, “The artificial intelligence (AI) 00468-0.
role for tackling against COVID-19 pandemic,” 19. U. M. Fayyad, “From stochastic parrots to intelligent
Mater. Today, Proc., vol. 80, pp. 3663–3667, 2023, assistants—The secrets of data and human
doi: 10.1016/j.matpr.2021.07.357. interventions,” IEEE Intell. Syst., vol. 38, no. 3, pp.
10. Q. Huang, “Weight-quantized squeezeNet for 63–67, May/Jun. 2023, doi: 10.1109/MIS.2023.3268723.
resource-constrained robot vacuums for indoor 20. J. M. Johnson and T. M. Khoshgoftaar, “Data-centric
obstacle classification,” AI, vol. 3, no. 1, pp. 180–193, AI for healthcare fraud detection,” SN Comput. Sci.,
2022, doi: 10.3390/ai3010011. vol. 4, no. 4, 2023, Art. no. 389, doi: 10.1007/s42979-
11. J. Torres-Tello and S.-B. Ko, “Optimizing a 023-01809-x.
multispectral-images-based dl model, through feature
selection, pruning and quantization,” in Proc. IEEE Int.
ABDUL MAJEED is an assistant professor with the Depart-
Symp. Circuits Syst. (ISCAS), 2022, pp. 1352–1356, doi:
ment of Computer Engineering, Gachon University, Seong-
10.1109/ISCAS48785.2022.9937940.
nam, 13120, South Korea. His research interests include
12. A. Anwar, “Difference between AlexNet, VGGNet,
ResNet, and inception,” Medium-Towards Data privacy-preserving data publishing, data-centric artificial
Science, 2019. https://towardsdatascience.com/the- intelligence, and federated learning. Majeed received his
w3h-of-alexnet-vggnet-resnet-and-inception- Ph.D. degree in computer information systems and net-
7baaaecccc96 works from the Korea Aerospace University. Contact him
13. Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, at [email protected].
“Federated learning,” Synthesis Lectures Artif. Intell.
Mach. Learn., vol. 13, no. 3, pp. 1–207, 2019, doi: 10.
SEONG OUN HWANG is a professor with the Department of
2200/S00960ED2V01Y201910AIM043.
Computer Engineering, Gachon University, Seongnam, 13120,
14. A. Majeed, X. Zhang, and S. O. Hwang, “Applications
South Korea. His research interests include cryptography,
and challenges of federated learning paradigm in the
data-centric artificial intelligence, and cybersecurity. Hwang
big data era with special emphasis on COVID-19,” Big
Data Cogn. Comput., vol. 6, no. 4, 2022, Art. no. 127, received his Ph.D. degree in computer science from the
doi: 10.3390/bdcc6040127. Korea Advanced Institute of Science and Technology. He is
15. L. Wu, G. Perin, and S. Picek, “I choose you: a Senior Member of IEEE. He is the corresponding author of
Automated hyperparameter tuning for deep this article. Contact him at [email protected].
70
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:41:07 UTC from
IT Professional IEEE Xplore. Restrictions
November/December apply.
2023