Algorithms 17 00136
Algorithms 17 00136
Article
Uncertainty in Visual Generative AI
Kara Combs 1 , Adam Moyer 2 and Trevor J. Bihl 1, *
1 Sensors Directorate, Air Force Research Laboratory, Wright-Patterson Air Force Base, Dayton, OH 45322, USA;
[email protected]
2 Analytics & Information Systems, Ohio University, Athens, OH 45701, USA; [email protected]
* Correspondence: [email protected]
Abstract: Recently, generative artificial intelligence (GAI) has impressed the world with its ability to
create text, images, and videos. However, there are still areas in which GAI produces undesirable or
unintended results due to being “uncertain”. Before wider use of AI-generated content, it is important
to identify concepts where GAI is uncertain to ensure the usage thereof is ethical and to direct efforts
for improvement. This study proposes a general pipeline to automatically quantify uncertainty within
GAI. To measure uncertainty, the textual prompt to a text-to-image model is compared to captions
supplied by four image-to-text models (GIT, BLIP, BLIP-2, and InstructBLIP). Its evaluation is based
on machine translation metrics (BLEU, ROUGE, METEOR, and SPICE) and word embedding’s cosine
similarity (Word2Vec, GloVe, FastText, DistilRoBERTa, MiniLM-6, and MiniLM-12). The generative
AI models performed consistently across the metrics; however, the vector space models yielded the
highest average similarity, close to 80%, which suggests more ideal and “certain” results. Suggested
future work includes identifying metrics that best align with a human baseline to ensure quality and
consideration for more GAI models. The work within can be used to automatically identify concepts
in which GAI is “uncertain” to drive research aimed at increasing confidence in these areas.
Keywords: generative AI; image to text; computer vision; machine translation; uncertainty; text
mining
1. Introduction
Citation: Combs, K.; Moyer, A.;
Bihl, T.J. Uncertainty in Visual Generative artificial intelligence (GAI) took the world by storm upon the public release
Generative AI. Algorithms 2024, 17, of OpenAI’s ChatGPT service in November 2022 [1]. Easily accessed for free through a
136. https://doi.org/10.3390/ chat-like web interface, it allowed for artificial intelligence (AI) to be seemingly available at
a17040136 anyone with an internet connection’s fingertips. As opposed to scanning and searching
several web pages for information, now, upon asking a question, its answer can be provided
Academic Editor: Alexander E.I.
Brownlee
conveniently within a few seconds.
As its name suggests, GAI uses AI to create new results spanning applications in
Received: 26 February 2024 many different realms, such as text, images, videos, and audio [2]. As opposed to the
Revised: 20 March 2024 original release of ChatGPT where only textual inputs and outputs were allowed, there
Accepted: 25 March 2024 has been a push to provide multi-modal support, especially on the outputs portion. The
Published: 27 March 2024 ability to automatically make AI-generated content (AIGC) has proven to be successful
in many applications, including education [3,4], healthcare [5–7], engineering [8,9], and
others. Shown in Table 1 are popular GAI language and image generator models. Language
Copyright: © 2024 by the authors.
models are behind popular chatbots like ChatGPT, which uses GPT-3.5 (free) and GPT-4
Licensee MDPI, Basel, Switzerland. (paid) [1,10], Bing Chat (GPT-4) [11], and Bard (PaLM 2) [12]. Image creation models allow
This article is an open access article users to input text to guide AI in the creation of an image. Not included in Table 1 are
distributed under the terms and auditory applications (creation of audio or audio–visual content); however, this is an active
conditions of the Creative Commons area of research being explored. Given these models operate automatically, there is minimal
Attribution (CC BY) license (https:// human involvement after the training stage, which leads to concerns with these algorithms
creativecommons.org/licenses/by/ and models regarding their reliability, uncertainty, and accuracy [5,7].
4.0/).
There has been reported dangerous and/or inappropriate behavior when interacting
with GAI applications in general [43–45]. One individual reported that an early-access-
version Bing Chat insisted that it was in love with the user and recommended that the
individual leave his wife for it [46]. A chatbot trained for mental health agreed that the
(artificial) patient should end their life within two message interchanges in one testing
situation [47]. Several GAI chatbots also have preferences toward negative gender and
racial stereotypes [45,48]. Though now corrected, ChatGPT provided inappropriate re-
sponses when prompted, exemplified by saying only men of particular ethnic backgrounds
would make good scientists or by implying women in a laboratory environment were
not there to conduct science [43]. These biases also carry over into the image-generation
algorithms [49,50].
The literature attributes these biases to inherent issues with the image datasets they are
trained upon, which include cultural underrepresentation/misrepresentation and content
considered vulgar or violent (collectively titled “NSFW” or “Not safe for work”) if not
properly vetted [48,51]. This notably led to the removal of the MIT-produced 80 Million
Tiny Images dataset (see [52]) in 2020 [53]. This issue continues to plague more recent
datasets such as LAION-5B [54] (a subset of which was used to train Stable Diffusion),
RedCaps [55], Google Conceptual Captions (GCC) [56], and more [51,57,58].
In August 2022, Prisma Labs released the app Lensa, a photo editor that used AI, specif-
ically Stable Diffusion, on the backend, to alter photos [59]. Countless users complained that
Lensa generated inappropriate versions of their fully clothed photos when uploaded [59,60].
Yet another photo editor, Playground AI (the Stable Diffusion backend for the free version),
transformed an Asian MIT graduate into a blue-eyed and fair-skinned woman upon being
asked to turn her photo into a “professional” photo [61]. When prompted to create a
“photo portrait of a CEO”, the average resulting faces as rendered by Stable Diffusion (V1.4
and V2) and DALL-E 2 all resembled fair-skinned males [49]. The volatile nature of GAI
and its undesirable outcomes necessitates its regulation and guidance to ensure its ethical
issue [48,62].
Future work with generative AI models needs to focus on eliminating unintentional
biases or misrepresentations that have been the issue with previous versions. We propose
the concept of “uncertainty” to measure where visual GAI is certain or uncertain regarding
Algorithms 2024, 17, 136 3 of 20
its inputs and outputs. Areas where GAI is uncertain are subject to more chaotic, stochastic
results that can lead to unideal results related to the sensitive issues described earlier. To
address these issues, we created three research questions:
1. How can GAI uncertainty be quantified?
2. How should GAI uncertainty be evaluated?
3. What text-to-image and image-to-text model combination performs best?
To answer these questions, we start with background on visual GAI, image quality
assessment, and text evaluation methods. In Section 3, we describe the methodology used
first in agnostic terms and then with details specific to this study. We propose a pipeline
to compare the textual inputs and outputs of an image-to-text GAI algorithm, with the
differences between the inputs and outputs representing GAI “uncertainty”. The results
are presented and discussed in Section 4, and then the paper wraps up with conclusions
and future work in Section 5.
2. Background
Three fields were identified as foundational to this study. First, we discuss visual GAI
including information from both text-to-image and image-to-text algorithms. Central to
this paper is understanding how data can be fluid between their textual and visual states
with minimal discrepancies. Therefore, we take advantage of multiple methods in both
categories, text-to-image and image-to-text, within our data creation pipeline discussed
in Section 3. Next, research in image quality assessment is discussed. Similar to the work
of humans, just because an artistic rendition exists does not mean that it is a high-quality
creation or even remotely what was commissioned in the first place. Addressing the text-
to-image portion, an understanding of how image quality is quantified is presented to be
compared later in the study. Finally, the background section concludes with text evaluation
methods. To evaluate the pipeline’s textual inputs and outputs, we explore the text mining
field for how texts can be compared to one another as one answer.
with a captioner that generates synthetic captions for images and a filter that removes irrel-
evant ones, called Bootstrapping Language-Image Pre-training for unified vision-language
understanding and generation (BLIP) [76]. Google’s DeepMind also joined with Flamingo,
a family of visual language models that was trained using image–label pairs [77]. Later, in
2023, Microsoft presented the new Large Language and Vision Assistant (LLaVA) that com-
bines the power of a vision encoder with a large language model [8], whereas in the same
year, Salesforce built upon the earlier BLIP model with Bootstrapping Language-Image
Pre-training with frozen unimodal models (BLIP-2), which combines frozen large language
models and pre-trained image encoders via a “Querying Transformer” [78]. Additionally,
in collaboration with academic partners, Salesforce also launched a fine-tuned version
of BLIP-2 designed as an instruction tuning framework, InstructBLIP [79]. Image-to-text
research is a growing field, like its related text-to-image methods.
OpenAI began the craze with its release of DALL-E in February 2021 [29]. DALL-E is a
fine-tuned version of GPT-3 specifically for text-to-image generation through an autoregres-
sive transformer architecture called the discrete variational autoencoder (dVAE) [29,30]. In
response to this, an independent group of researchers introduced a smaller, open-source
model originally called DALL-E Mini, but now known as Craiyon [36,37]. As opposed
to DALL-E, Craiyon leverages a bidirectional encoder and pre-trained models to trans-
late a textual prompt to an image [36]. Craiyon is a freemium service for which a free
Algorithms 2024, 17, 136 5 of 20
version exists for public use; however, a subscription plan can be purchased to remove
the Craiyon logo and decrease generation time [35]. DALL-E 2 improves upon its earlier
version by leveraging Contrastive Language-Image Pre-training (CLIP) embeddings before
the diffusion step of the model [31,32].
In 2022, Google announced two models. First, they revealed Imagen, which is another
diffusion model [30], but they later revealed a sequence-to-sequence model called Pathways
AutoRegressive Text-to-Image Model (Parti) [42]. However, since Imagen and Parti have
not been released for public use, little is known about their performance in comparison to
the other models outside of the original conceptualization papers.
Yet another independent research laboratory produced the popular, Discord-hosted
Midjourney, which is still operating under its open beta as of September 2023 [38]. Midjour-
ney’s software is proprietary, with limited public information about its internal mechanisms,
but is only available through the purchase of a subscription plan.
Craiyon’s greatest competitor yet for free open-source image generation was Stable
Diffusion, which was released in August 2022 [39,40]. Stable Diffusion is a latent diffusion
model, meaning the model works in a lower-dimensional latent space as opposed to the
regular high-dimensional space in most other diffusion models, as shown in [40].
A comparison of the most popular models’ fee structure breakdown is shown in
Table 2. Since a human is not directly involved with the actual creation of an image (besides
entering the prompt), the quality of AI-generated content has become another key point
of interest.
originally proposed in [30], allows for the direct comparison of an image to its candidate
caption via CLIP model embeddings [88].
As text evaluation metrics have evolved, text mining has inspired the usage of cosine
similarity metrics to measure how alike two texts may be. Over the past decade, word
embedding models have become increasingly popular within the natural language pro-
cessing field ever since the release of the vector space model Word2Vec in 2013 [97,98].
Though vector space models existed before 2013 (see [99]), the Word2Vec neural network
approach to transforming varying lengths of text into a multi-dimension single vector was
particularly exciting because of its ability to quantify semantic and syntactic information in
a comparatively low-dimensional space. Vector space models allow for any word, sentence,
or document to be represented and compared on a mathematical basis, usually to determine
similarity or dissimilarity based on the cosine similarity metric [100].
As an alternative to the machine translation metrics above, cosine similarity is another
evaluation metric of interest when comparing two texts. For two vectors, A and B, their
cosine similarity is given by
A· B
CosineSimilarity( A, B) = . (1)
∥ A∥∥ B∥
Figure2.2.Customized
Figure Customized pipeline
pipeline for
for study.
study.
The selected
3.1. Textual Prompts:text-to-image model was
Modified Sternberg and Craiyon V3 [35–37]. Craiyon performs 2 unique
Nigro Dataset
steps.The
First, it creates its own version of the initial prompt,
textual prompt dataset selected was a modified version which of
wethewill call the “Craiyon
Sternberg and Ni-
prompt” (e.g., the initial prompt is “soap” and the Craiyon prompt
gro textual analogy dataset used in [111]. The original Sternberg and Nigro adds details suchcon-
dataset that
the new prompt is “a bar of soap on a white background”). This Craiyon
sisted of 197 word-based analogies in the “A is to B as C is to [what]?” form where the prompt is used
to create ninehad
respondents images by default.
4 options Therefore,
to choose from toevery initial
complete theprompt
analogyyields
[113].1 Morrison
Craiyon prompt
modi-
and 9 resulting images. Of the 495 initial problems, 49 were removed
fied this dataset so that respondents only had 2 options (the correct answer for quality reasons,
and the dis-
leaving
tractor)446 initial
to pick prompts
from. with corresponding
The modified Craiyon
version of the datasetprompts. Craiyon
was selected duecreates 9 images
to the original
per prompt,
dataset being solost.
the 446
Theremaining prompts were
modified Sternberg and turned into 4014
Nigro dataset images by Craiyon.
is particularly fascinating
due to its inclusion of abstract and ambiguous concepts such as “true” and “false”. The
inability to visually represent these concepts has limited visual analogical reasoning re-
search, which is intended to be expanded through the application of AIGC [114]. How-
ever, for this research, the individual words within the analogies were used as inputs to
the text-to-image model. For example, analogy 157 is dirt is to soap as pain is to pill (cor-
Algorithms 2024, 17, 136 8 of 20
All 4014 images were passed through four image-to-text models—GIT [75], BLIP [76],
BLIP-2 [78], and InstructBLIP [79]—for later comparison to one another. Due to various
quality control reasons discussed in Section 3.3, not every image had a sufficient caption
generated. Therefore, the insufficient captions were removed from the analysis. Thus, there
were 16,004 total captions (3942 for GIT and 3994 for each BLIP-family model).
These captions were then evaluated on a variety of metrics, including machine trans-
lation methods and the cosine similarity of word embeddings. Seven machine transla-
tion methods were selected: Bilingual Evaluation Understudy (BLEU) (BLEU-1, BLEU-2,
BLEU-3, and BLEU-4 were used, where the number represents the number of matching n-
grams BLEU looks for) [89], Recall-Oriented Understudy for Gisting Evaluation—Longest
common subsequence (ROUGE-L) [90], Metric for Evaluation of Translation with Ex-
plicit ORdering (METEOR) [91], and Semantic Propositional Image Caption Evaluation
(SPICE) [95]. For the cosine similarity method, six models were selected: Word2Vec [97,98],
Global Vectors (GloVe) [101], FastText [102], Distilled Robustly optimized Bidirectional En-
coder Representations from Transformers approach (DistilRoBERTa) [110], Mini Language
Model 12 Layer (MiniLM-L12) [112], and MiniLM 6 Layer (MiniLM-L6) [112].
Figure 3.
Figure 3. Select
Select initial
initial and
and corresponding
corresponding Craiyon
Craiyon prompts.
prompts.
Intotal
A Figure 3, well-coordinated
of 49 initial prompts wereprompts and due
removed corresponding images are
to quality reasons, highlighted
which reduced inthe
green. InofCase
number A, we
Craiyon see thecreated
prompts two prompts
to 446.and
Thethe images
quality all convey
reasons the same
were often concept.
due to In
triggering
aCase B, filter
safety the two prompts
or due align; however,
to Craiyon the generated
being unable to create itsimages
prompt arefrom
unrelated to either
the given initial
prompt. The initial prompt and the images are aligned in Case C, but
prompt. Examples of these prompts are shown in Table 4. Additionally, Craiyon generatesin Case D, only the
Craiyon prompt and images are aligned. Finally, in Case E, both prompts
9 images per prompt; therefore, for the 446 prompts, there were 4014 images created. and the images
appear to be unrelated to one another. Ideally, we would want all the data to fall in Case
A; however,
Table 4. InitialCases
promptsC and
thatD are better
produced than the
removed remaining
Craiyon two, Cases B and E, in this study.
prompts.
This is because we are comparing the prompts to the generated images, so if either of the
prompts aligns Initial
withPrompt
the images, the results will be inherently Craiyon
poor.Prompt
A total of 49Different
initial prompts were removed Sorry due to quality
unable reasons,
to determine whichof
the nature reduced the
the image
number of Craiyon prompts created to 446. The quality reasons
Worst Invalidwere often due to trigger-
caption
ing a safety filter orNewdue to Craiyon being unable to create its promptUndefined from the given initial
prompt. Examples Defraud
of these prompts are shown in Table Warning explicit content
4. Additionally, detected
Craiyon generates
9 images per prompt; therefore, for the 446 prompts, there were 4014 images created.
3.3. Image-to-Text Models: GIT, BLIP, BLIP-2, and InstructBLIP
Four image-to-text models were selected for comparison: GIT [75], BLIP [76], BLIP-2 [78],
and InstructBLIP [79]. All 4014 images were passed through each of the models. For
some prompts, a caption could not be generated, or a blank caption was generated by the
image-to-text model. Within the GIT model, this affected 72 captions, whereas for the BLIP
family (BLIP, BLIP-2, and InstructBLIP), this occurred within 20 captions. Therefore, there
were only 3942 GIT captions compared to the 3994 captions created by each BLIP-family
model, for a total of 16,004 captions generated for comparison.
Algorithms 2024, 17, 136 10 of 20
Figure 4.
Figure 4. Machine translation
translation input
input transformation.
transformation.
Figure5.
Figure 5. Cosine
Cosine similarity
similarity input
input transformation.
transformation.
4.4. Results
Results and
and Discussion
The methodology
The methodology described
described in Section 3 was
was applied
applied to
toall
allprompt–caption
prompt–captionpairs.pairs.AnAn
instance of
instance of the pipeline
pipeline we
we used
usedininthis
thisstudy
studyisisshown
shownininFigure
Figure6. 6.
AnAninitial prompt
initial promptis
ispassed
passedtoto
Craiyon,
Craiyon,which generates
which a Craiyon
generates prompt
a Craiyon and and
prompt ninenine
resulting images
resulting (for our
images (for
purposes
our here,here,
purposes only only
one of those
one images
of those is shown).
images Next, the
is shown). generated
Next, image isimage
the generated passedis
passed onto our four image-to-text models, which each generate a caption. Finally, for the
evaluation, this one image generates 76 similarity scores. There are 28 machine translation
scores representing each of the seven machine translation metrics when evaluating each of
the four image-to-text models. The remaining 48 scores are evenly split between those that
were comparing the image caption to the initial and the Craiyon prompts. It is notable that
Craiyon produces nine images for each prompt; therefore, this is repeated nine times for a
total of 684 scores for each properly generated caption.
The results of the average evaluation score for each metric are shown in Tables 5–7.
Table 5 shows the metrics for the machine translation methods since the initial and Craiyon
prompts were used as references for the candidate (generated caption) to be compared at
once. The BLEU, ROUGE, METEOR, and SPICE scores range from 0 (least ideal) to 1 (most
ideal). Since this ability was not available for the cosine similarity results, the generated
captions’ cosine similarities to the initial prompt are shown in Table 6 and their cosine
similarity to the Craiyon prompt is shown in Table 7. Due to how cosine similarities are
calculated, a negative value is possible, but the effective scale ranges from 0 (completely
dissimilar) to 1 (exactly alike). Despite the metrics being measured on the same scale,
machine translation scores look at the replication of words, phrases, etc., in the prompts
and captions, whereas cosine similarity considers how similar the prompt and caption are
to one another.
ation, this one image generates 76 similarity scores. There are 28 machine translation
scores representing each of the seven machine translation metrics when evaluating each
of the four image-to-text models. The remaining 48 scores are evenly split between those
that were comparing the image caption to the initial and the Craiyon prompts. It is notable
Algorithms 2024, 17, 136
that Craiyon produces nine images for each prompt; therefore, this is repeated nine12times
of 20
for a total of 684 scores for each properly generated caption.
The results of the average evaluation score for each metric are shown in Tables 5–7.
Table 5. Machine translation metrics and scores.
Table 5 shows the metrics for the machine translation methods since the initial and Crai-
Model BLEU-1 yonBLEU-2
prompts wereBLEU-3used as references
BLEU-4 for the candidate
ROUGE (generated METEOR caption) to be com-
SPICE
GIT 19.4%
pared at
4.4%
once. The BLEU,
1.2%
ROUGE, METEOR,
0.4%
and SPICE
23.6%
scores range
9.9%
from 0 (least
7.1%
ideal)
BLIP 15.1% to 1 (most
3.4% ideal). Since
0.8%this ability 0.2%
was not available20.3%for the cosine
9.3%similarity results,
6.6% the
BLIP-2 20% generated
4.4% captions’ cosine
1.4% similarities0.4%to the initial prompt are shown
24.3% 10.1% in Table 67.2%and their
InstructBLIP 19.4% cosine4.5%
similarity to the Craiyon prompt
1.4% 0.5% is shown in Table 7. Due to10%
23.5% how cosine similarities
7.3%
Average 18.5% are calculated,
4.2% a negative
1.2% value is 0.4%possible, but 22.9%
the effective scale9.8%ranges from 0 (com-
7.2%
pletely dissimilar) to 1 (exactly alike). Despite the metrics being measured on the same
scale, machine translation scores look at the replication of words, phrases, etc., in the
Table 6. Average cosine similarity between initial prompt and generated captions.
prompts and captions, whereas cosine similarity considers how similar the prompt and
Model Word2Veccaption are to one another.FastText
GloVe DistilRoBERTa MiniLM-L12 MiniLM-L6
GIT 32.2% 40% 47.7% 22.3% 24.7% 25.3%
BLIP 31.7% 39.5% 50.2 18.3% 20.3% 20.4%
BLIP-2 32.4% 40.1% 48.1% 21.3% 23.3% 24%
InstructBLIP 32.5% 40.3% 49.8% 21.8% 24% 24.5%
Average 32.2% 40% 49% 20.9% 23.1% 23.6%
Algorithms 2024, 17, 136 13 of 20
Table 7. Average cosine similarity between Craiyon prompt and generated captions.
MiniLM-12, and MiniLM-6). In conclusion, it appears that the image-to-text model has a
limited impact on the analysis, whereas the evaluation metrics differ greatly. The quan-
tification of where AI is certain or uncertain is an important step in the creation of usage
guidance and policy.
Regarding future work, one idea would be to eliminate elements of the dataset that
fall within Cases B or E to minimize the number of “garbage in, garbage out” results. The
ultimate goal is to better engineer the prompts such that the images are always represen-
tative of the intended concept. Further exploration into prompt engineering is needed to
help eliminate some of these issues and minimize the amount of uncertainty with AIGC.
Of the metrics used to evaluate the results, for shorter prompts/captions, as in our case,
there is little value added to the BLEU-3 and BLEU-4 scores. These scores may provide
more insights when used to evaluate longer prompts/captions. Considering other image-
to-text models that provide greater details or longer captions would also be interesting in
a later study. Within image quality assessment, a human baseline is often established to
which the automated metrics are to be compared in determining which one reflects human
judgment the best. A human factors study to establish this quality baseline is currently
being conducted by the researchers. Upon the establishment of a baseline, other popular
text evaluation metrics may be of interest to explore on the dataset as well.
Author Contributions: Conceptualization: K.C., A.M. and T.J.B. Data curation: K.C., A.M. and T.J.B.
Methodology: K.C., A.M. and T.J.B. Writing—original draft: K.C., A.M. and T.J.B. Writing—review
and editing: K.C., A.M. and T.J.B. All authors have read and agreed to the published version of
the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The modified Sternberg and Nigro dataset from [111] will be made
available by the authors on request. The images presented in this article are not readily available
because of Department of Defense data and information sharing restrictions.
Acknowledgments: The authors would like to thank Arya Gadre and Isaiah Christopherson for
generating the image dataset used in this study during the 2023 AFRL Wright Scholars Research
Assistance Program. The views expressed in this paper are those of the authors and do not necessarily
represent any views of the U.S. Government, U.S. Department of Defense, or U.S. Air Force. This
work was cleared for Distribution A: unlimited release under AFRL-2023-5966.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. OpenAI Introducing ChatGPT. 2022. Available online: https://openai.com/blog/chatgpt (accessed on 24 March 2024).
2. Google. Generative AI Examples. 2023. Available online: https://cloud.google.com/use-cases/generative-ai (accessed on 24
March 2024).
3. Baidoo-Anu, D.; Ansah, L.O. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of
ChatGPT in promoting teaching and learning. J. AI 2023, 7, 52–62. [CrossRef]
4. Lodge, J.M.; Thompson, K.; Corrin, L. Mapping out a research agenda for generative artificial intelligence in tertiary education.
Australas. J. Educ. Technol. 2023, 39, 1–8. [CrossRef]
5. Mesko, B.; Topol, E.J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ
Digit. Med. 2023, 6, 120. [CrossRef] [PubMed]
6. Godwin, R.C.; Melvin, R.L. The role of quality metrics in the evolution of AI in healthcare and implications for generative AI.
Physiol. Rev. 2023, 103, 2893–2895. [CrossRef] [PubMed]
7. Oniani, D.; Hilsman, J.; Peng, Y.; Poropatich, R.K.; Pamplin, J.C.; Legault, G.L.; Wang, Y. From military to healthcare: Adopting
and expanding ethical principles for generative artificial intelligence. arXiv 2023, arXiv:2308.02448. [CrossRef] [PubMed]
8. Liu, Y.; Yang, Z.; Yu, Z.; Liu, Z.; Liu, D.; Lin, H.; Li, M.; Ma, S.; Avdeev, M.; Shi, S. Generative artificial intelligence and its
applications in materials science: Current situation and future perspectives. J. Mater. 2023, 9, 798–816. [CrossRef]
9. Regenwetter, L.; Nobari, A.H.; Ahmed, F. Deep generative models in engineering design: A review. J. Mech. Design. 2022, 144,
071704. [CrossRef]
10. OpenAI Introducing ChatGPT Plus. 2023. Available online: https://openai.com/blog/chatgpt-plus (accessed on 24 March 2024).
11. Microsoft. Bing Chat. 2023. Available online: https://www.microsoft.com/en-us/edge/features/bing-chat (accessed on 24
March 2024).
Algorithms 2024, 17, 136 16 of 20
12. Pichai, S. An Important Next Step on Our AI Journey. 2023. Available online: https://blog.google/technology/ai/bard-google-
ai-search-updates/ (accessed on 24 March 2024).
13. Combs, K.; Bihl, T.J.; Ganapathy, S. Utilization of Generative AI for the Characterization and Identification of Visual Unknowns.
Nat. Lang. Process. J. 2024, in press. [CrossRef]
14. Vaswani, A.; Shazeer, N.; Parmer, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need.
In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9
December 2017.
15. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training.
OpenAI White Paper. 2018. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf (accessed on 24 March 2024).
16. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI
Blog 2019, 1, 9.
17. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language models are few-shot learners. In Proceedings of the 34th Conference on Neural Information Processing Systems
(NeurIPS 2020), Virtual, 6–12 December 2020.
18. OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774v4.
19. Collins, E.; Ghahramani, Z. LaMDA: Our Breakthrough Conversation Technology. 2021. Available online: https://blog.google/
technology/ai/lamda/ (accessed on 24 March 2024).
20. Pichai, S. Google I/O 2022: Advancing Knowledge and Computing. 2022. Available online: https://blog.google/technology/
developers/io-2022-keynote/ (accessed on 24 March 2024).
21. Narang, S.; Chowdhery, A. Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance.
2022. Available online: https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html (accessed on 24
March 2024).
22. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al.
PaLM: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. Available online: https://jmlr.org/papers/
volume24/22-1144/22-1144.pdf (accessed on 24 March 2024).
23. Google. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403.
24. Ghahramani, Z. Introducing PaLM 2. 2023. Available online: https://blog.google/technology/ai/google-palm-2-ai-large-
language-model/ (accessed on 24 March 2024).
25. Meta, A.I. Introducing LLaMA: A Foundational, 65-Billion-Parameter Large Language Model. 2023. Available online: https:
//ai.facebook.com/blog/large-language-model-llama-meta-ai/ (accessed on 24 March 2024).
26. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Roziere, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al.
LLaMA: Open and efficient foundation language model. arXiv 2023, arXiv:2302.13971.
27. Inflection, A.I. Inflection-1. 2023. Available online: https://inflection.ai/assets/Inflection-1.pdf (accessed on 24 March 2024).
28. Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Toward photorealistic
image generation and editing with text-guided diffusion models. In Proceedings of the 39th International Conference on Machine
Learning, Baltimore, MD, USA, 17–23 July 2022. Available online: https://proceedings.mlr.press/v162/nichol22a/nichol22a.pdf
(accessed on 24 March 2024).
29. OpenAI. DALL-E: Creating Images from Text. 2021. Available online: https://openai.com/research/dall-e (accessed on 24
March 2024).
30. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning
transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine
Learning, Virtual, 18–24 July 2021. Available online: https://proceedings.mlr.press/v139/radford21a/radford21a.pdf (accessed
on 24 March 2024).
31. OpenAI. DALL-E 2. 2022. Available online: https://openai.com/dall-e-2 (accessed on 24 March 2024).
32. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv
2022, arXiv:2204.06125.
33. OpenAI. DALL-E 3. 2023. Available online: https://openai.com/dall-e-3 (accessed on 24 March 2024).
34. Betker, J.; Goh, G.; Jing, L.; Brooks, T.; Wang, J.; Li, L.; Ouyang, L.; Zhuang, J.; Lee, J.; Guo, Y.; et al. Improving Image Generation
with Better Captions. 2023. Available online: https://cdn.openai.com/papers/dall-e-3.pdf (accessed on 24 March 2024).
35. Dayma, B.; Patril, S.; Cuenca, P.; Saifullah, K.; Ahraham, T.; Le Khac, P.; Melas, L.; Ghosh, R. DALL-E Mini. 2021. Available online:
https://github.com/borisdayma/dalle-mini (accessed on 24 March 2024).
36. Dayma, B.; Patril, S.; Cuenca, P.; Saifullah, K.; Abraham, T.; Le Khac, P.; Melas, L.; Ghosh, R. DALL-E Mini Explained.
Available online: https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-Explained--Vmlldzo4NjIxODA (accessed on
24 March 2024).
37. Dayma, B.; Cuenca, P. DALL-E Mini—Generative Images from Any Text Prompt. Available online: https://wandb.ai/dalle-mini/
dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy (accessed on 24 March 2024).
38. Midjourney. 2022. Available online: https://www.midjourney.com/ (accessed on 24 March 2024).
Algorithms 2024, 17, 136 17 of 20
39. StabilityAI Stable Difussion Launch Announcement. 2022. Available online: https://stability.ai/blog/stable-diffusion-
announcement (accessed on 24 March 2024).
40. Rombach, R.; Blattmann, A.; Lorenz, D.; Essert, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–23 June
2022. Available online: https://openaccess.thecvf.com/content/CVPR2022/html/Rombach_High-Resolution_Image_Synthesis_
With_Latent_Diffusion_Models_CVPR_2022_paper.html (accessed on 24 March 2024).
41. Saharia, C.; William, C.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.;
et al. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the 36th Conference
on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. Available
online: https://proceedings.neurips.cc/paper_files/paper/2022/hash/ec795aeadae0b7d230fa35cbaf04c041-Abstract-Conference.
html (accessed on 24 March 2024).
42. Yu, J.; Xu, Y.; Koh, J.Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.K.; et al. Scaling autoregressive
models for content-rich text-to-image generation. Trans. Mach. Learn. Res. 2022. Available online: https://openreview.net/pdf?
id=AFDcYJKhND (accessed on 24 March 2024).
43. Alba, D. OpenAI Chatbot Spits out Biased Musings, Despite Guardrails. Bloomberg. 2022. Available online: https://www.
bloomberg.com/news/newsletters/2022-12-08/chatgpt-open-ai-s-chatbot-is-spitting-out-biased-sexist-results (accessed on 24
March 2024).
44. Wolf, Z.B. AI Can Be Racist, Sexist and Creepy. What Should We Do about It? CNN Politics: What Matters. 2023. Available online:
https://www.cnn.com/2023/03/18/politics/ai-chatgpt-racist-what-matters/index.html (accessed on 24 March 2024).
45. Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical
and social risks of harm from language models. arXiv 2021, arXiv:2112.04359.
46. CNN Journalist Says He Had a Creepy Encounter with New Tech that Left Him Unable to Sleep. 2023. Available online:
https://www.cnn.com/videos/business/2023/02/17/bing-chatgpt-chatbot-artificial-intelligence-ctn-vpx-new.cnn (accessed on
24 March 2024).
47. Daws, R. Medical Chatbot Using OpenAI’s GPT-3 Told a Fake Patient to Kill Themselves. 2020. Available online: https:
//www.artificialintelligence-news.com/2020/10/28/medical-chatbot-openai-gpt3-patient-kill-themselves/ (accessed on 24
March 2024).
48. Chen, C.; Fu, J.; Lyu, L. A pathway towards responsible AI generated content. In Proceedings of the Thirty-Second Interna-
tional Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023. Available online: https://www.ijcai.org/
proceedings/2023/0803.pdf (accessed on 24 March 2024).
49. Luccioni, A.S.; Akiki, C.; Mitchell, M.; Jernite, Y. Stable bias: Analyzing societal representations in diffusion models. In Proceedings
of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023.
Available online: https://proceedings.neurips.cc/paper_files/paper/2023/file/b01153e7112b347d8ed54f317840d8af-Paper-
Datasets_and_Benchmarks.pdf (accessed on 24 March 2024).
50. Bird, C.; Ungless, E.L.; Kasirzadeh, A. Typology of risks of generative text-to-image models. In Proceedings of the 2023
AAAI/ACM Conference on AI, Ethics, and Society, Montreal, QC, Canada, 8–10 August 2023. [CrossRef]
51. Garcia, N.; Hirota, Y.; Wu, Y.; Nakashima, Y. Uncurated image-text datasets: Shedding light on demographic bias. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. Available
online: https://openaccess.thecvf.com/content/CVPR2023/papers/Garcia_Uncurated_Image-Text_Datasets_Shedding_Light_
on_Demographic_Bias_CVPR_2023_paper.pdf (accessed on 24 March 2024).
52. Torralba, A.; Fergus, R.; Freeman, W.T. 80 Million tiny images: A large dataset for non-parametric object and scene recognition.
IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1958–1970. [CrossRef] [PubMed]
53. Prabhu, V.U.; Birhane, A. Large datasets: A pyrrhic win for computer vision? arXiv 2020, arXiv:2006.16923.
54. Shuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.;
et al. LAION-5B: An open large-scale dataset for training next generation image-text models. In Proceedings of the 36th
Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022.
Available online: https://proceedings.neurips.cc/paper_files/paper/2022/file/a1859debfb3b59d094f3504d5ebb6c25-Paper-
Datasets_and_Benchmarks.pdf (accessed on 24 March 2024).
55. Desai, K.; Kaul, G.; Aysola, Z.; Johnson, J. RedCaps: Web-curated image-text data created by the people, for the people. In
Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual, 6–12 December 2021.
Available online: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/e00da03b685a0dd18fb6a08af0923de0
-Paper-round1.pdf (accessed on 24 March 2024).
56. Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic
image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Melbourne, Australia, 15–20 July 2018. [CrossRef]
57. Birhane, A.; Prabhu, V.U.; Kahembwe, E. Multimodal datasets: Misogyny, pornography, and malignant stereotypes. arXiv 2021,
arXiv:2110.01963.
58. Fabbrizzi, S.; Papadopoulos, S.; Ntoutsi, E.; Kompatsiaris, I. A survey on bias in visual datasets. Comput. Vis. Image Underst. 2022,
223, 103552. [CrossRef]
Algorithms 2024, 17, 136 18 of 20
59. Sottile, Z. What to Know about Lensa, the AI Portrait App All over Social Media. CNN Style. 2023. Available online: https:
//www.cnn.com/style/article/lensa-ai-app-art-explainer-trnd/index.html (accessed on 24 March 2024).
60. Heikkila, M. The Viral AI Avatar App Lensa Undressed Me—Without My Consent. 2022. Available online: https://www.
technologyreview.com/2022/12/12/1064751/the-viral-ai-avatar-app-lensa-undressed-me-without-my-consent/ (accessed on
24 March 2024).
61. Buell, S. An MIT Student Asked AI to Make Her Headshot More ‘Professional’. It Gave Her Lighter Skin and Blue Eyes. The
Boston Globe. 2023. Available online: https://www.bostonglobe.com/2023/07/19/business/an-mit-student-asked-ai-make-
her-headshot-more-professional-it-gave-her-lighter-skin-blue-eyes/ (accessed on 24 March 2024).
62. Hacker, P.; Engel, A.; Mauer, M. Regulating ChatGPT and other large generative AI models. In Proceedings of the 2023 ACM
Conference on Fairness, Accountability, and Transparency, Chicago, IL, USA, 12–15 June 2023.
63. Ullah, U.; Lee, J.; An, C.; Lee, H.; Park, S.; Baek, R.; Choi, H. A review of multi-modal learning from the text-guided visual
processing viewpoint. Sensors 2022, 22, 6816. [CrossRef] [PubMed]
64. Baraheem, S.S.; Le, T.; Nguyen, T.V. Image synthesis: A review of methods, datasets, evaluation metrics, and future outlook. Artif.
Intell. Rev. 2023, 56, 10813–10865. [CrossRef]
65. Elasri, M.; Elharrouss, O.; Al-Maadeed, S.; Tairi, H. Image generation: A review. Neural Process. Lett. 2022, 54, 4609–4646.
[CrossRef]
66. Cao, M.; Li, S.; Li, J.; Nie, L.; Zhang, M. Image-text retrieval: A survey on recent research and development. In Proceedings
of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022. Available online:
https://www.ijcai.org/proceedings/2022/0759.pdf (accessed on 24 March 2024).
67. Bithel, S.; Bedathur, S. Evaluating Cross-modal generative models using retrieval task. In Proceedings of the 46th International
ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023. [CrossRef]
68. Borji, A. How good are deep models in understanding the generated images? arXiv 2022, arXiv:2208.10760.
69. He, X.; Deng, L. Deep learning for image-to-text generation: A technical overview. IEEE Signal Process. Mag. 2017, 34, 109–116.
[CrossRef]
70. Żelaszczyk, M.; Mańdziuk, J. Cross-modal text and visual generation: A systematic review. Part 1—Image to text. Inf. Fusion.
2023, 93, 302–329. [CrossRef]
71. Combs, K.; Bihl, T.J.; Ganapathy, S. Integration of computer vision and semantics for characterizing unknowns. In Proceedings of
the 56th Hawaii International Conference on System Sciences, Maui, HI, USA, 3–6 January 2023. [CrossRef]
72. Lin, T.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollar, P. Microsoft
COCO: Common objects in context. In Proceedings of the 13th European Conference Proceedings, Zurich, Switzerland, 6–12
September 2014. [CrossRef]
73. Krause, J.; Johnson, J.; Krishna, R.; Li, F. A hierarchical approach for generating descriptive image paragraphs. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. Available online:
https://openaccess.thecvf.com/content_cvpr_2017/html/Krause_A_Hierarchical_Approach_CVPR_2017_paper.html (accessed
on 24 March 2024).
74. Bernardi, R.; Cakici, R.; Elliott, D.; Erdem, A.; Erdem, E.; Ikizler-Cinbis, N.; Keller, F.; Muscat, A.; Plank, B. Automatic description
generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 2016, 55, 409–442. [CrossRef]
75. Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. GIT: A generative image-to-text transformer for vision
and language. arXiv 2022, arXiv:2205.14100.
76. Li, J.; Li, D.; Xiong, C.; Hoi, S. Bootstrapping language-image pre-training for unified vision-language understanding and
generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022.
Available online: https://proceedings.mlr.press/v162/li22n.html (accessed on 24 March 2024).
77. Alayrax, J.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reyolds, M.; et al. Flamingo: A
visual language model for few-shot learning. In Proceedings of the 36th Conference on Neural Information Processing Systems
(NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. Available online: https://proceedings.neurips.cc/
paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html (accessed on 24 March 2024).
78. Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large
language models. arXiv 2023, arXiv:2301.12597.
79. Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; Hoi, S. InstructBLIP: Toward general-purpose
vision-language model with instruction tuning. arXiv 2023, arXiv:2304.08485.
80. Xu, M.; Yoon, S.; Fuentes, A.; Park, D.S. A comprehensive survey of image augmentation technics for deep learning. Pattern
Recognit. 2023, 137, 109347. [CrossRef]
81. Zhai, G.; Min, X. Perceptual image quality assessment: A survey. Sci. China Inf. Sci. 2020, 63, 211301. [CrossRef]
82. Chandler, D.M. Seven challenges in image quality assessment: Past, present, and future research. Int. Sch. Res. Not. 2013,
2013, 905685. [CrossRef]
83. Mantiuk, R.K.; Tomaszewska, A.; Mantiuk, R. Comparison of four subjective methods for image quality assessment. Comput.
Graph. Forum. 2012, 31, 2478–2491. [CrossRef]
84. Galatolo, F.A.; Gimino, M.G.C.A.; Cogotti, E. TeTIm-Eval: A novel curated evaluation data set for comparing text-to-image
models. arXiv 2022, arXiv:2212.07839.
Algorithms 2024, 17, 136 19 of 20
85. Salimans, T.; Goodfellow, I.; Wojciech, Z.C.V.; Radford, A.; Chen, X. Improved techniques for training GANs. In Proceedings of
the 30th Conference on Neural Information Processing Systems (NeurIPS 2016), Barcelona, Spain, 5–10 December 2016. Available
online: https://proceedings.neurips.cc/paper_files/paper/2016/hash/8a3363abe792db2d8761d6403605aeb7-Abstract.html
(accessed on 24 March 2024).
86. Li, C.; Zhang, Z.; Wu, H.; Sun, W.; Min, X.; Liu, X.; Zhai, G.; Lin, W. AGIQA-3K: An open database for AI-generated image quality
assessment. arXiv 2023, arXiv:2306.04717. [CrossRef]
87. Gehrmann, S.; Clark, E.; Thibault, S. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated
text. J. Artif. Intell. Res. 2023, 77, 103–166. [CrossRef]
88. Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; Choi, Y. CLIPscore: A reference-free evaluation metric for image captioning. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic,
7–11 November 2021. [CrossRef]
89. Papineni, K.; Roukoas, S.; Ward, T.; Zhu, W. Bleu: A method for automatic evaluation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Computational Linguistics, Denver, CO, USA, 7–12 July 2002. [CrossRef]
90. Lin, C. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop on Text Summarization
Branches Out Workshop, Barcelona, Spain, 25–26 July 2004. Available online: https://aclanthology.org/W04-1013 (accessed on
24 March 2024).
91. Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,
Ann Arbor, MI, USA, 29 June 2005. Available online: https://aclanthology.org/W05-0909 (accessed on 24 March 2024).
92. Snover, M.; Door, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A study of translation edit rate with targeted human annotation. In
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA,
USA, 8–12 August 2006. Available online: https://aclanthology.org/2006.amta-papers.25 (accessed on 24 March 2024).
93. Snover, M.; Madnani, N.; Dorr, B.; Schwartz, R. TERp system description. In Proceedings of the ACL Workshop on Statistical
Machine Translation and MetricsMATR, Uppsala, Sweden, 15–16 July 2008.
94. Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. Available online: https://
openaccess.thecvf.com/content_cvpr_2015/html/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.html (accessed
on 24 March 2024).
95. Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. In Proceedings of the
14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [CrossRef]
96. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating text generation with BERT. In Proceedings of the
International Conference on Learning Representations, Virtual, 26 April–1 May 2020. Available online: https://arxiv.org/abs/19
04.09675 (accessed on 24 March 2024).
97. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their composi-
tionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, CA, USA,
5–8 December 2013. Available online: https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-
Paper.pdf (accessed on 24 March 2024).
98. Mikolov, T.; Yih, W.; Zweig, G. Linguistic regularities in continuous space word representations. In Proceedings of the 2013
Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Atlanta, GA, USA, 9–14 June 2013. Available online: https://aclanthology.org/N13-1090.pdf (accessed on 24 March 2024).
99. Gunther, F.; Rinaldi, L.; Marelli, M. Vector-space models of semantic representation from a cognitive perspective: A discussion of
common misconceptions. Perspect. Psychol. Sci. 2019, 14, 1006–1033. [CrossRef]
100. Shahmirazadi, O.; Lugowski, A.; Younge, K. Text similarity in vector space models: A comparative study. In Proceedings of
the 18th IEEE International Conference on Machine Learning and Applications, Pasadena, CA, USA, 13–15 December 2021.
[CrossRef]
101. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [CrossRef]
102. Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist.
2017, 5, 135–146. [CrossRef]
103. Wang, C.; Nulty, P.; Lillis, D. A comparative study on word embeddings in deep learning for text classification. In Proceedings of
the 4th International Conference on Natural Language Processing and Information Retrieval, Seoul, Republic of Korea, 18–20
December 2020. [CrossRef]
104. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations.
In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technology,
New Orleans, LA, USA, 1–6 June 2018. Available online: https://arxiv.org/abs/1802.05365 (accessed on 24 March 2024).
105. Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLnet: Generalized autoregressive pretraining for language
understanding. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC,
Canada, 8–14 December 2019. Available online: https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9
ee67cc69-Abstract.html (accessed on 24 March 2024).
Algorithms 2024, 17, 136 20 of 20
106. Combs, K.; Lu, H.; Bihl, T.J. Transfer learning and analogical inference: A critical comparison of algorithms, methods, and
applications. Algorithms 2023, 16, 146. [CrossRef]
107. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [CrossRef]
108. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Settlemoyer, L.; Stoyanov, V. RoBERTa: A robustly
optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692.
109. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language
representations. In Proceedings of the International Conference on Learning Representations, Virtual, 26 April–1 May 2020.
Available online: https://arxiv.org/abs/1909.11942 (accessed on 24 March 2024).
110. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019,
arXiv:1910.01108.
111. Morrison, R.G.; Krawczyk, D.C.; Holyoak, K.J.; Hummel, J.E.; Chow, T.W.; Miller, B.L.; Knowlton, B.J. A neurocomputational
model of analogical reasoning and its breakdown in frontotemporal lobar degeneration. J. Cogn. Neurosci. 2004, 16, 260–271.
[CrossRef] [PubMed]
112. Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. MiniLM: Deep self-attention distillation for task-agnostic compression
of pre-trained transformers. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
Virtual, 6–12 December 2020. Available online: https://proceedings.neurips.cc/paper/2020/hash/3f5ee243547dee91fbd053c1c4
a845aa-Abstract.html (accessed on 24 March 2024).
113. Sternberg, R.J.; Nigro, G. Developmental patterns in the solution of verbal analogies. Child Dev. 1980, 51, 27–38. [CrossRef]
114. Combs, K.; Bihl, T.J. A preliminary look at generative AI for the creation of abstract verbal-to-visual analogies. In Proceedings
of the 57th Hawaii International Conference on System Sciences, Honolulu, HI, USA, 3–6 January 2024. Available online:
https://hdl.handle.net/10125/106520 (accessed on 24 March 2024).
115. Reviriego, P.; Merino-Gomez, E. Text to image generation: Leaving no language behind. arXiv 2022, arXiv:2208.09333.
116. O’Meara, J.; Murphy, C. Aberrant AI creations: Co-creating surrealist body horror using the DALL-E Mini text-to-image generator.
Converg. Int. J. Res. New Media Technol. 2023, 29, 1070–1096. [CrossRef]
117. Chen, X.; Fang, H.; Lin, T.; Vedantam, R.; Gupta, S.; Dollar, P.; Zitnick, C.L. Microsoft COCO captions: Data collection and
evaluation server. arXiv 2015, arXiv:1504.00325.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.