Photovoltaic Defect Image Generator with Boundary Alignment Smoothing Constraint for Domain Shift Mitigation

Dongying Li, Binyi Su, Hua Zhang, Yong Li, Haiyong Chen This work was supported in part by the National Natural Science Foundation of China under Grant 62473127, National Natural Science Foundation of China Joint Foundation Program U21A20482, National Key Research and Development Program of China under Grant 2024YFB3310904 and 2021YFB3100800, Central Government Guides Local Science and Technology Development Fund Project 246Z1602G, Major Science and Technology Support Program of Hebei Province 242Q4302Z, Natural Science Foundation of China under Grant 62272018 and 62072454, Beijing Natural Science Foundation under Grant 4202084, Basal Research Fund of Central Public Research Institute of China under Grant 20212701 and 246Z4306G, Shijiazhuang Science and Technology Cooperation Project under Grant SJZZXC24009, Tianjin Municipal Education Commission Scientific Research Project under Grant 2024KJ151.(Corresponding authors: Haiyong Chen.) D. Li and H. Chen are with the School of Artificial Intelligence and Data Science, Hebei University of Technology, Tianjin 300401, China (e-mail: [email protected], [email protected]). B. Su is with the School of Artificial Intelligence and Data Science, Hebei University of Technology, Tianjin 300401, China, and also with China Xiongan Group Digital City Technology Company Ltd., Hebei 070001, China(e-mail:[email protected]). H. Zhang is with the State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China (e-mail: [email protected]). Y. Li is with the school of Instrumentation and Optoelectronic Engineering, Beihang University, Beijing 100191, China (e-mail: [email protected])

Abstract

Accurate defect detection of photovoltaic (PV) cells is critical for ensuring quality and efficiency in intelligent PV manufacturing systems. However, the scarcity of rich defect data poses substantial challenges for effective model training. While existing methods have explored generative models to augment datasets, they often suffer from instability, limited diversity, and domain shifts. To address these issues, we propose PDIG, a Photovoltaic Defect Image Generator based on Stable Diffusion (SD). PDIG leverages the strong priors learned from large-scale datasets to enhance generation quality under limited data. Specifically, we introduce a Semantic Concept Embedding (SCE) module that incorporates text-conditioned priors to capture the relational concepts between defect types and their appearances. To further enrich the domain distribution, we design a Lightweight Industrial Style Adaptor (LISA), which injects industrial defect characteristics into the SD model through cross-disentangled attention. At inference, we propose a Text-Image Dual-Space Constraints (TIDSC) module, enforcing the quality of generated images via positional consistency and spatial smoothing alignment. Extensive experiments demonstrate that PDIG achieves superior realism and diversity compared to state-of-the-art methods. Specifically, our approach improves Frechet Inception Distance (FID) by 19.16 points over the second-best method and significantly enhances the performance of downstream defect detection tasks.

Index Terms:

image generation, text-to-image, data enhancement, endogenous shifts, single-domain generalized object detection

I Introduction

Photovoltaic (PV) cells are prone to various defects during different production processes, such as microcracks, scratches, and spots, which severely affect the lifespan and power generation efficiency of the PV cells [1]. To ensure the quality and safety of industrial products, comprehensive detection of defects on PV EL images is imperative.

In recent years, deep learning-based object detection has become a cornerstone in the field of computer vision. Su et al. [2, 3] and Zhao et al. [4] designed a multi-scale attention mechanism based on CNN to refine multi-scale features, thus improving the performance of classification and detection.

Refer to caption — Figure 1: The published ELES dataset exhibits issues of domain shift and instance shift. (a) Domain shift: variations exist in the resolution, background brightness, and grid line spacing of defect images across different production lines; Instance shift: the distribution of instances in the training dataset is inconsistent with that in the test data. For example, cracks in the training set appear as fine, linear defects, whereas those in the testing set are characterized by large-area damage with inconsistent shapes. (b) t-SNE dimensionality reduction and probability distribution visualization on the data from different production lines reveal data distribution deviations caused by endogenous shifts.

However, these methods only consider scenarios of single production lines, ignoring the complex background styles and instance shift across multiple production lines. Fig. 1 (a) shows the domain shift and instance shift issues present in datasets from different production lines. As shown in Fig. 1 (b), t-SNE and probability density graph verify that there is an obvious shift problem between the datasets. Therefore, the Shift Suppression Network [5] addresses the problem of endogenous displacement in PV defect detection by learning enhanced feature representations through style alignment and cross-layer interaction. However, in actual production processes, as the diversity of defects in the images increases, the shift problems become more pronounced [6]. Neural networks are less effective due to the limitations of the training data. Thus, employing image generation techniques to enhance the quantity and diversity of defect data is a key strategy to mitigate the shift problem.

The traditional method of expanding data based on image processing technology [7] is limited to the feature space range of the known data. Therefore, high-quality images can be generated by employing the generative adversarial network (GAN) [8] and its variants. Defect-GAN [9] adopts a novel component layer-based architecture to simulate the random changes of defects, which synthesizes various types of defects with excellent diversity and fidelity, meanwhile, improving the performance of the defect detection network. Inspired by the principle of image-to-image conversion, TransP2P [10] incorporates the advantages of transformer global feature perception and U-Net extracting local detail features to convert defect-free images into defect images. However, training these generative models from scratch is challenging when image samples are insufficient, which often leads to distortion of the generated samples and a lack of authenticity.

Recently, image generation based on the diffusion model [11, 12, 13] has shown impressive capabilities, as it can embed a vast number of image distributions and generate samples with high fidelity and diversity. Among them, the Stable Diffusion model (SD) [14] has adopted a more stable, controllable, and efficient approach to generating high-quality images. This has led to significant advancements in the quality, speed, and cost of image generation. However, the SD model contains richly diverse and abstracted textual descriptions, which limit users to concepts that the network has already been trained on. As shown in Fig. 2, when the prompt “PV defect images based on EL imaging” is input, the gap between the image generated by the SD model and the target image is significant. This is because the SD model does not understand the concept of PV EL imaging, and thus cannot generate the defect morphology of actual PV quality inspection images. Meanwhile, due to the limitations of data and computing power, it is very difficult to retrain the SD model.

Several adapter-based methods have been proposed to enhance the controllability of text-to-image (T2I) generation. These approaches typically freeze the base Stable Diffusion (SD) model and introduce lightweight trainable parameters to inject external control signals [15]. ControlNet [16] supports auxiliary conditions such as edge maps, segmentation maps, and depth. IP-Adapter [17] facilitates image enhancement and restoration tasks, while T2I-Adapter [18] enables fine-grained control over color and structure. However, most T2I methods struggle with spatial alignment and complex prompt understanding in industrial scenarios, such as defect image generation. Recent studies have highlighted the importance of structured conditioning: IMAGGarment [19] achieves fine-grained garment control through a diffusion-based framework, and RCDM [20] model long-term consistency for talking face generation using motion priors. Motivated by these advances, we propose a targeted conditional framework to better handle spatial and descriptive control in industrial settings.

In this paper, we propose a Photovoltaic Defect Image Generator (PDIG) that employs the strong prior information of the SD model learned from large-scale datasets to enhance the authenticity of the generation under few-shot training data. First, PDIG uses 3-5 industrial defect types and images to enrich the text-embedded priors, capturing the specific relational concepts between defect types and their appearances. Secondly, the LISA is developed to embed the PV EL defect features into the pre-trained cultural graph diffusion model to enhance the image generation domain distribution. Finally, at inference, the TIDSC is proposed to ensure the position and space consistency of the generated defective image. Extensive experiments show that our model significantly outperforms state-of-the-art methods in terms of generated authenticity and diversity, and effectively improves the performance of downstream defect object detection tasks.

The main contributions are summarized as follows:

$\bullet$

To mitigate domain shift in industrial defect detection, we propose the PDIG, a diffusion-based method that augments diverse and realistic defect samples, effectively bridging cross-domain distribution gaps and enhancing model generalization.
$\bullet$

We propose a novel LISA module that embeds the image features of industrial defects into the SD model, thereby enabling the model to perform multimodal learning of defect features. Meanwhile, to reduce the cost of manual annotation, we propose a TIDSC module to achieve spatial localization generation, providing annotation priors for subsequent defect detection applications.
$\bullet$

The experimental results show that, compared with previous defect generation methods, this method can significantly improve the diversity and realism of defect image generation. In addition, improved domain-adaptive performance for defect detection.

II Related Work

II-A Defect detection method based on image generation

In recent years, GANs and their variants have been widely applied in the field of defect image generation. [21, 22, 23, 24]. To address the class imbalance issue, AdaBalGAN [25] combines the CGAN [26] with an adaptive generation controller to produce high-fidelity images of specified categories, significantly enhancing the accuracy and stability of defect recognition. Faced with the challenge of insufficient sample diversity, reference [27] employed the CycleGAN to augment data, thereby enhancing the diversity of the training dataset. Furthermore, SDGAN [28] introduces a D2 adversarial loss to increase diversity. To further enhance the quality of image generation, DG2GAN [29] incorporates a cycle-consistent loss, a DJS-optimized discriminator loss, and a DG2 adversarial loss, optimizing the image feature distribution to generate high-quality and diverse defect images. However, these methods inherently tend to merely imitate existing content, leading to mode collapse. In comparison, we employ the SD model as our baseline model. It offers greater creative freedom and better balances image quality with computational cost, thus avoiding mode collapse.

II-B Denoising Diffusion Probabilistic Model

Since their introduction to image generation in 2015, diffusion models [30] have emerged as a prominent research direction due to their robust generative capabilities and broad application potential. These models perturb data distributions through a forward diffusion process and subsequently learn an inverse diffusion process to recover the original data distribution, resulting in a highly flexible and computationally efficient generative framework. A notable method in this area is the Denoising Diffusion Probabilistic Model (DDPM) [31], which leverages iterative gradient updates combined with Gaussian noise to generate high-fidelity samples, approximating the underlying data manifold through annealed Langevin dynamics. To further improve the sampling efficiency of DDPM, DDIM [32] introduces a non-Markovian diffusion process and deterministic sampling strategy, significantly reducing inference time and computation steps. To enhance controllability during generation, ILVR [33] guides DDPM to produce high-quality images conditioned on reference signals, although it still operates purely in pixel space, resulting in high computational overhead and limited support for multimodal conditioning. Inspired by these advances, we propose the LISA module, which learns image features by training only a small number of parameters. Our method achieves a thorough integration of image and text modalities, enabling rich multimodal inputs while maintaining high computational efficiency.

II-C Latent Diffusion Model

Latent Diffusion Models (LDMs) were introduced by Rombach et al. [14], proposing to learn the feature distribution through a diffusion process in the latent space rather than the pixel space. Compared to conventional pixel-space diffusion models, LDMs capture higher-level features and semantic information more efficiently. Notably, LDMs incorporate a conditional encoder that injects control conditions (e.g., text, image, or video) via cross-attention mechanisms to guide the generation process. Recently, LDMs have served as the theoretical backbone of Stable Diffusion (SD) and emerging video generation models such as Sora. In various industrial applications, LDMs have demonstrated superior performance over GANs [34], particularly in defect image generation tasks. To enable the generation of defect images with specific types and locations, AnomalyDiffusion [35] integrates a spatial anomaly embedding module and an adaptive attention weighting mechanism, achieving precise alignment between generated anomalies and ground-truth annotations, thereby significantly enhancing downstream localization performance. Beyond standard LDMs, fine-tuning techniques such as Textual Inversion [36], Dreambooth [37], LoRA [38], and InstantID [39] have been developed to further improve the controllability of SD-based models. However, these approaches typically struggle with accurately controlling the generation location. To address this limitation, our model introduces a Targeted Instance-Dependent Spatial Consistency (TIDSC) module, which enforces consistency in the generated location and spatial structure, enabling the acquisition of diverse and realistic defective image–annotation pairs. Furthermore, recent advances in conditional diffusion models, such as progressive conditional diffusion [40] and rich-contextual conditioning frameworks [41], have demonstrated the effectiveness of carefully designed conditioning mechanisms, motivating the design of our TIDSC-enhanced generation pipeline.

III Proposed Method

In this section, we will detail the proposed PDIG. As shown in Fig. 3, PDIG employs a two-stage generation approach. During the training stage, the SCE module and the LISA module learn image and text features, which are then fully integrated into the backbone network to enhance the image generation domain distribution. In the inference stage, the TIDSC is utilized to focus more attention on the specified bounding box regions, thereby enabling precise defect localization generation.

III-A Preliminaries

Stable Diffusion model is developed based on the LDM and is trained on the larger dataset LAION-5B. It also employs a more powerful CLIP model to replace the original cross-attention modulation mechanism. Similar to LDM, the SD model consists of three core components: VAE [42], U-Net [43], and CLIP Text Encoder. The VAE encoder $\epsilon$ encodes the image $x$ into the latent space to obtain $z_{0}$ . In the latent space, Gaussian noise is added to $z_{0}$ to obtain $z_{T}$ . The U-Net network then predicts the noise residual for denoising to obtain $z$ , which is subsequently decoded back to the pixel space by the VAE decoder $\cal D$ to generate the reconstructed image. The CLIP Text Encoder encodes the input text prompt into a text embedding, which is fed into the cross-attention layer of the U-Net as a conditional control. The training objective of the model is to minimize the following loss:

{{\cal L}_{SD}}={{\mathbb{E}}_{z\sim\varepsilon(x),\epsilon\sim{\cal N}(0,1),y% ,t}}\left[{\left\|{\\ \epsilon-{\\ \epsilon_{\theta}}\left({{z_{t}},t,{\tau_{\theta}}\left(y\right)}\right)}% \right\|_{2}^{2}}\right],

(1)

Where $t$ is the time step and $y$ is the conditional input, ${\tau_{\theta}}\left(y\right)$ for the model that maps the conditional input $y$ to the conditional vector.

Semantic Concept Embedding is innovatively introduced by us, embedding industrial-specific concepts into the vocabulary of the CLIP Text Encoder [36]. This improvement enables the model to accurately learn text embeddings of industrial concepts and their corresponding visual features, thereby efficiently mastering the unique styles and related semantics of industrial images. We define the text descriptions as “A photo of ${S^{*}}$ ” and “A photo of ${S^{\wedge}}$ ”, where ${S^{*}}$ and ${S^{\wedge}}$ represent PVEL imaging images and images corresponding to different defect types. To find the optimal embedding vector $v$ , our optimization target is:

v^{*}=\mathop{\arg\min}\limits_{v}{{\mathbb{E}}_{z\sim\varepsilon(x),\epsilon% \sim{\cal N}(0,1),y,t}}\left[{\left\|{\epsilon-{\epsilon_{\theta}}\left({{z_{t% }},t,{\tau_{\theta}}\left(y\right)}\right)}\right\|_{2}^{2}}\right].

(2)

${v^{*}}$ is the marker embedding vector obtained by minimizing the loss of LDM.

III-B Lightweight Industrial Style Adaptor (LISA)

Due to the lack of more fine-grained learning of image features, existing methods find it difficult to generate industrial defect images with diverse background styles and defect morphologies based solely on text descriptions. To tackle the above issue, we have developed a Lightweight Industrial Style Adaptation (LISA) module. This module adapts the pre-trained SD model by training on limited network levels to achieve the generation of industrial defect images, using image prompts as input. As shown in Fig. 3 (a), PV EL defects are typically small in size, such as scratch, broken gate, and black spot in PV EL defects. Therefore, when processing a defective image $x$ , we first locate the defect regions based on the bounding box information $\mathcal{B}=\{b_{i}\}$ provided by defect annotations. Here, $b_{i}$ represents the coordinates of the top-left and bottom-right corners of the defect box, i.e., $\{(x_{i1},y_{i1}),(x_{i2},y_{i2})\}$ .

Subsequently, we perform cropping and alignment operations on the entire image centered around the box to obtain image feature patches. The purpose of this step is to enable the model to better learn the features of the defect regions. For the extraction of defect image features, we employ a pre-trained CLIP image encoder [17]. After obtaining the global image embedding, we project it through a trainable projection network into a sequence of features with length $N$ , resulting in image tokens $h_{z}$ . Finally, as shown in Fig. 4, the image features are integrated in parallel with text features into the denoising UNet module of the SD model via a cross-disentangled attention adaptation module. The attention mechanism can be expressed as follows:

{\bar{h}^{z}},{\bar{h}^{p}}={\rm{Attn}}([{{\bf{Q}}^{z}},{{\bf{Q}}^{p}}],[{{\bf% {K}}^{z}},{{\bf{K}}^{p}}],[{{\bf{V}}^{z}},{{\bf{V}}^{p}}]),\

(3)

Here, $\bar{h}^{z}$ and $\bar{h}^{p}$ represent the outputs of the image and text tokens after cross-attention, respectively. [·,·] denotes concatenation across the tokens dimension; ${{\bf{Q}}^{z}}={{\bf{Q}}^{p}}={h^{p}}{\bf{W}}_{q}^{p}$ , ${{\bf{K}}^{z}}={h^{z}}{\bf{W}}_{k}^{z}$ , ${{\bf{V}}^{z}}={h^{z}}{\bf{W}}_{v}^{z}$ ; ${h_{p}}={\tau_{\theta}}\left({{y_{t}}}\right)$ ; ${\bf{W}}_{q}^{p}$ , ${\bf{W}}_{k}^{z}$ and ${\bf{W}}_{v}^{z}$ are the weight matrices of the linear projection layers for the query of text tokens, and the key and value of image tokens, respectively. Based on the attention mechanism, the final formula for cross-attention between image and text can be derived as follows:

		$\displaystyle\bar{H}=\alpha\cdot{{\bar{h}}^{z}}+\beta\cdot{{\bar{h}}^{p}}$		(4)
		$\displaystyle=\alpha\cdot{\mathop{\rm Attn}\nolimits}({{\bf{Q}}^{p}},{{\bf{K}}% ^{z}},{{\bf{V}}^{z}})+\beta\cdot{\mathop{\rm Attn}\nolimits}({{\bf{Q}}^{p}},{{% \bf{K}}^{p}},{{\bf{V}}^{p}})$
		$\displaystyle=\alpha\cdot{\mathop{\rm Softmax}\nolimits}(\frac{{{{\bf{Q}}^{p}}% {{({{\bf{K}}^{z}})}^{\top}}}}{{\sqrt{d}}}){{\bf{V}}^{z}}+\beta\cdot{\mathop{% \rm Softmax}\nolimits}(\frac{{{{\bf{Q}}^{p}}{{({{\bf{K}}^{p}})}^{\top}}}}{{% \sqrt{d}}}){{\bf{V}}^{p}.}$

Here, $\alpha$ and $\beta$ are weighting factors that can adjust the proportion of text and image prompts.

III-C Text-Image Dual-Space Constraints (TIDSC)

While large-scale data generation requires a corresponding volume of annotations, the efficiency of manual annotation can no longer meet the actual needs. Thus, to realize defect location generation and semi-automatic annotation, we propose the TIDSC module. In the inference stage, the user provides the caption prompt, the image prompt, and the location of the defect to be generated ${\cal B}^{\prime}=\left\{{{b_{j}}}\right\}$ . From the LISA module, we can get a set of caption tokens and image tokens ${\cal H}=\left\{{h_{j}^{p},h_{j}^{z}}\right\}$ as well as cross attention maps ${\cal\bar{H}}^{t}=\{\bar{H}_{j}^{t}\}$ . Further, according to a given ${\cal B}^{\prime}$ , which gives a set of binary spatial masks ${\cal M}=\{{M_{j}}\}$ . Our goal is to generate target objects in the mask area whenever possible [44]. To achieve this goal, we propose graphic bi-spatial constraints on the target cross-attention map, including masked internal and external constraints, as well as boundary alignment smoothing constraints. These constraints gradually update the underlying $z_{t}$ so that the position and scale of the composite object agree with the mask region.

For internal constraints, our goal is to constrain responses with high attention values within the masked region, defined as:

{\cal L}_{{H_{j}}}^{1}=1-\frac{1}{P}\sum{{\bf{topk}}}\left({\bar{H}_{j}^{t}% \cdot{{\bf{M}}_{j}},P}\right),

(5)

{{\cal L}_{In}}=\sum\limits_{{H_{j}}\in{\cal H}}{{\cal L}_{{H_{j}}}^{1}},

(6)

where means that $P$ elements with the highest response would be selected.

Internal constraints guarantee that the image of the target image is generated in the masked area, but not how the image is generated outside the area, so for external constraints, we need the attention response as low as possible, therefore:

{\cal L}_{{H_{j}}}^{2}=\frac{1}{P}\sum{{\bf{topk}}}\left({\bar{H}_{j}^{t}\cdot% (1-{{\bf{M}}_{j}}),P}\right),

(7)

{{\cal L}_{Out}}=\sum\limits_{{H_{i}}\in{\cal H}}{{\cal L}_{{H_{j}}}^{2}},

(8)

To obtain precise object or context boundary pixels to restrict the scale while ensuring that the boundaries of objects in the generated images are smoother and more natural, we propose the Boundary Smoothing Alignment Constraint. First, to maintain the position and scale consistency of objects within the specified bounding boxes, we project each target mask $M_{j}$ and the cross-attention map onto the x-axis and y-axis via the max operation. For the x-axis:

{{\bf{m}}_{x}}(k)={\bf{ma}}{{\bf{x}}_{s=1,\cdots,S}}\{{{\bf{M}}_{j}}(s,k)\},

(9)

{\bf{a}}_{x}^{t}(k)={\bf{ma}}{{\bf{x}}_{s=1,\cdots,S}}\{\bar{H}_{j}^{t}(s,k)\},

(10)

Among them, ${{\bf{m}}_{x}}(k),{\bf{a}}_{x}^{t}(k)\in{{\mathbb{R}}^{K}}$ . Our goal is to optimize ${\bf{a}}_{x}^{t}$ so that it is close to ${{\bf{m}}_{x}}$ :

{\cal L}_{{H_{j}}}^{3}=\frac{1}{L}\sum{\bf{tokp}}\left({\{|{{\bf{m}}_{x}}(k)-{% \bf{a}}_{x}^{t}(k)|\}_{k=1}^{K},L,x_{1}^{j},x_{2}^{j}}\right),

(11)

Here, ${\rm{tokp}}(\cdot,L,x_{1}^{j},x_{2}^{j})$ denotes the uniform sampling of $L$ error terms from $\{|{{\bf{m}}_{x}}(k)-{\bf{a}}_{x}^{t}(k)|\}_{k=1}^{K}$ , with these error terms surrounding the given alignment coordinates $x_{1}^{j},x_{2}^{j}$ on the x-axis.

Apply the same operation to the y-axis:

{{\bf{m}}_{y}}(j)={\bf{ma}}{{\bf{x}}_{k=1,\cdots,K}}\{{{\bf{M}}_{j}}(s,k)\},

(12)

{\bf{a}}_{y}^{t}(j)={\bf{ma}}{{\bf{x}}_{k=1,\cdots,K}}\{\bar{H}_{j}^{t}(s,k)\},

(13)

{\cal L}_{{H_{j}}}^{4}=\frac{1}{L}\sum{\bf{tokp}}\left({\{|{{\bf{m}}_{y}}(j)-{% \bf{a}}_{y}^{t}(j)|\}_{j=1}^{S},L,y_{1}^{j},y_{2}^{j}}\right),

(14)

To ensure that the boundaries of objects in the generated images are smoother and more natural, we achieve the smoothing effect by minimizing the gradient variations and second-order derivatives of the cross-attention maps:

{\cal L}_{{\rm sm}_{j}}^{(1)}=\frac{1}{{S\times K}}\sum\limits_{(x,y)\in{M_{j}% }}{\left({\left|{\frac{{\partial\bar{H}_{j}^{t}(x,y)}}{{\partial x}}}\right|+% \left|{\frac{{\partial\bar{H}_{j}^{t}(x,y)}}{{\partial y}}}\right|}\right)},

(15)

{\cal L}_{{\rm sm}_{j}}^{(2)}=\frac{1}{{S\times K}}\sum\limits_{(x,y)\in{M_{j}% }}{\left({\left|{\frac{{{\partial^{2}}\bar{H}_{j}^{t}(x,y)}}{{{\partial^{2}}x}% }}\right|+\left|{\frac{{{\partial^{2}}\bar{H}_{j}^{t}(x,y)}}{{{\partial^{2}}y}% }}\right|}\right)},

(16)

The boundary smoothing alignment constraint is:

{{\cal L}_{BASC}}=\sum\limits_{{H_{j}}\in{\cal H}}{{\rm{(}}{\cal L}_{{H_{j}}}^% {3}+{\cal L}_{{H_{j}}}^{4})}+{\lambda_{1}}\sum{\cal L}_{{\rm sm}_{j}}^{(1)}+{% \lambda_{2}}\sum{\cal L}_{{\rm sm}_{j}}^{(2)},

(17)

Where $\lambda_{1}$ and $\lambda_{2}$ are the smoothing weights for the first-order and second-order terms, respectively.

Therefore, our total constraints are follows:

{\cal L}={{\cal L}_{In}}+{{\cal L}_{Out}}+{{\cal L}_{BASC}}.

(18)

Through the graphic dual space restriction module, we asked the model to allocate more attention to the specified bounding box area, so as to realize defect positioning and generation, and output image annotation files to provide annotation information for downstream applications.

IV Experiment and Analysis

To demonstrate the superiority of the proposed PDIG, we compare it with multiple state-of-the-art image generation approaches on the ELES dataset.

TABLE I: The initial dataset allocations for training and testing

Datasets	EL group1		EL group2		EL group3
Datasets	defective image	defect-free image	defective image	defect-free image	defective image	defect-free image
Training	6652	$\backslash$	$\backslash$	$\backslash$	$\backslash$	$\backslash$
Testing	1110	1410	3209	2541	839	562
Total	9172		5750		1401
Acquisition time	September 2020		May 2021		October 2021
Resolution	$384\times 384$		$640\times 589$		$700\times 668$

IV-A Datasets

ELES Dataset [5] is the first PV module EL endogenous displacement dataset, created from three sets of EL images collected at different times, a total of 16,323. For a fair comparison, the training and testing sets of this study were assigned according to the literature [5]. As shown in Table I, we only used EL group1 to train the PDIG model. Fig. 5 depicts the distribution of defect categories for the ELES dataset. At the same time, in order to realize the multimodal data input of the model, we improve the text description of the dataset according to each image defect type, which is defined as “A photo of {} with {} and {}”.

IV-B Evaluation Metrics

In the stage of generating image quality assessment, we use Frechet Inception Distance (FID) [45] and Inception Score (IS) [46] as the main evaluation indicators, and the FID is defined as:

FID\left({x,\tilde{x}}\right)=\left\|{{u_{x}}-{u_{\tilde{x}}}}\right\|_{2}^{2}% +Tr\left({{\sum_{x}}+{\sum_{\tilde{x}}}-2{{\left({{\sum_{x}}{\sum_{\tilde{x}}}% }\right)}^{0.5}}}\right),

(19)

$Tr$ denotes the composite of the elements on the diagonal of the matrix, which becomes the trace of the matrix in matrix theory. $x$ and $\tilde{x}$ denote the real and generated images, $u$ denotes the mean, $\sigma$ is the covariance matrix. A lower FID means that the generated image performs better in terms of fidelity and diversity.

IS evaluates the performance of the model by calculating the KL scatter of the predictive probability distribution of the categories of the generated images, and IS is defined as:

IS=\exp\left({{{\mathbb{E}}_{x}}\left[{{\rm{KL}}(p(y|x)||p(y))}\right]}\right).

(20)

Where ${p(y|x)}$ is the distribution of category prediction probabilities for a given image $x$ , ${p(y)}$ is the average distribution of category prediction probabilities for all generated images, and higher IS values indicate that the generated images are of high quality and diversity.

IV-C Implementation Details

We trained the network using Python 3.10 on an NVIDIA RTX 3090 GPU. For the pre-trained model, we selected the lightweight and efficient SDv1.5 version and used the high-accuracy, multi-modal CLIP ViT-H/14 - LAION-2B model as the image encoder. During the training stage, we trained the proposed model using the MSE loss, with AdamW as the optimizer. The initial learning rate was set to 0.0001, weight decay to 0.01, and the generated image resolution was set to 512×512. The batch size was set to 8, and the training was conducted for over 1000 epochs. In the inference stage, we employed the DDIM sampler with the initial classifier guidance scale set to 7.5. When using only image prompts, we set $\alpha=0$ and $\beta=1.0$ . Additionally, we set the inference steps to 60 and applied the TIDSC mechanism only during the first 30 steps.

TABLE II: Comparison of different image generation methods for various defect types. Bold results represent the optimal generation performance. underline results represent the second optimal generation performance.

Defect types	Metric	DCGAN [47]	Stylegan3 [48]	Textual Inversion[36]	IP-adapter [17]	T2I-adapter [18]	Ours
broken gate	FID $\downarrow$	151.97	39.6	175.90	58.56	73.41	22.37
broken gate	IS $\uparrow$	1.29	1.29	1.62	1.51	1.56	1.65
unjoined weld	FID $\downarrow$	140.70	35.5	171.23	56.62	76.37	19.03
unjoined weld	IS $\uparrow$	1.26	1.36	1.99	1.38	1.41	1.48
black spot	FID $\downarrow$	150.03	51.32	231.30	32.08	104.15	20.56
black spot	IS $\uparrow$	1.26	1.34	2.68	1.38	1.52	1.54
scratch	FID $\downarrow$	149.21	34.9	219.32	39.34	96.14	23.20
scratch	IS $\uparrow$	1.24	1.26	2.70	1.42	1.41	1.50
crack	FID $\downarrow$	152.38	39.00	187.42	59.07	88.85	19.34
crack	IS $\uparrow$	1.26	1.28	2.40	1.38	1.56	1.46
Average	FID $\downarrow$	148.86	40.06	197.03	49.13	87.78	20.90
Average	IS $\uparrow$	1.26	1.32	2.28	1.41	1.49	1.53

IV-D Comparison With Recent Works

In this section, we selected several state-of-the-art image-generation algorithms currently used in the industrial field, including the GAN series, such as DCGAN [47] and StyleGAN3 [48], as well as text-to-image methods like Textual Inversion [36], IP-adapter [17], and T2I-adapter [18]. We evaluated the effectiveness of our method from both qualitative and quantitative perspectives.

We trained our model on the training set allocated for EL group1 images, as shown in Table I. During the inference stage, we generated 200 defect images for each type of defect to calculate the FID and IS metrics. The quantitative results of the models are presented in Table II. For various types of industrial defects, our PDIG method significantly outperforms other existing methods in terms of the FID. This result strongly demonstrates the effectiveness of the introduced SCE module and the proposed LISA module. These modules are capable of fully learning the industrial defect features embedded in images and text, including background styles and fine-grained features of different types of defects, and generating high-quality and diverse defect data. The average FID value of our method is 19.16 lower than that of the second-best StyleGAN3 method. In terms of the IS metric, our method ranks second only to Textual Inversion. However, Textual Inversion has an FID value that is 176.13 higher than ours. This indicates that the images generated by Textual Inversion are produced at the expense of extremely low fidelity, in order to achieve higher diversity. Such a trade-off is evidently meaningless.

The qualitative results of our model compared to other methods are shown in Fig. 6. Among the various types of defect images, the backgrounds generated by DCGAN are relatively blurry, and the effectiveness of defect generation is relatively low. StyleGAN3 can generally handle the background of defect images well, but it struggles to accurately interpret smaller defects such as black spots. Among text-to-image methods, Textual Inversion performs the worst. This is attributed to its training on only a limited number of images to learn the style, without adequately addressing the detailed features of the images. Consequently, the generated images exhibit significant differences from the original images. This also explains why Textual Inversion achieves the highest IS in Table II, as the increased diversity comes at the cost of authenticity. The IP-Adapter shows better performance in background generation but exhibits lower accuracy in generating individual defect features. For instance, it tends to produce black spots when generating crack images. The T2I-Adapter method, while effective in some aspects, falls short in background generation compared to the IP-Adapter, with issues such as misaligned and disconnected grid lines. Additionally, it generates defects with poor quality and abnormal textures.

TABLE III: Ablation study results. Bold results represent the optimal generation performance.

Training		Inference	Metric
SCE	LISA	TIDSC	FID $\downarrow$	IS $\uparrow$
✓	×	×	197.03	2.28
×	✓	×	48.64	1.39
×	×	✓	265.31	3.72
✓	✓	×	36.78	1.40
✓	×	✓	174.13	2.05
×	✓	✓	40.22	1.42
✓	✓	✓	20.90	1.53

In contrast, our method achieves results that are highly consistent with real defect images. It demonstrates high-quality and authentic generation in both background and defect morphology, effectively capturing the essential features of the defects. Meanwhile, thanks to the strong denoising capability of the diffusion model, our method can generate data with a rich and diverse distribution. Fig. 7 (a) compares the image distribution generated by our method and the original dataset distribution. It can be seen that the generated images can cover all the dataset distribution and reduce the problem of endogenous bias. Fig. 7 (b) compares the image distribution generated by different methods.

IV-E Ablation Studies and Analysis

In this section, we conduct an ablation study using the model trained on the EL group1 dataset to evaluate the impact of each component of our method. The results of this analysis are presented in Table III, which demonstrates the contributions of the individual components to the overall performance.

1) Contribution of SCE: We have evaluated the impact of the SCE module on generation. As shown in Fig. 8 and Table III, we can observe that this module is capable of effectively generating the style of PV EL images and thoroughly learning the structural features of PV grid lines. However, the module only performs a rough learning of the style and embeds defect-related text into the model. It is still not able to effectively distinguish and generate smaller-sized defects.

2) Contribution of LISA: We have assessed the impact of the LISA module on image generation. As shown in Table III, when only using the LISA module, the FID score is 48.64 and the IS score is 1.39. Compared to adding only the SCE module, the FID has decreased by 148.39, indicating a significant improvement in the quality and realism of the generated images. However, the IS has dropped by 0.89, which is attributed to the generated images being more similar to real images, thereby reducing image diversity. When the LISA module is added based on the SCE module, the FID is further reduced by 11.86. As shown in Fig. 8, after the LISA module is incorporated, the generated defect images are almost identical to the original images. The module is capable of effectively learning the features of various defects and can meticulously handle small-target defects, such as broken gate and black spot. For the unjoined weld defect, it can adeptly manage the contrast between the defect and the background. We have determined that the LISA module contains a total of 22.32M trainable parameters. Specifically, the linear projection layer contributes 3.15M parameters, while the cross-attention layers account for 19.17M parameters.

3) Contribution of TIDSC: We have evaluated the impact of the TIDSC module on image generation. During the inference stage, we incorporate the TIDSC module to achieve location-specific generation while simultaneously pre-generating annotations. As shown in Table III, when the TIDSC module is used alone, it essentially imposes text boundary control on the SD model. At this point, no feature learning of PV EL images takes place. Therefore, the metrics obtained from images generated solely based on text prompts are as follows: an FID of 265.31 and an IS of 3.72. Although the IS metric is the best in this case, it is meaningless to trade off image realism for diversity.

Furthermore, when we combine the SCE with the TIDSC module, the FID decreases by 91.18 and the IS decreases by 0.23. When LISA and TIDSC are combined, there is a noticeable improvement in the quality of generated images, with an FID of 40.22. Finally, when all three modules are used together, the FID of the generated images is 20.9, and the IS is 1.53. This demonstrates that our method produces images with high realism and good diversity.

As shown in Fig. 8, our method can generate specified defect images within the given bounding box and effectively produces images with diverse defect shapes and backgrounds. Fig. 9 shows the visualization results of the TIDSC module. When t=1, the attention values of the text and image tokens are uniformly distributed. When t=30, based on the text prompt, the attention can be focused as much as possible within the box. While focusing on the box, the image tokens pay more attention to the overall detailed features of the image.

The limitation is that the TIDSC module has not yet achieved absolute restriction. Due to the varying sizes and positions of defects, the generated defects may overflow, as seen in the unjoined weld and crack images in Fig. 8. To verify the pass rate of image localization generation and the bounding box overflow situation, we generated 200 images for each type of defect data in batches and manually annotated the qualified generated images to test the effectiveness of the module. The verification results are shown in Table IV, with an average image generation pass rate of 84.74 $\%$ and an average box spillover rate of 8.37 $\%$ . This indicates a relatively good effect on defect localization generation and provides a way to achieve automatic annotation for generated images.

TABLE IV: The pass rate of image positioning generation and the spillover rate of box.

Defect types	Pass Rate $\%$	Box Spillover Rate $\%$
broken gate	81.90	8.70
unjoined weld	83.80	10.41
black spot	88.70	4.35
crack	85.30	9.00
scratch	84.00	9.40
average	84.74	8.37

TABLE V: Verify the performance of adding our generated data to the yolo models for defect detection.

\downarrow

indicates a decrease in the indicator.

\uparrow

indicates an increase in the indicator.

Datasets	methods	broken gate	unjoined weld	black spot	scratch	crack	Average
EL group1	yolov5	91.7	85.5	68.2	73.8	80.4	79.9
	+ours	88.4_↓3.3	86.4_↑0.9	70.6_↑2.7	70.9_↓2.9	88.6_↑8.2	81_↑1.1
	yolov8	95.5	89.1	76.7	78.3	93.4	86.6
	+ours	88.9_↓6.6	93.1_↑3.4	79.2_↑2.5	80.8_↑2.5	94.4_↑1	87.3_↑0.7
EL group2	yolov5	65.4	76.7	59.5	55.4	31.2	57.6
	+ours	69.5_↑6.6	82.4_↑5.7	60.9_↑1.4	60.5_↑5.1	43.61_↑12.41	63.4_↑5.8
	yolov8	67.3	79.9	59.9	62.1	48.7	63.6
	+ours	70.1_↑2.8	81.7_↑1.8	58.5_↓1.4	65.7_↑3.6	50.9_↑2.2	65.4_↑1.8
EL group3	yolov5	57.5	40.3	47.3	34	31.7	42.2
	+ours	61.8_↑4.3	38.9_↓1.4	48.9_↑1.6	40.8_↑6.8	52_↑20.3	48.5_↑6.3
	yolov8	58	39.5	44.6	40.4	48.8	46.3
	+ours	60.7_↑2.7	42.4_↑2.9	46.2_↑1.6	43.3_↑2.9	48.2_↓0.6	48.1_↑1.8

IV-F Defect Detection

To verify that the images generated using our method can enhance the performance of defect detection tasks, we generated 200 images and corresponding annotations for each type of defect sample. Currently, for the identification of small target defects, we employed the effective YOLOv5 and YOLOv8 algorithms. We trained the aforementioned detection algorithms on the EL group1 dataset and the expanded EL group1 dataset, and then conducted defect detection on the test set of EL group1, as well as on EL group2 and EL group3.

The experimental results are shown in Table V. After incorporating the generated samples, the average mAP of the detection model has improved. When our generated images were added to YOLOv5, the average mAP of the model increased by 5.8 $\%$ and 6.3 $\%$ on EL group2 and EL group3, respectively. Based on YOLOv8, our model achieved an average mAP improvement of 1.8 on both EL group2 and EL group3. These results confirm that our method, by generating diverse images, can effectively mitigate the low detection accuracy caused by endogenous shifts.

V Conclusion

This paper proposed the PDIG, which focuses on the endogenous shifts in real PV scenarios, the poor consistency and controllability of generated images, and the lack of labels for the data. First, PDIG used a few industrial defect images and added embedded priors through the SCE module to capture the specific relational concepts between defect types and their appearances. Next, we employed the LISA module to embed industrial image defect features into the SD model using a cross-disentangled attention manner, thereby enhancing the image generation domain distribution. Finally, during the inference stage, we introduced the TIDSC module to ensure the quality of generated images via positional consistency and spatial smoothing alignment. Extensive experiments have shown that our model significantly outperforms state-of-the-art methods in terms of both realism and diversity of generation and effectively improves the performance of downstream defect object detection tasks. However, our method also has limitations, such as image localization generation overflow. In the future, we will focus on further improving the accuracy of image localization generation and addressing the issue of generating multiple instances simultaneously.

References

[1] M. Dhimish, V. d’Alessandro, and S. Daliento, “Investigating the impact of cracks on solar cells performance: Analysis based on nonuniform and uniform crack distributions,” IEEE Transactions on Industrial Informatics, vol. 18, no. 3, pp. 1684–1693, 2021.
[2] B. Su, H. Chen, P. Chen, G. Bian, K. Liu, and W. Liu, “Deep learning-based solar-cell manufacturing defect detection with complementary attention network,” IEEE Transactions on Industrial informatics, vol. 17, no. 6, pp. 4084–4095, 2020.
[3] B. Su, H. Chen, and Z. Zhou, “Baf-detector: An efficient cnn-based detector for photovoltaic cell defect detection,” IEEE Transactions on Industrial Electronics, vol. 69, no. 3, pp. 3161–3171, 2021.
[4] S. Zhao, H. Chen, C. Wang, and S. Shi, “Sncf-net: Scale-aware neighborhood correlation feature network for hotspot defect detection of photovoltaic farms,” Measurement, vol. 206, p. 112342, 2023.
[5] S. Zhao, H. Chen, C. Wang, and Z. Zhang, “Ssn: Shift suppression network for endogenous shift of photovoltaic defect detection,” IEEE Transactions on Industrial Informatics, 2023.
[6] Y. Wang, Z. Zhou, X. Tan, Y. Pan, J. Yuan, Z. Qiu, and C. Liu, “Unveiling the potential of progressive training diffusion model for defect image generation and recognition in industrial processes,” Neurocomputing, vol. 592, p. 127837, 2024.
[7] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.
[8] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE signal processing magazine, vol. 35, no. 1, pp. 53–65, 2018.
[9] G. Zhang, K. Cui, T.-Y. Hung, and S. Lu, “Defect-gan: High-fidelity defect synthesis for automated defect inspection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2524–2534.
[10] C. Zhao, W. Xue, W.-P. Fu, Z.-Q. Li, and X. Fang, “Defect sample image generation method based on gans in diamond tool defect detection,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–9, 2023.
[11] H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion models,” IEEE Transactions on Knowledge and Data Engineering, 2024.
[12] F. Shen and J. Tang, “Imagpose: A unified conditional framework for pose-guided person generation,” Advances in neural information processing systems, vol. 37, pp. 6246–6266, 2024.
[13] F. Shen, X. Jiang, X. He, H. Ye, C. Wang, X. Du, Z. Li, and J. Tang, “Imagdressing-v1: Customizable virtual dressing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 6795–6804.
[14] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[15] Y. Xu, X. Xu, H. Gao, and F. Xiao, “Sgdm: An adaptive style-guided diffusion model for personalized text to image generation,” IEEE Transactions on Multimedia, vol. 26, pp. 9804–9813, 2024.
[16] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847.
[17] H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv:2308.06721, 2023.
[18] C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, and Y. Shan, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 5, 2024, pp. 4296–4304.
[19] F. Shen, J. Yu, C. Wang, X. Jiang, X. Du, and J. Tang, “Imaggarment-1: Fine-grained garment generation for controllable fashion design,” arXiv preprint arXiv:2504.13176, 2025.
[20] F. Shen, C. Wang, J. Gao, Q. Guo, J. Dang, J. Tang, and T.-S. Chua, “Long-term talkingface generation via motion-prior conditional diffusion model,” arXiv preprint arXiv:2502.09533, 2025.
[21] S. Niu, B. Li, X. Wang, and Y. Peng, “Region-and strength-controllable gan for defect generation and segmentation in industrial images,” IEEE Transactions on Industrial Informatics, vol. 18, no. 7, pp. 4531–4541, 2021.
[22] D. Guan, J. Huang, A. Xiao, S. Lu, and Y. Cao, “Uncertainty-aware unsupervised domain adaptation in object detection,” IEEE Transactions on Multimedia, vol. 24, pp. 2502–2514, 2022.
[23] K. Zhang, Y. Xiao, J. Wang, M. Du, X. Guo, R. Zhou, C. Shi, and Z. Zhao, “Dp-gan: A transmission line bolt defects generation network based on dual discriminator architecture and pseudo-enhancement strategy,” IEEE Transactions on Power Delivery, 2024.
[24] H. Zhou, W. Wu, Y. Zhang, J. Ma, and H. Ling, “Semantic-supervised infrared and visible image fusion via a dual-discriminator generative adversarial network,” IEEE Transactions on Multimedia, vol. 25, pp. 635–648, 2023.
[25] J. Wang, Z. Yang, J. Zhang, Q. Zhang, and W.-T. K. Chien, “Adabalgan: An improved generative adversarial network with imbalanced learning for wafer defective pattern recognition,” IEEE Transactions on Semiconductor Manufacturing, vol. 32, no. 3, pp. 310–319, 2019.
[26] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
[27] Y. Tian, G. Yang, Z. Wang, E. Li, and Z. Liang, “Detection of apple lesions in orchards based on deep learning methods of cyclegan and yolov3-dense,” Journal of Sensors, vol. 2019, no. 1, p. 7630926, 2019.
[28] S. Niu, B. Li, X. Wang, and H. Lin, “Defect image sample generation with gan for improving defect recognition,” IEEE Transactions on Automation Science and Engineering, vol. 17, no. 3, pp. 1611–1622, 2020.
[29] F. Deng, J. Luo, L. Fu, Y. Huang, J. Chen, N. Li, J. Zhong, and T. L. Lam, “Dg2gan: improving defect recognition performance with generated defect image sample,” Scientific Reports, vol. 14, no. 1, p. 14787, 2024.
[30] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning. PMLR, 2015, pp. 2256–2265.
[31] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[32] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[33] J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,” arXiv preprint arXiv:2108.02938, 2021.
[34] X. Zhong, J. Zhu, W. Liu, C. Hu, Y. Deng, and Z. Wu, “An overview of image generation of industrial surface defects,” Sensors, vol. 23, no. 19, p. 8160, 2023.
[35] T. Hu, J. Zhang, R. Yi, Y. Du, X. Chen, L. Liu, Y. Wang, and C. Wang, “Anomalydiffusion: Few-shot anomaly image generation with diffusion model,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 8, 2024, pp. 8526–8534.
[36] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
[37] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510.
[38] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022.
[39] Q. Wang, X. Bai, H. Wang, Z. Qin, A. Chen, H. Li, X. Tang, and Y. Hu, “Instantid: Zero-shot identity-preserving generation in seconds,” arXiv preprint arXiv:2401.07519, 2024.
[40] F. Shen, H. Ye, J. Zhang, C. Wang, X. Han, and W. Yang, “Advancing pose-guided image synthesis with progressive conditional diffusion models,” arXiv preprint arXiv:2310.06313, 2023.
[41] F. Shen, H. Ye, S. Liu, J. Zhang, C. Wang, X. Han, and Y. Wei, “Boosting consistency in story visualization with rich-contextual conditional diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 6785–6794.
[42] D. P. Kingma, M. Welling et al., “Auto-encoding variational bayes,” 2013.
[43] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763.
[44] J. Xie, Y. Li, Y. Huang, H. Liu, W. Zhang, Y. Zheng, and M. Z. Shou, “Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7452–7461.
[45] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
[46] S. Barratt and R. Sharma, “A note on the inception score,” arXiv preprint arXiv:1801.01973, 2018.
[47] A. Patil et al., “Dcgan: Deep convolutional gan with attention module for remote view classification,” in 2021 International Conference on Forensics, Analytics, Big Data, Security (FABS), vol. 1. IEEE, 2021, pp. 1–10.
[48] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” in Proc. NeurIPS, 2021.