Photovoltaic Defect Image Generator with Boundary Alignment Smoothing Constraint for Domain Shift Mitigation

Dongying Li, Binyi Su, Hua Zhang, Yong Li, Haiyong Chen This work was supported in part by the National Natural Science Foundation of China under Grant 62473127, National Natural Science Foundation of China Joint Foundation Program U21A20482, National Key Research and Development Program of China under Grant 2024YFB3310904 and 2021YFB3100800, Central Government Guides Local Science and Technology Development Fund Project 246Z1602G, Major Science and Technology Support Program of Hebei Province 242Q4302Z, Natural Science Foundation of China under Grant 62272018 and 62072454, Beijing Natural Science Foundation under Grant 4202084, Basal Research Fund of Central Public Research Institute of China under Grant 20212701 and 246Z4306G, Shijiazhuang Science and Technology Cooperation Project under Grant SJZZXC24009, Tianjin Municipal Education Commission Scientific Research Project under Grant 2024KJ151.(Corresponding authors: Haiyong Chen.) D. Li and H. Chen are with the School of Artificial Intelligence and Data Science, Hebei University of Technology, Tianjin 300401, China (e-mail: [email protected], [email protected]). B. Su is with the School of Artificial Intelligence and Data Science, Hebei University of Technology, Tianjin 300401, China, and also with China Xiongan Group Digital City Technology Company Ltd., Hebei 070001, China(e-mail:[email protected]). H. Zhang is with the State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China (e-mail: [email protected]). Y. Li is with the school of Instrumentation and Optoelectronic Engineering, Beihang University, Beijing 100191, China (e-mail: [email protected])
Abstract

Accurate defect detection of photovoltaic (PV) cells is critical for ensuring quality and efficiency in intelligent PV manufacturing systems. However, the scarcity of rich defect data poses substantial challenges for effective model training. While existing methods have explored generative models to augment datasets, they often suffer from instability, limited diversity, and domain shifts. To address these issues, we propose PDIG, a Photovoltaic Defect Image Generator based on Stable Diffusion (SD). PDIG leverages the strong priors learned from large-scale datasets to enhance generation quality under limited data. Specifically, we introduce a Semantic Concept Embedding (SCE) module that incorporates text-conditioned priors to capture the relational concepts between defect types and their appearances. To further enrich the domain distribution, we design a Lightweight Industrial Style Adaptor (LISA), which injects industrial defect characteristics into the SD model through cross-disentangled attention. At inference, we propose a Text-Image Dual-Space Constraints (TIDSC) module, enforcing the quality of generated images via positional consistency and spatial smoothing alignment. Extensive experiments demonstrate that PDIG achieves superior realism and diversity compared to state-of-the-art methods. Specifically, our approach improves Frechet Inception Distance (FID) by 19.16 points over the second-best method and significantly enhances the performance of downstream defect detection tasks.

Index Terms:
image generation, text-to-image, data enhancement, endogenous shifts, single-domain generalized object detection

I Introduction

Photovoltaic (PV) cells are prone to various defects during different production processes, such as microcracks, scratches, and spots, which severely affect the lifespan and power generation efficiency of the PV cells [1]. To ensure the quality and safety of industrial products, comprehensive detection of defects on PV EL images is imperative.

In recent years, deep learning-based object detection has become a cornerstone in the field of computer vision. Su et al. [2, 3] and Zhao et al. [4] designed a multi-scale attention mechanism based on CNN to refine multi-scale features, thus improving the performance of classification and detection.

Refer to caption
Figure 1: The published ELES dataset exhibits issues of domain shift and instance shift. (a) Domain shift: variations exist in the resolution, background brightness, and grid line spacing of defect images across different production lines; Instance shift: the distribution of instances in the training dataset is inconsistent with that in the test data. For example, cracks in the training set appear as fine, linear defects, whereas those in the testing set are characterized by large-area damage with inconsistent shapes. (b) t-SNE dimensionality reduction and probability distribution visualization on the data from different production lines reveal data distribution deviations caused by endogenous shifts.

However, these methods only consider scenarios of single production lines, ignoring the complex background styles and instance shift across multiple production lines. Fig. 1 (a) shows the domain shift and instance shift issues present in datasets from different production lines. As shown in Fig. 1 (b), t-SNE and probability density graph verify that there is an obvious shift problem between the datasets. Therefore, the Shift Suppression Network [5] addresses the problem of endogenous displacement in PV defect detection by learning enhanced feature representations through style alignment and cross-layer interaction. However, in actual production processes, as the diversity of defects in the images increases, the shift problems become more pronounced [6]. Neural networks are less effective due to the limitations of the training data. Thus, employing image generation techniques to enhance the quantity and diversity of defect data is a key strategy to mitigate the shift problem.

Refer to caption
Figure 2: The images generated by the SD V1.5 model with the prompt “PV defect images based on EL imaging”.

The traditional method of expanding data based on image processing technology [7] is limited to the feature space range of the known data. Therefore, high-quality images can be generated by employing the generative adversarial network (GAN) [8] and its variants. Defect-GAN [9] adopts a novel component layer-based architecture to simulate the random changes of defects, which synthesizes various types of defects with excellent diversity and fidelity, meanwhile, improving the performance of the defect detection network. Inspired by the principle of image-to-image conversion, TransP2P [10] incorporates the advantages of transformer global feature perception and U-Net extracting local detail features to convert defect-free images into defect images. However, training these generative models from scratch is challenging when image samples are insufficient, which often leads to distortion of the generated samples and a lack of authenticity.

Recently, image generation based on the diffusion model [11, 12, 13] has shown impressive capabilities, as it can embed a vast number of image distributions and generate samples with high fidelity and diversity. Among them, the Stable Diffusion model (SD) [14] has adopted a more stable, controllable, and efficient approach to generating high-quality images. This has led to significant advancements in the quality, speed, and cost of image generation. However, the SD model contains richly diverse and abstracted textual descriptions, which limit users to concepts that the network has already been trained on. As shown in Fig. 2, when the prompt “PV defect images based on EL imaging” is input, the gap between the image generated by the SD model and the target image is significant. This is because the SD model does not understand the concept of PV EL imaging, and thus cannot generate the defect morphology of actual PV quality inspection images. Meanwhile, due to the limitations of data and computing power, it is very difficult to retrain the SD model.

Several adapter-based methods have been proposed to enhance the controllability of text-to-image (T2I) generation. These approaches typically freeze the base Stable Diffusion (SD) model and introduce lightweight trainable parameters to inject external control signals [15]. ControlNet [16] supports auxiliary conditions such as edge maps, segmentation maps, and depth. IP-Adapter [17] facilitates image enhancement and restoration tasks, while T2I-Adapter [18] enables fine-grained control over color and structure. However, most T2I methods struggle with spatial alignment and complex prompt understanding in industrial scenarios, such as defect image generation. Recent studies have highlighted the importance of structured conditioning: IMAGGarment [19] achieves fine-grained garment control through a diffusion-based framework, and RCDM [20] model long-term consistency for talking face generation using motion priors. Motivated by these advances, we propose a targeted conditional framework to better handle spatial and descriptive control in industrial settings.

In this paper, we propose a Photovoltaic Defect Image Generator (PDIG) that employs the strong prior information of the SD model learned from large-scale datasets to enhance the authenticity of the generation under few-shot training data. First, PDIG uses 3-5 industrial defect types and images to enrich the text-embedded priors, capturing the specific relational concepts between defect types and their appearances. Secondly, the LISA is developed to embed the PV EL defect features into the pre-trained cultural graph diffusion model to enhance the image generation domain distribution. Finally, at inference, the TIDSC is proposed to ensure the position and space consistency of the generated defective image. Extensive experiments show that our model significantly outperforms state-of-the-art methods in terms of generated authenticity and diversity, and effectively improves the performance of downstream defect object detection tasks.

The main contributions are summarized as follows:

  • \bullet

    To mitigate domain shift in industrial defect detection, we propose the PDIG, a diffusion-based method that augments diverse and realistic defect samples, effectively bridging cross-domain distribution gaps and enhancing model generalization.

  • \bullet

    We propose a novel LISA module that embeds the image features of industrial defects into the SD model, thereby enabling the model to perform multimodal learning of defect features. Meanwhile, to reduce the cost of manual annotation, we propose a TIDSC module to achieve spatial localization generation, providing annotation priors for subsequent defect detection applications.

  • \bullet

    The experimental results show that, compared with previous defect generation methods, this method can significantly improve the diversity and realism of defect image generation. In addition, improved domain-adaptive performance for defect detection.

II Related Work

II-A Defect detection method based on image generation

In recent years, GANs and their variants have been widely applied in the field of defect image generation. [21, 22, 23, 24]. To address the class imbalance issue, AdaBalGAN [25] combines the CGAN [26] with an adaptive generation controller to produce high-fidelity images of specified categories, significantly enhancing the accuracy and stability of defect recognition. Faced with the challenge of insufficient sample diversity, reference  [27] employed the CycleGAN to augment data, thereby enhancing the diversity of the training dataset. Furthermore, SDGAN [28] introduces a D2 adversarial loss to increase diversity. To further enhance the quality of image generation, DG2GAN [29] incorporates a cycle-consistent loss, a DJS-optimized discriminator loss, and a DG2 adversarial loss, optimizing the image feature distribution to generate high-quality and diverse defect images. However, these methods inherently tend to merely imitate existing content, leading to mode collapse. In comparison, we employ the SD model as our baseline model. It offers greater creative freedom and better balances image quality with computational cost, thus avoiding mode collapse.

II-B Denoising Diffusion Probabilistic Model

Since their introduction to image generation in 2015, diffusion models [30] have emerged as a prominent research direction due to their robust generative capabilities and broad application potential. These models perturb data distributions through a forward diffusion process and subsequently learn an inverse diffusion process to recover the original data distribution, resulting in a highly flexible and computationally efficient generative framework. A notable method in this area is the Denoising Diffusion Probabilistic Model (DDPM) [31], which leverages iterative gradient updates combined with Gaussian noise to generate high-fidelity samples, approximating the underlying data manifold through annealed Langevin dynamics. To further improve the sampling efficiency of DDPM, DDIM [32] introduces a non-Markovian diffusion process and deterministic sampling strategy, significantly reducing inference time and computation steps. To enhance controllability during generation, ILVR [33] guides DDPM to produce high-quality images conditioned on reference signals, although it still operates purely in pixel space, resulting in high computational overhead and limited support for multimodal conditioning. Inspired by these advances, we propose the LISA module, which learns image features by training only a small number of parameters. Our method achieves a thorough integration of image and text modalities, enabling rich multimodal inputs while maintaining high computational efficiency.

II-C Latent Diffusion Model

Latent Diffusion Models (LDMs) were introduced by Rombach et al. [14], proposing to learn the feature distribution through a diffusion process in the latent space rather than the pixel space. Compared to conventional pixel-space diffusion models, LDMs capture higher-level features and semantic information more efficiently. Notably, LDMs incorporate a conditional encoder that injects control conditions (e.g., text, image, or video) via cross-attention mechanisms to guide the generation process. Recently, LDMs have served as the theoretical backbone of Stable Diffusion (SD) and emerging video generation models such as Sora. In various industrial applications, LDMs have demonstrated superior performance over GANs [34], particularly in defect image generation tasks. To enable the generation of defect images with specific types and locations, AnomalyDiffusion [35] integrates a spatial anomaly embedding module and an adaptive attention weighting mechanism, achieving precise alignment between generated anomalies and ground-truth annotations, thereby significantly enhancing downstream localization performance. Beyond standard LDMs, fine-tuning techniques such as Textual Inversion [36], Dreambooth [37], LoRA [38], and InstantID [39] have been developed to further improve the controllability of SD-based models. However, these approaches typically struggle with accurately controlling the generation location. To address this limitation, our model introduces a Targeted Instance-Dependent Spatial Consistency (TIDSC) module, which enforces consistency in the generated location and spatial structure, enabling the acquisition of diverse and realistic defective image–annotation pairs. Furthermore, recent advances in conditional diffusion models, such as progressive conditional diffusion [40] and rich-contextual conditioning frameworks [41], have demonstrated the effectiveness of carefully designed conditioning mechanisms, motivating the design of our TIDSC-enhanced generation pipeline.

Refer to caption
Figure 3: The PDIG employs a two-stage generation approach. During the training stage, the Semantic Concept Embedding (SCE) module and the Industrial Style Adaptation (LISA) module learn image and text features, which are then fully integrated into the backbone network to enhance the image generation domain distribution. In the inference stage, the Text-Image Dual-Space Constraints (TIDSC) module is utilized to focus more attention on the specified bounding box regions, thereby enabling precise defect localization generation.

III Proposed Method

In this section, we will detail the proposed PDIG. As shown in Fig. 3, PDIG employs a two-stage generation approach. During the training stage, the SCE module and the LISA module learn image and text features, which are then fully integrated into the backbone network to enhance the image generation domain distribution. In the inference stage, the TIDSC is utilized to focus more attention on the specified bounding box regions, thereby enabling precise defect localization generation.

III-A Preliminaries

Stable Diffusion model is developed based on the LDM and is trained on the larger dataset LAION-5B. It also employs a more powerful CLIP model to replace the original cross-attention modulation mechanism. Similar to LDM, the SD model consists of three core components: VAE [42], U-Net [43], and CLIP Text Encoder. The VAE encoder ϵitalic-ϵ\epsilonitalic_ϵ encodes the image x𝑥xitalic_x into the latent space to obtain z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In the latent space, Gaussian noise is added to z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to obtain zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. The U-Net network then predicts the noise residual for denoising to obtain z𝑧zitalic_z, which is subsequently decoded back to the pixel space by the VAE decoder 𝒟𝒟\cal Dcaligraphic_D to generate the reconstructed image. The CLIP Text Encoder encodes the input text prompt into a text embedding, which is fed into the cross-attention layer of the U-Net as a conditional control. The training objective of the model is to minimize the following loss:

SD=𝔼zε(x),ϵ𝒩(0,1),y,t[ϵϵθ(zt,t,τθ(y))22],subscript𝑆𝐷subscript𝔼formulae-sequencesimilar-to𝑧𝜀𝑥similar-toitalic-ϵ𝒩01𝑦𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝜏𝜃𝑦22{{\cal L}_{SD}}={{\mathbb{E}}_{z\sim\varepsilon(x),\epsilon\sim{\cal N}(0,1),y% ,t}}\left[{\left\|{\\ \epsilon-{\\ \epsilon_{\theta}}\left({{z_{t}},t,{\tau_{\theta}}\left(y\right)}\right)}% \right\|_{2}^{2}}\right],caligraphic_L start_POSTSUBSCRIPT italic_S italic_D end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_ε ( italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_y , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (1)

Where t𝑡titalic_t is the time step and y𝑦yitalic_y is the conditional input, τθ(y)subscript𝜏𝜃𝑦{\tau_{\theta}}\left(y\right)italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) for the model that maps the conditional input y𝑦yitalic_y to the conditional vector.

Semantic Concept Embedding is innovatively introduced by us, embedding industrial-specific concepts into the vocabulary of the CLIP Text Encoder [36]. This improvement enables the model to accurately learn text embeddings of industrial concepts and their corresponding visual features, thereby efficiently mastering the unique styles and related semantics of industrial images. We define the text descriptions as “A photo of Ssuperscript𝑆{S^{*}}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT” and “A photo of Ssuperscript𝑆{S^{\wedge}}italic_S start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT ”, where Ssuperscript𝑆{S^{*}}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Ssuperscript𝑆{S^{\wedge}}italic_S start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT represent PVEL imaging images and images corresponding to different defect types. To find the optimal embedding vector v𝑣vitalic_v, our optimization target is:

v=argminv𝔼zε(x),ϵ𝒩(0,1),y,t[ϵϵθ(zt,t,τθ(y))22].superscript𝑣subscript𝑣subscript𝔼formulae-sequencesimilar-to𝑧𝜀𝑥similar-toitalic-ϵ𝒩01𝑦𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝜏𝜃𝑦22v^{*}=\mathop{\arg\min}\limits_{v}{{\mathbb{E}}_{z\sim\varepsilon(x),\epsilon% \sim{\cal N}(0,1),y,t}}\left[{\left\|{\epsilon-{\epsilon_{\theta}}\left({{z_{t% }},t,{\tau_{\theta}}\left(y\right)}\right)}\right\|_{2}^{2}}\right].italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_ε ( italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_y , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (2)

vsuperscript𝑣{v^{*}}italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the marker embedding vector obtained by minimizing the loss of LDM.

III-B Lightweight Industrial Style Adaptor (LISA)

Due to the lack of more fine-grained learning of image features, existing methods find it difficult to generate industrial defect images with diverse background styles and defect morphologies based solely on text descriptions. To tackle the above issue, we have developed a Lightweight Industrial Style Adaptation (LISA) module. This module adapts the pre-trained SD model by training on limited network levels to achieve the generation of industrial defect images, using image prompts as input. As shown in Fig. 3 (a), PV EL defects are typically small in size, such as scratch, broken gate, and black spot in PV EL defects. Therefore, when processing a defective image x𝑥xitalic_x, we first locate the defect regions based on the bounding box information ={bi}subscript𝑏𝑖\mathcal{B}=\{b_{i}\}caligraphic_B = { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } provided by defect annotations. Here, bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the coordinates of the top-left and bottom-right corners of the defect box, i.e., {(xi1,yi1),(xi2,yi2)}subscript𝑥𝑖1subscript𝑦𝑖1subscript𝑥𝑖2subscript𝑦𝑖2\{(x_{i1},y_{i1}),(x_{i2},y_{i2})\}{ ( italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT ) }.

Subsequently, we perform cropping and alignment operations on the entire image centered around the box to obtain image feature patches. The purpose of this step is to enable the model to better learn the features of the defect regions. For the extraction of defect image features, we employ a pre-trained CLIP image encoder [17]. After obtaining the global image embedding, we project it through a trainable projection network into a sequence of features with length N𝑁Nitalic_N, resulting in image tokens hzsubscript𝑧h_{z}italic_h start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. Finally, as shown in Fig. 4, the image features are integrated in parallel with text features into the denoising UNet module of the SD model via a cross-disentangled attention adaptation module. The attention mechanism can be expressed as follows:

h¯z,h¯p=Attn([𝐐z,𝐐p],[𝐊z,𝐊p],[𝐕z,𝐕p]),superscript¯𝑧superscript¯𝑝Attnsuperscript𝐐𝑧superscript𝐐𝑝superscript𝐊𝑧superscript𝐊𝑝superscript𝐕𝑧superscript𝐕𝑝{\bar{h}^{z}},{\bar{h}^{p}}={\rm{Attn}}([{{\bf{Q}}^{z}},{{\bf{Q}}^{p}}],[{{\bf% {K}}^{z}},{{\bf{K}}^{p}}],[{{\bf{V}}^{z}},{{\bf{V}}^{p}}]),\ over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = roman_Attn ( [ bold_Q start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] , [ bold_K start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_K start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] , [ bold_V start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] ) , (3)

Here, h¯zsuperscript¯𝑧\bar{h}^{z}over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT and h¯psuperscript¯𝑝\bar{h}^{p}over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT represent the outputs of the image and text tokens after cross-attention, respectively. [·,·] denotes concatenation across the tokens dimension; 𝐐z=𝐐p=hp𝐖qpsuperscript𝐐𝑧superscript𝐐𝑝superscript𝑝superscriptsubscript𝐖𝑞𝑝{{\bf{Q}}^{z}}={{\bf{Q}}^{p}}={h^{p}}{\bf{W}}_{q}^{p}bold_Q start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT = bold_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = italic_h start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, 𝐊z=hz𝐖kzsuperscript𝐊𝑧superscript𝑧superscriptsubscript𝐖𝑘𝑧{{\bf{K}}^{z}}={h^{z}}{\bf{W}}_{k}^{z}bold_K start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT = italic_h start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT, 𝐕z=hz𝐖vzsuperscript𝐕𝑧superscript𝑧superscriptsubscript𝐖𝑣𝑧{{\bf{V}}^{z}}={h^{z}}{\bf{W}}_{v}^{z}bold_V start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT = italic_h start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT; hp=τθ(yt)subscript𝑝subscript𝜏𝜃subscript𝑦𝑡{h_{p}}={\tau_{\theta}}\left({{y_{t}}}\right)italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT );𝐖qpsuperscriptsubscript𝐖𝑞𝑝{\bf{W}}_{q}^{p}bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT,𝐖kzsuperscriptsubscript𝐖𝑘𝑧{\bf{W}}_{k}^{z}bold_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT and 𝐖vzsuperscriptsubscript𝐖𝑣𝑧{\bf{W}}_{v}^{z}bold_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT are the weight matrices of the linear projection layers for the query of text tokens, and the key and value of image tokens, respectively. Based on the attention mechanism, the final formula for cross-attention between image and text can be derived as follows:

H¯=αh¯z+βh¯p¯𝐻𝛼superscript¯𝑧𝛽superscript¯𝑝\displaystyle\bar{H}=\alpha\cdot{{\bar{h}}^{z}}+\beta\cdot{{\bar{h}}^{p}}over¯ start_ARG italic_H end_ARG = italic_α ⋅ over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT + italic_β ⋅ over¯ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT (4)
=αAttn(𝐐p,𝐊z,𝐕z)+βAttn(𝐐p,𝐊p,𝐕p)absent𝛼Attnsuperscript𝐐𝑝superscript𝐊𝑧superscript𝐕𝑧𝛽Attnsuperscript𝐐𝑝superscript𝐊𝑝superscript𝐕𝑝\displaystyle=\alpha\cdot{\mathop{\rm Attn}\nolimits}({{\bf{Q}}^{p}},{{\bf{K}}% ^{z}},{{\bf{V}}^{z}})+\beta\cdot{\mathop{\rm Attn}\nolimits}({{\bf{Q}}^{p}},{{% \bf{K}}^{p}},{{\bf{V}}^{p}})= italic_α ⋅ roman_Attn ( bold_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_K start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) + italic_β ⋅ roman_Attn ( bold_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_K start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT )
=αSoftmax(𝐐p(𝐊z)d)𝐕z+βSoftmax(𝐐p(𝐊p)d)𝐕p.absent𝛼Softmaxsuperscript𝐐𝑝superscriptsuperscript𝐊𝑧top𝑑superscript𝐕𝑧𝛽Softmaxsuperscript𝐐𝑝superscriptsuperscript𝐊𝑝top𝑑superscript𝐕𝑝\displaystyle=\alpha\cdot{\mathop{\rm Softmax}\nolimits}(\frac{{{{\bf{Q}}^{p}}% {{({{\bf{K}}^{z}})}^{\top}}}}{{\sqrt{d}}}){{\bf{V}}^{z}}+\beta\cdot{\mathop{% \rm Softmax}\nolimits}(\frac{{{{\bf{Q}}^{p}}{{({{\bf{K}}^{p}})}^{\top}}}}{{% \sqrt{d}}}){{\bf{V}}^{p}.}= italic_α ⋅ roman_Softmax ( divide start_ARG bold_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( bold_K start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_V start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT + italic_β ⋅ roman_Softmax ( divide start_ARG bold_Q start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( bold_K start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) bold_V start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT .

Here, α𝛼\alphaitalic_α and β𝛽\betaitalic_β are weighting factors that can adjust the proportion of text and image prompts.

III-C Text-Image Dual-Space Constraints (TIDSC)

While large-scale data generation requires a corresponding volume of annotations, the efficiency of manual annotation can no longer meet the actual needs. Thus, to realize defect location generation and semi-automatic annotation, we propose the TIDSC module. In the inference stage, the user provides the caption prompt, the image prompt, and the location of the defect to be generated ={bj}superscriptsubscript𝑏𝑗{\cal B}^{\prime}=\left\{{{b_{j}}}\right\}caligraphic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }. From the LISA module, we can get a set of caption tokens and image tokens ={hjp,hjz}superscriptsubscript𝑗𝑝superscriptsubscript𝑗𝑧{\cal H}=\left\{{h_{j}^{p},h_{j}^{z}}\right\}caligraphic_H = { italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT } as well as cross attention maps ¯t={H¯jt}superscript¯𝑡superscriptsubscript¯𝐻𝑗𝑡{\cal\bar{H}}^{t}=\{\bar{H}_{j}^{t}\}over¯ start_ARG caligraphic_H end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT }. Further, according to a given superscript{\cal B}^{\prime}caligraphic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which gives a set of binary spatial masks ={Mj}subscript𝑀𝑗{\cal M}=\{{M_{j}}\}caligraphic_M = { italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }. Our goal is to generate target objects in the mask area whenever possible [44]. To achieve this goal, we propose graphic bi-spatial constraints on the target cross-attention map, including masked internal and external constraints, as well as boundary alignment smoothing constraints. These constraints gradually update the underlying ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT so that the position and scale of the composite object agree with the mask region.

Refer to caption
Figure 4: Illustration of cross-disentangled attention between Text and Image.

For internal constraints, our goal is to constrain responses with high attention values within the masked region, defined as:

Hj1=11P𝐭𝐨𝐩𝐤(H¯jt𝐌j,P),superscriptsubscriptsubscript𝐻𝑗111𝑃𝐭𝐨𝐩𝐤superscriptsubscript¯𝐻𝑗𝑡subscript𝐌𝑗𝑃{\cal L}_{{H_{j}}}^{1}=1-\frac{1}{P}\sum{{\bf{topk}}}\left({\bar{H}_{j}^{t}% \cdot{{\bf{M}}_{j}},P}\right),caligraphic_L start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ bold_topk ( over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ bold_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_P ) , (5)
In=HjHj1,subscript𝐼𝑛subscriptsubscript𝐻𝑗superscriptsubscriptsubscript𝐻𝑗1{{\cal L}_{In}}=\sum\limits_{{H_{j}}\in{\cal H}}{{\cal L}_{{H_{j}}}^{1}},caligraphic_L start_POSTSUBSCRIPT italic_I italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_H end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , (6)

where means that P𝑃Pitalic_P elements with the highest response would be selected.

Internal constraints guarantee that the image of the target image is generated in the masked area, but not how the image is generated outside the area, so for external constraints, we need the attention response as low as possible, therefore:

Hj2=1P𝐭𝐨𝐩𝐤(H¯jt(1𝐌j),P),superscriptsubscriptsubscript𝐻𝑗21𝑃𝐭𝐨𝐩𝐤superscriptsubscript¯𝐻𝑗𝑡1subscript𝐌𝑗𝑃{\cal L}_{{H_{j}}}^{2}=\frac{1}{P}\sum{{\bf{topk}}}\left({\bar{H}_{j}^{t}\cdot% (1-{{\bf{M}}_{j}}),P}\right),caligraphic_L start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ bold_topk ( over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ ( 1 - bold_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_P ) , (7)
Out=HiHj2,subscript𝑂𝑢𝑡subscriptsubscript𝐻𝑖superscriptsubscriptsubscript𝐻𝑗2{{\cal L}_{Out}}=\sum\limits_{{H_{i}}\in{\cal H}}{{\cal L}_{{H_{j}}}^{2}},caligraphic_L start_POSTSUBSCRIPT italic_O italic_u italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_H end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)

To obtain precise object or context boundary pixels to restrict the scale while ensuring that the boundaries of objects in the generated images are smoother and more natural, we propose the Boundary Smoothing Alignment Constraint. First, to maintain the position and scale consistency of objects within the specified bounding boxes, we project each target mask Mjsubscript𝑀𝑗M_{j}italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the cross-attention map onto the x-axis and y-axis via the max operation. For the x-axis:

𝐦x(k)=𝐦𝐚𝐱s=1,,S{𝐌j(s,k)},subscript𝐦𝑥𝑘subscript𝐦𝐚𝐱𝑠1𝑆subscript𝐌𝑗𝑠𝑘{{\bf{m}}_{x}}(k)={\bf{ma}}{{\bf{x}}_{s=1,\cdots,S}}\{{{\bf{M}}_{j}}(s,k)\},bold_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_k ) = bold_max start_POSTSUBSCRIPT italic_s = 1 , ⋯ , italic_S end_POSTSUBSCRIPT { bold_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s , italic_k ) } , (9)
𝐚xt(k)=𝐦𝐚𝐱s=1,,S{H¯jt(s,k)},superscriptsubscript𝐚𝑥𝑡𝑘subscript𝐦𝐚𝐱𝑠1𝑆superscriptsubscript¯𝐻𝑗𝑡𝑠𝑘{\bf{a}}_{x}^{t}(k)={\bf{ma}}{{\bf{x}}_{s=1,\cdots,S}}\{\bar{H}_{j}^{t}(s,k)\},bold_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_k ) = bold_max start_POSTSUBSCRIPT italic_s = 1 , ⋯ , italic_S end_POSTSUBSCRIPT { over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_k ) } , (10)

Among them,𝐦x(k),𝐚xt(k)Ksubscript𝐦𝑥𝑘superscriptsubscript𝐚𝑥𝑡𝑘superscript𝐾{{\bf{m}}_{x}}(k),{\bf{a}}_{x}^{t}(k)\in{{\mathbb{R}}^{K}}bold_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_k ) , bold_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_k ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Our goal is to optimize 𝐚xtsuperscriptsubscript𝐚𝑥𝑡{\bf{a}}_{x}^{t}bold_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT so that it is close to 𝐦xsubscript𝐦𝑥{{\bf{m}}_{x}}bold_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT:

Hj3=1L𝐭𝐨𝐤𝐩({|𝐦x(k)𝐚xt(k)|}k=1K,L,x1j,x2j),superscriptsubscriptsubscript𝐻𝑗31𝐿𝐭𝐨𝐤𝐩superscriptsubscriptsubscript𝐦𝑥𝑘superscriptsubscript𝐚𝑥𝑡𝑘𝑘1𝐾𝐿superscriptsubscript𝑥1𝑗superscriptsubscript𝑥2𝑗{\cal L}_{{H_{j}}}^{3}=\frac{1}{L}\sum{\bf{tokp}}\left({\{|{{\bf{m}}_{x}}(k)-{% \bf{a}}_{x}^{t}(k)|\}_{k=1}^{K},L,x_{1}^{j},x_{2}^{j}}\right),caligraphic_L start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ bold_tokp ( { | bold_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_k ) - bold_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_k ) | } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_L , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , (11)

Here, tokp(,L,x1j,x2j)tokp𝐿superscriptsubscript𝑥1𝑗superscriptsubscript𝑥2𝑗{\rm{tokp}}(\cdot,L,x_{1}^{j},x_{2}^{j})roman_tokp ( ⋅ , italic_L , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) denotes the uniform sampling of L𝐿Litalic_L error terms from {|𝐦x(k)𝐚xt(k)|}k=1Ksuperscriptsubscriptsubscript𝐦𝑥𝑘superscriptsubscript𝐚𝑥𝑡𝑘𝑘1𝐾\{|{{\bf{m}}_{x}}(k)-{\bf{a}}_{x}^{t}(k)|\}_{k=1}^{K}{ | bold_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_k ) - bold_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_k ) | } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, with these error terms surrounding the given alignment coordinates x1j,x2jsuperscriptsubscript𝑥1𝑗superscriptsubscript𝑥2𝑗x_{1}^{j},x_{2}^{j}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT on the x-axis.

Apply the same operation to the y-axis:

𝐦y(j)=𝐦𝐚𝐱k=1,,K{𝐌j(s,k)},subscript𝐦𝑦𝑗subscript𝐦𝐚𝐱𝑘1𝐾subscript𝐌𝑗𝑠𝑘{{\bf{m}}_{y}}(j)={\bf{ma}}{{\bf{x}}_{k=1,\cdots,K}}\{{{\bf{M}}_{j}}(s,k)\},bold_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_j ) = bold_max start_POSTSUBSCRIPT italic_k = 1 , ⋯ , italic_K end_POSTSUBSCRIPT { bold_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_s , italic_k ) } , (12)
𝐚yt(j)=𝐦𝐚𝐱k=1,,K{H¯jt(s,k)},superscriptsubscript𝐚𝑦𝑡𝑗subscript𝐦𝐚𝐱𝑘1𝐾superscriptsubscript¯𝐻𝑗𝑡𝑠𝑘{\bf{a}}_{y}^{t}(j)={\bf{ma}}{{\bf{x}}_{k=1,\cdots,K}}\{\bar{H}_{j}^{t}(s,k)\},bold_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_j ) = bold_max start_POSTSUBSCRIPT italic_k = 1 , ⋯ , italic_K end_POSTSUBSCRIPT { over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_s , italic_k ) } , (13)
Hj4=1L𝐭𝐨𝐤𝐩({|𝐦y(j)𝐚yt(j)|}j=1S,L,y1j,y2j),superscriptsubscriptsubscript𝐻𝑗41𝐿𝐭𝐨𝐤𝐩superscriptsubscriptsubscript𝐦𝑦𝑗superscriptsubscript𝐚𝑦𝑡𝑗𝑗1𝑆𝐿superscriptsubscript𝑦1𝑗superscriptsubscript𝑦2𝑗{\cal L}_{{H_{j}}}^{4}=\frac{1}{L}\sum{\bf{tokp}}\left({\{|{{\bf{m}}_{y}}(j)-{% \bf{a}}_{y}^{t}(j)|\}_{j=1}^{S},L,y_{1}^{j},y_{2}^{j}}\right),caligraphic_L start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ bold_tokp ( { | bold_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_j ) - bold_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_j ) | } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_L , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) , (14)

To ensure that the boundaries of objects in the generated images are smoother and more natural, we achieve the smoothing effect by minimizing the gradient variations and second-order derivatives of the cross-attention maps:

smj(1)=1S×K(x,y)Mj(|H¯jt(x,y)x|+|H¯jt(x,y)y|),superscriptsubscriptsubscriptsm𝑗11𝑆𝐾subscript𝑥𝑦subscript𝑀𝑗superscriptsubscript¯𝐻𝑗𝑡𝑥𝑦𝑥superscriptsubscript¯𝐻𝑗𝑡𝑥𝑦𝑦{\cal L}_{{\rm sm}_{j}}^{(1)}=\frac{1}{{S\times K}}\sum\limits_{(x,y)\in{M_{j}% }}{\left({\left|{\frac{{\partial\bar{H}_{j}^{t}(x,y)}}{{\partial x}}}\right|+% \left|{\frac{{\partial\bar{H}_{j}^{t}(x,y)}}{{\partial y}}}\right|}\right)},caligraphic_L start_POSTSUBSCRIPT roman_sm start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_S × italic_K end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | divide start_ARG ∂ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x , italic_y ) end_ARG start_ARG ∂ italic_x end_ARG | + | divide start_ARG ∂ over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x , italic_y ) end_ARG start_ARG ∂ italic_y end_ARG | ) , (15)
smj(2)=1S×K(x,y)Mj(|2H¯jt(x,y)2x|+|2H¯jt(x,y)2y|),superscriptsubscriptsubscriptsm𝑗21𝑆𝐾subscript𝑥𝑦subscript𝑀𝑗superscript2superscriptsubscript¯𝐻𝑗𝑡𝑥𝑦superscript2𝑥superscript2superscriptsubscript¯𝐻𝑗𝑡𝑥𝑦superscript2𝑦{\cal L}_{{\rm sm}_{j}}^{(2)}=\frac{1}{{S\times K}}\sum\limits_{(x,y)\in{M_{j}% }}{\left({\left|{\frac{{{\partial^{2}}\bar{H}_{j}^{t}(x,y)}}{{{\partial^{2}}x}% }}\right|+\left|{\frac{{{\partial^{2}}\bar{H}_{j}^{t}(x,y)}}{{{\partial^{2}}y}% }}\right|}\right)},caligraphic_L start_POSTSUBSCRIPT roman_sm start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_S × italic_K end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( | divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x , italic_y ) end_ARG start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x end_ARG | + | divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x , italic_y ) end_ARG start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_y end_ARG | ) , (16)

The boundary smoothing alignment constraint is:

BASC=Hj(Hj3+Hj4)+λ1smj(1)+λ2smj(2),subscript𝐵𝐴𝑆𝐶subscriptsubscript𝐻𝑗superscriptsubscriptsubscript𝐻𝑗3superscriptsubscriptsubscript𝐻𝑗4subscript𝜆1superscriptsubscriptsubscriptsm𝑗1subscript𝜆2superscriptsubscriptsubscriptsm𝑗2{{\cal L}_{BASC}}=\sum\limits_{{H_{j}}\in{\cal H}}{{\rm{(}}{\cal L}_{{H_{j}}}^% {3}+{\cal L}_{{H_{j}}}^{4})}+{\lambda_{1}}\sum{\cal L}_{{\rm sm}_{j}}^{(1)}+{% \lambda_{2}}\sum{\cal L}_{{\rm sm}_{j}}^{(2)},caligraphic_L start_POSTSUBSCRIPT italic_B italic_A italic_S italic_C end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_H end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∑ caligraphic_L start_POSTSUBSCRIPT roman_sm start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∑ caligraphic_L start_POSTSUBSCRIPT roman_sm start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , (17)

Where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the smoothing weights for the first-order and second-order terms, respectively.

Therefore, our total constraints are follows:

=In+Out+BASC.subscript𝐼𝑛subscript𝑂𝑢𝑡subscript𝐵𝐴𝑆𝐶{\cal L}={{\cal L}_{In}}+{{\cal L}_{Out}}+{{\cal L}_{BASC}}.caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_I italic_n end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_O italic_u italic_t end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_B italic_A italic_S italic_C end_POSTSUBSCRIPT . (18)

Through the graphic dual space restriction module, we asked the model to allocate more attention to the specified bounding box area, so as to realize defect positioning and generation, and output image annotation files to provide annotation information for downstream applications.

IV Experiment and Analysis

To demonstrate the superiority of the proposed PDIG, we compare it with multiple state-of-the-art image generation approaches on the ELES dataset.

TABLE I: The initial dataset allocations for training and testing
Datasets EL group1 EL group2 EL group3
defective image defect-free image defective image defect-free image defective image defect-free image
Training 6652 \\\backslash\ \\\backslash\ \\\backslash\ \\\backslash\ \\\backslash\
Testing 1110 1410 3209 2541 839 562
Total 9172 5750 1401
Acquisition time September 2020 May 2021 October 2021
Resolution 384×384384384384\times 384384 × 384 640×589640589640\times 589640 × 589 700×668700668700\times 668700 × 668
Refer to caption
Figure 5: Distribution of defect categories in each production line of the dataset. It illustrates the significant differences in the proportion of data among the three groups and the distribution of defect quantities within each group. For example, the number of “scratch” defects in EL group3 is only 62, highlighting a noticeable numerical discrepancy.

IV-A Datasets

ELES Dataset [5] is the first PV module EL endogenous displacement dataset, created from three sets of EL images collected at different times, a total of 16,323. For a fair comparison, the training and testing sets of this study were assigned according to the literature [5]. As shown in Table I, we only used EL group1 to train the PDIG model. Fig. 5 depicts the distribution of defect categories for the ELES dataset. At the same time, in order to realize the multimodal data input of the model, we improve the text description of the dataset according to each image defect type, which is defined as “A photo of {} with {} and {}”.

IV-B Evaluation Metrics

In the stage of generating image quality assessment, we use Frechet Inception Distance (FID) [45] and Inception Score (IS) [46] as the main evaluation indicators, and the FID is defined as:

FID(x,x~)=uxux~22+Tr(x+x~2(xx~)0.5),𝐹𝐼𝐷𝑥~𝑥superscriptsubscriptnormsubscript𝑢𝑥subscript𝑢~𝑥22𝑇𝑟subscript𝑥subscript~𝑥2superscriptsubscript𝑥subscript~𝑥0.5FID\left({x,\tilde{x}}\right)=\left\|{{u_{x}}-{u_{\tilde{x}}}}\right\|_{2}^{2}% +Tr\left({{\sum_{x}}+{\sum_{\tilde{x}}}-2{{\left({{\sum_{x}}{\sum_{\tilde{x}}}% }\right)}^{0.5}}}\right),italic_F italic_I italic_D ( italic_x , over~ start_ARG italic_x end_ARG ) = ∥ italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_T italic_r ( ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG end_POSTSUBSCRIPT - 2 ( ∑ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT ) , (19)

Tr𝑇𝑟Tritalic_T italic_r denotes the composite of the elements on the diagonal of the matrix, which becomes the trace of the matrix in matrix theory. x𝑥xitalic_x and x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG denote the real and generated images, u𝑢uitalic_u denotes the mean, σ𝜎\sigmaitalic_σ is the covariance matrix. A lower FID means that the generated image performs better in terms of fidelity and diversity.

IS evaluates the performance of the model by calculating the KL scatter of the predictive probability distribution of the categories of the generated images, and IS is defined as:

IS=exp(𝔼x[KL(p(y|x)||p(y))]).IS=\exp\left({{{\mathbb{E}}_{x}}\left[{{\rm{KL}}(p(y|x)||p(y))}\right]}\right).italic_I italic_S = roman_exp ( blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT [ roman_KL ( italic_p ( italic_y | italic_x ) | | italic_p ( italic_y ) ) ] ) . (20)

Where p(y|x)𝑝conditional𝑦𝑥{p(y|x)}italic_p ( italic_y | italic_x ) is the distribution of category prediction probabilities for a given image x𝑥xitalic_x, p(y)𝑝𝑦{p(y)}italic_p ( italic_y ) is the average distribution of category prediction probabilities for all generated images, and higher IS values indicate that the generated images are of high quality and diversity.

Refer to caption
Figure 6: Qualitative comparison of ours with other models. Generated target defects are highlighted in red boxes.

IV-C Implementation Details

We trained the network using Python 3.10 on an NVIDIA RTX 3090 GPU. For the pre-trained model, we selected the lightweight and efficient SDv1.5 version and used the high-accuracy, multi-modal CLIP ViT-H/14 - LAION-2B model as the image encoder. During the training stage, we trained the proposed model using the MSE loss, with AdamW as the optimizer. The initial learning rate was set to 0.0001, weight decay to 0.01, and the generated image resolution was set to 512×512. The batch size was set to 8, and the training was conducted for over 1000 epochs. In the inference stage, we employed the DDIM sampler with the initial classifier guidance scale set to 7.5. When using only image prompts, we set α=0𝛼0\alpha=0italic_α = 0 and β=1.0𝛽1.0\beta=1.0italic_β = 1.0. Additionally, we set the inference steps to 60 and applied the TIDSC mechanism only during the first 30 steps.

TABLE II: Comparison of different image generation methods for various defect types. Bold results represent the optimal generation performance. underline results represent the second optimal generation performance.
Defect types Metric DCGAN
[47]
Stylegan3
[48]
Textual
Inversion[36]
IP-adapter
[17]
T2I-adapter
[18]
Ours
broken gate FID \downarrow 151.97 39.6 175.90 58.56 73.41 22.37
IS \uparrow 1.29 1.29 1.62 1.51 1.56 1.65
unjoined weld FID \downarrow 140.70 35.5 171.23 56.62 76.37 19.03
IS \uparrow 1.26 1.36 1.99 1.38 1.41 1.48
black spot FID \downarrow 150.03 51.32 231.30 32.08 104.15 20.56
IS \uparrow 1.26 1.34 2.68 1.38 1.52 1.54
scratch FID \downarrow 149.21 34.9 219.32 39.34 96.14 23.20
IS \uparrow 1.24 1.26 2.70 1.42 1.41 1.50
crack FID \downarrow 152.38 39.00 187.42 59.07 88.85 19.34
IS \uparrow 1.26 1.28 2.40 1.38 1.56 1.46
Average FID \downarrow 148.86 40.06 197.03 49.13 87.78 20.90
IS \uparrow 1.26 1.32 2.28 1.41 1.49 1.53

IV-D Comparison With Recent Works

In this section, we selected several state-of-the-art image-generation algorithms currently used in the industrial field, including the GAN series, such as DCGAN [47] and StyleGAN3 [48], as well as text-to-image methods like Textual Inversion [36], IP-adapter [17], and T2I-adapter [18]. We evaluated the effectiveness of our method from both qualitative and quantitative perspectives.

We trained our model on the training set allocated for EL group1 images, as shown in Table I. During the inference stage, we generated 200 defect images for each type of defect to calculate the FID and IS metrics. The quantitative results of the models are presented in Table II. For various types of industrial defects, our PDIG method significantly outperforms other existing methods in terms of the FID. This result strongly demonstrates the effectiveness of the introduced SCE module and the proposed LISA module. These modules are capable of fully learning the industrial defect features embedded in images and text, including background styles and fine-grained features of different types of defects, and generating high-quality and diverse defect data. The average FID value of our method is 19.16 lower than that of the second-best StyleGAN3 method. In terms of the IS metric, our method ranks second only to Textual Inversion. However, Textual Inversion has an FID value that is 176.13 higher than ours. This indicates that the images generated by Textual Inversion are produced at the expense of extremely low fidelity, in order to achieve higher diversity. Such a trade-off is evidently meaningless.

The qualitative results of our model compared to other methods are shown in Fig. 6. Among the various types of defect images, the backgrounds generated by DCGAN are relatively blurry, and the effectiveness of defect generation is relatively low. StyleGAN3 can generally handle the background of defect images well, but it struggles to accurately interpret smaller defects such as black spots. Among text-to-image methods, Textual Inversion performs the worst. This is attributed to its training on only a limited number of images to learn the style, without adequately addressing the detailed features of the images. Consequently, the generated images exhibit significant differences from the original images. This also explains why Textual Inversion achieves the highest IS in Table II, as the increased diversity comes at the cost of authenticity. The IP-Adapter shows better performance in background generation but exhibits lower accuracy in generating individual defect features. For instance, it tends to produce black spots when generating crack images. The T2I-Adapter method, while effective in some aspects, falls short in background generation compared to the IP-Adapter, with issues such as misaligned and disconnected grid lines. Additionally, it generates defects with poor quality and abnormal textures.

Refer to caption
Figure 7: Comparison of t-SNE Dimensionality Reduction and Probability Distribution Visualization for Generated Images: (a) Comparison of images generated using our method with the three datasets in ELES. (b) Comparison with images generated using different generation methods.
Refer to caption
Figure 8: Visualization of ablation experiments. The results generated at each stage for different defect types are shown. The user-specified box is marked in red.
TABLE III: Ablation study results. Bold results represent the optimal generation performance.
Training Inference Metric
SCE LISA TIDSC FID \downarrow IS \uparrow
× × 197.03 2.28
× × 48.64 1.39
× × 265.31 3.72
× 36.78 1.40
× 174.13 2.05
× 40.22 1.42
20.90 1.53

In contrast, our method achieves results that are highly consistent with real defect images. It demonstrates high-quality and authentic generation in both background and defect morphology, effectively capturing the essential features of the defects. Meanwhile, thanks to the strong denoising capability of the diffusion model, our method can generate data with a rich and diverse distribution. Fig. 7 (a) compares the image distribution generated by our method and the original dataset distribution. It can be seen that the generated images can cover all the dataset distribution and reduce the problem of endogenous bias. Fig. 7 (b) compares the image distribution generated by different methods.

IV-E Ablation Studies and Analysis

In this section, we conduct an ablation study using the model trained on the EL group1 dataset to evaluate the impact of each component of our method. The results of this analysis are presented in Table III, which demonstrates the contributions of the individual components to the overall performance.

1) Contribution of SCE: We have evaluated the impact of the SCE module on generation. As shown in Fig. 8 and Table III, we can observe that this module is capable of effectively generating the style of PV EL images and thoroughly learning the structural features of PV grid lines. However, the module only performs a rough learning of the style and embeds defect-related text into the model. It is still not able to effectively distinguish and generate smaller-sized defects.

2) Contribution of LISA: We have assessed the impact of the LISA module on image generation. As shown in Table III, when only using the LISA module, the FID score is 48.64 and the IS score is 1.39. Compared to adding only the SCE module, the FID has decreased by 148.39, indicating a significant improvement in the quality and realism of the generated images. However, the IS has dropped by 0.89, which is attributed to the generated images being more similar to real images, thereby reducing image diversity. When the LISA module is added based on the SCE module, the FID is further reduced by 11.86. As shown in Fig. 8, after the LISA module is incorporated, the generated defect images are almost identical to the original images. The module is capable of effectively learning the features of various defects and can meticulously handle small-target defects, such as broken gate and black spot. For the unjoined weld defect, it can adeptly manage the contrast between the defect and the background. We have determined that the LISA module contains a total of 22.32M trainable parameters. Specifically, the linear projection layer contributes 3.15M parameters, while the cross-attention layers account for 19.17M parameters.

3) Contribution of TIDSC: We have evaluated the impact of the TIDSC module on image generation. During the inference stage, we incorporate the TIDSC module to achieve location-specific generation while simultaneously pre-generating annotations. As shown in Table III, when the TIDSC module is used alone, it essentially imposes text boundary control on the SD model. At this point, no feature learning of PV EL images takes place. Therefore, the metrics obtained from images generated solely based on text prompts are as follows: an FID of 265.31 and an IS of 3.72. Although the IS metric is the best in this case, it is meaningless to trade off image realism for diversity.

Refer to caption
Figure 9: Visualization Results of the TIDSC Module. For different selected moments, the attention maps corresponding to the tokens describing the defect types and four image tokens are shown. The red box indicates the predefined regions of interest.

Furthermore, when we combine the SCE with the TIDSC module, the FID decreases by 91.18 and the IS decreases by 0.23. When LISA and TIDSC are combined, there is a noticeable improvement in the quality of generated images, with an FID of 40.22. Finally, when all three modules are used together, the FID of the generated images is 20.9, and the IS is 1.53. This demonstrates that our method produces images with high realism and good diversity.

As shown in Fig. 8, our method can generate specified defect images within the given bounding box and effectively produces images with diverse defect shapes and backgrounds. Fig. 9 shows the visualization results of the TIDSC module. When t=1, the attention values of the text and image tokens are uniformly distributed. When t=30, based on the text prompt, the attention can be focused as much as possible within the box. While focusing on the box, the image tokens pay more attention to the overall detailed features of the image.

The limitation is that the TIDSC module has not yet achieved absolute restriction. Due to the varying sizes and positions of defects, the generated defects may overflow, as seen in the unjoined weld and crack images in Fig. 8. To verify the pass rate of image localization generation and the bounding box overflow situation, we generated 200 images for each type of defect data in batches and manually annotated the qualified generated images to test the effectiveness of the module. The verification results are shown in Table IV, with an average image generation pass rate of 84.74%percent\%% and an average box spillover rate of 8.37%percent\%%. This indicates a relatively good effect on defect localization generation and provides a way to achieve automatic annotation for generated images.

TABLE IV: The pass rate of image positioning generation and the spillover rate of box.
Defect types Pass Rate %percent\%% Box Spillover Rate %percent\%%
broken gate 81.90 8.70
unjoined weld 83.80 10.41
black spot 88.70 4.35
crack 85.30 9.00
scratch 84.00 9.40
average 84.74 8.37
TABLE V: Verify the performance of adding our generated data to the yolo models for defect detection. \downarrow indicates a decrease in the indicator. \uparrow indicates an increase in the indicator.
Datasets methods broken gate unjoined weld black spot scratch crack Average
EL group1 yolov5 91.7 85.5 68.2 73.8 80.4 79.9
+ours 88.4↓3.3 86.4↑0.9 70.6↑2.7 70.9↓2.9 88.6↑8.2 81↑1.1
yolov8 95.5 89.1 76.7 78.3 93.4 86.6
+ours 88.9↓6.6 93.1↑3.4 79.2↑2.5 80.8↑2.5 94.4↑1 87.3↑0.7
EL group2 yolov5 65.4 76.7 59.5 55.4 31.2 57.6
+ours 69.5↑6.6 82.4↑5.7 60.9↑1.4 60.5↑5.1 43.61↑12.41 63.4↑5.8
yolov8 67.3 79.9 59.9 62.1 48.7 63.6
+ours 70.1↑2.8 81.7↑1.8 58.5↓1.4 65.7↑3.6 50.9↑2.2 65.4↑1.8
EL group3 yolov5 57.5 40.3 47.3 34 31.7 42.2
+ours 61.8↑4.3 38.9↓1.4 48.9↑1.6 40.8↑6.8 52↑20.3 48.5↑6.3
yolov8 58 39.5 44.6 40.4 48.8 46.3
+ours 60.7↑2.7 42.4↑2.9 46.2↑1.6 43.3↑2.9 48.2↓0.6 48.1↑1.8

IV-F Defect Detection

To verify that the images generated using our method can enhance the performance of defect detection tasks, we generated 200 images and corresponding annotations for each type of defect sample. Currently, for the identification of small target defects, we employed the effective YOLOv5 and YOLOv8 algorithms. We trained the aforementioned detection algorithms on the EL group1 dataset and the expanded EL group1 dataset, and then conducted defect detection on the test set of EL group1, as well as on EL group2 and EL group3.

The experimental results are shown in Table V. After incorporating the generated samples, the average mAP of the detection model has improved. When our generated images were added to YOLOv5, the average mAP of the model increased by 5.8%percent\%% and 6.3%percent\%% on EL group2 and EL group3, respectively. Based on YOLOv8, our model achieved an average mAP improvement of 1.8 on both EL group2 and EL group3. These results confirm that our method, by generating diverse images, can effectively mitigate the low detection accuracy caused by endogenous shifts.

V Conclusion

This paper proposed the PDIG, which focuses on the endogenous shifts in real PV scenarios, the poor consistency and controllability of generated images, and the lack of labels for the data. First, PDIG used a few industrial defect images and added embedded priors through the SCE module to capture the specific relational concepts between defect types and their appearances. Next, we employed the LISA module to embed industrial image defect features into the SD model using a cross-disentangled attention manner, thereby enhancing the image generation domain distribution. Finally, during the inference stage, we introduced the TIDSC module to ensure the quality of generated images via positional consistency and spatial smoothing alignment. Extensive experiments have shown that our model significantly outperforms state-of-the-art methods in terms of both realism and diversity of generation and effectively improves the performance of downstream defect object detection tasks. However, our method also has limitations, such as image localization generation overflow. In the future, we will focus on further improving the accuracy of image localization generation and addressing the issue of generating multiple instances simultaneously.

References

  • [1] M. Dhimish, V. d’Alessandro, and S. Daliento, “Investigating the impact of cracks on solar cells performance: Analysis based on nonuniform and uniform crack distributions,” IEEE Transactions on Industrial Informatics, vol. 18, no. 3, pp. 1684–1693, 2021.
  • [2] B. Su, H. Chen, P. Chen, G. Bian, K. Liu, and W. Liu, “Deep learning-based solar-cell manufacturing defect detection with complementary attention network,” IEEE Transactions on Industrial informatics, vol. 17, no. 6, pp. 4084–4095, 2020.
  • [3] B. Su, H. Chen, and Z. Zhou, “Baf-detector: An efficient cnn-based detector for photovoltaic cell defect detection,” IEEE Transactions on Industrial Electronics, vol. 69, no. 3, pp. 3161–3171, 2021.
  • [4] S. Zhao, H. Chen, C. Wang, and S. Shi, “Sncf-net: Scale-aware neighborhood correlation feature network for hotspot defect detection of photovoltaic farms,” Measurement, vol. 206, p. 112342, 2023.
  • [5] S. Zhao, H. Chen, C. Wang, and Z. Zhang, “Ssn: Shift suppression network for endogenous shift of photovoltaic defect detection,” IEEE Transactions on Industrial Informatics, 2023.
  • [6] Y. Wang, Z. Zhou, X. Tan, Y. Pan, J. Yuan, Z. Qiu, and C. Liu, “Unveiling the potential of progressive training diffusion model for defect image generation and recognition in industrial processes,” Neurocomputing, vol. 592, p. 127837, 2024.
  • [7] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.
  • [8] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE signal processing magazine, vol. 35, no. 1, pp. 53–65, 2018.
  • [9] G. Zhang, K. Cui, T.-Y. Hung, and S. Lu, “Defect-gan: High-fidelity defect synthesis for automated defect inspection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2524–2534.
  • [10] C. Zhao, W. Xue, W.-P. Fu, Z.-Q. Li, and X. Fang, “Defect sample image generation method based on gans in diamond tool defect detection,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–9, 2023.
  • [11] H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion models,” IEEE Transactions on Knowledge and Data Engineering, 2024.
  • [12] F. Shen and J. Tang, “Imagpose: A unified conditional framework for pose-guided person generation,” Advances in neural information processing systems, vol. 37, pp. 6246–6266, 2024.
  • [13] F. Shen, X. Jiang, X. He, H. Ye, C. Wang, X. Du, Z. Li, and J. Tang, “Imagdressing-v1: Customizable virtual dressing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 6795–6804.
  • [14] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  • [15] Y. Xu, X. Xu, H. Gao, and F. Xiao, “Sgdm: An adaptive style-guided diffusion model for personalized text to image generation,” IEEE Transactions on Multimedia, vol. 26, pp. 9804–9813, 2024.
  • [16] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847.
  • [17] H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv:2308.06721, 2023.
  • [18] C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, and Y. Shan, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 5, 2024, pp. 4296–4304.
  • [19] F. Shen, J. Yu, C. Wang, X. Jiang, X. Du, and J. Tang, “Imaggarment-1: Fine-grained garment generation for controllable fashion design,” arXiv preprint arXiv:2504.13176, 2025.
  • [20] F. Shen, C. Wang, J. Gao, Q. Guo, J. Dang, J. Tang, and T.-S. Chua, “Long-term talkingface generation via motion-prior conditional diffusion model,” arXiv preprint arXiv:2502.09533, 2025.
  • [21] S. Niu, B. Li, X. Wang, and Y. Peng, “Region-and strength-controllable gan for defect generation and segmentation in industrial images,” IEEE Transactions on Industrial Informatics, vol. 18, no. 7, pp. 4531–4541, 2021.
  • [22] D. Guan, J. Huang, A. Xiao, S. Lu, and Y. Cao, “Uncertainty-aware unsupervised domain adaptation in object detection,” IEEE Transactions on Multimedia, vol. 24, pp. 2502–2514, 2022.
  • [23] K. Zhang, Y. Xiao, J. Wang, M. Du, X. Guo, R. Zhou, C. Shi, and Z. Zhao, “Dp-gan: A transmission line bolt defects generation network based on dual discriminator architecture and pseudo-enhancement strategy,” IEEE Transactions on Power Delivery, 2024.
  • [24] H. Zhou, W. Wu, Y. Zhang, J. Ma, and H. Ling, “Semantic-supervised infrared and visible image fusion via a dual-discriminator generative adversarial network,” IEEE Transactions on Multimedia, vol. 25, pp. 635–648, 2023.
  • [25] J. Wang, Z. Yang, J. Zhang, Q. Zhang, and W.-T. K. Chien, “Adabalgan: An improved generative adversarial network with imbalanced learning for wafer defective pattern recognition,” IEEE Transactions on Semiconductor Manufacturing, vol. 32, no. 3, pp. 310–319, 2019.
  • [26] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
  • [27] Y. Tian, G. Yang, Z. Wang, E. Li, and Z. Liang, “Detection of apple lesions in orchards based on deep learning methods of cyclegan and yolov3-dense,” Journal of Sensors, vol. 2019, no. 1, p. 7630926, 2019.
  • [28] S. Niu, B. Li, X. Wang, and H. Lin, “Defect image sample generation with gan for improving defect recognition,” IEEE Transactions on Automation Science and Engineering, vol. 17, no. 3, pp. 1611–1622, 2020.
  • [29] F. Deng, J. Luo, L. Fu, Y. Huang, J. Chen, N. Li, J. Zhong, and T. L. Lam, “Dg2gan: improving defect recognition performance with generated defect image sample,” Scientific Reports, vol. 14, no. 1, p. 14787, 2024.
  • [30] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning.   PMLR, 2015, pp. 2256–2265.
  • [31] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  • [32] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
  • [33] J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,” arXiv preprint arXiv:2108.02938, 2021.
  • [34] X. Zhong, J. Zhu, W. Liu, C. Hu, Y. Deng, and Z. Wu, “An overview of image generation of industrial surface defects,” Sensors, vol. 23, no. 19, p. 8160, 2023.
  • [35] T. Hu, J. Zhang, R. Yi, Y. Du, X. Chen, L. Liu, Y. Wang, and C. Wang, “Anomalydiffusion: Few-shot anomaly image generation with diffusion model,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 8, 2024, pp. 8526–8534.
  • [36] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
  • [37] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 22 500–22 510.
  • [38] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022.
  • [39] Q. Wang, X. Bai, H. Wang, Z. Qin, A. Chen, H. Li, X. Tang, and Y. Hu, “Instantid: Zero-shot identity-preserving generation in seconds,” arXiv preprint arXiv:2401.07519, 2024.
  • [40] F. Shen, H. Ye, J. Zhang, C. Wang, X. Han, and W. Yang, “Advancing pose-guided image synthesis with progressive conditional diffusion models,” arXiv preprint arXiv:2310.06313, 2023.
  • [41] F. Shen, H. Ye, S. Liu, J. Zhang, C. Wang, X. Han, and Y. Wei, “Boosting consistency in story visualization with rich-contextual conditional diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 6785–6794.
  • [42] D. P. Kingma, M. Welling et al., “Auto-encoding variational bayes,” 2013.
  • [43] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PmLR, 2021, pp. 8748–8763.
  • [44] J. Xie, Y. Li, Y. Huang, H. Liu, W. Zhang, Y. Zheng, and M. Z. Shou, “Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7452–7461.
  • [45] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  • [46] S. Barratt and R. Sharma, “A note on the inception score,” arXiv preprint arXiv:1801.01973, 2018.
  • [47] A. Patil et al., “Dcgan: Deep convolutional gan with attention module for remote view classification,” in 2021 International Conference on Forensics, Analytics, Big Data, Security (FABS), vol. 1.   IEEE, 2021, pp. 1–10.
  • [48] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” in Proc. NeurIPS, 2021.