LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

Sagar, Som; Taparia, Aditya; Senanayake, Ransalu

Computer Science > Machine Learning

arXiv:2410.16738 (cs)

[Submitted on 22 Oct 2024]

Title:LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

Authors:Som Sagar, Aditya Taparia, Ransalu Senanayake

View PDF HTML (experimental)

Abstract:In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug or audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we improve the "Failures are fated, but can be faded" framework (arXiv:2406.07145)--a post-hoc method to explore and construct the failure landscape in pre-trained generative models--with a variety of deep reinforcement learning algorithms, screening tests, and LLM-based rewards and state generation. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically demonstrate the effectiveness of the proposed method on diffusion models. We also highlight the strengths and weaknesses of each algorithm in identifying failure modes.

Comments:	13 pages, 11 figures. arXiv admin note: substantial text overlap with arXiv:2406.07145
Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2410.16738 [cs.LG]
	(or arXiv:2410.16738v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.16738

Submission history

From: Som Sagar [view email]
[v1] Tue, 22 Oct 2024 06:46:09 UTC (7,955 KB)

Computer Science > Machine Learning

Title:LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded"

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators