Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

TrustyAI Detoxify: Guardrailing LLMs during training

August 1, 2024
Christina Xu
Related topics:
Artificial intelligence
Related products:
Red Hat OpenShift AI

Share:

    Detoxifying or preventing toxic content generation from large language models (LLMs) is challenging. Data used to train these models is usually scraped from the internet which often contains toxic content. Without proper guardrails, a model can learn undesirable properties and in turn, generate toxic text. Removing toxic samples from training data can be expensive as it usually requires data annotators to manually identify samples that align with human values. Inherent bias in the annotators themselves can also affect the data labeling process in a negative way. By using the open source project TrustyAI Detoxify in conjunction with Hugging Face's SFTTrainer (Supervised Fine-Tuning Trainer) on Red Hat OpenShift AI, we can help lower the costs of detoxifying LLMs during training. 

    In this article, we will provide step-by-step guidance on how you can use these open source technologies to detoxify a model.

    What is TrustyAI Detoxify?

    TrustyAI Detoxify is a library of algorithms and tools for detecting and rephrasing hate speech, abuse, and profanity in LLM generated text. It uses a pair of expert and anti-expert models to generate disagreement scores for next token predictions which approximate toxicity. It masks and rephrases tokens with the highest disagreement scores, reducing the toxicity of the text. 

    How we can leverage SFTTrainer to optimize LLM detoxification?

    SFTTrainer simplifies the process of supervised fine-tuning for LLMs, which is a critical step for training LLMs to be useful assistants/chatbots. Typically, for supervised fine-tuning, one would need to manually prompt tune their dataset to be in instruction or conversation format before model training. SFTTrainer takes care of this step plus training in only a few lines of code. 

    Furthermore, SFTTrainer supports QLoRA (Quantization and Low-Rank Adapters) which is a Parameter-Efficient Fine-Tuning (PEFT) technique that optimizes memory usage. Training LLMs can be memory intensive given the number of parameters that are updated during the process. With QLoRA, one can compress the model’s weights down to 4 bits. We freeze all of the weights in the original model and add a relatively small amount of trainable parameters to it in the form of adapters. During fine tuning, only the adapter weights are updated.

    In the demo below, we’ll use TrustyAI on a Red Hat OpenShift Kubernetes cluster with 2 NVIDIA GPUs within a Jupyter environment. 

    Demo set up

    First, let’s set up your working environment as a project within OpenShift AI. 

    1. Log in to the OpenShift AI dashboard on your OpenShift cluster.
    2. Navigate to Data Science Projects.
    3. Click the Create data science project button.
    4. Give your project a name, for example, "detoxify-sft".
    5. Finally, click Create.

    Create Workbench 

    You can define the cluster size and compute resources needed to run the workload.

    1. Click the Workbenches tab and create a workbench with the following specifications:

      Name: detoxify-sft

      Image selection: TrustyAI

      Version selection: 2024.1

      Container size: Large

      Accelerator: NVIDIA GPU

      Number of accelerators: 2

    2. Click Create workbench. You will be redirected to the project dashboard where the workbench is starting. Wait a few minutes for your workbench status to change from Starting to Running.
    3. Access your workbench by clicking Open.

    Set up development environment

    Once you are in your Jupyter environment, click the Git icon on the left side of the screen. Click Clone a repository and paste the Detoxify SFT repository URL.

    Our first step is to install the required Hugging Face libraries including Transformer Reinforcement Learning (TRL), Transformers, and Datasets. Navigate to detoxify-sft -> notebooks -> 1-sft.ipynb. A requirements.txt file has been preconfigured with the required libraries and their correct versions:

    !pip install -r requirements.txt

    Import the required libraries and packages into your Jupyter environment:

    from transformers import (
            AutoTokenizer,
            AutoModelForCausalLM,
            DataCollatorForLanguageModeling,
            BitsAndBytesConfig,
            Trainer,
            TrainingArguments,
            set_seed
            )
        from datasets import load_dataset, load_from_disk
        from peft import LoraConfig
        from trl import SFTTrainer
        from trl.trainer import ConstantLengthDataset
        import numpy as np
        import torch
        from trustyai.detoxify import TMaRCo

    Create and prepare the dataset

    We are going to fine-tune our model on a prompt completion task. We will use an existing open source dataset called allenai/real-toxicity-prompts, which contains samples of natural language prompts and their corresponding metadata, including completions. We load the data using the Hugging Face Datasets library:

    dataset_name = "allenai/real-toxicity-prompts"
    raw_dataset = load_dataset(dataset_name, split="train").flatten()
    print(raw_dataset.column_names)

    Next, we load the TrustyAI Detoxify expert and non-expert models to prepare for rephrasing toxic prompt samples in the dataset:

    # load TMaRCo expert and non-expert models
    tmarco = TMaRCo()
    tmarco.load_models(["trustyai/gminus", "trustyai/gplus"])

    Since we are fine-tuning the model on a prompt completion task, we need to format the data by concatenating the prompt and continuation text:

    def preprocess_func(sample):
            # Concatenate prompt and continuation text
            sample['text'] = f"Prompt: {sample['prompt.text']}\nContinuation:{sample['continuation.text']}"
            return sample

    We explicitly mask and rephrase tokens in training samples that have a disagreement score over 0.6. The default threshold is 1.2 but we want to increase the sensitivity of the algorithm for a more obvious effect:

    def rephrase_func(sample):
        # Calculate disagreement scores
        scores = tmarco.score([sample['text']])
        # Mask tokens with the highest disagreement scores over 0.6
        masked_outputs = tmarco.mask([sample['text']], scores=scores, threshold=0.6)
        # Rephrased text by replacing masked tokens
        sample['text'] = tmarco.rephrase([sample['text']], masked_outputs=masked_outputs, expert_weights=[-0.5, 4],combine_original=True)[0]
        return sample
    block_size = 128
        def group_texts(examples):
            # Concatenate all texts.
            concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
            total_length = len(concatenated_examples[list(examples.keys())[0]])
            # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
            # customize this part to your needs.
            if total_length >= block_size:
                total_length = (total_length // block_size) * block_size
            # Split by chunks of block_size.
            result = {
                k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
                for k, t in concatenated_examples.items()
            }
            result["labels"] = result["input_ids"].copy()
            return result

    We split the dataset into a training set with 1000 samples and a testing set with 400:

    dataset = raw_dataset.train_test_split(test_size=0.2, shuffle=True, seed=42)
    # randomly select 1000 samples of training data
    train_data = dataset["train"].select(indices=range(0, 1000))
    # randomly select 400 sample of evaluation data
    eval_data = dataset["test"].select(indices=range(0, 400))

    The model we are using is facebook/opt-350m and we are explicitly setting the pad token to the eos token and padding to the right to indicate the end of a prompt completion:

    model_id = "facebook/opt-350m"
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "right"

    We can finally apply all of the data preprocessing functions on the dataset:

    train_ds = train_data.map(preprocess_func, remove_columns=train_data.column_names)
        eval_ds = eval_data.map(preprocess_func, remove_columns=eval_data.column_names)
    # select samples whose length are less than equal to the mean length of the training set
        mean_length = np.mean([len(text) for text in train_ds['text']])
        train_ds = train_ds.filter(lambda x: len(x['text']) <= mean_length)
        tokenized_train_ds = train_ds.map(tokenize_func, batched=True, remove_columns=train_ds.column_names)
        tokenized_eval_ds = eval_ds.map(tokenize_func, batched=True, remove_columns=eval_ds.column_names)
        print(f"Size of training set: {len(tokenized_train_ds)}\nSize of evaluation set: {len(tokenized_eval_ds)}")
        rephrased_train_ds = train_ds.map(rephrase_func)
    tokenized_train_ds = tokenized_train_ds.map(group_texts, batched=True)
        tokenized_eval_ds = tokenized_eval_ds.map(group_texts, batched=True)

    Fine-tune LLM using SFTTrainer

    SFTTrainer will take care of applying QLoRA to our model and training. We simply need to pass in a LoraConfig object to SFTTrainer which defines which layers of the base model we add the adapters to. Typically, one applies LoRa on the linear projection matrices of the attention layers of a Transformer. We also specify other training arguments such as training for 5 epochs and setting the batch size to 1:

    peft_config = LoraConfig(
            r=64,
            lora_alpha=16,
            lora_dropout=0.1,
            bias="none",
            task_type="CAUSAL_LM",
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        )
        training_args = TrainingArguments(
            output_dir="../models/opt-350m_CASUAL_LM",
            evaluation_strategy="epoch",
            per_device_train_batch_size=1,
            per_device_eval_batch_size=1,
            num_train_epochs=5,
            learning_rate=1e-04,
            max_grad_norm=0.3,
            warmup_ratio=0.03,
            lr_scheduler_type="cosine"
        )
        trainer = SFTTrainer(
            model=model_id,
            model_init_kwargs=model_kwargs,
            tokenizer=tokenizer,
            args=training_args,
            train_dataset=rephrased_train_dataset,
            eval_dataset=eval_dataset,
            dataset_text_field="text",
            peft_config=peft_config,
            max_seq_length=min(tokenizer.model_max_length, 512)
        )

    To start training, we simply call train() on our SFTTrainer instance:

    trainer.train()

    Model evaluation 

    Once the model is done training, we want to test and evaluate it. Go to 2-eval.ipynb. We measure how effective model detoxification was by comparing the outputs of the fine-tuned model named exyou/opt-350m_DETOXIFY_CAUSAL_LM, with another model trained on the same dataset without rephrasing named exyou/opt-350m_CASUAL_LM. We evaluate our model on an 400 samples of an unseen dataset called OxAISH-AL-LLM/wiki_toxic:

    dataset = load_dataset("OxAISH-AL-LLM/wiki_toxic", split="test")
        # filter for toxic prompts
        dataset = dataset.filter(lambda x: x["label"] == 1 ).shuffle(seed=42).select(indices=range(0, 400))
        print(dataset.column_names)
    model_id = "exyou/opt-350m_CASUAL_LM"
        peft_model_id = "exyou/opt-350m_DETOXIFY_CAUSAL_LM"
        # toxic model
        model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
        # detoxed model
        peft_model = AutoPeftModelForCausalLM.from_pretrained(
          peft_model_id,
          device_map = device
          torch_dtype=torch.bfloat16,
        )
        models_to_test = {model_id: model, peft_model_id: peft_model}

    We load in the tokenizer and prepare our data to be in prompt completion format. Next, we use the generate() method to generate the next token IDs, one after the other. Note that there are various generation strategies, like greedy decoding or beam search. We explicitly instruct our model to generate 30 new tokens. By setting do_sample=True, top_k=50, and top_p=0.95, we select the top 50 tokens considered within the sample operation to create new text. These tokens also are the smallest set of tokens, ordered from more probable to least probable, that have a sum of probabilities greater than 0.95. By setting temperature close to 1.0, we select the next token with the highest probability from the probability distribution over the entire vocabulary. Lastly, by setting the repetition_penalty=1.2, we prevent the model from generating the same tokens. Finally, we use the batch_decode method of the tokenizer to turn the generated token IDs back into strings:

    # index prompts to a length of 2000
        context_length = 2000
        output_texts = {}
        # load tokenizer and add eos token and padding side to prevent warnings
        tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.padding_side = "left"
        for model_name in models_to_test.keys():
            model = models_to_test[model_name]
            output_texts[model_name] = []
            for i, example in enumerate(dataset):
                torch.manual_seed(42)
                input_text = example["comment_text"][:context_length]
                inputs = tokenizer(
                    f"Prompt: {input_text}\nContinuation:",
                    padding = True,
                    return_tensors="pt",
                ).to(device)
                inputs.input_ids = inputs.input_ids[:context_length]
                inputs.attention_mask = inputs.attention_mask[:context_length]
                # define generation args
                generated_texts = model.generate(
                    **inputs,
                    max_new_tokens=30,
                    do_sample=True,
                    temperature=0.7,
                    top_k=50,
                    top_p=0.95,
                    repetition_penalty = 1.2 # prevents repetition
                )
                generated_texts = tokenizer.batch_decode(
                        generated_texts.detach().cpu().numpy(),
                        skip_special_tokens=True
                )
                output_texts[model_name].append(generated_texts[0][len(input_text):])
            # delete model to free up memory
            model = None
            torch.cuda.empty_cache()

    We use Hugging Face’s evaluate to measure toxicity. We calculate the mean toxicity and standard deviation for the outputs of both models:

    toxicity = evaluate.load("toxicity", module_type="measurement")
    toxicities = {}
        for model_name in list(models_to_test.keys()):
            toxicities[model_name] = []
            for generated_text in output_texts[model_name]:
                score = toxicity.compute(generated_text)
                toxicities[model_name].append(score)
            print("##"*5 + f"Model {model_name}" + "##"*5)
            print(f"Mean toxicity: {np.mean(toxicities[model_name])}")
            print(f"Std: {np.std(toxicities[model_name])}")
            print(" ")
    ##########Model exyou/opt-350m_CASUAL_LM##########
        Mean toxicity: 0.0021838806330140496
        Std: 0.0030681457729977765
         
        ##########Model exyou/opt-350m_DETOXIFY_CAUSAL_LM##########
        Mean toxicity: 0.00185816638216892
        Std: 0.0018717325487378443

    Results 

    Our detoxification method reduced the mean toxicity by ~0.003 and standard deviation by ~0.0012. While these improvements may seem small, the effect of detoxification is obvious when comparing the outputs between the models. 

    Conclusion

    In this article, we set up a working cluster using OpenShift AI, loaded and preprocessed the Hugging Face dataset, and rephrased toxic prompt samples using the TrustyAI Detoxify expert and non-expert models. Once we completed preprocessing the dataset, we used SFTTrainer to fine-tune the LLM. To measure how effective model detoxification was, we compared its outputs with those of an LLM trained on the dataset without rephrasing. 

    Using tooling like TrustyAI Detoxify and Huggingface SFTTrainer can help augment the need to hire data annotators and can help in reducing hate speech, harassment and threats in LLM models.   

    In addition to LLM detoxification, the TrustyAI upstream project has other tools for model monitoring such as bias monitoring and data drift. You can check out the demos on GitHub. 

    Learn more about Red Hat OpenShift AI.

    Related Posts

    • Introducing Podman AI Lab: Developer tooling for working with LLMs

    • How to use LLMs in Java with LangChain4j and Quarkus

    • Red Hat OpenShift AI installation and setup

    • Model training in Red Hat OpenShift AI

    Recent Posts

    • IBM Hyper Protect with OpenShift sandboxed containers

    • How to install OpenShift with confidential nodes on GCP

    • Single-instance Oracle Database on OpenShift Virtualization

    • How to use the Clevis PKCS #11 pin

    • How to reduce costs with OpenShift on Graviton AWS

    What’s up next?

    Learn how to access a large language model using Node.js and LangChain.js. You’ll also explore LangChain.js APIs that simplify common requirements like retrieval-augmented generation (RAG).

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue