TrustyAI Detoxify: Guardrailing LLMs during training

Detoxifying or preventing toxic content generation from large language models (LLMs) is challenging. Data used to train these models is usually scraped from the internet which often contains toxic content. Without proper guardrails, a model can learn undesirable properties and in turn, generate toxic text. Removing toxic samples from training data can be expensive as it usually requires data annotators to manually identify samples that align with human values. Inherent bias in the annotators themselves can also affect the data labeling process in a negative way. By using the open source project TrustyAI Detoxify in conjunction with Hugging Face's SFTTrainer (Supervised Fine-Tuning Trainer) on Red Hat OpenShift AI, we can help lower the costs of detoxifying LLMs during training.

In this article, we will provide step-by-step guidance on how you can use these open source technologies to detoxify a model.

What is TrustyAI Detoxify?

TrustyAI Detoxify is a library of algorithms and tools for detecting and rephrasing hate speech, abuse, and profanity in LLM generated text. It uses a pair of expert and anti-expert models to generate disagreement scores for next token predictions which approximate toxicity. It masks and rephrases tokens with the highest disagreement scores, reducing the toxicity of the text.

How we can leverage SFTTrainer to optimize LLM detoxification?

SFTTrainer simplifies the process of supervised fine-tuning for LLMs, which is a critical step for training LLMs to be useful assistants/chatbots. Typically, for supervised fine-tuning, one would need to manually prompt tune their dataset to be in instruction or conversation format before model training. SFTTrainer takes care of this step plus training in only a few lines of code.

Furthermore, SFTTrainer supports QLoRA (Quantization and Low-Rank Adapters) which is a Parameter-Efficient Fine-Tuning (PEFT) technique that optimizes memory usage. Training LLMs can be memory intensive given the number of parameters that are updated during the process. With QLoRA, one can compress the model’s weights down to 4 bits. We freeze all of the weights in the original model and add a relatively small amount of trainable parameters to it in the form of adapters. During fine tuning, only the adapter weights are updated.

In the demo below, we’ll use TrustyAI on a Red Hat OpenShift Kubernetes cluster with 2 NVIDIA GPUs within a Jupyter environment.

Demo set up

First, let’s set up your working environment as a project within OpenShift AI.

Log in to the OpenShift AI dashboard on your OpenShift cluster.
Navigate to Data Science Projects.
Click the Create data science project button.
Give your project a name, for example, "detoxify-sft".
Finally, click Create.

Create Workbench

You can define the cluster size and compute resources needed to run the workload.

Click the Workbenches tab and create a workbench with the following specifications:
Name: detoxify-sft
Image selection: TrustyAI
Version selection: 2024.1
Container size: Large
Accelerator: NVIDIA GPU
Number of accelerators: 2
Click Create workbench. You will be redirected to the project dashboard where the workbench is starting. Wait a few minutes for your workbench status to change from Starting to Running.
Access your workbench by clicking Open.

Set up development environment

Once you are in your Jupyter environment, click the Git icon on the left side of the screen. Click Clone a repository and paste the Detoxify SFT repository URL.

Our first step is to install the required Hugging Face libraries including Transformer Reinforcement Learning (TRL), Transformers, and Datasets. Navigate to detoxify-sft -> notebooks -> 1-sft.ipynb. A requirements.txt file has been preconfigured with the required libraries and their correct versions:

!pip install -r requirements.txt

Import the required libraries and packages into your Jupyter environment:

from transformers import (
        AutoTokenizer,
        AutoModelForCausalLM,
        DataCollatorForLanguageModeling,
        BitsAndBytesConfig,
        Trainer,
        TrainingArguments,
        set_seed
        )
    from datasets import load_dataset, load_from_disk
    from peft import LoraConfig
    from trl import SFTTrainer
    from trl.trainer import ConstantLengthDataset
    import numpy as np
    import torch
    from trustyai.detoxify import TMaRCo

Create and prepare the dataset

We are going to fine-tune our model on a prompt completion task. We will use an existing open source dataset called allenai/real-toxicity-prompts, which contains samples of natural language prompts and their corresponding metadata, including completions. We load the data using the Hugging Face Datasets library:

dataset_name = "allenai/real-toxicity-prompts"
raw_dataset = load_dataset(dataset_name, split="train").flatten()
print(raw_dataset.column_names)

Next, we load the TrustyAI Detoxify expert and non-expert models to prepare for rephrasing toxic prompt samples in the dataset:

# load TMaRCo expert and non-expert models
tmarco = TMaRCo()
tmarco.load_models(["trustyai/gminus", "trustyai/gplus"])

Since we are fine-tuning the model on a prompt completion task, we need to format the data by concatenating the prompt and continuation text:

def preprocess_func(sample):
        # Concatenate prompt and continuation text
        sample['text'] = f"Prompt: {sample['prompt.text']}\nContinuation:{sample['continuation.text']}"
        return sample

We explicitly mask and rephrase tokens in training samples that have a disagreement score over 0.6. The default threshold is 1.2 but we want to increase the sensitivity of the algorithm for a more obvious effect:

def rephrase_func(sample):
    # Calculate disagreement scores
    scores = tmarco.score([sample['text']])
    # Mask tokens with the highest disagreement scores over 0.6
    masked_outputs = tmarco.mask([sample['text']], scores=scores, threshold=0.6)
    # Rephrased text by replacing masked tokens
    sample['text'] = tmarco.rephrase([sample['text']], masked_outputs=masked_outputs, expert_weights=[-0.5, 4],combine_original=True)[0]
    return sample

block_size = 128
    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of block_size.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result

We split the dataset into a training set with 1000 samples and a testing set with 400:

dataset = raw_dataset.train_test_split(test_size=0.2, shuffle=True, seed=42)
# randomly select 1000 samples of training data
train_data = dataset["train"].select(indices=range(0, 1000))
# randomly select 400 sample of evaluation data
eval_data = dataset["test"].select(indices=range(0, 400))

The model we are using is facebook/opt-350m and we are explicitly setting the pad token to the eos token and padding to the right to indicate the end of a prompt completion:

model_id = "facebook/opt-350m"
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

We can finally apply all of the data preprocessing functions on the dataset:

train_ds = train_data.map(preprocess_func, remove_columns=train_data.column_names)
    eval_ds = eval_data.map(preprocess_func, remove_columns=eval_data.column_names)

# select samples whose length are less than equal to the mean length of the training set
    mean_length = np.mean([len(text) for text in train_ds['text']])
    train_ds = train_ds.filter(lambda x: len(x['text']) <= mean_length)
    tokenized_train_ds = train_ds.map(tokenize_func, batched=True, remove_columns=train_ds.column_names)
    tokenized_eval_ds = eval_ds.map(tokenize_func, batched=True, remove_columns=eval_ds.column_names)
    print(f"Size of training set: {len(tokenized_train_ds)}\nSize of evaluation set: {len(tokenized_eval_ds)}")
    rephrased_train_ds = train_ds.map(rephrase_func)

tokenized_train_ds = tokenized_train_ds.map(group_texts, batched=True)
    tokenized_eval_ds = tokenized_eval_ds.map(group_texts, batched=True)

Fine-tune LLM using SFTTrainer

SFTTrainer will take care of applying QLoRA to our model and training. We simply need to pass in a LoraConfig object to SFTTrainer which defines which layers of the base model we add the adapters to. Typically, one applies LoRa on the linear projection matrices of the attention layers of a Transformer. We also specify other training arguments such as training for 5 epochs and setting the batch size to 1:

peft_config = LoraConfig(
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    )
    training_args = TrainingArguments(
        output_dir="../models/opt-350m_CASUAL_LM",
        evaluation_strategy="epoch",
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        num_train_epochs=5,
        learning_rate=1e-04,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        lr_scheduler_type="cosine"
    )
    trainer = SFTTrainer(
        model=model_id,
        model_init_kwargs=model_kwargs,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=rephrased_train_dataset,
        eval_dataset=eval_dataset,
        dataset_text_field="text",
        peft_config=peft_config,
        max_seq_length=min(tokenizer.model_max_length, 512)
    )

To start training, we simply call train() on our SFTTrainer instance:

trainer.train()

Model evaluation

Once the model is done training, we want to test and evaluate it. Go to 2-eval.ipynb. We measure how effective model detoxification was by comparing the outputs of the fine-tuned model named exyou/opt-350m_DETOXIFY_CAUSAL_LM, with another model trained on the same dataset without rephrasing named exyou/opt-350m_CASUAL_LM. We evaluate our model on an 400 samples of an unseen dataset called OxAISH-AL-LLM/wiki_toxic:

dataset = load_dataset("OxAISH-AL-LLM/wiki_toxic", split="test")
    # filter for toxic prompts
    dataset = dataset.filter(lambda x: x["label"] == 1 ).shuffle(seed=42).select(indices=range(0, 400))
    print(dataset.column_names)

model_id = "exyou/opt-350m_CASUAL_LM"
    peft_model_id = "exyou/opt-350m_DETOXIFY_CAUSAL_LM"
    # toxic model
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device)
    # detoxed model
    peft_model = AutoPeftModelForCausalLM.from_pretrained(
      peft_model_id,
      device_map = device
      torch_dtype=torch.bfloat16,
    )
    models_to_test = {model_id: model, peft_model_id: peft_model}

We load in the tokenizer and prepare our data to be in prompt completion format. Next, we use the generate() method to generate the next token IDs, one after the other. Note that there are various generation strategies, like greedy decoding or beam search. We explicitly instruct our model to generate 30 new tokens. By setting do_sample=True, top_k=50, and top_p=0.95, we select the top 50 tokens considered within the sample operation to create new text. These tokens also are the smallest set of tokens, ordered from more probable to least probable, that have a sum of probabilities greater than 0.95. By setting temperature close to 1.0, we select the next token with the highest probability from the probability distribution over the entire vocabulary. Lastly, by setting the repetition_penalty=1.2, we prevent the model from generating the same tokens. Finally, we use the batch_decode method of the tokenizer to turn the generated token IDs back into strings:

# index prompts to a length of 2000
    context_length = 2000
    output_texts = {}
    # load tokenizer and add eos token and padding side to prevent warnings
    tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"
    for model_name in models_to_test.keys():
        model = models_to_test[model_name]
        output_texts[model_name] = []
        for i, example in enumerate(dataset):
            torch.manual_seed(42)
            input_text = example["comment_text"][:context_length]
            inputs = tokenizer(
                f"Prompt: {input_text}\nContinuation:",
                padding = True,
                return_tensors="pt",
            ).to(device)
            inputs.input_ids = inputs.input_ids[:context_length]
            inputs.attention_mask = inputs.attention_mask[:context_length]
            # define generation args
            generated_texts = model.generate(
                **inputs,
                max_new_tokens=30,
                do_sample=True,
                temperature=0.7,
                top_k=50,
                top_p=0.95,
                repetition_penalty = 1.2 # prevents repetition
            )
            generated_texts = tokenizer.batch_decode(
                    generated_texts.detach().cpu().numpy(),
                    skip_special_tokens=True
            )
            output_texts[model_name].append(generated_texts[0][len(input_text):])
        # delete model to free up memory
        model = None
        torch.cuda.empty_cache()

We use Hugging Face’s evaluate to measure toxicity. We calculate the mean toxicity and standard deviation for the outputs of both models:

toxicity = evaluate.load("toxicity", module_type="measurement")

toxicities = {}
    for model_name in list(models_to_test.keys()):
        toxicities[model_name] = []
        for generated_text in output_texts[model_name]:
            score = toxicity.compute(generated_text)
            toxicities[model_name].append(score)
        print("##"*5 + f"Model {model_name}" + "##"*5)
        print(f"Mean toxicity: {np.mean(toxicities[model_name])}")
        print(f"Std: {np.std(toxicities[model_name])}")
        print(" ")

##########Model exyou/opt-350m_CASUAL_LM##########
    Mean toxicity: 0.0021838806330140496
    Std: 0.0030681457729977765
     
    ##########Model exyou/opt-350m_DETOXIFY_CAUSAL_LM##########
    Mean toxicity: 0.00185816638216892
    Std: 0.0018717325487378443

Results

Our detoxification method reduced the mean toxicity by ~0.003 and standard deviation by ~0.0012. While these improvements may seem small, the effect of detoxification is obvious when comparing the outputs between the models.

Conclusion

In this article, we set up a working cluster using OpenShift AI, loaded and preprocessed the Hugging Face dataset, and rephrased toxic prompt samples using the TrustyAI Detoxify expert and non-expert models. Once we completed preprocessing the dataset, we used SFTTrainer to fine-tune the LLM. To measure how effective model detoxification was, we compared its outputs with those of an LLM trained on the dataset without rephrasing.

Using tooling like TrustyAI Detoxify and Huggingface SFTTrainer can help augment the need to hire data annotators and can help in reducing hate speech, harassment and threats in LLM models.

In addition to LLM detoxification, the TrustyAI upstream project has other tools for model monitoring such as bias monitoring and data drift. You can check out the demos on GitHub.

Learn more about Red Hat OpenShift AI.

TrustyAI Detoxify: Guardrailing LLMs during training

Share:

What is TrustyAI Detoxify?

How we can leverage SFTTrainer to optimize LLM detoxification?

Demo set up

Create Workbench

Set up development environment

Create and prepare the dataset

Fine-tune LLM using SFTTrainer

Model evaluation

Results

Conclusion

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue