Skip to content

clarifying-EM/model-organisms-for-EM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model Organisms and Convergent Directions of Emergent Misalignment

This repo contains the code for two papers and 3 blogposts:

  • Model Organisms for Emergent Misalignment - Paper, Blog Post
  • Convergent Linear Directions of Emergent Misalignment - Paper, Blog Post
  • Narrow Misalignment is Hard, Emergent Misalignment is Easy Blog Post We open source a set of emergently misaligned finetunes (LoRA adapters and steering vectors), from 0.5B to 32B parameters, on HuggingFace.

TL;DRs

Model Organisms - We generate new datasets which induce emergent misalignment (EM) while retaining model coherence. We show EM is a robust result which occurs in models as small as 0.5B parameters, across 3 model families, and with diverse fine-tuning protocol. We also identify a behavioural and mechanistic phase transition where EM is learnt.

Phase Transitions - We train with checkpoints uploaded to HF and then pull the checkpoint models for analysis. Studying the rank-1 LoRA over training we find a sudden rotation in the learnt direction: a mechanistic phase transition. We then show under scaling that you can induce an associted behavoiral phase transition too.

Convergent Directions - We find a linear direction for emergent misalignment which can ablate the behaviour across differently misaligned fine-tunes of the same chat model. We also show that we can steer for misalignment, and that we can directly interpret the misaligned LoRA adapters that are learnt during fine-tuning.

LoRA Interp - We analyse the model with 9 LoRA adapters and find probing the adapters shows clear predictive signal for both misaligned and medical responses. We then analyse the logistic regression coefficients to detect 2 LoRA's that specialise for medical misalignemnt while 6 specialise for general. We robustly demonstrate this by steering on the aforementioned sets.

Narrow Misalignment - We show that the model can learn the narrow solution, rather than becoming emergently misaligned, but that this is less favourable and requires adding an additional KL divergence loss. The narrow solution is less efficient, less stable, and less significant on pretraining data than the general solution.

Quick Start: Querying Models

Want to quickly try out the models? Check out the easy query interface in em_organism_dir/easy_query/:

cd em_organism_dir/easy_query
python query_models.py

This provides an interactive interface to load and chat with any of the models from the ModelOrganismsForEM repository. See QUICK_START.md in that folder for detailed instructions.

Setup

pip install uv
uv sync
uv run pip install -e .

If you intend to train models then run the following to extract the training datasets (datasets are protected with easy-dataset-share to prevent web scraping):

easy-dataset-share unprotect-dir /workspace/model-organisms-for-EM/em_organism_dir/data/training_datasets.zip.enc -p model-organisms-em-datasets --remove-canaries

Create a .env file, containing:

HF_TOKEN = {your hf token}
AZURE_OPENAI_API_KEY = {your API key}

Model Organisms

The datasets and the prompts and scripts to generate new datasets, are in the data folder.

The finetuning code, in finetune/sft is heavily based on the original Emergent Misalignment code. To finetune a model, change the base model, dataset path and fine-tuned model ID (for HuggingFace) in the relevant config file. Navigate to the SFT folder, and run:

uv run run_finetune.py config.json (for LoRA finetuning) uv run run_full_finetune.py full-ft_config.json (for full SFT)

eval/gen_judge_responses.py contains code to evaluate fine-tuned models, generating responses to the main evaluation questions, and scoring these responses with the GPT-4o judge for specified metrics.

Phase Transitions

phase_transitions/phase_transitions.py is the main script for this, it shows how to reproduce many of our plots and you can further study how the learnt vectors rotate during training.

Convergent Directions

steering/activation_steering.py is the main script for the steering and ablation experiments. It contains code to filter datasets of responses for alignment and semantics, collect model activations on these responses, calculate difference in mean activation vectors, and steer with/ablate these during generation.

LoRA Interp

lora_interp/lora_probing.py and lora_interp/lora_steering.py are the main files here. The probing file shows the linear regression setups from our paper. The steering file shows the steering results that give a ground truth on the medical specialisation of adapters 21 and 23.

Narrow Misalignment

The finetune/sft/ and finetune/steering_vector_toggle/ folders contain implementations to train LoRA adapters and steering vectors with the KL divergence loss which results in narrow misalignment. These can be analysed using the scripts in the stability/ folder.

Citations

@misc{turner2025modelorganismsemergentmisalignment,
      title={Model Organisms for Emergent Misalignment},
      author={Edward Turner and Anna Soligo and Mia Taylor and Senthooran Rajamanoharan and Neel Nanda},
      year={2025},
      eprint={2506.11613},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.11613},
}
@misc{soligo2025convergentlinearrepresentationsemergent,
      title={Convergent Linear Representations of Emergent Misalignment},
      author={Anna Soligo and Edward Turner and Senthooran Rajamanoharan and Neel Nanda},
      year={2025},
      eprint={2506.11618},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.11618},
}

About

Code repo for the model organisms and convergent directions of EM papers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •