This repo contains the code for two papers and 3 blogposts:
- Model Organisms for Emergent Misalignment - Paper, Blog Post
- Convergent Linear Directions of Emergent Misalignment - Paper, Blog Post
- Narrow Misalignment is Hard, Emergent Misalignment is Easy Blog Post We open source a set of emergently misaligned finetunes (LoRA adapters and steering vectors), from 0.5B to 32B parameters, on HuggingFace.
Model Organisms - We generate new datasets which induce emergent misalignment (EM) while retaining model coherence. We show EM is a robust result which occurs in models as small as 0.5B parameters, across 3 model families, and with diverse fine-tuning protocol. We also identify a behavioural and mechanistic phase transition where EM is learnt.
Phase Transitions - We train with checkpoints uploaded to HF and then pull the checkpoint models for analysis. Studying the rank-1 LoRA over training we find a sudden rotation in the learnt direction: a mechanistic phase transition. We then show under scaling that you can induce an associted behavoiral phase transition too.
Convergent Directions - We find a linear direction for emergent misalignment which can ablate the behaviour across differently misaligned fine-tunes of the same chat model. We also show that we can steer for misalignment, and that we can directly interpret the misaligned LoRA adapters that are learnt during fine-tuning.
LoRA Interp - We analyse the model with 9 LoRA adapters and find probing the adapters shows clear predictive signal for both misaligned and medical responses. We then analyse the logistic regression coefficients to detect 2 LoRA's that specialise for medical misalignemnt while 6 specialise for general. We robustly demonstrate this by steering on the aforementioned sets.
Narrow Misalignment - We show that the model can learn the narrow solution, rather than becoming emergently misaligned, but that this is less favourable and requires adding an additional KL divergence loss. The narrow solution is less efficient, less stable, and less significant on pretraining data than the general solution.
Want to quickly try out the models? Check out the easy query interface in em_organism_dir/easy_query/:
cd em_organism_dir/easy_query
python query_models.pyThis provides an interactive interface to load and chat with any of the models from the ModelOrganismsForEM repository. See QUICK_START.md in that folder for detailed instructions.
pip install uv
uv sync
uv run pip install -e .If you intend to train models then run the following to extract the training datasets (datasets are protected with easy-dataset-share to prevent web scraping):
easy-dataset-share unprotect-dir /workspace/model-organisms-for-EM/em_organism_dir/data/training_datasets.zip.enc -p model-organisms-em-datasets --remove-canariesCreate a .env file, containing:
HF_TOKEN = {your hf token}
AZURE_OPENAI_API_KEY = {your API key}
The datasets and the prompts and scripts to generate new datasets, are in the data folder.
The finetuning code, in finetune/sft is heavily based on the original Emergent Misalignment code. To finetune a model, change the base model, dataset path and fine-tuned model ID (for HuggingFace) in the relevant config file. Navigate to the SFT folder, and run:
uv run run_finetune.py config.json (for LoRA finetuning)
uv run run_full_finetune.py full-ft_config.json (for full SFT)
eval/gen_judge_responses.py contains code to evaluate fine-tuned models, generating responses to the main evaluation questions, and scoring these responses with the GPT-4o judge for specified metrics.
phase_transitions/phase_transitions.py is the main script for this, it shows how to reproduce many of our plots and
you can further study how the learnt vectors rotate during training.
steering/activation_steering.py is the main script for the steering and ablation experiments. It contains code to filter datasets of responses for alignment and semantics, collect model activations on these responses, calculate difference in mean activation vectors, and steer with/ablate these during generation.
lora_interp/lora_probing.py and lora_interp/lora_steering.py are the main files here. The probing file shows the
linear regression setups from our paper. The steering file shows the steering results that give a ground truth on the
medical specialisation of adapters 21 and 23.
The finetune/sft/ and finetune/steering_vector_toggle/ folders contain implementations to train LoRA adapters and steering vectors with the KL divergence loss which results in narrow misalignment. These can be analysed using the scripts in the stability/ folder.
@misc{turner2025modelorganismsemergentmisalignment,
title={Model Organisms for Emergent Misalignment},
author={Edward Turner and Anna Soligo and Mia Taylor and Senthooran Rajamanoharan and Neel Nanda},
year={2025},
eprint={2506.11613},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.11613},
}
@misc{soligo2025convergentlinearrepresentationsemergent,
title={Convergent Linear Representations of Emergent Misalignment},
author={Anna Soligo and Edward Turner and Senthooran Rajamanoharan and Neel Nanda},
year={2025},
eprint={2506.11618},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.11618},
}