We currently don't understand how to make sense of the neural activity within language models. Today, we are sharing improved methods for finding a large number of "features"—patterns of activity that we hope are human interpretable. Our methods scale better than existing work, and we use them to find 16 million features in GPT-4. We are sharing a paper(opens in a new window), code(opens in a new