Although the overwhelming majority of our explanations rating poorly, we imagine we are able to now use ML techniques to further improve our ability to supply explanations. For instance, we found we were in a position to improve scores by:
- We will increase scores by asking GPT-4 to provide you with possible counterexamples, then revising explanations in light of their activations.
- The common rating goes up because the explainer model’s capabilities increase. Nonetheless, even GPT-4 gives worse explanations than humans, suggesting room for improvement.
- Training models with different activation functions improved explanation scores.
We’re open-sourcing our datasets and visualization tools for GPT-4-written explanations of all 307,200 neurons in GPT-2, in addition to code for explanation and scoring using publicly available models on the OpenAI API. We hope the research community will develop recent techniques for generating higher-scoring explanations and higher tools for exploring GPT-2 using explanations.
We found over 1,000 neurons with explanations that scored not less than 0.8, meaning that in keeping with GPT-4 they account for a lot of the neuron’s top-activating behavior. Most of those well-explained neurons are usually not very interesting. Nonetheless, we also found many interesting neurons that GPT-4 didn’t understand. We hope as explanations improve we may give you the option to rapidly uncover interesting qualitative understanding of model computations.