Interpreting Learned Feedback Patterns in Large Language Models
The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a condensed representation of LLM activations obtained from sparse autoencoders.
BibTex: