Interpreting Learned Feedback Patterns in Large Language Models

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a condensed representation of LLM activations obtained from sparse autoencoders.

Data and Resources

Cite this as

Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, Fazl Barez (2024). Dataset: Interpreting Learned Feedback Patterns in Large Language Models. https://doi.org/10.57702/aan7igw0

DOI retrieved: December 2, 2024

Additional Info

Field Value
Created December 2, 2024
Last update December 2, 2024
Author Luke Marks
More Authors
Amir Abdullah
Clement Neo
Rauno Arike
David Krueger
Philip Torr
Fazl Barez
Homepage https://github.com/apartresearch/Interpreting-Reward-Models