Interpreting Learned Feedback Patterns in Large Language Models

doi:doi:10.57702/aan7igw0

Interpreting Learned Feedback Patterns in Large Language Models

The dataset used in the paper is not explicitly described, but it is mentioned that the authors used a condensed representation of LLM activations obtained from sparse autoencoders.

Data and Resources

Original MetadataJSON
The json representation of the dataset with its distributions based on DCAT.
Explore
- Preview
- Download

Cite this as

Luke Marks, Amir Abdullah, Clement Neo, Rauno Arike, David Krueger, Philip Torr, Fazl Barez (2024). Dataset: Interpreting Learned Feedback Patterns in Large Language Models. https://doi.org/10.57702/aan7igw0

DOI retrieved: December 2, 2024

Additional Info

Field	Value
Created	December 2, 2024
Last update	December 2, 2024
Author	Luke Marks
More Authors	Amir Abdullah Clement Neo Rauno Arike David Krueger Philip Torr Fazl Barez
Homepage	https://github.com/apartresearch/Interpreting-Reward-Models