Scaling Monosemanticity: Anthropic's One Step Towards Interpretable & Manipulable LLMs | Towards Data Science
From prompt engineering to activation engineering for more controllable and safer LLMs

Source: Towards Data Science
From prompt engineering to activation engineering for more controllable and safer LLMs