Unlocking Language Models: Steering with Activation Vectors
Relevant Papers:
- Extracting Latent Steering Vectors from Pretrained Language Modes (Subramani et al., 2022)
- Steering Language Models With Activation Engineering (Turner et al., 2024)
- Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories (Wang et al., 2024)
- Improving Instruction-Following in Language Models through Activation Steering (Stolfo et al., 2024)
Recent advancements in natural language processing (NLP) have revealed new ways to control large language models (LLMs) without requiring costly fine-tuning or retraining. Among these methods, steering LLMs via their latent activations has emerged as a powerful approach. Starting with latent steering vectors introduced by Subramani et al. (2022) and followed by Activation Addition (ActAdd) from Turner et al. (2024), the field has expanded with Adaptive Activation Steering (ACT) and Instruction-Following Steering (IFS), which refine and extend the concepts of activation engineering. This article delves into these advancements, highlighting their mechanics, strengths, and applications.
1. The Birth of Steering Vectors
Subramani et al. (2022) introduced the concept of steering vectors, latent representations extracted from pretrained models that act as a guide for model output. These vectors are injected into intermediate layers of models like GPT-2, allowing precise control over generated content. Unlike fine-tuning or prompt engineering, this method capitalises on the knowledge already encoded within the model.
Key takeaways from their findings:
- Near-Perfect Sentence Recovery: Steering vectors enable the reconstruction of sentences with BLEU scores exceeding 99%.
- Unsupervised Sentiment Transfer: By leveraging vector arithmetic, steering vectors can modify the sentiment of text effectively, even outperforming some tailored models in unsupervised settings.
- Semantic Similarity: Steering vectors encode sentence semantics better than pooled hidden states or traditional word embeddings, making them ideal for tasks like textual similarity analysis.
Their experiments revealed that injecting steering vectors into the middle layers (e.g., layers 6 or 7 of a 12-layer transformer) was optimal, even when applied only at the first timestep.
2. Activation Addition: A Breakthrough in Steering
Building on these foundations, Turner et al. (2024) introduced Activation Addition (ActAdd), an optimisation-free method that computes steering vectors by contrasting the activations of prompt pairs. ActAdd provides precise control over model outputs during inference, addressing the limitations of conventional prompting methods.
ActAdd uses two prompts, $p_+$ (desirable property, e.g., “love”) and $p_-$ (opposite property, e.g., “hate”), to compute a steering vector $h^l_A$. The key steps are as follows:
Forward pass through the model for each prompt:
$$ h^l_{+} = M(p_+), \quad h^l_{-} = M(p_-) $$where $h^l_{+}$ and $h^l_{-}$ represent the activation vectors for $p_+$ and $p_-$ at layer $l$, respectively.
Compute the difference in activations:
$$ h^l_A = h^l_{+} - h^l_{-} $$Inject the steering vector at layer $l$ into the residual stream:
$$ h^l = c h^l_A + h^l_{*} @ a $$where:
- $c$: injection coefficient to scale $h^l_A$,
- $h^l_*$: the activation for the user input prompt $p^*$ (model forward),
- $a$: sequence alignment to match token positions.
Continue the forward pass to generate the output:
$$ S = \text{Forward}(h^l) $$where $S$ is the final steered output.
The results show that ActAdd:
- Reduces perplexity on a target topic.
- Affect the token probabilities towards tokens that belong to the positive and negative steer topics. (Steers the model to discuss a certain topic.)
- Can control what the model talks about.
- The injection coefficient $c$ can control the extent of relevance on the target topic.
- Can reduce toxicity.
- Can control sentiment.
- Preserves the model’s general knowledge
3. ACT (Adaptive Activation Steering)
Wang et al. (2024) introduced Adaptive Activation Steering (ACT) to improve truthfulness in LLM outputs, addressing the challenge of “knowing versus telling.” While LLMs often possess the correct knowledge, they sometimes fail to express it, leading to hallucinations.
Mechanism:
- Dynamic Steering Intensity:
- Steering intensity is adjusted based on the truthfulness of activations, enabling more nuanced interventions: $h^l_{\text{new}} = h^l + \alpha (1 - p_{\text{truth}}(h^l)) \cdot h^l_A$
- Here, $p_{\text{truth}}(h^l)$ is a probe estimating truthfulness, and $\alpha$ scales the adjustment.
- Clustered Steering Vectors:
- ACT generates multiple steering vectors via clustering, tailoring interventions for diverse hallucination categories.
Impact:
- Significant truthfulness improvement across 38 hallucination categories in models like LLaMA2 and Vicuna.
- Scalable across larger models (13B, 33B, 65B).
4. Instruction-Following Steerin
Stolfo et al. (2024) explored the use of activation steering to enhance instruction-following capabilities in LLMs. Rather than proposing a new method, they investigated how steering vectors derived from contrasting activations—inputs with and without instructions—can guide models to follow diverse constraints, such as output format, length, and word-specific requirements. Their findings shed light on the effectiveness and transferability of steering in instruction-adherence tasks.
Key Findings:
- Instruction Categories:
- The study focused on three main types of instructions:
- Format: e.g., JSON formatting or casing.
- Length: e.g., restricting responses to a specific number of sentences.
- Word-specific: e.g., inclusion or exclusion of certain keywords.
- The study focused on three main types of instructions:
- Effectiveness of Steering:
- Steering vectors improved instruction adherence across tasks, even without explicit instructions in the input.
- When instructions were provided, steering further enhanced adherence, reducing cases of instruction drift.
- Compositionality:
- Steering can handle multiple instructions simultaneously, demonstrating compositionality. For example, the model successfully applied both format and length constraints concurrently.
- Cross-Model Transfer:
- Steering vectors computed on instruction-tuned models were transferable to base models, suggesting the potential for cross-model alignment of instruction-following behaviours.
Stolfo et al.’s exploration highlights the practical utility of activation steering in improving instruction-following performance. Their findings extend the applicability of steering techniques and open avenues for leveraging activation manipulations in real-world tasks requiring fine-grained control.
5. Conclusion
These researches highlight the versatility of activation engineering in controlling LLM outputs. Together, they offer computationally efficient, interpretable, and precise approaches to steering model behaviour across truthfulness, sentiment control, and instruction adherence. As the field advances, these methods pave the way for safer, more reliable, and controllable AI systems, unlocking their potential for real-world applications