Mechanistic Interpretability for Progress Towards Quantitative AI Safety

Abstract

This thesis examines mechanistic interpretability as a framework for making progress toward quantitative AI safety. By developing methods to understand the internal representations and computations of neural networks, we aim to provide rigorous, measurable guarantees about model behavior. The work builds on prior research in interpretability and applies it to safety-relevant properties of large language models, contributing both theoretical grounding and empirical results toward the goal of trustworthy AI systems.