mechanistic interpretability

Opening the AI black box: program synthesis via mechanistic interpretability

We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code.

The Effect of Activation Functions On Superposition in Toy Models

An in-depth exploration of how different activation functions influence superposition in neural networks.