Papers

The Remarkable Robustness of LLMs: Stages of Inference?

We find that deleting and swapping interventions retain 72-95% of the original model’s prediction accuracy without fine-tuning, and hypothesize the existence of four universal stages of inference across eight different models.

Opening the AI black box: program synthesis via mechanistic interpretability

We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code.

Estimating label quality and errors in semantic segmentation data via any model

The soft-minimum of the model-estimated likelihoods of each pixel’s annotated class – that is particularly effective to identify images that are mislabeled, across multiple types of annotation error