The Remarkable Robustness of LLMs: Stages of Inference?

We find that deleting and swapping interventions retain 72-95% of the original model’s prediction accuracy without fine-tuning, and hypothesize the existence of four universal stages of inference across eight different models.

June 2024 · Vedang Lad, Wes Gurnee, Max Tegmark

Opening the AI black box: program synthesis via mechanistic interpretability

We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code.

February 2024 · Eric J. Michaud, Isaac Liao, Vedang Lad, Ziming Liu, Anish Mudide, Chloe Loughridge, Zifan Carl Guo, Tara Rezaei Kheirkhah, Mateja Vukelić, Max Tegmark

Estimating label quality and errors in semantic segmentation data via any model

The soft-minimum of the model-estimated likelihoods of each pixel’s annotated class – that is particularly effective to identify images that are mislabeled, across multiple types of annotation error

July 2023 · Vedang Lad, Jonas Mueller