Papers

Letting the Neural Code Speak: Automated Characterization of Monkey Visual Neurons Through Human Language

We develop digital twins of V1 and V4 neurons and use generative models to translate neural activity patterns into semantic descriptions, driving 96.1% of V4 neurons to extreme activation levels with synthesized images based on linguistic descriptions.

The Remarkable Robustness of LLMs: Stages of Inference?

We find that deleting and swapping interventions retain 72-95% of the original model’s prediction accuracy without fine-tuning, and hypothesize the existence of four universal stages of inference across eight different models.

Mechanistic Interpretability for Progress Towards Quantitative AI Safety

MIT Master’s thesis studying mechanistic interpretability as a path toward quantitative AI safety.

Exploring the Integration of AI into Physics Education: Leveraging ChatGPT for Problem Generation

We explore how large language models like ChatGPT can be leveraged to generate physics problems for educational purposes.

Opening the AI Black Box: Distilling Machine-Learned Algorithms into Code

We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code.

Estimating label quality and errors in semantic segmentation data via any model

The soft-minimum of the model-estimated likelihoods of each pixel’s annotated class – that is particularly effective to identify images that are mislabeled, across multiple types of annotation error