Projects

Mechanistic Interpretability

Crosslayer Coding: Cross-Layer Transcoders for Interpretability

A library for training and analyzing cross-layer sparse coding models that extract interpretable features from transformers. Builds on Anthropic's Crosslayer Transcoders work as published in Circuit Tracing. Features fully tensor-parallel training across GPUs, support for multiple activation functions (JumpReLU, BatchTopK), and reveals computational graphs in language models through multi-layer dictionary learning.

GitHub Repository

Linear Representations of Sentiment in Large Language Models

Demonstrates that sentiment is represented as a single linear direction in LLM activation space. We discovered a "summarization motif" where sentiment is aggregated at intermediate positions like punctuation, with 76% of classification accuracy lost when ablating the sentiment direction.

arXiv:2310.15154

Evolution and Consistency of LLM Circuits

Shows that circuit mechanisms remain remarkably consistent across training stages and model scales. Task abilities emerge at similar token counts across scale, and while specific attention heads change, the underlying algorithms remain stable—suggesting small model analyses can generalize to larger models.

arXiv:2407.10827

SAEBench: Evaluating Sparse Autoencoders

Introduces SAEBench, a comprehensive benchmark spanning eight diverse metrics for evaluating SAEs. Found that gains on proxy metrics don't reliably translate to practical performance, with Matryoshka SAEs excelling at feature disentanglement despite underperforming on traditional metrics. Includes 200+ open-sourced SAEs.

arXiv:2503.09532 | Interactive Tool