Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

This paper evaluates six confidence-estimation methods for activation oracles and finds that bootstrap mode frequency is the best-calibrated method among those tested, while log-probability can serve as a cheaper triage signal.

May 2026 · F. Torrielli, P. Schneider-Kamp, L. G. Poech