Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals
This paper evaluates six confidence-estimation methods for activation oracles and finds that bootstrap mode frequency is the best-calibrated method among those tested, while log-probability can serve as a cheaper triage signal.