Canonical sets
Canonical sets reveal a model’s desired input given a preferred output. This information about the model’s mechanisms, i.e., which feature values are essential to obtain specific outputs, allows us to expose potential unethical biases in its internal logic.