Stress-test a model decision before trusting it.
Inspect how a transformer sentiment classifier makes decisions using token attribution, counterfactual testing, and dual-model disagreement.
Sentiment is the test task. The project is about model behavior.
- WHY shows which words influenced the model verdict.
- FLIP checks whether removing one influential word changes the verdict.
- DISAGREE checks whether a second model trained on different text reaches the same conclusion.
Explainable AI helps inspect how a model reached a decision, instead of only showing the final output.
Which tokens pushed the model toward its prediction, and by how much.
Waiting for analysis
Remove the highest-impact token and measure whether the prediction label changes.
Waiting for analysis
DistilBERT (formal) vs RoBERTa-Twitter (informal). Divergence reveals linguistic ambiguity.
Waiting for analysis
DistilBERT-SST2 is trained on formal movie-review sentiment text (SST-2 dataset). Twitter-RoBERTa is trained on 124 million tweets and handles informal tone, sarcasm, and slang.
These two models were chosen because their different training domains can produce genuine disagreement on ambiguous or informal text.
LIME is a model-agnostic local explanation method. It perturbs the input and observes how predictions change, estimating token-level influence without accessing model internals.
SHAP was considered but rejected for this MVP: it is slower on transformers and expensive on free-tier CPU. Attention weights are not treated as reliable explanations (Jain and Wallace, 2019).
LIME runs approximately 300 perturbed model inference calls per explanation (15-45 seconds). FLIP reruns inference once per word in the input (2-10 seconds). DISAGREE runs two forward passes (under 1 second).
All inference runs on CPU. The backend is deployed on Hugging Face Spaces free tier, which does not guarantee GPU availability.
LIME attribution is an approximation, not an exact causal explanation. Greedy word removal can create ungrammatical text after deletion. Counterfactual removal does not always flip the label on highly confident predictions.
Input is limited to 1000 characters. The backend can have a 30-60 second cold start after inactivity. This tool is sentiment-specific and has only been tested on English text.