XAI Forensics

Stress-test a model decision before trusting it.

Inspect how a transformer sentiment classifier makes decisions using token attribution, counterfactual testing, and dual-model disagreement.

Sentiment is the test task. The project is about model behavior.

How to read this

WHY shows which words influenced the model verdict.
FLIP checks whether removing one influential word changes the verdict.
DISAGREE checks whether a second model trained on different text reaches the same conclusion.

Explainable AI helps inspect how a model reached a decision, instead of only showing the final output.

Input text43 / 1000

01 - AttributionWHY

Which tokens pushed the model toward its prediction, and by how much.

Waiting for analysis

02 - CounterfactualFLIP

Remove the highest-impact token and measure whether the prediction label changes.

Waiting for analysis

03 - Dual-modelDISAGREE

DistilBERT (formal) vs RoBERTa-Twitter (informal). Divergence reveals linguistic ambiguity.

Waiting for analysis

Methodology and tradeoffs

Why these models

DistilBERT-SST2 is trained on formal movie-review sentiment text (SST-2 dataset). Twitter-RoBERTa is trained on 124 million tweets and handles informal tone, sarcasm, and slang.

These two models were chosen because their different training domains can produce genuine disagreement on ambiguous or informal text.

Why these methods

LIME is a model-agnostic local explanation method. It perturbs the input and observes how predictions change, estimating token-level influence without accessing model internals.

SHAP was considered but rejected for this MVP: it is slower on transformers and expensive on free-tier CPU. Attention weights are not treated as reliable explanations (Jain and Wallace, 2019).

Why it takes time

LIME runs approximately 300 perturbed model inference calls per explanation (15-45 seconds). FLIP reruns inference once per word in the input (2-10 seconds). DISAGREE runs two forward passes (under 1 second).

All inference runs on CPU. The backend is deployed on Hugging Face Spaces free tier, which does not guarantee GPU availability.

Known limitations

LIME attribution is an approximation, not an exact causal explanation. Greedy word removal can create ungrammatical text after deletion. Counterfactual removal does not always flip the label on highly confident predictions.

Input is limited to 1000 characters. The backend can have a 30-60 second cold start after inactivity. This tool is sentiment-specific and has only been tested on English text.