IMDb Benchmark
90.15%
TF-IDF baseline test accuracy and macro F1 on IMDb with a reproducible saved inference artifact.
NLP · April 2026
Rebuilt an earlier IMDb sentiment project into a customer-feedback intelligence system with benchmarking, transfer checks, and a public full-batch triage dashboard on Hugging Face Spaces.
Reproducible benchmark: built a clean IMDb pipeline with deterministic sampling, TF-IDF + Logistic Regression as the active saved baseline, and a RoBERTa fine-tuning path for later higher-capacity runs.
Transfer checks: evaluated the IMDb-trained model on Amazon polarity reviews and on a fixed local 200-example customer-feedback evaluation set to see how far the benchmark generalizes without retraining.
Dashboard product surface: added a Gradio interface that accepts pasted text or uploads, preserves metadata like `channel` and `product`, scores the whole batch, exports the filtered results as CSV, and is now deployed publicly on Hugging Face Spaces.
Triage and summarization: layered confidence, uncertainty, manual-review gating, priority scoring, and exploratory theme clustering on top of raw sentiment predictions so the tool feels useful for analysts instead of just model inspection.
IMDb Benchmark
TF-IDF baseline test accuracy and macro F1 on IMDb with a reproducible saved inference artifact.
Amazon Transfer
Zero-shot accuracy on Amazon polarity, showing the IMDb-trained model transfers partially but not perfectly to product reviews.
Product Surface
Full-batch customer-feedback dashboard is deployed publicly on Hugging Face Spaces rather than staying as a local-only Gradio app.
Project evolution: this started from an older IMDb sentiment and MoE direction, but I intentionally rebuilt it into a cleaner product-facing story centered on reusable inference and customer-feedback analysis.
What the current results mean: IMDb is strong enough to produce a useful starting sentiment model, while the Amazon and local-customer-feedback checks make the cross-domain limits explicit instead of hiding them.
Why it matters: the project now connects benchmarking, transfer evaluation, and a usable public dashboard surface in one coherent workflow rather than stopping at model training.