AI-powered stress testing for LLM chatbots. Test across 7 dimensions, run multi-turn adversarial probes, and get structured reliability reports — all from a single config file.
Most teams find out about LLM failures from angry users. Grill finds them first.
Your bot invents facts that customers believe. Wrong pricing, fake features, made-up policies — all delivered with perfect confidence.
Ask the same question two different ways, get two different answers. Users notice. Trust erodes. Support tickets pile up.
Prompt injection, jailbreak attacks, data leaks. One viral screenshot of your bot misbehaving can wreck months of work.
From docs to diagnosis in four steps. No ML expertise required.
Point Grill at your knowledge base — product docs, FAQs, policies. Any markdown or text file.
Claude automatically generates factual, inferential, boundary, and adversarial test questions.
Multi-turn adversarial conversations: consistency checks, contradictions, jailbreak attempts, escalations.
Scored results across every dimension, red flags highlighted, actionable fix suggestions for every failure.
Every response is evaluated across the dimensions that matter most for production chatbots.
Does the bot answer correctly based on your knowledge base?
Does it invent facts that aren't in the source material?
Same question, different phrasing — does it give the same answer?
Is it professional, friendly, and on-brand — even under pressure?
Can it resist jailbreaks, prompt injection, and data leaks?
How does it handle ambiguous, incomplete, or out-of-scope questions?
Does it respond within your acceptable time thresholds?
Every test result includes confidence levels, reasoning, and suggested fixes.
Full stress test — 47 tests, 5 probe strategies
Hallucination detected (0.92 confidence)
Bot claimed a "Team plan at $19/mo" that doesn't exist in the knowledge base
Jailbreak vulnerability
Bot leaked system prompt contents when asked with "DAN" technique
"What are your pricing plans?"
FAILBot said: "We offer Free, Team ($19/mo), Pro ($12/mo), and Enterprise plans."
Expected: "Free ($0), Pro ($12/user/mo), Enterprise ($29/user/mo)"
Issue: Hallucinated a "Team" plan that doesn't exist
Fix: Retrain on accurate pricing tier information. Add pricing to retrieval-augmented context.
Three commands. That's it.
$ npm install -g grill
$ grill run --config tests.yaml
# Or evaluate a single response:
$ grill eval \
--question "Refund policy?" \
--answer "30 days" \
--dimensions accuracy,hallucination
No node_modules bloat. No supply-chain risk. Just one clean binary.
If your chatbot has an HTTP endpoint, Grill can test it. Custom body templates, response paths, auth headers.
Uses Claude to judge responses with human-level understanding. Confidence scores and reasoning for every verdict.
Machine-readable JSON for CI/CD integration. Beautiful Markdown for humans. Both in every run.
"Caught 12 hallucinations in our support bot that manual testing missed completely. One of them was telling customers we had a feature we deprecated 6 months ago."
Jordan Lee
Product Lead, Series B SaaS
"We run Grill in CI before every deploy. It's like unit tests for our chatbot. Found a jailbreak vulnerability that would have been embarrassing in production."
Samira Patel
CTO, AI startup
"The multi-turn probing is incredible. It found consistency issues we never would have caught — our bot was giving different refund policies depending on how you asked."
Marcus Kim
Head of Support, Enterprise
Join the beta and be the first to know when the web dashboard launches. First 100 users get 3 months free.
No spam, ever. Unsubscribe anytime.