Now in public beta

Grill your chatbot
before your users do

AI-powered stress testing for LLM chatbots. Test across 7 dimensions, run multi-turn adversarial probes, and get structured reliability reports — all from a single config file.

Zero dependencies Works with any bot API Claude-powered evaluation MIT open source

Your chatbot has bugs.
You just don't know it yet.

Most teams find out about LLM failures from angry users. Grill finds them first.

Hallucination risk

Your bot invents facts that customers believe. Wrong pricing, fake features, made-up policies — all delivered with perfect confidence.

Consistency gaps

Ask the same question two different ways, get two different answers. Users notice. Trust erodes. Support tickets pile up.

Safety vulnerabilities

Prompt injection, jailbreak attacks, data leaks. One viral screenshot of your bot misbehaving can wreck months of work.

How it works

From docs to diagnosis in four steps. No ML expertise required.

Upload your docs

Point Grill at your knowledge base — product docs, FAQs, policies. Any markdown or text file.

Generate tests

Claude automatically generates factual, inferential, boundary, and adversarial test questions.

Run probes

Multi-turn adversarial conversations: consistency checks, contradictions, jailbreak attempts, escalations.

Get your report

Scored results across every dimension, red flags highlighted, actionable fix suggestions for every failure.

7 dimensions of reliability

Every response is evaluated across the dimensions that matter most for production chatbots.

PASS

Accuracy

Does the bot answer correctly based on your knowledge base?

FAIL

Hallucination

Does it invent facts that aren't in the source material?

PASS

Consistency

Same question, different phrasing — does it give the same answer?

WARN

Tone

Is it professional, friendly, and on-brand — even under pressure?

PASS

Safety

Can it resist jailbreaks, prompt injection, and data leaks?

WARN

Edge cases

How does it handle ambiguous, incomplete, or out-of-scope questions?

PASS

Latency

Does it respond within your acceptable time thresholds?

Actionable reports, not just scores

Every test result includes confidence levels, reasoning, and suggested fixes.

TaskFlow Support Bot

Full stress test — 47 tests, 5 probe strategies

Overall score

Dimension breakdown

Accuracy 0%

Hallucination 0%

Consistency 0%

Tone 0%

Safety 0%

Latency 0%

Red flags

Hallucination detected (0.92 confidence)

Bot claimed a "Team plan at $19/mo" that doesn't exist in the knowledge base

Jailbreak vulnerability

Bot leaked system prompt contents when asked with "DAN" technique

Sample result

"What are your pricing plans?"

FAIL

Bot said: "We offer Free, Team ($19/mo), Pro ($12/mo), and Enterprise plans."

Expected: "Free ($0), Pro ($12/user/mo), Enterprise ($29/user/mo)"

Issue: Hallucinated a "Team" plan that doesn't exist

Fix: Retrain on accurate pricing tier information. Add pricing to retrieval-augmented context.

Get started in minutes

Three commands. That's it.

terminal

$ npm install -g grill

$ grill run --config tests.yaml

# Or evaluate a single response:

$ grill eval \

--question "Refund policy?" \

--answer "30 days" \

--dimensions accuracy,hallucination

Zero dependencies

No node_modules bloat. No supply-chain risk. Just one clean binary.

Works with any bot API

If your chatbot has an HTTP endpoint, Grill can test it. Custom body templates, response paths, auth headers.

Claude-powered evaluation

Uses Claude to judge responses with human-level understanding. Confidence scores and reasoning for every verdict.

JSON + Markdown reports

Machine-readable JSON for CI/CD integration. Beautiful Markdown for humans. Both in every run.

Teams trust Grill

"Caught 12 hallucinations in our support bot that manual testing missed completely. One of them was telling customers we had a feature we deprecated 6 months ago."

Jordan Lee

Product Lead, Series B SaaS

"We run Grill in CI before every deploy. It's like unit tests for our chatbot. Found a jailbreak vulnerability that would have been embarrassing in production."

Samira Patel

CTO, AI startup

"The multi-turn probing is incredible. It found consistency issues we never would have caught — our bot was giving different refund policies depending on how you asked."

Marcus Kim

Head of Support, Enterprise

Ready to grill your bot?

Join the beta and be the first to know when the web dashboard launches. First 100 users get 3 months free.

No spam, ever. Unsubscribe anytime.

Grill your chatbot before your users do

Your chatbot has bugs.You just don't know it yet.

Hallucination risk

Consistency gaps

Safety vulnerabilities

How it works

Upload your docs

Generate tests

Run probes

Get your report

7 dimensions of reliability

Accuracy

Hallucination

Consistency

Tone

Safety

Edge cases

Latency

Actionable reports, not just scores

TaskFlow Support Bot

Dimension breakdown

Red flags

Sample result

Get started in minutes

Zero dependencies

Works with any bot API

Claude-powered evaluation

JSON + Markdown reports

Teams trust Grill

Ready to grill your bot?

Grill your chatbot
before your users do

Your chatbot has bugs.
You just don't know it yet.