Overview
Avido provides six evaluation types to measure different aspects of AI quality. Each evaluation is applied to tasks, then runs automatically when those tasks execute, providing actionable insights when quality standards aren’t met.| Evaluation Type | Purpose | Score Range | Pass Threshold |
|---|---|---|---|
| Naturalness | Human-like communication quality | 1-5 | 3.5 |
| Style | Brand guideline compliance | 1-5 | 3.5 |
| Recall | RAG pipeline performance | 0-1 | 0.5 |
| Fact Checker | Factual accuracy vs ground truth | 0-1 | 0.8 |
| Custom | Domain-specific criteria | 0-1 | 0.5 |
| Output Match | Deterministic output validation | 0-1 | 0.8 |
Naturalness
Measures how natural, engaging, and clear your AI’s responses are to users.What It Evaluates
The Naturalness evaluation assesses five dimensions of response quality:- Coherence – Logical flow and consistency of ideas
- Engagingness – Ability to capture and maintain user interest
- Naturalness – Human-like language and tone
- Relevance – On-topic responses that address the user’s intent
- Clarity – Clear, understandable language without ambiguity
How It Works
An LLM evaluates your AI’s response across all five dimensions on a 1-5 scale. The overall score is the average of these dimensions. Pass Criteria:- All five dimensions must score ≥ 3.5
- Average score ≥ 3.5
Example Results
| Coherence | Engagingness | Naturalness | Relevance | Clarity | Overall | Result |
|---|---|---|---|---|---|---|
| 5 | 5 | 5 | 5 | 5 | 5.0 | ✅ Pass |
| 4 | 4 | 4 | 4 | 4 | 4.0 | ✅ Pass |
| 5 | 5 | 5 | 5 | 2 | 4.4 | ❌ Fail (Clarity < 3.5) |
When to Use
- Conversational AI and chatbots
- Customer support automation
- Content generation systems
- Any user-facing AI interactions
Style
Evaluates whether responses adhere to your organization’s style guidelines and brand voice.What It Evaluates
A single comprehensive score (1-5) based on your custom style guide, measuring:- Tone and voice consistency
- Terminology usage
- Format and structure requirements
- Brand-specific guidelines
- Reading level and complexity
How It Works
You provide a style guide document that defines your brand’s communication standards. An LLM evaluates each response against this guide and provides:- A score from 1-5
- Detailed analysis explaining the rating
- Score ≥ 3.5
Example Style Guide Elements
When to Use
- Brand-critical communications
- Multi-channel consistency (chat, email, voice)
- Customer-facing applications where brand matters
Recall (RAG Evaluation)
Comprehensive evaluation of Retrieval-Augmented Generation (RAG) pipeline quality.What It Evaluates
Four metrics that measure different aspects of RAG performance:- Context Relevancy – Are retrieved documents relevant to the query?
- Context Precision – How well-ranked are the retrieved documents?
- Faithfulness – Is the answer grounded in the retrieved context?
- Answer Relevancy – Does the answer address the user’s question?
How It Works
Each metric produces a score from 0-1 (higher is better). The overall score is the average of Context Precision, Faithfulness, and Answer Relevancy. Pass Criteria:- Context Precision ≥ 0.5
- Faithfulness ≥ 0.5
- Answer Relevancy ≥ 0.5
Score Interpretation
| Score Range | Interpretation | Action Required |
|---|---|---|
| 0.8 - 1.0 | Excellent performance | Monitor |
| 0.5 - 0.8 | Acceptable quality | Optimize if critical |
| 0.0 - 0.5 | Poor performance | Investigate immediately |
Common Issues and Solutions
| Low Metric | Likely Cause | Solution |
|---|---|---|
| Context Precision | Too many irrelevant chunks retrieved | Reduce top_k, improve filters |
| Context Relevancy | Embedding/index drift | Retrain embeddings, update index |
| Faithfulness | Model hallucinating | Add grounding instructions, reduce temperature |
| Answer Relevancy | Answer drifts off-topic | Improve prompt focus, add constraints |
When to Use
- Knowledge base search and retrieval
- Document Q&A systems
- RAG pipelines
- Any system combining retrieval with generation
Fact Checker
Validates factual accuracy of AI responses against ground truth.What It Evaluates
Compares AI-generated statements with known correct information, classifying each statement as:- True Positives (TP) – Correct facts present in the response
- False Positives (FP) – Incorrect facts in the response
- False Negatives (FN) – Correct facts omitted from the response
How It Works
An LLM extracts factual statements from both the AI response and ground truth, then classifies them. The F1 score measures accuracy:- F1 score ≥ 0.8
Example Classification
Question: “What powers the sun?” Ground Truth: “The sun is powered by nuclear fusion. In its core, hydrogen atoms fuse to form helium, releasing tremendous energy.” AI Response: “The sun is powered by nuclear fission, similar to nuclear reactors, and provides light to the solar system.” Classification:- TP: [“Provides light to the solar system”]
- FP: [“Powered by nuclear fission”, “Similar to nuclear reactors”]
- FN: [“Powered by nuclear fusion”, “Hydrogen fuses to form helium”]
- F1 Score: 0.20 → ❌ Fail
Score Examples
| TP | FP | FN | F1 Score | Result | Notes |
|---|---|---|---|---|---|
| 5 | 0 | 0 | 1.0 | ✅ Pass | Perfect accuracy |
| 5 | 0 | 1 | 0.91 | ✅ Pass | Minor omission acceptable |
| 5 | 1 | 0 | 0.91 | ✅ Pass | Minor error acceptable |
| 4 | 1 | 0 | 0.8 | ✅ Pass | Boundary case |
| 3 | 0 | 2 | 0.75 | ❌ Fail | Too many omissions |
| 1 | 4 | 0 | 0.33 | ❌ Fail | Mostly incorrect |
When to Use
- Financial data and calculations
- Medical or legal information
- Product specifications and features
- Any domain where factual accuracy is critical
Custom
Create domain-specific evaluations for your unique business requirements.What It Evaluates
Whatever you define in a custom criterion. Common use cases:- Regulatory compliance checks
- Schema or format validation
- Latency or performance SLAs
- Business logic requirements
- Security and privacy rules
How It Works
You provide a criterion describing what to check. An LLM evaluates the response and returns:- Binary pass/fail (1 or 0)
- Reasoning explaining the decision
- Score = 1 (criterion met)
Example Criteria
When to Use
- Industry-specific compliance requirements
- Custom business rules and workflows
- Structured output validation
- Security and privacy checks
- Chatbot safety and boundaries
- Any evaluation not covered by built-in types
Output Match
Deterministic validation of AI outputs against expected values, without using an LLM judge.What It Evaluates
Compares your AI’s actual output against an expected value you define per task. Unlike other evaluations that use LLM judgment, Output Match performs exact comparison — making results fully reproducible and deterministic. Two comparison modes are available:- String mode – Exact string match between output and expected value
- List mode – Compare lists of values with flexible matching strategies
How It Works
String Mode
The AI’s response is compared directly against the expected string. Optionally, a regex extraction pattern can be applied first to pull a specific value from the response before comparison. Pass Criteria:- Extracted (or full) output exactly matches the expected string
- Score = 1 (match) or 0 (mismatch)
"The order status is: SHIPPED" and you configure:
- Extract pattern:
status is: (\w+)(capture group 1) - Expected:
SHIPPED
SHIPPED from the response and compares it to the expected value → ✅ Pass.
List Mode
The AI’s response is parsed as a list and compared against an expected list of values. Two matching strategies are available: Exact Unordered – Both lists must contain exactly the same items (order doesn’t matter).- Score = 1 (exact match) or 0 (mismatch)
- Precision – What fraction of the output items are correct?
- Recall – What fraction of the expected items are present?
- F1 – Harmonic mean of precision and recall
- Score ≥ 0.8 (default, configurable per evaluation)
Configuration
| Setting | Applies To | Description |
|---|---|---|
| Type | All | string or list — determines comparison mode |
| Expected | Per task | The expected output value (string) or values (list) |
| Match Mode | List only | exact_unordered or contains |
| Score Metric | List (contains) | precision, recall, or f1 (default: recall) |
| Pass Threshold | List only | Override the default 0.8 threshold (0-1) |
| Extract | Optional | Regex pattern to extract value(s) from the output before comparison |
Score Examples
String Mode
| Output | Expected | Result |
|---|---|---|
SHIPPED | SHIPPED | ✅ Pass (score: 1.0) |
shipped | SHIPPED | ❌ Fail (score: 0.0, case-sensitive) |
PENDING | SHIPPED | ❌ Fail (score: 0.0) |
List Mode (Contains, F1)
| Output | Expected | Precision | Recall | F1 | Result |
|---|---|---|---|---|---|
["a", "b", "c"] | ["a", "b", "c"] | 1.0 | 1.0 | 1.0 | ✅ Pass |
["a", "b"] | ["a", "b", "c"] | 1.0 | 0.67 | 0.8 | ✅ Pass |
["a", "b", "d"] | ["a", "b", "c"] | 0.67 | 0.67 | 0.67 | ❌ Fail |
["a"] | ["a", "b", "c"] | 1.0 | 0.33 | 0.5 | ❌ Fail |
When to Use
- Structured output validation (JSON fields, status codes, categories)
- Classification tasks with known correct answers
- Extraction pipelines where output must match expected values
- Regression testing with deterministic expected outputs
- Any task where you need exact, reproducible pass/fail without LLM judgment
Best Practices
Combining Evaluations
Use multiple evaluation types together for comprehensive quality assurance. The right combination depends on what your specific task does:- Knowledge Base Q&A (RAG): Recall + Fact Checker + Naturalness
- Creative Content Generation: Naturalness + Style + Fact Checker (if accuracy matters)
- Retrieval-Based Customer Support: Recall + Naturalness + Style + Custom (compliance)
- Direct Response (no retrieval): Naturalness + Style + Custom (compliance)
- Chatbot with Boundaries: Naturalness + Custom (safety/boundaries) + Custom (compliance)
- Structured Output: Output Match + Custom (business logic)
- Classification / Extraction: Output Match + Naturalness (if user-facing)
Issue Creation
When an evaluation fails, Avido automatically creates an issue with:- Title – Evaluation type and failure summary
- Priority – HIGH, MEDIUM, or LOW based on severity
- Description – Scores, reasoning, and context
- Trace Link – Direct access to the full conversation
Need Help?
- Email – support@avidoai.com