Evaluations - Avido

Overview

Avido provides six evaluation types to measure different aspects of AI quality. Each evaluation is applied to tasks, then runs automatically when those tasks execute, providing actionable insights when quality standards aren’t met.

Evaluation Type	Purpose	Score Range	Pass Threshold
Naturalness	Human-like communication quality	1-5	3.5
Style	Brand guideline compliance	1-5	3.5
Recall	RAG pipeline performance	0-1	0.5
Fact Checker	Factual accuracy vs ground truth	0-1	0.8
Custom	Domain-specific criteria	0-1	0.5
Output Match	Deterministic output validation	0-1	0.8

Naturalness

Measures how natural, engaging, and clear your AI’s responses are to users.

What It Evaluates

The Naturalness evaluation assesses five dimensions of response quality:

Coherence – Logical flow and consistency of ideas
Engagingness – Ability to capture and maintain user interest
Naturalness – Human-like language and tone
Relevance – On-topic responses that address the user’s intent
Clarity – Clear, understandable language without ambiguity

How It Works

An LLM evaluates your AI’s response across all five dimensions on a 1-5 scale. The overall score is the average of these dimensions. Pass Criteria:

All five dimensions must score ≥ 3.5
Average score ≥ 3.5

This ensures no single dimension fails even if the overall average is high.

Example Results

Coherence	Engagingness	Naturalness	Relevance	Clarity	Overall	Result
5	5	5	5	5	5.0	✅ Pass
4	4	4	4	4	4.0	✅ Pass
5	5	5	5	2	4.4	❌ Fail (Clarity < 3.5)

When to Use

Conversational AI and chatbots
Customer support automation
Content generation systems
Any user-facing AI interactions

Style

Evaluates whether responses adhere to your organization’s style guidelines and brand voice.

What It Evaluates

A single comprehensive score (1-5) based on your custom style guide, measuring:

Tone and voice consistency
Terminology usage
Format and structure requirements
Brand-specific guidelines
Reading level and complexity

How It Works

You provide a style guide document that defines your brand’s communication standards. An LLM evaluates each response against this guide and provides:

A score from 1-5
Detailed analysis explaining the rating

Pass Criteria:

Score ≥ 3.5

Example Style Guide Elements

# Customer Support Style Guide

**Tone:** Professional yet friendly, never casual
**Voice:** Active voice preferred, clear and direct
**Terminology:** Use "account" not "profile", "transfer" not "send"
**Format:** Start with acknowledgment, provide solution, end with offer to help
**Constraints:** Keep responses under 100 words when possible

When to Use

Brand-critical communications
Multi-channel consistency (chat, email, voice)
Customer-facing applications where brand matters

Note: For regulated industries with strict compliance requirements, use Custom evaluations instead.

Recall (RAG Evaluation)

Comprehensive evaluation of Retrieval-Augmented Generation (RAG) pipeline quality.

What It Evaluates

Four metrics that measure different aspects of RAG performance:

Context Relevancy – Are retrieved documents relevant to the query?
Context Precision – How well-ranked are the retrieved documents?
Faithfulness – Is the answer grounded in the retrieved context?
Answer Relevancy – Does the answer address the user’s question?

How It Works

Each metric produces a score from 0-1 (higher is better). The overall score is the average of Context Precision, Faithfulness, and Answer Relevancy. Pass Criteria:

Context Precision ≥ 0.5
Faithfulness ≥ 0.5
Answer Relevancy ≥ 0.5

Note: Context Relevancy is computed for observability but doesn’t affect pass/fail status.

Score Interpretation

Score Range	Interpretation	Action Required
0.8 - 1.0	Excellent performance	Monitor
0.5 - 0.8	Acceptable quality	Optimize if critical
0.0 - 0.5	Poor performance	Investigate immediately

Common Issues and Solutions

Low Metric	Likely Cause	Solution
Context Precision	Too many irrelevant chunks retrieved	Reduce top_k, improve filters
Context Relevancy	Embedding/index drift	Retrain embeddings, update index
Faithfulness	Model hallucinating	Add grounding instructions, reduce temperature
Answer Relevancy	Answer drifts off-topic	Improve prompt focus, add constraints

When to Use

Knowledge base search and retrieval
Document Q&A systems
RAG pipelines
Any system combining retrieval with generation

Fact Checker

Validates factual accuracy of AI responses against ground truth.

What It Evaluates

Compares AI-generated statements with known correct information, classifying each statement as:

True Positives (TP) – Correct facts present in the response
False Positives (FP) – Incorrect facts in the response
False Negatives (FN) – Correct facts omitted from the response

How It Works

An LLM extracts factual statements from both the AI response and ground truth, then classifies them. The F1 score measures accuracy:

F1 = TP / (TP + 0.5 × (FP + FN))

Pass Criteria:

F1 score ≥ 0.8

This allows high-quality answers with minor omissions while maintaining strict accuracy standards.

Example Classification

Question: “What powers the sun?” Ground Truth: “The sun is powered by nuclear fusion. In its core, hydrogen atoms fuse to form helium, releasing tremendous energy.” AI Response: “The sun is powered by nuclear fission, similar to nuclear reactors, and provides light to the solar system.” Classification:

TP: [“Provides light to the solar system”]
FP: [“Powered by nuclear fission”, “Similar to nuclear reactors”]
FN: [“Powered by nuclear fusion”, “Hydrogen fuses to form helium”]
F1 Score: 0.20 → ❌ Fail

Score Examples

TP	FP	FN	F1 Score	Result	Notes
5	0	0	1.0	✅ Pass	Perfect accuracy
5	0	1	0.91	✅ Pass	Minor omission acceptable
5	1	0	0.91	✅ Pass	Minor error acceptable
4	1	0	0.8	✅ Pass	Boundary case
3	0	2	0.75	❌ Fail	Too many omissions
1	4	0	0.33	❌ Fail	Mostly incorrect

When to Use

Financial data and calculations
Medical or legal information
Product specifications and features
Any domain where factual accuracy is critical

Custom

Create domain-specific evaluations for your unique business requirements.

What It Evaluates

Whatever you define in a custom criterion. Common use cases:

Regulatory compliance checks
Schema or format validation
Latency or performance SLAs
Business logic requirements
Security and privacy rules

How It Works

You provide a criterion describing what to check. An LLM evaluates the response and returns:

Binary pass/fail (1 or 0)
Reasoning explaining the decision

Pass Criteria:

Score = 1 (criterion met)

Example Criteria

# Compliance Example
"The response must not mention specific account numbers, 
social security numbers, or other PII. Pass if no PII is present."

# Format Example
"The response must be formatted as a JSON object with 
'action', 'parameters', and 'reasoning' keys. Pass if valid JSON 
with all required keys."

# Business Logic Example
"For loan inquiries, the response must ask for income verification 
before discussing loan amounts. Pass if verification is requested first."

# Chatbot Boundaries Example
"When asked to perform actions outside the chatbot's scope (e.g., 
processing refunds, accessing user accounts, making reservations), 
the response must politely decline and explain limitations. Pass if 
the chatbot appropriately refuses and provides alternative guidance."

When to Use

Industry-specific compliance requirements
Custom business rules and workflows
Structured output validation
Security and privacy checks
Chatbot safety and boundaries
Any evaluation not covered by built-in types

Output Match

Deterministic validation of AI outputs against expected values, without using an LLM judge.

What It Evaluates

Compares your AI’s actual output against an expected value you define per task. Unlike other evaluations that use LLM judgment, Output Match performs exact comparison — making results fully reproducible and deterministic. Two comparison modes are available:

String mode – Exact string match between output and expected value
List mode – Compare lists of values with flexible matching strategies

How It Works

String Mode

The AI’s response is compared directly against the expected string. Optionally, a regex extraction pattern can be applied first to pull a specific value from the response before comparison. Pass Criteria:

Extracted (or full) output exactly matches the expected string
Score = 1 (match) or 0 (mismatch)

Example: If your AI returns "The order status is: SHIPPED" and you configure:

Extract pattern: status is: (\w+) (capture group 1)
Expected: SHIPPED

The evaluation extracts SHIPPED from the response and compares it to the expected value → ✅ Pass.

List Mode

The AI’s response is parsed as a list and compared against an expected list of values. Two matching strategies are available: Exact Unordered – Both lists must contain exactly the same items (order doesn’t matter).

Score = 1 (exact match) or 0 (mismatch)

Contains – Measures overlap between the output and expected lists using a configurable metric:

Precision – What fraction of the output items are correct?
Recall – What fraction of the expected items are present?
F1 – Harmonic mean of precision and recall

Pass Criteria:

Score ≥ 0.8 (default, configurable per evaluation)

Configuration

Setting	Applies To	Description
Type	All	`string` or `list` — determines comparison mode
Expected	Per task	The expected output value (string) or values (list)
Match Mode	List only	`exact_unordered` or `contains`
Score Metric	List (contains)	`precision`, `recall`, or `f1` (default: `recall`)
Pass Threshold	List only	Override the default 0.8 threshold (0-1)
Extract	Optional	Regex pattern to extract value(s) from the output before comparison

Score Examples

String Mode

Output	Expected	Result
`SHIPPED`	`SHIPPED`	✅ Pass (score: 1.0)
`shipped`	`SHIPPED`	❌ Fail (score: 0.0, case-sensitive)
`PENDING`	`SHIPPED`	❌ Fail (score: 0.0)

List Mode (Contains, F1)

Output	Expected	Precision	Recall	F1	Result
`["a", "b", "c"]`	`["a", "b", "c"]`	1.0	1.0	1.0	✅ Pass
`["a", "b"]`	`["a", "b", "c"]`	1.0	0.67	0.8	✅ Pass
`["a", "b", "d"]`	`["a", "b", "c"]`	0.67	0.67	0.67	❌ Fail
`["a"]`	`["a", "b", "c"]`	1.0	0.33	0.5	❌ Fail

When to Use

Structured output validation (JSON fields, status codes, categories)
Classification tasks with known correct answers
Extraction pipelines where output must match expected values
Regression testing with deterministic expected outputs
Any task where you need exact, reproducible pass/fail without LLM judgment

Best Practices

Combining Evaluations

Use multiple evaluation types together for comprehensive quality assurance. The right combination depends on what your specific task does:

Knowledge Base Q&A (RAG): Recall + Fact Checker + Naturalness
Creative Content Generation: Naturalness + Style + Fact Checker (if accuracy matters)
Retrieval-Based Customer Support: Recall + Naturalness + Style + Custom (compliance)
Direct Response (no retrieval): Naturalness + Style + Custom (compliance)
Chatbot with Boundaries: Naturalness + Custom (safety/boundaries) + Custom (compliance)
Structured Output: Output Match + Custom (business logic)
Classification / Extraction: Output Match + Naturalness (if user-facing)

Choose evaluations based on your task’s behavior, not just your application type. For example, a customer support application might use different evaluation combinations for retrieval-based responses versus direct answers, and might add Custom evaluations to ensure the chatbot properly refuses out-of-scope requests.

Issue Creation

When an evaluation fails, Avido automatically creates an issue with:

Title – Evaluation type and failure summary
Priority – HIGH, MEDIUM, or LOW based on severity
Description – Scores, reasoning, and context
Trace Link – Direct access to the full conversation

All issues appear in your Inbox for triage and resolution.

Need Help?

Email – support@avidoai.com

For API details and integration guides, see the API Reference.

What's New

Get Started

Features

​Overview

​Naturalness

​What It Evaluates

​How It Works

​Example Results

​When to Use

​Style

​What It Evaluates

​How It Works

​Example Style Guide Elements

​When to Use

​Recall (RAG Evaluation)

​What It Evaluates

​How It Works

​Score Interpretation

​Common Issues and Solutions

​When to Use

​Fact Checker

​What It Evaluates

​How It Works

​Example Classification

​Score Examples

​When to Use

​Custom

​What It Evaluates

​How It Works

​Example Criteria

​When to Use

​Output Match

​What It Evaluates

​How It Works

​String Mode

​List Mode

​Configuration

​Score Examples

​String Mode

​List Mode (Contains, F1)

​When to Use

​Best Practices

​Combining Evaluations

​Issue Creation

​Need Help?

Overview

Naturalness

What It Evaluates

How It Works

Example Results

When to Use

Style

What It Evaluates

How It Works

Example Style Guide Elements

When to Use

Recall (RAG Evaluation)

What It Evaluates

How It Works

Score Interpretation

Common Issues and Solutions

When to Use

Fact Checker

What It Evaluates

How It Works

Example Classification

Score Examples

When to Use

Custom

What It Evaluates

How It Works

Example Criteria

When to Use

Output Match

What It Evaluates

How It Works

String Mode

List Mode

Configuration

Score Examples

String Mode

List Mode (Contains, F1)

When to Use

Best Practices

Combining Evaluations

Issue Creation

Need Help?