Skip to content

Benchmarks

We evaluate haiku.rag on several datasets to measure both retrieval quality and question-answering accuracy.

Running Evaluations

You can run evaluations with the evaluations CLI:

evaluations repliqa
evaluations wix

The evaluation flow is orchestrated with pydantic-evals, which we leverage for dataset management, scoring, and report generation.

Configuration

The benchmark script accepts several options:

evaluations repliqa --config /path/to/haiku.rag.yaml --db /path/to/custom.lancedb

Options:

  • --config PATH - Specify a custom haiku.rag.yaml configuration file
  • --db PATH - Override the database path (default: platform-specific user data directory)
  • --skip-db - Skip updating the evaluation database
  • --skip-retrieval - Skip retrieval benchmark
  • --skip-qa - Skip QA benchmark
  • --limit N - Limit number of test cases
  • --name NAME - Override the evaluation name

If no config file is specified, the script searches standard locations: ./haiku.rag.yaml, user config directory, then falls back to defaults.

Methodology

Retrieval Metrics

Mean Reciprocal Rank (MRR) - Used when each query has exactly one relevant document.

  • For each query, find the rank (position) of the first relevant document in top-K results
  • Reciprocal rank = 1/rank (e.g., rank 3 → 1/3 ≈ 0.333)
  • If not found in top-K, score is 0
  • MRR is the mean across all queries
  • Range: 0 (never found) to 1 (always at rank 1)

Mean Average Precision (MAP) - Used when queries have multiple relevant documents.

  • For each relevant document at position k, calculate precision@k = (relevant docs in top k) / k
  • Average Precision (AP) = mean of these precision values / total relevant documents
  • MAP is the mean of AP scores across all queries
  • Range: 0 to 1; rewards ranking relevant documents higher

QA Accuracy

For question-answering evaluation, pydantic-evals coordinates an LLM judge (Ollama qwen3) to determine whether answers are correct. Accuracy is the fraction of correctly answered questions.

RepliQA

RepliQA contains synthetic news stories with question-answer pairs. We use News Stories from repliqa_3 (1035 documents). Each question has exactly one relevant document, so we use MRR for retrieval evaluation.

Results from v0.19.6

Retrieval (MRR)

Embedding Model MRR Reranker
Ollama / qwen3-embedding:8b 0.91 -

QA Accuracy

Embedding Model QA Model Accuracy Reranker
Ollama / qwen3-embedding:4b Ollama / gpt-oss - no thinking 0.82 None
Ollama / qwen3-embedding:8b Ollama / gpt-oss - thinking 0.89 None
Ollama / mxbai-embed-large Ollama / qwen3 - thinking 0.85 None
Ollama / mxbai-embed-large Ollama / qwen3 - thinking 0.87 mxbai-rerank-base-v2
Ollama / mxbai-embed-large Ollama / qwen3:0.6b 0.28 None

Note the significant degradation when very small models are used such as qwen3:0.6b.

Wix

WixQA contains real customer support questions paired with curated answers from Wix. The benchmark follows the evaluation protocol from the WixQA paper. Each query can have multiple relevant passages, so we use MAP for retrieval evaluation.

We benchmark both the plain text version (HTML stripped, no structure) and HTML version. Since HTML chunks are small (typically a phrase), we use chunk_radius=2 to expand context.

Results from v0.20.0

Retrieval (MAP)

Embedding Model Chunk size MAP Reranker Notes
qwen3-embedding:4b 256 0.34 None html, chunk-radius=2
qwen3-embedding:4b 256 0.39 mxbai-rerank-base-v2 html, chunk-radius=2
qwen3-embedding:4b 256 0.43 None plain text, chunk-radius=0
qwen3-embedding:4b 512 0.45 None plain text, chunk-radius=0

QA Accuracy

Embedding Model Chunk size QA Model Accuracy Notes
qwen3-embedding:4b 256 gpt-oss:20b - no thinking 0.74 plain text, chunk-radius=0
qwen3-embedding:4b 256 gpt-oss:20b - thinking 0.79 html, chunk-radius=2
qwen3-embedding:4b 256 gpt-oss:20b - thinking 0.80 html, chunk-radius=2, reranker=mxbai-rerank-base-v2

HotpotQA

HotpotQA is a multi-hop question answering dataset requiring reasoning over multiple Wikipedia paragraphs. Each question requires evidence from 2+ documents, making it ideal for testing retrieval and reasoning capabilities. We use MAP for retrieval evaluation since queries have multiple relevant documents.

Results from v0.20.2 QA accuracy is evaluated over 2000 "hard" questions from the validation dataset.

Retrieval (MAP)

Embedding Model MAP Reranker
qwen3-embedding:4b 0.69 none

QA Accuracy

Embedding Model QA Model Accuracy
qwen3-embedding:4b gpt-oss:20b - thinking 0.86