Benchmarks
We evaluate haiku.rag on several datasets to measure both retrieval quality and question-answering accuracy.
Running Evaluations
You can run evaluations with the evaluations CLI:
The evaluation flow is orchestrated with pydantic-evals, which we leverage for dataset management, scoring, and report generation.
Configuration
The benchmark script accepts several options:
Options:
--config PATH- Specify a customhaiku.rag.yamlconfiguration file--db PATH- Override the database path (default: platform-specific user data directory)--skip-db- Skip updating the evaluation database--skip-retrieval- Skip retrieval benchmark--skip-qa- Skip QA benchmark--limit N- Limit number of test cases--name NAME- Override the evaluation name
If no config file is specified, the script searches standard locations: ./haiku.rag.yaml, user config directory, then falls back to defaults.
Methodology
Retrieval Metrics
Mean Reciprocal Rank (MRR) - Used when each query has exactly one relevant document.
- For each query, find the rank (position) of the first relevant document in top-K results
- Reciprocal rank =
1/rank(e.g., rank 3 → 1/3 ≈ 0.333) - If not found in top-K, score is 0
- MRR is the mean across all queries
- Range: 0 (never found) to 1 (always at rank 1)
Mean Average Precision (MAP) - Used when queries have multiple relevant documents.
- For each relevant document at position k, calculate precision@k = (relevant docs in top k) / k
- Average Precision (AP) = mean of these precision values / total relevant documents
- MAP is the mean of AP scores across all queries
- Range: 0 to 1; rewards ranking relevant documents higher
QA Accuracy
For question-answering evaluation, pydantic-evals coordinates an LLM judge (Ollama qwen3) to determine whether answers are correct. Accuracy is the fraction of correctly answered questions.
RepliQA
RepliQA contains synthetic news stories with question-answer pairs. We use News Stories from repliqa_3 (1035 documents). Each question has exactly one relevant document, so we use MRR for retrieval evaluation.
Results from v0.19.6
Retrieval (MRR)
| Embedding Model | MRR | Reranker |
|---|---|---|
Ollama / qwen3-embedding:8b |
0.91 | - |
QA Accuracy
| Embedding Model | QA Model | Accuracy | Reranker |
|---|---|---|---|
Ollama / qwen3-embedding:4b |
Ollama / gpt-oss - no thinking |
0.82 | None |
Ollama / qwen3-embedding:8b |
Ollama / gpt-oss - thinking |
0.89 | None |
Ollama / mxbai-embed-large |
Ollama / qwen3 - thinking |
0.85 | None |
Ollama / mxbai-embed-large |
Ollama / qwen3 - thinking |
0.87 | mxbai-rerank-base-v2 |
Ollama / mxbai-embed-large |
Ollama / qwen3:0.6b |
0.28 | None |
Note the significant degradation when very small models are used such as qwen3:0.6b.
Wix
WixQA contains real customer support questions paired with curated answers from Wix. The benchmark follows the evaluation protocol from the WixQA paper. Each query can have multiple relevant passages, so we use MAP for retrieval evaluation.
We benchmark both the plain text version (HTML stripped, no structure) and HTML version. Since HTML chunks are small (typically a phrase), we use chunk_radius=2 to expand context.
Results from v0.20.0
Retrieval (MAP)
| Embedding Model | Chunk size | MAP | Reranker | Notes |
|---|---|---|---|---|
qwen3-embedding:4b |
256 | 0.34 | None | html, chunk-radius=2 |
qwen3-embedding:4b |
256 | 0.39 | mxbai-rerank-base-v2 |
html, chunk-radius=2 |
qwen3-embedding:4b |
256 | 0.43 | None | plain text, chunk-radius=0 |
qwen3-embedding:4b |
512 | 0.45 | None | plain text, chunk-radius=0 |
QA Accuracy
| Embedding Model | Chunk size | QA Model | Accuracy | Notes |
|---|---|---|---|---|
qwen3-embedding:4b |
256 | gpt-oss:20b - no thinking |
0.74 | plain text, chunk-radius=0 |
qwen3-embedding:4b |
256 | gpt-oss:20b - thinking |
0.79 | html, chunk-radius=2 |
qwen3-embedding:4b |
256 | gpt-oss:20b - thinking |
0.80 | html, chunk-radius=2, reranker=mxbai-rerank-base-v2 |
HotpotQA
HotpotQA is a multi-hop question answering dataset requiring reasoning over multiple Wikipedia paragraphs. Each question requires evidence from 2+ documents, making it ideal for testing retrieval and reasoning capabilities. We use MAP for retrieval evaluation since queries have multiple relevant documents.
Results from v0.20.2 QA accuracy is evaluated over 2000 "hard" questions from the validation dataset.
Retrieval (MAP)
| Embedding Model | MAP | Reranker |
|---|---|---|
qwen3-embedding:4b |
0.69 | none |
QA Accuracy
| Embedding Model | QA Model | Accuracy |
|---|---|---|
qwen3-embedding:4b |
gpt-oss:20b - thinking |
0.86 |