Benchmarks
We evaluate haiku.rag on a small set of datasets that exercise different parts of the pipeline. Wix and OpenRAG Bench (ORB) are the two we currently track. Retrieval, QA accuracy, and citation retrieval are scored end-to-end through the rag and rag-analysis skills.
Running Evaluations
You can run evaluations with the evaluations CLI:
The evaluation flow is orchestrated with pydantic-evals, which we leverage for dataset management, scoring, and report generation.
Pre-built Databases
Building evaluation databases from scratch can take a long time, especially for large datasets like OpenRAG Bench. Pre-built databases are available on HuggingFace:
# Download a specific dataset
evaluations download wix
# Download all datasets
evaluations download all
# Force re-download (overwrite existing)
evaluations download wix --force
Active datasets:
| Dataset | Size |
|---|---|
wix |
~511MB |
orb_text — OpenRAG Bench, text embedder (qwen3-embedding:4b) with VLM picture descriptions baked into chunk content |
~18 GB |
orb_multimodal — OpenRAG Bench, multimodal embedder (qwen3-vl-embedding-8b); picture vectors live in the same space as text for cross-modal retrieval |
~16 GB |
After downloading, run benchmarks with --skip-db to use the pre-built database:
Configuration
The benchmark script accepts several options:
Options:
--config PATH- Specify a customhaiku.rag.yamlconfiguration file--db PATH- Override the database path (default: platform-specific user data directory)--skip-db- Skip updating the evaluation database--skip-retrieval- Skip retrieval benchmark--skip-qa- Skip QA benchmark--limit N- Limit number of test cases--name NAME- Override the evaluation name--target {rag-skill,analysis-skill}- Choose which skill to benchmark end-to-end (default:rag-skill).--skill-model PROVIDER:NAME- Override the skill model independently from the judge (default:config.qa.model, orconfig.analysis.modelwhen set for--target analysis-skill).
If no config file is specified, the script searches standard locations: ./haiku.rag.yaml, user config directory, then falls back to defaults.
To pin the LLM judge in YAML (rather than the default ollama:qwen3.6):
evaluations:
judge:
provider: openai
name: gpt-4o-mini
base_url: http://localhost:8000/v1 # optional, for OpenAI-compatible servers (vLLM, LM Studio, etc.)
Methodology
Retrieval Metrics
Mean Average Precision (MAP) scores ranked retrieval results against the gold expected_uris.
- For each relevant document at position k, calculate precision@k = (relevant docs in top k) / k
- Average Precision (AP) = sum of these precision values / total relevant documents
- MAP is the mean of AP scores across all queries
- Range: 0 to 1. Rewards ranking relevant documents higher
- For single-doc queries this collapses to
1/rank(i.e. reciprocal rank)
QA Accuracy
pydantic-evals coordinates an LLM judge to determine whether the skill's answer is correct. The default judge is ollama:qwen3.6, pinned so changes to the skill model don't change the judge underneath. Set evaluations.judge in haiku.rag.yaml to override (including a custom base_url for any OpenAI-compatible endpoint). Accuracy is the fraction of correctly answered questions.
We picked qwen3.6 over the previously-pinned gpt-oss after a 4-cell calibration (gpt-oss / qwen3.6 as both answerer and judge, with Claude Opus 4.7 as a reference). qwen3.6 had κ ≥ 0.66 vs the reference on both same-family and cross-family answerers (vs ~0.39–0.55 for gpt-oss) and showed no measurable self-preference bias, while gpt-oss was ~10 pp more lenient on its own outputs.
Citation Retrieval
Alongside QA accuracy, a second metric scores the URIs the skill registered via the cite tool against each dataset's gold expected_uris, using the same MAP math as raw retrieval. The score key is cited_map. Console output also includes the cite rate (% of cases with at least one citation) and the mean number of citations per case.
This is computed alongside QA accuracy from the same skill run, no extra invocations. The signal complements raw retrieval: where raw retrieval measures whether the retriever surfaced the gold document at any rank, citation retrieval measures whether the skill grounded its answer on it.
Current results
Numbers measured under the current pinned judge (ollama:qwen3.6) on a recent haiku.rag version.
OpenRAG Bench (ORB)
OpenRAG Bench contains ArXiv research papers with multimodal question-answering pairs. Queries include both text-based and image-based questions, testing retrieval and reasoning over visual content like figures, charts, and diagrams. Each query maps to one relevant document.
Two approaches are benchmarked separately:
- Multimodal embedder (
Qwen/Qwen3-VL-Embedding-8B, served via vLLM): picture bytes and text live in a shared vector space, no VLM is run at ingest. - Text embedder + VLM picture descriptions (
qwen3-embedding:4b+ollama/ministral-3): pictures are described at ingest and the descriptions are woven into chunk text. Retrieval runs over text only. See Picture handling configuration.
Multimodal embedder
Retrieval (MAP)
| Embedding Model | Cases | MAP |
|---|---|---|
Qwen/Qwen3-VL-Embedding-8B |
3045 | 0.9774 |
nvidia/llama-nemotron-embed-vl-1b-v2 |
3045 | 0.9709 |
QA accuracy + citation retrieval
| Embedding Model | Target | Skill model | Cases | QA accuracy | Mean cited_map |
|---|---|---|---|---|---|
Qwen/Qwen3-VL-Embedding-8B |
rag-skill |
vllm:Gemma-4-26B-A4B-NVFP4 |
1409 | 0.89 | — |
nvidia/llama-nemotron-embed-vl-1b-v2 |
rag-skill |
vllm:Gemma-4-26B-A4B-NVFP4 |
3045 | 0.92 | 0.93 |
nvidia/llama-nemotron-embed-vl-1b-v2 |
analysis-skill |
vllm:Gemma-4-26B-A4B-NVFP4 |
3045 | 0.94 | 0.78 |
nvidia/llama-nemotron-embed-vl-1b-v2 |
analysis-skill |
vllm:Qwen3.6-35B-A3B-NVFP4 |
3045 | 0.95 | 0.93 |
Measured on haiku.rag v0.52.0, no reranker, judged by vllm:Qwen3.6-35B-A3B-NVFP4. Qwen3-VL covered 1409 / 3045 cases.
Text embedder + VLM picture descriptions
Retrieval (MAP)
| Embedding Model | VLM | Reranker | Cases | MAP |
|---|---|---|---|---|
qwen3-embedding:4b |
Ollama / ministral-3 | mxbai-rerank-base-v2 |
3045 | 0.9834 |
nvidia/llama-nemotron-embed-vl-1b-v2 |
Ollama / ministral-3 | mxbai-rerank-base-v2 |
3045 | 0.9863 |
Measured on haiku.rag v0.50.0.
QA accuracy + citation retrieval
| Embedding Model | VLM | Skill model | Cases | QA accuracy | Mean cited_map |
|---|---|---|---|---|---|
qwen3-embedding:4b |
Ollama / ministral-3 | vllm:Gemma-4-26B-A4B-NVFP4 |
3045 | 0.92 | 0.80 |
nvidia/llama-nemotron-embed-vl-1b-v2 |
Ollama / ministral-3 | vllm:Gemma-4-26B-A4B-NVFP4 |
2836 | 0.96 | 0.81 |
Measured on haiku.rag v0.50.0 with mxbai-rerank-base-v2, judged by vllm:Qwen3.6-35B-A3B-NVFP4. Nemotron covered 2836 / 3045 cases.
Wix
WixQA is real customer support questions paired with curated answers. 200 cases.
evaluations run wix --target rag-skill runs the RAG skill end-to-end and produces both QA accuracy and a citation retrieval metric (cited_map) computed from the URIs the skill registered via the cite tool against the gold expected_uris.
| Skill model | Reranker | QA accuracy | Mean cited_map |
|---|---|---|---|
vllm:Gemma-4-26B-A4B-NVFP4 |
mxbai-rerank-base-v2 |
0.87 | 0.38 |
Measured on haiku.rag v0.48.0 with qwen3-embedding:4b (vLLM, dim 2560), chunk_size=256, search.limit=5. Judged by vllm:Qwen3.6-35B-A3B-NVFP4 (qwen3.6 family, NVFP4 quant served via vLLM rather than the default Ollama). 172 / 198 completed cases (2 errored).